Introduction to Machine Learning

Introduction to Machine Learning

Overview of supervised and unsupervised learning, linear regression, and gradient descent

Jongmin Lee
10 min read
Updated: August 4, 2024

1. Introduction to machine learning

Supervised learning

input x → output y (learns from being right answers)

  • Examples
    • Spam email filtering
    • Audio text transcripts (speech recognition)
    • Language translations
    • Online advertisement
    • Self-driving car
    • Visual inspection

Regression

Regression predicts a number infinitely many possible outputs

Screenshot 2024-03-20 at 6.28.25 PM.png

Classification

Classification predicts categories, small number of possible outputs

Screenshot 2024-03-20 at 6.29.45 PM.png

Two or more inputs

Screenshot 2024-03-20 at 6.30.48 PM.png

Unsupervised Learning

Find something interesting in unlabeled data

Screenshot 2024-03-20 at 6.34.39 PM.png

  • example
    • Google news

      Screenshot 2024-03-20 at 6.35.56 PM.png

    • DNA microarray

      Screenshot 2024-03-20 at 6.56.07 PM.png

Unsupervised learning

Data only comes with inputs x, but not output labels y: Algorithm has to find structure in the data

  • clustering - group similar data points together
  • Anomaly detection - find unusual data points
  • Dimensionality reduction - compress data using fewer numbers

Linear Regression Model

Screenshot 2024-03-20 at 7.04.07 PM.png

Terminology

Screenshot 2024-03-20 at 7.05.47 PM.png

  • Training set - data used to train the model
  • x - input variable feature
  • y - output variable, target variable
  • m - number of training examples
  • (x,y) - single training example
  • (x(i),y(i))(x^{(i)},y^{(i)}) - ith training example
    • ex) (x(1),y(1))=(2104,400)(x^{(1)}, y^{(1)}) = (2104,400)

Screenshot 2024-03-20 at 7.09.04 PM.png

Cost Function

Screenshot 2024-03-20 at 7.10.21 PM.png

What do w, b do?

Find w,b:

y^(i)\hat{y}^{(i)} is close to y(i)y^{(i)} for all (x(i),y(i))(x^{(i)},y^{(i)})

Screenshot 2024-03-20 at 7.11.31 PM.png

Cost function

cost function is to see how well the model is doing

J(w,b)=12mi=1m(fw,b(x(i))y(i))2J(w,b) = \frac{1}{2m}\sum_{i=1}^m (f_w,_b(x^{(i)})-y^{(i)})^2

Screenshot 2024-03-20 at 7.13.53 PM.png

  • error : y^(i)y(i)\hat{y}^{(i)}-y^{(i)} (prediction_i - realvalue_i)
    • different people use different cost function
    • squared error cost function is the most commonly used one

Cost function intuition

  • model
    • fw,b(x)=wx+bf_w,_b(x) = wx + b
  • parameter
    • w,bw, b
  • cost function
    • J(w,b)=12mi=1m(fw,b(x(i))y(i))2J(w,b) = \frac{1}{2m}\sum_{i=1}^m (f_w,_b(x^{(i)})-y^{(i)})^2
  • goal
    • minimizew,bJ(w,b)minimize_w,_bJ(w,b)

Simplified cost function

  • model

    • fw(x)=wxf_w(x) = wx
  • parameter

    • ww
  • cost function

    • J(w)=12mi=1m(fw(x(i))y(i))2J(w) = \frac{1}{2m}\sum_{i=1}^m (f_w(x^{(i)})-y^{(i)})^2
  • goal

    • minimizewJ(w)minimize_wJ(w)
  • examples

    Screenshot 2024-03-20 at 7.27.29 PM.png

    Screenshot 2024-03-20 at 7.28.47 PM.png

    Screenshot 2024-03-20 at 7.29.59 PM.png

Visualizing the cost function

Screenshot 2024-03-20 at 7.33.45 PM.png

  • examples

    Screenshot 2024-03-20 at 7.34.40 PM.png

    Screenshot 2024-03-20 at 7.35.03 PM.png

    Screenshot 2024-03-20 at 7.35.27 PM.png

Gradient Descent

Have some function J(w,b)J(w,b) for linear regression or any function

Want minw,bJ(w,b)min_w,_bJ(w,b)

Outline:

  • Start with some w, b (set w=0,b=0)
  • keep changing w,b to reduce J(w,b)
  • until we settle at or near a minimum
    • may have > 1 minimum

      Screenshot 2024-03-20 at 9.11.58 PM.png

Implementing gradient descent

Gradient descent algorithm

simultaneously update w and b

w=waddwJ(w,b)w = w - a\frac{d}{dw}J(w,b) b=baddbJ(w,b)b = b - a\frac{d}{db}J(w,b)

aa: learning rate

ddwJ(w,b)\frac{d}{dw}J(w,b): Derivative

Gradient descent intuition

Screenshot 2024-03-20 at 9.16.20 PM.png

Learning rate

if a is too small, gradient descent may be slow

if a is too large, gradient descent may:

  • overshoot, never reach minimum
  • fail to converge, diverge

Screenshot 2024-03-20 at 9.19.26 PM.png

Gradient descent for linear regression

Linear regression model

fw,b(x)=wx+bf_w,_b(x) = wx + b

Cost function

J(w,b)=12mi=1m(fw,b(x(i))y(i))2J(w,b) = \frac{1}{2m}\sum_{i=1}^m (f_w,_b(x^{(i)})-y^{(i)})^2

Gradient descent algorithm

repeat until convergence

w=waddwJ(w,b)w = w - a\frac{d}{dw}J(w,b)
  • w=waddwJ(w,b)w = w - a\frac{d}{dw}J(w,b)
    • 1mi=1m(fw,b(x(i))y(i))x(i)\frac{1}{m}\sum_{i=1}^m(f_w,_b(x^{(i)})-y^{(i)})x^{(i)}
b=baddbJ(w,b)b = b - a\frac{d}{db}J(w,b)
  • b=baddbJ(w,b)b = b - a\frac{d}{db}J(w,b)
    • 1mi=1m(fw,b(x(i))y(i))\frac{1}{m}\sum_{i=1}^m(f_w,_b(x^{(i)})-y^{(i)})

Screenshot 2024-03-20 at 9.27.44 PM.png

  • squared error cost will never have local minimum
  • gradient descent with convex function will always converge with global minimum

Mathematics

1mi=1m(fw,b(x(i))y(i))x(i)\frac{1}{m}\sum_{i=1}^{m}(f_w,_b(x^{(i)})-y^{(i)})x^{(i)}

  1. ddwJ(w,b)\frac{d}{dw}J(w,b)
  2. ddw12mi=1m(fw,b(x(i))y(i))2\frac{d}{dw}\frac{1}{2m}\sum_{i=1}^{m}(f_w,_b(x^{(i)})-y^{(i)})^2
  3. ddw12mi=1m(wx(i)+by(i))2\frac{d}{dw}\frac{1}{2m}\sum_{i=1}^{m}(wx^{(i)}+b-y^{(i)})^2
  4. 12mi=1m(wx(i)+by(i))2x(i)\frac{1}{2m}\sum_{i=1}^{m}(wx^{(i)}+b-y^{(i)})2x^{(i)}

1mi=1m(fw,b(x(i))y(i))\frac{1}{m}\sum_{i=1}^{m}(f_w,_b(x^{(i)})-y^{(i)})

  1. ddbJ(w,b)\frac{d}{db}J(w,b)
  2. ddb12mi=1m(fw,b(x(i))y(i))2\frac{d}{db}\frac{1}{2m}\sum_{i=1}^{m}(f_w,_b(x^{(i)})-y^{(i)})^2
  3. ddb12mi=1m(wx(i)+by(i))2\frac{d}{db}\frac{1}{2m}\sum_{i=1}^{m}(wx^{(i)}+b-y^{(i)})^2
  4. 12mi=1m(wx(i)+by(i))2\frac{1}{2m}\sum_{i=1}^{m}(wx^{(i)}+b-y^{(i)})2

Running gradient descent

Screenshot 2024-03-20 at 9.31.37 PM.png


Source

Machine Learning