Logistic Regression: From Sigmoid to Regularization

Logistic Regression: From Sigmoid to Regularization

A comprehensive breakdown of logistic regression, sigmoid function, loss functions, and regularization for classification tasks.

Jongmin Lee
5 min read

3. Classification

Motivation

To get small handful possible number of outputs instead of infinite number of outputs

Logistic regression (Classification)

Output is either 0 or 1

Sigmoid function

Screenshot 2024-03-24 at 2.10.03 PM.png

outputs value between 0 and 1

g(z)=11+ezg(z) = \frac{1}{1+e^{-z}} (0<g(z)<1)0 < g(z) < 1)

Steps

  1. define z function

fw,b(x)=z=wx+bf_{\overrightarrow{w},b}(\overrightarrow{x}) = z = \overrightarrow{w}\cdot\overrightarrow{x} + b

  1. take z function to sigmoid function

g(z)=11+ezg(z) = \frac{1}{1+e^{-z}}

Logistic regression

fw,b(x)=g(wx+b)=11+e(wx+b)=P(y=1x;w,b)f_{\overrightarrow{w},b}(\overrightarrow{x}) =g( \overrightarrow{w}\cdot\overrightarrow{x} + b) = \frac{1}{1+e^{( \overrightarrow{w}\cdot\overrightarrow{x} + b)}} = P(y=1|\overrightarrow{x};\overrightarrow{w},b)

The probability that class is 1

  • example
    • x is tumor size
    • y is 0(not malignant) or 1(malignant)
    • f(x) = 0.7 → 70% chance that y is 1
    • P(y=0)+P(y=1)=1P(y=0) + P(y=1) = 1
      • fw,b(x)=P(y=1x;w,b)f_{\overrightarrow{w},b}(\overrightarrow{x}) = P(y=1|\overrightarrow{x};\overrightarrow{w},b)
      • probability that y is 1, given input x\overrightarrow{x}, parameters w,b\overrightarrow{w},b

Decision boundary

Prediction: y^\hat{y}

Is fw,b(x)f_{\overrightarrow{w},b}(\overrightarrow{x}) ≥ 0.5?

  • Yes: y^=1\hat{y} = 1
  • No: y^=0\hat{y}=0

When is fw,b(x)f_{\overrightarrow{w},b}(\overrightarrow{x}) ≥ 0.5?

  • g(z)g(z) ≥ 0.5
  • z ≥ 0
  • wx+b\overrightarrow{w}\cdot\overrightarrow{x} + b ≥ 0 → predicts 1 (when wx+b\overrightarrow{w}\cdot\overrightarrow{x} + b < 0 → predicts 0)

Cost Function

Squared error cost

J(w,b)=1mi=1m12(fw,b(x(i))y(i))2J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}\frac{1}{2}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2

Screenshot 2024-03-24 at 3.49.07 PM.png

Squared error cost is not a good choice for logistic regression because it can have multiple local minima

12(fw,b(x(i))y(i))2\frac{1}{2}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2

  • → Loss function L(fw,b(x(i)),y(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)}))

Logistic loss function

L(fw,b(x(i)),y(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)}))

  • log(fw,b(x(i)))-log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) if y(i)=1y^{(i)} =1
  • log(1fw,b(x(i)))-log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) if y(i)=0y^{(i)} =0

When loss close to 0, the predictions is close to the answer

Cost

Loss function will make a convex shape that can reach a global minimum

J(w,b)=1mi=1mL(fw,b(x(i)),y(i)))J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)}))

  • log(fw,b(x(i)))-log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) if y(i)=1y^{(i)} =1
  • log(1fw,b(x(i)))-log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) if y(i)=0y^{(i)} =0

Simplified Cost Function

L(fw,b(x(i)),y(i)))=y(i)log(fw,b(x(i)))(1y(i))log(1fw,b(x(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)})) = - y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) - (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))

  • if y(i)=1y^{(i)} =1
    • L(fw,b(x(i)),y(i)))=1log(fw,b(x(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)})) =-1log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))
  • if y(i)=0y^{(i)} =0
    • L(fw,b(x(i)),y(i)))=1log(1fw,b(x(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)})) =-1log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))

Loss function

L(fw,b(x(i)),y(i)))=y(i)log(fw,b(x(i)))(1y(i))log(1fw,b(x(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)})) = - y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) - (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))

Cost function

J(w,b)=1mi=1m[L(fw,b(x(i)),y(i)))]J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}[L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)}))]

  • = 1mi=1m[y(i)log(fw,b(x(i)))+(1y(i))log(1fw,b(x(i)))]-\frac{1}{m}\sum_{i=1}^{m}[ y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) + (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))]

Gradient Descent

J(w,b)=1mi=1m[y(i)log(fw,b(x(i)))+(1y(i))log(1fw,b(x(i)))]J(\overrightarrow{w},b)=-\frac{1}{m}\sum_{i=1}^{m}[ y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) + (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))]

repeat {

wj=wjaddwjJ(w,b)w_j=w_j-a\frac{d}{dw_j}J(\overrightarrow{w},b)

  • ddwjJ(w,b)=1mi=1m(fw,b(x(i))y(i))xj(i)\frac{d}{dw_j}J(\overrightarrow{w},b) = \frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}

b=baddbJ(w,b)b=b-a\frac{d}{db}J(\overrightarrow{w},b)

  • ddbJ(w,b)=1mi=1m(fw,b(x(i))y(i))\frac{d}{db}J(\overrightarrow{w},b) = \frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})

} simultaneous updates

Gradient descent for logistic regression

repeat {

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)]w_j = w_j - a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}]

b=ba[1mi=1m(fw,b(x(i))y(i))]b = b-a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})]

}

Linear regression model

fw,b(x)=wx+bf_{\overrightarrow{w},b}(\overrightarrow{x})=\overrightarrow{w}\cdot\overrightarrow{x}+b

Logistic regression

fw,b(x)=11+ewx+bf_{\overrightarrow{w},b}(\overrightarrow{x})=\frac{1}{1+e^{-\overrightarrow{w}\cdot\overrightarrow{x}+b}}

  • Same concept
    • monitor gradient descent (learning curve)
    • vectorized implementation
    • Feature scaling

The problem of overfitting

Linear regression

Screenshot 2024-03-24 at 4.20.17 PM.png

Classification

Screenshot 2024-03-24 at 4.21.14 PM.png

Addressing overfitting

  • Collecting more training examples
  • Selecting features to include or exclude
    • Feature selection
  • Reduce the size of parameters
    • Regularization

Cost Function to Regularization

Intuition

minw,b12mi=1m(fw,b(x(i))y(i))2min_{\overrightarrow{w},b}\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2

  • small values w1, w2,…wn,b → simpler model less likely to overfit
    • J(w,b)=12mi=1m(fw,b(x(i))y(i))2J(\overrightarrow{w},b)=\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2
  • Lambda λ\lambda
    • regularization parameter λ\lambda > 0
    • J(w,b)=12mi=1m(fw,b(x(i))y(i))2+λ2mj=1mwj2+λ2mb2J(\overrightarrow{w},b)=\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{m}w_j^2 + \frac{\lambda}{2m}b^2
      • mean squared error
        • 12mi=1m(fw,b(x(i))y(i))2\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2
      • Regularization term
        • λ2mj=1mwj2\frac{\lambda}{2m}\sum_{j=1}^{m}w_j^2
      • Can include or exclude b

Regularization

minw,bJ(w,b)=minw,b[12mi=1m(fw,b(x(i))y(i))2+λ2mj=1mwj2]min_{\overrightarrow{w},b}J(\overrightarrow{w},b)=min_{\overrightarrow{w},b}[\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{m}w_j^2]

  • Linear regression example

fw,b(x)=w1x+w2x2+w3xe+w4xr+bf_{\overrightarrow{w},b}(\overrightarrow{x})=w_1x+w_2x^2+w_3x^e+w_4x^r+b

choose λ=1010\lambda = 10^{10} → underfits

choose λ=0\lambda = 0 → overfits

Regularized linear regression

minw,bJ(w,b)=minw,b[12mi=1m(fw,b(x(i))y(i))2+λ2mj=1mwj2]min_{\overrightarrow{w},b}J(\overrightarrow{w},b)=min_{\overrightarrow{w},b}[\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{m}w_j^2]

Gradient descent

repeat {

wj=wjaddwJ(w,b)w_j = w_j - a \frac{d}{dw}J(\overrightarrow{w},b)

b=baddbJ(w,b)b = b - a\frac{d}{db}J(\overrightarrow{w},b)

}

  • wj=wjaddwJ(w,b)w_j = w_j - a \frac{d}{dw}J(\overrightarrow{w},b)
    • =1mi=1m(fw,b(x(i))y(i))xj(i)+λmwj= \frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}w_j
  • b=baddbJ(w,b)b = b - a\frac{d}{db}J(\overrightarrow{w},b)
    • =1mi=1m(fw,b(x(i))y(i))= \frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})
    • don’t have to regularize b

Implement gradient descent with regularized linear regression

repeat {

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)+λmwj]w_j = w_j - a [\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}w_j]

b=ba1mi=1m(fw,b(x(i))y(i))b = b - a\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})

}

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)+λmwj]w_j = w_j - a [\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}w_j]

  • wj=1wjaλmwja1mi=1m(fw,b(x(i))y(i))xj(i)w_j = 1w_j - a \frac{\lambda}{m}w_j-a\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}
  • wj=wj(1aλm)a1mi=1m(fw,b(x(i))y(i))xj(i)w_j = w_j (1- a \frac{\lambda}{m})-a\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}
    • usual gradient descent update : a1mi=1m(fw,b(x(i))y(i))xj(i)a\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}
    • shrink w : (1aλm)(1- a \frac{\lambda}{m})

How we get the derivative term

ddwjJ(w,b)=ddwj[12mi=1m(f(x(i)y(i))+λ2mj=1nwj2]\frac{d}{dw_j}J(\overrightarrow{w},b)=\frac{d}{dw_j}[\frac{1}{2m}\sum_{i=1}^{m}(f(\overrightarrow{x}^{(i)}-y^{(i)})+\frac{\lambda}{2m}\sum_{j=1}^{n}w_j^2]

  • = 12mi=1m[(wx(i)+by(i))2xj(i)]+λ2m2wj\frac{1}{2m}\sum_{i=1}^{m}[(\overrightarrow{w}\cdot\overrightarrow{x}^{(i)}+b-y^{(i)})2x_j^{(i)}]+\frac{\lambda}{2m}2w_j
  • = 1mi=1m[(xx(i)+by(i))xj(i)]+λmwj\frac{1}{m}\sum_{i=1}^{m}[(\overrightarrow{x}\cdot\overrightarrow{x}^{(i)}+b-y^{(i)})x_j^{(i)}]+\frac{\lambda}{m}w_j
  • = 1mi=1m[(fw,b(x(i))y(i))xj(i)]+λmwj\frac{1}{m}\sum_{i=1}^{m}[(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}]+\frac{\lambda}{m}w_j

Regularized Logistic Regression

Logistic regression

z=w1x1+...+wnxn+bz = w_1x_1+...+w_nx_n+b

Sigmoid function

fw,b(x)=11+ezf_{\overrightarrow{w},b}(\overrightarrow{x})=\frac{1}{1+e^{-z}}

Cost function

J(w,b)=1mi=1m[y(i)log(fw,b(x(i)))+(1y(i))log(1fw,b(x(i)))]+λ2mj=1nwj2J(\overrightarrow{w},b)=-\frac{1}{m}\sum_{i=1}^{m}[ y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) + (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))]+\frac{\lambda}{2m}\sum_{j=1}^{n}w_j^2

  • minw,bJ(w,b)min_{\overrightarrow{w},b}J(\overrightarrow{w},b)

Gradient descent

repeat {

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)]+λmwjw_j = w_j - a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}]+\frac{\lambda}{m}w_j

b=ba[1mi=1m(fw,b(x(i))y(i))]b = b-a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})]

}