· 5 min read

Logistic Regression: From Sigmoid to Regularization

A comprehensive breakdown of logistic regression, sigmoid function, loss functions, and regularization for classification tasks.

Logistic Regression: From Sigmoid to Regularization

3. Classification

Motivation

To get small handful possible number of outputs instead of infinite number of outputs

Logistic regression (Classification)

Output is either 0 or 1

Sigmoid function

Screenshot 2024-03-24 at 2.10.03 PM.png

outputs value between 0 and 1

g(z)=11+ezg(z) = \frac{1}{1+e^{-z}} (0<g(z)<1)0 < g(z) < 1)

Steps

  1. define z function

fw,b(x)=z=wx+bf_{\overrightarrow{w},b}(\overrightarrow{x}) = z = \overrightarrow{w}\cdot\overrightarrow{x} + b

  1. take z function to sigmoid function

g(z)=11+ezg(z) = \frac{1}{1+e^{-z}}

Logistic regression

fw,b(x)=g(wx+b)=11+e(wx+b)=P(y=1x;w,b)f_{\overrightarrow{w},b}(\overrightarrow{x}) =g( \overrightarrow{w}\cdot\overrightarrow{x} + b) = \frac{1}{1+e^{( \overrightarrow{w}\cdot\overrightarrow{x} + b)}} = P(y=1|\overrightarrow{x};\overrightarrow{w},b)

The probability that class is 1

Decision boundary

Prediction: y^\hat{y}

Is fw,b(x)f_{\overrightarrow{w},b}(\overrightarrow{x}) ≥ 0.5?

When is fw,b(x)f_{\overrightarrow{w},b}(\overrightarrow{x}) ≥ 0.5?

Cost Function

Squared error cost

J(w,b)=1mi=1m12(fw,b(x(i))y(i))2J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}\frac{1}{2}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2

Screenshot 2024-03-24 at 3.49.07 PM.png

Squared error cost is not a good choice for logistic regression because it can have multiple local minima

12(fw,b(x(i))y(i))2\frac{1}{2}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2

Logistic loss function

L(fw,b(x(i)),y(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)}))

When loss close to 0, the predictions is close to the answer

Cost

Loss function will make a convex shape that can reach a global minimum

J(w,b)=1mi=1mL(fw,b(x(i)),y(i)))J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)}))

Simplified Cost Function

L(fw,b(x(i)),y(i)))=y(i)log(fw,b(x(i)))(1y(i))log(1fw,b(x(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)})) = - y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) - (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))

Loss function

L(fw,b(x(i)),y(i)))=y(i)log(fw,b(x(i)))(1y(i))log(1fw,b(x(i)))L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)})) = - y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) - (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))

Cost function

J(w,b)=1mi=1m[L(fw,b(x(i)),y(i)))]J(\overrightarrow{w},b)=\frac{1}{m}\sum_{i=1}^{m}[L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}),y^{(i)}))]

Gradient Descent

J(w,b)=1mi=1m[y(i)log(fw,b(x(i)))+(1y(i))log(1fw,b(x(i)))]J(\overrightarrow{w},b)=-\frac{1}{m}\sum_{i=1}^{m}[ y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) + (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))]

repeat {

wj=wjaddwjJ(w,b)w_j=w_j-a\frac{d}{dw_j}J(\overrightarrow{w},b)

b=baddbJ(w,b)b=b-a\frac{d}{db}J(\overrightarrow{w},b)

} simultaneous updates

Gradient descent for logistic regression

repeat {

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)]w_j = w_j - a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}]

b=ba[1mi=1m(fw,b(x(i))y(i))]b = b-a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})]

}

Linear regression model

fw,b(x)=wx+bf_{\overrightarrow{w},b}(\overrightarrow{x})=\overrightarrow{w}\cdot\overrightarrow{x}+b

Logistic regression

fw,b(x)=11+ewx+bf_{\overrightarrow{w},b}(\overrightarrow{x})=\frac{1}{1+e^{-\overrightarrow{w}\cdot\overrightarrow{x}+b}}

The problem of overfitting

Linear regression

Screenshot 2024-03-24 at 4.20.17 PM.png

Classification

Screenshot 2024-03-24 at 4.21.14 PM.png

Addressing overfitting

Cost Function to Regularization

Intuition

minw,b12mi=1m(fw,b(x(i))y(i))2min_{\overrightarrow{w},b}\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2

Regularization

minw,bJ(w,b)=minw,b[12mi=1m(fw,b(x(i))y(i))2+λ2mj=1mwj2]min_{\overrightarrow{w},b}J(\overrightarrow{w},b)=min_{\overrightarrow{w},b}[\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{m}w_j^2]

fw,b(x)=w1x+w2x2+w3xe+w4xr+bf_{\overrightarrow{w},b}(\overrightarrow{x})=w_1x+w_2x^2+w_3x^e+w_4x^r+b

choose λ=1010\lambda = 10^{10} → underfits

choose λ=0\lambda = 0 → overfits

Regularized linear regression

minw,bJ(w,b)=minw,b[12mi=1m(fw,b(x(i))y(i))2+λ2mj=1mwj2]min_{\overrightarrow{w},b}J(\overrightarrow{w},b)=min_{\overrightarrow{w},b}[\frac{1}{2m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{m}w_j^2]

Gradient descent

repeat {

wj=wjaddwJ(w,b)w_j = w_j - a \frac{d}{dw}J(\overrightarrow{w},b)

b=baddbJ(w,b)b = b - a\frac{d}{db}J(\overrightarrow{w},b)

}

Implement gradient descent with regularized linear regression

repeat {

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)+λmwj]w_j = w_j - a [\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}w_j]

b=ba1mi=1m(fw,b(x(i))y(i))b = b - a\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})

}

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)+λmwj]w_j = w_j - a [\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}w_j]

How we get the derivative term

ddwjJ(w,b)=ddwj[12mi=1m(f(x(i)y(i))+λ2mj=1nwj2]\frac{d}{dw_j}J(\overrightarrow{w},b)=\frac{d}{dw_j}[\frac{1}{2m}\sum_{i=1}^{m}(f(\overrightarrow{x}^{(i)}-y^{(i)})+\frac{\lambda}{2m}\sum_{j=1}^{n}w_j^2]

Regularized Logistic Regression

Logistic regression

z=w1x1+...+wnxn+bz = w_1x_1+...+w_nx_n+b

Sigmoid function

fw,b(x)=11+ezf_{\overrightarrow{w},b}(\overrightarrow{x})=\frac{1}{1+e^{-z}}

Cost function

J(w,b)=1mi=1m[y(i)log(fw,b(x(i)))+(1y(i))log(1fw,b(x(i)))]+λ2mj=1nwj2J(\overrightarrow{w},b)=-\frac{1}{m}\sum_{i=1}^{m}[ y^{(i)}log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) + (1-y^{(i)})log(1-f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))]+\frac{\lambda}{2m}\sum_{j=1}^{n}w_j^2

Gradient descent

repeat {

wj=wja[1mi=1m(fw,b(x(i))y(i))xj(i)]+λmwjw_j = w_j - a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})x_j^{(i)}]+\frac{\lambda}{m}w_j

b=ba[1mi=1m(fw,b(x(i))y(i))]b = b-a[\frac{1}{m}\sum_{i=1}^{m}(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})-y^{(i)})]

}