Multiple Features

Model
fw,b(x)=w1x1+w2x2+...+wnxn+b
w = [w1 w2 w3 .. wn]
b
x = [x1 x2 x3 .. xn]
Simplified - multiple linear regression
fw,b(x)=w1x1+w2x2+...+wnxn+b=w⋅x+b
Vectorization
Parameters and features
w = [w1 w2 w3] (n = 3)
b is a number
x = [x1 x2 x3 .. xn]
w = np.array([1, 2, 3])
b = 4
x = np.array([10, 20, 30])
Without vectorization
fw,b(x)=w1x1+w2x2+...+wnxn+b
f = w[0]*x[0] + w[1]*x[1] + w[2]*x[2] + b
=fw,b(x)=∑j=1nwjxj+b
f = 0
for j in range(n):
f = f + w[j] * x[j]
f = f + b
With vectorization
fw,b(x)=w⋅x+b
f = np.dot(w,x) + b
vectorization calculate each columns in parallel
- much less time
- efficient → scale to large dataset
Gradient descent
w = (w1 w2 … w16)
d = (d1 d2 … d16)
w = np.array([0.5, 1.3, ... , 3.4])
d = np.array([0.3, 0.2, ... , 0.4])
Compute wj=wj−0.1dj for j=1...16
Without vectorization
w1 = w1 - 0.1d1
…
w16 = w16 - 0.1d16
for j in range(16):
w[j] = w[j] - 0.1 * d[j]
With vectorization
w=w−0.1d
w = w - 0.1 * d
Gradient descent for multiple regression
Previous notation
Parameters
w1,...,wn
b
Model
fw,b(x)=w1x1+...+wnxn+b
Cost function
J(w1,...,wn,b)
Gradient descent
repeat {
wj=wj−adwjdJ(w1,...wn,b)
b=b−adwbdJ(w1,...wn,b)
}
Vector notation
Parameters
w=[w1...wn]
b
Model
fw,b(x)=w⋅x+b
Cost function
J(w,b)
Gradient descent
repeat {
wj=wj−adwjdJ(w,b)
b=b−adwbdJ(w,b)
}
Gradient Descent
One feature
repeat {
w=w−am1∑i=1m(fw,b(x(i))−y(i))x(i)
- m1∑i=1m(fw,b(x(i))−y(i))x(i)=dwdJ(w,b)
b=b−am1∑i=1m(fw,b(x(i))−y(i))
simultaneously update w, b
}
n features (n ≥ 2)
repeat {
j = 1 : w1=w1−am1∑i=1m(fw,b(x(i))−y(i))x(i)
- m1∑i=1m(fw,b(x(i))−y(i))x(i)=dw1dJ(w,b)
b=b−am1∑i=1m(fw,b(x(i))−y(i))
simulatenously update wj (for j=1,..,n) and b
}
An alternative to gradient descent
Normal equation
- only for linear regression
- solve for w, b without iterations
- need to know
- Normal equation method may be used in machine learning libraries that implement linear regression
- Gradient descent is the recommended method for finding parameters w, b
- disadvantages
- doesn’t generalize to other learning algorithms
- slow when number of features is large (> 10,000)
Feature scaling
Feature scaling enables gradient descent to run much faster by rescaling the range of each features
Mean normalization

Z-score normalization

Checking Gradient descent for convergence

Choosing the learning rate

Feature engineering

Polynomial regression

