- How to “roll down”
- Assume we are at weights $w(t)$ and we take a step of size $n$ in the direction of $\vec{v}$
- $w(t+1) = w(t) + n\vec{v}$
- $\vec v$ is the direction of the descent
- $n$ is the step size
- How to determine which is the best direction to take the step?
- Goal is to go opposite direction of gradient
- gradient points in the direction of largest increase
- The Gradient is the Fastest Way to “roll down”
- Note $\Delta E$ is always positive
- called the “learning rate”
- We want $\vec v$ that maximizes $\Delta E = E(w(t)) - E(w(t+1))$
- $\Delta E = E(w(t)) - E(w(t+1)) = E(w(t)) - E(w(t) + \eta \vec v$
- $= E(w(t)) - (E(w(t)) + \eta \nabla E(w(t))^T\vec v + O(\eta ^2))$
- $O(n^2)$ is the error/noise
- This term is gotten by taking the Taylor series expansion of $E(w(t) + \eta \vec v)$
- $= -\eta\nabla E(w(t))^T \vec v - O(\eta ^2) \approx - \eta \nabla E(w(t))^T \vec v$
- Derived by cancelling the $E(w(t))$’s
- When $\vec v$ is a unit directional vector $||\vec v || = 1$ we have
- $\vec v$ should be in the opposite direction of the gradient vector, therefore when mulltiplied to the vector transpose, $\Delta E$ is maximized
- $\Delta E \approx -n\nabla E(w(t))^T \vec v \leq n ||\nabla E(w(t))||_2$
- Iterating with Negative Gradient
- Update with negative gradient through each step
- $w(0) \rightarrow w(1) \rightarrow w(2) \rightarrow \ldots$
- Choosing step size
- if $\eta$ is too small, you will not be able to reach valley within a finite number of steps
- if $\eta$ is too large, you will overstep and not be accurate
- One method is to make your $\eta$ large at the beginning and as you step more, your step size can decrease to become more precise
- Learning by Gradient Descent
- The gradient descent algorithm can be applied to minimize any smooth function
- However, only applies to smooth function
- If the algorithm steps into a break, the gradient will become zero, and will not be able to step
- Gradient descent vs stochastic gradient descent
- Typically gradient descent takes gradient over entire dataset, however if the dataset is too large you may use stochastic gradient descent
- Stochastic gradient descent adjusts the error function along a single line or a subset of the dataset, allowing for a faster time to convergence
- Issue is that this may lead to a larger error