CS165B Gradient Descent

How to “roll down”
- Assume we are at weights $w(t)$ and we take a step of size $n$ in the direction of $\vec{v}$
  - $w(t+1) = w(t) + n\vec{v}$
    - $\vec v$ is the direction of the descent
    - $n$ is the step size
- How to determine which is the best direction to take the step?
- Goal is to go opposite direction of gradient
  - gradient points in the direction of largest increase
The Gradient is the Fastest Way to “roll down”
- Note $\Delta E$ is always positive
  - called the “learning rate”
- We want $\vec v$ that maximizes $\Delta E = E(w(t)) - E(w(t+1))$
  - $\Delta E = E(w(t)) - E(w(t+1)) = E(w(t)) - E(w(t) + \eta \vec v$
  - $= E(w(t)) - (E(w(t)) + \eta \nabla E(w(t))^T\vec v + O(\eta ^2))$
    - $O(n^2)$ is the error/noise
    - This term is gotten by taking the Taylor series expansion of $E(w(t) + \eta \vec v)$
  - $= -\eta\nabla E(w(t))^T \vec v - O(\eta ^2) \approx - \eta \nabla E(w(t))^T \vec v$
    - Derived by cancelling the $E(w(t))$’s
  - When $\vec v$ is a unit directional vector $||\vec v || = 1$ we have
    - $\vec v$ should be in the opposite direction of the gradient vector, therefore when mulltiplied to the vector transpose, $\Delta E$ is maximized
    - $\Delta E \approx -n\nabla E(w(t))^T \vec v \leq n ||\nabla E(w(t))||_2$
Iterating with Negative Gradient
- Update with negative gradient through each step
- $w(0) \rightarrow w(1) \rightarrow w(2) \rightarrow \ldots$
- Choosing step size
  - if $\eta$ is too small, you will not be able to reach valley within a finite number of steps
  - if $\eta$ is too large, you will overstep and not be accurate
  - One method is to make your $\eta$ large at the beginning and as you step more, your step size can decrease to become more precise
Learning by Gradient Descent
- The gradient descent algorithm can be applied to minimize any smooth function
  - However, only applies to smooth function
  - If the algorithm steps into a break, the gradient will become zero, and will not be able to step
- Gradient descent vs stochastic gradient descent
  - Typically gradient descent takes gradient over entire dataset, however if the dataset is too large you may use stochastic gradient descent
  - Stochastic gradient descent adjusts the error function along a single line or a subset of the dataset, allowing for a faster time to convergence
    - Issue is that this may lead to a larger error