• How to “roll down”
    • Assume we are at weights $w(t)$ and we take a step of size $n$ in the direction of $\vec{v}$
      • $w(t+1) = w(t) + n\vec{v}$
        • $\vec v$ is the direction of the descent
        • $n$ is the step size
    • How to determine which is the best direction to take the step?
    • Goal is to go opposite direction of gradient
      • gradient points in the direction of largest increase
  • The Gradient is the Fastest Way to “roll down”
    • Note $\Delta E$ is always positive
      • called the “learning rate”
    • We want $\vec v$ that maximizes $\Delta E = E(w(t)) - E(w(t+1))$
      • $\Delta E = E(w(t)) - E(w(t+1)) = E(w(t)) - E(w(t) + \eta \vec v$
      • $= E(w(t)) - (E(w(t)) + \eta \nabla E(w(t))^T\vec v + O(\eta ^2))$
        • $O(n^2)$ is the error/noise
        • This term is gotten by taking the Taylor series expansion of $E(w(t) + \eta \vec v)$
      • $= -\eta\nabla E(w(t))^T \vec v - O(\eta ^2) \approx - \eta \nabla E(w(t))^T \vec v$
        • Derived by cancelling the $E(w(t))$’s
      • When $\vec v$ is a unit directional vector $||\vec v || = 1$ we have
        • $\vec v$ should be in the opposite direction of the gradient vector, therefore when mulltiplied to the vector transpose, $\Delta E$ is maximized
        • $\Delta E \approx -n\nabla E(w(t))^T \vec v \leq n ||\nabla E(w(t))||_2$
  • Iterating with Negative Gradient
    • Update with negative gradient through each step
    • $w(0) \rightarrow w(1) \rightarrow w(2) \rightarrow \ldots$
    • Choosing step size
      • if $\eta$ is too small, you will not be able to reach valley within a finite number of steps
      • if $\eta$ is too large, you will overstep and not be accurate
      • One method is to make your $\eta$ large at the beginning and as you step more, your step size can decrease to become more precise
  • Learning by Gradient Descent
    • The gradient descent algorithm can be applied to minimize any smooth function
      • However, only applies to smooth function
      • If the algorithm steps into a break, the gradient will become zero, and will not be able to step
    • Gradient descent vs stochastic gradient descent
      • Typically gradient descent takes gradient over entire dataset, however if the dataset is too large you may use stochastic gradient descent
      • Stochastic gradient descent adjusts the error function along a single line or a subset of the dataset, allowing for a faster time to convergence
        • Issue is that this may lead to a larger error