How does Neural Network Work

Screenshot 2023-01-30 at 11.28.35 AM.png

Input layer
- Represented by values $x_1,x_2,x_3$
Hidden Layer
- The $x$ vector is passed on to the Hidden Layer by dot producting $w_1,w_2,w_3$ and then adding a bias term
  - $y = w_1x_1 + w_2 x_2 + w_3x_3 + \text{bias}$
- After calculating $y$ we apply activation function
  - $z = Act(y)$
  - Example activation function
    - $\frac1{1+e^{-y}}$
    - Returns a value based off of the value of $y$ and if it is less than $0.5$ it will return 0 or -1 in the output layer, or else if it is greater it will return 1 in the output layer
Activation Functions
- Sigmoid Af
  - $z = \frac1{1+e^{-y}}$
  - Advantage is that it returns a value between 0 and 1
  - This can be useful for probabilistic predictions
- ReLU Af
  - returns the max(y,0)
  - If y > 0 then returns y, if y < 0 then returns 0
Back propagation
- Begins with forward propagation to first determine loss
  - Example
    - $y = (x_1w_1+x_2w_2+x_3w_3) + b_1$
    - $z = Act(y)$
      - Sigmoid activation function
    - Example data:
      - Features = Playtime, studytime, sleeptime
      - Output = [0,1] Pass or fail
    - Assume that $\hat y = 0$ but real value is $y = 1$
      - Loss function = $(y-\hat y)^2$
      - So Loss = $(1-0)^2 = 1$
  - Goal is to reduce loss function value to a minimal value
  - The loss can be minimized using an optimizer
    - Examples of optimizers are gradient descent
- After determining the loss $\eta$
  - $w_{4 \text{new}} = w_{4\text{old}} -\eta \frac{\partial L}{\partial w_4}$
    - where $\eta$ is a learning rate
      - The learning rate may be a very small value e.g. $\eta = 0.001$
      - This way we can update the weight values not too drastically from each data update
      - and also maybe so that the gradient descent step size is not also too drastic?
      - Shouldn’t be too small though or else the weights will never get updated
  - In next layer (or previous since we’re going backwards)
  - Each weight gets updated as well with the same formula
  - $w_{3\text{new}} = w_{4\text{old}} - \eta \frac{\partial L}{\partial w_4}$
- After iteration through back propagation and all $w$ weights are updated, the process repeats for however many epochs
  - An epoch is one cycle of iterating through the entire training dataset, performing back propagation many times