
- Input layer
- Represented by values $x_1,x_2,x_3$
- Hidden Layer
- The $x$ vector is passed on to the Hidden Layer by dot producting $w_1,w_2,w_3$ and then adding a bias term
- $y = w_1x_1 + w_2 x_2 + w_3x_3 + \text{bias}$
- After calculating $y$ we apply activation function
- $z = Act(y)$
- Example activation function
- $\frac1{1+e^{-y}}$
- Returns a value based off of the value of $y$ and if it is less than $0.5$ it will return 0 or -1 in the output layer, or else if it is greater it will return 1 in the output layer
- Activation Functions
- Sigmoid Af
- $z = \frac1{1+e^{-y}}$
- Advantage is that it returns a value between 0 and 1
- This can be useful for probabilistic predictions
- ReLU Af
- returns the max(y,0)
- If y > 0 then returns y, if y < 0 then returns 0
- Back propagation
- Begins with forward propagation to first determine loss
- Example
- $y = (x_1w_1+x_2w_2+x_3w_3) + b_1$
- $z = Act(y)$
- Sigmoid activation function
- Example data:
- Features = Playtime, studytime, sleeptime
- Output = [0,1] Pass or fail
- Assume that $\hat y = 0$ but real value is $y = 1$
- Loss function = $(y-\hat y)^2$
- So Loss = $(1-0)^2 = 1$
- Goal is to reduce loss function value to a minimal value
- The loss can be minimized using an optimizer
- Examples of optimizers are gradient descent
- After determining the loss $\eta$
- $w_{4 \text{new}} = w_{4\text{old}} -\eta \frac{\partial L}{\partial w_4}$
- where $\eta$ is a learning rate
- The learning rate may be a very small value e.g. $\eta = 0.001$
- This way we can update the weight values not too drastically from each data update
- and also maybe so that the gradient descent step size is not also too drastic?
- Shouldn’t be too small though or else the weights will never get updated
- In next layer (or previous since we’re going backwards)
- Each weight gets updated as well with the same formula
- $w_{3\text{new}} = w_{4\text{old}} - \eta \frac{\partial L}{\partial w_4}$
- After iteration through back propagation and all $w$ weights are updated, the process repeats for however many epochs
- An epoch is one cycle of iterating through the entire training dataset, performing back propagation many times