- Logistic Regression
- Gives real valued input between 0 and 1
- Allows for probabilistic interpretation
- Linear model
- No closed form solution
- Example:
- Training a handwritten digits sample
- Digits are from 0-9
- Can first simplify down to training and classifying between 1 and 5
- Dataset
- In this example, the pictures are 28x28 pixels
- To create the $x$ vector we linearize the 28x28 pixels into vector length 728
- Good features to differentiate between the two class
- Pixels
- Intensity
- the amount of black pixels
- Symmetry
- the negative of absolute difference between an image and its flipped versions
- $x$ vector is a concatenation of the features’ values
- The Sigmoid Function
- $h(x) = \theta(\sum^d_{i=0}w_ix_ii) = \theta(w^Tx) \in [0,1]$
- maps an input vector $x$ to a real value between $[0,1]$
- $\theta (s) = \frac{e^s}{1+e^s} = \frac{1}{1 + e^{-s}}$
- How is $\theta(-s)$ related to $\theta(s)$?
- $\theta(-s) *\theta(s) = 1$
- What makes a good $h$?
- h is good if:
- $h(x_n) = \theta(w^Tx) \approx 1$ whenever $y_n = +1$
- $h(x_n) = \theta(w^Tx) \approx 0$ whenever $y_n = -1$
- Data representation
- $D = \{y_1, y_2, \ldots, y_n\} \in \{-1,1\}^n$
- Bad simplistic error message for optimizing $h$
- $E(h) = \frac{1}{N}\sum^N_{n=1}(h(x_n)-\frac{1}{2}(1+y_n))^2$
- $\frac12 (1+y_n)$ term maps target from $\{-1,1\} \Rightarrow\{0,1\}$
- Logistic Loss Function
- $E(w) = \frac1N\sum^N_{n=1}\ln(1+\exp(-y_n*w^Tx))$
- Based on intuitive probabilistic interpretation of $h$
- The larger that you can make $w^Tx$ or the closer to $\infin$, the smaller $\exp$ term will become and since inside term is added to 1, $\log$ term will approach zero as well, resulting in smallest error
- This function is very easy to minimize
- Probabilistic Interpretation
- Suppose that $h(x) = \theta(w^Tx)$ closely captures $P[+1|x]$
- $P(y|x) = \theta(y *w^Tx)$