CS165B Logistic Regression

Logistic Regression
- Gives real valued input between 0 and 1
  - Allows for probabilistic interpretation
  - Linear model
    - i.e. given by $w^Tx$
  - No closed form solution
Example:
- Training a handwritten digits sample
- Digits are from 0-9
  - Can first simplify down to training and classifying between 1 and 5
- Dataset
  - In this example, the pictures are 28x28 pixels
  - To create the $x$ vector we linearize the 28x28 pixels into vector length 728
- Good features to differentiate between the two class
  - Pixels
  - Intensity
    - the amount of black pixels
  - Symmetry
    - the negative of absolute difference between an image and its flipped versions
- $x$ vector is a concatenation of the features’ values
The Sigmoid Function
- $h(x) = \theta(\sum^d_{i=0}w_ix_ii) = \theta(w^Tx) \in [0,1]$
- maps an input vector $x$ to a real value between $[0,1]$
- $\theta (s) = \frac{e^s}{1+e^s} = \frac{1}{1 + e^{-s}}$
- How is $\theta(-s)$ related to $\theta(s)$?
  - $\theta(-s) *\theta(s) = 1$
What makes a good $h$?
- h is good if:
  - $h(x_n) = \theta(w^Tx) \approx 1$ whenever $y_n = +1$
  - $h(x_n) = \theta(w^Tx) \approx 0$ whenever $y_n = -1$
- Data representation
  - $D = \{y_1, y_2, \ldots, y_n\} \in \{-1,1\}^n$
- Bad simplistic error message for optimizing $h$
  - $E(h) = \frac{1}{N}\sum^N_{n=1}(h(x_n)-\frac{1}{2}(1+y_n))^2$
    - $\frac12 (1+y_n)$ term maps target from $\{-1,1\} \Rightarrow\{0,1\}$
- Logistic Loss Function
  - $E(w) = \frac1N\sum^N_{n=1}\ln(1+\exp(-y_n*w^Tx))$
  - Based on intuitive probabilistic interpretation of $h$
    - The larger that you can make $w^Tx$ or the closer to $\infin$, the smaller $\exp$ term will become and since inside term is added to 1, $\log$ term will approach zero as well, resulting in smallest error
  - This function is very easy to minimize
  - Probabilistic Interpretation
    - Suppose that $h(x) = \theta(w^Tx)$ closely captures $P[+1|x]$
      - then
    - $P(y|x) = \theta(y *w^Tx)$