• Logistic Regression
    • Gives real valued input between 0 and 1
      • Allows for probabilistic interpretation
      • Linear model
        • i.e. given by $w^Tx$
      • No closed form solution
  • Example:
    • Training a handwritten digits sample
    • Digits are from 0-9
      • Can first simplify down to training and classifying between 1 and 5
    • Dataset
      • In this example, the pictures are 28x28 pixels
      • To create the $x$ vector we linearize the 28x28 pixels into vector length 728
    • Good features to differentiate between the two class
      • Pixels
      • Intensity
        • the amount of black pixels
      • Symmetry
        • the negative of absolute difference between an image and its flipped versions
    • $x$ vector is a concatenation of the features’ values
  • The Sigmoid Function
    • $h(x) = \theta(\sum^d_{i=0}w_ix_ii) = \theta(w^Tx) \in [0,1]$
    • maps an input vector $x$ to a real value between $[0,1]$
    • $\theta (s) = \frac{e^s}{1+e^s} = \frac{1}{1 + e^{-s}}$
    • How is $\theta(-s)$ related to $\theta(s)$?
      • $\theta(-s) *\theta(s) = 1$
  • What makes a good $h$?
    • h is good if:
      • $h(x_n) = \theta(w^Tx) \approx 1$ whenever $y_n = +1$
      • $h(x_n) = \theta(w^Tx) \approx 0$ whenever $y_n = -1$
    • Data representation
      • $D = \{y_1, y_2, \ldots, y_n\} \in \{-1,1\}^n$
    • Bad simplistic error message for optimizing $h$
      • $E(h) = \frac{1}{N}\sum^N_{n=1}(h(x_n)-\frac{1}{2}(1+y_n))^2$
        • $\frac12 (1+y_n)$ term maps target from $\{-1,1\} \Rightarrow\{0,1\}$
    • Logistic Loss Function
      • $E(w) = \frac1N\sum^N_{n=1}\ln(1+\exp(-y_n*w^Tx))$
      • Based on intuitive probabilistic interpretation of $h$
        • The larger that you can make $w^Tx$ or the closer to $\infin$, the smaller $\exp$ term will become and since inside term is added to 1, $\log$ term will approach zero as well, resulting in smallest error
      • This function is very easy to minimize
      • Probabilistic Interpretation
        • Suppose that $h(x) = \theta(w^Tx)$ closely captures $P[+1|x]$
          • then
        • $P(y|x) = \theta(y *w^Tx)$