CS165B Perceptrons

Perceptron model

Example:
- credit approval problem
  - input: salary, debt, years in residence, etc.
    - $x \in \R^d = X$
    - where $x$ is a $d$ dimensional vector
  - output:
    - binary output, 0 or 1
  - target:
    - accurate relationship between x and y
    - $f : X \rightarrow Y$
  - customer data:
    - $D$
  - $x,y,D$ are given
- Simple Learning Model solution
  - give importance weights to the different features
  - credit score = $\sum_{n=1}^dw_nx_n$
  - approve if credit score > threshold or deny if <
    - $\sum_{i = 1}^dw_ix_i -$threshold > or < 0
    - can be written as
    - $h(x) =$ sign$((\sum{i=1}^dw_ix_i) + w_0)$
      - sign is given by the sign of the input vector, giving the corresponding sign to the output
how to choose importance weights $w_i$
- $x_i$ is important $\Rightarrow$ large weight
- $x_i$ is beneficial for credit $\Rightarrow$ positive weight
- $x_i$ is detrimental for credit $\Rightarrow$ negative weight
Perception Hypothesis Set $H$
- $H = \{h(x) =$ sign$(\sum_{i=1}^d w_ix_i + w_0) =$ sign$(w^Tx)\}$
- absorb the $w_0$ into $w$ vector
  - $w = [w_0, w_1, ..., w_d] \in\R^{d+1}$
  - $x = [x_0, x_1, ..., x_d] \in [1] \times\R^d$
    - $x_0 = 1$
- in 2-d space
  - $f(x) = w^Tx$ defines a line that separates the space
    - orthogonal to $w$ norm vector
    - $w^Tx = 0$
    - therefore the values not on the line and separated by $f(x)$ will have outputs greater or less than $0$
  - Classifier partitions space to separate data points into discrete value outputs
Model evaluation
- Can create an infinitely many models/f(x) line such that 100% of the predictions are correct for the given task
- Therefore, should train for generality rather than specificity
How to Learn a Final Hypothesis from $H$
Perceptron Learning ALgorithm (PLA)
- w(1) = 0
- for iteration t = 1,2,3, …
  - weight vector is w(t)
  - From $(x_1,y_1),...,(x_N,y_N)$ pick any misclassified sample randomly
  - Call the misclassified sample
    - e.g. sign$(w(t)^Tx_0) \neq y_0$
  - Update the weight:
    - $w(t+1) = w(t) + y_0x_0$
      - Note: $w(t)$ = original weight
  - $t\leftarrow t+1$
- algorithm does not account for outlier
Theorem:
- If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one
- Converge after how long?
  - Assume that there exists a hyperplane $w^$ that separates the data, i.e., exists $d$ such taht for $y = +1,(w)^Tx > \delta$ and for $y = 01, (w*)^Tx < -\delta$
    - Or equivalently
    - $y(w^*)^Tx > \delta, \forall(x,y)\in D$
Convergence of Perceptron
- when we start from a zero vector $w(0) = 0$, then we have
  - $w(t) = y(0)x(0) + y(1)x(1) + ... + y(t-1)x(t-1)$
  - $w(t) = w(t-1) + y(t-1)x(t-1)$
  - $w(t) = w(t-2) + y(t-2)x(t-2) + y(t-1)x(t-1)$
  - $...$
  - $= w(0) + y(0)x(0) + y(1)x(1) + ... + y(t-1)x(t-1)$
  - multiply $(w^*)^T$ on both sides
  - $(w^)^Tw(t) = (w^)^Tw(0) + ... + (w^*)^Ty(t-1)x(t-1)$
  - $||w(t)||^2 = w(t)^Tw(t)$