Perceptron model
- Example:
- credit approval problem
- input: salary, debt, years in residence, etc.
- $x \in \R^d = X$
- where $x$ is a $d$ dimensional vector
- output:
- target:
- accurate relationship between x and y
- $f : X \rightarrow Y$
- customer data:
- $x,y,D$ are given
- Simple Learning Model solution
- give importance weights to the different features
- credit score = $\sum_{n=1}^dw_nx_n$
- approve if credit score > threshold or deny if <
- $\sum_{i = 1}^dw_ix_i -$threshold > or < 0
- can be written as
- $h(x) =$ sign$((\sum{i=1}^dw_ix_i) + w_0)$
- sign is given by the sign of the input vector, giving the corresponding sign to the output
- how to choose importance weights $w_i$
- $x_i$ is important $\Rightarrow$ large weight
- $x_i$ is beneficial for credit $\Rightarrow$ positive weight
- $x_i$ is detrimental for credit $\Rightarrow$ negative weight
- Perception Hypothesis Set $H$
- $H = \{h(x) =$ sign$(\sum_{i=1}^d w_ix_i + w_0) =$ sign$(w^Tx)\}$
- absorb the $w_0$ into $w$ vector
- $w = [w_0, w_1, ..., w_d] \in\R^{d+1}$
- $x = [x_0, x_1, ..., x_d] \in [1] \times\R^d$
- in 2-d space
- $f(x) = w^Tx$ defines a line that separates the space
- orthogonal to $w$ norm vector
- $w^Tx = 0$
- therefore the values not on the line and separated by $f(x)$ will have outputs greater or less than $0$
- Classifier partitions space to separate data points into discrete value outputs
- Model evaluation
- Can create an infinitely many models/f(x) line such that 100% of the predictions are correct for the given task
- Therefore, should train for generality rather than specificity
- How to Learn a Final Hypothesis from $H$
- Perceptron Learning ALgorithm (PLA)
- w(1) = 0
- for iteration t = 1,2,3, …
- weight vector is w(t)
- From $(x_1,y_1),...,(x_N,y_N)$ pick any misclassified sample randomly
- Call the misclassified sample
- e.g. sign$(w(t)^Tx_0) \neq y_0$
- Update the weight:
- $w(t+1) = w(t) + y_0x_0$
- Note: $w(t)$ = original weight
- $t\leftarrow t+1$
- algorithm does not account for outlier
- Theorem:
- If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one
- Converge after how long?
- Assume that there exists a hyperplane $w^$ that separates the data, i.e., exists $d$ such taht for $y = +1,(w)^Tx > \delta$ and for $y = 01, (w*)^Tx < -\delta$
- Or equivalently
- $y(w^*)^Tx > \delta, \forall(x,y)\in D$
- Convergence of Perceptron
- when we start from a zero vector $w(0) = 0$, then we have
- $w(t) = y(0)x(0) + y(1)x(1) + ... + y(t-1)x(t-1)$
- $w(t) = w(t-1) + y(t-1)x(t-1)$
- $w(t) = w(t-2) + y(t-2)x(t-2) + y(t-1)x(t-1)$
- $...$
- $= w(0) + y(0)x(0) + y(1)x(1) + ... + y(t-1)x(t-1)$
- multiply $(w^*)^T$ on both sides
- $(w^)^Tw(t) = (w^)^Tw(0) + ... + (w^*)^Ty(t-1)x(t-1)$
- $||w(t)||^2 = w(t)^Tw(t)$