CS165B Linear Regression

Linear regression

Regression task
For reference
- In classification task there is a score function
  - $w^Tx$
prediction value:
- $\text{sign}[w^Tx]\rightarrow \{-1,1\}$
Polynomial curve fitting
- Goal is to fit a curve for which, entering some vector or variable with values, we can make an accurate prediction about some other variable
polynomial function
- $w_0 + w_1x + w_2x^2 + ... + w_Mx^M = w^T\tilde x$
- linearity of the function is dependent on what the equation is being compared as linear to
  - in respect to $x$, it is not linear because $x$ can be any polynomial curve
  - however, we are typically interested in linearity with respect to our output
- We can manually increase the dimension of $\tilde x$ to increase the dimensions of our model, and thus the sofistication
sum of squares error function
- The sum of the squares error function
  - $\frac{1}{2} \sum^N_{n=1} (y(x_n,w) - t_n)^2$
    - $y(x_n,w)$ is the output of our guessing function
    - the 1/2 is so that the derivative can be taken easily without an ending coefficient
    - why do we square the difference equation
      - It is important to penalize the evaluation more, the farther the actual is from the prediction line
how to choose the order $M$?
- if M is too low, then there will be a large fitting error due to being unable to fit to the data
- if M is too high then overfitting
  - do not make the dimension of $M$ too high or else you may result with a 0 percent error, but with a curve that is not actually representative of the data\
  - when new data is introduced, then there will be a large fitting error
polynomial coefficients for different M ordered models
- when there is overfitting, the norm of the vector of the coefficients of M may be very huge
regularization regression
- One technique that is often used to control the over-fitting phenomenon in such cases is called regularization
- Add a penalty term to the error function in order to discourage the coefficients from reaching large values
  - $\frac{1}{2} \sum_{n=1}^N(y(x_n, w) - t_n)^2 + \frac{\lambda}{2}||w||^2_2$
  - for $\lambda = 0$, the equation is regular least squares regression
  - low $\lambda$ = less regularization
  - high $\lambda$ = higher regularization, more generalization, less overfitting
cross validation
- $s$-fold cross-validation
- Partition the data into $s$ groups of equal size
- Then $s - 1$ of the groups are used to train a set of models (i.e., different $\lambda$ that are then evaluated on the remaining group
- This procedure is repeated for all $s$ possible choices for the held out group and then the performance scores from the $s$ runs are averaged
- use these runs to plot training and test curve
Gaussian Noise
- Assumptions
  - observation from a deterministic function with added Gaussian noise results in a random variable
    - $t = y(x,w) + e$ where $N(e|0,]\beta^{-1})$
      - $\beta^{-1}$ similar to $\sigma^2$ in perceptron
Likelihood Function
- Given a dataset $D = \{X,t\}$ where $X = \{x_1,\ldots,x_N\}$ are features and $t = \{t_1,\ldots,t_N\}$
- When samples are drawn from i.i.d. (independent and identically distributed), the probability of data is given by

When samples are drawn from i.i.d. (independent and identically distributed), the probability of data is given by