Linear regression
- Regression task
- For reference
- In classification task there is a score function
- prediction value:
- $\text{sign}[w^Tx]\rightarrow \{-1,1\}$
- Polynomial curve fitting
- Goal is to fit a curve for which, entering some vector or variable with values, we can make an accurate prediction about some other variable
- polynomial function
- $w_0 + w_1x + w_2x^2 + ... + w_Mx^M = w^T\tilde x$
- linearity of the function is dependent on what the equation is being compared as linear to
- in respect to $x$, it is not linear because $x$ can be any polynomial curve
- however, we are typically interested in linearity with respect to our output
- We can manually increase the dimension of $\tilde x$ to increase the dimensions of our model, and thus the sofistication
- sum of squares error function
- The sum of the squares error function
- $\frac{1}{2} \sum^N_{n=1} (y(x_n,w) - t_n)^2$
- $y(x_n,w)$ is the output of our guessing function
- the 1/2 is so that the derivative can be taken easily without an ending coefficient
- why do we square the difference equation
- It is important to penalize the evaluation more, the farther the actual is from the prediction line
- how to choose the order $M$?
- if M is too low, then there will be a large fitting error due to being unable to fit to the data
- if M is too high then overfitting
- do not make the dimension of $M$ too high or else you may result with a 0 percent error, but with a curve that is not actually representative of the data\
- when new data is introduced, then there will be a large fitting error
- polynomial coefficients for different M ordered models
- when there is overfitting, the norm of the vector of the coefficients of M may be very huge
- regularization regression
- One technique that is often used to control the over-fitting phenomenon in such cases is called regularization
- Add a penalty term to the error function in order to discourage the coefficients from reaching large values
- $\frac{1}{2} \sum_{n=1}^N(y(x_n, w) - t_n)^2 + \frac{\lambda}{2}||w||^2_2$
- for $\lambda = 0$, the equation is regular least squares regression
- low $\lambda$ = less regularization
- high $\lambda$ = higher regularization, more generalization, less overfitting
- cross validation
- $s$-fold cross-validation
- Partition the data into $s$ groups of equal size
- Then $s - 1$ of the groups are used to train a set of models (i.e., different $\lambda$ that are then evaluated on the remaining group
- This procedure is repeated for all $s$ possible choices for the held out group and then the performance scores from the $s$ runs are averaged
- use these runs to plot training and test curve
- Gaussian Noise
- Assumptions
- observation from a deterministic function with added Gaussian noise results in a random variable
- $t = y(x,w) + e$ where $N(e|0,]\beta^{-1})$
- $\beta^{-1}$ similar to $\sigma^2$ in perceptron
- Likelihood Function
- Given a dataset $D = \{X,t\}$ where $X = \{x_1,\ldots,x_N\}$ are features and $t = \{t_1,\ldots,t_N\}$
-
When samples are drawn from i.i.d. (independent and identically distributed), the probability of data is given by