# Linear Classification

Originally, I was like, why am I learning this from CS231N, since I need to work on neural networks. However, I am realizing that the Neural Network are just stacked linear Classifiers (+ Neural Network are just stacked linear Classifiers (+ Activation Function, since they take on the form (called the Neural Network are just stacked linear Classifiers (+ Activation Function, since they take on the form (called the Activation Function, since they take on the form (called the Score Function): $f(x_{i},W,b)=Wx_{i}+b$

So this will be used as a foundation to understand more complex algorithms. You can think of this as a single classifier that assigns a weight to each pixel with the matrix $W$ and class, i.e. the importance of a particular pixel for a particular class.

Say you want to predict an image of size $(32,32,3)$, (rolled out = $3072$) with $10$ possible classes. You have the following dimensions:

- $x_{i}$ is $(3072,)$ or more generally $(D,)$
- $W$ are the parameters/weights $(10,3072)$ is or more generally $(C,D)$
- $b$ is the bias term, $(10,)$ or more generally $(C,)$

In practice, we use the Bias Trick to simplify the above expression to
$f(x_{i},W)=Wx_{i}$
by extending the vector $x_{i}$ with one additional dimension that always holds the constant $1$ - a default *bias dimension*. With the extra dimension,

- $x_{i}$ is now $(3073,)$
- $W$ is now $(10,3073)$

### Summary of the information flow

- The dataset of pairs of
**(x,y)**is given and fixed. - The weights $W$ start out as random numbers and can change.
- During the forward pass the score function computes class scores, stored in vector
**f**. - The Loss Function contains two components: The data loss computes the compatibility between the scores
**f**and the labels**y**. - The regularization loss is only a function of the weights. During Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them to perform a parameter update during Gradient Descent.
- We want to update $W$ to find a set of weights to minimize the loss function, and then we can use that later to make predictions

### Hard Cases for a Linear Classifier

- When you can’t use a line to split a decision boundary, when one class appears in multiple regions of space