Regression

A regression is a statistical technique that relates a dependent variable to one or more independent (explanatory) variables. You predict a continuous value, rather than discrete classes such as in Classification.

We can use L1 and L2 distance to solve regression problems. In Stanford CS231n, it was a Classification problem, where you used the SVM loss to predict the class.

Practice implementing it: - https://www.deep-ml.com/problems/14

Using Cross-Entropy to solve regression?

In my notes about Behavior Cloning, I’ve been thinking about how much more powerful using Log Likelihood is compared to just doing regression, since we can actually learn uncertainty. But then saw this thread: https://www.quora.com/Can-we-use-cross-entropy-to-solve-the-regression-problem

“In principle you could, but in practice cross-entropy is most meaningful for probabilities, and probabilities rarely fit on a linear curve (because they need to sum to one). So you really have to have a specific application in mind where this makes sense, what is it?

Linear Regression

Method 1: Using MLE

Simple linear regression model: $Y_{i} \sim N (α + β x_{i}, σ^{2})$ Alternate Formulation $Y_{i} = α + β x_{i} + R_{i}, R_{i} \sim N (0, σ^{2})$

We use data $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ to estimate $α, β, σ^{2}$

The Likelihood Function is given by ?? i am too lazy to put this

We come up with the line of best fit using MLEs. We get the following results (derivation is at page 402) for the estimates of the parameters $α, β, σ^{2}$ : $α = \overline{y} - β \overline{x}$ $β = \frac{S _{x y}}{S _{xx}} = \frac{\sum [( x _{i} - x ) ( y _{i} - y )]}{\sum ( x _{i} - x ) ^{2}}$ $σ^{2} = \frac{1}{n} \sum (y_{i} - (α - β x_{i}))^{2}$

The line of best fit is given by $y = α + β X$

$β = 0 ⟹$ $x$ has no predictive power for $Y$ .

Method 2: Least Squares

I don’t think the teacher went too in depth for this… They both end up with the same final equation.

We are making the Gauss-Markov Theorem

We want to ask if $β = 0 ⟹$ if $β = 0$ then $x$ has no predictive power for $Y$ Suppose H_0 $:$ \beta = 0H_1 $:$ \beta \neq 0$

You do hypothesis testing, where it is given by $\frac{∣ β - β _{0} ∣}{s _{e} / S _{xx}}$

Note that $β_{0} = 0$ (since this is the hypothesis we are testing) And then you use your t-table, where your Degrees of Freedom is $n - 2$ .

Other

From the deep-ml practice, they just use some formula?

In linear regression, you’re trying to find parameters ( $θ$ ) that make predictions: $\overset{y}{^} = Xθ$

$X$ is your design matrix (rows = training examples, columns = features, often with a column of ones for the bias term).
$y$ is the vector of actual target values.
$θ$ is the vector of parameters (weights) you’re solving for.

The normal equation is derived by minimizing the cost function:

$J (θ) = \frac{1}{2 m} (Xθ - y)^{T} (Xθ - y)$

You take the derivative with respect to $θ$ , set it to zero, and solve. That derivation gives you:

$θ = (X^{T} X)^{- 1} X^{T} y$

So in short:

$θ$ is the closed-form solution for the weights of your linear regression model that minimize the squared error.
It “came from” solving for the best parameters that make your prediction line (or hyperplane) fit the data as best as possible.

Normally, a linear regression model looks like this: $h_{θ} (x) = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + \dots + θ_{n} x_{n}$

To express this in matrix form, we add a column of ones to $X$ :

X = 11 ⋮ 1 x_{11} x_{21} ⋮ x_{m 1} x_{12} x_{22} ⋮ x_{m 2} \dots \dots \dots x_{1 n} x_{2 n} ⋮ x_{mn}

In the deep ML problem, they don’t have a set of features equal to 1, which does not allow us to encode a bias value.

“A practical implementation involves augmenting ( X ) with a column of ones to account for the intercept term and then applying the normal equation directly to compute ( \theta ).“.

🛠️ Steven Gong

Table of Contents

Regression

Linear Regression

Method 1: Using MLE

Method 2: Least Squares

Other

Graph View

Backlinks