Logistic Regression

Logistic regression is a binary classifier that outputs probabilities with confidence, not just hard decisions. It shares the same prediction rule as perceptron () but optimizes a convex log-likelihood that works on non-separable data and gives calibrated confidences.

Intuition

Labels are in (not like perceptron). Model the label as Bernoulli:

Parameterization via the Logit Transform

Naive attempt: fails because LHS while RHS .

The fix: equate the linear score with the log-odds (β€œlogit”):

This works because as ranges over , the logit ranges over . Solving for :

The choice of sigmoid is somewhat arbitrary: any monotone function works. Taking recovers the perceptron.

Intuition

Sigmoid squashes an unbounded linear score into a probability. Far from the boundary () the output saturates near 0 or 1 (high confidence), and near the boundary it is close to 0.5 (uncertain). The linear score is the log-odds: doubling squares the odds ratio, not the probability.

Prediction

Predict if , else . Since , this is equivalent to , same decision boundary as perceptron.

What’s different from perceptron:

  1. Objective is convex and defined on non-separable data
  2. Magnitude of is a confidence, hence β€œregression”

MLE Derivation

Take log, negate, and this is the cross-entropy loss:

Equivalently, with if and if :

This is the logistic loss, a smooth, convex surrogate for 0-1 loss.

Optimization

No closed form (unlike linear regression). The loss is convex so we find the point where .

Gradient: .

The gradient has a clean reading: the update is proportional to (prediction error) times (feature vector). A confident-and-right prediction contributes nothing, a confident-and-wrong prediction contributes a big -aligned push. Same structure as perceptron’s update, but weighted smoothly by how wrong you are instead of all-or-nothing.

Methods:

  • Gradient Descent:
  • Stochastic Gradient Descent: subsample
  • Newton’s method: uses Hessian, fewer steps but costly per step

Multiclass Logistic Regression (Softmax)

This is the softmax function. Training with cross-entropy loss generalizes directly.

From CS480 lec4.