Logistic Regression
Logistic regression is a binary classifier that outputs probabilities with confidence, not just hard decisions. It shares the same prediction rule as perceptron () but optimizes a convex log-likelihood that works on non-separable data and gives calibrated confidences.
Intuition
Labels are in (not like perceptron). Model the label as Bernoulli:
Parameterization via the Logit Transform
Naive attempt: fails because LHS while RHS .
The fix: equate the linear score with the log-odds (βlogitβ):
This works because as ranges over , the logit ranges over . Solving for :
The choice of sigmoid is somewhat arbitrary: any monotone function works. Taking recovers the perceptron.
Intuition
Sigmoid squashes an unbounded linear score into a probability. Far from the boundary () the output saturates near 0 or 1 (high confidence), and near the boundary it is close to 0.5 (uncertain). The linear score is the log-odds: doubling squares the odds ratio, not the probability.
Prediction
Predict if , else . Since , this is equivalent to , same decision boundary as perceptron.
Whatβs different from perceptron:
- Objective is convex and defined on non-separable data
- Magnitude of is a confidence, hence βregressionβ
MLE Derivation
Take log, negate, and this is the cross-entropy loss:
Equivalently, with if and if :
This is the logistic loss, a smooth, convex surrogate for 0-1 loss.
Optimization
No closed form (unlike linear regression). The loss is convex so we find the point where .
Gradient: .
The gradient has a clean reading: the update is proportional to (prediction error) times (feature vector). A confident-and-right prediction contributes nothing, a confident-and-wrong prediction contributes a big -aligned push. Same structure as perceptronβs update, but weighted smoothly by how wrong you are instead of all-or-nothing.
Methods:
- Gradient Descent:
- Stochastic Gradient Descent: subsample
- Newtonβs method: uses Hessian, fewer steps but costly per step
Multiclass Logistic Regression (Softmax)
This is the softmax function. Training with cross-entropy loss generalizes directly.
From CS480 lec4.
Related
- Perceptron (same prediction rule, different objective)
- Cross-Entropy Loss
- Sigmoid
- Softmax
- MLE
- Gradient Descent
- Newtonβs Method
- CS480