Perceptron

The perceptron is the simplest neural network: a linear binary classifier that learns a hyperplane by making “lazy” updates only on mistakes.

A MLP just stacks multiple of these per layer, and increases the number of layers
The simplest perceptron doesn’t even have an Activation Function, the output is just $sign (⟨ w, x ⟩ + b)$

Setup: Binary Classification

Given pairs $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ with feature vectors $x_{i} \in R^{d}$ and labels $y_{i} \in {- 1, + 1}$ . Learn $h : R^{d} \to {- 1, + 1}$ such that $h (x) = y$ .

Assumption

The data is linearly separable: there exists some hyperplane that perfectly splits the classes.

Algorithm

Weight vector $w$ , bias $b$ . Typically initialize $w = 0, b = 0$ , set $δ = 0$ .

For each $(x_{i}, y_{i})$ :

Predict $\overset{y}{^}_{i} = sign (⟨ w, x_{i} ⟩ + b)$
Lazy update (only on a mistake): if $\overset{y}{^}_{i} \neq = y_{i}$ , then $w \leftarrow w + y_{i} x_{i}$ and $b \leftarrow b + y_{i}$

Intuition

The update rotates $w$ toward $y_{i} x_{i}$ . If we got the sign wrong on $x_{i}$ , nudging $w$ in the direction of $y_{i} x_{i}$ increases $⟨ w, x_{i} ⟩ \cdot y_{i}$ next time, pushing the hyperplane to put $x_{i}$ on the right side. It is the minimal local correction that makes this one point less wrong.

Padding + Pre-Multiplication Trick

To absorb the bias into the weight vector and simplify the analysis:

Let $z = (w, b)$ and replace $x_{i}$ with $(x_{i}, 1)$ . Then $y_{i} = sign (⟨ z, (x_{i}, 1)⟩)$
Let $a_{i} = y_{i} (x_{i}, 1)$ . Then the goal becomes $⟨ z, a_{i} ⟩ > 0$ for all $i$ , i.e., $A z > 0$ entrywise, where $A$ is the matrix with rows $a_{i}$

Linear Separability and Margin

The dataset is separable with margin $s > 0$ if there exists $z$ such that $⟨ a_{i}, z ⟩ \geq s$ for all $i$ . The (normalized) margin is

$γ = max_{∥ z ∥_{2} = 1} min_{i} ⟨ a_{i}, z ⟩$

where:

$γ$ is the (normalized) margin
$a_{i}$ are the (padded, signed) data points
$z$ is a unit-norm candidate separator

Large margin means easy to classify; small margin means hard.

Geometric picture

The margin is the width of the thickest slab you can draw between the two classes. A wide slab means small perturbations in the data cannot flip a point’s side, so any reasonable separator works. A razor-thin slab means the boundary has to be threaded precisely.

Perceptron Convergence Theorem (Mistake Bound)

Theorem. Suppose there exists $z = (w, b)$ such that $A z \geq s \cdot 1$ . Then the perceptron correctly classifies the entire dataset after at most

$\frac{R ^{2} ∥ z ∥ _{2}^{2}}{s ^{2}}$

where:

$R = max_{i} ∥ a_{i} ∥_{2}$ is the largest data-point norm
$s$ is a margin witness
$z$ is the corresponding separator

Minimizing over the non-unique $(z, s)$ pairs (note $A z \geq s 1 \Rightarrow A (2 z) \geq (2 s) 1$ ) gives the bound in terms of the margin:

$mistakes \leq \frac{R ^{2}}{γ ^{2}}$

Intuition:

a large $R$ (points far from the origin) makes each update less effective
a small margin $γ$ means fewer directions work
mistakes are bounded by $(R / γ)^{2}$ , not by $n$ : feed the algorithm a million easy points and it still halts quickly, while a small but tight dataset can dominate the cost

Termination

When do we stop?

All points classified correctly (training error stops decreasing)
Validation error stops decreasing
Budget exhausted, or weights stop changing

Not linearly separable

The algorithm never halts: perceptron will “cycle.” It is not the right algorithm for non-separable data. Use Logistic Regression or soft-margin SVM instead.

Uniqueness

Perceptron finds some separator, not the best one. SVM picks the max-margin separator, which generalizes better.

Multiclass Extensions

One-versus-all: train $k$ classifiers (dog vs not dog, etc.), predict $ar g max_{i} ⟨ z_{i}, x ⟩$
One-versus-one: train $(2 k)$ pairwise classifiers, take majority vote

From CS480 lec1.

🛠️ Steven Gong

Table of Contents

Perceptron

Setup: Binary Classification

Algorithm

Padding + Pre-Multiplication Trick

Linear Separability and Margin

Perceptron Convergence Theorem (Mistake Bound)

Termination

Uniqueness

Multiclass Extensions

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Perceptron

Setup: Binary Classification

Algorithm

Padding + Pre-Multiplication Trick

Linear Separability and Margin

Perceptron Convergence Theorem (Mistake Bound)

Termination

Uniqueness

Multiclass Extensions

Related

Graph View

Backlinks