Supervised Learning

Support Vector Machine

This feels like just finding a sort of Hyperplane that splits data into half.

Resources

1. Primal Formulation (The Geometric Problem)

The primal formulation defines the optimization problem in terms of the primary variables: the weight vector and the bias . Its objective is explicitly geometric: maximizing the margin.

Primal formulation of SVM (hard margin):

My Intuition

is the equation of the Hyperplane that we want to learn to linearly separate our points . The sign of tells us which side of the hyperplane the point lies on:

  • If , then lies on the same side as the normal vector
  • If , then lies on the opposite side as the normal vector

So in the hard margin setup, we have correct label when .

Equivalently, if we let

Soft-margin formulation:

2. Dual Formulation

The Dual Formulation is defined entirely in terms of the Lagrange multipliers ​ and the dot products of the data points . We no longer use the primary variables.

Convert constrained optimization to unconstrained optimization via Method of Lagrange:

  • Lagrange multipliers add a penalty to the objective for each violated constraint.

The dual formulation is given by:

In the real-world

in practice, the Soft-Margin Support Vector Machine (SVM) formulation is almost always used because real-world data is rarely perfectly linearly separable.

CS231n

The Loss Function for a multiclass SVM loss for example is defined as a Hinge Loss): where

  • is the score for the -th class
  • can be interpreted as the minimum margin (use ), i.e. the correct class should be a minimum of away to make sure we incur 0 loss.

We omit as part of the calculation for loss because if not, the loss would be , instead of 0, even though we have perfect predictions.

The loss over the full dataset is

  • The first part is the Data Loss
  • The second part of the equation () is the Regularization loss, given by (this is L2 Regularization)

In code, these are implemented as:

def L_i(x, y, W):
  """
  unvectorized version. Compute the multiclass svm loss for a single example (x,y)
  - x is a column vector representing an image (e.g. 3073 x 1 in CIFAR-10)
    with an appended bias dimension in the 3073-rd position (i.e. bias trick)
  - y is an integer giving index of correct class (e.g. between 0 and 9 in CIFAR-10)
  - W is the weight matrix (e.g. 10 x 3073 in CIFAR-10)
  """
  delta = 1.0 # see notes about delta later in this section
  scores = W.dot(x) # scores becomes of size 10 x 1, the scores for each class
  correct_class_score = scores[y]
  D = W.shape[0] # number of classes, e.g. 10
  loss_i = 0.0
  for j in range(D): # iterate over all wrong classes
    if j == y:
      # skip for the true class to only loop over incorrect classes
      continue
    # accumulate loss for the i-th example
    loss_i += max(0, scores[j] - correct_class_score + delta)
  return loss_i
 
def L_i_vectorized(x, y, W):
  """
  A faster half-vectorized implementation. half-vectorized
  refers to the fact that for a single example the implementation contains
  no for loops, but there is still one loop over the examples (outside this function)
  """
  delta = 1.0
  scores = W.dot(x)
  # compute the margins for all classes in one vector operation
  margins = np.maximum(0, scores - scores[y] + delta)
  # on y-th position scores[y] - scores[y] canceled and gave delta. We want
  # to ignore the y-th position and only consider margin on max wrong class
  margins[y] = 0
  loss_i = np.sum(margins)
  return loss_i
 
def L(X, y, W):
  """
  fully-vectorized implementation :
  - X holds all the training examples as columns (e.g. 3073 x 50,000 in CIFAR-10)
  - y is array of integers specifying correct class (e.g. 50,000-D array)
  - W are weights (e.g. 10 x 3073)
  """
  # evaluate loss over all examples in X without using any for loops
  # left as exercise to reader in the assignment