1. Overview

A support vector machine(SVM) is a supervised Machine Learning algorithm used for classification, regression and outliers detection. The basic idea of SVM is constructing a generalized optimal hyperplane that seperates the data with the maximal margin while makes the smallest number of errors.

For a particular hyperplane, the nearest instances are support vectors. The distance between support vectors and the hyperplane is margin. Basically, SVM Classification is a maximum-margin classifier.

2. Notations

We denote as a finite training set, with as training features, and as training labels, where


is the sample number, is the feature number. The label of instance is

and the features of an instance is

A hyperplane is denoted as


is a feature mapping function, and is a constant.

3. SVM: hard margin

To make things easy, we’ll consider a binary hard margin classification first. Here, we try to find a hyperplane that maximize the minimum distance between the samples and the hyperplane. The distance bewteen and the hyperplane is

So the margin is

Assume instance is the nearest one to the hyperplane, therefore the hard margin optimal problem becomes

Since for all constant , we have

so, there must exists a constant , such that

therefore, becomes

Equals to

Combine with Lagrangian dual problem(see section Lagrangian dual problem), equals to

is convex about , and , therefore , and can be drived from


where is the kernel function. Therefore, the problem equals to

4. SVM: soft margin

However, in real situation, a hard margin hyperplane may not exists or generalizes well. So we introduce soft margin hyperplane to find a good balance between maximal margin and smallest number of errors. The optimal problem becomes

where controls the trade-off and is the hinge loss function

The dual becomes

5. Kernels

Kernels are what make SVM magical and it is not limited to SVMs. A linear model can become a non-linear model by kernel functions. Through kernel functions, we can increase feature dimensions without even actually computing the feature mapping or even knowing the feature mapping function. The most common kernel functions are:

  • Gaussian kernel: , .

  • Polynomial kernel: , d means degree.

  • Expoenetial kernel: .

  • Sigmoid kernel: .

6. Iterative Methods

The traditional iterative method is sequential minimal optimization(SMO). SMO is an extreme case of the decomposition method used in LIBSVM. LIBSVM is a popular open source library for SVMs. Decomposition methods translate a problem into a new one that is easier to solve, by grouping variables into sets, and solving a subproblem for each set. SMO updates only two per iteration, other lagrange multipliers considered as constants, thus problem becomes two variables QP optimization per iteration. With the constraint as , we have

where . The Langrange in becomes

where . Therefore, we have



With the constraint , we have

where is the lower bounder, is the high bounder, and means step . Therefore,

After solving the optimal Lagrange multipliers per iteration, the next problem is how to choose Lagrange multipliers at each iteration. In LIBSVM, it is called working set selection. There are several methods working with it. The current one used in LIBSVM is via the “maximal violating pair”. For more details, please reference Chang and Lin(2001) or Fan, Chen, and Lin(2005).

Other iterative methods includes subgradient descent. Subgradient methods are iterative methods for solving convex minimization problems.(via wiki) Basically, it is very similar to gradient descent. However, subgradient descent can apply to non-differentiable objective function.

7. Lagrangian dual problem

Given the primal problem

with the domain , the is defined as


are , and are convex. With , we denoted the as

and the as

and are the optimal values.

THEOREM 1. If , and , we have and is the optimal value for primal problem .


THEOREM 2. If and are convex and continuously differentiable at a point , , and , therefore

PROOF 2: Suppose are the solutions to primal problem , same as Proof 1, we have

Similarly, suppose are the solutions to primal problem , we have

According to \eqref{proof2.1} and \eqref{proof2.2}, given f(x) is convex, we have

THEOREM 3. If and are convex and continuously differentiable at a point , , and , therefore

8. References

[1] Chih-Chung Chang and Chih-Jen Lin(2001). LIBSVM: a library for support vector machines.

[2] John C. Platt(1998). Fast Training of Support Vector Machines Using Sequential Minimal Optimization.

[3] Rong-En Fan, Pai-Hsuen Chen, and Chih-Jen Lin(2005). Working Set Selection Using Second Order Information for Training Support Vector Machines.