A Super Boring Guide To Support Vector Machines
1. Overview
A support vector machine(SVM) is a supervised Machine Learning algorithm used for classification, regression and outliers detection. The basic idea of SVM is constructing a generalized optimal hyperplane that seperates the data with the maximal margin while makes the smallest number of errors.
For a particular hyperplane, the nearest instances are support vectors. The distance between support vectors and the hyperplane is margin. Basically, SVM Classification is a maximummargin classifier.
2. Notations
We denote as a finite training set, with as training features, and as training labels, where
and
is the sample number, is the feature number. The label of instance is
and the features of an instance is
A hyperplane is denoted as
where
is a feature mapping function, and is a constant.
3. SVM: hard margin
To make things easy, we’ll consider a binary hard margin classification first. Here, we try to find a hyperplane that maximize the minimum distance between the samples and the hyperplane. The distance bewteen and the hyperplane is
So the margin is
Assume instance is the nearest one to the hyperplane, therefore the hard margin optimal problem becomes
Since for all constant , we have
so, there must exists a constant , such that
therefore, becomes
Equals to
Combine with Lagrangian dual problem(see section Lagrangian dual problem), equals to
is convex about , and , therefore , and can be drived from
becomes
where is the kernel function. Therefore, the problem equals to
4. SVM: soft margin
However, in real situation, a hard margin hyperplane may not exists or generalizes well. So we introduce soft margin hyperplane to find a good balance between maximal margin and smallest number of errors. The optimal problem becomes
where controls the tradeoff and is the hinge loss function
The dual becomes
5. Kernels
Kernels are what make SVM magical and it is not limited to SVMs. A linear model can become a nonlinear model by kernel functions. Through kernel functions, we can increase feature dimensions without even actually computing the feature mapping or even knowing the feature mapping function. The most common kernel functions are:

Gaussian kernel: , .

Polynomial kernel: , d means degree.

Expoenetial kernel: .

Sigmoid kernel: .
6. Iterative Methods
The traditional iterative method is sequential minimal optimization(SMO). SMO is an extreme case of the decomposition method used in LIBSVM. LIBSVM is a popular open source library for SVMs. Decomposition methods translate a problem into a new one that is easier to solve, by grouping variables into sets, and solving a subproblem for each set. SMO updates only two per iteration, other lagrange multipliers considered as constants, thus problem becomes two variables QP optimization per iteration. With the constraint as , we have
where . The Langrange in becomes
where . Therefore, we have
and
Therefore,
With the constraint , we have
where is the lower bounder, is the high bounder, and means step . Therefore,
After solving the optimal Lagrange multipliers per iteration, the next problem is how to choose Lagrange multipliers at each iteration. In LIBSVM, it is called working set selection. There are several methods working with it. The current one used in LIBSVM is via the “maximal violating pair”. For more details, please reference Chang and Lin(2001) or Fan, Chen, and Lin(2005).
Other iterative methods includes subgradient descent. Subgradient methods are iterative methods for solving convex minimization problems.(via wiki) Basically, it is very similar to gradient descent. However, subgradient descent can apply to nondifferentiable objective function.
7. Lagrangian dual problem
Given the primal problem
with the domain , the is defined as
where
are , and are convex. With , we denoted the as
and the as
and are the optimal values.
THEOREM 1. If , and , we have and is the optimal value for primal problem .
PROOF 1:
THEOREM 2. If and are convex and continuously differentiable at a point , , and , therefore
PROOF 2: Suppose are the solutions to primal problem , same as Proof 1, we have
Similarly, suppose are the solutions to primal problem , we have
According to \eqref{proof2.1} and \eqref{proof2.2}, given f(x) is convex, we have
THEOREM 3. If and are convex and continuously differentiable at a point , , and , therefore
8. References
[1] ChihChung Chang and ChihJen Lin(2001). LIBSVM: a library for support vector machines.
[2] John C. Platt(1998). Fast Training of Support Vector Machines Using Sequential Minimal Optimization.
[3] RongEn Fan, PaiHsuen Chen, and ChihJen Lin(2005). Working Set Selection Using Second Order Information for Training Support Vector Machines.