## 1. Overview

A support vector machine(SVM) is a supervised Machine Learning algorithm used for classification, regression and outliers detection. The basic idea of SVM is constructing a generalized optimal hyperplane that seperates the data with the maximal margin while makes the smallest number of errors.

For a particular hyperplane, the nearest instances are support vectors. The distance between support vectors and the hyperplane is margin. Basically, SVM Classification is a maximum-margin classifier.

## 2. Notations

We denote $\mathbb{S}$ as a finite training set, with $X$ as training features, and $y$ as training labels, where

and

$n$ is the sample number, $m$ is the feature number. The label of instance is

and the features of an instance is

A hyperplane is denoted as

where

$\phi(x)$ is a feature mapping function, and $b$ is a constant.

## 3. SVM: hard margin

To make things easy, we’ll consider a binary hard margin classification first. Here, we try to find a hyperplane that maximize the minimum distance between the samples and the hyperplane. The distance bewteen $\forall x_{i}$ and the hyperplane is

So the margin is

Assume instance $(x^*, y^*)$ is the nearest one to the hyperplane, therefore the hard margin optimal problem $\mathcal{H}$ becomes

Since for all constant $a \neq 0$, we have

so, there must exists a constant $c \neq 0$, such that

therefore, $\mathcal{H}$ becomes

Equals to

Combine with Lagrangian dual problem(see section Lagrangian dual problem), $\mathcal{H}$ equals to $\eqref{dual}$

$L(w, b, \lambda, \nu)$ is convex about $w$, and $b$, therefore $w$, and $b$ can be drived from $\lambda$

$L(w, b, \lambda, \nu)$ becomes

where $K(x_i, x_j)$ is the kernel function. Therefore, the problem equals to

## 4. SVM: soft margin

However, in real situation, a hard margin hyperplane may not exists or generalizes well. So we introduce soft margin hyperplane to find a good balance between maximal margin and smallest number of errors. The optimal problem becomes $\mathcal{H_S}$

where $C$ controls the trade-off and $\zeta_i$ is the hinge loss function

The dual becomes

## 5. Kernels

Kernels are what make SVM magical and it is not limited to SVMs. A linear model can become a non-linear model by kernel functions. Through kernel functions, we can increase feature dimensions without even actually computing the feature mapping or even knowing the feature mapping function. The most common kernel functions are:

• Gaussian kernel: $K(x_i, x_j) = exp(- \gamma {\| x_i - x_j \|}^2)$, $\gamma \ geq 0$.

• Polynomial kernel: $% + c)^d %]]>$, d means degree.

• Expoenetial kernel: $K(x_i, x_j) = exp(- \gamma \| x_i - x_j \|)$.

• Sigmoid kernel: $% + c) %]]>$.

## 6. Iterative Methods

The traditional iterative method is sequential minimal optimization(SMO). SMO is an extreme case of the decomposition method used in LIBSVM. LIBSVM is a popular open source library for SVMs. Decomposition methods translate a problem into a new one that is easier to solve, by grouping variables into sets, and solving a subproblem for each set. SMO updates only two $\lambda$ per iteration, other lagrange multipliers considered as constants, thus problem becomes two variables QP optimization per iteration. With the constraint as $\eqref{SoftDualProblem}$, we have

where $a = - \sum_{k \neq i, j}^n \lambda_k y_k$. The Langrange in $\eqref{SoftDualProblem}$ becomes

where $K_{ij} = K(x_i, x_j)$. Therefore, we have

and

Therefore,

With the constraint $\eqref{SoftDualProblem}$, we have

where $L$ is the lower bounder, $H$ is the high bounder, and $t$ means step $t$. Therefore,

After solving the optimal Lagrange multipliers per iteration, the next problem is how to choose Lagrange multipliers at each iteration. In LIBSVM, it is called working set selection. There are several methods working with it. The current one used in LIBSVM is via the “maximal violating pair”. For more details, please reference Chang and Lin(2001) or Fan, Chen, and Lin(2005).

Other iterative methods includes subgradient descent. Subgradient methods are iterative methods for solving convex minimization problems.(via wiki) Basically, it is very similar to gradient descent. However, subgradient descent can apply to non-differentiable objective function.

## 7. Lagrangian dual problem

Given the primal problem $\mathcal{P_0}$

with the domain $\mathbb{D} \subset \mathbb{R^n}$, the $\textit{Lagrangian function}$ $\mathbb{L} : \mathbb{R^n} \times \mathbb{R^m} \times \mathbb{R^p} \to \mathbb{R}$ is defined as

where

are $\textit{Lagrange muptiplier vectors}$, $f(x)$ and $g_i(x)$ are convex. With $\textit{Lagrangian function} \eqref{lagrangian}$ , we denoted the $\textit{primal problem}$ as

and the $\textit{dual problem}$ as

$p^*$ and $d^*$ are the optimal values.

THEOREM 1. If $\lambda_i \geq 0$, $g_i(x) \leq 0$ and $h_j(x) = 0$, we have $\max \limits_{\lambda, \nu}L(x, \lambda, \nu) = f(x)$ and $p^*$ is the optimal value for primal problem $\ref{originalPrimal}$.

PROOF 1:

THEOREM 2. If $f(x)$ and $g_i(x)$ are convex and continuously differentiable at a point $x^{*}$, $\lambda_i \geq 0$, $g_i(x) \leq 0$ and $h_j(x) = 0$, therefore

PROOF 2: Suppose $(x_p^*, \lambda_p^*, \nu_p^*)$ are the solutions to primal problem $\mathcal{P}$ $\eqref{primal}$, same as Proof 1, we have

Similarly, suppose $(x_d^*, \lambda_d^*, \nu_d^*)$ are the solutions to primal problem $\mathcal{D}$ $\eqref{dual}$, we have

According to \eqref{proof2.1} and \eqref{proof2.2}, given f(x) is convex, we have

THEOREM 3. If $f(x)$ and $g_i(x)$ are convex and continuously differentiable at a point $x^{*}$, $\lambda_i \geq 0$, $g_i(x) \leq 0$ and $h_j(x) = 0$, therefore

## 8. References

 Chih-Chung Chang and Chih-Jen Lin(2001). LIBSVM: a library for support vector machines.

 John C. Platt(1998). Fast Training of Support Vector Machines Using Sequential Minimal Optimization.

 Rong-En Fan, Pai-Hsuen Chen, and Chih-Jen Lin(2005). Working Set Selection Using Second Order Information for Training Support Vector Machines.