8 Support vector machines
While the origins of support vector machines (SVMs) are old (and go back to Vapnik and Lerner (1963)), their modern treatment was initiated in Boser, Guyon, and Vapnik (1992) and Cortes and Vapnik (1995) (binary classification) and Drucker et al. (1997) (regression). We refer to http://www.kernel-machines.org/books for an exhaustive bibliography on their theoretical and empirical properties. SVMs have been very popular since their creation among the machine learning community. Nonetheless, other tools (neural networks especially) have gained popularity and progressively replaced SVMs in many applications like computer vision notably.
8.1 SVM for classification
As is often the case in machine learning, it is easier to explain a complex tool through an illustration with binary classification. In fact, sometimes, it is originally how the tool was designed (e.g., for the perceptron). Let us consider a simple example in the plane, that is, with two features. In Figure 8.1, the goal is to find a model that correctly classifies points: filled circles versus empty squares.
A model consists of two weights \(\textbf{w}=(w_1,w_2)\) that load on the variables and create a natural linear separation in the plane. In the example above, we show three separations. The red one is not a good classifier because there are circles and squares above and beneath it. The blue line is a good classifier: all circles are to its left and all squares to its right. Likewise, the green line achieves a perfect classification score. Yet, there is a notable difference between the two.
The grey star at the top of the graph is a mystery point and given its location, if the data pattern holds, it should be a circle. The blue model fails to recognize it as such while the green one succeeds. The interesting features of the scheme are those that we have not mentioned yet, that is, the grey dotted lines. These lines represent the no-man’s land in which no observation falls when the green model is enforced. In this area, each strip above and below the green line can be viewed as a margin of error for the model. Typically, the grey star is located inside this margin.
The two margins are computed as the parallel lines that maximize the distance between the model and the closest points that are correctly classified (on both sides). These points are called support vectors, which justifies the name of the technique. Obviously, the green model has a greater margin than the blue one. The core idea of SVMs is to maximize the margin, under the constraint that the classifier does not make any mistake. Said differently, SVMs try to pick the most robust model among all those that yield a correct classification.
More formally, if we numerically define circles as +1 and squares as -1, any ‘good’ linear model is expected to satisfy: \[\begin{equation} \tag{8.1} \left\{\begin{array}{lll} \sum_{k=1}^Kw_kx_{i,k}+b \ge +1 & \text{ when } y_i=+1 \\ \sum_{k=1}^Kw_kx_{i,k}+b \le -1 & \text{ when } y_i=-1, \end{array}\right. \end{equation}\]
which can be summarized in compact form \(y_i \times \left(\sum_{k=1}^K w_kx_{i,k}+b \right)\ge 1\). Now, the margin between the green model and a support vector on the dashed grey line is equal to \(||\textbf{w}||^{-1}=\left(\sum_{k=1}^Kw_k^2\right)^{-1/2}\). This value comes from the fact that the distance between a point \((x_0,y_0)\) and a line parametrized by \(ax+by+c=0\) is equal to \(d=\frac{|ax_0+by_0+c|}{\sqrt{a^2+b^2}}\). In the case of the model defined above (8.1), the numerator is equal to 1 and the norm is that of \(\textbf{w}\). Thus, the final problem is the following:
\[\begin{equation} \tag{8.2} \underset{\textbf{w}, b}{\text{argmin}} \ \frac{1}{2} ||\textbf{w}||^2 \ \text{ s.t. } y_i\left(\sum_{k=1}^Kw_kx_{i,k}+b \right)\ge 1. \end{equation}\]
The dual form of this program (see chapter 5 in Boyd and Vandenberghe (2004)) is
\[\begin{equation} \tag{8.3} L(\textbf{w},b,\boldsymbol{\lambda})= \frac{1}{2}||\textbf{w}||^2 + \sum_{i=1}^I\lambda_i\left(y_i\left(\sum_{k=1}^Kw_kx_{i,k}+b \right)- 1\right), \end{equation}\] where either \(\lambda_i=0\) or \(y_i\left(\sum_{k=1}^Kw_kx_{i,k}+b \right)= 1\). Thus, only some points will matter in the solution (the so-called support vectors). The first order conditions impose that the derivatives of this Lagrangian be null: \[\frac{\partial L}{\partial \textbf{w}}L(\textbf{w},b,\boldsymbol{\lambda})=\textbf{0}, \quad \frac{\partial L}{\partial b}L(\textbf{w},b,\boldsymbol{\lambda})=0,\] where the first condition leads to \[\textbf{w}^*=\sum_{i=1}^I\lambda_iu_i\textbf{x}_i.\] This solution is indeed a linear form of the features, but only some points are taken into account. They are those for which the inequalities (8.1) are equalities.
Naturally, this problem becomes infeasible whenever the condition cannot be satisfied, that is, when a simple line cannot perfectly separate the labels, no matter the choice of coefficients. This is the most common configuration and datasets are then called logically not linearly separable. This complicates the process but it is possible to resort to a trick. The idea is to introduce some flexbility in (8.1) by adding correction variables that allow the conditions to be met:
\[\begin{equation} \tag{8.4} \left\{\begin{array}{lll} \sum_{k=1}^Kw_kx_{i,k}+b \ge +1-\xi_i & \text{ when } y_i=+1 \\ \sum_{k=1}^Kw_kx_{i,k}+b \le -1+\xi_i & \text{ when } y_i=-1, \end{array}\right. \end{equation}\]
where the novelties, the \(\xi_i\) are positive so-called ‘slack’ variables that make the conditions feasible. They are illustrated in Figure 8.2. In this new configuration, there is no simple linear model that can perfectly discriminate between the two classes.
The optimization program then becomes \[\begin{equation} \tag{8.5} \underset{\textbf{w},b, \boldsymbol{\xi}}{\text{argmin}} \ \frac{1}{2} ||\textbf{w}||^2+C\sum_{i=1}^I\xi_i \ \text{ s.t. } \left\{ y_i\left(\sum_{k=1}^Kw_k\phi(x_{i,k})+b \right)\ge 1-\xi_i \ \text{ and } \ \xi_i\ge 0, \ \forall i \right\}, \end{equation}\] where the parameter \(C>0\) tunes the cost of mis-classification: as \(C\) increases, errors become more penalizing.
In addition, the program can be generalized to nonlinear models, via the kernel \(\phi\) which is applied to the input points \(x_{i,k}\). Nonlinear kernels can help cope with patterns that are more complex than straight lines (see Figure 8.3). Common kernels can be polynomial, radial or sigmoid. The solution is found using more or less standard techniques for constrained quadratic programs. Once the weights \(\textbf{w}\) and bias \(b\) are set via training, a prediction for a new vector \(\textbf{x}_j\) is simply made by computing \(\sum_{k=1}^Kw_k\phi(x_{j,k})+b\) and choosing the class based on the sign of the expression.
8.2 SVM for regression
The ideas of classification SVM can be transposed to regression exercises but the role of the margin is different. One general formulation is the following
\[\begin{align} \underset{\textbf{w},b, \boldsymbol{\xi}}{\text{argmin}} \ & \frac{1}{2} ||\textbf{w}||^2+C\sum_{i=1}^I\left(\xi_i+\xi_i^* \right)\\ \text{ s.t. }& \sum_{k=1}^Kw_k\phi(x_{i,k})+b -y_i\le \epsilon+\xi_i \\ \tag{8.6} & y_i-\sum_{k=1}^Kw_k\phi(x_{i,k})-b \le \epsilon+\xi_i^* \\ &\xi_i,\xi_i^*\ge 0, \ \forall i , \end{align}\]
and it is illustrated in Figure 8.4. The user specifies a margin \(\epsilon\) and the model will try to find the linear (up to kernel transformation) relationship between the labels \(y_i\) and the input \(\textbf{x}_i\). Just as in the classification task, if the data points are inside the strip, the slack variables \(\xi_i\) and \(\xi_i^*\) are set to zero. When the points violate the threshold, the objective function (first line of the code) is penalized. Note that setting a large \(\epsilon\) leaves room for more error. Once the model has been trained, a prediction for \(\textbf{x}_j\) is simply \(\sum_{k=1}^Kw_k\phi(x_{j,k})+b\).
Let us take a step back and simplify what the algorithm does, that is: minimize the sum of squared weights \(||\textbf{w}||^2\) subject to the error being small enough (modulo a slack variable). In spirit, this somewhat the opposite of the penalized linear regressions which seek to minimize the error, subject to the weights being small enough.
The models laid out in this section are a preview of the universe of SVM engines and several other formulations have been developed. One reference library that is coded in C and C++ is LIBSVM and it is widely used by many other programming languages. The interested reader can have a look at the corresponding article Chang and Lin (2011) for more details on the SVM zoo (a more recent November 2019 version is also available online).
8.3 Practice
In R the LIBSVM library is exploited in several packages. One of them, e1071, is a good choice because it also nests many other interesting functions, especially a naive Bayes classifier that we will see later on.
In the implementation of LIBSVM, the package requires to specify the label and features separately. For this reason, we recycle the variables used for the boosted trees. Moreover, the training being slow, we perform it on a subsample of these sets (first thousand instances).
library(e1071)
fit_svm <- svm(y = train_label_xgb[1:1000], # Train label
x = train_features_xgb[1:1000,], # Training features
type = "eps-regression", # SVM task type (see LIBSVM documentation)
kernel = "radial", # SVM kernel (or: linear, polynomial, sigmoid)
epsilon = 0.1, # Width of strip for errors
gamma = 0.5, # Constant in the radial kernel
cost = 0.1) # Slack variable penalisation
test_feat_short <- dplyr::select(testing_sample,features_short)
mean((predict(fit_svm, test_feat_short) - testing_sample$R1M_Usd)^2) # MSE
## [1] 0.03839085
## [1] 0.5222197
The results are slightly better than those of the boosted trees. All parameters are completely arbitrary, especially the choice of the kernel. We finally turn to a classification example.
fit_svm_C <- svm(y = training_sample$R1M_Usd_C[1:1000], # Train label
x = training_sample[1:1000,] %>%
dplyr::select(features), # Training features
type = "C-classification", # SVM task type (see LIBSVM doc.)
kernel = "sigmoid", # SVM kernel
gamma = 0.5, # Parameter in the sigmoid kernel
coef0 = 0.3, # Parameter in the sigmoid kernel
cost = 0.2) # Slack variable penalisation
mean(predict(fit_svm_C,
dplyr::select(testing_sample,features)) == testing_sample$R1M_Usd_C) # Accuracy
## [1] 0.5008973
Both the small training sample and the arbitrariness in our choice of the parameters may explain why the predictive accuracy is so poor.
8.4 Coding exercises
- From the simple example shown above, extend SVM models to other kernels and discuss the impact on the fit.
- Train a vanilla SVM model with labels being the 12-month forward (i.e., future) return and evaluate it on the testing sample. Do the same with a simple random forest. Compare.