The kernel trick

A support vector machine can find values of \mathbf{w} such that the hyperplane (i.e. line, but in more dimensions) given by

\mathbf{w}^\top \mathbf{x} + b = 0

seperates a (linearly seperable) dataset such that all points above the line are in class A and all points below the line are in class B. The key feature of the SVM is that of the many possible values of \mathbf{w} we can choose, the SVM will pick the one which maximises the “margin” (that is, the distance between the hyperplane and the points we are classifying).

Of course, this only works if our data is linearly seperable. It’s pretty hard to seperate inseperable data into two classes, but we can certainly relax the assumption that our dataset is linear. One way we can do this is using a function \phi which maps our data from a space where it is not linearly seperable to one where it is linear.

The problem now is that the space we map our data to usually has a lot of dimensions. This means that it becomes much more expensive to train our model, because we are now solving the problem

\mathbf{w}^\top \phi(\mathbf{x}) + b

and thus the weight vector gets a lot larger.

There’s a clever trick we can use to solve this called the “kernel trick”. The key idea here is that our weights (i.e. \mathbf{w} ) are a linear combination of the training dataset. This means that we can write (see here for a proof)

\mathbf{w} = \sum_{i=1}^{m} \alpha_i \mathbf{x}

where each \alpha_i is to be determined (usually gradient descent), and each \mathbf{x}_i is a member of the training dataset. Therefore we can write the classification problem as

b + \left\langle \mathbf{x}, \sum_{i=1}^{m} \alpha_i \mathbf{x}_i  \right\rangle = b + \sum_{i=1}^{m} \alpha_i \langle \mathbf {x}, \mathbf{x}_i \rangle

This is useful because we have rewritten our model in terms of only inner products involving the training data. This means we can work with our data in \phi -space without having to explicitly map the entire dataset into a new space; we just have to compute the inner product.

This has the added advantage that we can also work with more funky inner products (for example those which require integration, etc).


Posted

in

by

Tags: