The hyperplane in the SVM plays the role of the decision boundary. It is used to separate two groups of example from one another. As such, it has to be as far from each group as possible
On the other hand, the hyperplane in linear regression is chosen to be as close to all training examples as possible.
Hyperplane satisfy the following constraints:
-
$wx_i - b ⋝ 1$ if$y_i = + 1$ and -
$wx_i - b ⋜ 1$ if$y_i = - 1$
We also want to minimize
The optimization problem for SVM:
where
To extend SVM To cases in which the data is not linearly sepearbel, we introduce the hinge loss function:
Hinge loss function is zero if teh constraints of hyperplane are satisfied, in other words, if
We then wish to minimize the following cost function:
where the hyperparameter C determines the tradeoff between increasing the size of the decision boundary and ensuring
that each
SVMs that optimize hinge loss are called soft-margin SVMs, which the original formulation is referred to as a hard margin SVM
As you can see, for sufficiently high values of C, the second term in the cost function will become negligible, so the SVM algorithm will try to find the highest margin by completely ignoring misclassification. As we decrease the value of C, making classification errors is becoming more costly, so the SVM algorithm will try to make fewer mistakes by sacrificing the margin size. As we have already discussed, a larger margin is better for generalization. Therefore, C regulates the tradeoff between classifying the training data well (minimizing empirical risk) and classifying future examples well (generalization)