# Introduction to Support Vector Machines

Quadratic Programming ... • Support Vector Machines—Theory and Applications (Wang). Springer, 2005. References Seminal Papers...

Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik

June 15, 2006

1 The Problem

2 The Basics

3 The Proposed Solution

Learning by Machines

Learning Rote Learning memorization (Hash tables)

Reinforcement feedback at end (Q [Watkins 89])

Induction generalizing examples (ID3 [Quinlan 79])

Clustering grouping data (CMLIB [Hartigan 75])

Analogy representation similarity (JUPA [Yvon 94])

Discovery unsupervised, no goal

Genetic Alg. simulated evolution (GABIL [DeJong 93])

Supervised Learning Definition Supervised Learning: given nontrivial training data (labels known) predict test data (labels unknown)

Implementations Rote Learning

Clustering Nearest Neighbor [Cover, Hart 67]

Induction

Hash tables

Neural Networks [McCulloch, Pitts 43]

Decision Trees [Hunt 66]

SVMs [Vapnik et al 92]

Problem Description—General Problem Classify a given input • binary classification: two classes • multi-class classification: several, but finitely many classes • regression: infinitely many classes

Major Applications • Handwriting recognition • Cheminformatics (Quantitative Structure-Activity Relationship) • Pattern recognition • Spam detection (HP Labs, Palo Alto)

Problem Description—Specific

Electricity Load Prediction Challenge 2001 • Power plant that supports energy demand of a region • Excess production expensive • Load varies substantially • Challenge won by libSVM [Chang, Lin 06]

Problem • given: load and temperature for 730 days (≈ 70kB data) • predict: load for the next 365 days

Example Data Load 1997 850 800 750

700 650 600 550 500 450

12:00 24:00

400 50

100

150

200

Day of Year

250

300

350

Problem Description—Formal

Definition (cf. [Lin 01]) Given a training set S ⊆ Rn × {−1, 1} of correctly classified input data vectors ~x ∈ Rn , where: • every input data vector appears at most once in S • there exist input data vectors ~p and ~n such that (~p , 1) ∈ S as

well as (~n, −1) ∈ S (non-trivial) successfully classify unseen input data vectors.

Linear Classification [Vapnik 63] • Given: A training set S ⊆ Rn × {−1, 1} • Goal: Find a hyperplane that separates Rn into halves that

contain only elements of one class

Representation of Hyperplane Definition Hyperplane ~n · (~x − ~x0 ) = 0 • ~n ∈ Rn weight vector • ~x ∈ Rn input vector • ~x0 ∈ Rn offset

~ · ~x + b = 0 Alternatively: w

Decision Function • training set S = {(~xi , yi ) | 1 ≤ i ≤ k} • separating hyperplane w ~ · ~x + b = 0 for S

( > 0 if yi = 1 ~ ·~xi +b Decision: w < 0 if yi = −1

f (~x ) = sgn(~ w · ~x + b)

Learn Hyperplane Problem • Given: training set S • Goal: coefficients w ~ and b of a separating hyperplane • Difficulty: several or no candidates for w ~ and b

Solution [cf. Vapnik’s statistical learning theory] ~ and b with maximal margin (minimal distance Select admissible w to any input data vector)

Observation ~ and b such that We can scale w ( ≥1 if yi = 1 ~ · ~xi + b w ≤ −1 if yi = −1

Maximizing the Margin • Closest points ~x+ and ~x− (with w ~ · ~x± + b = ±1) • Distance between w ~ · ~x + b = ±1:

2 2 (~ w · ~x+ + b) − (~ w · ~x− + b) = =√ k~ wk k~ wk ~ ·w ~ w 2 • maxw ≡ minw~ ,b w~ 2·~w ~ ,b √w ~ ·~ w

Basic (Primal) Support Vector Machine Form target:

~) minw~ ,b 12 (~ w ·w

subject to: yi (~ w · ~xi + b) ≥ 1

(i = 1, . . . , k)

Non-separable Data Problem Maybe a linear separating hyperplane does not exist!

Solution Allow training errors ξi penalized by large penalty parameter C Standard (Primal) Support Vector Machine Form target: subject to:

~)+C w ·w minw~ ,b,ξ~ 21 (~ yi (~ w · ~xi + b) ≥ 1 − ξi ξi ≥ 0

If ξi > 1, then misclassification of ~xi

Pk

i=1 ξi



(i = 1, . . . , k)

Higher Dimensional Feature Spaces

Problem Data not separable because target function is essentially nonlinear!

Approach Potentially separable in higher dimensional space • Map input vectors nonlinearly into

high dimensional space (feature space) • Perform separation there

Higher Dimensional Feature Spaces

Literature • Classic approach [Cover 65] • “Kernel trick” [Boser, Guyon, Vapnik 92] • Extension to soft margin [Cortes, Vapnik 95]

Example (cf. [Lin 01]) Mapping φ from R3 into feature space R10 √ √ √ √ √ √ φ(~x ) = (1, 2x1 , 2x2 , 2x3 , x12 , x22 , x32 , 2x1 x2 , 2x1 x3 , 2x2 x3 )

Definition Standard (Primal) Support Vector Machine Form target: subject to:

~)+C minw~ ,b,ξ~ 12 (~ w ·w

Pk

yi (~ w · φ(~xi ) + b) ≥ 1 − ξi ξi ≥ 0

~ is a vector in a high dimensional space w

i=1 ξi



(i = 1, . . . , k)

How to Solve? Problem ~ and b from the standard SVM form Find w

Solution Solve via Lagrangian dual [Bazaraa et al 93]:  ~ α maxα~ ≥0,~π≥0 minw~ ,b,ξ~ L(~ w , b, ξ, ~) where ~ α L(~ w , b, ξ, ~) =

k k k X X  X ~ ·w ~ w +C ξi + αi (1 − ξi − yi (~ w · φ(~xi ) + b)) − πi ξi 2 i=1

i=1

i=1

Simplifying the Dual [Chen et al 03] Standard (Dual) Support Vector Machine Form target: subject to:

minα~ 21 (~ αT Q~ α) − ~y · α ~ =0 0 ≤ αi ≤ C

Pk

i=1 αi

(i = 1, . . . , k)

 where: Qij = yi yj φ(~xi ) · φ(~xj )

Solution ~ as We obtain w ~ = w

k X i=1

αi yi φ(~xi )

Where is the Benefit? • α ~ ∈ Rk (dimension independent from feature space) • Only inner products in feature space

Kernel Trick • Inner products efficiently calculated on input vectors via

kernel K K (~xi , ~xj ) = φ(~xi ) · φ(~xj ) • Select appropriate feature space • Avoid nonlinear transformation into feature space • Benefit from better separation properties in feature space

Kernels Example Mapping into feature space φ : R3 → R10 √ √ √ φ(~x ) = (1, 2x1 , 2x2 , . . . , 2x2 x3 ) Kernel K (~xi , ~xj ) = φ(~xi ) · φ(~xj ) = (1 + ~xi · ~xj )2 .

Popular Kernels • Gaussian Radial Basis Function:

(feature space is an infinite dimensional Hilbert space) g (~xi , ~xj ) = exp(−γk~xi − ~xj k2 ) • Polynomial: g (~xi , ~xj ) = (~xi · ~xj + 1)d

The Decision Function

Observation • No need for w ~ because k X    ~ · φ(~x ) + b = sgn f (~x ) = sgn w αi yi φ(~xi ) · φ(~x ) + b i=1

• Uses only ~xi (support vectors) where αi > 0

Few points determine the separation; borderline points

Support Vectors

Support Vector Machines Definition • Given: Kernel K and training set S • Goal: decision function f

target:

subject to:

minα~

α ~ T Q~ α 2

k X

αi



Qij = yi yj K (~xi , ~xj )

i=1

~y · α ~ =0

(i = 1, . . . , k)

0 ≤ αi ≤ C

k  X decide: f (~x ) = sgn αi yi K (~xi , ~x ) + b i=1

• Suppose Q (k by k) fully dense matrix • 70,000 training points

70,000 variables

• 70, 0002 · 4B ≈ 19GB: huge problem • Traditional methods: Newton, Quasi Newton cannot be

directly applied • Current methods: • Decomposition [Osuna et al 97], [Joachims 98], [Platt 98] • Nearest point of two convex hulls [Keerthi et al 99]

Sample Implementation www.kernel-machines.org • Main forum on kernel machines • Lists over 250 active researchers • 43 competing implementations

libSVM [Chang, Lin 06] • Supports binary and multi-class classification and regression • Beginners Guide for SVM classification • “Out of the box”-system (automatic data scaling, parameter

selection) • Won EUNITE and IJCNN challenge

Application Accuracy

Automatic Training using libSVM Application Astroparticle Bioinformatics Vehicle

Training Data 3,089 391 1,243

Features 4 20 21

Classes 2 3 2

Accuracy 96.9% 85.2% 87.8%

References

Books • Statistical Learning Theory (Vapnik). Wiley, 1998 • Advances in Kernel Methods—Support Vector Learning

(Schölkopf, Burges, Smola). MIT Press, 1999 • An Introduction to Support Vector Machines (Cristianini,

Shawe-Taylor). Cambridge Univ., 2000 • Support Vector Machines—Theory and Applications (Wang).

Springer, 2005

References Seminal Papers • A training algorithm for optimal margin classifiers (Boser,

Guyon, Vapnik). COLT’92, ACM Press. • Support vector networks (Cortes, Vapnik). Machine

Learning 20, 1995 • Fast training of support vector machines using sequential

minimal optimization (Platt). In Advances in Kernel Methods, MIT Press, 1999 • Improvements to Platt’s SMO algorithm for SVM classifier

design (Keerthi, Shevade, Bhattacharyya, Murthy). Technical Report, 1999

References

Recent Papers • A tutorial on ν-Support Vector Machines (Chen, Lin,

Schölkopf). 2003 • Support Vector and Kernel Machines (Nello Christianini).

ICML, 2001 • libSVM: A library for Support Vector Machines (Chang, Lin).

System Documentation, 2006

Sequential Minimal Optimization [Platt 98] • Commonly used to solve standard SVM form • Decomposition method with smallest working set, |B| = 2 • Subproblem analytically solved; no need for optimization

software • Contained flaws; modified version [Keerthi et al 99]

~ = (1, . . . , 1)): • Karush-Kuhn-Tucker (KKT) of the dual (E ~ + b~y − ~λ + µ Q~ α−E ~ =0 µi (C − αi ) = 0 αi λi = 0

µ ~ ≥0 ~λ ≥ 0

Computing b • KKT yield

( ~ + b~y )i ≥ 0 if αi < C (Q~ α−E ≤ 0 if αi > 0 • Let Fi (~ α) =

Pk

xi , ~xj ) j=1 αj yj K (~

− yi and

I0 = {i | 0 < αi < C } I1 = {i | yi = 1, αi = 0}

I2 = {i | yi = −1, αi = C }

I3 = {i | yi = 1, αi = C }

I4 = {i | yi = −1, αi = 0}

• Case analysis on yi yields bounds on b

max{Fi (~ α) | i ∈ I0 ∪ I3 ∪ I4 } ≤ b ≤ min{Fi (~ α ) | i ∈ I0 ∪ I1 ∪ I2 }

Working Set Selection Observation (see [Keerthi et al 99]) α ~ not optimal solution iff max{Fi (~ α) | i ∈ I0 ∪ I3 ∪ I4 } > min{Fi (~ α ) | i ∈ I0 ∪ I1 ∪ I2 }

Approach Select working set B = {i, j} with i ≡ arg maxm {Fm (~ α ) | m ∈ I0 ∪ I3 ∪ I4 } j ≡ arg minm {Fm (~ α ) | m ∈ I0 ∪ I1 ∪ I2 }

The Subproblem Definition Let B = {i, j} and N = {1, . . . , k} \ B.   αi • α and α ~N = α ~ |N (similar for matrices) ~B = αj B-Subproblem target:

minα~ B

 X  TQ α ~B ~ B X BB α + αb Qb,N α ~N − αb 2 b∈B

subject to:

~y · α ~ =0 0 ≤ αi , αj ≤ C

b∈B

Final Solution

• Note that −yi αi = ~yN · α ~ N + yj αj • Substitute αi = −yi (~yN · α ~ N + yj αj ) into target •

One-variable optimization problem

• Can be solved analytically (cf., e.g., [Lin 01]) • Iterate (yielding new α ~ ) until

max{Fi (~ α) | i ∈ I0 ∪ I3 ∪ I4 } ≤ min{Fi (~ α ) | i ∈ I0 ∪ I1 ∪ I2 } − 