Support Vector Machines using Kernels

Support Vector Machines with Kernel Functions 4-5 We can use a subset of the training data to define a discriminant function, where the support vector...

0 downloads 40 Views 3MB Size
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 and MoSIG M1 Lesson 4

Winter Semester 2017 10 February 2017

Support Vector Machines using Kernels Contents

Kernel Functions ...............................................................2 Definition ..................................................................................... 3 Radial Basis Function (RBF) ........................................................ 4 Kernel Functions for Symbolic Data............................................. 6

Support Vector Machines with Kernels.............................7 Soft Margin SVM's - Non-separable training data...........12 Sources: "Neural Networks for Pattern Recognition", C. M. Bishop, Oxford Univ. Press, 1995. "A Computational Biology Example using Support Vector Machines", Suzy Fei, 2009 (available on line).

Support Vector Machines with Kernel Functions

Kernel Functions Linear discriminant functions can provide very efficient 2-class and multi-class classifiers, provided that the class features can be separated by a linear decision surface. For many domains, it is possible to find a “kernel” function, that transforms the data into a space where the two classes are separate.

Instead of a decision surface:

! ! ! g( X ) = W T X + b

We will use a decision surface of the form:

where

! !T ! g( X ) = W f ( X ) + b

!! M ! ! W = f ( Z ) = " am y m f ( X m ) m=1

! is learned from the transformed training data. and am is a binary variable learned from the training data that is an ≥ 0 for support vectors and 0 for all others.

!

! !

The function f ( X ) provides an implicit non-linear decision surfaces for the original data. !

4-2

Support Vector Machines with Kernel Functions Definition ! !

Formally, a Kernel function is any function K( Z , X )

! ! K : X " X #R

that satisfies “Mercer's condition”. Essentially, Mercer’s condition tells whether a ! function is a vector product in some space. ! (The definition of this condition is beyond the scope of this class. See wikipedia for a discussion of Mercer’s condition if you are curious). ! !

! !

Mercer’s condition tells whether, for K( Z , X ) there exists a function f ( Z ) such that

! ! ! ! ! ! K( Z , X ) = f ( Z ), f ( X ) !

!

Obviously, Mercer's condition is satisfied by inner products (dot products)

!

D ! ! !T ! ! ! K ( Z , X ) = Z X = Z , X = " zd x d d =1

! !

!T !

Thus K(W , X ) = W X is a valid (but trivial) kernel function. (Known as the linear kernel).

!

! !

!

!

T We can learn the discriminant in an inner product space K( Z , X ) = f ( Z ) f ( X ) ! where W will be learned from the training data.

!

!

This will give us

! ! ! g( X ) = W T f ( X ) + b !

Note that Mercer’s condition can be satisfied by many other functions. Popular !kernel functions include: • Polynomial Kernels • Radial Basis Functions • Fisher Kernels • Text intersection • Bayesian Kernels Kernel functions provide an implicit feature space. We will see that we can learn in the kernel space, and then recognize without explicitly computing the position in this implicit space! 4-3

Support Vector Machines with Kernel Functions Radial Basis Function (RBF) !

!

Radial functions of the form f ( X " Xn ) are popular for use a kernel function. In this case, each support vector acts as a dimension in the new feature space. Radial basis function!(RBF) are popular for use a kernel function. In this case, each support vector acts as a dimension in the new feature space. The RBF function is a function f ( ) : R N " R . Typically, the function is a used with the Euclidean Norm " between a set of points to provide an approximation/interpolation of the form: ! ! ! ! s( X ) = "Wn f X # Xn N

(

n=1

!

)

!

where Wn and Xn are learned from training data. !

This can be used to learn a discriminant function: !

! N ! ! ! g( X ) = "Wn f X # Xn

(

n=1

) !

Where the points N samples Xn can be derived from the training data. !

!

!

!

The term X " Xn is the Euclidean distance from the set of points { Xn } . !

The distance can be normalized by dividing by a value σ !

! ! N % X#X ( ! n ** or g( X ) = "Wn f '' 2 $ n=1 & )

% X! # X! 2 ( N ! n * g( X ) = "Wn f ' 2 ' * 2 $ n=1 & )

!

!

The vectors Xn act as center points for defining bases. The sigma parameter acts as a ! ! smoothing parameter that ! determines the influence of each of the basis vectors, Xn . The zero-crossings in the distances define the decision surface. ! " ! ! f( x "c )=e

! ! 2 x "c 2# 2

!

The Gaussian function is a popular Radial Basis Function, and is often used as a kernel for support vector machines. !

4-4

Support Vector Machines with Kernel Functions We can use a subset of the training data to define a discriminant function, where the support vectors are drawn from the M training samples. This gives a discriminant function ! g( X ) =

M

#a

! ! y f ( X " Xm ) + b ,

m m

m=1

The training samples

for which am ! 0 are the support vectors.

!

The distance can be normalized by dividing by σ ! g( X ) =

M

$a

m=1

y f(

m m

! ! X " Xm

#

)+ b

Depending on σ, this can provide a good fit or an over fit to the data. If σ is large ! compared to the distance between the classes, this can give an overly flat discriminant surface. If σ is small compared to the distance between classes, this will over-fit the samples. A good choice for σ will be comparable to the distance between the closest members of the two classes.

(images from "A Computational Biology Example using Support Vector Machines", Suzy Fei, 2009) Each Radial Basis Function is a dimension in a high dimensional basis space.

4-5

Support Vector Machines with Kernel Functions Kernel Functions for Symbolic Data Kernel functions can be defined over graphs, sets, strings and text! Consider for example, a non-vector space composed of a set of words {W}. We can select a subset of discriminant words {S} ⊂ {W} Now given a set of words (a probe), {A} ⊂ {W} We can define a kernel function of A and S using the intersection operation.

k(A, S) = 2 A"S where | . | denotes the cardinality (the number of elements) of a set.

!

4-6

Support Vector Machines with Kernel Functions

Support Vector Machines with Kernels !

Let us assume that a training data composed of M training samples { X m } and their indicator variable, {ym } , where , ym is -1 or +1. !

!

!

!

! a linear decision surface g( X ) = W T f ( X ) + b such that the training data We will seek

fall into two separable classes. That is ! ! "m : ym (W T f ( X ) + b) > 0

!

If we assume that the data is separable, then for all training samples: !

! ym g( X m ) > 0 !

For any training sample X m the perpendicular distance to the decision surface is: !

! ! ! ym g( X m ) ym (W T f ( X m ) + b) ! ! dm = = W ! W

The margin is the smallest distance from the decision surface: !

!T ! " = min{ym (W f ( X m ) + b)} !

!

For a decision surface, ( W , b), the support vectors are the subset { X s } of the training ! ! sample, { X s } " { X m } that on the margin, γ.

!

!

Our problem is to choose the !margin.

!

!

{Xs } " {Xm }

! support vectors that maximizes the

!

4-7

Support Vector Machines with Kernel Functions !

We will seek to maximize the margin by finding the { X s } training samples that maximize: !& "$ 1 ! !T ! $ (W ,b) =! arg max# ! min{ ym (W f ( X m ) + b)}' $% W m $( W ,b

The factor

!

! 1 ! can be removed from the optimization because W does not depend W

on m. !

! solution can be difficult because we do not always know how many support Direct vectors will be required, notably with radial basis functions as kernels.

Fortunately the problem can be converted to an equivalent problem. Note that rescaling the problem changes nothing. Thus we will scale the equation such for the sample that is closest to the decision surface (smallest margin):

! ! ! ym (W T f ( X m )+ b) = 1 that is: ym g( X m ) = 1 For all other sample points:

!

! ! ym (W T f ( X m )+ b) > 1! This is known as the Canonical Representation for the decision hyperplane.

!

!T ! The training sample where ym (w f ( X m )+ b) = 1 are said to be the "active" constraint. All other training samples are "inactive". By definition there is always at least one active constraint.

!

"1 ! 2% $2 '

Thus the optimization problem is to maximize arg!min# W & subject to the active W ,b

constraints. The factor of ½ is a convenience for later!analysis.

4-8

Support Vector Machines with Kernel Functions To solve this problem, we will use Lagrange Multipliers, an ≥ 0, with one multiplier for each constraint. Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints, for instance to maximize f(x, y) subject to g(x, y) = 0. The function f and g must have continuous first partial derivatives. The technique introduces a new variable (λ) called a Lagrange multiplier and sets up a loss function L(), and finds a solution by setting the derivatives to zero. See wikipedia for an accessible discussion of Lagrange multipliers. For our problem, we have a Lagrangian function:

! ! 1 ! 2 M ! ! L(W ,b, a ) = W " # am {ym (W T f ( X m )+ b) "1} 2 m=1 Setting the derivatives to zero, we obtain:

! "L ! =0# "W

! M ! W = " a m ym f ( X m ) m=1 M

"L =0# !"b

!

"a

y =0

m m

m=1

!

!

!

Eliminating w,b from L(w,b, a ) we obtain: !

! M ! T ! 1 M M ! ! L(a ) = " ! am # " " an am yn ym f ( X n ) f ( X m ) 2 m=1 n=1 m=1 ! ! 1 M M = " am # " " an am yn ym K( Xn , X m ) 2 m=1 n=1 m=1 M

!

with constraints:

!

M

am ≥ 0 for m=1, ..., M and

"a

y =0

m m

m=1

4-9

!

Support Vector Machines with Kernel Functions

The solution takes the form of a quadratic programming problem in Dk variables (the dimension of the Kernel space). This would normally take O(Dk3) computations. In going to the dual formulation, we have converted this to a dual problem over M data points, requiring O(M3) computations. This can appear to be a problem, but the solution only depends on a small number of points Ms << M. To classify a new observed point, we evaluate: M M ! ! T ! ! ! g( X ) = " am ym f ( X m ) f ( X ) + b = " am ym K( X m , X ) + b m=1

m=1

The solution to optimization problems of this form satisfy the "Karush-Kuhn-Tucker" condition, requiring: !

am ≥ 0

! ym g( X m ) "1 # 0 ! am { ym g( X m ) "1} # 0

! For every observation in the training set, ! ! am = 0 or ym g( X m ) = 1

!

{ X m } , either

Any point for which am = 0 does!not contribute to M M ! ! ! T ! ! ! g( X ) = " am ym f ( X m ) f ( X ) + b = " am ym K( X m , X ) + b m=1

m=1

and thus is not used! (is not active) . !

The remaining Ms samples for which am ≠ 0 are the Support vectors. ! These points lie on the margin at ym g( X m ) = 1 of the maximum margin hyperplane. Once the model is trained, all other points can be discarded! !

! Let us define the support vectors as the set { X s }. !

Now that we have solved for { X s } and a, we can solve for b: ! we note that for any active training!sample m in { X s } 4-10

! !

Support Vector Machines with Kernel Functions

$ ' ! ! ym & # an yn K ( Xn , X m ) + b ) = 1 % n"S ( !

averaging over all support vectors in { X s } gives:

! 1 b= MS

% ! ! ( $' ym " $ an yn K( ! Xn , X m )* ) m#S& n#S

!

From Bishop p 331.

4-11

Support Vector Machines with Kernel Functions

Soft Margin SVM's - Non-separable training data. !

So far we have assumed that the data are linearly separable in f ( X ) . For many problems some training data may overlap. The problem is that the error function goes to ∞ for !any point on the wrong side of the decision surface. This is called a "hard margin" SVM. We will relax this by adding a "slack" variable, zn for each training sample. zm ≥ 1 We will define zm = 0

for training samples on the correct side of the margin, and ! zm = ym " g( X m ) for other training samples.

For a sample inside the margin, but on the correct side of the decision surface:

!

0 < zm ≤ 1 For a sample on the decision surface: zm= 1 For a sample on the wrong side of the decision surface: zm > 1

Soft margin SVM: Bishop p 332 (note use of ξn in place zn)

4-12

Support Vector Machines with Kernel Functions

This is called a soft margin SVM. To softly penalize points on the wrong side, we minimize : M

C " zm + m=1

1 ! w 2

2

where C > 0 controls the tradeoff between slack variables and the margin.

!

because any misclassified point zm > 1, the upper bound on the number of M

misclassified points is

"z

m

.

m=1

C is an inverse factor. (note C=∞ is the SVM with hard margins)

!the SVM we write the Lagrangian: To solve for M M M ! ! 1 ! 2 ! L(W ,b, z, a, µ ) = W + C " zm # " am { ym g( X m ) #1+ zm } # " µ m zm 2 m=1 m=1 m=1

where {am ≥ 0} and { µm ≥ 0} are the Lagrange multipliers. !

The KKT conditions are am ≥ 0

! ym g( X m ) "1+ zm # 0 ! am { ym g( X m ) "1+ zm } # 0

µm " 0 ! ! !

zm ≥ 1 µnzm = 0 ! We optimize for W , b, and {zm}, using

!

! !T ! g( X ) = W f ( X ) + b

!

Solving the derivatives of L(W ,b, a ) for zero gives !

! !

4-13

Support Vector Machines with Kernel Functions

! M ! "L = 0 # W = $ a m ym f ( X m ) "w m=1 M "L = 0 # $ a m ym = 0 "b m=1

"L = 0 # am = C $ µ n "zn

!

! using these to eliminate w, b and {Sm} from L(w, b, a) we obtain ! N ! T ! 1 M M ! L( a ) = " am # " " am an ym yn f ( X m ) f ( Xn ) 2 m=1 n=1 n=1

This appears to be the same as before, except that the constraints are different. M

!

0 ≤ am ≤ C

and

"a

y =0

m m

m=1

(referred to as a "box" constraint). The solution is a quadratic programming problem, 3 with complexity O(M ! ). However, as before, a large subset of training samples have am = 0, and thus do not contribute to the optimization.

! For the remaining points ym g( X m ) = 1" Sm For samples ON the margin am < C hence µm > 0 requiring that Sm = 0

! For samples INSIDE the margin: am = C and Sm ≤ 1 if correctly classified and Sm >1 if misclassified. as before to solve for b we note that :

$ ' ! ! ym & # an yn f ( Xn )T f ( X m ) + b ) = 1 % n"S (

!

Averaging over all support vectors in S gives: % ! ! ( 1 b= ' yn " $ an yn f ( Xn )T f ( X m )* $ M N m# N & ) n#S where N denotes the set of support vectors such that 0 < an < C.

!

4-14