Support vector machines, Kernel methods, and Applications in bioinformatics

[email protected] Ecole des Mines de Paris Computational Biology group Machine Learning in Bioinformatics conference, October 17th, 2003, Brussels, Belgium

2

Overview

1. Support Vector Machines and kernel methods 2. Application: Protein remote homology detection 3. Application: Extracting pathway activity from gene expression data

3

Partie 1

Support Vector Machines (SVM) and Kernel Methods

4

The pattern recognition problem

5

The pattern recognition problem

• Learn from labelled examples a discrimination rule

6

The pattern recognition problem

• Learn from labelled examples a discrimination rule • Use it to predict the class of new points

7

Pattern recognition examples

• Medical diagnosis (e.g., from microarrays) • Drugability/activity of chemical compouds • Gene function, structure, localization • Protein interactions

8

Support Vector Machines for pattern recognition

φ

~ (feature space) • Object x represented by the vector Φ(x)

9

Support Vector Machines for pattern recognition

φ

~ (feature space) • Object x represented by the vector Φ(x) • Linear separation in the feature space

10

Support Vector Machines for pattern recognition

φ

~ (feature space) • Object x represented by the vector Φ(x) • Linear separation with large margin in the feature space

11

Large margin separation

12

Large margin separation

H

13

Large margin separation

m

H

14

Large margin separation

m

e1 e3 e2 e4

e5 H

15

Large margin separation

m

e1 e3 e2 e4

e5 H

min H,m

(

X 1 +C ei 2 m i

)

16

Dual formulation The classification of a new point x is the sign of: X f (x) = αiK(x, xi), i

where αi solves: Pn Pn 1 maxα~ i=1 αi − 2 i,j=1 αiαj yiyj K(xi, xj ) ∀i = 1, . . . , n 0 ≤ αi ≤ C Pn α y = 0 i=1

i i

with the notation: ~ Φ(x ~ 0) K(x, x0) = Φ(x).

17

The kernel trick for SVM

• The separation can be found without knowing Φ(x). Only the kernel matters: ~ Φ(y) ~ K(x, y) = Φ(x).

~ • Simple kernels K(x, y) can correspond to complex Φ • SVM work with any sort of data as soon as a kernel is defined

18

Kernel examples

• Linear :

K(x, x0) = x.x0

• Polynomial :

0

0

d

K(x, x ) = (x.x + c) • Gaussian RBf :

0 2

||x − x || K(x, x ) = exp 2σ 2 0

19

Kernels

For any set X , a function K : X × X → R is a kernel iff: • it is symetric : K(x, y) = K(y, x), • it is positive semi-definite: X i,j

for all ai ∈ R and xi ∈ X

aiaj K(xi, xj ) ≥ 0

20

Advantages of SVM

• Works well on real-world applications • Large dimensions, noise OK (?) • Can be applied to any kind of data as soon as a kernel is available

21

Examples: SVM in bioinformatics

• Gene functional classification from microarry: Brown et al. (2000), Pavlidis et al. (2001) • Tissue classification from microarray: Mukherje et al. Furey et al. (2000), Guyon et al. (2001)

(1999),

• Protein family prediction from sequence: Jaakkoola et al. (1998) • Protein secondary structure prediction: Hua et al. (2001) • Protein subcellular localization prediction from sequence: Hua et al. (2001)

22

Kernel methods

Let K(x, y) be a given kernel. Then is it possible to perform other linear algorithms implicitly in the feature space such as: • Compute the distance between points • Principal component analysis (PCA) • Canonical correlation analysis (CCA)

23

Compute the distance between objects

φ( g1) 0

d φ( g2)

~ 1) − Φ(g ~ 2)k2 d(g1, g2)2 = kΦ(g ~ 1) − Φ(g ~ 2) . Φ(g ~ 1) − Φ(g ~ 2) = Φ(g ~ 1).Φ(g ~ 1) + Φ(g ~ 2).Φ(g ~ 2) − 2Φ(g ~ 1).Φ(g ~ 2) = Φ(g d(g1, g2)2= K(g1, g1) + K(g2, g2) − 2K(g1, g2)

24

Distance to the center of mass φ( g1) m

Center of mass: m ~ =

1 N

PN ~ i=1 Φ(gi), hence:

~ 1) − mk ~ 1).Φ(g ~ 1) − 2Φ(g ~ 1).m kΦ(g ~ 2 = Φ(g ~ + m. ~ m ~ N N 2 X 1 X = K(g1, g1) − K(g1, gi) + 2 K(gi, gj ) N i=1 N i,j=1

25

Principal component analysis PC2

PC1

It is equivalent to find the eigenvectors of ~ i).Φ(g ~ j) K = Φ(g i,j=1...N = K(gi, gj ) i,j=1...N Useful to project the objects on small-dimensional spaces (feature extraction).

26

Canonical correlation analysis CCA2 CCA1 CCA1 CCA2

K1 and K2 are two kernels for the same objects. CCA can be performed by solving the following generalized eigenvalue problem: 2 0 K1 K2 ~ K1 0 ~ ξ=ρ ξ 0 K22 K2 K1 0 Useful to find correlations between different representations of the same objects (ex: genes, ...)

27

Part 3

Local alignment kernel for strings (with S. Hiroto, N. Ueda, T. Akutsu, preprint 2003)

28

Motivations

• Develop a kernel for strings adapted to protein / DNA sequences • Several methods have been adopted in bioinformatics to measure the similarity between sequences... but are not valid kernels • How to mimic them?

29

Related work

• Spectrum kernel (Leslie et al.): K(x1 . . . xm, yi . . . yn) =

m−k X n−k X i=1 j=1

δ(xi . . . xi+k , yj . . . yj+k ).

29

Related work

• Spectrum kernel (Leslie et al.): K(x1 . . . xm, yi . . . yn) =

m−k X n−k X

δ(xi . . . xi+k , yj . . . yj+k ).

i=1 j=1

• Fisher kernel (Jaakkola et al.): given a statistical model pθ , θ ∈ Θ ⊂ R d : φ(x) = ∇θ log pθ (x) and use the Fisher information matrix.

30

Local alignment

• For two strings x and y, a local alignment π with gaps is:

ABCD EF−−−G−HI JKL MNO EEPQRGS−I TUVWX • The score is: s(x, y, π) = s(E, E) + s(F, F ) + s(G, G) + s(I, I) − s(gaps)

31

Smith-Waterman (SW) score

SW (x, y) =

max s(x, y, π) π∈Π(x,y)

• Computed by dynamic programming • Not a kernel in general

32

Convolution kernels (Haussler 99)

• Let K1 and K2 be two kernels for strings • Their convolution is the following valid kernel: K1 ? K2(x, y) =

X

x1x2=x,y1y2=y

K1(x1, y1)K2(x2, y2)

33

3 basic kernels

• For the unaligned parts: K0(x, y) = 1.

33

3 basic kernels

• For the unaligned parts: K0(x, y) = 1. • For aligned residues: 0 if |x| = 6 1 or |y| = 6 1, (β) Ka (x, y) = exp (βs(x, y)) otherwise

33

3 basic kernels

• For the unaligned parts: K0(x, y) = 1. • For aligned residues: 0 if |x| = 6 1 or |y| = 6 1, (β) Ka (x, y) = exp (βs(x, y)) otherwise

• For gaps: Kg(β)(x, y) = exp [β (g(|x|) + g(|y|))]

34

Combining the kernels

• Detecting local alignments of exactly n residues: (β)

K(n) (x, y) = K0 ? Ka(β) ? Kg(β)

(n−1)

? Ka(β) ? K0.

34

Combining the kernels

• Detecting local alignments of exactly n residues: (β)

K(n) (x, y) = K0 ? Ka(β) ? Kg(β)

(n−1)

• Considering all possible local alignments: (β)

KLA =

∞ X i=0

(β)

K(i) .

? Ka(β) ? K0.

35

Properties

(β) KLA (x, y)

=

X

π∈Π(x,y)

exp (βs(x, y, π)) ,

35

Properties

(β) KLA (x, y)

=

X

exp (βs(x, y, π)) ,

π∈Π(x,y)

1 (β) lim ln KLA (x, y) = SW (x, y). β→+∞ β

36

Kernel computation

e X0

X d

B

X2 d

M

E d

Y0

Y e

Y2

37

gs mo lo ho Cl

os

e

ig il Tw

Un

re

la

te

d

ht

pr

zo

ot

ne

ei

ns

Application: remote homology detection

Sequence similarity

• Same structure/function but sequence diverged • Remote homology can not be found by direct sequence similarity

38

SCOP database

SCOP Fold Superfamily Family Remote homologs

Close homologs

39

A benchmark experiment

• Can we predict the superfamily of a domain if we have not seen any member of its family before?

39

A benchmark experiment

• Can we predict the superfamily of a domain if we have not seen any member of its family before? • During learning: remove a family and learn the difference between the superfamily and the rest

39

A benchmark experiment

• Can we predict the superfamily of a domain if we have not seen any member of its family before? • During learning: remove a family and learn the difference between the superfamily and the rest • Then, use the model to test each domain of the family removed

40

SCOP superfamily recognition benchmark

No. of families with given performance

60 SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher

50

40

30

20

10

0 0

0.2

0.4

0.6 ROC50

0.8

1

41

Part 4

Detecting pathway activity from microarray data

42

Genes encode proteins which can catalyse chemical reations

Nicotinamide Mononucleotide Adenylyltransferase With Bound Nad+

43

Chemical reactions are often parts of pathways

From http://www.genome.ad.jp/kegg/pathway

44

Microarray technology monitors mRNA quantity

(From Spellman et al., 1998)

45

Comparing gene expression and pathway databases

VS Detect active pathways? Denoise expression data? Denoise pathway database? Find new pathways? Are there “correlations”?

46

A useful first step g6 g1 g3 g8 g2 g7 g5 g4

and

g1 g8 g5 g6 g2 g7 g3 g4

47

Using microarray only

PC1

g1 g8 g5 g6 g2 g7 g3 g4

PCA finds the directions (profiles) explaining the largest amount of variations among expression profiles.

48

PCA formulation

• Let fv (i) be the projection of the i-th profile onto v. • The amount of variation captured by fv is: h1(v) =

N X

fv (i)2

i=1

• PCA finds an orthonormal basis by solving successively: max h1(v) v

49

Issues with PCA

• PCA is useful if there is a small number of strong signal • In concrete applications, we observe a noisy superposition of many events • Using a prior knowledge of metabolic networks can help denoising the information detected by PCA

50

The metabolic gene network GAL10

Glucose

HKA, HKB, GLK1

Glucose−6P

PGT1

PFK1,PFK2

Fructose−6P

Fructose−1,6P2

FBA1

FBP1

Glucose−1P

PGM1, PGM2

HKA1

GAL10

PGM1

HKA2

PFK1

PGT1

GLK1

PGM2

FBP1

FBA1

PFK2

Link two genes when they can catalyze two successive reactions

51

Mapping fv to the metabolic gene network

+0.8 g8

−0.8 g1 −0.4 g3 −0.7g2 g4 +0.1

g7 +0.5 g6 g5 +0.4 +0.2

Does it look interesting or not?

g1 −0.8 g8 +0.8 g5 +0.2 g6 +0.4 g2 −0.7 g7 +0.5 g3 −0.4 g4 +0.1

52

Important hypothesis

If v is related to a metabolic activity, then fv should vary ”smoothly” on the graph

Smooth

Rugged

53

Graph Laplacian L = D − A

1 2

L=

3

5 4

−1 0 1 0 0 0 −1 1 0 0 1 1 −3 1 0 0 0 1 −2 1 0 0 0 1 −1

54

Smoothness quantification

f > exp(−βL)f h2(f ) = f >f is large when f is smooth

h(f) = 2.5

h(f) = 34.2

55

Motivation

For a candidate profile v, • h1(fv ) is large when v captures a lot of natural variation among profiles • h2(fv ) is large when fv is smooth on the graph Try to maximize both terms in the same time

56

Problem reformulation Find a function fv and a function f2 such that: • h1(fv ) be large • h2(f2) be large • corr(f1, f2) be large by solving: h1(fv ) h2(f2) max corr(f1, f2) × × h1(fv ) + δ h2(f2) + δ (f1,v)

57

Solving the problem

This formultation is equivalent to a generalized form of CCA (Kernel-CCA, Bach and Jordan, 2002), which is solved by the following generalized eigenvector problem

0 K1 K2 K2 K1 0

α β

=ρ

K12

+ δK1 0 0 K22 + δK2

where [K1]i,j = e> i ej and K2 = exp(−L). Then, fv = K1α and f2 = K2β.

α β

58

The kernel point of view... g1 g8 g5 g6 g2 g7 g3 g4

g8

g1 g3

g7

g2 g4

g5

g6

Linear kernel

Diffusion kernel

g1 g7

g8

Kernel CCA

g5

g2

g3

g3

g7 g4

g6

g4

g5

g1 g2

g6 g8

59

Data

• Gene network: two genes are linked if the catalyze successive reactions in the KEGG database (669 yeast genes) • Expression profiles: 18 time series measures for the 6,000 genes of yeast, during two cell cycles

60

Expression

First pattern of expression

Time

61

Related metabolic pathways 50 genes with highest s2 − s1 belong to: • Oxidative phosphorylation (10 genes) • Citrate cycle (7) • Purine metabolism (6) • Glycerolipid metabolism (6) • Sulfur metabolism (5) • Selenoaminoacid metabolism (4) , etc...

62

Related genes

63

Related genes

64

Related genes

65

Expression

Opposite pattern

Time

66

Related genes

• RNA polymerase (11 genes) • Pyrimidine metabolism (10) • Aminoacyl-tRNA biosynthesis (7) • Urea cycle and metabolism of amino groups (3) • Oxidative phosphorlation (3) • ATP synthesis(3) , etc...

67

Related genes

68

Related genes

69

Related genes

70

Expression

Second pattern

Time

71

Extensions

• Can be used to extract features from expression profiles (preprint 2002) • Can be generalized to more than 2 datasets and other kernels • Can be used to extract clusters of genes (e.g., operon detection, ISMB 03 with Y. Yamanishi, A. Nakaya and M. Kanehisa)

72

Conclusion

73

Conclusion

• Kernels offer a versatile framework to represent biological data • SVM and kernel methods work well on real-life problems, in particular in high dimension and with noise • Encouraging results on real-world applications • Many opportunities in developping kernels for particular applications