2008-12-07

The Distant Segments Kernel: a tutorial

This post is a tutorial on how to use the distant segments kernel. String kernels were recently introduced as a more precise way to perform pairwise comparisons. First, I will define the concept of kernels. A kernel is a function that takes two objects in an input space and multiply them by mapping them to a vectorial feature space. A kernel associates a real number to a pair of instances. String kernels are a particular case of kernels. The input space of string kernels is the set of strings. Recall that strings are sequences generated with a given alphabet. The distant segments kernel is a string kernel. For the distant segments kernel, the feature vector associated to a string is the distribution of its distant segments. See this paper for more details.

In this tutorial, command lines are shown in red. First, download the source code.


seb@ubuntu:~$ wget http://boisvert.info/software/PermutationDSKernel.cpp
This software performs the kernel matrix computation of a set of strings. Now, compile the c++ file with g++.

seb@ubuntu:~$ g++ PermutationDSKernel.cpp -O4 -Wall -o DSK

Executing the program without providing parameters will output the usage.

seb@ubuntu:~$ ./DSK
Usage:
DSkernel l sequenceFile delta_m theta_m out

Basically, sequenceFile is a file containing sequences. In this file, you must put one sequence per line. Note that this file is not a fasta file.

In this tutorial, I will use the file sequences.txt. You can copy the content below in sequences.txt on your computer. The sequences.txt file contains protein sequences from the SCOP database.

seb@ubuntu:~$ cat sequences.txt
svydaaaqltadvkkdlrdswkvigsdkkgngvalmttlfadnqetigyfkrlgnvsqgmandklrghsitlmyalqnfidqldnpddlvcvvekfavnhitrkisaaefgkingpikkvlasknfgdkyanawaklvavvqaal
lsaaqkdnvksswakasaawgtagpeffmalfdahddvfakfsglfsgaakgtvkntpemaaqaqsfkglvsnwvdnldnagalegqcktfaanhkargisagqleaafkvlagfmksyggdegawtavagalmgmirpdm
glsaaqrqviaatwkdiagadngagvgkkclikflsahpqmaavfgfsgasdpgvaalgakvlaqigvavshlgdegkmvaqmkavgvrhkgygnkhikaqyfeplgasllsamehriggkmnaaakdawaaayadisgalisglqs
vlsegewqlvlhvwakveadvaghgqdilirlfkshpetlekfdrfkhlkteaemkasedlkkhgvtvltalgailkkkghheaelkplaqshatkhkipikylefiseaiihvlhsrhpgdfgadaqgamnkalelfrkdiaakykelgy
xslsaaeadlagkswapvfanknangldflvalfekfpdsanffadfkgksvadikaspklrdvssriftrlnefvnnaanagkmsamlsqfakehvgfgvgsaqfenvrsmfpgfvasvaappagadaawtklfgliidalkaaga
lsadqistvqasfdkvkgdpvgilyavfkadpsimakftqfagkdlesikgtapfethanrivgffskiigelpnieadvntfvashkprgvthdqlnnfragfvsymkahtdfagaeaawgatldtffgmifskm
seb@ubuntu:~$ wc -l sequences.txt
6 sequences.txt

To perform a pairwise comparison, one can use string kernels. Note that pairwise comparisons are usually done with alignment software. Instead of relying on an alignment to perform this task, here we will utilize the computer program that we just compiled from its source code.

seb@ubuntu:~$ ./DSK 6 sequences.txt 1000 3 KernelMatrix-DSK-1000-3.txt

This performed the computation for the 6 sequences in sequences.txt with the distant segments kernel. Hyperparameters deltaM and theteM were set respectively to 1000 and 3. See the paper for a clear explanation of the role of those hyperparameters in the distant segments kernel. The last argument that we provided is simply the output file.

The output file contains a square matrix. For each pair of sequences, it contains the image of the distant segments kernel. Let us look at the content of this file.

seb@ubuntu:~$ cat KernelMatrix-DSK-1000-3.txt
x 0 1 2 3 4 5
0 95925 5707 6163 5114 6218 4130
1 5707 93627 9197 5396 9088 5900
2 6163 9197 103266 6711 8617 5547
3 5114 5396 6711 105790 5833 4359
4 6218 9088 8617 5833 101886 6256
5 4130 5900 5547 4359 6256 83565

As we can see, a diagonal dominance occurs and inner products are large. The distant segments kernel has broad applicability. One of the applications is to use the distant segments kernel as the similarity operator of the support vector machine. svmlight, an implementation of the SVM algorithm, provides ways to plug in any kernel. You can download this package to use a precomputed matrix with svmlight. Note that you need to download svmlight separately here. Note also that svmlight is free for academic use.

For real-world bioinformatics application, the kernel matrix needs to be normalized. One of the reasons is that proteins share subsequences, but have a varying length. One way to normalize the matrix is to transform all associated vectors to vectors of unit norm. Here, in the last part of this tutorial, I show how to normalize the kernel matrix associated to the distant segments kernel. Let us compile the computer program Normalize.

seb@ubuntu:~$ wget http://genome.ulaval.ca/dav/boiseb01/pub/software/svmoptimize-v1.zip
seb@ubuntu:~$ unzip svmoptimize-v1.zip
seb@ubuntu:~$ gcc -O4 -Wall svmoptimize-v1/Normalize.c -o svmoptimize-v1/Normalize -lm

Again, executing the program without arguments will show the correct usage.

seb@ubuntu:~$ ./svmoptimize-v1/Normalize
usage: program matrix n
k'(x,y) <- k(x,y)/sqrt(k(x,x) k(y,y))

The normalized matrix is generated and written to KernelMatrix-DSK-1000-3.txt.1

seb@ubuntu:~$ ./svmoptimize-v1/Normalize KernelMatrix-DSK-1000-3.txt 6 > KernelMatrix-DSK-1000-3.txt.1
seb@ubuntu:~$ cat KernelMatrix-DSK-1000-3.txt.1
x 0 1 2 3 4 5
0 1.000000 0.060220 0.061922 0.050766 0.062897 0.046129
1 0.060220 1.000000 0.093533 0.054219 0.093049 0.066702
2 0.061922 0.093533 1.000000 0.064208 0.084008 0.059713
3 0.050766 0.054219 0.064208 1.000000 0.056184 0.046361
4 0.062897 0.093049 0.084008 0.056184 1.000000 0.067800
5 0.046129 0.066702 0.059713 0.046361 0.067800 1.000000

Using a normalized matrix keeps the same angle between proteins, but sets the norm of proteins to 1. Presumably, the angle between proteins is more important than the norm of proteins. This normalized matrix is suitable for supervised learning with the SVM, if labels are available. This concludes this tutorial.



References:

Sebastien Boisvert, Mario Marchand, Francois Laviolette, and Jacques Corbeil. Hiv-1 coreceptor usage prediction without multiple alignments: an application of string kernels. Retrovirology, 5(1):110, Dec 2008. [ bib | DOI | http ]

2008-12-02

My research activities

I started my master's degree in September. I also have a new web site. If everything goes as planned,I should start a PhD with professor Jacques Corbeil and professor Mario Marchand at the Université Laval by next year. So far, in my two first internships I focused on microarray for gene expression (Ubeda et al., Rochette et al.). Last year, in my third internship, I also had the chance to develop a new method for the prediction of HIV coreceptor usage with Mario Marchand, François Laviolette and Jacques Corbeil. Our manuscript was recently accepted for publication in the journal Retrovirology (to appear). Since existing methods for this particular task all need each V3 loop (a component of HIV) to be aligned, we saw a good opportunity to apply string kernels. Recall that a kernel is a similary measure and that it maps objects to a feature space to perform a dot product. Recall also that string kernels are the family of kernels whose input space is the set of strings. Alignments can be the cause of many problems. Just to name one, alignments break the i.i.d. hypothesis (each example are identically and independantly distributed according to an unknown, but constant distribution). In particular, cross-validation can not be performed on a set of sequences that were aligned to each other because the i.i.d. hypothesis is broken. Consequently, we developped the distant segments kernel (the source of inspiration for naming my blog), a kernel with a very large feature space. Its feature space is constructed by counting the distances between pairs of segments inside the primary structure of a protein. I also tested this kernel on the SCOP data set. With SVMlight (C=10) and the distant segments kernel (theta_m=3, delta_m=1000), I obtained very competitive results (i.g. ROC AUC). The mean AUC is 0.9265 for the 54 families. This is impressive because I did not applied any twisted changes to the diagonal of the kernel matrix. The kernel matrix was generated and feature vectors were implicitly normalized to unit norm. Also, positive examples were duplicated in the training sets until a balance (i.g. 1:1) is achieved. However, I don't think that the results on this data set are significant, for any algorithm, because the training and test sets are not build from the same distribution (see this paper). During the next year, I want to assign biological functions to some of the trypasomatid proteins with unknown function. To tackle this task, I will employ presumably kernel methods. However, the best approach has not yet been determined by our marvelous team. This problem can be interpreted as a ranking problem, as a supervised learning problem (see this data set), or even in a semi-supervised learning problem. Currently, I am working on the assembly and annotation of the genome of Leishmania tarentolae. Also, I maintain an online bibliography.

Jean-Philippe Vert

Jean-Philippe Vert is giving two talks in Québec this week. The first one will be on his contributions of machine learning in the field of bioinformatics. His second talk will be about biological networks.

Talk 1
Some contributions of machine learning in bioinformatics
December third, 2008, 10h30, room PLT-2744 (Pavillon Andrien-Pouliot)
see http://www.ulaval.ca/Al/interne/plan/AdrienPouliot/reference.htm

Many problems in bioinformatics can be formulated as pattern recognition problems on non-standard objects, such as strings, graphs or high-dimensional vectors with particular structure. They have triggered many original developments in machine learning recently, in particular in the way data are represented and prior knowledge is introduced in the algorithm. In this talk I will present some of these developments through several examples in microarray data analysis, virtual screening, and inference of biological networks.

Talk 2
Inferring and using biological networks
Decembre fourth, 2008, 15h00, Amphithéâtre Fisher, CRCHUL Porte TR-54
see http://maps.google.ca/maps?f=q&hl=en&geocode=&q=46.769017,+-71.281705&ie=UTF8&ll=46.769017,-71.281705&spn=0.001911,0.003648&t=h&z=18&iwloc=addr&om=1

Protein and gene networks have become important tools recently to represent
and think about the complexity of biological processes involving interactions
among many molecules. Two problems must nevertheless be solved in order to
make them useful: (i) how to obtain large-scale networks of good quality, and (ii)
how to use them in order to extract useful information from, e.g., expression data.
In this talk I will address both issues with a machine learning approach.
There was an error in this gadget