2008-12-02

My research activities

I started my master's degree in September. I also have a new web site. If everything goes as planned,I should start a PhD with professor Jacques Corbeil and professor Mario Marchand at the Université Laval by next year. So far, in my two first internships I focused on microarray for gene expression (Ubeda et al., Rochette et al.). Last year, in my third internship, I also had the chance to develop a new method for the prediction of HIV coreceptor usage with Mario Marchand, François Laviolette and Jacques Corbeil. Our manuscript was recently accepted for publication in the journal Retrovirology (to appear). Since existing methods for this particular task all need each V3 loop (a component of HIV) to be aligned, we saw a good opportunity to apply string kernels. Recall that a kernel is a similary measure and that it maps objects to a feature space to perform a dot product. Recall also that string kernels are the family of kernels whose input space is the set of strings. Alignments can be the cause of many problems. Just to name one, alignments break the i.i.d. hypothesis (each example are identically and independantly distributed according to an unknown, but constant distribution). In particular, cross-validation can not be performed on a set of sequences that were aligned to each other because the i.i.d. hypothesis is broken. Consequently, we developped the distant segments kernel (the source of inspiration for naming my blog), a kernel with a very large feature space. Its feature space is constructed by counting the distances between pairs of segments inside the primary structure of a protein. I also tested this kernel on the SCOP data set. With SVMlight (C=10) and the distant segments kernel (theta_m=3, delta_m=1000), I obtained very competitive results (i.g. ROC AUC). The mean AUC is 0.9265 for the 54 families. This is impressive because I did not applied any twisted changes to the diagonal of the kernel matrix. The kernel matrix was generated and feature vectors were implicitly normalized to unit norm. Also, positive examples were duplicated in the training sets until a balance (i.g. 1:1) is achieved. However, I don't think that the results on this data set are significant, for any algorithm, because the training and test sets are not build from the same distribution (see this paper). During the next year, I want to assign biological functions to some of the trypasomatid proteins with unknown function. To tackle this task, I will employ presumably kernel methods. However, the best approach has not yet been determined by our marvelous team. This problem can be interpreted as a ranking problem, as a supervised learning problem (see this data set), or even in a semi-supervised learning problem. Currently, I am working on the assembly and annotation of the genome of Leishmania tarentolae. Also, I maintain an online bibliography.

No comments:

There was an error in this gadget