Description : | Scientific background
Thanks to the progress of sequencing technologies, it is now possible to collect and access sequencing and genomic information for large cohorts of patients, paving the way for the use of these technologies for many biotechnological and medical applications.
Patient stratification and disease grading based on the results of a sequencing experiment is therefore becoming the standard of care in several diseases, and in cancer in particular. But despite a lot of research on the topic, discovering biomarkers using transcriptomic data remains challenging. Challenges include that patient stratification based on transcriptomic data lead to results that are often difficult to replicate, and that the genes selected as biomarkers are highly unstable [1].
Several reasons make biomarker discovery particularly difficult in this context. First, despite the increase in cohort sizes, the number of explanatory variables remains much larger than the number of patients. This is well known to be a difficult statistical setup. An additional difficulty arises due to the high level of correlation between the different genes. Indeed, many biological processes induce correlations between different genes’ expressions. As a result, it becomes difficult to differentiate genes that are truly causally associated to the outcome of interest from those that appear associated to the outcome solely because they are correlated to a causal gene.
To tackle this question, we recently proposed to use a method called Knockoffs (KO), which was developed specifically to improve variable selection in high dimension [2], in particular in the case where input variables are correlated. We showed that KO improves marker genes selection as compared to state of the art in a simulated framework, drastically reducing the number of false discoveries. Despite these promising results, when applied to real data this method tends to discover very few to no marker genes. In addition, the method also fails when the outcome variable (i.e. the health status of the patient) depends on the input gene expression non-linearly.
PhD Objectives
The goal of the current PhD project would be to build on our previous results to develop a gene selection method based on KO that would be able to select genes even when the dependency between genes and outcome is not linear, as such non-linear scenario is actually very likely to occur in real data sets. In addition, due to its high computational cost, KO is currently difficult to apply on very large datasets, and in particular when the number of features is high. In our experiments, KO can deal with about at most 1,000 genes, far from the 20,000 genes of the human genome.
To overcome those limitations, we propose to develop a KO method that selects groups of genes together instead of single genes as is implemented in our current procedure. Work on this topic already exists [3,4], but its applicability to classification problems on genome wide transcriptomics data remains to be determined. We further propose to group genes based on their common functional roles, thus also incorporating prior biological information in the gene selection process. Selecting groups of genes together would permit to effectively reduce the number of input features, and thus to apply KO on larger datasets.
Required skills:
Candidate should have a background in statistics and Machine learning or bioinformatics, and a strong interest both in method development and biological and medical applications.
Scientific environment
The Center for Computational Biology (CBIO) is a research center at Mines Paris; it is affiliated with its Department « Mathematics and Systems » and the joint unit "Computational Oncology (U1331)" with Institut Curie and INSERM. The CBIO develops methods in artificial intelligence, machine learning, and computer vision for applications in life sciences, covering a wide range of applications from fundamental biology to clinical applications. CBIO's collaborations allow it to work on data from various sources, such as DNA sequencing technologies, spatial transcriptomics, protein structures, large-scale microscopy, medical imaging, and electronic health records. The CBIO develops innovative mathematical methods and algorithms to analyze these massive, heterogeneous, and complex data, thus addressing biological or clinical questions. The CBIO is involved in several major initiatives in France, both for methodological development in AI and its applications in health.
Supervision
The PhD would be co-supervised by C.-A. Azencott, Professor in Machine learning for Genomics at Ecole des Mines and F. Massip, chargé de recherche in Bioinformatics at Ecole des Mines.
Funding
The opening of the position is not guaranteed and is dependent on the obtention of a scolarship through the funding schemes of Ecole des Mines.
Refs:
[1] Anne-Claire Haury , Pierre Gestraud, Jean-Philippe Vert, The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures, Plos ONE, (2011)
[2] Emmanuel Candès, Yingying Fan, Lucas Janson, Jinchi Lv, Panning for Gold: ‘Model-X’ Knockoffs for High Dimensional Controlled Variable Selection, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 80, Issue 3, June 2018, Pages 551–577, https://doi.org/10.1111/rssb.12265
[3] Chu BB, Gu J, Chen Z, Morrison T, Candès E, He Z, Sabatti C. Second-order group knockoffs with applications to genome-wide association studies. Bioinformatics. 2024 Oct 1;40(10):btae580. doi: 10.1093/bioinformatics/btae580.
[4] Guangyu Zhu, Tingting Zhao (2021), Deep-gKnock: Nonlinear group-feature selection with deep neural networks, Neural Networks, 135, 139-147. doi: 0.1016/j.neunet.2020.12.004 |