I am developping bio informatic and statistical methods to analyse genomic data to study health related questions. In the recent years, I have focus on three main research questions:
Horizontal Gene Transfer in Bacteria
Recent advances of sequencing technologies have allowed the analysis of microbial communities via sequencing of the metagenomes of a variety of different environments, further highlighting the crucial importance of microorganisms on human health. The huge amount of data generated this way constitutes a tremendous opportunity to improve our understanding of microorganisms’ biology, but also requires the development of efficient dedicated methods.
The ability of bacteria belonging to different species to exchange genetic material via Horizontal Gene Transfer (HGT) is a key mechanism of microorganisms evolution and constitutes an important source of genetic novelty for microbial species. On short timescales it is the primary reason for the spread of antibiotic resistance in bacteria (1) and plays an important role in the evolution of virulence (2), and as such is of crucial importance for human health. However, HGT events occurring between distantly related organisms remain poorly studied. Notably, the biological conditions that make long distance HGT possible, and influence the rates at which species exchange material are not understood. Finally, the routes that genes take to spread via HGT over the tree of life are mostly unknown.
My work aims at developing method to study those questions. To do so, I focus on the analysis statistical properties of long Maximal Exact Matches (sheinman et al.). The rationale is that long sequences of DNA (>300bp) exactly identical between pairs of bacteria of different species are extremely unlikely to occur via classical vertical inheritance. Hence, such long exact matches must result from a recent HGT event. Using efficient algorithms detecting exact matches at low computation cost thus allow to identify HGT on very large datasets.
Pursuing this avenue of research, we recently demonstrated that studying the length distribution of exact sequence matches observed in whole genome alignments of two bacterial species allow to efficiently separate sequence similarities inherited from their common ancestors from those acquired via horizontal gene transfer. Based on this result, we propose a novel concept, the “mosaic molecular clock”, and developped a mathematical framework to accurately calibrate bacterial phylogeny using solely genomic data. The full stody can be found here.
Early Detection of Lung Cancer
Lung cancer is one of the major causes of cancer-related deaths in the western world. A direct connection to lifestyle risk in the form of cigarette smoke has long been established, and ex smokers remain at risk since >50% of all lung cancers occur in former smokers. On the other hand, only 10-20% of heavy smokers ever develop lung cancer in the course of their lifetime, raising questions about the factors that can explain such variability. It has been postulated that individual differences in smoke injury response might modulate personal cancer risk and that cancer onset results from the combination of environmental exposure and disadvantageous genetic background. Early studies into the gene expression dynamics induced by smoke exposure provided a broad characterisation of the genes involved (3). However, the lack of statistical power of these studies have prevented the discovery of individual differences among smokers. As a consequence, the role of subject specific genetic background in response to smoke injury remains poorly understood.
We have recently developped a classifier to improve early detection of lung cancer (de biase et al.). To do so, we collected nasal swab from 500 subjects to conduct transcriptomic analysis. We also genotyped the subjects in order to understand whether patients’ genetic background could influence lung cancer risk. Our study demonstrate that transcriptomic analysis of nasal swab can serve as an efficient early detection tool for lung cancer diagnosis. We further provide evidence from systems biology approaches, that demonstrates systematic dysregulation of immune regulatory networks in smokers, and its link to increased lung cancer risk. These results are also supported by the analysis of germline variants thereby providing evidence for a risk-related connection of germline, environmental and transcriptional response.
During each cell cycle, the genome must be accurately replicated to ensure the faithful transmission of the genetic material to daughter cells. In vertebrates, DNA replication starts at specific sites, called replication origins. The positions of replication origins have been identified in a handful of eukaryotic genomes (human, mouse, chicken, drosophila and Leishmania major). However, the derterminants of replication origins positions are still poorly understood.
To study the determinants of replication Origins, I conducted an evolutionary analysis of vertebrate replication origins (massip et. al), and I am currently analyzing the link between the accumulation of somatic mutations in cancers and the position of replication origins.
(1) : https://doi.org/10.2147/IDR.S48820
(2) : https://doi.org/10.1177/0300985813511131
(3) : https://doi.org/10.1186/gb-2007-8-9-r201