Thesis defence - Lucas Etourneau - FDR control and missing value imputation for the analysis of mass spectrometry-based proteomics data

On  January 24, 2024

Abstract

Proteomics involves characterizing the proteome of a biological sample, that is, the set of proteins it contains, and doing so as exhaustively as possible. By identifying and quantifying protein fragments that are analyzable by mass spectrometry (known as peptides), proteomics provides access to the level of gene expression at a given moment. This is crucial information for improving the understanding of molecular mechanisms at play within living organisms. These experiments produce large amounts of data, often complex to interpret and subject to various biases. They require reliable data processing methods that ensure a certain level of quality control, as to guarantee the relevance of the resulting biological conclusions.

The work of this thesis focuses on improving this data processing, and specifically on the following two major points:

The first is controlling for the false discovery rate (FDR), when either identifying (1) peptides or (2) quantitatively differential biomarkers between a tested biological condition and its negative control. Our contributions focus on establishing links between the empirical methods stemmed for proteomic practice and other theoretically supported methods. This notably allows us to provide directions for the improvement of FDR control methods used for peptide identification.

The second point focuses on managing missing values, which are often numerous and complex in nature, making them impossible to ignore. Specifically, we have developed a new algorithm for imputing them that leverages the specificities of proteomics data. Our algorithm has been tested and compared to other methods on multiple datasets and according to various metrics, and it generally achieves the best performance. Moreover, it is the first algorithm that allows imputation following the trending paradigm of "multi-omics": if it is relevant to the experiment, it can impute more reliably by relying on transcriptomic information, which quantifies the level of messenger RNA expression present in the sample. Finally, Pirat is implemented in a freely available software package, making it easy to use for the proteomic community.

This thesis was supervised by:

- Thomas Burger, Directeur de recherche CNRS (EDyP/BGE/IRIG/CEA Grenoble)

- Nelle Varoquaux, Chargée de recherche CNRS (TrEE/TIMC)

The invited members of the jury are:

- Nataliya Sokolovska, Professeure des universités, Sorbonne Université

- Julie Josse, Advanced Researcher, INRIA

- Adeline Leclercq-Samson, Professeure des universités, Université Grenoble-Alpes

- Guillaume Fertin, Professeur des université, Nantes Université

- Quentin Giai-Gianetto, Ingénieur de recherche, Institut Pasteur


A reception will then be organized in the cafeteria of the TIMC at the Taillefer pavilion.
Published on  January 16, 2024
Updated on  January 16, 2024