Skip to content

Automating metabolomics analysis. Machine learning models and software development for mass spectrometry data analysis.

liquid chromatography equipment

In short

PhD defence
  • 28 of November 2025
  • 10.30 - 12.00 h
  • Auditorium Omnia, building 105, Wageningen Campus
  • Livestream available

Summary

Mass spectrometry plays a central role in metabolomics research by enabling the detection of a wide range of small molecules in biological samples. Yet, the structural annotation of mass spectra remains a major bottleneck; only a fraction of metabolites can be correctly annotated, leaving the majority unidentified. This thesis presents novel computational methods that enhance mass spectrometry data analysis. These new methods make analysis faster, easier and more informative, thereby facilitating new biological discoveries.

A central theme of this work is the application of machine learning to predict relationships between mass spectra. One major obstacle is the limited availability of high-quality, machine-readable public datasets. Many spectral libraries lack standardized metadata, making it difficult to directly train machine learning models. To overcome this, we have developed a flexible and reproducible pipeline for cleaning and harmonizing both the spectra and their associated metadata. A second challenge is the lack of established benchmarks for evaluating new methods. Without clear standards, assessing model performance can be inconsistent and subjective. In this thesis, we address this by outlining key principles to ensure fair, representative, and reproducible benchmarking of computational approaches in mass spectrometry.

A major contribution of this thesis is MS2Query, a tool designed to predict not only exact library matches, but also analogues: structurally similar compounds not present in the reference library. By searching for closely related metabolites, the number of metabolites for which predictions can be made is expanded substantially. However, a limitation of previous implementations of an analogue search was the high risk of predicting incorrect analogues. The improvements made by MS2Query substantially improved the accuracy in comparison to MS2DeepScore or modified cosine score based analogue searching. This improved accuracy makes analogue searching by MS2Query a valuable tool in exploring mass spectrometry datasets.

We also introduce MS2DeepScore 2.0, a unified model capable of predicting chemical similarity both within and across ion modes. Previous methods required separate handling of positive and negative ion mode spectra. By bridging this gap, we enable seamless integration between ion modes, thereby making data analysis of mass spectra easier and more informative.

Altogether, the methods developed in this thesis advance how we organize and interpret mass spectrometry data. By introducing new computational methods and concepts, such as analogue searching and cross-ionization mode similarity prediction, this work expands the analytical possibilities beyond traditional structural annotation. These contributions not only enhance our ability to extract meaningful insights from complex metabolomics datasets but also provide robust, reusable software foundations that can accelerate further innovations in computational mass spectrometry.

PhD candidate

Candidate of the PhD defence "Automating metabolomics analysis. Machine learning models and software development for mass spectrometry data analysis."

NF (Niek) de Jonge, MSc

PhD candidate

About the PhD defence

Date

Sun 28 December 2025 10:30 -
Sat 28 November 2026 12:00

Duration description

10.30 - 12.00 h

Organisational unit

Wageningen University & Research, Bioinformatics (BIF), EPS

Location

Omnia - Building 105

PhD candidate

NF (Niek) de Jonge, MSc

Promoters

prof.dr.ir. D (Dick) de Ridder

Co-promoters

dr. JJJ (Justin) van der Hooft

External promoters

Florian Huber