MSc.Thesis Defense:Ahmed Mohamed Mahmoud Elmoselhy Salem
Active Learning for Drug Blood-Brain Barrier Permeability Prediction
Ahmed Mohamed Mahmoud Elmoselhy Salem
Data Science, MSc. Thesis, 2024
Thesis Jury
Assoc. Prof. Öznur Taştan (Thesis Advisor),
Asst. Prof. Nur Mustafaoğlu Varol,
Asst. Prof. Gülden Olgun.
Date & Time: 22nd, July 2024 – 11:00 AM
Place: FENS 2019
Zoom Link: https://sabanciuniv.zoom.us/j/8589218567
Keywords : Active Learning, Blood-Brain Barrier, Molecular Scaffolds, Scaffold Splitting, QSAR
Abstract
The blood-brain barrier (BBB) is a highly selective semipermeable border that regulates the transfer of chemicals between the circulatory and central nervous systems (CNS). Assessing whether a compound can permeate through the BBB is critical in drug development for treating CNS disorders, as it determines the compound's ability to reach targets within the brain.
The chemical space is vast, and traditional methods for measuring a chemical compound's BBB permeability are time-consuming and costly. With the availability of open datasets for compounds with their experimentally verified permeability labels, several machine learning (ML) models have been proposed to accelerate the BBB permeability assessment. A large pool of labeled datasets is necessary for the model to learn from in supervised machine learning. However, the vast space of chemical compounds is extensive, and the size of labeled datasets is still far from comprehensive regarding the enormous chemical space. Thus, traditional supervised passive learning procedures fall short. The active learning (AL) framework offers an alternative. The active learner iteratively obtains a classifier of high accuracy by using fewer label requests compared to passive learning by strategically selecting which examples to label in each iteration. In this thesis, we explored various AL strategies for predicting the BBB permeability of chemical compounds and compared their effect on the performance of machine learning models. Specifically, we examine the following sampling strategies: random sampling, uncertainty-based sampling, dissimilarity-based sampling, and a combined strategy that integrates uncertainty and dissimilarity. Comparisons of AL methods versus passive learning ones have been conducted in two separate setups: one setup based on a stratified splitting technique based on the label class of the data and another setup based on splitting the data based on the molecular scaffold of the chemical compounds, which is a harder evaluation set up. Our results show that the scaffold splitting setup resulted in lower performance than the label-stratified setup in both passive and active learning training paradigms. Furthermore, our experiments revealed that the active learning approaches we implemented matched the passive learning performance in almost every performance metric we tested, typically after 10-65% of the labeled data, depending on the specific metric. This highlights the potential of AL methods to efficiently reduce the need for large labeled datasets while maintaining high performance in predicting BBB permeability.