MSc.Thesis Defense:Emine Ayşe Sunar
Improving DeepKinZero with Protein Languages Models and Transductive Learning
Emine Ayşe Sunar
Computer Science and Engineering, MSc. Thesis, 2024
Thesis Jury
Assoc. Prof. Öznur Taştan (Thesis Advisor),
Assoc. Prof. Arzucan Özgür, Asst. Prof. Onur Varol
Date & Time: July 23rd, 2024 – 11:00 AM
Place: Fens L027
Keywords : Benchmark Dataset, Protein Language Models, Kinases, Phosphorylation, Zero-Shot Learning, Transductive Learning
Abstract
Phosphorylation is a critical post-translational modification that regulates numerous cellular processes, including cell signaling. Kinases are the enzymes responsible for catalyzing phosphorylation events. Due to their essential roles in the cell, kinases are the major drug targets. The amino acid residue that receives the phosphate in the substrate protein is termed a phosphosite. While high-throughput experimental techniques can detect phosphosites, identifying the specific kinases that phosphorylate these sites remains challenging. Computational methods, which typically rely on supervised techniques and existing training data, fall short for understudied kinases, also known as dark kinases, due to a lack of sufficient examples for training.
Our research group previously addressed this data limitation by framing the prediction of dark kinases as a zero-shot learning problem and introduced DeepKinZero. DeepKinZero takes the phosphosite and its surrounding sequence and kinase attributes and transfers knowledge from well-studied kinases to understudied kinases to make predictions. In this thesis, we aim to enhance DeepKinZero in several aspects. Firstly, we present a new evaluation setup where the evaluation splitting strategy takes into account not only the zero-shot nature of the problem but also the kinase group memberships, and kinase sequence similarities. This benchmark dataset, DARKIN, serves as a challenging and valuable benchmark designed to accurately assess zero-shot learning performance for dark kinase-phosphosite prediction tasks.
Secondly, we improve the protein sequence representation by evaluating various protein language models in this task. As part of this study, two zero-shot models—a zero-shot k-NN model and a zero-shot bi-linear model—have been presented to benchmark the representation power of protein language models. Thirdly, we demonstrate that using kinase active sites can be as effective as using the entire kinase domain. These active sites slightly surpass the performance of the original DeepKinZero model. Additionally, we explore a transductive approach and pseudo-labeling strategies to leverage the known phosphosite sequences of the unlabeled phosphosites.