Abstract: Protein Language Model are Transformer-like models that are trained on massive sets of protein sequences (represented as text) in an attempt to learn the biological 'grammar' of proteins.These models have a broad range of application, thanks to their generative and embedding abilities.
In this workshop, we will get more familiar with this type of model, how they differ from their NLP counterparts and the tasks they can address. we will also get a short overview of the existing open-source models and datasets.
During the hands-on session, we will start from a pre-trained language model and develop a basic example of protein function multi label classifier. We will then develop compare and benchmark different classification approaches, including a simple retrieval-augmented enhancement, and fine tuning.
Part 1: Introduction to Protein Language Models
An overview of protein language models (PLMs) and their significance in bioinformatics for synthetic biology use cases. This part includes an explanation of how PLMs differ from traditional natural language processing models, and examples of popular PLMs such as ProBERT, ProtTrans, Ankh..
Part 2: Hands-On Exercise
Part 2.1: Setup and first interactions
After setting up a google Colab environment (Python), you will get familiar with protein language model by retrieving a protein language model from Huggingface's repository and start playing with a protein dataset from the literature. This includes exploring the dataset content, get to know more about the protein function prediction task and experiment simple encoding/decoding of proteins sequences.
Part 2.2: Classification and fine-tuning
In this session you will develop a protein function classifier by finetuning a small pre-trained protein language model. On the basis of this fine-tuned model, we will try to leverage the new embeddings and enhance the performance with retrieval-augmented classification.
Part 2.3: Benchmarking and development of a small app
Compare the results and implement the best solutions with a small Streamlit app let user get real-time prediction of an protein's function.
How to use Huggingface's Trainer with Python/Pytorch to finetune an open-source protein language model on a dataset from the scientific literature.
Requires some basic knowledge of Python and Pytorch.
Model -> https://huggingface.co/facebook/esm2_t6_8M_UR50D
Dataset -> I will provide (from the literature: https://academic.oup.com/bioinformatics/article/38/24/5368/6795008)
Bio: With over 8 years of experience in data science and machine learning, I am passionate about developing and applying innovative solutions to real-world problems. I am currently working as a senior researcher within the Biotechnology Research Center of the Technology Innovation Institute, where I have worked on several projects related to Large Language Models, including LLM applications for source code and protein generation.