Introduction to Protein Language Models for Synthetic Biology


Protein Language Model are Transformer-like models that are trained on massive sets of protein sequences (represented as text) in an attempt to learn the biological 'grammar' of proteins.These models have a broad range of application, thanks to their generative and embedding abilities.

In this workshop, we will get more familiar with this type of model, how they differ from their NLP counterparts and the tasks they can address. we will also get a short overview of the existing open-source models and datasets.

During the hands-on session, we will start from a pre-trained language model and develop a basic example of protein function multi label classifier. We will then develop compare and benchmark different classification approaches, including a simple retrieval-augmented enhancement, and fine tuning.

Session Outline:

Part 1: Introduction to Protein Language Models
An overview of protein language models (PLMs) and their significance in bioinformatics for synthetic biology use cases. This part includes an explanation of how PLMs differ from traditional natural language processing models, and examples of popular PLMs such as ProBERT, ProtTrans, Ankh..

Part 2: Hands-On Exercise

Part 2.1: Setup and first interactions
After setting up a google Colab environment (Python), you will get familiar with protein language model by retrieving a protein language model from Huggingface's repository and start playing with a protein dataset from the literature. This includes exploring the dataset content, get to know more about the protein function prediction task and experiment simple encoding/decoding of proteins sequences.

Part 2.2: Classification and fine-tuning
In this session you will develop a protein function classifier by finetuning a small pre-trained protein language model. On the basis of this fine-tuned model, we will try to leverage the new embeddings and enhance the performance with retrieval-augmented classification.

Part 2.3: Benchmarking and development of a small app
Compare the results and implement the best solutions with a small Streamlit app let user get real-time prediction of an protein's function.

How to use Huggingface's Trainer with Python/Pytorch to finetune an open-source protein language model on a dataset from the scientific literature.

Background Knowledge:

Requires some basic knowledge of Python and Pytorch.

Model ->
Dataset -> I will provide (from the literature:


With over 8 years of experience in data science and machine learning, I am passionate about developing and applying innovative solutions to real-world problems. I am currently working as a senior researcher within the Biotechnology Research Center of the Technology Innovation Institute, where I have worked on several projects related to Large Language Models, including LLM applications for source code and protein generation.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google