Machine Learning in Processing Genomic Sequence Information (XGBoost/R)


A real-world scenario of applying ML in genomics will be discussed, where we shall build a predictor of a special type of DNA structures out of DNA sequence information alone. This will be done by walking through all the major decision points in the project, up to the production level run and model deployment. All this will be accompanied by introducing Gradient Boosting Machines (GBMs) as one of the top-performing stand-alone machine learning (ML) methodologies standing by deep learning. We shall provide an intuitive approach of understanding the major hyperparameters necessary to tune and optimise the architecture for model development in computational genomics. Specific usage examples, code snippets and hints will be provided. We shall then conclude with “dos and don’ts” while applying ML to process biological (DNA/RNA) sequence information in computational genomics. The workshop will be useful for anyone wanting to learn more about GBMs and ML in applied genomics.

Session Outline
Module 1. Briefly on Gradient Boosting Machines: an intuitive approach for understanding the concept and intricacies of the methodology.

Gain the essential level of understanding about the core machine learning methodology employed in the example project.

Module 2. Real-world case study of the machine learning model development process in genomics.

Get an insider look at all the key stages and decision-making while developing a machine learning model in genomics. Here, some knowledge on DNA sequence and genomes, though not essential, will help.

Module 3. Example demo of the usage of GBM in R (via XGBoost and Caret libraries).

Follow a live example on how the machine learning components of the code look like. Some knowledge in R and Caret/XGBoost, though not essential, will help.

Module 4. “Dos and don'ts” of machine learning applied to genomic sequence data.

Here, you will learn about the major pitfalls of machine learning projects operating on genomic DNA/RNA sequences, and will gain tips on how to avoid the common mistakes.

Background Knowledge
Basic familiarity with R, machine learning methodologies, DNA sequence and molecular biology, XGBoost, Caret and GBM (all not essential).


Alex is a principal investigator at the University of Oxford, leading a group focused on integrative computational biology and machine learning in the MRC Weatherall Institute of Molecular Medicine, Radcliffe Department of Medicine. In past, he did his undergraduate studies in pharmaceutical sciences with a research focus on quantum/structural chemistry and NMR spectroscopy. He next moved to the University of Cambridge, first obtaining an MPhil degree in computational biology (Department of Applied Mathematics and Theoretical Physics), followed by a PhD in theoretical chemical biology (Department of Chemistry). He then became an interdisciplinary research fellow in computational genomics and epigenetics (Department of Chemistry and Cancer Research UK Cambridge Institute), before joining the University of Oxford. His research aims at combining machine learning, computational biology, computational chemistry, data from experimental genomics and biophysical techniques to reach a new level of precision in biology at both genome and proteome levels.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google