Machine Learning in Drug Discovery: How Not to Lie with Computational Models?

Abstract: 

In recent years, predictive toxicity models in drug discovery have seen remarkable progress, driven by the availability of extensive molecular data and the rapid evolution of machine learning (ML) techniques. However, the lack of a comprehensive benchmark tailored to the unique complexities of conditional parameters in toxicity, such as concentration and pharmacokinetics, has hindered the advancement and effective comparison of novel ML algorithms. In response to this challenge, we present a versatile and machine learning-ready benchmark dataset curated from diverse sources, including ChEMBL, PubChem, FDA datasets, and other scientific publications. The challenges we present are designed for machine learning researchers aiming to make impactful contributions to real-world drug discovery. We present a diverse array of predictive tasks relevant to real-world drug discovery, along with parameters such as human pharmacokinetics, dose, concentration, and cell line data. It integrates in vitro data essential for toxicity prediction, such as hERG inhibition and microsomal stability, and delves deeper into in vivo outcomes, such as cardiotoxicity labels encompassing both arrhythmia and structural heart damage. It offers curated datasets for protein target prediction, emphasizing diverse protein functions beyond just inhibition, and covers pharmacokinetics data such as plasma concentration. It also incorporates environmental toxicity data, covering the ecological footprint of drug compounds, and a dataset on natural compounds’ protein binding. We set out specific challenges for classification and regression tasks, as well as multitask and transfer learning models, along with recommended dataset splits for validation that cover various random splits as well as out-of-distribution splits. Each task is tailored to mirror real-world drug discovery challenges and aims to bridge the gap between machine learning predictions and practical drug development outcomes. We provide preprocessed molecular features from a wide range of modalities, such as structural features, cell imaging, and gene expression, which can be used as input features for models. This presentation is a collaborative endeavor, pooling insights from both industry and academia, designed to offer ML researchers a benchmark dataset that can be used to make meaningful contributions to real-world drug discovery.

Bio: 

Srijit Seal is a researcher specializing in chemoinformatics at the Imaging Platform at the Broad Institute of MIT and Harvard. His work focuses particularly on modeling and interpreting the Cell Painting assay. Previously, his research at the University of Cambridge centered on using machine learning techniques to predict drug bioactivity, safety, and toxicity. Seal actively engages in academic outreach, promoting the understanding of artificial intelligence and delivering seminars on its applications in drug discovery.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google