Using Open-Source Tools in Support of Neglected Diseases Drug Discovery

Abstract: We present our efforts generating and validating multiple predictive machine learning models (e.g., random forests, k-nearest neighbors, support vector machines, self-organizing maps, naïve Bayes) in support of neglected tropical diseases drug discovery. Screening data, retrieved from ChEMBL-NTD (https://www.ebi.ac.uk/chemblntd/), was used to construct and validate the models. Programs written using the R software ecosystem (http://www.r-project.org/) and the Python software ecosystem (https://www.python.org/) were used for data retrieval, curation, visualization, analysis, mining, and reporting. End-to-end workflows using both of these software ecosystems will be presented. We demonstrate how one might access these models using Shiny, a web application framework for R (http://shiny.rstudio.com/), and Jupyter notebooks (http://jupyter.org/). Each of the models is collected into a compendium, a ‘container’ for all those elements that make up a model and its associated description: the primary data, the annotated computational code, figures, tables, and derived data together with textual documentation and conclusions. These compendia are meant to enable the practice of reproducible research. One may re-run the analyses; run the analyses with new data sets; modify the code for other purposes. The primary purpose of this work is to make the functionality of the R scripts and Python scripts (i.e., the predictive models) available to interested parties, regardless of their knowledge of R or Python. Free and open access to predictive models supporting neglected diseases drug discovery is meant to complement the research activities of all investigators, and in particular, those with limited access to computational tools and algorithms.

Bio: Paul received his PhD in Physical Chemistry from Rensselaer Polytechnic Institute; received his Postdoctoral fellowship with IBM Data Systems Division; previously worked as a computational chemist (QSAR, QSPR, ligand-based and structure-based pharmacophore development, cheminformatics) at Sterling Winthrop Research Institute, Procept, Pfizer and Scynexis; and is currently the Senior Data Scientist at Solvay.