From Stored Data To Data Stories: Building Data Narratives With Open-source Tools

Abstract: Literate computing weaves a narrative directly into an interactive computation. Text, code, and results are combined into a narrative that relies equally on textual explanations and computational components. Insights are extracted from data using computational tools. These insights are communicated to an audience in the form of a narrative that resonates with the audience. Literate computing lends itself to the practice of reproducible research. One may re-run the analyses; run the analyses with new data sets; modify the code for other purposes.

This workshop will take one through the steps associated with literate computing: data retrieval; data curation; model construction, evaluation, and selection; and reporting. Particular attention will be paid to reporting, i.e., building a narrative. Examples will be presented demonstrating how one might generate multiple output formats (e.g., HTML pages, presentation slides, PDF documents) starting with the same code base.

As a specific example, a data narrative will be built showing how one might build predictive models for the solubility of organic molecules. Reports will be presented as (1) an HTML file, (2) a PDF document (in a format acceptable for journal submission), and (3) a slide presentation.

The workshop will have three main foci:
1. infrastructure: instantiating the computational environment; loading packages; loading data
2. computation: data curation, transformation, and analysis; model construction and evaluation
3. communication: creating tables, charts, and graphs; weaving all components into data narrative

While the workshop’s example comes from the field of cheminformatics, the computational tools used and the exercises presented are applicable to any field where an investigator is interested in building predictive models, and describing these models to colleagues and associates.

At the workshop’s conclusion attendees will have worked through exercises that may serve as templates to be used with their data as they build their data narratives.

The R and Python ecosystems will be used throughout. All data, code, and text will be made available.

Bio: Paul Kowalczyk is a Senior Data Scientist at Solvay. There, Paul uses a variety of toolchains and machine learning workflows to visualize, analyze, mine, and report data; to generate actionable insights from data. Paul is particularly interested in democratizing data science, working to put data products into the hands of his colleagues. His experience includes using computational chemistry, cheminformatics, and data science in the biopharmaceutical and agrochemical industries. Paul received his PhD from Rensselaer Polytechnic Institute, and was a Postdoctoral Research Fellow with IBM’s Data Systems Division.