Data Operations for Research Quality Health Data


Data Operations for Research Quality Health Data

A mature data operations strategy for ensuring high quality health data is critical to the development of patient level prediction and machine learning models. In this tutorial session, attendees will learn how a set of open source tools can be leveraged to perform standardization, characterization, and data quality assessment for various health data sources. Open source tools including Synthea, ETL-Synthea, Achilles, Data Quality Dashboard, and Ares will be reviewed and demonstrated in a data operations pipeline. We will demonstrate how the global health information community leverages this strategy to ensure research-ready health data.

“Synthea is an open-source, synthetic patient generator that models the medical history of synthetic patients and provides high-quality, synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions, enabling research that is otherwise legally or practically unavailable.”[1] We use synthetic data to demonstrate this data operations strategy using completely available, open source data and methods. However, the strategy is also widely used with real world data and evidence strategies around the world.

In the overall data operations pipeline Synthea is used to generate an array of synthetic data sets which are then standardized to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) by the ETL-Synthea package.[2] As the name suggests ETL-Synthea performs the extraction, transformation, and loading of the Synthea data sets from the Synthea exported format to the OMOP CDM. “The OMOP Common Data Model allows for the systematic analysis of disparate observational databases. The concept behind this approach is to transform data contained within those databases into a common format (data model) as well as a common representation (terminologies, vocabularies, coding schemes), and then perform systematic analyses using a library of standard analytic routines that have been written based on the common format.”[3]

The standardized data can then be characterized by ACHILLES.[4] ACHILLES stands for Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems and is an open source package from the Observational Health Data Sciences & Informatics (OHDSI) community. Achilles provides descriptive statistics on an OMOP CDM database including typical analyses such as concomitant medication, comorbidity, and demographics reporting.

The standardized data is also assessed by over three thousand quality checks executed by the Data Quality Dashboard package.[5] The quality checks were organized according to the Kahn Framework[6] which uses a system of categories and contexts that represent strategies for assessing data quality.

Upon completion of characterization and data quality assessment the resulting outputs are integrated into a set of data indices by the Ares Indexer package.[7] The integrated results are then reviewed in the ARES application.[8] ARES provides access to results of characterization and data quality assessment for entire data networks, across the various updates to a data source, or on a specific release of a data source.

When used in combination these open source tools provide a mature data operations pipeline for ensuring health data is research ready.

[6] Kahn, M.G., et al., A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC), 2016. 4(1): p. 1244


Frank DeFalco is the Director of Epidemiology Analytics at Janssen Research and Development where he architects software solutions and data platforms for the analysis and application of observational data sources. He is currently the leader and Benevolent Dictator of the OHDSI open source architecture working group. Frank is a presenter and panelist at OHDSI symposiums and has served as faculty for OHDSI symposium tutorials classes on architecture and common data model vocabulary.

In addition to leading the OHDSI Architecture working group Frank initiated development of a standardized platform for observational analytics known as ATLAS. He is an active contributor to the open source software repositories developed and released by OHDSI including ATLAS, WebAPI, Achilles, Circe, Arachne, Visualizations, Hermes, Helios and others. Frank’s areas of expertise include computation epidemiology, large scale data platforms, software development and architecture, data visualization and informatics.

Prior to joining Janssen Research and Development, Frank held the position of Senior Principal and Director of Collaboration and Analytics at British Telecom where he was a strategic advisor for multiple Fortune 100 companies across sectors including Consumer Products, Telecommunications and Pharmaceuticals. Frank received his undergraduate degrees in Computer Science and Psychology at Rutgers University.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google