Blockchain and Data Governance – Validating Information for Data Science

Abstract: Data validation and data governance are key to any data science project, especially within regulated industries such as healthcare and finance. Within biomedical research, the prevalence of scientific fraud has been a recurring issue within both the academic and commercial sectors. While the improvement of study reproducibility and data transparency may require a multifaceted approach, the use of emerging cryptographic technologies may reduce the risk of fraudulent data practices and boost the confidence in conclusions made by the scientific community. The recent expansion of blockchain technology provides a novel approach that can be used to rapidly deploy cryptographically-secure data validation and audit trails with open source technology.

In this session, we will demonstrate how blockchain can be used as a central component in data acquisition workflows to ensure data authenticity, permissions, and an efficient governance pipeline. Along with workshop participants, we will implement an integrated private/public blockchain application with MultiChain and the Ethereum network. Using the open source data flow application NiFi, we will then architect a platform that can be used to provide a robust data governance structure for nearly any big data project. We will include basic topics such as general blockchain architecture and approaches, specific technical implementation and Python code samples for the blockchain-based data governance application, and a demonstration web interface to visualize data and reports within the platform that participants can quickly adapt to many industries. We will conclude with additional examples of high-value use cases of blockchain within big data and data science projects.