
Abstract: Every day, virtual mountains of data are collected and stored at unfathomable speeds. As data volume grows exponentially, the data workflow becomes more complex as an avalanche of data makes it challenging to identify, cleanse, mine, pivot and use it for both insights and AI powered product features.
To derive the most value from their data, data professionals must be able to set up their workflow in a way that will maximize not only their own efficiency and productivity but also data reproducibility. To do this, data teams borrow a lot of best practices from software engineering like testing, version control, documentation and continuous integration and deployment (CI/CD), but there are important differences in how these are implemented with data workflows that hamper the success of data teams.
This presentation will outline the specific challenges to adopting software engineering best practices for data and analytics workflows, why they exist, and how data scientists can craft environments to best address common pitfalls and encourage reproducibility.
Specifically:
- what it really means to ‘version control’ data sets (and why it's not what most people think). The session will elaborate on the different motivations for version control in production software environments and production data environments, and discuss best practices for versioning in a data specific workflow..
- what CI/CD needs to look like to best enable a team to collaborate on a data workflow. For example, how to best design data integration tests that go beyond checking for null or unexpected values to enable a team to make changes to a data workflow without compromising downstream dependencies.
- “the why” behind several canonical data modeling best practices that are prevalent today (e.g. a staging layer in your data model), and how they contribute to reproducibility in data workflows.
The session will cover specific actions leaders can take and offer real life examples and use cases. Attendees will walk away with a deeper understanding of how to avoid common pitfalls, how to improve team collaboration and reproducibility in data workflows.
Bio: Anna Filippova tends to the dbt Community garden of over 25,000 at dbt Labs as the Director of Community. Prior to dbt Labs, Anna built the first Analytics Engineering team at GitHub. Today, she writes about the intersection of modern data tools and open source in the Analytics Engineering Roundup.
In her past life, Anna published research on building, maintaining and sustaining open source communities. She has also studied how distributed and open source communities worked, fought and learned in a Postdoc at Carnegie Mellon, and acquired a PhD in Communication and Media from the National University of Singapore. From time to time you can find Anna traveling the coast of California and working from her campervan and she is always open to an AMA session.