Abstract: Data Science is easy* when your data fit in memory, your functions are stateless, and everything is version controlled. As things grow, however, the data pipeline can become its own tangled mess – if you change a preprocessing step, does your model need to be retrained? What about predictions – did you rerun them after finding the best parameters for the model? Which teammates have which versions of your output?
Developers have juggled “Dependency Hell” issues for decades, and many tools exist to help keep our environments functional and complete. Yet there are few or no standard tools for the artifacts of a data science workflow, or even outputs from an ETL pipeline. Functions are expected to get “the most recent” input and we rely on eventual consistency, despite the obvious risks and prevalent failures.
At Solaria Labs, these problems are even more real as we are constantly developing brand new products from the ground up. In this workshop, I will discuss how we leverage dataflow programming libraries, such as Dask and Luigi, to structure and simplify our approach. I will also introduce the Salted Graph – a concept which allows rigorous tracking of data lineage within the code framework of choice, in a manner similar to Git, to provide a ‘controlled version’ for our data outputs.
Bio: Zhengwei (William) Ma is a Data Scientist at Solaria Labs in Boston. Prior to joining Solaria William worked at Liberty Mutual on Solvency II reporting. He holds a BS in Financial Math from UCLA.