Abstract: Data Science is easy* when your data fit in memory, your functions are stateless, and everything is version controlled. As things grow, however, the data pipeline can become its own tangled mess – if you change a preprocessing step, does your model need to be retrained? What about predictions – did you rerun them after finding the best parameters for the model? Which teammates have which versions of your output?
Developers have juggled “Dependency Hell” issues for decades, and many tools exist to help keep our environments functional and complete. Yet there are few or no standard tools for the artifacts of a data science workflow, or even outputs from an ETL pipeline. Functions are expected to get “the most recent” input and we rely on eventual consistency, despite the obvious risks and prevalent failures.
At Solaria Labs, these problems are even more real as we are constantly developing brand new products from the ground up. In this workshop, I will discuss how we leverage dataflow programming libraries, such as Dask and Luigi, to structure and simplify our approach. I will also introduce the Salted Graph – a concept which allows rigorous tracking of data lineage within the code framework of choice, in a manner similar to Git, to provide a ‘controlled version’ for our data outputs.
Bio: Scott Gorlin is the Director of Applied Science for Solaria Labs, an incubation arm within Liberty Mutual Innovation focused on exploring emerging technologies and non-traditional business opportunities. Prior to joining Solaria in June of 2017, he led research and development for an ad-tech startup focused on automated campaign management and performance optimization. Scott earned a Ph.D. in Systems and Computational Neuroscience from the Massachusetts Institute of Technology and has long been an advocate and practitioner of repeatable and scalable science through code.