Salted Graphs – A (Delicious) Approach to Repeatable Data Science
Salted Graphs – A (Delicious) Approach to Repeatable Data Science

Abstract: 

Data Science is easy* when your data fit in memory, your functions are stateless, and everything is version controlled. As things grow, however, the data pipeline can become its own tangled mess – if you change a preprocessing step, does your model need to be retrained? What about predictions – did you rerun them after finding the best parameters for the model? Which teammates have which versions of your output?

Developers have juggled “Dependency Hell” issues for decades, and many tools exist to help keep our environments functional and complete. Yet there are few or no standard tools for the artifacts of a data science workflow, or even outputs from an ETL pipeline. Functions are expected to get “the most recent” input and we rely on eventual consistency, despite the obvious risks and prevalent failures.

At Solaria Labs, these problems are even more real as we are constantly developing brand new products from the ground up. In this workshop, I will discuss how we leverage dataflow programming libraries, such as Dask and Luigi, to structure and simplify our approach. I will also introduce the Salted Graph – a concept which allows rigorous tracking of data lineage within the code framework of choice, in a manner similar to Git, to provide a ‘controlled version’ for our data outputs.

Bio: 

Zhengwei (William) Ma is a Data Scientist at Solaria Labs in Boston. Prior to joining Solaria William worked at Liberty Mutual on Solvency II reporting. He holds a BS in Financial Math from UCLA.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google