Managing Data Projects Like a Software Engineer
Managing Data Projects Like a Software Engineer

Abstract: 

In this talk we’ll go over how to write code that is reproducible and easy for other people to work with.

We’ll start by talking about virtual environments. Virtual environments allow you to define the dependencies for your projects (such as NumPy or Matplotlib) and to keep these dependencies separated between projects. We’ll also outline some choices you have about how to manage your virtual environments.

Next we’ll talk about version control and why you should be using it even if you’re the only contributor to a project. Version control helps create a log of what work was done and why, and will give you the ability to go back when you inevitably make a change to your project that you can’t figure out how to undo.

Then we’ll discuss project structure by reviewing DrivenData’s Cookiecutter Data Science template. The template encourages a number of best practices, and makes it so that anyone familiar with the template will be able to look at your code for the first time will be reasonably well oriented.

Finally, we’ll briefly cover why you should establish coding styles and always use a linter.

Bio: 

Michael is a data engineer at Amazon in San Diego. He works in the Buyer Risk Prevention team, whose mission is to keep Amazon stores safe and trustworthy by protecting customer accounts from takeover, fraud, and abuse. Before joining a “big tech” company he worked on an enterprise data warehouse migration project at Petco, and helped build the data science team at a startup called Classy. He’s most passionate about doing work that makes a positive impact and helps give everyone in the world equal opportunity to do what they love.