What to Do When Your Data Gets Big

Abstract: 

Regardless of where you are in your data science career, you will eventually be confronted with datasets that cannot fit into memory of a single machine–and the problems that often come with this situation. In this talk, we will review key strategies that will help you adapt to your growing datasets. Importantly, we will consider when you might choose one strategy over another.


We will discuss different approaches you can take to adapt your data so that it fits in your existing analysis framework. Then we will review the steps you can take when the analysis is simply too big to fit in the RAM of a single machine. We will examine how you might speed up calculations by using parallel processes and/or GPUs and by using frameworks such as Python’s Dask and the R future package.


This discussion will equip you with strategies to tackle larger datasets. More data does not have to mean more problems!

Session Outline
Introduction

Two broad problems with larger datasets: memory and speed

An overview of strategies to address these problems

Strategy 1: Make the data smaller - sample, summarize, and/or optimize your data to make it fit on your machine(s).

Strategy 2: Buy your way out - use cloud resources to solve the problem and keep your code the same

Strategy 3: Analyze the data in smaller chunks on a single node and combine the results.

Strategy 4: Analyze the data in smaller chunks on a multi-node cluster using a big data framework.

An example workflow in Saturn Cloud illustrating these strategies

An overview of other strategies to note - JIT compilation and code optimization

Bio: 

Nathan Ballou is a Senior Data Scientist at Saturn Cloud, a cloud workspace for the whole data science team. Prior to working at Saturn Cloud, Nathan worked as a data science consultant and as an operations research analyst. When Nathan’s not evangelizing machine learning at Saturn Cloud he can be found rowing on the Patapsco River in Baltimore.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google