Enough data engineering for a Data Scientist – “How I Learned to Stop Worrying and Love the Data Scientists”

Abstract: 

So how much data engineering should a Data Scientist know? For a Data Scientist to get to the fun part of their job, they normally have to do a bit of data engineering. Like on boarding data. Do a little bit of “wrangling”. Before they get to the fun part part - The Data Science! In most cases this is 50%-80% of the time.

Then comes the handing it over to the Data Engineering team to put it into production (of course via dev, test, and QA). This is when a “little bit” of contention happen. As in most cases the Data Engineering team will have to do “some” modification/re-write/Head shaking/Hand wringing to get the code to be production ready and meet the SLA’s defined by the business. As there is a disconnect in how Data Scientists and Data Engineers develop code / models (I get a front row seat to this all the time). In this talk I’ll take the Data Scientist on a journey. From on-boarding data, and how different data/object stores can help; Understanding and choosing the right data format for the data assets; Explore some different query engines, and some basic query tuning for each; Explain how a distributed streaming platform works, and how you can take advantage of it; Lastly cover some good coding practices. This will give the Data Scientist new skills to help them be more productive, so that can get to the fun part faster! Plus reduce the contention with the Data Engineering team, and make them say - “How I Learned to Stop Worrying and Love the Data Scientists”!

The topics that are going to be covered:
On boarding data
Load into Data/Object Stores
Load into Memory
Partition Strategies
Data Formats
Text
Avro
Parquet
ORC
Schema Evolution
Query Engines
Initial “Create Table”
Hive
Impala
Presto
Spark SQL
Explain plan on SQL / SQL tuning
Distributed message bus / streaming platform
Stream processing
Partition Strategies
*Good coding practices
Source control
Unit tests
Continuous Integration
Catching errors
Alerting & Monitoring

Bio: 

Stephen O’Sullivan is the owner of Data Whisperers. He is an expert in data architecture, infrastructure, and technical operations. Mr. O’Sullivan has deep experience in Hadoop usage and architecture and cutting-edge open source solutions for Big Data. He brings more than 25 years of experience creating enterprise applications and data management solutions for high availability and scale to his current position.

Prior to Data Whisperers, Mr. O’Sullivan was VP, Engineering at Silicon Valley Data Science. Where is led the data engineering team to help SVDS clients become data driven, and obtain their business goals utilizing data. Prior to SVDS, he created and led the next generation data platform team at Walmart Labs as a senior director. He and his team architected and designed the data platform that will be used by all of Walmart's e-commerce business units. At Walmart Labs he spent time evaluating big data / database / datastore / data management vendors, from big name companies to stealth startups, as to how they would be used within Walmart’s eCommerce and store infrastructure. Mr. O’Sullivan evaluated, made recommendations and built solutions to address Walmart’s needs in security, high availability, scalability and performance.

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google