
Abstract: Data scientists can spend 60 to 80% of their time exploring and cleaning data. When they're given an updated data set, this process should be repeated but often, it isn't. This can lead to a model that poorly describes the system it represents. However, there is something that you can do about this.
The "feature type" system in OCI Data Science’s Accelerated Data Science (ADS) SDK classifies data based on what they represent, not how they're stored in memory. It also gives you the tools to compute custom statistics, create visualizations, use a validator and a warning system, and select columns based on the feature types.
Session Outline
Attend this presentation to:
- Learn how to speed up your exploratory data analysis (EDA).
- Create custom feature types.
- Make your data cleaning and validation process reproducible.
- Develop the skills to have confidence in the quality of your data.
Bio: A modern polymath, John holds advanced degrees in mechanical engineering, kinesiology and data science, with a focus on solving novel and ambiguous problems. As a senior applied data scientist at Amazon, John worked closely with engineering to create machine learning models to arbitrate chatbot skills, entity resolution, search, and personalization.
As a principal data scientist for Oracle Cloud Infrastructure, he is now defining tooling for data science at scale. John frequently gives talks on best practices and reproducible research. To that end, he has developed an approach to improve validation and reliability by using data unit tests and has pioneered Data Science Design Thinking. He also coordinates SoCal RUG, the largest R meetup group in Southern California.