Training Sessions

– Taught by World-Class Data Scientists –

Learn the latest data science concepts, tools and techniques from the best. Forge a connection with these rockstars from industry and academic, who are passionate about molding the next generation of data scientists.

Highly Experienced Instructors

Our instructors are highly regarded in data science, coming from both academia and notable companies.

Real World Applications

Gain the skills and knowledge to use data science in your career and business, without breaking the bank.

Cutting Edge Subject Matter

Find training sessions offered on a wide variety of data science topics from machine learning to data visualization.

ODSC Training Includes

Form a working relationship with some of the world’s top data scientists for follow up questions and advice.

Additionally, your ticket includes access to 50+ talks and workshops.

High quality recordings of each session, exclusively available to premium training attendees.

Equivalent training at other conferences costs much more.

Professionally prepared learning materials, custom tailored to each course.

Opportunities to connect with other ambitious like-minded data scientists.

10+ reasons people are attending ODSC West 2018

See Reasons

A Few of Our Training Sessions and Workshops

More sessions coming soon.

Training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD,

Author, Lecturer, and Core contributor to scikit-learn

Training: Intermediate Machine Learning with scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Machine Learning in R Part I

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Machine Learning in R Part II

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Advanced Machine Learning with scikit-learn Part I

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Advanced Machine Learning with scikit-learn Part II

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training:Apache Spark with Python for Data Science and Machine Learning at Scale Part I

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with popular Python tooling; distributed scheduling for popular libraries like XGBoost and TensorFlow; and fast inference.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning/AI/deep learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures.
 
Adam‘s first full-time job in tech was on neural-net-based fraud detection deployed at North America’s largest banks … in 1998. Since then, he’s worked with numerous startups (3 successful exits) where he enjoyed getting to build the future (e.g., universal mobile check-in for 2 of America’s 5 biggest airlines … in 2004, 3 years before the iPhone release). He has also worked in entertainment, insurance, and retail banking, on web, embedded, and server apps as well as on clustering architectures, APIs, and streaming analytics. 

Adam Breindel

Featured Apache Spark Instructor, Data Science Trainer and Consultant

Training: Apache Spark with Python for Data Science and Machine Learning at Scale Part II

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with popular Python tooling; distributed scheduling for popular libraries like XGBoost and TensorFlow; and fast inference.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning/AI/deep learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures.
 
Adam‘s first full-time job in tech was on neural-net-based fraud detection deployed at North America’s largest banks … in 1998. Since then, he’s worked with numerous startups (3 successful exits) where he enjoyed getting to build the future (e.g., universal mobile check-in for 2 of America’s 5 biggest airlines … in 2004, 3 years before the iPhone release). He has also worked in entertainment, insurance, and retail banking, on web, embedded, and server apps as well as on clustering architectures, APIs, and streaming analytics.

Adam Breindel

Featured Apache Spark Instructor, Data Science Trainer and Consultant

Training: Network Analysis Made Simple Part I

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems – as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

Instructor Bio

Eric is an Investigator in the Scientific Data Analysis team at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Eric Ma, ScD

DS Investigator, Novartis Institutes, Author of nxviz Package

Training: Network Analysis Made Simple Part II

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems – as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

Instructor Bio

Eric is an Investigator in the Scientific Data Analysis team at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Eric Ma, ScD

DS Investigator, Novartis Institutes, Author of nxviz Package

Training: Introduction to RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Intermediate RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Programming with Data: Python and Pandas

Whether in R, MATLAB, Stata, or python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.

In this workshop, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

Instructor Bio

Daniel Gerlanc has worked as a data scientist for more than decade and written software professionally for 15 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors. At Enplus, he works with clients on data science and custom software development with a particular focus on projects requiring an expertise in both areas. He teaches data science and software development at introductory through advanced levels. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups.

Daniel Gerlanc

Data Science and Software Engineering Instructor, President, Enplus Advisors

Workshop: Latest Developments in GANS

Generative adversarial networks (GANs) are widely considered one of the most interesting developments in machine learning and AI in the last decade. In this wide-ranging talk, we’ll start by covering the fundamentals of how and why they work, reviewing basic neural network and deep learning terminology in the process; we’ll then cover the latest applications of GANs, from generating art from drawings, to advancing research areas such as Semi-Supervised Learning, and even generating audio. We’ll also examine the progress on improving GANs themselves, showing the tricks researchers have used to increase the realism of the images GANs generate.

Throughout, we’ll touch on many related topics, such as different ways of scoring GANs, and many of the Deep Learning-related tricks that have been found to improve training. Finally, we’ll close with some speculation from the leading minds in the field on where we are most likely to see GANs applied next.

Attendees will leave with a better understanding of the latest developments in this exciting area and the technical innovations that made those developments possible. Emphasis will be placed throughout on illuminating why the latest achievements have worked, not just what they are. Furthermore, a link to a clean, documented GitHub repo with a working GAN will be provided for attendees to see how to code one up. Attendees will thus leave feeling more confident and empowered to apply these same tricks to solve problems they face in personal projects or at work.

Instructor Bio

Seth loves teaching and learning cutting edge machine learning concepts, applying them to solve companies’ problems, and teaching others to do the same. Seth discovered Data Science and machine learning while working in consulting in early 2014. After taking virtually every course Udacity and Coursera had to offer on Data Science, he joined Trunk Club as their first Data Scientist in December 2015. There, he worked on lead scoring, recommenders, and other projects, before joining Metis in April 2017 as a Senior Data Scientist, teaching the Chicago full time course. Over the past six months, he has developed a passion for neural nets and deep learning, working on writing a neural net library from scratch and sharing what he has learned with others via blog posts (on sethweidman.com), as well as speaking at Meetups and conferences.

Seth Weidman

Senior Data Scientist, Metis

Workshop: The Power of Monotonicity to Make ML make ense

The key to machine learning is getting the right flexibility. For many ML problems, we have prior knowledge about global trends the model should be capturing, like that predicted travel time should go up if traffic gets worse. But flexible models like DNN’s and RF’s can have a hard time capturing such global trends given noisy training data, which limits their ability to extrapolate well when you run a model on examples different than your training data. TensorFlow’s new TensorFlow Lattice tools let you create flexible models that can respect the global trends you request, producing easier-to-debug models that generalize well. TF Lattice provides new TF Estimators that make capturing your global trends easy, and we’ll also explain the underlying new TF Lattice operators that you can use to create your own deeper lattice networks.

Instructor Bio

Maya Gupta leads Google’s Glassbox Machine Learning R and D team, which focuses on designing and developing controllable and interpretative machine learning algorithms that solve Google product needs. Prior to Google, Gupta was an Associate Professor of Electrical Engineering at the University of Washington from 2003-2013. Her PhD is from Stanford, and she holds a BS EE and BA Econ from Rice.

Maya Gupta, PhD

Glassbox ML R&D Team Lead at Google

Workshop: Multivariate Time Series Forecasting Using Statistical and Machine Learning Models

Time series data is ubiquitous: weekly initial unemployment claim, daily term structure of interest rates, tick level stock prices, weekly company sales, daily foot traffic recorded by mobile devices, and daily number of steps taken recorded by a wearable, just to name a few.

Some of the most important and commonly used data science techniques in time series forecasting are those developed in the field of machine learning and statistics. Data scientists should have at least a few basic time series statistical and machine learning modeling techniques in their toolkit.

This lecture discusses the formulation Vector Autoregressive (VAR) Models, one of the most important class of multivariate time series statistical models, and neural network-based techniques, which has received a lot of attention in the data science community in the past few years, demonstrates how they are implemented in practice, and compares their advantages and disadvantages used in practice. Real-world applications, demonstrated using python, are used throughout the lecture to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using histogram, kernel density plot, time-series plot, scatterplot matrix, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, and plots of cross-correlations will also be included in the demo.

Instructor Bio

Jeffrey is the Chief Data Scientist at AllianceBernstein, a global investment firm managing over $500 billions. He is responsible for building and leading the data science group, partnering with investment professionals to create investment signals using data science, and collaborating with sales and marketing teams to analyze clients. Graduated with a Ph.D. in economics from the University of Pennsylvania, he has also taught statistics, econometrics, and machine learning courses at UC Berkeley, Cornell, NYU, the University of Pennsylvania, and Virginia Tech. Previously, Jeffrey held advanced analytic positions at Silicon Valley Data Science, Charles Schwab Corporation, KPMG, and Moody’s Analytics.

Jeffrey Yau, PhD

Chief Data Scientist, Alliance Bernstein

Workshop: Applying Deep Learning to Article Embedding for Fake News Evaluation

In this talk we explore real world use case applications for automated “Fake News” evaluation using contemporary deep learning article vectorization and tagging. We begin with the use case and an evaluation of the appropriate context applications for various deep learning applications in fake news evaluation. Technical material will review several methodologies for article vectorization with classification pipelines, ranging from traditional to advanced deep architecture techniques. We close with a discussion on troubleshooting and performance optimization when consolidating and evaluating these various techniques on active data sets.

Instructor Bio

Mike serves as Head of Data Science at Uber ATG, UC Berkeley Data Science faculty, and head of Skymind Labs the Machine Learning research lab affiliated with DeepLearning4J. He has led teams of Data Scientists in the bay area as Chief Data Scientist for InterTrust and Takt, Director of Data Sciences for MetaScale/Sears, and CSO for Galvanize where he founded the galvanizeU-UNH accredited Masters of Science in Data Science degree and oversaw the company’s transformation from co-working space to Data Science organization.

Michael Tamir, PhD

Head of Data Science, Uber

Workshop: Model Evaluation in the land of deep learning

Model evaluation metrics are typically tied to the predictive learning tasks. There are different metrics for classification (ROC-AUC, confusion matrix), regression (RMSE, R2 score], ranking metrics (precision-recall, F1 score), and so on. These metrics, coupled with cross-validation or hold-out validation techniques, might help analysts and data scientists select a performant model. However, model performance decays over time because of the variability in the data. At this point in time, point estimate-based metrics are not enough, and a better understanding of the why what, and how of the categorization process is needed.

Evaluating model decisions might still be easy for linear models but gets difficult in the world of deep neural networks (DNNs). This complexity might increase multifold for use cases related to computer vision (image classification, image captioning or visual QnA(VQA), text classification), sentiment analysis, or topic modeling. ResNets, a recently published state-of-the-art DNN, has over 200 layers. Interpreting input features and it output categorization over multiple layers is challenging. The lack of decomposability and intuitiveness associated with DNNs prevents widespread adoption even with their superior performance compared to more classical machine learning approaches. The faithful interpretation of DNNs will not only help in providing insight about the failure modes (false positives and false negatives) but will also help the humans in the loop evaluate the robustness of the model against noise. This brings in trust and transparency to the predictive algorithm.

In this workshop, I will share how to enable class-discriminative visualizations for computer vision/NLP problems when using convolutional neural networks (CNN) and approach to help enable transparency of CNN’s by not only capturing metrics during the validation step but also highlighting salient features in the image and text which are driving prediction.
I will also be talking briefly about the open source project(“Skater”: https://github.com/datascienceinc/Skater) and how it can help in solving our interpretation needs.

Instructor Bio

Pramit Choudhary is an applied machine learning research scientist. Currently, he is the lead DataScientist@datascience.com(RnD labs)(acquired by Oracle). He focuses on optimizing and applying Machine Learning to solve real-world problems. His research area includes scaling and optimizing machine learning algorithms. Currently, he is exploring better ways to explain a model’s learned decision policies to reduce the chaos in building effective models to close the gap between a prototype and operationalized models.

Pramit Choudhary

Lead Data Scientist at datascience.com

Workshop: Raise your own Pandas Cub

A typical data scientist’s workflow in Python consists of firing up a Jupyter Notebook, importing NumPy, Pandas, Matplotlib, and Scikit-Learn into the workspace and then completing a data analysis. The APIs from these libraries are well-known, mostly stable, and provide a powerful and flexible way of analyzing data. These libraries have contributed an enormous amount to the success of Python as a language of choice for doing data science as well as increasing productivity for the data scientists that use them.

For those data scientists that are interested in learning how to develop their own data science tools, relying on these popular, easy-to-use libraries hides the complexities and underlying Python code. In fact, it is so easy to produce data science results in Python, that one only needs to know the very basics of the language along with knowledge of the library’s API.

In this hands-on tutorial, we will build our own data analysis package from scratch. Specifically, our package will contain a DataFrame Class with a Pandas-like API. We will make heavy use of the Python data model, which contains special methods to help our DataFrame work with Python operators. By the end of the tutorial, we will have built a Python package that you can import into your workspace capable of performing the most important operations available in Pandas.

Instructor Bio

Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.

Ted Petrou

Pandas Author, Founder at Dunder Data

Workshop: Visual Elements of Data Science

“Above all, show the data” (Edward Tufte)

Data Visualization is fundamental not only to data exploration, but to addressing data science problems in general. It is a key technique in descriptive statistics (e.g., boxplots, histograms, distribution charts, heatmaps), diagnostics (e.g., scatterplots, Geiger counter charts, digital elevation models) and predictive layers (e.g., decision trees, artificial neural networks) of the data science stack. For example, visualization is a means to understand relationships between variables, to recognize patterns, to detect outliers and to break down complexity. Effective ways to describe and summarize data sets are also very helpful in communicating with clients and collaborators in a more quantitative and rational way. Therefore, implementing and utilizing data visualizations is a key skill that every data scientist must have in their repository.

While enterprises and businesses across industries are now widely using dashboards and other (often commercial) business intelligence software to generate data visualizations, data scientists usually still heavily rely on creating charts in scripting languages and other open source coding environments from scratch. This is because they need to not only explore raw data and data aggregates, but also review model outputs visually and prepare charts for presentations and publications. The currently most widely used tools include ggplot2, plotly and shiny (R); as well as matplotlib, Seaborn and Bokeh (python).

This session reviews key elements of the effective use of data visualizations in Data Science industry applications. These include (1) a narrative / a story to tell about the data, (2) simplicity (3) conciseness through balancing information, complexity and avoiding too much decoration (aesthetics concept). It also addresses how to choose the right chart for given data sets, depending on different contexts and questions. What are some simple rules to follow for a good graphic, and which common errors need to be avoided? How do you know if your graph is accurately representing the underlying data set? This is particularly important for high dimensional data sets and growing data volumes in the age of Big Data.

In this workshop, state of the art scripts and packages in R and python will be used to demo how to plot heatmaps, time series charts and network graphs as well as representations and maps for geospatial data sets.

Instructor Bio

Olaf Menzer is a Data Scientist in the Decision Analytics team at Pacific Life in Newport Beach, California. His focus areas are around enabling business process improvements and the generation of insights through data synthesis, the application of advanced analytics and technology more broadly. He is also a Visiting Researcher at the University of California, Santa Barbara, contributing to primary research articles and statistical applications in Ecosystem Science.

Prior to working at Pacific Life, Olaf was a Predictive Analyst at Ingram Micro, designing, implementing and testing sales forecasting models, lead generation engines and product recommendation algorithms for cross-selling millions of technology products. He also held different Research Assistant roles at the Lawrence Berkeley National Lab and the Max Planck Institute in Germany where he supported scientific computing, data analysis and machine learning applications.

Olaf was a speaker at the INFORMS Business Analytics conference in 2016, Predictive Analytics World in 2018 and at several academic conferences in the past. He received a M.Sc. in Bioinformatics from Friedrich Schiller University in Germany (2011), and a Ph.D. in Geographic Information Science from University of California, Santa Barbara (2015).

Olaf Menzer, PhD

Senior Data Scientist at Pacific Life

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required. We will cover relevant background/fundamentals wherever necessary.

Instructor Bio

Coming Soon.

Anish Das Sarma, PhD

Engineering Manager at Airbnb

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required. We will cover relevant background/fundamentals wherever necessary.

Instructor Bio

Coming Soon

Sukhada Palkar

Software Engineer at Airbnb

Workshop: Introduction to Clinical Natural Language Processing: Predicting Hospital Readmission with Discharge Summaries

Clinical notes from physicians and nurses contain a vast wealth of knowledge and insight that can be utilized for predictive models to improve patient care and hospital workflow. In this workshop, we will introduce a few Natural Language Processing techniques for building a machine learning model in Python with clinical notes. As an example, we will focus on predicting unplanned hospital readmission with discharge summaries using the MIMIC III data set. After completing this tutorial, the audience will know how to prepare data for a machine learning project, preprocess unstructured notes using a bag-of-words approach, build a simple predictive model, assess the quality of the model and strategize how to improve the model. Note to the audience: the MIMIC III data set requires requesting access in advance, so please request access as early as possible.

Instructor Bio

Andrew Long is a Data Scientist at Fresenius Medical Care North America (FMCNA). Andrew holds a PhD in biomedical engineering from Johns Hopkins University and a Master’s degree in mechanical engineering from Northwestern University. Andrew joined FMCNA last year after participating in the Insight Health Data Fellows Program. At FMCNA, he is responsible for building predictive models using machine learning to improve the quality of life of every patient who receives dialysis from FMCNA. He is currently creating a model to predict which patients are at the highest risk of imminent hospitalization.

Andrew Long, PhD

Data Scientist, Fresenius Medical Care

Sign Up for ODSC West | Oct 31st - Nov 3rd 2018

Register Now
Open Data Science Conference