Training Sessions

– Taught by World-Class Data Scientists –

Learn the latest data science concepts, tools and techniques from the best. Forge a connection with these rockstars from industry and academic, who are passionate about molding the next generation of data scientists.

Highly Experienced Instructors

Our instructors are highly regarded in data science, coming from both academia and notable companies.

Real World Applications

Gain the skills and knowledge to use data science in your career and business, without breaking the bank.

Cutting Edge Subject Matter

Find training sessions offered on a wide variety of data science topics from machine learning to data visualization.

ODSC Training Includes

Form a working relationship with some of the world’s top data scientists for follow up questions and advice.

Additionally, your ticket includes access to 50+ talks and workshops.

High quality recordings of each session, exclusively available to premium training attendees.

Equivalent training at other conferences costs much more.

Professionally prepared learning materials, custom tailored to each course.

Opportunities to connect with other ambitious like-minded data scientists.

10+ reasons people are attending ODSC West 2018

See Reasons

A Few of Our Training Sessions and Workshops

More sessions coming soon.

Training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD,

Author, Lecturer, and Core contributor to scikit-learn

Training: Intermediate Machine Learning with scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Machine Learning in R Part I

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Machine Learning in R Part II

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Advanced Machine Learning with scikit-learn Part I

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Advanced Machine Learning with scikit-learn Part II

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training:Apache Spark with Python for Data Science and Machine Learning at Scale Part I

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with popular Python tooling; distributed scheduling for popular libraries like XGBoost and TensorFlow; and fast inference.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning/AI/deep learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures.
 
Adam‘s first full-time job in tech was on neural-net-based fraud detection deployed at North America’s largest banks … in 1998. Since then, he’s worked with numerous startups (3 successful exits) where he enjoyed getting to build the future (e.g., universal mobile check-in for 2 of America’s 5 biggest airlines … in 2004, 3 years before the iPhone release). He has also worked in entertainment, insurance, and retail banking, on web, embedded, and server apps as well as on clustering architectures, APIs, and streaming analytics. 

Adam Breindel

Featured Apache Spark Instructor, Data Science Trainer and Consultant

Training: Apache Spark with Python for Data Science and Machine Learning at Scale Part II

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with popular Python tooling; distributed scheduling for popular libraries like XGBoost and TensorFlow; and fast inference.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning/AI/deep learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures.
 
Adam‘s first full-time job in tech was on neural-net-based fraud detection deployed at North America’s largest banks … in 1998. Since then, he’s worked with numerous startups (3 successful exits) where he enjoyed getting to build the future (e.g., universal mobile check-in for 2 of America’s 5 biggest airlines … in 2004, 3 years before the iPhone release). He has also worked in entertainment, insurance, and retail banking, on web, embedded, and server apps as well as on clustering architectures, APIs, and streaming analytics.

Adam Breindel

Featured Apache Spark Instructor, Data Science Trainer and Consultant

Training: Network Analysis Made Simple Part I

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems – as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

Instructor Bio

Eric is an Investigator in the Scientific Data Analysis team at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Eric Ma, ScD

DS Investigator, Novartis Institutes, Author of nxviz Package

Training: Network Analysis Made Simple Part II

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems – as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

Instructor Bio

Eric is an Investigator in the Scientific Data Analysis team at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Eric Ma, ScD

DS Investigator, Novartis Institutes, Author of nxviz Package

Training: Introduction to RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Intermediate RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Programming with Data: Python and Pandas

Whether in R, MATLAB, Stata, or python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.

In this workshop, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

Instructor Bio

Daniel Gerlanc has worked as a data scientist for more than decade and written software professionally for 15 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors. At Enplus, he works with clients on data science and custom software development with a particular focus on projects requiring an expertise in both areas. He teaches data science and software development at introductory through advanced levels. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups.

Daniel Gerlanc

Data Science and Software Engineering Instructor, President, Enplus Advisors

Workshop: Latest Developments in GANS

Generative adversarial networks (GANs) are widely considered one of the most interesting developments in machine learning and AI in the last decade. In this wide-ranging talk, we’ll start by covering the fundamentals of how and why they work, reviewing basic neural network and deep learning terminology in the process; we’ll then cover the latest applications of GANs, from generating art from drawings, to advancing research areas such as Semi-Supervised Learning, and even generating audio. We’ll also examine the progress on improving GANs themselves, showing the tricks researchers have used to increase the realism of the images GANs generate.

Throughout, we’ll touch on many related topics, such as different ways of scoring GANs, and many of the Deep Learning-related tricks that have been found to improve training. Finally, we’ll close with some speculation from the leading minds in the field on where we are most likely to see GANs applied next.

Attendees will leave with a better understanding of the latest developments in this exciting area and the technical innovations that made those developments possible. Emphasis will be placed throughout on illuminating why the latest achievements have worked, not just what they are. Furthermore, a link to a clean, documented GitHub repo with a working GAN will be provided for attendees to see how to code one up. Attendees will thus leave feeling more confident and empowered to apply these same tricks to solve problems they face in personal projects or at work.

Instructor Bio

Seth loves teaching and learning cutting edge machine learning concepts, applying them to solve companies’ problems, and teaching others to do the same. Seth discovered Data Science and machine learning while working in consulting in early 2014. After taking virtually every course Udacity and Coursera had to offer on Data Science, he joined Trunk Club as their first Data Scientist in December 2015. There, he worked on lead scoring, recommenders, and other projects, before joining Metis in April 2017 as a Senior Data Scientist, teaching the Chicago full time course. Over the past six months, he has developed a passion for neural nets and deep learning, working on writing a neural net library from scratch and sharing what he has learned with others via blog posts (on sethweidman.com), as well as speaking at Meetups and conferences.

Seth Weidman

Senior Data Scientist, Metis

Workshop: The Power of Monotonicity to Make ML make ense

The key to machine learning is getting the right flexibility. For many ML problems, we have prior knowledge about global trends the model should be capturing, like that predicted travel time should go up if traffic gets worse. But flexible models like DNN’s and RF’s can have a hard time capturing such global trends given noisy training data, which limits their ability to extrapolate well when you run a model on examples different than your training data. TensorFlow’s new TensorFlow Lattice tools let you create flexible models that can respect the global trends you request, producing easier-to-debug models that generalize well. TF Lattice provides new TF Estimators that make capturing your global trends easy, and we’ll also explain the underlying new TF Lattice operators that you can use to create your own deeper lattice networks.

Instructor Bio

Maya Gupta leads Google’s Glassbox Machine Learning R and D team, which focuses on designing and developing controllable and interpretative machine learning algorithms that solve Google product needs. Prior to Google, Gupta was an Associate Professor of Electrical Engineering at the University of Washington from 2003-2013. Her PhD is from Stanford, and she holds a BS EE and BA Econ from Rice.

Maya Gupta, PhD

Glassbox ML R&D Team Lead at Google

Workshop: Multivariate Time Series Forecasting Using Statistical and Machine Learning Models

Time series data is ubiquitous: weekly initial unemployment claim, daily term structure of interest rates, tick level stock prices, weekly company sales, daily foot traffic recorded by mobile devices, and daily number of steps taken recorded by a wearable, just to name a few.

Some of the most important and commonly used data science techniques in time series forecasting are those developed in the field of machine learning and statistics. Data scientists should have at least a few basic time series statistical and machine learning modeling techniques in their toolkit.

This lecture discusses the formulation Vector Autoregressive (VAR) Models, one of the most important class of multivariate time series statistical models, and neural network-based techniques, which has received a lot of attention in the data science community in the past few years, demonstrates how they are implemented in practice, and compares their advantages and disadvantages used in practice. Real-world applications, demonstrated using python, are used throughout the lecture to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using histogram, kernel density plot, time-series plot, scatterplot matrix, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, and plots of cross-correlations will also be included in the demo.

Instructor Bio

Jeffrey is the Chief Data Scientist at AllianceBernstein, a global investment firm managing over $500 billions. He is responsible for building and leading the data science group, partnering with investment professionals to create investment signals using data science, and collaborating with sales and marketing teams to analyze clients. Graduated with a Ph.D. in economics from the University of Pennsylvania, he has also taught statistics, econometrics, and machine learning courses at UC Berkeley, Cornell, NYU, the University of Pennsylvania, and Virginia Tech. Previously, Jeffrey held advanced analytic positions at Silicon Valley Data Science, Charles Schwab Corporation, KPMG, and Moody’s Analytics.

Jeffrey Yau, PhD

Chief Data Scientist, Alliance Bernstein

Workshop: Applying Deep Learning to Article Embedding for Fake News Evaluation

In this talk we explore real world use case applications for automated “Fake News” evaluation using contemporary deep learning article vectorization and tagging. We begin with the use case and an evaluation of the appropriate context applications for various deep learning applications in fake news evaluation. Technical material will review several methodologies for article vectorization with classification pipelines, ranging from traditional to advanced deep architecture techniques. We close with a discussion on troubleshooting and performance optimization when consolidating and evaluating these various techniques on active data sets.

Instructor Bio

Mike serves as Head of Data Science at Uber ATG, UC Berkeley Data Science faculty, and head of Skymind Labs the Machine Learning research lab affiliated with DeepLearning4J. He has led teams of Data Scientists in the bay area as Chief Data Scientist for InterTrust and Takt, Director of Data Sciences for MetaScale/Sears, and CSO for Galvanize where he founded the galvanizeU-UNH accredited Masters of Science in Data Science degree and oversaw the company’s transformation from co-working space to Data Science organization.

Michael Tamir, PhD

Head of Data Science, Uber

Workshop: Model Evaluation in the land of deep learning

Model evaluation metrics are typically tied to the predictive learning tasks. There are different metrics for classification (ROC-AUC, confusion matrix), regression (RMSE, R2 score], ranking metrics (precision-recall, F1 score), and so on. These metrics, coupled with cross-validation or hold-out validation techniques, might help analysts and data scientists select a performant model. However, model performance decays over time because of the variability in the data. At this point in time, point estimate-based metrics are not enough, and a better understanding of the why what, and how of the categorization process is needed.

Evaluating model decisions might still be easy for linear models but gets difficult in the world of deep neural networks (DNNs). This complexity might increase multifold for use cases related to computer vision (image classification, image captioning or visual QnA(VQA), text classification), sentiment analysis, or topic modeling. ResNets, a recently published state-of-the-art DNN, has over 200 layers. Interpreting input features and it output categorization over multiple layers is challenging. The lack of decomposability and intuitiveness associated with DNNs prevents widespread adoption even with their superior performance compared to more classical machine learning approaches. The faithful interpretation of DNNs will not only help in providing insight about the failure modes (false positives and false negatives) but will also help the humans in the loop evaluate the robustness of the model against noise. This brings in trust and transparency to the predictive algorithm.

In this workshop, I will share how to enable class-discriminative visualizations for computer vision/NLP problems when using convolutional neural networks (CNN) and approach to help enable transparency of CNN’s by not only capturing metrics during the validation step but also highlighting salient features in the image and text which are driving prediction.
I will also be talking briefly about the open source project(“Skater”: https://github.com/datascienceinc/Skater) and how it can help in solving our interpretation needs.

Instructor Bio

Pramit Choudhary is an applied machine learning research scientist. Currently, he is the lead DataScientist@datascience.com(RnD labs)(acquired by Oracle). He focuses on optimizing and applying Machine Learning to solve real-world problems. His research area includes scaling and optimizing machine learning algorithms. Currently, he is exploring better ways to explain a model’s learned decision policies to reduce the chaos in building effective models to close the gap between a prototype and operationalized models.

Pramit Choudhary

Lead Data Scientist at datascience.com

Workshop: Raise your own Pandas Cub

A typical data scientist’s workflow in Python consists of firing up a Jupyter Notebook, importing NumPy, Pandas, Matplotlib, and Scikit-Learn into the workspace and then completing a data analysis. The APIs from these libraries are well-known, mostly stable, and provide a powerful and flexible way of analyzing data. These libraries have contributed an enormous amount to the success of Python as a language of choice for doing data science as well as increasing productivity for the data scientists that use them.

For those data scientists that are interested in learning how to develop their own data science tools, relying on these popular, easy-to-use libraries hides the complexities and underlying Python code. In fact, it is so easy to produce data science results in Python, that one only needs to know the very basics of the language along with knowledge of the library’s API.

In this hands-on tutorial, we will build our own data analysis package from scratch. Specifically, our package will contain a DataFrame Class with a Pandas-like API. We will make heavy use of the Python data model, which contains special methods to help our DataFrame work with Python operators. By the end of the tutorial, we will have built a Python package that you can import into your workspace capable of performing the most important operations available in Pandas.

Instructor Bio

Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.

Ted Petrou

Pandas Author, Founder at Dunder Data

Workshop: Visual Elements of Data Science

“Above all, show the data” (Edward Tufte)

Data Visualization is fundamental not only to data exploration, but to addressing data science problems in general. It is a key technique in descriptive statistics (e.g., boxplots, histograms, distribution charts, heatmaps), diagnostics (e.g., scatterplots, Geiger counter charts, digital elevation models) and predictive layers (e.g., decision trees, artificial neural networks) of the data science stack. For example, visualization is a means to understand relationships between variables, to recognize patterns, to detect outliers and to break down complexity. Effective ways to describe and summarize data sets are also very helpful in communicating with clients and collaborators in a more quantitative and rational way. Therefore, implementing and utilizing data visualizations is a key skill that every data scientist must have in their repository.

While enterprises and businesses across industries are now widely using dashboards and other (often commercial) business intelligence software to generate data visualizations, data scientists usually still heavily rely on creating charts in scripting languages and other open source coding environments from scratch. This is because they need to not only explore raw data and data aggregates, but also review model outputs visually and prepare charts for presentations and publications. The currently most widely used tools include ggplot2, plotly and shiny (R); as well as matplotlib, Seaborn and Bokeh (python).

This session reviews key elements of the effective use of data visualizations in Data Science industry applications. These include (1) a narrative / a story to tell about the data, (2) simplicity (3) conciseness through balancing information, complexity and avoiding too much decoration (aesthetics concept). It also addresses how to choose the right chart for given data sets, depending on different contexts and questions. What are some simple rules to follow for a good graphic, and which common errors need to be avoided? How do you know if your graph is accurately representing the underlying data set? This is particularly important for high dimensional data sets and growing data volumes in the age of Big Data.

In this workshop, state of the art scripts and packages in R and python will be used to demo how to plot heatmaps, time series charts and network graphs as well as representations and maps for geospatial data sets.

Instructor Bio

Olaf Menzer is a Data Scientist in the Decision Analytics team at Pacific Life in Newport Beach, California. His focus areas are around enabling business process improvements and the generation of insights through data synthesis, the application of advanced analytics and technology more broadly. He is also a Visiting Researcher at the University of California, Santa Barbara, contributing to primary research articles and statistical applications in Ecosystem Science.

Prior to working at Pacific Life, Olaf was a Predictive Analyst at Ingram Micro, designing, implementing and testing sales forecasting models, lead generation engines and product recommendation algorithms for cross-selling millions of technology products. He also held different Research Assistant roles at the Lawrence Berkeley National Lab and the Max Planck Institute in Germany where he supported scientific computing, data analysis and machine learning applications.

Olaf was a speaker at the INFORMS Business Analytics conference in 2016, Predictive Analytics World in 2018 and at several academic conferences in the past. He received a M.Sc. in Bioinformatics from Friedrich Schiller University in Germany (2011), and a Ph.D. in Geographic Information Science from University of California, Santa Barbara (2015).

Olaf Menzer, PhD

Senior Data Scientist at Pacific Life

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required. We will cover relevant background/fundamentals wherever necessary.

Instructor Bio

Coming Soon.

Anish Das Sarma, PhD

Engineering Manager at Airbnb

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required. We will cover relevant background/fundamentals wherever necessary.

Instructor Bio

Coming Soon

Sukhada Palkar

Software Engineer at Airbnb

Workshop: Introduction to Clinical Natural Language Processing: Predicting Hospital Readmission with Discharge Summaries

Clinical notes from physicians and nurses contain a vast wealth of knowledge and insight that can be utilized for predictive models to improve patient care and hospital workflow. In this workshop, we will introduce a few Natural Language Processing techniques for building a machine learning model in Python with clinical notes. As an example, we will focus on predicting unplanned hospital readmission with discharge summaries using the MIMIC III data set. After completing this tutorial, the audience will know how to prepare data for a machine learning project, preprocess unstructured notes using a bag-of-words approach, build a simple predictive model, assess the quality of the model and strategize how to improve the model. Note to the audience: the MIMIC III data set requires requesting access in advance, so please request access as early as possible.

Instructor Bio

Andrew Long is a Data Scientist at Fresenius Medical Care North America (FMCNA). Andrew holds a PhD in biomedical engineering from Johns Hopkins University and a Master’s degree in mechanical engineering from Northwestern University. Andrew joined FMCNA last year after participating in the Insight Health Data Fellows Program. At FMCNA, he is responsible for building predictive models using machine learning to improve the quality of life of every patient who receives dialysis from FMCNA. He is currently creating a model to predict which patients are at the highest risk of imminent hospitalization.

Andrew Long, PhD

Data Scientist, Fresenius Medical Care

Training: Engineering For Data Science

Practicing data scientists typically spend the bulk of their time working developing models for a particular inference or prediction application, likely giving substantially less time to the equally complex problems stemming from system infrastructure. We might trivially think of these two often orthogonal concerns as the modeling problem and the engineering problem. The typical data scientist is trained to solve the former, often in an extremely rigorous manner, but can often wind up developing a series of ad hoc solutions to the latter. This talk will discuss Docker as a tool for the data scientist, in particular in conjunction with the popular interactive programming platform, Jupyter, and the cloud computing platform, Amazon Web Services (AWS). Using Docker, Jupyter and AWS, the data scientist can take control of their environment configuration, prototype scalable data architectures, and trivially clone their work toward replicability and communication. This talk will toward developing a set of best practices for Engineering for Data Science.

Instructor Bio

Coming soon

Joshua Cook

Principal Lecturer in Data Science at UCLA Extension

Training: Matrix Algorithms at Scale: Randomization and using Alchemist to bridge the Spark-MPI gap

Linear algebra problems form the heart of many machine learning computations, but the demands on linear algebra from scientific machine learning problems can be different than for internet and social media applications. In particular, the need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more. We will describe these evaluations. These consist of exploring the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. Many of these algorithms use randomization in novel ways, and we will describe some of the underlying randomized linear algebra techniques. Finally, we’ll describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets. We describe use cases from scientific data analysis that motivated the development of Alchemist and that benefit from this system. We’ll also describe related work on communication-avoiding machine learning, optimization-based methods that can call these algorithms, and extending Alchemist to provide an ipython notebook <=> MPI interface.

Instructor Bio

Michael Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he spent fall 2013 at UC Berkeley co-organizing the Simons Foundation’s program on the Theoretical Foundations of Big Data Analysis.

Michael Mahnoney, PhD

Professor at UC Berkeley

Workshop: Scalable data science and deep learning with R

We provide an overview of the tools available to data scientists using R for Spark and TensorFlow, then discuss the latest developments at the intersections of these ecosystems. We organize the conversation around a diverse selection of use cases, such as ad hoc analysis on distributed datasets, building machine learning models for low latency scoring, and developing deep learning models for research, and demonstrate sample workflows. Various open source R packages will be featured, including the sparklyr, keras, and tensorflow projects.

Instructor Bio

Kevin is a software engineer at RStudio developing open source packages for big data analytics and machine learning. He has held data science positions across different industries, and has experience executing the end-to-end analytics process, from data engineering to model deployment and change management. Prior to RStudio, he was a principal data scientist at Honeywell, and also held roles at KPMG and Citi.

Kevin Kuo
Software Engineer at RStudio

Workshop: Stanned Up: Bayesian Methods Using Stan

Many introductory tutorials on Bayesian inference are not as simple as the authors purport them to be. In this workshop, we provide intuitive and straightforward marketing examples, using the open source probabilistic programming language Stan, showing where traditional linear regression and even other machine learning methods fall short.

We will first show how a hierarchical model improves forecasting for situations where you have categories with a mix of counts including counts that are traditionally too low for statistical significance. We will use a pay per click example for this first tutorial. The second example will demonstrate how early stopping can reduce the cost and time for test-and-learn situations (aka marketing tests, experiments, etc.)

For both examples, we will walk through the code, and if wifi allows and attendees have pre-installed R, RStudio, and rstan, they can follow along.

Instructor Bio

Curt studied computer science at the University of Illinois at Urbana-Champaign and mathematics at the University of Minnesota. After building too many websites and client/server systems, he turned to data mining and statistics and never looked back. A good day is spent building models in R and Stan.

Curt Bergmann
Senior Data Scientist at Elicit, LLC

Training: Machine Learning with Big Data and TensorFlow on Google Cloud Part I

This training session will be conducted on Google Cloud Platform (GCP) and will use GCP to run TensorFlow. All you need is a laptop with a modern browser.
In the session, you will walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction:
Data pipelines and data processing: You will learn how to explore and split large data sets – for this part of the session you will be using SQL and Pandas on BigQuery and Cloud Datalab.
Model building: The machine learning models in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented using Apache Beam, so that the same preprocessing can be applied in streaming mode as well. The preprocessing and training of the model will be carried out GCP.
Model Inference and Deployment: The trained model will be deployed as a REST microservice and predictions invoked from a web application.

Instructor Bio

Carl is a program manager focused on helping Google’s customers and business partners get trained and certified to run machine learning and data analytics workloads on Google Cloud. With over 16 years of experience in the IT industry, Carl worked with the world’s leading technology companies across United States and Europe, including in leadership roles on programs and projects in the areas of big data, cloud computing, service-oriented architecture, machine learning, and computational natural language processing. Carl is an author of over 20 articles in professional, trade, and academic journals, an inventor with 6 patents at USPTO, and holds 3 corporate awards from IBM for his innovative work. You can find out more about Carl on his blog http://www.cloudswithcarl.com

Carl Ospiov

Staff Program Manager at Google

Training: Machine Learning with Big Data and TensorFlow on Google Cloud Part II

This training session will be conducted on Google Cloud Platform (GCP) and will use GCP to run TensorFlow. All you need is a laptop with a modern browser.
In the session, you will walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction:
Data pipelines and data processing: You will learn how to explore and split large data sets – for this part of the session you will be using SQL and Pandas on BigQuery and Cloud Datalab.
Model building: The machine learning models in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented using Apache Beam, so that the same preprocessing can be applied in streaming mode as well. The preprocessing and training of the model will be carried out GCP.
Model Inference and Deployment: The trained model will be deployed as a REST microservice and predictions invoked from a web application.

Instructor Bio

Carl is a program manager focused on helping Google’s customers and business partners get trained and certified to run machine learning and data analytics workloads on Google Cloud. With over 16 years of experience in the IT industry, Carl worked with the world’s leading technology companies across United States and Europe, including in leadership roles on programs and projects in the areas of big data, cloud computing, service-oriented architecture, machine learning, and computational natural language processing. Carl is an author of over 20 articles in professional, trade, and academic journals, an inventor with 6 patents at USPTO, and holds 3 corporate awards from IBM for his innovative work. You can find out more about Carl on his blog http://www.cloudswithcarl.com

Carl Ospiov

Staff Program Manager at Google

Workshop: An introduction to Julia for machine learning

In this workshop, we assume no prior exposure to Julia, and will show you why Julia is a fantastic language for machine learning. It should be accessible useful to data scientists and engineers of all levels, as well as anyone else with technical computing needs and an interest in machine learning. Our goal is that attendees will leave the workshop with an understanding of how easy it is to start programming in Julia, what makes Julia special, and how using Julia for machine learning applications will improve your workflow as a data scientist.

All workshop materials will be provided on juliabox.com so that attendees can operate in a common environment and code along with the instructor.

The first thirty minutes of the workshop will cover language basics and show you how easy it is to pick up Julia’s high-level syntax. To get you up and running with Julia, we will go over syntax for function declarations, loops, conditionals, and linear algebra operations.

The second thirty minutes of the workshop will highlight Julia’s performance. Attendees will learn how to benchmark, see first-hand how quickly Julia code runs compared to C and Python, and learn to take advantage of special features from Julia’s linear algebra infrastructure. Finally, attendees will come to understand how multiple dispatch, a key feature of Julia’s design, helps to make Julia both high-level and performant.

In the last 45 minutes, we will cover special tools for data science and machine learning, where you will see how easy it is to recognize letters in your own handwriting using Flux.

 

Instructor Bio

Jane Herriman is Director of Diversity and Outreach at Julia Computing and a PhD student at Caltech. She is a Julia, dance, and strength training enthusiast and is excited for the opportunity to teach you Julia.

Jane Herriman
Director of Diversity and Outreach at Julia Computing

Workshop: TBD

Coming soon

 

Instructor Bio

Yunus Genes is completing his Masters in Computer Science, and continuing his part time PhD at University of Central Florida. His research is focused on Applied Machine Learning, social media behavior, misinformation detection/diffusion. He has been working on this field over 4 years. His is currently working on a DARPA funded project to simulate social media under SocialSim project, teaching Data Science to Fortune 50 Company professionals and he has previously held Data Science positon at Silicon Valley as well as Florida, Orlando area.

Yunus Genes, PhD

Data Scientist at Royal Caribbean Cruise Line

Training: Real-Time, Continuous ML/AI Model Training, Optimizing, and Predicting with PipelineAI, Scikit-Learn, MXNet, Spark ML, GPU, TPU, Kafka, and Kubernetes (and some TensorFlow)

Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters – and model pipeline phases – that have never been exposed until now.

While most Hyper-parameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing – among many other framework and hardware-specific optimizations.

Next, we introduce hyper-parameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU). We’ll continuously learn from all phases of our pipeline – including the prediction phase. And we’ll update our model in real-time using data from a Kafka stream.

Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.

Instructor Bio

Chris Fregly is Founder at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.

He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, “High Performance TensorFlow in Production with Kubernetes and GPUs.”

Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.

Chris Fregly

Founder and Research Scientist, Apache Spark Contributor

Workshop: pomegranate: Fast and Flexible Probabilistic Modeling in Python

Pomegranate is a Python package for fast and flexible probabilistic modeling. The basic unit is the probability distribution, which can be combined into compositional models such as hidden Markov models, mixture models, and Bayesian networks. These more complicated models can themselves be used as components to models, such as building a mixture of Bayesian networks, or a Bayes classifier of hidden Markov models for the classification of sentences instead of fixed feature sets. This format for specifying models is augmented by a variety of sophisticated training strategies, such as multi-threaded parallelism, GPU support, semi-supervised learning, support for missing values, mini-batch learning, out-of-core learning for massive data sets, and any combination of the above. This tutorial will give a high level overview of the features of pomegranate, the design rationale, a brief comparison to other packages, and an application to practical examples.

Instructor Bio

Jacob Schreiber is a fifth year Ph.D. student and NSF IGERT big data fellow in the Computer Science and Engineering department at the University of Washington. His primary research focus is on the application of machine larning methods, primarily deep learning ones, to the massive amount of data being generated in the field of genome science. His research projects have involved using convolutional neural networks to predict the three dimensional structure of the genome and using deep tensor factorization to learn a latent representation of the human epigenome. He routinely contributes to the Python open source community, currently as the core developer of the pomegranate package for flexible probabilistic modeling, and in the past as a developer for the scikit-learn project. Future projects include graduating.

Jacob Schreiber
PhD Candidate at University of Washington

Workshop: Automating Trend Discovery on Streaming Datasets with Spark 2.3

In this session we will start off with a deep dive into effective data modeling and continue on to explore some unique methods in which to bubble up and automatically uncover unique and interesting patterns in your data all using Spark SQL. More importantly we will discuss how to do this in batch mode and then follow up with how to easily migrate from batch mode to streaming using Spark Structured Streaming.

We will walk though techniques for reducing the memory footprint of statistical aggregations, giving you the ability to more efficiently scale out your systems to handles many millions of records (all in memory) while maintaining a relatively small footprint all via the use of Data Sketching. The idea here is to leverage quantile sketches in order to auto-analyze the change in shape and behavior of seemingly disparate datasets to find common dimensions (features) of given data sets across many different metrics.

We will also go over how to handle common serialization problems with respect to the storage and retreival of partially aggregated data when updating your streaming applications. Lastly we will finish off by talking about how to use windowed statistical aggregations and rollups to automatically detect trends in your data while also being able to handle the dreaded issue of data seasonality.

This session will cover some best practices and patterns for writing streaming applications with Apache Spark 2.3 including how to write effective unit tests to ensure your applications can handle live updates in production. A working application will be made available at the start of the presentation. Knowledge of Spark and Scala are a must in order to take full advantage of this information.

Instructor Bio

Scott Haines is a Principal Software Engineer / Tech Lead on the Voice Insights team at Twilio. His focus has been on the architecture and development of a real-time (sub 250ms), highly available, trust-worthy analytics system. His team is providing near real-time analytics that processes / aggregates and analyzes multiple terabytes of global sensor data daily. Scott helped drive Apache Spark adoption at Twilio and actively teaches and consulting teams internally. Scott’s past experience was at Yahoo! where he built a real-time recommendation engine and targeted ranking / ratings analytics which helped serve personalized page content for millions of customers of Yahoo Games. He worked to build a real-time click / install tracking system that helped deliver customized push marketing and ad attribution for Yahoo Sports and lastly Scott finished his tenure at Yahoo working for Flurry Analytics where he wrote the an auto-regressive smart alerting and notification system which integrated into the Flurry mobile app for ios/android

Scott Haines
Principal Software Engineer at Twilio

Workshop: Balancing ML accuracy, interpretability and costs when building a model

As data scientists we strive to deliver high performance models, but in the real-world the best model possible is not usually the best model for the business. When developing a model if it is not interpretable by the business, you will be unable to get buy in necessary to get your model into production. Additionally, you are always fighting two cost related battles: opportunity cost of delivering a perfect model tomorrow instead of delivering a good one today; operational costs of the most superior model compared to the next best one. This workshop will use real-world coding examples in Python to demonstrate how to be mindful of these constraints when developing your models.

Instructor Bio

Marc Fridson is the Principal Data Scientist of Cross Brand Digital @ Carnival Cruise Line, a Part-Time Lecturer for the Applied Analytics Program Masters Program @ Columbia University and the founder of tech start-up Instant Analytics.

Marc has previously worked as a Technology Consultant for Accenture, as an Engineer for the Boeing Company, AVP of Metrics and Reporting for Capital One, and as Manager of Analytics at CB Richard Ellis for JP Morgan Chase’s Real Estate Management. Previous consulting clients include: Morgan Stanley, Capital One, The College Board, Anthem Blue Cross, Verizon and Time Warner Cable.

He has helped these companies measure, analyze, and automate their processes through data analysis and by developing technological tools to enable process improvement/automation.

He holds a B.S. in Industrial and Systems Engineering from Rutgers University.

Marc Fridson
Principal Data Scientist at Carnival

Workshop: Building an image search service from scratch

Many products fundamentally appeal to our perception. When browsing through outfits on clothing sites, looking for a vacation rental on Airbnb, or choosing a pet to adopt, the way something looks is often an important factor in our decision. The way we perceive things is a strong predictor of what kind of items we will like, and therefore a valuable quality to measure.

However, making computers understand images the way humans do has been a computer science challenge for quite some time. Since 2012, Deep Learning has slowly started overtaking classical methods such as Histograms of Oriented Gradients (HOG) in perception tasks like image classification or object detection. One of the main reasons often credited for this shift is deep learning’s ability to automatically extract meaningful representations when trained on a large enough dataset.

This is why many teams — like at Pinterest, StitchFix, and Flickr — started using Deep Learning to learn representations of their images, and provide recommendations based on the content users find visually pleasing. Similarly, Fellows at Insight have used deep learning to build models for applications such as helping people find cats to adopt, recommending sunglasses to buy, and searching for art styles.

Many recommendation systems are based on collaborative filtering: leveraging user correlations to make recommendations (“users that liked the items you have liked have also liked…”). However, these models require a significant amount of data to be accurate, and struggle to handle new items that have not yet been viewed by anyone. Item representation can be used in what’s called content-based recommendation systems, which do not suffer from the problem above.

In addition, these representations allow consumers to efficiently search photo libraries for images that are similar to the selfie they just took (querying by image), or for photos of particular items such as cars (querying by text). Common examples of this include Google Reverse Image Search, as well as Google Image Search.

Based on our experience providing technical mentorship for many semantic understanding projects, we are bring a workshop to ODSC on how you would go about building your own representations, both for image and text data, and efficiently do similarity search. By the end of this workshop, you should be able to build a quick semantic search model from scratch, no matter the size of your dataset.

Instructor Bio

Coming Soon

Matthew Rubashkin, PhD
AI Program Director

Training: Coming Soon

Coming Soon

Instructor Bio

Emmanuel has a profile at the intersection of Artificial Intelligence and Business, having earned an Msc. in Artificial Intelligence, an Msc. in Computer Engineering, and an Msc. in Management from three of France’s top schools. Recently, Emmanuel has worked implementing and scaling out predictive analytics and machine learning solutions for Local Motion and Zipcar. He is currently an AI Program Director and Machine Learning Engineer at Insight, where he has lead dozens of AI Products from ideation to polished implementation.

Emanuel Ameisen
AI Program Director and Machine Learning Engineer at Insight Data Science

Training: Deep Learning Research to Production - An hands on approach using Apache MXNet

Deep Learning (DL) has become ubiquitous in every day software applications and services. A solid understanding of DL foundational principles is necessary for researchers and modern-day engineers alike to successfully adapt the state of the art research in DL to business applications.

Researchers require a DL framework to quickly prototype and transform their ideas into models and Engineers need a framework that allows them to efficiently deploy these models to production without losing performance. We will show how to use Gluon APIs in Apache MXNet to quickly prototype models and also deploy them without losing performance in production using MXNet Model Server (MMS).

In this workshop, you will learn applying Convolutional Neural Network (CNN), a class of DL techniques, in Computer Vision (CV) and applying Recurrent Neural Network (RNN) DL techniques for solving Natural Language Processing (NLP) tasks using Apache MXNet – the two fields in which Deep Learning has achieved state of the Art results.

To learn applying DL in CV problems, we will get hands-on by building a Facial Emotion Recognition (FER) model using advances of deep learning in CV. We will also build a sentiment analysis model to understand the application of DL in Natural Language Processing (NLP). As we build the model, we will learn common practical limitations, pitfalls, best practices and tips and tricks used by practitioners. Finally, we will conclude the workshop by showing how to deploy using MMS for online/real-time inference and using Apache Spark + MXNet for offline batch inference on large datasets.

We will provide Juptyer notebooks to get hands on and solidify the concepts.

Instructor Bio

Naveen is a Senior Software Engineer and a member of Amazon AI at AWS and works on Apache MXNet. He began his career building large scale distributed systems and has spent the last 10+ years designing and developing it. He has delivered various Tech Talks at AMLC, Spark Summit, ApacheCon and loves to share knowledge. His current focus is to make Deep Learning easily accessible to Software Developers without the need for a steep learning curve. In his spare time, he loves to read books, spend time with his family and watch his little girl grow

Naveen Swamy
Software Developer at Amazon AI – AWS

Training: Understanding the PyTorch Framework with Applications to Deep Learning

Over the past couple of years, PyTorch has been increasing in popularity in the Deep Learning community. What was initially a tool for Deep Learning researchers has been making headway in industry settings.

In this session, we will cover how to create Deep Neural Networks using the PyTorch framework on a variety of examples. The material will range from beginner – understanding what is going on “under the hood”, coding the layers of our networks, and implementing backpropagation – to more advanced material on RNNs,CNNs, LSTMs, & GANs.

Attendees will leave with a better understanding of the PyTorch framework. In particular, how it differs from Keras and Tensorflow. Furthermore, a link to a clean documented GitHub repo with the solutions of the examples covered will be provided.

Instructor Bio

Robert loves to break deep technical concepts down to be as simple as possible, but no simpler.

Robert has data science experience in companies both large and small. He is currently an Adjunct Professor at Santa Clara University’s Leavey School of Business and a Senior Data Scientist at Metis where he teaches Data
Science and Machine Learning. At Intel, he used his knowledge to tackle problems in data center optimization using cluster analysis, enriched market sizing models by implementing sentiment analysis from social media feeds, and improved data-driven decision making in one of the top 5 global supply chains. At Tamr, he built models to unify large amounts of messy data across multiple silos for some of the largest corporations in the world. He earned a PhD in Applied Mathematics from Arizona State University where his research spanned image reconstruction, dynamical systems, mathematical epidemiology and oncology.

Robert Alvarez, PhD
Sr. Data Scientist at Metis

Workshop: Coming soon

Coming soon

Instructor Bio

At Metis, Andrew has taught the fundamentals of Machine Learning and Data Science in a 3 month Bootcamp to over a 100 students and advised nearly 500 student projects. Andrew came to Metis from LinkedIn, where he worked as a Data Scientist, on the Education, Skills and then the NLP teams. He is passionate about helping people make rational decisions and building cool data products. Prior to that he worked on fraud modeling at IMVU (the lean startup) and studied applied physics at Cornell. He loves snowboarding, traveling, scotch and reading about all kinds of nerdy topics.

Andrew Blevins
Data Science Instructor at Metis

Workshop: Visual Search: The Next Frontier of Search

Visual search is a rapidly emerging trend that is ideal for retail segments, such as fashion and home design, because they are largely driven by visual content, and style is often difficult to describe using text search alone. Visual search allows you to replace your keyboard with your camera phone by using images instead of text to search for things. Many people believe that visual search will change the way we search, as evidenced by the following quote from Pinterest co-founder and CEO Ben Silbermann in a CNBC interview, “A lot of the future of search is going to be about pictures instead of keywords.”

Through a technique called distance metric learning, a neural network can transform any image into a compact, information rich vector of numbers. In this tutorial/session, you will hear from visual search experts at Clarifai, eBay, Wayfair, and Walmart Labs/Jet.com. We’ll look at how you can use distance metric learning for visual similarity search within massive product catalogs – up to 1.1 billion items in eBay’s case.

If you are part of an in-house team of experts in machine learning and data science, you will learn:

* The latest state-of-the art visual search research and techniques as the speakers will share their in-depth knowledge on the subject

* How to scale your visual search solution to address the billion-scale problem

* How to train models that provide more specific and accurate results for visually rich categories

If you don’t have a team of in-house machine learning or data science experts but are interested in implementing visual search, you will learn about a solution that:

* Allows you to leverage without having to do any training our your dataset

* For those who want to train your own custom models, makes it easy to do so using less than 10 data examples using minimal code and no special infrastructure

Instructor Bio

Coming Soon

George Williams
Director of Data Science

Training: Analyzing Data Efficiently with Pandas and Python

Learn how to use pandas and python together to quickly and easily analyze large amounts of data. We will teach you how to use pandas to carry out your entire data workflow, from dealing with basic data cleaning and munging, to quickly creating visualizations through simple calls from pandas. We will work with real datasets and show you how to work with various data inputs and types of data, including dates and timestamps. We’ll also discuss more complex operations, such as groupby commands and working with text data.

Instructor Bio

Jose Marcial Portilla has a BS and MS in Mechanical Engineering from Santa Clara University and years of experience as a professional instructor and trainer for Data Science and programming. He has publications and patents in various fields such as microfluidics, materials science, and data science technologies. Over the course of his career he has developed a skill set in analyzing data and he hopes to use his experience in teaching and data science to help other people learn the power of programming the ability to analyze data, as well as present the data in clear and beautiful visualizations. Currently he works as the Head of Data Science for Pierian Data Inc. and provides in-person data science and python programming training courses to employees working at top companies, including General Electric, Cigna, The New York Times, Credit Suisse, and many more. Feel free to contact him on LinkedIn for more information on in-person training sessions.

Jose Portilla
Head of Data Science at Pierian Data Inc.

Workshop: Coming soon

Coming soon

Instructor Bio

Nathaniel earned his AB/SM in Computer Science from Harvard. He previously worked as a Quant and Trader at Jane Street and Goldman Sachs before transitioning into the pure tech industry. Nathaniel worked as a Data Scientist at Facebook, a Product Manager at Microsoft and a Software Engineer at Google before joining Vicarious. He is an avid reader and learner. He teaches part time at General Assembly and is developing open source teaching material for data science, machine learning, and web development.

Nathaniel Tucker
Lead Instructor Data Science and Analytics at General Assemby

Workshop: Using Data Science for Good

AI for Earth has one simple but huge ambition – to fundamentally transform the way one monitors, models and manages Earth’s natural resources using AI. At the same time, deep learning innovations and breakthrough are happening both in academia and industry at a breathtaking pace. By leveraging these AI innovations, and breakthroughs, many AI for Earth grantees have been using AI to solve many of earth’s toughest challenges – ranging from precision agriculture, precision conservation to understanding and protecting biodiversity, and more.

Join Wee Hyong in this talk as he shares with you many of the exciting projects that AI for Earth has been working on, and how AI is used to amplify human’s ingenuity. Through the lens of these exciting use cases, you will also learn about cutting-edge AI use cases, trends and opportunities, and how you can get started with AI today.

Instructor Bio

Wee Hyong Tok is a principal data science manager with the AI CTO office at Microsoft, where he leads the engineering and data science team for the AI for Earth program. Wee Hyong has worn many hats in his career, including developer, program and product manager, data scientist, researcher, and strategist, and his track record of leading successful engineering and data science teams has given him unique superpowers to be a trusted AI adviser to customers. Wee Hyong coauthored several books on artificial intelligence, including Deep Learning on Azure and Predictive Analytics Using Azure Machine Learning. Wee Hyong holds a PhD in computer science from the National University of Singapore.

Wee Hyong Tok, PhD
Principal Data Science Manager, AI & Research

Workshop: Coming soon

Coming soon

Instructor Bio

Magnus Hyttsten is a Senior Staff Developer Advocate for TensorFlow at Google. He focuses on all things TensorFlow – from making sure that the developer community is happy to help developing the product. Magnus has been speaking at many major events including Google I/O, AnDevCon, Machine Learning meetups, etc. Right now, he is fanatically & joyfully focusing on TensorFlow for Mobile as well as creating Reinforcement Learning models.

Magnus Hyttsten
Senior Staff Developer Advocate

Training: Coming soon

Coming soon

Instructor Bio

Coming Soon

Ted Kwartler

Training: Coming soon

Coming soon

Instructor Bio

Coming Soon

Lukas Biewald

Training: Coming soon

Coming soon

Instructor Bio

Coming Soon

Skipper Seabold

Training: Coming soon

Coming soon

Instructor Bio

Coming Soon

Michael Schmidt

Training: Coming soon

Coming soon

Instructor Bio

Coming Soon

Todd Cioffi

Workshop: Are you sure you're an ethical data scientist? Build your ethical imagination

This 90-minute workshop reveals researchers’ hidden attribution biases and equips them with an ethical imagination to do data science better.

As an Ethics for Data Science instructor, I was struck by the disjunction between students’ excellent ability to find ethical gaps in other peoples’ projects and the blind spots they exhibited when critiquing their own work.

I developed a two-part 90 minute workshop to reduce the good intention bias. First, we interactively review three well-known cases including the Facebook/Cambridge Analytica case. These lively cases demonstrate how the basic tenets of research ethics can be adapted for data science.

Next, teams of 3-5 are led through a 60 minute Predicting College Failure case. This case is drawn from reality. A college has purchased an algorithm to predict which students will leave before graduation and why. We address ethical questions that arise during data collection, data cleaning, model development, intervention design, and FERPA compliance.

We also explore questions such as: how accurate does a model have to be before deployment? Is it fair to give students who show signs of financial hardship more aid if it reduces the amount available for students who are not visibly struggling? How do we manage the risk that labeling students high risk for departure may become a self-fulfilling prophecy? How do we weigh the collective impact on the school’s retention capacity against the unique needs of individual students? Teams will produce a strategy for maximizing the social benefit of the intervention and minimizing negative impacts.

Instructor Bio

Laura Norén is a data science ethicist and researcher currently working in cybersecurity at Obsidian Security in Newport Beach. She holds undergraduate degrees from MIT, a PhD from NYU where she recently completed a postdoc in the Center for Data Science. Her work has been covered in The New York Times, Canada’s Globe and Mail, American Public Media’s Marketplace program, in numerous academic journals and international conferences. Dr. Norén is a champion of open source software and those who write it.

Laura Noren, PhD

Director of Research | Professor

Workshop: Deep Learning on Mobile

Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in the area of computer vision. Many mobile applications running on smartphones and wearable devices would potentially benefit from the new opportunities enabled by deep learning techniques. However, CNNs are by nature computationally and memory intensive, making them challenging to deploy on a mobile device.

This workshop explains how to practically bring the power of convolutional neural networks and deep learning to memory and power-constrained devices like smartphones. You will learn various strategies to circumvent obstacles and build mobile-friendly shallow CNN architectures that significantly reduce the memory footprint and therefore make them easier to store on a smartphone; The workshop also dives into how to use a family of model compression techniques to prune the network size for live image processing, enabling you to build a CNN version optimized for inference on mobile devices. Along the way, you will learn practical strategies to preprocess your data in a manner that makes the models more efficient in the real world.

Following a step by step example of building an iOS deep learning app, we will discuss tips and tricks, speed and accuracy trade-offs, and benchmarks on different hardware to demonstrate how to get started developing your own deep learning application suitable for deployment on storage- and power-constrained mobile devices. You can also apply similar techniques to make deep neural nets more efficient when deploying in a regular cloud-based production environment, thus reducing the number of GPUs required and optimizing on cost.

Instructor Bio

Coming Soon

Anirudh Koul

Head of AI & Research

Workshop: The Big Reveal: How visualization can unearth the secrets of data

What’s the relationship between human and computer-recorded numbers? When we talk about data, data is not just numbers, behind every single number is human behavior. In my talk, I am trying to answer the following questions: How can data visualization design help us see the unseen faces of culture? How can we use open data to explore the differences between design and art colleges around the world? And how can we use data visualization to unlock the secrets of winning art awards? I will cover all the process of collecting data to process data.

Instructor Bio

Ying He is a computational designer currently works for Pershing of BNY Mellon and was previously at The Metropolitan Museum of Art. She holds a master degree from ITP, Tisch School of the Arts. Her research interests include visual representation, creative design and cultural hacking such as artistic data visualization, mapping, and interactive experience. She is fascinated to synthesize graphic design and technology as a telescope into unseen reality.

Ying He

Computational Designer

Workshop: The Big Reveal: How visualization can unearth the secrets of data

What’s the relationship between human and computer-recorded numbers? When we talk about data, data is not just numbers, behind every single number is human behavior. In my talk, I am trying to answer the following questions: How can data visualization design help us see the unseen faces of culture? How can we use open data to explore the differences between design and art colleges around the world? And how can we use data visualization to unlock the secrets of winning art awards? I will cover all the process of collecting data to process data.

Instructor Bio

Ying He is a computational designer currently works for Pershing of BNY Mellon and was previously at The Metropolitan Museum of Art. She holds a master degree from ITP, Tisch School of the Arts. Her research interests include visual representation, creative design and cultural hacking such as artistic data visualization, mapping, and interactive experience. She is fascinated to synthesize graphic design and technology as a telescope into unseen reality.

Sean Patrick Gorman, PhD

Computational Designer

Workshop: Making Data Science: AIG, Amazon, Albertsons

Developing an internal data science capability requires a cultural shift, a strategic mapping process thataligns with existing business objectives, a technical infrastructure that can host new processes, and an organizational structure that can alter business practice to create measurable impact on business functions. This workshop will take you through ways to consider the vast opportunities for data science to identify and prioritize what will add the most value to your organization, and then budget and hire into commitments. Learn the most effective ways to establish data science objectives from a business perspective including recruiting, retention, goaling, and improving business.

Instructor Bio

Haftan Eckholdt, PhD. is Chief Data Science Office at Plated. His career began with research professorships in Neuroscience, Neurology, and Psychiatry followed by industrial research appointments at companies like Amazon and AIG. He holds graduate degrees in Biostatistics and Developmental Psychology from Columbia and Cornell Universities. In his spare time he thinks about things like chess and cooking and cross country skiing and jogging and reading. When things get really really busy, he actually plays chess and cooks delicious meals and jogs a lot. Born and raised in Baltimore, Haftan has been a resident of Kings County, New York since the late 1900’s.

Haftan Eckholdt, PhD

Chief Data Science Officer

Training: Introduction to Text Analytics

Text analytics or text mining is an important branch of analytics that allows machines to break down text data. As a data scientist, I often use text-specific techniques to interpret data that I’m working with for my analysis. During this workshop, I plan to walk through an end-to-end project covering text pre-processing techniques, machine learning techniques and Python libraries for text analysis.

Text pre-processing techniques include data cleaning and tokenization. Once in a standard format, various machine learning techniques can be applied to better understand the data. This includes using popular modeling techniques to classify emails as spam or not, or to score the sentiment of a tweet on Twitter. In addition, unsupervised learning techniques such as topic modeling with Latent Dirichlet Allocation or matrix factorization can be applied to text data to pull out hidden themes in the text. Other techniques such as text generation can be applied using Markov chains or deep learning.

We will walk through an example in Jupyter Notebook that goes through all of the steps of a text analysis project, using several text analysis libraries in Python including NLTK, TextBlob and gensim along with the standard machine learning libraries including pandas and scikit-learn.

Instructor Bio

Alice Zhao is currently a Senior Data Scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she worked at Cars.com, where she started as the company’s first data scientist, supporting multiple functions from Marketing to Technology. During that time, she also co-founded a data science education startup, Best Fit Analytics Workshop, teaching weekend courses to professionals at 1871 in Chicago. Prior to becoming a data scientist, she worked at Redfin as an analyst and at Accenture as a consultant. She has her M.S. in Analytics and B.S. in Electrical Engineering, both from Northwestern University. She blogs about analytics and pop culture on A Dash of Data. Her blog post, “How Text Messages Change From Dating to Marriage” made it onto the front page of Reddit, gaining over half a million views in the first week. She is passionate about teaching and mentoring, and loves using data to tell fun and compelling stories.

Alice Zhao

Senior Data Scientist

Workshop: Scaling Interactive Data Science and AI with Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk we introduce Ray, a high-performance distributed execution engine, and its libraries for data science and AI development. We cover each Ray library in turn, and also show how the Ray API allows these traditionally separate workflows to be composed and run together as one distributed application.

Ray is an open source project being developed at the RISE Lab in UC Berkeley for interactive data processing, scalable hyperparameter optimization, distributed deep learning, and reinforcement learning. We focus on the following libraries in this tutorial:

MODIN: With Modin, you can make your Pandas workflows faster by changing only a single line of code. Modin uses Ray to provide interactive analysis on multi-core machines (e.g., your laptop), and also scale to large clusters.

TUNE: Tune is a scalable hyperparameter optimization framework for reinforcement learning and deep learning. Go from running one experiment on a single machine to running on a large cluster with efficient search algorithms without changing your code. Unlike existing hyperparameter search frameworks, Tune targets long-running, compute-intensive training jobs that may take many hours or days to complete, and includes many resource-efficient algorithms designed for this setting.

RLLIB: RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones. In this tutorial we discuss using RLlib to tackle both classic benchmark and applied problems, RLlib’s primitives for scalable RL, and how RL workflows can be integrated with data processing and hyperparameter optimization.

Instructor Bio

Richard Liaw is a PhD student in BAIR/RISELab at UC Berkeley working with Joseph Gonzalez, Ion Stoica, and Ken Goldberg. He has worked on a variety of different areas, ranging from robotics to reinforcement learning to distributed systems. He is currently actively working on Ray, a distributed execution engine for AI applications; RLlib, a scalable reinforcement learning library; and Tune, a distributed framework for model training.

Richard Liaw
AI Researcher

Workshop: A Deeper Stack for Deep Learning: Adding Visualizations and Data Abstractions to Your Workflow

In this training session I introduce a new layer of Python software, called ConX, which sits on top of Keras, which sits on a backend (like TensorFlow.) Do we really need a deeper stack of software for deep learning? Backends, like TensorFlow, can be thought of as “assembly language” for deep learning. Keras helps, but is more like “C++” for deep learning. ConX is designed to be “Python” for deep learning. So, yes, this layer is needed.

ConX is a carefully designed library that includes tools for network, weight, and activation visualizations; data and network abstractions; and an intuitive interactive and programming interface. Especially developed for the Jupyter notebook, ConX enhances the workflow of designing and training artificial neural networks by providing interactive visual feedback early in the process, and reducing cognitive load in developing complex networks.

This session will start small and move to advanced recurrent networks for images, text, and other data. Participants are encouraged to have samples of their own data so that they can explore a real and meaningful project.

A basic understanding of Python and a laptop is all that is required. Many example deep learning models will be provided in the form of Jupyter notebooks.

Documentation: https://conx.readthedocs.io/en/latest/

Instructor Bio

Douglas Blank is a professor of Computer Science at Bryn Mawr College outside of Philadelphia, PA. He has been working with neural networks for over 20 years, and developing easy to use software for even longer. He is one of the core developers of ConX.

Douglas Blank, PhD

Professor of Computer Science | Core Developer of ConX

Workshop: Making Data Great Again

The advent of big data means that the time has come for change in the way in which we collect and use data on human beings. However, that change needs to be effected in a thoughtful, careful way so that we don’t jump out of the frying pan into the fire.

There is enormous potential to use such data to improve decision making at all levels of government. The barriers are complex but at their core stem from (i) lack of local capacity to access and use data and (ii) lack of evidence of the value. Much can be gained when local stakeholders develop use-cases and create value from data sources specific to a jurisdiction. The combination of human and technical approaches is critical success.

We have been developing a multi-step approach to foster data-driven decision making in regional development efforts so that locally driven efforts can grow into a robust, scalable system. Each phase serves the dual purposes of (a) building local capacity while (b) developing useful data and analytic products.
The analytics training programs are delivered in a secure remote access environment. They include a mix of targeted introductory material appropriate for a wide audience and tailored sessions specific to jurisdictions. The focus is on facilitating the creation of national standards from the bottom up, directly via (i) a secure, analytics computing platform in which the underlying code and data itself can be shared and evaluated, (ii) conferences and workshops to convene key stakeholders across the community, and (iii) ongoing support to provide continuity of methodologies across jurisdictions.

Instructor Bio

Julia Lane is a Professor in the Wagner School of Public Policy at New York University. She is also a Provostial Fellow in Innovation Analytics and a Professor in the Center for Urban Science and Policy at NYU. Dr. Lane is an economist and has authored over 65 refereed articles and edited or authored seven books. She has been working with a number of national governments to document the results of their science investments. Her work has been featured in several publications including Science and Nature. Dr. Lane started at the National Science Foundation (as Senior Program Director of the Science of Science and Innovation Policy Program) to quantify the results of federal stimulus spending, which is the basis of the new Institaute for Research on Innovation and Science at the University of Michigan. Dr. Lane has had leadership positions in a number of policy and data science initiatives at her other previous appointments, which include Senior Managing Economist at the American Institutes for Research; Senior Vice President and Director, Economics Department at NORC/University of Chicago; various consultancy roles at The World Bank; and Assistant, Associate and Full Professor at American University. Dr. Lane received her PhD in Economics and Master’s in Statistics from the University of Missouri.

Julia Lane, PhD
Professor

Workshop: Data visualization in the web setting with a focus on D3

The D3 JavaScript library utilizes standard web technologies to facilitate interactive data visualization in the browser. This session will cover the principles behind D3 and will use examples to introduce core ideas and concepts in the library. It will also highlight some of the differences between versions as the library has evolved to its current state and discuss how D3 fits into the landscape of data visualization tools and web frameworks.

At its core, D3 establishes a connection between the data behind a visualization and the graphical elements shown on screen. By providing methods to manipulate elements at a low-level, it allows every aspect of a visualization to be customized according to standard technologies. At the same time, selections allow users to modify large groups of elements at the same time but based on the data items they are connected with. Another important feature of D3 is support for interactive visualizations which can help show new data being added or existing data being filtered. D3 adds straightforward methods to help transition between different states. To make the most of this session, attendees should be familiar with JavaScript.

Instructor Bio

Coming Soon

David Koop, PhD
Assistant Professor

Training: Cloud Native Data Science with Dask

Python has become a great language for data science. Libraries like NumPy, pandas, and Scikit-Learn provide high-performance, pleasant APIs for analyzing data. However, they’re focused on single-core, in-memory analytics, and so don’t scale out to very large datasets or clusters of machines. That’s where Dask comes in.

Dask is a library that natively scales Python. It works with libraries like NumPy, pandas, and Scikit-Learn to operate on datasets in parallel, potentially distributed on a cluster.

Moving to a cloud-native data science workflow will make you and your team more productive. You’ll be able to more quickly iterate on the data collection, visualization, modeling, testing, and deployment cycle.

Attendees will learn the high-level user-interfaces dask provides like dask.array and dask.dataframe. These let you write regular Python, NumPy, or Pandas code that is then executed in parallel on datasets that may be larger than memory. We’ll learn through hands-on exercises. Each attendee will be provided with their own Dask cluster to develop and run their solutions.

Dask is a flexible parallelization framework; we’ll demonstrate that flexibility with some machine-learning workloads. We’ll use Dask to easily distribute a large scikit-learn grid search to run a cluster of machines. We’ll use Dask-ML to work with larger-than-memory datasets.

We’ll see how Dask can be deployed on Kubernetes, taking advantage of features like auto-scaling, where new worker pods are automatically created or destroyed based on the current workload

Instructor Bio

Tom is a Data Scientist and developer at Anaconda and works on open source projects including dask and pandas. Tom’s current focus is on scaling out Python’s machine learning ecosystem to larger datasets and larger models

Tom Augspurger
Data Scientist

Sign Up for ODSC West | Oct 31st - Nov 3rd 2018

Register Now
Open Data Science Conference