Training & Workshop Sessions

– Taught by World-Class Data Scientists –

Learn the latest data science concepts, tools and techniques from the best. Forge a connection with these rockstars from industry and academic, who are passionate about molding the next generation of data scientists.

ODSC Training Includes

Form a working relationship with some of the world’s top data scientists for follow up questions and advice.

Additionally, your ticket includes access to 50+ talks and workshops.

High get hands-on with the latest frameworks and breakthroughs in data science.

Equivalent training at other conferences costs much more.

Professionally prepared learning materials, custom tailored to each course.

Opportunities to connect with other ambitious like-minded data scientists.

Training & Workshops Sessions

Training sessions are 3.5 to 4 hours in duration.  Workshops are 1.5 hours. See training sessions below and scroll down or click here for workshop session

Additional sessions added weekly

Training Sessions

Training: Introduction to Machine Learning

Machine learning has become an indispensable tool across many areas of research and commercial applications. From text-to-speech for your phone to detecting the Higgs boson, machine learning excells at extracting knowledge from large amounts of data. This talk will give a general introduction to machine learning, as well as introduce practical tools for you to apply machine learning in your research. We will focus on one particularly important subfield of machine learning, supervised learning. The goal of supervised learning is to “”learn”” a function that maps inputs x to an output y, by using a collection of training data consisting of input-output pairs. We will walk through formalizing a problem as a supervised machine learning problem, creating the necessary training data and applying and evaluating a machine learning algorithm. The talk should give you all the necessary background to start using machine learning yourself.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD,

Author, Lecturer, and Core contributor to scikit-learn

Training: Intermediate Machine Learning with scikit-learn

Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners. This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning. Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Machine Learning in R Part I

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Machine Learning in R Part II

Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today’s incredible computing power. This two-part course focuses on the available methods for implementing machine learning algorithms in R, and will examine some of the underlying theory behind the curtain. We start with the foundation of it all, the linear model and its generalization, the glm. We look how to assess model quality with traditional measures and cross-validation and visualize models with coefficient plots. Next we turn to penalized regression with the Elastic Net. After that we turn to Boosted Decision Trees utilizing xgboost. Attendees should have a good understanding of linear models and classification and should have R and RStudio installed, along with the `glmnet`, `xgboost`, `boot`, `ggplot2`, `UsingR` and `coefplot` packages.

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Advanced Machine Learning with scikit-learn Part I

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training: Advanced Machine Learning with scikit-learn Part II

Abstract Coming Soon

Instructor Bio

Andreas is lecturer at the Data Science Institute at Columbia University and author of the O’Reilly book “Introduction to machine learning with Python,” which describes a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and he have been co-maintaining it for several years. Andreas is also a Software Carpentry instructor. In the past, he worked at the NYU Center for Data Science on open source and open science, and as Machine Learning Scientist at Amazon. Andreas’s mission is to create open tools to lower the barrier of entry for machine learning applications, promote reproducible science and democratize the access to high-quality machine learning algorithms.

Andreas Mueller, PhD

Author, Lecturer, and Core contributor to scikit-learn

Training:Apache Spark with Python for Data Science and Machine Learning at Scale Part I

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with popular Python tooling; distributed scheduling for popular libraries like XGBoost and TensorFlow; and fast inference.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning/AI/deep learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures.
 
Adam‘s first full-time job in tech was on neural-net-based fraud detection deployed at North America’s largest banks … in 1998. Since then, he’s worked with numerous startups (3 successful exits) where he enjoyed getting to build the future (e.g., universal mobile check-in for 2 of America’s 5 biggest airlines … in 2004, 3 years before the iPhone release). He has also worked in entertainment, insurance, and retail banking, on web, embedded, and server apps as well as on clustering architectures, APIs, and streaming analytics. 

Adam Breindel

Featured Apache Spark Instructor, Data Science Trainer and Consultant

Training: Apache Spark with Python for Data Science and Machine Learning at Scale Part II

We’ll start with the basics of machine learning on Apache Spark: when to use it, how it works, and how it compares to all of your other favorite data science tooling.

You’ll learn to use Spark (with Python) for statistics, modeling, inference, and model tuning. But you’ll also get a peek behind the APIs: see why the pieces are arranged as they are, how to get the most out of the docs, open source ecosystem, third-party libraries, and solutions to common challenges.

We will then look at some of the newest features in Spark that allow elegant, high performance integration with popular Python tooling; distributed scheduling for popular libraries like XGBoost and TensorFlow; and fast inference.

By lunch, you will understand when, why, and how Spark fits into the data science world, and you’ll be comfortable doing your own feature engineering and modeling with Spark.

By the end of the day, you will be caught up on the latest, easiest, fastest, and most user friendly ways of applying Apache Spark in your job and/or research.

Instructor Bio

Adam Breindel consults and teaches widely on Apache Spark, big data engineering, and machine learning/AI/deep learning. He supports instructional initiatives and teaches as a senior instructor at Databricks, teaches classes on Apache Spark and on deep learning for O’Reilly, and runs a business helping large firms and startups implement data and ML architectures.
 
Adam‘s first full-time job in tech was on neural-net-based fraud detection deployed at North America’s largest banks … in 1998. Since then, he’s worked with numerous startups (3 successful exits) where he enjoyed getting to build the future (e.g., universal mobile check-in for 2 of America’s 5 biggest airlines … in 2004, 3 years before the iPhone release). He has also worked in entertainment, insurance, and retail banking, on web, embedded, and server apps as well as on clustering architectures, APIs, and streaming analytics.

Adam Breindel

Featured Apache Spark Instructor, Data Science Trainer and Consultant

Training: Network Analysis Made Simple Part I

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems – as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

Instructor Bio

Eric is an Investigator in the Scientific Data Analysis team at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Eric Ma, ScD

DS Investigator, Novartis Institutes, Author of nxviz Package

Training: Network Analysis Made Simple Part II

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems – as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

Instructor Bio

Eric is an Investigator in the Scientific Data Analysis team at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Eric Ma, ScD

DS Investigator, Novartis Institutes, Author of nxviz Package

Training: Introduction to RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Intermediate RMarkdown in Shiny

Markdown Primer (45 minutes) Structure Documents with Sections and Subsections, Formatting Text, Creating Ordered and Unordered Lists, Making Links, Number Sections Include Table of Contents


Integrate R Code (30 minutes) 
Insert Code Chunks, Hide Code, Set Chunk Options, Draw Plots, Speed Up Code with Caching

Build RMarkdown Slideshows (20 minutes) Understand Slide Structure, Create Sections, Set Background Images, Include Speaker Notes, Open Slides in Speaker Mode

Develop Flexdashboards (30 minutes )Start with the Flex dashboard, Layout Design Columns and Rows, Use Multiple Pages, Create Social Sharing, Include Code


Shiny Inputs, Drop Downs, Text Radio Checks, Shiny Outputs, Text Tables, Plots, Reactive Expressions, HTML, Widgets, Interactive Plots, Interactive Maps, Interactive Tables, Shiny Layouts UI and Server Files, User Interface

Instructor Bio

Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and a bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fund raising to finance and humanitarian relief efforts.

He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.

Jared Lander

Statistics Professor, Columbia University, Author of R for Everyone

Training: Programming with Data: Python and Pandas

Whether in R, MATLAB, Stata, or python, modern data analysis, for many researchers, requires some kind of programming. The preponderance of tools and specialized languages for data analysis suggests that general purpose programming languages like C and Java do not readily address the needs of data scientists; something more is needed.

In this workshop, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for interactive data analysis. Pandas is a massive library, so we will focus on its core functionality, specifically, loading, filtering, grouping, and transforming data. Having completed this workshop, you will understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

Instructor Bio

Daniel Gerlanc has worked as a data scientist for more than decade and written software professionally for 15 years. He spent 5 years as a quantitative analyst with two Boston hedge funds before starting Enplus Advisors. At Enplus, he works with clients on data science and custom software development with a particular focus on projects requiring an expertise in both areas. He teaches data science and software development at introductory through advanced levels. He has coauthored several open source R packages, published in peer-reviewed journals, and is active in local predictive analytics groups.

Daniel Gerlanc

Data Science and Software Engineering Instructor, President, Enplus Advisors

Training: Data science for finance bootcamp

The goal of this course is to offer data science and fintech enthusiasts a hand-on practical case study to understand the power of Data Science in Finance. We will be using the Lending club data set to build a credit risk model using machine learning techniques. Python experience is not required to attend the course but useful. We will illustrate how to build applications using Python packages such as scikit-learn, Keras etc. Techniques such as K-means, t-sne, Regression, Random Forest and Neural Networks would be covered.

Instructor Bio

Sri Krishnamurthy, CFA, CAP is the founder of QuantUniversity.com, a data and Quantitative Analysis Company and the creator of the Analytics Certificate program and Fintech Certificate program. Sri has more than 15 years of experience in analytics, quantitative analysis, statistical modeling and designing large-scale applications. Prior to starting QuantUniversity, Sri has worked at Citigroup, Endeca, MathWorks and with more than 25 customers in the financial services and energy industries. He has trained more than 1000 students in quantitative methods, analytics and big data in the industry and at Babson College, Northeastern University and Hult International Business School. Sri is leading development efforts in creating a platform called QuSandbox for adopting open source and analytics solutions within regulated industries.

Sri Krishnamurthy

Chief Data Scientist and President

Training: Designing modern streaming data applications

Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data.

In the proposed tutorial, we shall lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We shall also share case studies from the IoT, gaming, and healthcare as well as our experience operating these systems at internet scale at Twitter and Yahoo. We shall conclude by offering perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.

Instructor Bio

Coming Soon

Arun Kejariwal

Engineering Leader at Facebook

Training: Bringing Your Deep Learning Algorithms to Life: From Experiments to Production Use

Copresenters Sindhu Ghanta and Drew Roselli.

In this hands on workshop, attendees will learn how to take Deep Learning programs and monitor their health in a production environment. This workshop is targeted for data scientists, with some basic knowledge of Deep Learning algorithms, who would like to learn how to bring their promising experimental results on DL algorithms into production with confidence. Attendees will learn about potential production issues with DL algorithms and how to monitor for these in a production environment using TensorFlow. They will take a sample program in TensorFlow and learn how to deploy it in a production environment. They will learn how to instrument Convolutional Neural Network algorithms in TensorFlow and then deploy their chosen algorithm and instrumentations into production use. They will learn how to monitor the behavior of Deep Learning algorithms in production and approaches to optimizing production DL behavior via retraining and transfer learning.

Attendees should have basic knowledge of ML and DL algorithm types. Deep mathematical knowledge of algorithm internals is not required. All experiments will use Python. Environments will be provided in Azure for hands-on use by all attendees. Each attendee will receive an account for use during the workshop and access to the TensorFlow engines as well as an ML lifecycle management environment. Sample algorithms and public data sets will be provided for Image Classification and Text Recognition.

Instructor Bio

Coming Soon

Nisha Talagala

CTO/VP of Engineering at ParallelM

Training: Engineering For Data Science

Practicing data scientists typically spend the bulk of their time working developing models for a particular inference or prediction application, likely giving substantially less time to the equally complex problems stemming from system infrastructure. We might trivially think of these two often orthogonal concerns as the modeling problem and the engineering problem. The typical data scientist is trained to solve the former, often in an extremely rigorous manner, but can often wind up developing a series of ad hoc solutions to the latter. This talk will discuss Docker as a tool for the data scientist, in particular in conjunction with the popular interactive programming platform, Jupyter, and the cloud computing platform, Amazon Web Services (AWS). Using Docker, Jupyter and AWS, the data scientist can take control of their environment configuration, prototype scalable data architectures, and trivially clone their work toward replicability and communication. This talk will toward developing a set of best practices for Engineering for Data Science.

Instructor Bio

Coming soon

Joshua Cook

Principal Lecturer in Data Science at UCLA Extension

Training: Matrix Algorithms at Scale: Randomization and using Alchemist to bridge the Spark-MPI gap

Linear algebra problems form the heart of many machine learning computations, but the demands on linear algebra from scientific machine learning problems can be different than for internet and social media applications. In particular, the need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more. We will describe these evaluations. These consist of exploring the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. Many of these algorithms use randomization in novel ways, and we will describe some of the underlying randomized linear algebra techniques. Finally, we’ll describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets. We describe use cases from scientific data analysis that motivated the development of Alchemist and that benefit from this system. We’ll also describe related work on communication-avoiding machine learning, optimization-based methods that can call these algorithms, and extending Alchemist to provide an ipython notebook <=> MPI interface.

Instructor Bio

Michael Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council’s Committee on the Analysis of Massive Data, he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he spent fall 2013 at UC Berkeley co-organizing the Simons Foundation’s program on the Theoretical Foundations of Big Data Analysis.

Michael Mahnoney, PhD

Professor at UC Berkeley

Training: Machine Learning with Big Data and TensorFlow on Google Cloud Part I

This training session will be conducted on Google Cloud Platform (GCP) and will use GCP to run TensorFlow. All you need is a laptop with a modern browser.
In the session, you will walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction:
Data pipelines and data processing: You will learn how to explore and split large data sets – for this part of the session you will be using SQL and Pandas on BigQuery and Cloud Datalab.
Model building: The machine learning models in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented using Apache Beam, so that the same preprocessing can be applied in streaming mode as well. The preprocessing and training of the model will be carried out GCP.
Model Inference and Deployment: The trained model will be deployed as a REST microservice and predictions invoked from a web application.

Instructor Bio

Coming Soon

Alex Hanna, Ph.D.  
Technical Curriculum Developer, Machine Learning at Google Cloud

Training: Machine Learning with Big Data and TensorFlow on Google Cloud Part II

This training session will be conducted on Google Cloud Platform (GCP) and will use GCP to run TensorFlow. All you need is a laptop with a modern browser.
In the session, you will walk through the process of building a complete machine learning pipeline covering ingest, exploration, training, evaluation, deployment, and prediction:
Data pipelines and data processing: You will learn how to explore and split large data sets – for this part of the session you will be using SQL and Pandas on BigQuery and Cloud Datalab.
Model building: The machine learning models in TensorFlow will be developed on a small sample locally. The preprocessing operations will be implemented using Apache Beam, so that the same preprocessing can be applied in streaming mode as well. The preprocessing and training of the model will be carried out GCP.
Model Inference and Deployment: The trained model will be deployed as a REST microservice and predictions invoked from a web application.

Instructor Bio

Coming Soon

Alex Hanna, Ph.D.  
Technical Curriculum Developer, Machine Learning at Google Cloud

Training: Real-Time, Continuous ML/AI Model Training, Optimizing, and Predicting with PipelineAI, Scikit-Learn, MXNet, Spark ML, GPU, TPU, Kafka, and Kubernetes (and some TensorFlow)

Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters – and model pipeline phases – that have never been exposed until now.

While most Hyper-parameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing – among many other framework and hardware-specific optimizations.

Next, we introduce hyper-parameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU). We’ll continuously learn from all phases of our pipeline – including the prediction phase. And we’ll update our model in real-time using data from a Kafka stream.

Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.

Instructor Bio

Chris Fregly is Founder at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.

He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, “High Performance TensorFlow in Production with Kubernetes and GPUs.”

Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.

Chris Fregly

Founder and Research Scientist, Apache Spark Contributor

Training: Coming Soon

Coming Soon

Instructor Bio

Emmanuel has a profile at the intersection of Artificial Intelligence and Business, having earned an Msc. in Artificial Intelligence, an Msc. in Computer Engineering, and an Msc. in Management from three of France’s top schools. Recently, Emmanuel has worked implementing and scaling out predictive analytics and machine learning solutions for Local Motion and Zipcar. He is currently an AI Program Director and Machine Learning Engineer at Insight, where he has lead dozens of AI Products from ideation to polished implementation.

Emanuel Ameisen
AI Program Director and Machine Learning Engineer at Insight Data Science

Training: Deep Learning Research to Production - An hands on approach using Apache MXNet

Deep Learning (DL) has become ubiquitous in every day software applications and services. A solid understanding of DL foundational principles is necessary for researchers and modern-day engineers alike to successfully adapt the state of the art research in DL to business applications.

Researchers require a DL framework to quickly prototype and transform their ideas into models and Engineers need a framework that allows them to efficiently deploy these models to production without losing performance. We will show how to use Gluon APIs in Apache MXNet to quickly prototype models and also deploy them without losing performance in production using MXNet Model Server (MMS).

In this workshop, you will learn applying Convolutional Neural Network (CNN), a class of DL techniques, in Computer Vision (CV) and applying Recurrent Neural Network (RNN) DL techniques for solving Natural Language Processing (NLP) tasks using Apache MXNet – the two fields in which Deep Learning has achieved state of the Art results.

To learn applying DL in CV problems, we will get hands-on by building a Facial Emotion Recognition (FER) model using advances of deep learning in CV. We will also build a sentiment analysis model to understand the application of DL in Natural Language Processing (NLP). As we build the model, we will learn common practical limitations, pitfalls, best practices and tips and tricks used by practitioners. Finally, we will conclude the workshop by showing how to deploy using MMS for online/real-time inference and using Apache Spark + MXNet for offline batch inference on large datasets.

We will provide Juptyer notebooks to get hands on and solidify the concepts.

Instructor Bio

Naveen is a Senior Software Engineer and a member of Amazon AI at AWS and works on Apache MXNet. He began his career building large scale distributed systems and has spent the last 10+ years designing and developing it. He has delivered various Tech Talks at AMLC, Spark Summit, ApacheCon and loves to share knowledge. His current focus is to make Deep Learning easily accessible to Software Developers without the need for a steep learning curve. In his spare time, he loves to read books, spend time with his family and watch his little girl grow

Naveen Swamy
Software Developer at Amazon AI – AWS

Training: Understanding the PyTorch Framework with Applications to Deep Learning

Over the past couple of years, PyTorch has been increasing in popularity in the Deep Learning community. What was initially a tool for Deep Learning researchers has been making headway in industry settings.

In this session, we will cover how to create Deep Neural Networks using the PyTorch framework on a variety of examples. The material will range from beginner – understanding what is going on “under the hood”, coding the layers of our networks, and implementing backpropagation – to more advanced material on RNNs,CNNs, LSTMs, & GANs.

Attendees will leave with a better understanding of the PyTorch framework. In particular, how it differs from Keras and Tensorflow. Furthermore, a link to a clean documented GitHub repo with the solutions of the examples covered will be provided.

Instructor Bio

Robert loves to break deep technical concepts down to be as simple as possible, but no simpler.

Robert has data science experience in companies both large and small. He is currently an Adjunct Professor at Santa Clara University’s Leavey School of Business and a Senior Data Scientist at Metis where he teaches Data
Science and Machine Learning. At Intel, he used his knowledge to tackle problems in data center optimization using cluster analysis, enriched market sizing models by implementing sentiment analysis from social media feeds, and improved data-driven decision making in one of the top 5 global supply chains. At Tamr, he built models to unify large amounts of messy data across multiple silos for some of the largest corporations in the world. He earned a PhD in Applied Mathematics from Arizona State University where his research spanned image reconstruction, dynamical systems, mathematical epidemiology and oncology.

Robert Alvarez, PhD
Sr. Data Scientist at Metis

Training: Analyzing Data Efficiently with Pandas and Python

Learn how to use pandas and python together to quickly and easily analyze large amounts of data. We will teach you how to use pandas to carry out your entire data workflow, from dealing with basic data cleaning and munging, to quickly creating visualizations through simple calls from pandas. We will work with real datasets and show you how to work with various data inputs and types of data, including dates and timestamps. We’ll also discuss more complex operations, such as groupby commands and working with text data.

Instructor Bio

Jose Marcial Portilla has a BS and MS in Mechanical Engineering from Santa Clara University and years of experience as a professional instructor and trainer for Data Science and programming. He has publications and patents in various fields such as microfluidics, materials science, and data science technologies. Over the course of his career he has developed a skill set in analyzing data and he hopes to use his experience in teaching and data science to help other people learn the power of programming the ability to analyze data, as well as present the data in clear and beautiful visualizations. Currently he works as the Head of Data Science for Pierian Data Inc. and provides in-person data science and python programming training courses to employees working at top companies, including General Electric, Cigna, The New York Times, Credit Suisse, and many more. Feel free to contact him on LinkedIn for more information on in-person training sessions.

Jose Portilla
Head of Data Science at Pierian Data Inc.

Training: Introduction to Data Science

Curious about Data Science? Self-taught on some aspects, but missing the big picture? Well you’ve got to start somewhere and this session is the place to do it. This session will cover, at a layman’s level, some of the basic concepts of data science. In a conversational format, we will discuss: What are the differences between Big Data and Data Science – and why aren’t they the same thing? What distinguishes descriptive, predictive, and prescriptive analytics? What purpose do predictive models serve in a practical context? What kinds of models are there and what do they tell us? What is the difference between supervised and unsupervised learning? What are some common pitfalls that turn good ideas into bad science? During this session, attendees will learn the difference between k-nearest neighbor and k-means clustering, understand the reasons why we do normalize and don’t overfit, and grasp the meaning of No Free Lunch.

Instructor Bio

For more than 20 years, Todd has been highly respected as both a technologist and a trainer. As a tech, he has seen that world from many perspectives: “data guy” and developer; architect, analyst and consultant. As a trainer, he has designed and covered subject matter from operating systems to end-user applications, with an emphasis on data and programming. As a strong advocate for knowledge sharing, he combines his experience in technology and education to impart real-world use cases to students and users of analytics solutions across multiple industries. He is a regular contributor to the community of analytics and technology user groups in the Boston area, writes and teaches on many topics, and looks forward to the next time he can strap on a dive mask and get wet. Todd is a Data Science Evangelist at DataRobot.

Todd Cioffi
Data Science Evangelist at DataRobot

Training: A Deeper Stack for Deep Learning: Adding Visualizations and Data Abstractions to Your Workflow

In this training session I introduce a new layer of Python software, called ConX, which sits on top of Keras, which sits on a backend (like TensorFlow.) Do we really need a deeper stack of software for deep learning? Backends, like TensorFlow, can be thought of as “assembly language” for deep learning. Keras helps, but is more like “C++” for deep learning. ConX is designed to be “Python” for deep learning. So, yes, this layer is needed.

ConX is a carefully designed library that includes tools for network, weight, and activation visualizations; data and network abstractions; and an intuitive interactive and programming interface. Especially developed for the Jupyter notebook, ConX enhances the workflow of designing and training artificial neural networks by providing interactive visual feedback early in the process, and reducing cognitive load in developing complex networks.

This session will start small and move to advanced recurrent networks for images, text, and other data. Participants are encouraged to have samples of their own data so that they can explore a real and meaningful project.

A basic understanding of Python and a laptop is all that is required. Many example deep learning models will be provided in the form of Jupyter notebooks.

Documentation: https://conx.readthedocs.io/en/latest/

Instructor Bio

Douglas Blank is a professor of Computer Science at Bryn Mawr College outside of Philadelphia, PA. He has been working with neural networks for over 20 years, and developing easy to use software for even longer. He is one of the core developers of ConX.

Douglas Blank, PhD

Professor of Computer Science | Core Developer of ConX at Bryn Mawr College

Training: Hands-on introduction to LSTMs in Keras/TensorFlow

This is a very hands on introduction to LSTMs in Keras and TensorFlow. We will build a language classifier, generator and a translating sequence to sequence model. We will talk about debugging models and explore various related architectures like GRUs, Bidirectional LSTMs, etc. to see how well they work.

Instructor Bio

Lukas Biewald is the founder and Chief Data Scientist of CrowdFlower. Founded in 2009, CrowdFlower is a data enrichment platform that taps into an on-demand to workforce to help companies collect training data and do human-in-the-loop machine learning.

Following his graduation from Stanford University with a B.S. in Mathematics and an M.S. in Computer Science, Lukas led the Search Relevance Team for Yahoo! Japan. He then worked as a senior data scientist at Powerset, acquired by Microsoft in 2008. Lukas was featured in Inc Magazine’s 30 Under 30 list.

Lukas is also an expert level Go player.

Lukas Biewald

Training: Introduction to Deep Learning for Engineers

We will build and tweak several vision classifiers together starting with perceptrons and building up to transfer learning and convolutional neural networks. We will investigate practical implications of tweaking loss functions, gradient descent algorithms, network architectures, data normalization, data augmentation and so on. This class is super hands on and practical and requires no math or experience with deep learning.

Instructor Bio

Lukas Biewald is the founder and Chief Data Scientist of CrowdFlower. Founded in 2009, CrowdFlower is a data enrichment platform that taps into an on-demand to workforce to help companies collect training data and do human-in-the-loop machine learning.

Following his graduation from Stanford University with a B.S. in Mathematics and an M.S. in Computer Science, Lukas led the Search Relevance Team for Yahoo! Japan. He then worked as a senior data scientist at Powerset, acquired by Microsoft in 2008. Lukas was featured in Inc Magazine’s 30 Under 30 list.

Lukas is also an expert level Go player.

Lukas Biewald

Training: Getting started with tensorflow

Bring your laptops and get started with TensorFlow! In this training, you will get an introduction using TensorFlow. We will go through the basics, and by the time we are finished, you will know how to build models on your own. No previous experience of machine learning is needed.

Prerequisites: Either install TensorFlow on your own computer. Alternatively, we will also provide Google Cloud instances of TensorFlow that you can use (no installation required). If you have a Google Cloud account, we can also share a TensorFlow cloud image that you can use.

Instructor Bio

Magnus Hyttsten is a Senior Staff Developer Advocate for TensorFlow at Google. He focuses on all things TensorFlow – from making sure that the developer community is happy to help developing the product. Magnus has been speaking at many major events including Google I/O, AnDevCon, Machine Learning meetups, etc. Right now, he is fanatically & joyfully focusing on TensorFlow for Mobile as well as creating Reinforcement Learning models.

Magnus Hyttsten

Senior Staff Developer Advocate

Training: Coming soon

Coming soon

Instructor Bio

Coming Soon

Skipper Seabold

Training: Feature Engineering for Time Series Data

Most machine learning algorithms today are not time-aware and are not easily applied to time series and forecasting problems. Leveraging algorithms like XGBoost, or even linear models, typically requires substantial data preparation and feature engineering – for example, creating lagged features, detrending the target, and detecting periodicity. The preprocessing required becomes more difficult in the common case where the problem requires predicting a window of multiple future time points. As a result, most practitioners fall back on classical methods, such as ARIMA or trend analysis, which are time-aware but often less expressive. This talk covers practices for solving this challenge and exploring the potential to automate this process in order to apply advanced machine learning algorithms time series problems.

Instructor Bio

Michael Schmidt is the Chief Scientist at DataRobot and has been featured in the Forbes list of the world’s top 7 data scientists. He has won awards for research in AI, with publications ranking in the 99th percentile of all tracked research. In 2012, Michael founded Nutonian and led the development Eureqa, a machine learning application and service used by over 80,000 users globally (later acquired by DataRobot). In 2015, he was selected by MIT for the most innovative 35-under-35 award. Michael has also appeared in several media outlets such as the New York Times, NPR’s RadioLab, the Science Channel, and Communications of the ACM. Most recently, his work focuses on automated machine learning, feature engineering, and advanced time series prediction.

Michael Schmidt, PhD

Chief Scientist

Training: Introduction to Text Analytics

Text analytics or text mining is an important branch of analytics that allows machines to break down text data. As a data scientist, I often use text-specific techniques to interpret data that I’m working with for my analysis. During this workshop, I plan to walk through an end-to-end project covering text pre-processing techniques, machine learning techniques and Python libraries for text analysis.

Text pre-processing techniques include data cleaning and tokenization. Once in a standard format, various machine learning techniques can be applied to better understand the data. This includes using popular modeling techniques to classify emails as spam or not, or to score the sentiment of a tweet on Twitter. In addition, unsupervised learning techniques such as topic modeling with Latent Dirichlet Allocation or matrix factorization can be applied to text data to pull out hidden themes in the text. Other techniques such as text generation can be applied using Markov chains or deep learning.

We will walk through an example in Jupyter Notebook that goes through all of the steps of a text analysis project, using several text analysis libraries in Python including NLTK, TextBlob and gensim along with the standard machine learning libraries including pandas and scikit-learn.

Instructor Bio

Alice Zhao is currently a Senior Data Scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she worked at Cars.com, where she started as the company’s first data scientist, supporting multiple functions from Marketing to Technology. During that time, she also co-founded a data science education startup, Best Fit Analytics Workshop, teaching weekend courses to professionals at 1871 in Chicago. Prior to becoming a data scientist, she worked at Redfin as an analyst and at Accenture as a consultant. She has her M.S. in Analytics and B.S. in Electrical Engineering, both from Northwestern University. She blogs about analytics and pop culture on A Dash of Data. Her blog post, “How Text Messages Change From Dating to Marriage” made it onto the front page of Reddit, gaining over half a million views in the first week. She is passionate about teaching and mentoring, and loves using data to tell fun and compelling stories.

Alice Zhao

Senior Data Scientist

Training: Cloud Native Data Science with Dask

Python has become a great language for data science. Libraries like NumPy, pandas, and Scikit-Learn provide high-performance, pleasant APIs for analyzing data. However, they’re focused on single-core, in-memory analytics, and so don’t scale out to very large datasets or clusters of machines. That’s where Dask comes in.

Dask is a library that natively scales Python. It works with libraries like NumPy, pandas, and Scikit-Learn to operate on datasets in parallel, potentially distributed on a cluster.

Moving to a cloud-native data science workflow will make you and your team more productive. You’ll be able to more quickly iterate on the data collection, visualization, modeling, testing, and deployment cycle.

Attendees will learn the high-level user-interfaces dask provides like dask.array and dask.dataframe. These let you write regular Python, NumPy, or Pandas code that is then executed in parallel on datasets that may be larger than memory. We’ll learn through hands-on exercises. Each attendee will be provided with their own Dask cluster to develop and run their solutions.

Dask is a flexible parallelization framework; we’ll demonstrate that flexibility with some machine-learning workloads. We’ll use Dask to easily distribute a large scikit-learn grid search to run a cluster of machines. We’ll use Dask-ML to work with larger-than-memory datasets.

We’ll see how Dask can be deployed on Kubernetes, taking advantage of features like auto-scaling, where new worker pods are automatically created or destroyed based on the current workload

Instructor Bio

Tom is a Data Scientist and developer at Anaconda and works on open source projects including dask and pandas. Tom’s current focus is on scaling out Python’s machine learning ecosystem to larger datasets and larger models

Tom Augspurger
Data Scientist

Training: Good, Fast, Cheap: How to do Data Science with Missing Data

If you’ve never heard of the “good, fast, cheap” dilemma, it goes something like this: You can have something good and fast, but it won’t be cheap. You can have something good and cheap, but it won’t be fast. You can have something fast and cheap, but it won’t be good. In short, you can pick two of the three but you can’t have all three.
 
If you’ve done a data science problem before, I can all but guarantee that you’ve run into missing data. How do we handle it? Well, we can avoid, ignore, or try to account for missing data. The problem is, none of these strategies are good, fast, *and* cheap.
 
We’ll start by visualizing missing data and identify the three different types of missing data, which will allow us to see how they affect whether we should avoid, ignore, or account for the missing data. We will walk through the advantages and disadvantages of each approach as well as how to visualize and implement each approach. We’ll wrap up with practical tips for working with missing data and recommendations for integrating it with your workflow!

Instructor Bio

Matt currently leads instruction for GA’s Data Science Immersive in Washington, D.C. and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about making data science more accessible and putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

Matt Brems
Global Lead Data Science Instructor

Training: Data Visualization: From Square One to Interactivity

As data scientists, we are expected to be experts in machine learning, programming, and statistics. However, our audiences might not be! Whether we’re working with peers in the office, trying to convince our bosses to take some sort of action, or communicating results to clients, there’s nothing more clear or compelling than an effective visual to make our point. Let’s leverage the Python libraries Matplotlib and Bokeh along with visual design principles to make our point as clearly and as compellingly as possible!
 
This talk is designed for a wide audience. If you haven’t worked with Matplotlib or Bokeh before or if you (like me!) don’t have a natural eye for visual design, that’s OK! This will be a hands-on training designed to make visualizations that best communicate what you want to communicate. We’ll cover different types of visualizations, how to generate them in Matplotlib, how to reduce clutter and guide your user’s eye, and how (and when!) to add interactivity with Bokeh.

 

Instructor Bio

Matt currently leads instruction for GA’s Data Science Immersive in Washington, D.C. and most enjoys bridging the gap between theoretical statistics and real-world insights. Matt is a recovering politico, having worked as a data scientist for a political consulting firm through the 2016 election. Prior to his work in politics, he earned his Master’s degree in statistics from The Ohio State University. Matt is passionate about making data science more accessible and putting the revolutionary power of machine learning into the hands of as many people as possible. When he isn’t teaching, he’s thinking about how to be a better teacher, falling asleep to Netflix, and/or cuddling with his pug.

Matt Brems
Global Lead Data Science Instructor

Workshop Sessions

More sessions added weekly

Workshop: Raise your own Pandas Cub

A typical data scientist’s workflow in Python consists of firing up a Jupyter Notebook, importing NumPy, Pandas, Matplotlib, and Scikit-Learn into the workspace and then completing a data analysis. The APIs from these libraries are well-known, mostly stable, and provide a powerful and flexible way of analyzing data. These libraries have contributed an enormous amount to the success of Python as a language of choice for doing data science as well as increasing productivity for the data scientists that use them.

For those data scientists that are interested in learning how to develop their own data science tools, relying on these popular, easy-to-use libraries hides the complexities and underlying Python code. In fact, it is so easy to produce data science results in Python, that one only needs to know the very basics of the language along with knowledge of the library’s API.

In this hands-on tutorial, we will build our own data analysis package from scratch. Specifically, our package will contain a DataFrame Class with a Pandas-like API. We will make heavy use of the Python data model, which contains special methods to help our DataFrame work with Python operators. By the end of the tutorial, we will have built a Python package that you can import into your workspace capable of performing the most important operations available in Pandas.

Instructor Bio

Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.

Ted Petrou

Pandas Author, Founder at Dunder Data

Workshop: Introduction to Clinical Natural Language Processing: Predicting Hospital Readmission with Discharge Summaries

Clinical notes from physicians and nurses contain a vast wealth of knowledge and insight that can be utilized for predictive models to improve patient care and hospital workflow. In this workshop, we will introduce a few Natural Language Processing techniques for building a machine learning model in Python with clinical notes. As an example, we will focus on predicting unplanned hospital readmission with discharge summaries using the MIMIC III data set. After completing this tutorial, the audience will know how to prepare data for a machine learning project, preprocess unstructured notes using a bag-of-words approach, build a simple predictive model, assess the quality of the model and strategize how to improve the model. Note to the audience: the MIMIC III data set requires requesting access in advance, so please request access as early as possible.

Instructor Bio

Andrew Long is a Data Scientist at Fresenius Medical Care North America (FMCNA). Andrew holds a PhD in biomedical engineering from Johns Hopkins University and a Master’s degree in mechanical engineering from Northwestern University. Andrew joined FMCNA last year after participating in the Insight Health Data Fellows Program. At FMCNA, he is responsible for building predictive models using machine learning to improve the quality of life of every patient who receives dialysis from FMCNA. He is currently creating a model to predict which patients are at the highest risk of imminent hospitalization.

Andrew Long, PhD

Data Scientist, Fresenius Medical Care

Workshop: Visual Search: The Next Frontier of Search

Visual search is a rapidly emerging trend that is ideal for retail segments, such as fashion and home design, because they are largely driven by visual content, and style is often difficult to describe using text search alone. Visual search allows you to replace your keyboard with your camera phone by using images instead of text to search for things. Many people believe that visual search will change the way we search, as evidenced by the following quote from Pinterest co-founder and CEO Ben Silbermann in a CNBC interview, “A lot of the future of search is going to be about pictures instead of keywords.”

Through a technique called distance metric learning, a neural network can transform any image into a compact, information rich vector of numbers. In this tutorial/session, you will hear from visual search experts at Clarifai, eBay, Wayfair, and Walmart Labs/Jet.com. We’ll look at how you can use distance metric learning for visual similarity search within massive product catalogs – up to 1.1 billion items in eBay’s case.

If you are part of an in-house team of experts in machine learning and data science, you will learn:

* The latest state-of-the art visual search research and techniques as the speakers will share their in-depth knowledge on the subject

* How to scale your visual search solution to address the billion-scale problem

* How to train models that provide more specific and accurate results for visually rich categories

If you don’t have a team of in-house machine learning or data science experts but are interested in implementing visual search, you will learn about a solution that:

* Allows you to leverage without having to do any training our your dataset

* For those who want to train your own custom models, makes it easy to do so using less than 10 data examples using minimal code and no special infrastructure

Instructor Bio

Coming Soon

George Williams
Director of Data Science

Workshop: Multivariate Time Series Forecasting Using Statistical and Machine Learning Models

Time series data is ubiquitous: weekly initial unemployment claim, daily term structure of interest rates, tick level stock prices, weekly company sales, daily foot traffic recorded by mobile devices, and daily number of steps taken recorded by a wearable, just to name a few.

Some of the most important and commonly used data science techniques in time series forecasting are those developed in the field of machine learning and statistics. Data scientists should have at least a few basic time series statistical and machine learning modeling techniques in their toolkit.

This lecture discusses the formulation Vector Autoregressive (VAR) Models, one of the most important class of multivariate time series statistical models, and neural network-based techniques, which has received a lot of attention in the data science community in the past few years, demonstrates how they are implemented in practice, and compares their advantages and disadvantages used in practice. Real-world applications, demonstrated using python, are used throughout the lecture to illustrate these techniques. While not the focus in this lecture, exploratory time series data analysis using histogram, kernel density plot, time-series plot, scatterplot matrix, plots of autocorrelation (i.e. correlogram), plots of partial autocorrelation, and plots of cross-correlations will also be included in the demo.

Instructor Bio

Jeffrey is the Chief Data Scientist at AllianceBernstein, a global investment firm managing over $500 billions. He is responsible for building and leading the data science group, partnering with investment professionals to create investment signals using data science, and collaborating with sales and marketing teams to analyze clients. Graduated with a Ph.D. in economics from the University of Pennsylvania, he has also taught statistics, econometrics, and machine learning courses at UC Berkeley, Cornell, NYU, the University of Pennsylvania, and Virginia Tech. Previously, Jeffrey held advanced analytic positions at Silicon Valley Data Science, Charles Schwab Corporation, KPMG, and Moody’s Analytics.

Jeffrey Yau, PhD

Chief Data Scientist, Alliance Bernstein

Workshop: Model Evaluation in the land of deep learning

Model evaluation metrics are typically tied to the predictive learning tasks. There are different metrics for classification (ROC-AUC, confusion matrix), regression (RMSE, R2 score], ranking metrics (precision-recall, F1 score), and so on. These metrics, coupled with cross-validation or hold-out validation techniques, might help analysts and data scientists select a performant model. However, model performance decays over time because of the variability in the data. At this point in time, point estimate-based metrics are not enough, and a better understanding of the why what, and how of the categorization process is needed.

Evaluating model decisions might still be easy for linear models but gets difficult in the world of deep neural networks (DNNs). This complexity might increase multifold for use cases related to computer vision (image classification, image captioning or visual QnA(VQA), text classification), sentiment analysis, or topic modeling. ResNets, a recently published state-of-the-art DNN, has over 200 layers. Interpreting input features and it output categorization over multiple layers is challenging. The lack of decomposability and intuitiveness associated with DNNs prevents widespread adoption even with their superior performance compared to more classical machine learning approaches. The faithful interpretation of DNNs will not only help in providing insight about the failure modes (false positives and false negatives) but will also help the humans in the loop evaluate the robustness of the model against noise. This brings in trust and transparency to the predictive algorithm.

In this workshop, I will share how to enable class-discriminative visualizations for computer vision/NLP problems when using convolutional neural networks (CNN) and approach to help enable transparency of CNN’s by not only capturing metrics during the validation step but also highlighting salient features in the image and text which are driving prediction.
I will also be talking briefly about the open source project(“Skater”: https://github.com/datascienceinc/Skater) and how it can help in solving our interpretation needs.

Instructor Bio

Pramit Choudhary is an applied machine learning research scientist. He focuses on optimizing and applying Machine Learning to solve real-world problems. His research area includes scaling and optimizing machine learning algorithms. Currently, he is exploring better ways to explain a model’s learned decision policies to reduce the chaos in building effective models to close the gap between a prototype and operationalized models.

Pramit Choudhary

Lead Data Scientist at datascience.com

Workshop: Stanned Up: Bayesian Methods Using Stan

Many introductory tutorials on Bayesian inference are not as simple as the authors purport them to be. In this workshop, we provide intuitive and straightforward marketing examples, using the open source probabilistic programming language Stan, showing where traditional linear regression and even other machine learning methods fall short.

We will first show how a hierarchical model improves forecasting for situations where you have categories with a mix of counts including counts that are traditionally too low for statistical significance. We will use a pay per click example for this first tutorial. The second example will demonstrate how early stopping can reduce the cost and time for test-and-learn situations (aka marketing tests, experiments, etc.)

For both examples, we will walk through the code, and if wifi allows and attendees have pre-installed R, RStudio, and rstan, they can follow along.

Instructor Bio

Curt studied computer science at the University of Illinois at Urbana-Champaign and mathematics at the University of Minnesota. After building too many websites and client/server systems, he turned to data mining and statistics and never looked back. A good day is spent building models in R and Stan.

Curt Bergmann
Senior Data Scientist at Elicit, LLC

Workshop: The Power of Monotonicity to Make ML make ense

The key to machine learning is getting the right flexibility. For many ML problems, we have prior knowledge about global trends the model should be capturing, like that predicted travel time should go up if traffic gets worse. But flexible models like DNN’s and RF’s can have a hard time capturing such global trends given noisy training data, which limits their ability to extrapolate well when you run a model on examples different than your training data. TensorFlow’s new TensorFlow Lattice tools let you create flexible models that can respect the global trends you request, producing easier-to-debug models that generalize well. TF Lattice provides new TF Estimators that make capturing your global trends easy, and we’ll also explain the underlying new TF Lattice operators that you can use to create your own deeper lattice networks.

Instructor Bio

Maya Gupta leads Google’s Glassbox Machine Learning R and D team, which focuses on designing and developing controllable and interpretative machine learning algorithms that solve Google product needs. Prior to Google, Gupta was an Associate Professor of Electrical Engineering at the University of Washington from 2003-2013. Her PhD is from Stanford, and she holds a BS EE and BA Econ from Rice.

Maya Gupta, PhD

Glassbox ML R&D Team Lead at Google

Workshop: Applying Deep Learning to Article Embedding for Fake News Evaluation

In this talk we explore real world use case applications for automated “Fake News” evaluation using contemporary deep learning article vectorization and tagging. We begin with the use case and an evaluation of the appropriate context applications for various deep learning applications in fake news evaluation. Technical material will review several methodologies for article vectorization with classification pipelines, ranging from traditional to advanced deep architecture techniques. We close with a discussion on troubleshooting and performance optimization when consolidating and evaluating these various techniques on active data sets.

Instructor Bio

Mike serves as Head of Data Science at Uber ATG, UC Berkeley Data Science faculty, and head of Skymind Labs the Machine Learning research lab affiliated with DeepLearning4J. He has led teams of Data Scientists in the bay area as Chief Data Scientist for InterTrust and Takt, Director of Data Sciences for MetaScale/Sears, and CSO for Galvanize where he founded the galvanizeU-UNH accredited Masters of Science in Data Science degree and oversaw the company’s transformation from co-working space to Data Science organization.

Michael Tamir, PhD

Head of Data Science, Uber

Workshop: Scalable data science and deep learning with R

We provide an overview of the tools available to data scientists using R for Spark and TensorFlow, then discuss the latest developments at the intersections of these ecosystems. We organize the conversation around a diverse selection of use cases, such as ad hoc analysis on distributed datasets, building machine learning models for low latency scoring, and developing deep learning models for research, and demonstrate sample workflows. Various open source R packages will be featured, including the sparklyr, keras, and tensorflow projects.

Instructor Bio

Kevin is a software engineer at RStudio developing open source packages for big data analytics and machine learning. He has held data science positions across different industries, and has experience executing the end-to-end analytics process, from data engineering to model deployment and change management. Prior to RStudio, he was a principal data scientist at Honeywell, and also held roles at KPMG and Citi.

Kevin Kuo
Software Engineer at RStudio

Workshop: Latest Developments in GANS

Generative adversarial networks (GANs) are widely considered one of the most interesting developments in machine learning and AI in the last decade. In this wide-ranging talk, we’ll start by covering the fundamentals of how and why they work, reviewing basic neural network and deep learning terminology in the process; we’ll then cover the latest applications of GANs, from generating art from drawings, to advancing research areas such as Semi-Supervised Learning, and even generating audio. We’ll also examine the progress on improving GANs themselves, showing the tricks researchers have used to increase the realism of the images GANs generate.

Throughout, we’ll touch on many related topics, such as different ways of scoring GANs, and many of the Deep Learning-related tricks that have been found to improve training. Finally, we’ll close with some speculation from the leading minds in the field on where we are most likely to see GANs applied next.

Attendees will leave with a better understanding of the latest developments in this exciting area and the technical innovations that made those developments possible. Emphasis will be placed throughout on illuminating why the latest achievements have worked, not just what they are. Furthermore, a link to a clean, documented GitHub repo with a working GAN will be provided for attendees to see how to code one up. Attendees will thus leave feeling more confident and empowered to apply these same tricks to solve problems they face in personal projects or at work.

Instructor Bio

Seth loves teaching and learning cutting edge machine learning concepts, applying them to solve companies’ problems, and teaching others to do the same. Seth discovered Data Science and machine learning while working in consulting in early 2014. After taking virtually every course Udacity and Coursera had to offer on Data Science, he joined Trunk Club as their first Data Scientist in December 2015. There, he worked on lead scoring, recommenders, and other projects, before joining Metis in April 2017 as a Senior Data Scientist, teaching the Chicago full time course. Over the past six months, he has developed a passion for neural nets and deep learning, working on writing a neural net library from scratch and sharing what he has learned with others via blog posts (on sethweidman.com), as well as speaking at Meetups and conferences.

Seth Weidman

Senior Data Scientist, Metis

Workshop: An introduction to Julia for machine learning

In this workshop, we assume no prior exposure to Julia, and will show you why Julia is a fantastic language for machine learning. It should be accessible useful to data scientists and engineers of all levels, as well as anyone else with technical computing needs and an interest in machine learning. Our goal is that attendees will leave the workshop with an understanding of how easy it is to start programming in Julia, what makes Julia special, and how using Julia for machine learning applications will improve your workflow as a data scientist.

All workshop materials will be provided on juliabox.com so that attendees can operate in a common environment and code along with the instructor.

The first thirty minutes of the workshop will cover language basics and show you how easy it is to pick up Julia’s high-level syntax. To get you up and running with Julia, we will go over syntax for function declarations, loops, conditionals, and linear algebra operations.

The second thirty minutes of the workshop will highlight Julia’s performance. Attendees will learn how to benchmark, see first-hand how quickly Julia code runs compared to C and Python, and learn to take advantage of special features from Julia’s linear algebra infrastructure. Finally, attendees will come to understand how multiple dispatch, a key feature of Julia’s design, helps to make Julia both high-level and performant.

In the last 45 minutes, we will cover special tools for data science and machine learning, where you will see how easy it is to recognize letters in your own handwriting using Flux.

 

Instructor Bio

Jane Herriman is Director of Diversity and Outreach at Julia Computing and a PhD student at Caltech. She is a Julia, dance, and strength training enthusiast and is excited for the opportunity to teach you Julia.

Jane Herriman
Director of Diversity and Outreach at Julia Computing

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required. We will cover relevant background/fundamentals wherever necessary.

Instructor Bio

Coming Soon

Sukhada Palkar

Software Engineer at Airbnb

Workshop: Machine Learning for Digital Identity

There are tens of billions of online profiles today, each associated with some identity, on diverse platforms including social networks, online marketplaces, dating sites and financial institutions. Every platform needs to understand, validate and verify these identities.

The landscape of identity challenges, available data, and machine-learning technology have evolved over the years. However, identity still remains a notoriously hard problem. While we’ve made a lot of progress in academia and industry, there still are several unsolved problems. In this session, we will talk through three core, interconnected problems: (1) identity authentication/validation; (2) identity matching; (3) identity verification. We will discuss our work on effectively using machine learning technology to solve these problems, along with an analysis of popular techniques used on different platforms.

Identity authentication and validation ensures high quality attributes, that affect all downstream identity processes. The challenge of identity authentication is determining whether an input identity/attribute is a valid value. While identity validation solutions need to be tailored to the attribute-type, we will share some of the common techniques applicable across all attribute types: (1) canonicalizing attribute values, and then (2) lookup against constructed datasets of the universe of all possible values. We will also discuss how some of these generic techniques are applied to validation of two different types of attributes- names and government-issued IDs.

Identity matching is fundamental for two main applications: detecting duplicates and joining with other, often external, data sources to create a richer identity. We will describe the typical identity matching pipeline which is composed of 4 steps: (1) extraction of relevant attributes from structured and unstructured sources, (2) iterative identity enrichment of the input, (3) fuzzy matching of attribute pairs, (4) building a model to compute a match confidence using similarity and uniqueness.

Identity verification is the process of confirming that a online/digital identity accurately reflects the offline identity of the person who created the online identity. The key insight we will dive deep into is verification of one piece of the online identity, and then applying coherence across various identity attributes to verify all other attributes of the online identity.

This session is geared towards product, data science, and engineering leaders who would like to introduce state-of-the-art machine-learning techniques to solve identity problems at their respective companies or fortify their existing solutions. Some familiarity with machine learning techniques is preferred, but not required. We will cover relevant background/fundamentals wherever necessary.

Instructor Bio

Coming Soon.

Anish Das Sarma, PhD

Engineering Manager at Airbnb

Workshop: Building an image search service from scratch

Many products fundamentally appeal to our perception. When browsing through outfits on clothing sites, looking for a vacation rental on Airbnb, or choosing a pet to adopt, the way something looks is often an important factor in our decision. The way we perceive things is a strong predictor of what kind of items we will like, and therefore a valuable quality to measure.

However, making computers understand images the way humans do has been a computer science challenge for quite some time. Since 2012, Deep Learning has slowly started overtaking classical methods such as Histograms of Oriented Gradients (HOG) in perception tasks like image classification or object detection. One of the main reasons often credited for this shift is deep learning’s ability to automatically extract meaningful representations when trained on a large enough dataset.

This is why many teams — like at Pinterest, StitchFix, and Flickr — started using Deep Learning to learn representations of their images, and provide recommendations based on the content users find visually pleasing. Similarly, Fellows at Insight have used deep learning to build models for applications such as helping people find cats to adopt, recommending sunglasses to buy, and searching for art styles.

Many recommendation systems are based on collaborative filtering: leveraging user correlations to make recommendations (“users that liked the items you have liked have also liked…”). However, these models require a significant amount of data to be accurate, and struggle to handle new items that have not yet been viewed by anyone. Item representation can be used in what’s called content-based recommendation systems, which do not suffer from the problem above.

In addition, these representations allow consumers to efficiently search photo libraries for images that are similar to the selfie they just took (querying by image), or for photos of particular items such as cars (querying by text). Common examples of this include Google Reverse Image Search, as well as Google Image Search.

Based on our experience providing technical mentorship for many semantic understanding projects, we are bring a workshop to ODSC on how you would go about building your own representations, both for image and text data, and efficiently do similarity search. By the end of this workshop, you should be able to build a quick semantic search model from scratch, no matter the size of your dataset.

Instructor Bio

Coming Soon

Matthew Rubashkin, PhD
AI Program Director

Workshop: Visual Elements of Data Science

“Above all, show the data” (Edward Tufte)

Data Visualization is fundamental not only to data exploration, but to addressing data science problems in general. It is a key technique in descriptive statistics (e.g., boxplots, histograms, distribution charts, heatmaps), diagnostics (e.g., scatterplots, Geiger counter charts, digital elevation models) and predictive layers (e.g., decision trees, artificial neural networks) of the data science stack. For example, visualization is a means to understand relationships between variables, to recognize patterns, to detect outliers and to break down complexity. Effective ways to describe and summarize data sets are also very helpful in communicating with clients and collaborators in a more quantitative and rational way. Therefore, implementing and utilizing data visualizations is a key skill that every data scientist must have in their repository.

While enterprises and businesses across industries are now widely using dashboards and other (often commercial) business intelligence software to generate data visualizations, data scientists usually still heavily rely on creating charts in scripting languages and other open source coding environments from scratch. This is because they need to not only explore raw data and data aggregates, but also review model outputs visually and prepare charts for presentations and publications. The currently most widely used tools include ggplot2, plotly and shiny (R); as well as matplotlib, Seaborn and Bokeh (python).

This session reviews key elements of the effective use of data visualizations in Data Science industry applications. These include (1) a narrative / a story to tell about the data, (2) simplicity (3) conciseness through balancing information, complexity and avoiding too much decoration (aesthetics concept). It also addresses how to choose the right chart for given data sets, depending on different contexts and questions. What are some simple rules to follow for a good graphic, and which common errors need to be avoided? How do you know if your graph is accurately representing the underlying data set? This is particularly important for high dimensional data sets and growing data volumes in the age of Big Data.

In this workshop, state of the art scripts and packages in R and python will be used to demo how to plot heatmaps, time series charts and network graphs as well as representations and maps for geospatial data sets.

Instructor Bio

Olaf Menzer is a Data Scientist in the Decision Analytics team at Pacific Life in Newport Beach, California. His focus areas are around enabling business process improvements and the generation of insights through data synthesis, the application of advanced analytics and technology more broadly. He is also a Visiting Researcher at the University of California, Santa Barbara, contributing to primary research articles and statistical applications in Ecosystem Science.

Prior to working at Pacific Life, Olaf was a Predictive Analyst at Ingram Micro, designing, implementing and testing sales forecasting models, lead generation engines and product recommendation algorithms for cross-selling millions of technology products. He also held different Research Assistant roles at the Lawrence Berkeley National Lab and the Max Planck Institute in Germany where he supported scientific computing, data analysis and machine learning applications.

Olaf was a speaker at the INFORMS Business Analytics conference in 2016, Predictive Analytics World in 2018 and at several academic conferences in the past. He received a M.Sc. in Bioinformatics from Friedrich Schiller University in Germany (2011), and a Ph.D. in Geographic Information Science from University of California, Santa Barbara (2015).

Olaf Menzer, PhD

Senior Data Scientist at Pacific Life

Workshop: Using Data Science for Good

AI for Earth has one simple but huge ambition – to fundamentally transform the way one monitors, models and manages Earth’s natural resources using AI. At the same time, deep learning innovations and breakthrough are happening both in academia and industry at a breathtaking pace. By leveraging these AI innovations, and breakthroughs, many AI for Earth grantees have been using AI to solve many of earth’s toughest challenges – ranging from precision agriculture, precision conservation to understanding and protecting biodiversity, and more.

Join David Smith in this talk as he shares with you many of the exciting projects that AI for Earth has been working on, and how AI is used to amplify human’s ingenuity. Through the lens of these exciting use cases, you will also learn about cutting-edge AI use cases, trends and opportunities, and how you can get started with AI today.

Instructor Bio

David Smith is a Cloud Developer Advocate for Microsoft, specializing in the topics of artificial intelligence and machine learning. With a background in statistics and data science, he is the editor of the Revolutions blog (http://blog.revolutionanalytics.com) where he has written about applications of data science with a focus on the programming language “R” since 2009. He is also a co-author of the R manual “Introduction to R”, and a member of the board of directors for the R Consortium. He lives with his husband and two Jack Russell terriers in Chicago. Follow David on Twitter as @revodavid.

David Smith

Cloud Developer Advocate at Microsoft

Workshop: Are you sure you're an ethical data scientist? Build your ethical imagination

This 90-minute workshop reveals researchers’ hidden attribution biases and equips them with an ethical imagination to do data science better.

As an Ethics for Data Science instructor, I was struck by the disjunction between students’ excellent ability to find ethical gaps in other peoples’ projects and the blind spots they exhibited when critiquing their own work.

I developed a two-part 90 minute workshop to reduce the good intention bias. First, we interactively review three well-known cases including the Facebook/Cambridge Analytica case. These lively cases demonstrate how the basic tenets of research ethics can be adapted for data science.

Next, teams of 3-5 are led through a 60 minute Predicting College Failure case. This case is drawn from reality. A college has purchased an algorithm to predict which students will leave before graduation and why. We address ethical questions that arise during data collection, data cleaning, model development, intervention design, and FERPA compliance.

We also explore questions such as: how accurate does a model have to be before deployment? Is it fair to give students who show signs of financial hardship more aid if it reduces the amount available for students who are not visibly struggling? How do we manage the risk that labeling students high risk for departure may become a self-fulfilling prophecy? How do we weigh the collective impact on the school’s retention capacity against the unique needs of individual students? Teams will produce a strategy for maximizing the social benefit of the intervention and minimizing negative impacts.

Instructor Bio

Laura Norén is a data science ethicist and researcher currently working in cybersecurity at Obsidian Security in Newport Beach. She holds undergraduate degrees from MIT, a PhD from NYU where she recently completed a postdoc in the Center for Data Science. Her work has been covered in The New York Times, Canada’s Globe and Mail, American Public Media’s Marketplace program, in numerous academic journals and international conferences. Dr. Norén is a champion of open source software and those who write it.

Laura Noren, PhD

Director of Research | Professor

Workshop: From the Presidential Campaign Trail to the Enterprise: Building Effective Data-Driven Teams

The 2012 Obama Campaign ran the first personalized presidential campaign in history. The data team was made up of people from diverse backgrounds who embraced data science in service of the goal. With seed funding from Eric Schmidt, Civis Analytics emerged from this team and today enables organizations to use the same methods outside politics. I would explore the changes Civis made—particularly for creating effective data-driven teams—that allowed it to continue to deliver the same caliber of work in a business setting. The company has applied many of these same lessons learned to bring data-driven decision making to some of the largest organizations in the country.

Working on a political campaign is not that different from working in a consumer-focused organization. The pressure is high, the timelines are tight, and there are often shifting priorities. The data and the modeling need to happen on a national scale. Noisy data is coming in rapidly and needs to be assimilated into existing models and simulations. The best methods you should use are often ambiguous at best. The analytics focus is on the actions of the individual. To be effective in this environment, a data scientists must use  a myriad of technologies, and these technologies may need to serve different needs from those that the engineering team has to enable high throughput writes or to help analysts serve dashboards.

This environment heavily informed the technology stack in the early days of Civis. We’ll start this session by presenting on work we did in our early days for a national healthcare non-profit. After the passage of the Affordable Care Act, this non-profit needed to run a national-level campaign to inform people about and, ultimately, influence them to sign up for healthcare. We will look at how we solved all aspects of this problem from data collection to modeling, message testing, and consumer outreach.

As we grew following this work, we found that our technical solutions and processes didn’t scale. The rest of this session will focus on the lessons we learned to allow us to continue to deliver the same caliber of work. We’ve applied many of these same lessons to bring data-driven decision making to some of the largest organizations in the country. Enabling effective data-driven teams starts with building trust around the process — from the team itself to the decision-makers and the IT team that, in the end, controls access to the data — while also building efficiencies in the team and the organization.

Attendees will walk away understanding

  • how to build trust within an organization around data-driven decision making,

  • how to enable existing data science teams to provide incremental and sustained value to the business,

  • how to make sure that these efforts continue at an institutional level, and

  • that some aspects of data-driven decision making are solved with technology and some with social and cultural changes.

If you are a data scientists wanting to understand how to have a larger impact on your organization or a decision-maker wanting to understand how to elicit sustained value from your data science team, this session is for you.

Instructor Bio

Katie Malone is Director of Data Science at Civis Analytics, a data science software and services company. At Civis she leads the Data Science Research and Development department, which tackles some of Civis’ most novel and challenging data science consulting engagements as well as writing the core data science code that powers the Civis Data Science Platform. A physicist by training, Katie spent her PhD searching for the Higgs boson at CERN and is also the instructor for Udacity’s Introduction to Machine Learning course. As a side project she hosts a weekly podcast about data science and machine learning, Linear Digressions.

Katie Malone

Director of Data Science R&D

Workshop: Deep Learning on Mobile

Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in the area of computer vision. Many mobile applications running on smartphones and wearable devices would potentially benefit from the new opportunities enabled by deep learning techniques. However, CNNs are by nature computationally and memory intensive, making them challenging to deploy on a mobile device.

This workshop explains how to practically bring the power of convolutional neural networks and deep learning to memory and power-constrained devices like smartphones. You will learn various strategies to circumvent obstacles and build mobile-friendly shallow CNN architectures that significantly reduce the memory footprint and therefore make them easier to store on a smartphone; The workshop also dives into how to use a family of model compression techniques to prune the network size for live image processing, enabling you to build a CNN version optimized for inference on mobile devices. Along the way, you will learn practical strategies to preprocess your data in a manner that makes the models more efficient in the real world.

Following a step by step example of building an iOS deep learning app, we will discuss tips and tricks, speed and accuracy trade-offs, and benchmarks on different hardware to demonstrate how to get started developing your own deep learning application suitable for deployment on storage- and power-constrained mobile devices. You can also apply similar techniques to make deep neural nets more efficient when deploying in a regular cloud-based production environment, thus reducing the number of GPUs required and optimizing on cost.

Instructor Bio

Coming Soon

Anirudh Koul

Head of AI & Research

Workshop: Intro to Technical Financial Evaluation with R

In this entry level workshop you will learn how to download and evaluate equities with the TTR (technical trading rules) package. We will evaluate an equity according to three basic indicators and introduce you to backtesting for more sophisticated analyses on your own. Next we will model a financial market’s risk versus reward to identify the best possible individual investments in the market. Lastly, we will explore a non-traditional market, simulate the reward in the market and put our findings to an actual test in a highly speculative environment.

Instructor Bio

Ted works with the innovation group at Liberty Mutual Insurance, working on the overall vision and strategy for next generation vehicles including self driving cars. He leverages his data science expertise and creativity to cultivate internal thought leadership and external partnerships with startups related to vehicles.

Ted Kwartler

Data Scientist

Workshop: Enough data engineering for a Data Scientist - “How I Learned to Stop Worrying and Love the Data Scientists”

So how much data engineering should a Data Scientist know? For a Data Scientist to get to the fun part of their job, they normally have to do a bit of data engineering. Like on boarding data. Do a little bit of “wrangling”. Before they get to the fun part part – The Data Science! In most cases this is 50%-80% of the time.

Then comes the handing it over to the Data Engineering team to put it into production (of course via dev, test, and QA). This is when a “little bit” of contention happen. As in most cases the Data Engineering team will have to do “some” modification/re-write/Head shaking/Hand wringing to get the code to be production ready and meet the SLA’s defined by the business. As there is a disconnect in how Data Scientists and Data Engineers develop code / models (I get a front row seat to this all the time). In this talk I’ll take the Data Scientist on a journey. From on-boarding data, and how different data/object stores can help; Understanding and choosing the right data format for the data assets; Explore some different query engines, and some basic query tuning for each; Explain how a distributed streaming platform works, and how you can take advantage of it; Lastly cover some good coding practices. This will give the Data Scientist new skills to help them be more productive, so that can get to the fun part faster! Plus reduce the contention with the Data Engineering team, and make them say – “How I Learned to Stop Worrying and Love the Data Scientists”!

Instructor Bio

Stephen O’Sullivan is the owner of Data Whisperers. He is an expert in data architecture, infrastructure, and technical operations. Mr. O’Sullivan has deep experience in Hadoop usage and architecture and cutting-edge open source solutions for Big Data. He brings more than 25 years of experience creating enterprise applications and data management solutions for high availability and scale to his current position.

Prior to Data Whisperers, Mr. O’Sullivan was VP, Engineering at Silicon Valley Data Science. Where is led the data engineering team to help SVDS clients become data driven, and obtain their business goals utilizing data. Prior to SVDS, he created and led the next generation data platform team at Walmart Labs as a senior director. He and his team architected and designed the data platform that will be used by all of Walmart’s e-commerce business units. At Walmart Labs he spent time evaluating big data / database / datastore / data management vendors, from big name companies to stealth startups, as to how they would be used within Walmart’s eCommerce and store infrastructure. Mr. O’Sullivan evaluated, made recommendations and built solutions to address Walmart’s needs in security, high availability, scalability and performance.

Stephen O’Sullivan

Founder

Workshop: The Big Reveal: How visualization can unearth the secrets of data

What’s the relationship between human and computer-recorded numbers? When we talk about data, data is not just numbers, behind every single number is human behavior. In my talk, I am trying to answer the following questions: How can data visualization design help us see the unseen faces of culture? How can we use open data to explore the differences between design and art colleges around the world? And how can we use data visualization to unlock the secrets of winning art awards? I will cover all the process of collecting data to process data.

Instructor Bio

Ying He currently works for TCG (The Creative Group) and was previously at Pershing of BNY Mellon and The Metropolitan Museum of Art. She holds a master degree from ITP, Tisch School of the Arts. Her research interests include visual representation, creative design and cultural hacking such as artistic data visualization, mapping, and interactive experience. She is fascinated to synthesize graphic design and technology as a telescope into unseen reality.

Ying He

Computational Designer at TCG (The Creative Group)

Workshop: Scaling Interactive Data Science and AI with Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. To develop these applications, data scientists and engineers will need to seamlessly scale their work from running interactively to production clusters. In this talk we introduce Ray, a high-performance distributed execution engine, and its libraries for data science and AI development. We cover each Ray library in turn, and also show how the Ray API allows these traditionally separate workflows to be composed and run together as one distributed application.

Ray is an open source project being developed at the RISE Lab in UC Berkeley for interactive data processing, scalable hyperparameter optimization, distributed deep learning, and reinforcement learning. We focus on the following libraries in this tutorial:

MODIN: With Modin, you can make your Pandas workflows faster by changing only a single line of code. Modin uses Ray to provide interactive analysis on multi-core machines (e.g., your laptop), and also scale to large clusters.

TUNE: Tune is a scalable hyperparameter optimization framework for reinforcement learning and deep learning. Go from running one experiment on a single machine to running on a large cluster with efficient search algorithms without changing your code. Unlike existing hyperparameter search frameworks, Tune targets long-running, compute-intensive training jobs that may take many hours or days to complete, and includes many resource-efficient algorithms designed for this setting.

RLLIB: RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones. In this tutorial we discuss using RLlib to tackle both classic benchmark and applied problems, RLlib’s primitives for scalable RL, and how RL workflows can be integrated with data processing and hyperparameter optimization.

Instructor Bio

Richard Liaw is a PhD student in BAIR/RISELab at UC Berkeley working with Joseph Gonzalez, Ion Stoica, and Ken Goldberg. He has worked on a variety of different areas, ranging from robotics to reinforcement learning to distributed systems. He is currently actively working on Ray, a distributed execution engine for AI applications; RLlib, a scalable reinforcement learning library; and Tune, a distributed framework for model training.

Richard Liaw
AI Researcher

Workshop: Making Data Science: AIG, Amazon, Albertsons

Developing an internal data science capability requires a cultural shift, a strategic mapping process thataligns with existing business objectives, a technical infrastructure that can host new processes, and an organizational structure that can alter business practice to create measurable impact on business functions. This workshop will take you through ways to consider the vast opportunities for data science to identify and prioritize what will add the most value to your organization, and then budget and hire into commitments. Learn the most effective ways to establish data science objectives from a business perspective including recruiting, retention, goaling, and improving business.

Instructor Bio

Haftan Eckholdt, PhD. is Chief Data Science Office at Plated. His career began with research professorships in Neuroscience, Neurology, and Psychiatry followed by industrial research appointments at companies like Amazon and AIG. He holds graduate degrees in Biostatistics and Developmental Psychology from Columbia and Cornell Universities. In his spare time he thinks about things like chess and cooking and cross country skiing and jogging and reading. When things get really really busy, he actually plays chess and cooks delicious meals and jogs a lot. Born and raised in Baltimore, Haftan has been a resident of Kings County, New York since the late 1900’s.

Haftan Eckholdt, PhD

Chief Data Science Officer

Workshop: Making Data Great Again

The advent of big data means that the time has come for change in the way in which we collect and use data on human beings. However, that change needs to be effected in a thoughtful, careful way so that we don’t jump out of the frying pan into the fire.

There is enormous potential to use such data to improve decision making at all levels of government. The barriers are complex but at their core stem from (i) lack of local capacity to access and use data and (ii) lack of evidence of the value. Much can be gained when local stakeholders develop use-cases and create value from data sources specific to a jurisdiction. The combination of human and technical approaches is critical success.

We have been developing a multi-step approach to foster data-driven decision making in regional development efforts so that locally driven efforts can grow into a robust, scalable system. Each phase serves the dual purposes of (a) building local capacity while (b) developing useful data and analytic products.
The analytics training programs are delivered in a secure remote access environment. They include a mix of targeted introductory material appropriate for a wide audience and tailored sessions specific to jurisdictions. The focus is on facilitating the creation of national standards from the bottom up, directly via (i) a secure, analytics computing platform in which the underlying code and data itself can be shared and evaluated, (ii) conferences and workshops to convene key stakeholders across the community, and (iii) ongoing support to provide continuity of methodologies across jurisdictions.

Instructor Bio

Julia Lane is a Professor in the Wagner School of Public Policy at New York University. She is also a Provostial Fellow in Innovation Analytics and a Professor in the Center for Urban Science and Policy at NYU. Dr. Lane is an economist and has authored over 65 refereed articles and edited or authored seven books. She has been working with a number of national governments to document the results of their science investments. Her work has been featured in several publications including Science and Nature. Dr. Lane started at the National Science Foundation (as Senior Program Director of the Science of Science and Innovation Policy Program) to quantify the results of federal stimulus spending, which is the basis of the new Institaute for Research on Innovation and Science at the University of Michigan. Dr. Lane has had leadership positions in a number of policy and data science initiatives at her other previous appointments, which include Senior Managing Economist at the American Institutes for Research; Senior Vice President and Director, Economics Department at NORC/University of Chicago; various consultancy roles at The World Bank; and Assistant, Associate and Full Professor at American University. Dr. Lane received her PhD in Economics and Master’s in Statistics from the University of Missouri.

Julia Lane, PhD
Professor

Workshop: MacroBase: Prioritizing Human Attention in Big Data

MacroBase is a new analytic monitoring engine designed to prioritize human attention in large-scale datasets and data streams. Unlike a traditional analytics engine, MacroBase is specialized for one task: finding and explaining unusual or interesting trends in data. With its unique feature selection functionality, MacroBase has found and explained the cause of previously unknown behaviors in several domains, including online services, mobile devices, user analytics, automotive telemetry, and manufacturing.

In this workshop, we’ll describe how to write MacroBase queries and analyze your own data using MacroBase SQL, an extension of SQL that incorporates our new MacroBase operators. We’ll also show how to query large-scale datasets using MacroBase SQL on Spark, our distributed version of MacroBase that integrates seamlessly with other Spark APIs. We’ll provide sample datasets to play with, but we highly encourage you to bring your own datasets to analyze with MacroBase!

MacroBase is an ongoing research project in the Stanford FutureData Systems Group and the Stanford DAWN Project—for more information, check out https://macrobase.stanford.edu/. Installation instructions can be found at https://macrobase.stanford.edu/docs/sql/setup/.

Instructor Bio

Coming Soon

Firas Abuzaid
Ph.D. Researcher at Stanford University

Workshop: Data visualization in the web setting with a focus on D3

The D3 JavaScript library utilizes standard web technologies to facilitate interactive data visualization in the browser. This session will cover the principles behind D3 and will use examples to introduce core ideas and concepts in the library. It will also highlight some of the differences between versions as the library has evolved to its current state and discuss how D3 fits into the landscape of data visualization tools and web frameworks.

At its core, D3 establishes a connection between the data behind a visualization and the graphical elements shown on screen. By providing methods to manipulate elements at a low-level, it allows every aspect of a visualization to be customized according to standard technologies. At the same time, selections allow users to modify large groups of elements at the same time but based on the data items they are connected with. Another important feature of D3 is support for interactive visualizations which can help show new data being added or existing data being filtered. D3 adds straightforward methods to help transition between different states. To make the most of this session, attendees should be familiar with JavaScript.

Instructor Bio

Coming Soon

David Koop, PhD
Assistant Professor

Workshop: The Big Reveal: How visualization can unearth the secrets of data

What’s the relationship between human and computer-recorded numbers? When we talk about data, data is not just numbers, behind every single number is human behavior. In my talk, I am trying to answer the following questions: How can data visualization design help us see the unseen faces of culture? How can we use open data to explore the differences between design and art colleges around the world? And how can we use data visualization to unlock the secrets of winning art awards? I will cover all the process of collecting data to process data.

Instructor Bio

Ying He is a computational designer currently works for Pershing of BNY Mellon and was previously at The Metropolitan Museum of Art. She holds a master degree from ITP, Tisch School of the Arts. Her research interests include visual representation, creative design and cultural hacking such as artistic data visualization, mapping, and interactive experience. She is fascinated to synthesize graphic design and technology as a telescope into unseen reality.

Sean Patrick Gorman, PhD

Head of Technical Product Management

Workshop: Balancing ML accuracy, interpretability and costs when building a model

As data scientists we strive to deliver high performance models, but in the real-world the best model possible is not usually the best model for the business. When developing a model if it is not interpretable by the business, you will be unable to get buy in necessary to get your model into production. Additionally, you are always fighting two cost related battles: opportunity cost of delivering a perfect model tomorrow instead of delivering a good one today; operational costs of the most superior model compared to the next best one. This workshop will use real-world coding examples in Python to demonstrate how to be mindful of these constraints when developing your models.

Instructor Bio

Marc Fridson is the Principal Data Scientist of Cross Brand Digital @ Carnival Cruise Line, a Part-Time Lecturer for the Applied Analytics Program Masters Program @ Columbia University and the founder of tech start-up Instant Analytics.

Marc has previously worked as a Technology Consultant for Accenture, as an Engineer for the Boeing Company, AVP of Metrics and Reporting for Capital One, and as Manager of Analytics at CB Richard Ellis for JP Morgan Chase’s Real Estate Management. Previous consulting clients include: Morgan Stanley, Capital One, The College Board, Anthem Blue Cross, Verizon and Time Warner Cable.

He has helped these companies measure, analyze, and automate their processes through data analysis and by developing technological tools to enable process improvement/automation.

He holds a B.S. in Industrial and Systems Engineering from Rutgers University.

Marc Fridson
Principal Data Scientist at Carnival

Workshop: Automating Trend Discovery on Streaming Datasets with Spark 2.3

In this session we will start off with a deep dive into effective data modeling and continue on to explore some unique methods in which to bubble up and automatically uncover unique and interesting patterns in your data all using Spark SQL. More importantly we will discuss how to do this in batch mode and then follow up with how to easily migrate from batch mode to streaming using Spark Structured Streaming.

We will walk though techniques for reducing the memory footprint of statistical aggregations, giving you the ability to more efficiently scale out your systems to handles many millions of records (all in memory) while maintaining a relatively small footprint all via the use of Data Sketching. The idea here is to leverage quantile sketches in order to auto-analyze the change in shape and behavior of seemingly disparate datasets to find common dimensions (features) of given data sets across many different metrics.

We will also go over how to handle common serialization problems with respect to the storage and retreival of partially aggregated data when updating your streaming applications. Lastly we will finish off by talking about how to use windowed statistical aggregations and rollups to automatically detect trends in your data while also being able to handle the dreaded issue of data seasonality.

This session will cover some best practices and patterns for writing streaming applications with Apache Spark 2.3 including how to write effective unit tests to ensure your applications can handle live updates in production. A working application will be made available at the start of the presentation. Knowledge of Spark and Scala are a must in order to take full advantage of this information.

Instructor Bio

Scott Haines is a Principal Software Engineer / Tech Lead on the Voice Insights team at Twilio. His focus has been on the architecture and development of a real-time (sub 250ms), highly available, trust-worthy analytics system. His team is providing near real-time analytics that processes / aggregates and analyzes multiple terabytes of global sensor data daily. Scott helped drive Apache Spark adoption at Twilio and actively teaches and consulting teams internally. Scott’s past experience was at Yahoo! where he built a real-time recommendation engine and targeted ranking / ratings analytics which helped serve personalized page content for millions of customers of Yahoo Games. He worked to build a real-time click / install tracking system that helped deliver customized push marketing and ad attribution for Yahoo Sports and lastly Scott finished his tenure at Yahoo working for Flurry Analytics where he wrote the an auto-regressive smart alerting and notification system which integrated into the Flurry mobile app for ios/android

Scott Haines
Principal Software Engineer at Twilio

Workshop: pomegranate: Fast and Flexible Probabilistic Modeling in Python

Pomegranate is a Python package for fast and flexible probabilistic modeling. The basic unit is the probability distribution, which can be combined into compositional models such as hidden Markov models, mixture models, and Bayesian networks. These more complicated models can themselves be used as components to models, such as building a mixture of Bayesian networks, or a Bayes classifier of hidden Markov models for the classification of sentences instead of fixed feature sets. This format for specifying models is augmented by a variety of sophisticated training strategies, such as multi-threaded parallelism, GPU support, semi-supervised learning, support for missing values, mini-batch learning, out-of-core learning for massive data sets, and any combination of the above. This tutorial will give a high level overview of the features of pomegranate, the design rationale, a brief comparison to other packages, and an application to practical examples.

Instructor Bio

Jacob Schreiber is a fifth year Ph.D. student and NSF IGERT big data fellow in the Computer Science and Engineering department at the University of Washington. His primary research focus is on the application of machine larning methods, primarily deep learning ones, to the massive amount of data being generated in the field of genome science. His research projects have involved using convolutional neural networks to predict the three dimensional structure of the genome and using deep tensor factorization to learn a latent representation of the human epigenome. He routinely contributes to the Python open source community, currently as the core developer of the pomegranate package for flexible probabilistic modeling, and in the past as a developer for the scikit-learn project. Future projects include graduating.

Jacob Schreiber
PhD Candidate at University of Washington

Workshop: Mastering A/B Testing: From Design to Analysis

Co-presenter – Guillaume Saint-Jacques

Embark on a journey into the realm of experimentation. Learn how to run scientific experiments, synthesize insights from data and deliver impactful recommendations. There is no doubt that experimentation, or A/B testing, has become a driving force of innovation in the online world. In this workshop, you will advance your understanding of both theoretical and practical issues in A/B testing. You will learn about cutting-edge research, novel approaches to A/B testing, and will be able to apply them. Through the workshop, you will master new ways to design and analyze experiments and maximize your data-driven impact.

You will learn to:
* Build the foundation of experiment: design of experiment
* Ensure the trustworthiness of A/B report;
* Ramp experiment while balancing speed, quality and risk
* Amplify insights with variance reduction
* and more…

Instructors Bio

Iavor Bojinov received his Ph.D. in Statistics from Harvard University in 2018. His research advances both theory and methodology in social science applications, with a particular focus on design and analysis of experiments in complex settings, observational studies, and missing data. He has recently published articles in the Journal of the American Statistical Association, Biometrika, and Sociological Methods & Research.

Guillaume Saint-Jacques received his PhD from MIT Sloan and specializes in economics of technology as well as inference on social networks.At LinkedIn, he designs and implements experimental methodologies that the company uses to measure the network effects of new features and interventions.Guillaume also focuses on the computational social science aspects of fairness and inequality research at LinkedIn.

Iavor Bojinov

Software Engineer

Workshop: How LinkedIn navigates Streams Infrastructure using Cruise Control

Kafka has been widely adopted as the data streaming platform backbone of the leading companies. However, this rising adoption introduced new challenges. In particular, the growing cluster sizes, increasing volume and diversity of user traffic, and the age of network and server components led to a management overhead. Getting near-optimal performance from such an infrastructure service, maintaining its availability in the face of cascading failures, and achieving these objectives with minimal overhead are critical, but non-trivial tasks. Hence, human intervention alone tends to be insufficient in providing both reactive and proactive mitigation measures.

In this talk, we will share our work and experiences towards alleviating the management overhead of large-scale Kafka clusters using Cruise Control at LinkedIn. The talk will consist of three parts: The first part will provide an overview of Cruise Control, including the operational challenges that it solves, its high-level architecture, and some evaluation results from real-world scenarios. The second part will go through a hands-on tutorial to demonstrate how we can manage a real Kafka cluster using Cruise Control. Finally, we will also have time for a Q&A session at the end of the discussion.

Instructors Bio

Adem Efe Gencer received his PhD in Computer Science from Cornell University in 2017. His PhD research has focused on improving the scalability of blockchain technologies.
The protocols introduced in his research (e.g. Bitcoin-NG, Aspen) were adopted by Waves Platform, Aeternity, Cypherium, Enecuum, Ergo Platform, and Legalthings, and are
actively being developed into other systems.

Efe develops Apache Kafka and the ecosystem around it, and supports their operation at LinkedIn. In particular, he works on the design, development, and maintenance of
Cruise Control — a system for alleviating the management overhead of large-scale Kafka clusters at LinkedIn.

Adem Efe Gencer
Software Engineer

Workshop: Image Recognition Primer: ImageNet AlexNet to Mask R-CNN, R-CNN and Fast R-CNN

Image Recognition Primer: Alexnet to Fast R-CNN and Mask R-CNN Applications aims to bring together researchers in the fields of Deep Learning and Image Recognition. The workshop will address recent advances in theory, methodologies and various related applications. It is meant to promote the advancement, discussion and presentation of new research and development in the areas of Deep Learning, Image Recognition and applications. Of special interest are how image recognition has evolved from Alexnet to Advanced Convolution Image recognition

Instructors Bio

Bhairav Mehta is Senior Data Scientist with extensive professional experience and academic background. Bhairav works for Apple Inc. as Sr. Data Scientist.

Bhairav Mehta is experienced engineer, business professional and seasoned Statistician / programmer with 19 years of combined progressive experience working on data science in electronics consumer products industry (7 years at Apple Inc.), yield engineering in semiconductor manufacturing (6 years at Qualcomm and MIT Startup) and quality engineering in automotive industry (OEM, Tier2 Suppliers, Ford Motor Company) (3 years). Bhairav founded a start up DataInquest Inc. in 2014 that is specialized in training/consulting in Artificial Intelligence, Machine Learning, Blockchain and Data Science.

Bhairav Mehta has MBA from Johnson School of Management at Cornell University, Masters in Computer science from Georgia Tech (Expected 2018), Masters in Statistics from Cornell University, Masters in Industrial Systems Engineering from Rochester Institute of Technology and BS Production Engineering from Mumbai University.

Bhairav Mehta
Data Science Manager

Workshop: Coming soon

Coming soon

Instructor Bio

Nathaniel earned his AB/SM in Computer Science from Harvard. He previously worked as a Quant and Trader at Jane Street and Goldman Sachs before transitioning into the pure tech industry. Nathaniel worked as a Data Scientist at Facebook, a Product Manager at Microsoft and a Software Engineer at Google before joining Vicarious. He is an avid reader and learner. He teaches part time at General Assembly and is developing open source teaching material for data science, machine learning, and web development.

Nathaniel Tucker
Lead Instructor Data Science and Analytics at General Assemby

Workshop: Coming soon

Coming soon

Instructor Bio

At Metis, Andrew has taught the fundamentals of Machine Learning and Data Science in a 3 month Bootcamp to over a 100 students and advised nearly 500 student projects. Andrew came to Metis from LinkedIn, where he worked as a Data Scientist, on the Education, Skills and then the NLP teams. He is passionate about helping people make rational decisions and building cool data products. Prior to that he worked on fraud modeling at IMVU (the lean startup) and studied applied physics at Cornell. He loves snowboarding, traveling, scotch and reading about all kinds of nerdy topics.

Andrew Blevins
Data Science Instructor at Metis

Workshop: Knowledge Graphs: A lingua franca for humans and machines?

Knowledge graphs have recently emerged as a powerful way to represent knowledge in multiple communities, including data mining, natural language processing and machine learning. Large-scale knowledge graphs like Wikidata and DBpedia are openly available, while in industry, the Google Knowledge Graph is a good example of proprietary knowledge that continues to fuel impressive advances in Google’s semantic search capabilities. Knowledge graphs are also intuitive, and it is possible to understand the basic concepts girding a knowledge graph without much technical background. This workshop will cover knowledge graphs from a broad perspective. Starting from plain English documents, we will construct simple knowledge graphs both by hand (at a small scale) and using NLP techniques like named entity recognition (at a larger scale). We will also look at the data to see why automatically constructing knowledge graphs is problematic, and consider practical techniques for ‘refining’ the knowledge graph further. We’ll close the workshop by showing that, even when noisy, knowledge graphs can be useful in a data science pipeline, which makes them robust in deriving and representing knowledge from human-produced text.

Instructor Bio

Mayank Kejriwal is a research scientist and lecturer at the University of Southern California’s Information Sciences Institute (ISI). He received his Ph.D. from the University of Texas at Austin. His dissertation involved Web-scale data linking, and in addition to being published as a book, was recently recognized with an international Best Dissertation award in his field. His research is highly applied and sits at the intersection of knowledge graphs, social networks, Web semantics, network science, data integration and AI for social good. He has contributed to systems that are being used by both DARPA and by law enforcement, and he has active collaborations in both academia and industry. He is currently co-authoring a textbook on knowledge graphs (MIT Press, 2018), and has delivered tutorials and demonstrations at numerous conferences and venues, including KDD, AAAI, and ISWC.

Mayank Kejriwal, Ph.D

Research Scientist and lecturer University of Southern California’s Information Sciences Institute (ISI)

Workshop: Topic Modeling Practices And Its Value For Businesses

As we all are aware, we have data more than we had before. With the amount of data getting more and more every day, we all work with different types of data. We can categorize the types of data into 3 parts: numbers, texts, and images. But not everything is as easy and as straight forward as numbers to work with. One has to convert everything to a number to work with. For this session, we will discuss “Topic Modeling Practices And Its Value For Businesses”.

Text data could be very unstructured and may include a lot noise. Even with current algorithms one needs to make sure to clean and prepare the data for any successful analysis and in particular with text, we need make sure we understand the algorithms, the results and how can we utilize those algorithms to bring values to our companies with feasibility and innovation in mind.

In this workshop, we will start talking about different types of text data that a company may have. Then we will talk about Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words (BOW), Word2vec, Latent Dirichlet Allocation (LDA) and more…

At the end, we will see real example of cleaning, structuring text data, and run a Topic Modeling algorithm on a text data, discuss the results and its value to the businesses. At the end of the session we will talk about the pros and cons of the approach and how to bring values to our business and jobs with text data. If you want to know how to take advantage and analyze your text data, this maybe a session for you.

 

Instructor Bio

Yunus Genes is completing his Masters in Computer Science, and continuing his part time PhD at University of Central Florida. His research is focused on Applied Machine Learning, social media behavior, misinformation detection/diffusion. He has been working on this field over 4 years. His is currently working on a DARPA funded project to simulate social media under SocialSim project, teaching Data Science to Fortune 50 Company professionals and he has previously held Data Science positon at Silicon Valley as well as Florida, Orlando area.

Yunus Genes, PhD

Data Scientist at Royal Caribbean Cruise Line

Workshop: Open Source Random Variables: Building a Prediction Web

Roar Data is an experiment in Collective Intelligence undertaken by J.P. Morgan, academic and industry partners. The premise is that streaming data, machine learning, micro services and cryptography support powerful new ways of organizing predictive analytics. The engineering ambition is a Prediction Web open to all – furthering the democratization of data science started by MOOCs and open source software.

Co-author of session: Rusty Conover

Instructors Bio

Peter Cotton is an Executive Director at J.P. Morgan in the Data Science group. Previously he founded Benchmark Solutions, an enterprise data company sold to Bloomberg. Peter received his Ph.D. from Stanford’s Department of Mathematics and began his career at Morgan Stanley.

Peter Cotton

Executive Director – Data Science group at JP Morgan – Roar Data

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS®

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content.

In this session, participants will use recurrent neural networks to analyze sequential data and improve the forecast performance of time series data, and use convolutional neural networks for image classification. Participants will also use a genetic algorithm to efficiently tune the hyperparameters of both deep learning models. Finally, students will use factorization machines to model the relationship between movies and viewers to make recommendations.
Demonstrations are provided in both R and Python, and will be administered from a Jupyter notebook. Students will use the open source SWAT package (SAS Wrapper for Analytics Transfer) to access SAS CAS (Cloud Analytic Services) in order to take advantage of the in-memory distributed environment. CAS provides a fast and scalable environment to build complex models and analyze big data by using algorithms designed for parallel processing.

Instructors Bio

Jordan Bakerman holds a Ph.D. in statistics from North Carolina State University. His dissertation centered on using social media to forecast real world events, such as civil unrest and influenza rates. As an intern at SAS, Jordan wrote the SAS Programming for R Users course for students to efficiently transition from the R to SAS using a cookbook style approach. As an employee, Jordan has developed courses demonstrating how to integrate open source software within SAS products. He is passionate about statistics, programming, and helping others become better statisticians.

Jordan Bakerman

Analytical Training Consultant at SAS

Training: Building Recommendation Engines and Deep Learning Models Using Python, R and SAS®

Deep learning is the newest area of machine learning and has become ubiquitous in predictive modeling. The complex, brainlike structure of deep learning models is used to find intricate patterns in large volumes of data. These models have heavily improved the performance of general supervised models, time series, speech recognition, object detection and classification, and sentiment analysis.

Factorization machines are a relatively new and powerful tool for modeling high-dimensional and sparse data. Most commonly they are used as recommender systems by modeling the relationship between users and items. For example, factorization machines can be used to recommend your next Netflix binge based on how you and other streamers rate content.

In this session, participants will use recurrent neural networks to analyze sequential data and improve the forecast performance of time series data, and use convolutional neural networks for image classification. Participants will also use a genetic algorithm to efficiently tune the hyperparameters of both deep learning models. Finally, students will use factorization machines to model the relationship between movies and viewers to make recommendations.
Demonstrations are provided in both R and Python, and will be administered from a Jupyter notebook. Students will use the open source SWAT package (SAS Wrapper for Analytics Transfer) to access SAS CAS (Cloud Analytic Services) in order to take advantage of the in-memory distributed environment. CAS provides a fast and scalable environment to build complex models and analyze big data by using algorithms designed for parallel processing.

Instructors Bio

Robert teaches machine learning for SAS and specializes in neural networks. Before joining SAS, Robert worked under the Senior Vice Provost at North Carolina State University where he built models pertaining to student success, faculty development and resource management. Prior to working in academia, Robert was a member of the research and development group on the Workforce Optimization team at Travelers Insurance. His models at Travelers focused on forecasting and optimizing resources. Robert graduated with a master’s degree in Business Analytics and Project Management from the University of Connecticut and a master’s degree in Applied and Resource Economics from East Carolina University.

Robert Blanchard

Sr. Analytical Training Consultant at SAS

Sign Up for ODSC West | Oct 31st - Nov 3rd 2018

Register Now

Highly Experienced Instructors

Our instructors are highly regarded in data science, coming from both academia and notable companies.

Real World Applications

Gain the skills and knowledge to use data science in your career and business, without breaking the bank.

Cutting Edge Subject Matter

Find training sessions offered on a wide variety of data science topics from machine learning to data visualization.

10+ reasons people are attending ODSC West 2018

See Reasons
Open Data Science Conference