ODSC East 2022

April 19th-21st

More sessions coming soon 

Register Now | Save 60%

East 2022 Preliminary Schedule

We are delighted to announce our East 2022 Schedule!

ODSC East Trainings & Workshops
---Tuesday, 19th April
--Wednesday, 20th April
-Thursday, 21st April
ODSC East Talks
---Tuesday, 19th April
--Wednesday, 20th April
-Thursday, 21st April
---Tuesday, 19th April
--Wednesday, 20th April
-Thursday, 21st April
---Tuesday, 19th April
--Wednesday, 20th April
-Thursday, 21st April
Engineering a Performant Machine Learning Pipeline: From Dask to Kubeflow

Training | In-Person | AI for Engineers | Machine Learning | Intermediate-Advanced

 

The lifecycle of any machine learning model, regular or deep, consists of (a) the pre-processing/transformation/augmenting of data (b) the training of the model with different hyper-parameter values/learning rates (c) the computing of results on new data/test sets. Whether you are using transfer learning, or a from-scratch model, this process requires a large amount of computation, management of your experimental process, and the quick perusal of results from your experiment. In this workshop, we will learn how to combine off-the-shelf clustering software such as kubernetes and dask, with learning systems such as tensorflow/pytorch/scikit-learn, on cloud infrastructure such as AWS/Google Cloud/Azure to construct a machine-learning system for your data science team. We’ll start with an understanding of kubernetes, move on to analysis pipelines in sklearn and dask, finally arrive at kubeflow. Participants should install minikube on their laptops (https://kubernetes.io/docs/tasks/tools/install-minikube/), and create accounts on the Google Cloud…more details

Engineering a Performant Machine Learning Pipeline: From Dask to Kubeflow image
Dr. Rahul Dave
Chief Scientist | univ.ai, lxprior.com and Harvard University
Engineering a Performant Machine Learning Pipeline: From Dask to Kubeflow image
Richard Kim
Founder and CEO | Markov Lab
Statistics for Data Science

Bootcamp | Virtual |Deep Learning | All Levels

 

The emergence of data science as a discipline has impacted businesses in a range of different ways. One primary impact has been to elevate the use of data in decision-making by using statistical methods to assess the ever-growing datasets companies are collecting. This workshop will review and introduce statistical techniques and touch on more advanced methods for dealing with noisy data and applying real-world constraints to analyses. This workshop assumes a working knowledge of standard statistical methods and will aim to connect theory to practice using real-world examples.

Lesson 1: Descriptive statistics and exploring data statistically

– (Re)familiarize yourself with basic descriptive statistics
– Use simple data exploration techniques to identify problems and limitations of a new dataset

Lesson 2: Statistical analyses...more details

Statistics for Data Science image
Andrew Zirm, PhD
Senior Data Scientist | Greenhouse Software
Evolution of NLP and its Underpinnings

Tutorial | Virtual | NLP | Beginner – Intermediate

 

This talk aims to give an overview walkthrough of the suite of NLP methods grounded in neural-network architectures, including recurrent neural networks (RNNs), transformers, and convolutional neural networks (CNNs). We will connect them by diving into their similarities and differences. You will come away from the talk gaining the overview picture of NLP and grasping the theoretical essence that underpins NLP methods. This talk hopes to empower you with the foundational NLP knowledge and reduce the knowledge barrier for you to jumpstart your NLP projects.

Evolution of NLP and its Underpinnings image
Chengyin Eng
Senior Data Scientist | Databricks
Network Analysis Made Simple

Training | In-Person | Data Visualization | Machine Learning | Intermediate

 

Upon completing this tutorial, you will be:
– familiar with how to use the NetworkX and nxviz Python packages for modelling and rationally visualizing networks,
– able to load node and edge data from a Pandas dataframe,
– familiar with object-oriented and matrix-oriented representations of graphs,
able to find paths between nodes, interesting structures in graphs, and projections of bipartite graphs.
– (if time permits) able to use matrix operations to simulate diffusion of information on networks..more details

Network Analysis Made Simple image
Eric Ma, PhD
Author of nxviz Package
Mastering Gradient Boosting with CatBoost

Workshop | In-Person | Machine Learning | Intermediate

 

This workshop will feature a comprehensive tutorial on using the CatBoost library. We will walk you through all the steps of building a good predictive model…more details

Mastering Gradient Boosting with CatBoost image
Anna Veronika Dorogush
ML Lead | Yandex
Data Science for Digital Forensics & Incident Response (DFIR)

Training | In-Person | Machine Learning Safety and Security | Intermediate

 

In this workshop, which is directed to both a Data Science audience who may want to learn DFIR, and a DFIR audience who may want to learn Data Science, Jess Garcia will explain the fundamentals of Data Science and DFIR, and will lead the audience through all the different steps of an end-to-end investigation using exclusively Data Science tools and techniques. In the process, Jess will introduce multiple forensic artifacts and will explain the value they provide to the overall investigation….more details

Data Science for Digital Forensics & Incident Response (DFIR) image
Jess Garcia
CEO, Security & Forensics Analyst, Incident Responder | Senior Instructor at One eSecurity | SANS Institute
An Introduction to Drift Detection

Workshop | In-Person | MLOps & Data Engineering | Beginner-Intermediate

 

Although powerful, modern machine learning models can be sensitive. Seemingly subtle changes in a data distribution can destroy the performance of otherwise state-of-the art models, which can be especially problematic when ML models are deployed in production.. In this workshop, we will give a hands-on overview to drift detection, the discipline focused on detecting such changes. We will start by building an understanding of the ways in which drift can occur, and why it pays to detect it. We’ll then explore the anatomy of a drift detector, and learn how they can be used to detect drift in a principled manner.

An Introduction to Drift Detection image
Ed Shee
Head of Developer Relation | Seldon
Tutorial: Building and Deploying Machine Learning Models with TensorFlow and Keras

Half-Day Training | Virtual | MLOps & Data Engineering | Intermediate

 

Apply machine learning models to different engineering areas has been of particular interests for many data science practitioners and software engineers. As one of the mostly popular machine learning framework, TensorFlow and Keras have been widely used in many production environments for its robustness and scalability. In this session, we will provide a tutorial on TensorFlow and Keras, and guide your through a series of hands-on examples ranging from basic MNIST dataset to time series processing for model building. We will also cover data input and output processing with TensorFlow, from processing simple CSV files to cloud data warehouse services such as Google Cloud BigQuery. As a bonus we will also cover the integration of TensorFlow with Apache Kafka, to illustrate the streaming data pipeline that is used broadly across the industry.

Tutorial: Building and Deploying Machine Learning Models with TensorFlow and Keras image
Yong Tang, PhD
Director of Engineering | MobileIron
Data Operations for Research Quality Health Data

Tutorial | In-Person | Data Analytics | Beginner – Intermediate 

 

In this tutorial session, attendees will learn how a set of open source tools can be leveraged to perform standardization, characterization, and data quality assessment for various health data sources. Open source tools including Synthea, ETL-Synthea, Achilles, Data Quality Dashboard, and Ares will be reviewed and demonstrated in a data operations pipeline. We will demonstrate how the global health information community leverages this strategy to ensure research-ready health data.

 

 

Data Operations for Research Quality Health Data image
Frank DeFalco
Director, Observational Health Data Analytics | Janssen Research & Development
Data Visualization with ggplot2

Workshop | In-Person | Data Visualisation | Intermediate-Advanced

 

Data visualization is a powerful tool for facilitating confident, informed decision-making. ggplot2 is one of the most popular data visualization packages in use today. Based on comprehensive grammar and syntax, ggplot2 gives you the ability to create data visualizations quickly and iteratively, whether it’s a simple bar-chart or a complicated network analysis. This workshop will teach you how to manipulate and structure your data for visualizations, graph elements, and their associated terminology, how to select the appropriate graph based on your data, and how to avoid common graphing mistakes. You will also learn how to customize data visualizations and give them the ‘personal touches’ that make them memorable to your audience.

Data Visualization with ggplot2 image
Martin Frigaard
Senior Clinical Programmer | BioMarin
Bridge the Gap between Data Scientist and Business Users

Workshop | Virtual | Machine Learning | Intermediate

 

Today data scientists use a rapidly evolving and diverse set of tools and platforms to build advanced analytic and machine learning models. After the model building and development, these models need to be available to business users to draw insights and make better decisions to generate value for the business. As a result, enabling business users to interact with these sophisticated models to create their own on-demand analytics is extremely critical to reducing time to value for any business. During this workshop, we demonstrate how data scientists can leverage Tableau analytics extensions API as a model-agnostic deployment platform to enable business users to interact dynamically with the bespoke ML models and to build their own on-demand analytics in a self-service manner.

Bridge the Gap between Data Scientist and Business Users image
Amir Meimand, PhD
Data Science | ML Solution Engineer | Salesforce
SQL for Data Science

Bootcamp | Virtual | Open-source | Beginner

 

By completing this workshop, you will develop an understanding of relational models of data, how SQL is used to retrieve that data, and how to join tables, aggregate information, and answer data science questions. You will also become familiar with many of the common types of SQL databases, how to access information in a database from the command line, and how to integrate database access from within Python…more details

SQL for Data Science image
Mona Khalil
Senior Data Scientist | Greenhouse Software
Machine Learning for Trading

Training | In-Person | Machine Learning  | Beginner-Intermediate

 

The rapid progress in machine learning (ML) and the massive increase in the availability and diversity of data has enabled novel approaches to quantitative investment. It has also increased the demand for the application of data science to develop both discretionary and algorithmic trading strategies.
In this workshop, we will cover popular use cases for ML in the investment industry, and how data science and ML fit into the workflow of developing a trading and investment strategy from the identification and combination of alpha factors to strategy backtesting and asset allocation.
The workshop uses Python and various standard data science and machine learning libraries like pandas, scikit-learn, gensim, spaCy as well as TensorFlow and Keras. The code examples will be presented using jupyter notebooks and are based on my book ‘Machine Learning for Algorithmic Trading’…more details

Machine Learning for Trading image
Stefan Jansen
Founder & Lead Data Scientist | Applied Artificial Intelligence
Streaming Decision Intelligence and Predictive Analytics with Spark 3

Workshop | In-Person | Deep Learning | Intermediate – Advanced

 

The speed at which intelligent decision-making can take place is changing the very fabric of modern industry. From autonomous routing decisions for real-time traffic avoidance to intelligent integrated systems for customer service and assisted analysis or forecasting, the time to correct decisions can be a huge differentiator. During the course of this workshop, you will learn how to harness the power of Apache Spark to build up a data ingestion pipeline that can process and learn from streaming data. These learnings can then be applied to make simple decisions using SparkSQL in a parallel streaming system…more details

Streaming Decision Intelligence and Predictive Analytics with Spark 3 image
Scott Haines
Software Architect | Twilio
Dealing with Bias in Machine Learning

Tutorial | In-Person | Machine Learning | Responsible AI | Beginner-Intermediate

 

Bias is everywhere – in data, in algorithms, in humans. Therefore dealing with it is a difficult issue and ever more important now that data increases exponentially in nearly all dimensions (velocity, volume, veracity) and machine learning systems trained on these data sets are becoming omnipresent. It is however not as straightforward to be accomplished as you have to start with awareness, take care of data quality while being proficient in measuring and monitoring your algorithmsmore details

Dealing with Bias in Machine Learning image
Thomas Kopinski, PhD
Professor for Data Science | University of South Westphalia
Multi-Task Reinforcement Learning

Tutorial | In-Person| Beginner-Intermediate

 

In this tutorial, we will look at some of the recent advances in multi-task RL. We will start by discussing the nuances of the relationship between single-task and multi-task setup. For example, a multi-task RL problem can be modeled as a single-task RL problem. In that setup, all the multi-task environments are considered different parts of a large single task environment. We will see how this observation can be used to develop some general components and strategies that can convert any single task RL algorithm into a multi-task agentmore details

Multi-Task Reinforcement Learning image
Shagun Sodhani
Research Engineer | Facebook AI Research Group.
Painting with Data: Introduction to d3.js

Training | Virtual | Data Visualization | Intermediate-Advanced

 

In this workshop, we will build an interactive data visualization from scratch using d3.js in the browser. The possibilities shown in d3 examples are exciting but the API surface of d3 and the various browser standards like HTML, CSS, SVG, and JavaScript, can be overwhelming. Think of this workshop as a guided tour that will point out the important things to pay attention to as we go step-by-step from CSV file to interactive visualization....more details

Painting with Data: Introduction to d3.js image
Ian Johnson
Data Visualization Developer | Observable
Leading Data Science Teams: A framework to help guide data science project managers

Workshop| In-Person | All Levels

 

 

 

Leading Data Science Teams: A framework to help guide data science project managers image
Jeffrey Saltz, PhD
Associate Professor | Syracuse University
End to end Machine Learning with XGBoost

Training | In-Person | Intermediate

 

This tutorial will show how to use XGBoost. It will demonstrate model creation, model tuning, model evaluation, and model interpretation.

End to end Machine Learning with XGBoost image
Matt Harrison
Python & Data Science Corporate Trainer | Consultant | MetaSnake
Spark NLP for Healthcare: Modular Approach to Solve Problems at Scale in Healthcare NLP

Training | In-Person | Intermediate-Advanced

 

Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 3000+ pretrained pipelines and models in more than 200+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 1 million every month and experiencing 20x growth for the last one year, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise. In this talk, Veysel will conduct a hands-on session to go over the library’s healthcare components and teach how to solve any NLP problem in healthcare with state-of-the-art methods and practices across the industry. He will also explain the best practices for building production-grade solutions around the latest researchmore details

Spark NLP for Healthcare: Modular Approach to Solve Problems at Scale in Healthcare NLP image
Veysel Kocaman, PhD
Senior Data Scientist | John Snow Labs
Introducing Model Validation Toolkit

Talk | Virtual | MLOps & Data Engineering | Beginner – Intermediate

 

Surrounding a typical ML pipeline many details are commonly swept under the rug. How will we monitor production data for concept drift? How do we measure false negative rate in production? How confident can we be of our performance assessments with a small test set and how should they be modified when faced with biased data? How can we ensure our model follows reasonable assumptions? We introduce a new general purpose tool, the Model Validation Toolkit, for common tasks involved in model validation, interpretability, and monitoring. Our utility has submodules and accompanying tutorials on measuring concept drift, assigning and updating optimal thresholds, determining credibility of performance metrics, compensating for data bias, and performing sensitivity analysis. In this session, we will give a tour of the framework’s core functionality and some associated use cases.

Introducing Model Validation Toolkit image
Alex Eftimiades
Senior Data Scientist | FINRA
Natural Language Processing in Accelerating Business Growth

Business Talk | Virtual | Cross Industry | Beginner

 

In a talk that guides the audience through the key strategies, strengths and challenges around NLP, Dr. Sameer Maskey will deliver the following key takeaways: Impact and application of NLP by enterprise across various industries Key NLP techniques that are being leveraged to accelerate business success The do’s and don’ts of adopting and deploying NLP techniques and how it can take businesses to the next level.

 

Natural Language Processing in Accelerating Business Growth image
Sameer Maskey
Founder & CEO | Fusemachines
Next-Generation Big Data Pipelines With Prefect and Dask

Talk | In-Person | Intermediate

 

Dask is the leading Python-native framework for distributed computing with a growing open source community. Dask is used by commercial enterprises and scientific communities for scheduling and coordinating task execution for data engineering and data science pipelinesmore details

 

Next-Generation Big Data Pipelines With Prefect and Dask image
David Chudzicki
Senior Software Engineer | Coiled
Can We Let AI be Great? Practical Considerations in Designing Effective and Ethical AI products.

Talk | Virtual | Cross-Industry | Beginner-Intermediate

 

This talk will draw on information from the following two frameworks – IEEE 7000-2021: IEEE Standard Model Process for Addressing Ethical Concerns during System Design, and Section A of the Technical Best Practices from the Foundation for Best Practices in Machine Learning. In melding the more technically focused IEEE framework with the more business-oriented section from the Technical Best Practices framework, this talk aims to provide participants with practical and actionable guidance on designing effective and ethical AI products. The talk will focus on the initial stage of the AI product lifecycle. Specifically, we’ll focus on the following areas of the AI design process: team composition and roles, problem statement and solution mapping, integrating context, organizational capacity, and defining the product and outcomes. Participants will leave this session with a broad understanding of two prevailing ethical design frameworks and a practical understanding of the initial steps organizations can take to design ethical and effective AI systems.

 

Can We Let AI be Great? Practical Considerations in Designing Effective and Ethical AI products. image
Masheika Allgood
Founder | AllAI Consulting, LLC
What Analytics Leaders Should Know About Human in the Loop

Business Talk | Virtual | Cross-Industry | Beginner-Intermediate

 

This 30-minute business and strategy-oriented talk aims to educate data scientists and analytics leaders about the role and benefits of human-in-the-loop (HITL), presented from the standpoint of a 25+ year veteran of traditional machine learning applications who has recently been immersed in this new world. The session will include brief examples in application areas as diverse as Medical AI, Ag Tech, and autonomous vehicles.

 

What Analytics Leaders Should Know About Human in the Loop image
Keith McCormick
Chief Data Science Advisor | CloudFactory
Deep Learning Enables a New View in the Agriculture Industry

Talk | In-Person | Deep Learning | Intermediate

 

Although initially a slow adopter of machine learning and computer vision, agriculture has become an important domain for these approaches. Computer vision is now a key element of agricultural systems to determine crop type, count plants, guide harvesting robots, identify issues like crop stress and weeds, and forecast yield. Adoption and extension of these approaches is critical due to the challenges facing global agriculture: the world’s population is predicted to reach 9.7billion by 2050, water supply is expected to fall 40% short of global needs by 2030, and climate change produces significant challenges and uncertainty. Fortunately advances in deep learning and remote sensing technologies have unlocked unprecedented opportunities for precision agriculture. However, challenges remain in leveraging common SOTA approaches, which are often developed for natural scene imagery like Imagenet, with remote sensing data; this data may be massive, spatiotemporal, multispectral, contain very small objects, and possess fundamentally different statistics than natural scene images. In this session we’ll explore some of these challenges around remote sensing data for precision agriculture and approaches for addressing them including spatiotemporal modeling, self-supervised and contrastive learning, and multi-task learning within a deep learning framework.

 

 

Deep Learning Enables a New View in the Agriculture Industry image
Jennifer Hobbs
Director of Machine Learning | Intelinair
Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot

Talk | Virtual | Deep Learning | All Levels 

 

Most evaluation suites for multi-agent reinforcement learning do not assess generalization to novel situations as their primary objective (unlike supervised-learning benchmarks). The subject of this talk, Melting Pot, is an evaluation suite that fills this gap, and is scalable because it uses reinforcement learning to reduce the human labor required to create novel test scenarios. This works because one agent’s behavior constitutes (part of) another agent’s environment. Melting Pot currently contains 85 unique test scenarios covering a broad range of topics such as social dilemmas, reciprocity, resource sharing, and task partitioning. Melting Pot is open source and free to use for anyone.

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot image
Joel Z. Leibo, PhD
Research Scientist | DeepMind
Vector Databases

Talk | MLOps & Data Engineering

Vector Databases image
Bob van Luijt
CEO & Co-Founder | SeMI Technologies
Just Machine Learning

Talk | In-Person | Machine Learning | All Levels

 

Risk assessment is a popular task when machine learning is used for automated decision making. For example, Jack’s risk of defaulting on a loan is 8, Jill’s is 2; Ed’s risk of recidivism is 9, Peter’s is 1. We know that this task definition comes with impossibility results for group fairness, where one cannot simultaneously satisfy desirable probabilistic measures of fairness. I will highlight recent findings in terms of these impossibility results. Next, I will present work on how machine learning can be used to generate aspirational data (i.e., data that are free from biases of real-world data). Such data are useful for recognizing and detecting sources of unfairness in machine learning models besides biased data. Time-permitting, I will discuss steps in measuring our algorithmically infused societies.

Tina Eliassi-Rad, PhD
Professor | Core Faculty Northeastern University | Network Science Institute
Trustworthy AI

Talk | Virtual | Responsible AI | Machine Learning Safety and Security | Intermediate

 

Under the umbrella of trustworthy computing, employing formal methods for ensuring trust properties such as reliability and security has led to scalable success. Just as for trustworthy computing, formal methods could be an effective approach for building trust in AI-based systems. However, we would need to extend the set of properties to include fairness, robustness, and interpretability, etc.; and to develop new verification techniques to handle new kinds of artifacts, e.g., data distributions and machine-learned models. This talk poses a new research agenda, from a formal methods perspective, for us to increase trust in AI systems.

Trustworthy AI image
Jeannette M. Wing, PhD
Avanessians Director of the Data Science Institute and Professor of Computer Science | Columbia University
Data Science and Contextual Approaches to Palliative Care Need Prediction

Talk | In-Person | Machine Learning | Responsible AI | Intermediate

 

In this talk I will discuss how predictive analytics and machine learning can be utilized to identify patients appropriate for a palliative care path. I will then describe a particular use case – developing a palliative care referral mechanism that works well for both commercially insured adults and Medicare Advantage plan members – and use it to illustrate how data scientists can modify their modeling processes for success. I will discuss several approaches to modeling and demonstrate how each can be adjusted to work well within the bounds of multiple variations of one problem. I will also illustrate the limits of several common predictive algorithms in building niche models. Along the way I will give several deep dives into the practical challenges associated with predicting rare events such as palliative care need and discuss the limits of using claims data to develop measurable proxies for palliative care need and other health related events.