ODSC West 2021

West Preliminary Session Schedule

West Talks – PST
17th November, Wednesday
18th November, Thursday
West Workshops & Training – PST
15th November, Monday
16th November, Tuesday
17th November, Wednesday
18th November, Thursday
17th November, Wednesday
18th November, Thursday
15th November, Monday
16th November, Tuesday
17th November, Wednesday
18th November, Thursday
Data engineering workflows with Jupyter notebooks

Talk | MLOps & Data Engineering

Data engineering workflows with Jupyter notebooks image
Michelle Ufford
CEO & Co-Founder at Noteable
Building Label-Efficient Models & Machine-actionable Knowledge from Natural Language Data
Xiang Ren, PhD
 Assistant Professor | Research Team Leader | Director | Information Director at USC Computer Science | USC ISI | USC INK Research Lab | SIGKDD 
Incident Response in a World of Evolving Threats

Talk | Cybersecurity

This session will examine the evolution of threat actor techniques, tactics and procedures, and take a look at where threats are evolving. It will cover, through use cases, basic strategies every incident response plan should contain and in particular, examine the roles situational awareness and information sharing play in helping organizations be better prepared to respond to incidents before, during and after they happen.

Denise Anderson
President at H-ISAC
Large-Scale Video Analytics with Ease

Talk | Deep Learning | Intermediate

Tracking objects is a foundational task for video analysis. It is the engine for smart cities, autonomous driving, and building management. However, although many methods have been proposed in top-tier research conferences for the task each year, most of our production systems use half a decade old techniques. In this talk, I will explain the reason behind the gap in research and production, with intuition and experimental results. Further, I will introduce our recent works to address the issues. Our new methods can easily leverage large-scale datasets and learn to track objects in diverse scenarios. In the end, I will provide tools for industry practitioners to build trackers on their data without diving into complicated parameter tuning and expensive optimization. You will learn to make robust, simple, and performant tracking modules to supercharge your video analysis engines.

Fisher Yu
Postdoctoral Researcher at UC Berkeley
Seeing the Unseen: Inferring Unobserved Information from Limited Sensory Data

Talk | Machine Learning | Intermediate-Advanced

Seeing the Unseen: Inferring Unobserved Information from Limited Sensory Data image
Adriana Romero
Research Scientist at Facebook AI Research
Deep Probabilistic Programming with Pyro

Talk | Research Frontiers | Beginner-Intermediate

In this talk, we will learn about Pyro (http://pyro.ai) a PPL built on PyTorch. We will discuss what probabilistic programming is, and how we can integrate it with deep learning to tackle open machine learning problems in generative modeling. We will talk about approximate inference techniques such as variational inference, and walk through some of the tools and examples to make inference on models automatic. If you are a data scientist, an ML engineer, or an ML researcher, this talk will be of interest to you!

Deep Probabilistic Programming with Pyro image
JP Chen
Machine Learning Researcher at Facebook
Data-driven Modeling Approaches in Computational Drug Discovery

Tutorial | Machine Learning

Hiranmayi Ranganathan
Machine Learning Researcher at the Lawrence Livermore National Laboratory
DataOps For the Modern Computer Vision Stack

Talk | Deep Learning

James Le
DataOps/MLOps Practitioner | AI Safety Researcher at Superb AI Inc.
Practical individual fairness algorithms

Talk | Machine Learning | Responsible AI | Intermediate

Individual Fairness (IF) is a very intuitive and desirable notion of fairness: we want ML models to treat similar individuals similarly, that is, to be fair for every person. For example, two resumes of individuals that only differ in their name and gender pronouns should be treated similarly by the model. Despite the intuition, training ML/AI models that abide by this rule in theory and in practice poses several challenges. In this talk, I will introduce a notion of Distributional Individual Fairness (DIF) highlighting similarities and differences with the original notion of IF introduced by Dwork et al. in 2011. DIF suggests a transport-based regularizer that is easy to incorporate into modern training algorithms while controlling the fairness-accuracy tradeoff by varying the regularization strength. Corresponding algorithm guarantees to train certifiably fair ML models theoretically and achieves individual fairness in practice on a variety of tasks. DIF can also be readily extended to other ML problems, such as Learning to Rank.

Practical individual fairness algorithms image
Mikhail Yurochkin, PhD
Research Staff Member | IBM Research and MIT-IBM Watson AI Lab
Artificial Intelligence for Conservation and Sustainability: From the Local to the Global

Talk | Responsible AI | Beginner

 

How can artificial intelligence and open data science tackle the twin challenges of climate change and reversing biodiversity loss? Come hear about some of the successes in addressing these challenges and get some tips on how you can help. Topics include case studies on how the World Wildlife Fund and others are using AI to help predict deforestation, monitor the health of protected areas, and map carbon stocks, going into some detail about the artificial intelligence techniques applied. The talk will also address difficulties involved with AI applications due to the sensitive nature of much of the relevant data. Pointers for further exploration will be given throughout the talk, and it will close with some open questions you can help answer.

Artificial Intelligence for Conservation and Sustainability: From the Local to the Global image
Dave Thau, PhD
Data and Technology Global Lead Scientist at WWF
Machine Learning With Graphs: Going Beyond Tabular Data

Talk | Deep Learning | Machine Learning | Intermediate

Machine learning has traditionally relied on creating models around data that can be represented in tabular format such as SQL tables, Pandas dataframes, and the like. Inherent in this data is the assumption that there is no relationship between each entry (row) of the data. In certain cases this is an accurate assumption. However, there are many common use cases for machine learning where this assumption is not entirely accurate. In these cases, by considering the relationships among those individual data points, models can be significantly enhanced and measurable improvements can be made to the appropriate metrics of that model. Such use cases can include common data science and machine learning tasks such as churn prediction and automated recommendation engines.

In this talk we will compare and contrast models created with individual data points to those made entirely with graphs and hybrids of the two. We will explore a variety of techniques that are used for creating graph embeddings, the vectors for representing graphs that are created in a similar fashion to the feature engineering and vector embeddings associated with traditional machine learning. We will focus on the optimization of the graph embeddings and explore some real-world examples of their use individually and in conjunction with the traditional types of machine learning embeddings. Special emphasis will be placed on the benefits of using graph embeddings with significant class imbalance. We will also discuss the use of these embeddings with traditional machine learning packages and workflows, such as through the use of scikit-learn and TensorFlow.

Machine Learning With Graphs: Going Beyond Tabular Data image
Dr. Clair Sullivan
Graph Data Science Advocate | Neo4j
Towards More Energy-Efficient Neural Networks? Use Your Brain!

Talk | Machine Learning | Research Frontiers | Intermediate

In the last decade many different types of neural networks have been developed. They showed us the amazing power and opportunities of machine learning. Everywhere in the world processes are replaced by ML algorithms, people are matched with their dream job, products are recommended and cars are driven automatically. It is truly amazing what we can do with such models. On the other hand, when you take a critical look, the whole training process is not that efficient. We have to feed models with millions of labeled images or text inputs to make sure your algorithm will perform well. And thinks of what happens in this training process. Each input goes through many layers where multiplications and ReLu or sigmoid functions are applied to each item from the input. Forward and backwards! Due to back propagation. Of course, with all the available computer in the form of GPU’s this is not really an issue. However this cost a lot of energy. With that in mind we do know that neural networks are sort of based on the way humans learn. Except that the human brain is much more energy efficient. Could we achieve that same energy-efficient level in artificial neural networks? The answer is yes!

In this talk I will show you what is often called the third generation neural networks: Spiking Neural Networks. Based on the biological processes in the brain this kind of neural network uses discrete spikes and sparse communication to learn. I will give short introduction in some biological processes in the human brain and from there we will define spiking neural networks. We will discuss the downsides compared to artificial neural networks due to their discontinuous nature and I will show a resolution to that. The accuracy maybe still falls short of the artificial neural networks but the field is evolving and I will show the great potential of these networks. You will also get an overview of some existing frameworks based on Pytorch.

Let’s go for great accuracy and major energy savings

Towards More Energy-Efficient Neural Networks? Use Your Brain! image
Olaf de Leeuw
Data Scientist at Dataworkz
Practical MLOps: Automation Journey

Talk | Machine Learning Data Engineering and MLOps | Intermediate

At YooMoney, we use ML Models extensively for different tasks from Anti-fraud to NLP.

We started with a Data Scientist who used jupyter and then copy-pasted model code to flask vs pickle and zipped it for production. But it was a labor-intensive and hardly sclable process, so we begin to introduce MLOps.

My talk will cover MLOps practices—a way to streamline the model development process and automate it as much as possible. In general, at least some of them are an attempt to use Software Development practices in Machine Learning experimentation and production.

From one point of view, this task was relatively easy for our company: we already have CI/CD in place for regular applications, so why not just use them for ML purposes?

But when it comes to implementation, one might understand that it is not such a straightforward process. I will go through the main stages of the MLOps pipeline, explaining the challenges and solutions to overcome them.

The first stage is Model Development. On the one hand, it looks like regular software development (writing some code), but on the other hand, it doesn’t as it requires access to a lot of datasets and DWHs (preferably, with live data – in case of Fintech, as well as Medical data, it might be challenging), so we have to solve Security issues like introducing IDM interfaces for ‘sets of datasets””.

There are some issues with code writing tools on this stage: Data Scientists often write code in jupyter notebooks instead of IDEs like Idea/Eclipse/VS, which is not directly suitable for creating a standalone application and requires some additional effort on the merge phase during commit. We manage to solve it with the jupytext module, it helps sync both ways between py and ipynb, storing py-files as the main reference in git.

The next stage is Preparing for Production. It starts with Model Risk Evaluation: we will briefly mention the probability-impact matrix, and name the risks that can be mitigated using MLOps like operational or data drift.

QA (testing practices suitable for ML) also plays an important role in the MLOps Process. As we aim to automate model’s lifecycle, short testing period will definitely help achieve this aim. The cheapest and most straightforward solution here is to use a tool like pytest, but it only works until other platforms like Scala applications are introduced, so alternatives like Kotlin autotests should be considered. As for the testing strategy, a few solutions can be used here (testing on a reference dataset, accessing ground truth, checking business metrics, and so on).

After that, we proceed to building the environment for real-time inference. As OReilly’s “Introduction to MLOps” suggests, this part of the process should come as early as possible, before the model is prepared (or even before model development starts). With basic modules like scikit it is relatively simple process, but when it comes to Tensorflow, for example – situation changes (who have tried just run latest version of tensorflow with latest version of python and all the libraries? How often it works with a first attempt?). So here we have to solve platform-specific problems (like it is better to use virtualenv or docker for dependencies) and more general ones: what kind of tools will be used for inference platform management. Should it be a single machine, Kubernetes cluster, or whatever? Our current solution is horizontal scaling with a balancer, and we’re aiming to use docker under Kubernetes as the target platform.

Deploying also has a lot of things to consider. We start with building artifacts for deployment. What should come with a model, just a serialized object? Or a reproducible research set including Data? We started with pickle / hdf5, and they suited perfectly until Scala models were introduced. Now we have to switch to another technology choosing different formats such as PMML, PFA, ONNX, or POJO (main pros and cons will be discussed during the presentation).

Release (which is not the same as deployment). At this stage, we start using the new model. Some Risks discussed above might be mitigated here: for example, the operational one can be dealt with using Canary releases or blue-green releases.

Last but not least, we might combine a few of the above discussed techniques in Monitoring. Here we need a Model Repository or some other way of detecting the model version (writing this version to a log works well too). Also, testing practices can be used to ensure that the model performs well in the production environment (from heartbeat/ping request till human evaluation for a portion of requests).

The main idea for MLOps is something similar to DevOps: identify labor-intensive parts of the Model lifecycle, choose the ones that can be automated and match appropriate tools for these parts (example was given above). This approach makes Model development more predictable and ensures that highly-qualified people like Data Scientists or Subject Matter Experts can focus on their specific task, leaving almost all infrastructure-related tasks to automated tools.

Practical MLOps: Automation Journey image
Evgenii Vinogradov, PhD
Head of Data Engineering and Data Science | Director, Analytical Solutions Department at YooMoney
Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber

Talk | MLOps & Data Engineering | Intermediate

Development tools such as Jupyter are prevalent among data scientists because they provide an environment to explore data visually and interactively. However, when deploying a project, we must ensure the analysis can run reliably in a production environment like Airflow or Argo; this causes data scientists to move code back and forth between their notebooks and these production tools. Furthermore, data scientists have to learn an unfamiliar framework and write pipeline code, which severely delays the deployment process.

Ploomber solves this problem by providing:

1. A workflow orchestrator that automatically infers task execution order using static analysis.
2. A sensible layout to bootstrap projects.
3. A development environment integrated with Jupyter.
4. Capabilities to export to production systems (Airflow and Argo) without code changes.

This talk develops and deploys a Machine Learning pipeline in 30 minutes to demonstrate how Ploomber streamlines the Machine Learning development and deployment process.

Who and why

This talk is for data scientists (with experience developing Machine Learning projects) looking to enhance their workflow. Experience with production tools such as Airflow or Argo is not necessary.

The talk has two objectives:

1. Advocate for more development-friendly tools that let data scientists focus on analyzing data and taking off popular production tools’ overhead.
2. Demonstrate an example workflow using Ploomber where a pipeline is developed interactively (using Jupyter) and deployed without code changes.

GitHub: https://github.com/ploomber/ploomber

Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber image
Eduardo Blancas
Data Scientist at Fidelity Investments
Applications of Modern Survival Modeling with Python

Talk | Machine Learning | Intermediate

Survival models describe how long it will take for some important event to occur. Because they account for censored data and avoid arbitrary binarization thresholds, they are a natural fit for many applications. In churn prediction, for example, it can be more useful to model the time until a subscriber churns, instead of the probability of churn over some arbitrary time window. Similarly, for hardware engineering, it can be more valuable to know the likely time until a piece fails instead of the probability of failure over a fixed time frame.

Despite the name survival, not all applications involve negative outcome events; in some cases, we want to reduce the time-to-event. For example, we can use survival models to decide between competing sales strategies; the better policy is the one with the shortest time-to-conversion.

While conceptually elegant, survival analysis has historically been popular only within a handful of applied domains, especially clinical research. The machine learning community has recently taken notice, however, and survival analysis is gaining traction within research, applications, and software circles.

Our goal in this talk is to help the audience add survival modeling to their working data science tool belt. We’ll first introduce basic concepts of survival analysis like censored data, duration matrices, and survival curves. We’ll then show when to consider using survival models instead of other methods, how to use popular Python survival analysis tools Lifelines, Scikit-survival, and Convoys, and how to interpret model results for either prediction or decision-making. Throughout, we’ll emphasize the machine learning perspectives on the topic.

Applications of Modern Survival Modeling with Python image
Brian Kent, PhD
Data Scientist | Founder at The Crosstab Kite
Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems

Talk | Machine Learning | Beginner-Intermediate

It is important to efficiently determine the health of large complex systems by detecting anomalous behavior, where anomalies in the system data can help detect if there is a failure or an impending failure. The goal is to detect anomalous behavior before it escalates to severe service degradation or a service impacting outage.

In this talk, using sequential multivariate system performance data, we present the application of multivariate change detection algorithms and visual analytics methods for detecting and diagnosing anomalous behavior with low latency in a large networking system. A brief overview of anomaly detection concepts will also be presented.

Multivariate change detection algorithms based on non-parametric change detection methods are applied to the data to detect anomalies and present diagnostic information at fine time granularity. We identify whether a change point is a single time stamp (pointwise anomaly) or a collection of time stamps (collective anomaly) that does not conform with the general pattern of data.

Two unsupervised change point detection methods are used, namely, the Bayesian approach and the distance-based approach. For the Bayesian approach, we deploy the following R packages: changepoint.mv and anomaly. The R package ecp is selected for the distance-based change detection approach. An advantage of the changepoint.mv package is that it also provides diagnostic capability in terms of explicitly identifying both the change point location and the variables associated with the change point.

The R packages used for change detection will be described in terms of their capabilities and characteristics, and the R code used for the analysis will be shared. In addition, the use of self-organizing maps (using the R kohonen package) for visual analytics will be presented. We demonstrate our methods with real data.

Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems image
Veena Mendiratta, PhD
Adjunct Faculty | Network Reliability and Analytics Researcher at Northwestern University
Responsible AI; From Principles to Practice

Talk | Responsible AI | Beginner-Intermediate

AI has made amazing technological advances possible; as the field matures, the question for AI practitioners has shifted from “can we do it?” to “should we do it?”. In this talk, Dr. Tempest van Schaik will share her Responsible AI (RAI) journey, from ethical concerns in AI projects, to turning high-level RAI principles into code, and the foundation of an RAI review board that oversees projects for the team. She will share some of the practical RAI tools and techniques that can be used throughout the AI lifecycle, special RAI considerations for healthcare, and the experts she looks to as she continues in this journey.

Responsible AI; From Principles to Practice image
Tempest Van Schaik, PhD
Senior Machine Learning Biomedical Engineer at Microsoft CSE
Using Deep Learning to Understand Documents

Talk | Deep Learning | Machine Learning | Beginner-Intermediate

Extracting key-fields from a variety of document types remains a challenging problem. Services such as AWS, Google Cloud and open-source alternatives provide text extraction to “digitize” images or pdfs, returning phrases, words and characters. Processing these outputs is unscalable and error-prone as varied documents require different heuristics, rules or models and new types are uploaded daily. In addition, a performance ceiling exists as downstream models rely on good yet imperfect OCR algorithms upstream.

We propose an end-to-end solution utilizing image-based deep learning to automatically extract important text-fields from documents of various templates and sources. Computer vision algorithms utilizing deep learning produce state-of-the-art classification accuracy and generalizability through training on millions of images. We compare the in-house model accuracy, processing time and cost with 3rd party services and found favorable results to automatically extract important fields from documents.

Bill.com is working to build a paperless future. We process millions of documents a year ranging from invoices, contracts, receipts and a variety of others. Understanding those documents is critical to building intelligent products for our users.

Using Deep Learning to Understand Documents image
Eitan Anzenberg, PhD
Chief Data Scientist | Bill.com
Using AI to Overcome Bias & Make Hiring More Equitable

Talk | Responsible AI | Beginner-Intermediate

Recruitment and hiring has inherent bias that hinders companies from hiring the candidates who are best fit for the job. This hurts not only individual career paths and diversity efforts, but also the growth and success of organizations. By leveraging AI and career data from over 1 billion people, Ashutosh helped build a system free of bias that is shifting how companies recruit, hire, retain, and grow talent. Equal parity algorithms and audit and monitoring processes create a transparent system that is independent of race, gender, ethnicity, age, and other characteristics. Ashutosh will go into detail about how AI removes fairness concerns by increasing transparency and mitigating risk for unconscious bias. He’ll delve into how this system is helping remap the traditional career journey, increasing diversity at every level and helping Americans get back to work post-pandemic. Finally, Ashutosh will share case studies of how HR departments at leading organizations are using these technologies to make hiring more equitable and inclusive.

Using AI to Overcome Bias & Make Hiring More Equitable image
Ashutosh Garg, PhD
CEO and Founder at Eightfold.ai
Machine Learning and Robotics in Healthcare Devices and Rehabilitation

Business Talk | AI for Healthcare

In the upcoming stages of the Fourth Industrial Revolution, we are going to experience a paradigm shift in how we use Artificial Intelligence (AI) and Robotics to improve processes and enhance healthcare. During her presentation, Alishba will discuss various applications of AI, efforts to develop Artificial General Intelligence (AGI), soft robots, and how these technologies can be used to facilitate and enhance mental health practices, improve prosthetics and rehabilitation devices for recovering stroke patients. She will demonstrate this by walking through her work with San Jose State University where she used 3D printing and AI to develop a cheaper prosthetic that costs $700 vs the current price of $10k. As well, she will be providing insights through highlighting her work with Hanson Robotics on Sophia the Robot to improve manipulation techniques for robots and develop the next wave of humanoid robots with human-like intelligence. She will also highlight her research and research by labs such as the Harvard Biodesign Lab to use soft robotics and machine learning for the development of low-cost and easier to use portable rehabilitation gloves for recovering stroke patients. As well, Alishba will speak on her work with Kindred.Ai using imitation learning and telerobotics to develop more intelligent and safe human-robot interactions that can be used in medical and manufacturing settings. This interactive session will highlight current cutting-edge research being done in robotics and AI while diving deep into the concepts powering them and how these are being applied to key problem areas in medical and rehabilitation industries.

Machine Learning and Robotics in Healthcare Devices and Rehabilitation image
Alishba Imran
Machine Learning Developer at Hanson Robotics | Kindred.Ai
New Frontiers in Deep Generative Learning