I have always appreciated the unusual, unexpected, and surprising in science and in data. As famous science author Arthur C. Clarke once said, “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ (I found it) but ‘That’s funny!’” This is the primary reason that I motivated most of the doctoral students that I mentored at GMU to work on some variation of Novelty Discovery (or Surprise Discovery) for their Ph.D. dissertations.
“Surprise discovery” for me is a much more positive, exciting phrase than “outlier detection” or “anomaly detection”, and it is much richer in meaning, in algorithms, and in new opportunities. Finding the surprising unexpected thing in your data is what inspires our exclamation “That’s funny!” that may be signaling a great discovery (either about your data’s quality, or about your data pipeline’s deficiencies, or about some wholly new scientific concept). As famous astronomer, Vera Rubin said, “Science progresses best when observations force us to alter our preconceptions.”
My two training sessions will look at two different topics from a common perspective that reflects the theme of “novelty” through the study of some uncommon examples. Specifically, some (hopefully, most) of these examples may alter the sessions’ participants’ preconceptions (in a positive way) about your data science applications and the typical machine learning algorithms that you use every day. Each of the training sessions will present a series of examples (approximately 10 each) to demonstrate the over-arching idea represented in the title of the corresponding session.
My training sessions will focus on novel approaches and ways of thinking about common machine learning techniques and algorithms that data scientists frequently use. These include Bayes theorem, independent component analysis, Markov modeling, recommender engines, K-means clustering, K-nearest neighbors, neural networks, deep learning, TensorFlow, knowledge graphs, and more.
The machine learning cold-start problem is the focus of my first session. It will explore examples of meta-learning and optimization when there is very little initial knowledge about where to start in model hyperparameter space. This is a frequent challenge in data science applications, encountered either when there is very little labeled data to adequately train a supervised learning model or when our goal is to figure out what the data is saying to us (i.e., applying unsupervised learning, to explore them without the added baggage of our preconceptions as to what we think the data is revealing). We will review backpropagation and TensorFlow in this same context.
My second training session will examine atypical applications of some typical machine learning algorithms. This will include predicting tropical storm intensification using retail market basket analysis, and it will include predicting solar storm impact on astronauts in space using customer journey mapping techniques. It will even include examples from Formula 1 racing and finding a cure for cancer. The most surprising example might be the one where a company achieved a 100,000% ROI on a data analytics investment to reduce customer churn – and they used perhaps the simplest algorithm in the known Universe.
When we take a novel look at the methods and algorithms that we use every day, which then leads to unexpected and surprising discoveries in data, that should get us excited for each new day with data.
Note: Kirk will present two training sessions at the ODSC East 2021 Virtual Conference. One will focus on “Solving the Data Scientist’s Cold-Start Problem with Machine Learning Examples” and the other will look at “Atypical Applications of Typical Machine Learning Algorithms.”
When AI architects think about ML Serving, they focus primarily on speeding up the inference function in the Serving layer. Worried about performance, they optimize towards overcapacity, leading to an expense end-to-end solution. When the solution is deployed, the cost of serving alarms those responsible for budgets, leading to abandoning of solutions. Keeping costs down is an important goal for a practical architecture and a successful AI solution in Asynchronous Architectures
The default architecture that architects come up with is a synchronous one. A simplified version of this architecture is provided here:
An ML Service API, typical a REST API sits in front of the serving layer. It takes care of standard API functions like authentication and load balancing. A cluster of ML Serving nodes is set up behind this service API. Each node provides an inference function, that takes input feature variables and returns the required prediction. The Service API layer does load balancing of requests between the Serving nodes.
What are some of the problems of this synchronous architecture? Typically, there is a fluctuation in serving load across time intervals. If this solution needs to handle the maximum load, it must be provisioned to maximum capacity. When the system is not utilized at its peak, the provisioned resources are idling. ML Serving may use expensive resources like GPUs and keeping them idle is a waste of money. Alternatively, if we provide the solution to handle only average or above-average loads, the Service Layer needs to reject new requests beyond provisioned capacity to prevent back-pressure on the serving nodes. This would be a denial of service. Both scenarios are not desired. While new developments in elastic scaling in the cloud helps alleviate some of the concerns, they introduce other issues.
A good alternative to synchronous serving is asynchronous serving, which helps optimize ML Serving node resources and enables additional capabilities. Here is how a typical asynchronous serving architecture would work:
When the ML Service API receives the request, instead of directly sending it to the ML Serving node, it places the request in a Request Queue, which is a publish-subscribe Queue. Apache Kafka is a great technology for this option. ML Serving nodes are subscribers to this queue. They listen to the queue and pull in new requests when they have free capacity. The results are then pushed to a Response Queue, which is another publish-subscribe queue. The Service API subscribes to the response queue. As responses are received, the ML Service API then returns them to the clients. Clients can either wait for the response in the same request connection or can provide a callback endpoint to receive the responses.
There are multiple advantages to the Asynchronous Architectures approach
- The ML Serving node can be provisioned for average loads. When there is a sudden spike in load, the queues handle the back pressure temporarily. Care should be taken to understand load patterns and provision enough resources, so that catch up happens within acceptable thresholds.
- Any pre-processing needed for requests can be handled using streaming jobs on the request queue. Again, this ensures even distribution and scaling. Same with post-processing on the response queues.
- Message Queue technologies provide capabilities like persistence and fault tolerance, so these don’t have to be built out separately.
- Message queues can also provide the same input to other subscribers, like reporting and analytics.
- This architecture provides loose coupling between services, that allows them to evolve independently and enables Agile development and deployment.
An immediate concern raised by this approach is the latency of responses. Hypothetically, there is additional latency introduced due to the asynchronous nature, but in most practical cases, this additional latency is still within acceptable limits for the response times required for the solution. Setup correctly, an asynchronous solution can provide the same response time averages as its synchronous equivalent.
While architecting the solution, the suitability of the asynchronous architecture should be evaluated against business needs. If end-user experience is blocked until a prediction is made, and the prediction is critical, then synchronous may be the way to go, despite higher costs. An example here would be operating critical machinery based on predictions. On the other hand, if the user experience can proceed and a delay in predictions is not critical, asynchronous is the way to go. An example here would be providing recommendations to a user on an e-commerce website, while the user is browsing a catalog.
Asynchronous pipelines are a great tool in the architect’s toolset. It is still the architect’s call on whether it’s appropriate to the use case in question. Do catch up on my talk about experiences in building an ML platform at ODSC West (https://odsc.com/speakers/building-a-ml-serving-platform-at-scale-for-natural-language-processing/), where I will be discussing similar interesting asynchronous architecture options.
Kumaran Ponnambalam is an AI and Big Data leader with 15+ years of experience. He is currently the Senior AI Architect for Webex Contact Center at Cisco. He focuses on creating robust, scalable AI platforms and models to drive effective customer engagements. In his current and previous roles, he has built data pipelines, ML models, analytics and integrations around customer engagement. He has also authored several courses on the LinkedIn Learning Platform in Machine Learning and Big Data areas. He holds a MS in Information Technology and advanced certificates in Deep Learning and Data Science.
This is the second part of my blog posts on machine learning monitoring. In the first part, we listed the four questions we are trying to address in a model monitoring setup. We discussed the first two on how to detect functionality degradation of a model in production, as well as how to detect applications of the model on non-optimal populations or even completely inapplicable zones. In this post, we will look at how to detect a change of learned relationships between the input variables and target for a supervised learning model, as well as how to discover new relationships to continuously improve the model.
3. Has the learned relationship changed?
Supervised learning models are trained to learn the statistical correlations between the input variables and the target. Taking our credit risk model again as an example, we are learning the correlations between the information we had at the time of loan approval and whether they repaid in full at the end of the term. These correlations are not guaranteed to be static over time. And the model performance metrics can degrade if the learned relationship changes.
At a high level, we can monitor the performance metrics such as ROCAUC or accuracy on the observed outcome of the application data periodically. As discussed in the first blog post, the metrics can change as the input variables change their distributions, even if the learned relationships are static. Therefore, we would compare the observed metrics against the kNN reweighted metrics from the benchmark sample (typically the training dataset). If the observed value is worse than the reweighted expectation, it would be an indication of relationship changes.
To further monitor the relationship of individual input variables with the target, we also regularly carry out the residual analysis. The residual is defined as the deviation between the prediction and the outcome. On the training dataset, we would expect the average residuals to be close to zero across the range of every input variable. On the application data, we might detect non-zero overall residual, which indicates a change of relationship. We can also identify some of the input variables which have their relationship with the outcome changed, if the residuals exhibit correlations with their values.
To mitigate the model degradation, we should try to find out the cause of relationship changes. Sometimes it is due to the change of meaning of an input variable, e.g. reduced spending during the pandemic. Such change would typically also trigger the distributional change monitoring described in the previous blog. Other times, it is a merely quantitative drift of the relationship between the inputs and the outcome. For example, as lenders are scaling back in anticipation of recession, borrowers relying heavily on credit (a typical input variable) might be more likely to have cash flow problems (hence not repaying as an outcome).
If we believe that the statistical correlation will be stable going forward, we can re-train the model completely with data after the change. Alternatively, we can derive an adjustment function of the affected variable based on the residuals. While the former is straightforward in most cases, the latter solution requires less data and minimizes the statistical variance from the existing model.
If instead, we think the variable is going to be too volatile in the future, the easiest solution might be a refit with this variable excluded. This would lead to a bit loss of model performance, but it can be achieved with just the original training data.
4. Is there any new signal emerging?
New causative factor
The change of statistical correlation between the model inputs and the outcome might also be spurious if it’s just the confounding effect of a new causative factor not accounted for in the model. After all, machine learning models are just learning the statistical correlations, while the input variables are often mere proxies of underlying causes of the outcome. Instead of looking for the causation of model degradation from the input variables, sometimes we might find a more likely explanation among variables not included in the model yet.
That’s why we also run the residual analysis over our entire feature store, which contains thousands of variables as potential input to the model. If any of the variables has extra explanatory power, the residuals would exhibit a correlation with its value.
The complication here is that there are a lot of covariances among this large pool of variables (including existing model inputs). It’s likely that we will find a not-so-small collection of variables correlated with the residuals. Few of them would be the causative factor, if any at all. It’s entirely possible that the causative factor is not observable to us, and we just have to find the best proxy for our statistical modeling.
Therefore, we’d typically go through an iterative feature selection among those variables detected by the residual analysis. Each round, we would adjust the model by adding the variable of the most explanatory power of the residuals, before running the residual analysis again. Typically, we would only need very few extra variables to eliminate the residuals over every dimension.
Occasionally, the change of relationship might also be the result of changes in the underlying mechanism of the outcome, e.g. when payments did not happen due to the payment holidays lenders are obliged to offer during the pandemic. In another word, some borrowers stop paying not due to a lack of capability, but just as a precaution against recession. Such change would usually also manifest as a variation in the outcome distribution, which we also monitor closely.
In this scenario, we should review the definition of the outcome according to the business context. We might assume that all borrowers taking payment holidays are going to repay after the crisis and keep using the existing model to predict long term insolvency. Alternatively, we might want to predict the take-up of payment holidays as another type of hazard, either by extending our model into a multi-class or multi-target one, or train a separate model for the new outcome.
In these two blog posts, we have looked at the four types of questions that need to be addressed in a machine learning model monitoring. Green lights over these questions would give us great assurance of business continuity. As the world is changing at a rapid pace, it is pragmatic for data scientists to anticipate some deterioration and be prepared to find mitigation quickly. Hope this article is useful for you. And looking forward to further discussions on this topic at the upcoming ODSC Europe in September.
Editor’s note: Dr. Jiahang Zhong is a speaker for ODSC Europe 2020. Check out his talk, “Can Your Model Survive the Crisis: Monitoring, Diagnosis and Mitigation,” there! In his session, he will share some experience of model monitoring and diagnosis from a leading UK fintech company.
About the author/ODSC Europe speaker:
Dr. Jiahang Zhong is the leader of the data science team at Zopa, one of the UK’s earliest fintech companies. He has broad experience in data science projects in credit risk, operational optimization, and marketing, with keen interests in machine learning, optimization algorithms, and big data technologies. Prior to Zopa, he worked as a PhD and Postdoctoral researcher on the Large Hadron Collider project at CERN, with a focus on data analysis, statistics, and distributed computing.