Let me tell you about a common stereotypical data story in many industries today, simplified here for brevity. “Corey” is a fresh graduate from a great data school and he is hired right away by some company with lots of money and lots of data. Great, right? Corey is actually a young but experienced machine learning practitioner who loves deep learning and natural language processing. The company needs his help to extract value from all their text data sitting in some massive data lake. What about this text data?

[Related Article: An Introduction to Active Learning]

We could be talking about doctors’ diagnostic reports, customer care emails, or attached messages of wire transfers. However, for the sake of this argument, we actually do not care. What we do care about is that Corey cannot train any supervised models simply because the data are not labeled.

If we took the healthcare example, we could say that all Corey has is the diagnostic report texts but no disease type attached, or in the case of the financial example he has wire transfer messages but no fraud label available.

Whatever the use case is, Corey cannot train either a complex or simple document classifier unless all those documents are manually labeled first, by a domain expert and in a short amount of time. We are talking of thousands if not millions of confidential data that might require deep domain knowledge. And domain knowledge is expensive. What now?

Corey is not defeated yet because he’s heard about active learning, an old strategy that can be used to train his super deep RNN model—or any supervised model really—even a simple logistic regression. Active learning can provide the labels for training his supervised model by involving the expensive domain expert to label only a subset of the data. For deep learning this required subset of manually provided labels is greater, but it is still better than labeling the entire dataset.

Corey thus needs a web-based interactive application where the domain expert can provide labels in small doses, i.e. just for the critical data.

data no labels

The thing is that “usually,” in order to create an active learning application, you need different skills to those that typically fall into the skillset of a data scientist. Neither Python nor R can help you set up a complex web application where frontend interactivity and backend model training are heavily combined. Instead of labeling documents for months, you find yourself shouting at full-stack developers for years. I said “usually” didn’t I?

[Related Article: The Latest Advances in Classification With Too Many Labels]

Whether you are a “Corey” or not, join my talk on October 31 about active learning. I will show my free and open source blueprint guided analytics application which you can download to train a document classifier starting with no labels.

data no labels

Editor’s Note: Interested in learning about the problem of: lots of data, no labels? See Paolo’s talk “Guiding AI to Generate the Labels we do not have with Active Learning” at ODSC West 2019, Thursday, October 31, from 12:00 pm-12:45 pm.

Originally posted on OpenDataScience.com