Natural Language Processing (NLP) is a part of many day-to-day applications we use. A popular depiction of NLP is full of various algorithms, state-of-the-art neural network architectures, and so on. While it is not far from reality, it also gives an incomplete picture.

A typical NLP system development pipeline may look like the figure below:

NLP Datasets

(source: Figure 2-1, from chapter 2 in

As the figure shows, the starting point of any NLP system is data. However, in many research and real-world scenarios, when we encounter a new problem, we don’t have such ready-made datasets that suit our needs. What should we do, then? What kind of data do we really need for NLP, anyway? How do we collect such data?

What kind of data do we really need for NLP?: Different kinds of NLP systems need different kinds of data. Sometimes, all we need are large collections of documents without any additional information (e.g., tasks such as language modeling, topic modeling, etc). But in many cases, we need large collections of labeled data i.e., input -> output pairs. Here are some examples:

  1. sentence-translated sentence pairs (machine translation)
  2. spam/non-spam emails (an example text classification)
  3. question-answer pairs
  4. sentence $->$ names of entities in it, relations between them, etc (information extraction)

Okay, so, we really need labeled data for NLP, but, what do we do when we encounter a new NLP problem and have no data to start with? Cloud service providers like Google, Microsoft, Amazon, and IBM offer a wide collection of NLP services. It is always a good starting point to explore. If one of their offerings meets our needs, we may not have to build our own NLP pipeline. However, often, the problems we encounter are custom problems that need custom solutions. So, we are back to our previous question: how do I start collecting data? Let us look at some paths one can take:

  • Use publicly available resources: Sometimes, such datasets are shared by other companies or other research groups. We can also look at scraping data from the web, depending on the availability of such data which suits our needs. 
  • Product intervention: The AI team can work with the product team together to collect real-world data suiting their needs.  
  • Manual data annotation: You can set up small data-collection experiments, with a group of annotators (1-3 people or more) labeling your data manually, or opt for crowdsourcing, depending on the nature of your data sharing arrangements. 
  • Automatic data labeling:  One can consider “bootstrapping” a labeled dataset, using some domain knowledge and heuristics, by writing some labeling functions which create noisy/imperfect labeled data. There are methods to learn efficiently from such noisy data. 

What do you do once you have a small amount of labeled data?

Often, data created by the means described above may not be large enough to build a sophisticated model on its own. So, what are our options?

  • Weak supervision creates noisy labeled data through heuristics and learns a reliable prediction model with that. Snorkel is a popular tool to train weak supervision models, and here is an article where they explain more on how and why weak supervision works. 
  • Transfer learning uses an existing model that learns some other similar tasks, and adapts it to the current data/problem scenario. HuggingFace is a popular NLP library with some easy-to-use transfer learning options. Here is an article in which they explain how it works.
  • Semi-supervised learning uses a small amount of labeled data to build a “not so good” model, and iteratively train the model with more and more data, taking it from the most confident predictions of the previous model. Here is a recent article on how Google uses semi-supervised learning “at scale”.
  • Active learning puts human in the loop, and send some of your model predictions to human labeling, and gradually improves model performance. I found this book by Robert Munro to be very useful on this topic. Here is an article on how Amazon uses active learning to improve Alexa’s performance. 

All of the above-mentioned strategies are commonly used in NLP research and application scenarios, along with state-of-the-art machine/deep learning methods. 

Here is a useful article by Eugene Yan on different strategies to address this “lack of labeled datasets” scenario, tackling more general use cases beyond NLP.

Editor’s note: Sowmya is a speaker for ODSC APAC 2021! Check out their talk, “How to do NLP When You Don’t Have a Labeled Dataset?” there! More on this session on working without readymade NLP datasets:

Lack of readymade NLP datasets is a commonly seen scenario in industry projects involving NLP. It is also a situation where researchers venturing into new problems or new languages often encounter. However, both traditional textbooks, as well as tutorials and workshops primarily focus on modeling and deploying models. In this workshop, I will introduce some strategies to create labeled datasets for a new task and build your first models with that data. At the end of this session, the participants are expected to get some ideas for solving the data bottleneck in their organization. The target audience is data scientists as well as those involved in requirements gathering for a given NLP problem.

About the author of working without readymade NLP datasets:

Sowmya Vajjala works as an NLP Researcher at National Research Council, Canada, and is a co-author of “Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP systems”, published by O’Reilly Media (2020).