Best Practices for Data Annotation at Scale


The long term success of machine learning relies on consistently labeled high quality data. While most machine learning initiatives begin in the lab, they take on a life of their own and can create significant challenges once they scale. ML practitioners and data ops managers can find themselves being consumed by the logistics of data annotation and data management instead of focusing on the science.

Wherever you are in your team’s machine learning journey, it’s helpful to think about evolving towards large scale production. Proactively planning a data process can generate progressively better results during development, but it requires some thought and stakeholder buy-in. A key ingredient of this journey is your data labeling and annotation framework.

In this talk, we describe best practices to build a scalable and repeatable data labeling pipeline with the balance of tools and humans in the loop. Through peer, manager, and machine-learning expert collaboration, annotators refine their skills, mastering tasks traditionally beyond the expertise of crowdsourcing. Finally, in a collaborative framework, annotators and ML experts negotiate and create meaning through an iterative feedback process as they identify new concepts and nuances in the data. Concepts like designing to break, edge case knowledge management and multiple workflow management are discussed in detail.

A pipeline designed for human judgement and incremental training on edge cases, can provide that last mile of acceptability to roll out a machine learning solution in production. We discuss the implications of an ongoing production environment where data is live and can significantly impact the customer experience. We outline upcoming trends and challenges in combining humans with the machine learning pipeline.


Jai Natarajan is the Vice President, Strategic Business Development at iMerit, a global AI data solutions company delivering high-quality data that powers machine learning and artificial intelligence applications for Fortune 500 companies. Bringing more than 24 years of experience, Jai works with more than 5500 data experts who label and enrich data at scale to help customers get better results from their machine learning algorithms. Jai works with iMerit’s partner ecosystem to develop iMerit’s solutions for its customers, and provides strategic inputs to the company.

Previously, Jai worked at Lucasfilm and Sony, and founded Xentrix, an Emmy-winning animation studio. He is a board member of the Anudip Foundation.

JaI has an M.S. in Computer Science from UCLA, and undergraduate degrees from Birla Institute of Technology and Science.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google