Creating Data to Enable Multilingual AI: What Can Go Wrong and Ways to Mitigate It


Multilingual data is essential for enabling global conversational AI. It has a broad range of use cases in operations, production, and supply chain management. The global reach of business has given rise to a need to collect and develop data for AI systems that understand text and speech in multiple languages. But we need to be very mindful of everything that can potentially go wrong in the process of developing this data.

In this session, Olga Beregovaya, VP, AI Innovation at Welocalize, Inc., will discuss what can go wrong when creating data to enable multilingual AI, as well as ways to mitigate it.

Problems that can arise:

- Language-wise: dataset can be limited to catering to just certain demographics including ethnicity, age, location, and education level, which will significantly limit the customer engagement with your conversational AI product or voice search engine. When developing for local audiences you can also introduce or carry over cultural phenomena from your English dataset that are not relevant for that geographic. 

- Engineering-wise: multiple challenges are related to developing datasets for various locales. Issues can arise, including code-switching, various glyph sets within a single language, challenges specific to bi-directional languages, homophones, and ways of handling user errors such as repeated words, typos in text dataset and challenges specific to voice datasets; and many other challenges that arise with each iteration of the model training and may not be possible to predict.

- Bias and Inclusion-wise: your models may produce gender, racial, age, and other bias, which is a phenomenon that recently is getting a lot of attention due to the issues it has caused; however, there are several techniques in managing your data and tweaking your algorithms that can help you control and reduce this bias. 

The good news is Olga will discuss various ways to mitigate these problems based on solutions Welocalize has developed over the course of collaborating with its clients’ data scientists.


A seasoned professional with over 20 years of leadership experience in language technology, NLP, ML and AI data generation and annotation, Olga is the VP, AI Innovation at Welocalize. She is passionate about growing business through driving change and innovation, and an expert in building things from scratch and bringing them to measurable success. Olga has experience on both the buyer and the supplier side, giving her unique perspective around establishing strategic buyer/supplier alliances and designing cost-effective Global Content Lifecycle Programs. She has built and managed global production, engineering and development teams of up to 300 members specializing in NLP and broader ML and AI.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google