Abstract: Distributed representations of words has proven to be successful in addressing previous drawbacks of symbolic representations which treated words as atomic units of meaning. Symbolic representations treat words like an island, unable to capture similarity and relatedness information between. Word representations like Word2Vec, GLoVE, Fasttext on the other hand use distributional semantics and learn compressed representations of words by accounting contextual information. These representations are able learn meaningful analogical and lexical relationship which make them a popular choice in downstream NLP tasks.
While these representations are useful, they come with their own set of drawbacks. For instance, complete reliance on natural language corpora amplifies vocabulary bias that is inherently present in datasets. Vocabulary bias is seen in the form of bias with word usage, where some words, often morphologically complex words, are used less frequently than other words or phrases with the same meaning. Example , the word destroy is more likely to be used than annihilate. This tendency leads to embeddings suffering from inaccurate modeling of less frequent words.
Another drawback with word embeddings is in their inability to handle polysemy. Polysemy is an important feature of language that causes words to have different meanings based on the context they occur in. For example the word bank can refer to a financial institution or land on either side of a river. Word embeddings assign a single vector representation to a word type, irrespective of polysemy. A large amount of work has already gone into developing word sense disambiguation systems to identify the correct sense of a word based on it’s context. The availability of disambiguation systems coupled with the growing reliance of NLP systems on distributional semantics has led to an increasing interest in obtaining powerful sense representations.
In this post, we talk about our work that focuses on solving both these problems inherent in embedding spaces.
Bio: Sanjana started off with leveraging machine learning algorithms for political data where she worked on drawing inferences from expenditures data for the presidential election cycle. She received her Masters in Computer Science with a specialization in Artificial Intelligence and spent a year working on her thesis for learning to identify writers based on handwriting using neural networks. Sanjana now works on Conversational AI, researching and developing NLP techniques for Mya Systems. Having worked on the applications of machine/deep learning on political and forensic data, she intends to identify and work on unique problems that can be solved using deep learning/machine learning.