Abstract: Finding common topics discussed in a set of text responses is often performed using techniques that learn how often words occur together in a set of responses.
In real-world cases, the number of words in each response may be small, for example, finding which common topics are discussed in social media comments that mention a particular brand. This case poses a problem for these traditional methods of topic modeling: if two words with similar meanings rarely appear together in the dataset, the model will not be able to learn that they represent a common topic. Here, using a pre-trained large language model (LLM) can help. Because LLMs are trained on a much larger dataset, they contain richer information about when words typically appear together in the wild, beyond a limited dataset. The LLM translates the responses to high-dimensional vector embeddings, without requiring any expensive re-training of the model. Once embeddings have been generated, a clustering algorithm like K-means or HDBSCAN clustering can group the data into discrete sets of documents that share semantic similarity. Though measuring the distance between high-dimensional datapoints is easy, visualizing high-dimensional relationships is challenging. Luckily, there are several techniques for reducing high-dimensional data to a more digestible level. In particular, the UMAP algorithm does a good job of capturing both global and local structure in a 2D or 3D reduction that can be easily plotted and inspected. In this talk, I will show how to find topics in brief text responses and create interactive visualizations of the results, using several free open-source Python packages.
Light familiarity is okay, all of the tools will be introduced and any Python coding will not be complicated.
Bio: Matt Bezdek is a Senior Data Scientist at Elder Research. In his work, he empowers commercial clients to make better business decisions, with expertise in machine learning, forecast modeling, natural language processing, and visualization. He has a PhD in Cognitive Psychology from Stony Brook University and has conducted neuroimaging research at Georgia Tech and Washington University in St. Louis.