A Comparison of Topic Modeling Methods in Python


We consider three topic modeling methods in Python, utilizing tools in the scikit-learn and gensim packages. These methods are (1) K-Means Clustering, (2) Latent Dirichlet Allocation, and (3) Non-negative Matrix Factorization. We show how these methods can be used to perform topic modeling using the same data set, together with common preprocessing steps in the analysis. We discuss some of the advantages and drawbacks of each method, concentrating especially on the central question of "How many topics are contained in the documents in the data set?"


Russell Martin is a data scientist in residence at the Data Incubator, where he instructs fellows, teaches online courses, and leads training courses with corporate partners. Russ lived and worked in the UK for 17 years, including at Warwick University and the University of Liverpool, where he taught in the Department of Computer Science. He holds a PhD in applied mathematics from the Georgia Institute of Technology.