A Comparison of Topic Modeling Methods in Python


We consider three topic modeling methods in Python, utilizing tools in the scikit-learn and gensim packages. These methods are (1) K-Means Clustering, (2) Latent Dirichlet Allocation, and (3) Non-negative Matrix Factorization. We show how these methods can be used to perform topic modeling using the same data set, together with common preprocessing steps in the analysis. We discuss some of the advantages and drawbacks of each method, concentrating especially on the central question of "How many topics are contained in the documents in the data set?"


