TLDR – automatic text summarization of documents at scale
TLDR – automatic text summarization of documents at scale


Imagine a scenario where you are responsible for summarizing the information contained in tens of thousands of text documents. Perhaps you need to summarize product feedback, patent documents or technical specs and you are faced with the daunting task of sifting through an enormous pile of documents! This problem can be solved by using natural language processing and machine learning to programmatically extract text summaries.

In this talk, I will present an approach based on topic modeling. Specifically, we’ll use Latent Dirichlet Allocation (LDA) to generate short, human readable summaries spanning a range of discovered topics. We’ll describe the algorithm in the context of summarizing public feedback to pending government regulations, and demonstrate how this can be incorporated into an automated production pipeline. Along the way we’ll go over a brief history of text summarization, we’ll go over topic modelling and we'll discuss LDA, including other use cases and interesting visualizations.


Guilherme is a Data Scientist at Dataiku. He works out of the headquarters in NYC where he helps customers build and deploy predictive applications. Before joining Dataiku, he was a fellow at the Insight Data Science Fellowship program, and prior to that he worked in quantitative finance. He holds a PhD in applied mathematics.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google