If you’re familiar with conspiracy theories regarding COVID-19 vaccines, Barack Obama’s birthplace, or Hillary Clinton’s ties to “Pizzagate”, you’ve experienced only a fraction of contemporary disinformation. With the rise of social media and its capacity for viral posts and fake news, disinformation campaigns have become a common strategy for undermining public trust and discrediting trustworthy sources. 

In academia and in industry, researchers are developing new methods for finding, classifying, and understanding the nature of disinformation. But what constitutes “disinformation”? And how is it different from terms like “misinformation” and “fake news”?

Disinformation: content that is verifiably incorrect and is intended to mislead or harm

Example: “5G causes coronavirus”

Misinformation: content that is incorrect, but may have been created or shared without knowing any better

Example: Sharing outdated statistics about COVID-19 deaths

Current technical approaches obtain mixed performance results when it comes to identifying disinformation online. Most methods perform well on curated data that has been thoroughly cleaned and well-standardized, but these methods are subject to accuracy problems when deployed in real-world situations and on “messier” data. These technical approaches can typically fit into three broad categories: language-based, machine learning, and network analysis approaches.

Language-Based Approaches

These methods involve examinations of the text itself. They are not typically concerned with understanding the larger context surrounding potential sources of disinformation, but rather, they are interested in underlying structural patterns associated with false information. These techniques work to evaluate the syntax, parts of speech, or language complexity of a text in order to find similarities between types of documents – credible or otherwise.

Machine Learning Approaches

Machine learning approaches are highly popular and diverse, relying on tabular and unstructured data sets to train effective algorithms for classifying texts. These techniques include deep learning models, decision trees, and clustering methods, and they require data that is tailored to particular types of texts and content to adequately train the models.

Network Analysis Approaches

Network-based analyses are concerned with identifying who is producing and circulating certain narratives. This provides a way to examine how specific groups or digital communities are responding to the dissemination of disinformation narratives. Because this approach requires a lot of contextual information about the network of interest, it can become difficult to track new sources of questionable texts. It does, however, allow for a better understanding of connections between entities on social media and the ability to track or traceback information to its source.

Combining Approaches: How can we apply these ideas?

When used independently, language-based, machine learning, and network analysis methods have seen some success. However, disinformation is complex and constantly evolving, requiring practitioners and researchers to develop new methods that can be just as adaptable. To do this, many new approaches are focused on combining existing techniques to form composite models and analysis pipelines, giving us the right tools to tackle highly niche problems.

In our upcoming talk at ODSC East (April 2021), “Narrative Extraction for Disinformation Detection,” we will outline one of many emerging ways to do this by sharing our strategy of combining deep learning classifiers with topic modeling to identify disinformation in articles and fake news. We plan to walk through our methodology and explore a use case where we examine articles about the 2020 U.S. Presidential Election.

While no single approach will perfectly identify disinformation, continuing to develop new ways of thinking about and detecting it can only be helpful for maintaining media platforms that are dedicated to providing reliable information.

About the authors/ODSC East 2021 speakers:

Amber Chin is a Machine Learning Engineer at Novetta and a student at the University of Texas at San Antonio. Her primary interests are in natural language processing and applying deep learning to analyze digital communication. Prior to starting at Novetta, Amber worked in UTSA’s Digital Politics Studio examining the effects of political incivility on social media platforms and developing methods for detecting uncivil language online. Amber will graduate from UTSA this spring with Bachelor’s degrees in Psychology and English.

Carlos Martinez is a Machine Learning Engineer at Novetta and a graduate student in Computational Linguistics at Montclair State University. His primary interests are in natural language processing, linguistics, and voice application development. Prior to working at Novetta, Carlos worked as a voice app developer for Voicefirst Tech and as a NLP research assistant for the Montclair State University research lab detecting censored language on Chinese social media.