- Challenges of topic modeling on microblogs
- 1. No common definition of what short-form text is.
- 2. Lack of context.
- 3. Need of extensive configuration
- 4. Developing bias in the model as a result of human interaction.
- 5. Need for extensive data pre-processing.
- 6. Vulnerability of overfitting.
- 7. Lack of consideration of abbreviations and slang.
- 8. Lack of gold standards in academic research leads to ambiguity of results.
- Final Thoughts
Challenges of topic modeling on microblogs
Short-form text is typically user-generated, defined by lack of structure, presence of noise, and lack of context, causing difficulty for machine learning modeling.
Topic modeling aims to identify patterns in a text corpus and extract the main themes, entities, or topics, depending on how they are referred to in a particular model.
Topic modeling is notoriously more challenging to do when the text is shorter. For instance, when the corpus consists of tweets as opposed to news articles.
In this article, I want to go over what are the main reasons why topic modeling algorithms tank on short, user-generated text. I compiled it as a result of a systematic literature review on topic modeling and sentiment analysis I did as part of my MSc.
1. No common definition of what short-form text is.
There exists no common definition of what short-form text is in academic literature.
This means that when reading papers about innovations in the area, you need to know that the scientists are working with datasets with different lengths of text, causing some models to perform better than others.
What are the most common types of data used in such studies:
- user product and service reviews
- UGC data from social media (Tweets, Reddit posts, and comments, Facebook posts, YouTube video comments — all of which vary greatly in their length limits)
- instant messages
- short message exchanges
- public forum comments
- news headlines
2. Lack of context.
The short text is challenging for the tasks of topic detection and sentiment extraction as it lacks contextual information, which leads to a problem of data sparsity.
As a result, general models such as bag-of-words become unsuitable for semantic analysis of short texts as they ignore the order and semantic relationships between words.
Even so, a review of text analysis studies in financial markets demonstrates that the bag-of-words approach is used in the majority of the reviewed sample as means of feature selection. While somewhat inefficient, some approaches remain popular in the academic community.
3. Need of extensive configuration
Currently, the topic model quality depends on manipulation and refinement, which is often manual and requires time-consuming fine-tuning of model parameters.
One of the most considerable challenges in topic modeling is the issue of configuration.
Prior to running a topic modeling algorithm, data pre-processing should occur, a step from which involves removing stop words and topic general words (TGWs). Topic-general word removal is typically done manually, hence challenging and time-consuming.
TGWs are problematic as they can alter the results of topic modeling as they are more probabilistic to occur in the corpus, thus more likely to be paired with other words, reducing the validity of word pair topics identified. Automation is possible, which can potentially improve the effectiveness of the topic model.
Li, Zhang, and colleagues propose the entropy weighting (EW) scheme, which is based on conditional entropy measured by word co-occurrences, combined with existing term weighting schemes. This can automatically reward informative words, resulting in assigning meaningless words to lower weights, improving topic modeling performance.
4. Developing bias in the model as a result of human interaction.
A 2019 study discusses how human interaction with topic models can also be considered another research challenge.
After doing two individual experiments with non-expert users, the scientists propose that human-in-the-loop topic modeling is developed as a form of mixed-initiative interaction, where the system and the user work collaboratively with the goal of topic model optimization.
5. Need for extensive data pre-processing.
Choosing efficiently the pre-processing technique is considered a research priority, with studies being devoted to the topic. These studies show through comparative analysis methods to improve the effectiveness of machine learning models using Tweets as data.
Exploration and preparation of data involve but are not limited to writing functions for filtration of noise from data, setting up the development environment, scaling, and encoding.
Such a process is commonly referred to as pre-processing and it broadly includes three main steps: term/object standardization, noise reduction, and word normalization, each of which consists of various text analysis operations that must be performed
Twitter and other social media also present a challenge of irrelevant data collected as part of the dataset, which impacts the performance of the model. This can be resolved by implementing additional steps in filtering the dataset from irrelevant entries.
6. Vulnerability of overfitting.
A 2018 study also argues that sentiment analysis using topic-level and word-level models is vulnerable to overfitting as a result of data sparsity.
This is a direct result of the data characteristics, namely the linguistic complexity of user-generated short texts and their irregularity.
7. Lack of consideration of abbreviations and slang.
Additionally, microblogging involves using flexible language, including abbreviations and slang as opposed to structured sentences.
This is unanimously considered more challenging than traditional text for algorithmic analysis.
Part of the challenges in language interpretation is also the use of sarcasm, imagery, metaphors, similes, humor, and figurative language, which relies on previous knowledge and/or context as they impact the accuracy of classification of both topics and sentiment.
8. Lack of gold standards in academic research leads to ambiguity of results.
The lack of gold standards and annotated data in the fields of topic modeling and sentiment analysis result in a reduction of the academic rigor of many studies due to subjectivity and ambiguity.
Annotation in itself is time-consuming and complex, which is why the majority of studies deploy unsupervised learning algorithms.
Final Thoughts
Recognizing the limitations of published research is considered vital for providing an accurate representation of current knowledge on the topic.
While there is a considerable number of significant studies that approach both topic modeling and sentiment analysis of short text there is still a need for model refinement and optimization to improve accuracy and optimize output.
Topic modeling can become a competitive advantage for businesses, seeking to utilize NLP techniques for improved predictive analytics, hence why understanding how to do it efficiently on user-generated text is a crucial step in social understanding.
The same kind of challenges can be met on short-form text in the context of SEO, such as titles, headings, anchor text, etc. If you’d like to learn more about the tools you can use for overcoming some of the challenges mentioned in this post, check out my guide on internal linking using machine learning.