Rock is not dead: NLP experiments on rock news articles

Overview

This repository contains multiple NLP experiments on web rock-based news articles. The text corpus is comprised by 20K rock news headlines and descriptions with unlabelled data (for demonstration purposes, a random subset of 1K articles is available in this repository). The data was retrieved between January 2022 and September 2023 from 6 specialized rock websites: Loudersound, loudwire, Ultimate Classic Rock, Kerrang!, Planet Rock and The New York Times.

Table of Contents

Dictionary-based Named Entity Recognition
Rule-based text classification
Topic modelling experiments
- LDA model using Scikit-learn
- LDA model using Gensim
Rule-based text classification Vs. Machine Learning classification: final thoughts and further research
Visualization portfolio
References

Data architecture diagram

Dictionary-based Named Entity Recognition

Goals

The purpose of this model is identifying and extracting rock artist/rock artist member names from the headlines and descriptions of the above-mentioned text corpus. A custom dictionary-based named entity recognition (NER) approach is tested. The pre-built dictionary is based on data extracted from Wikipedia lists on rock, metal and punk bands gathered by a web scraper (the rock artist master data available in this repository is restricted to artists starting with the letter "A").

Challenges

Single/multiple rock artist name(s) and/or single/multiple rock artist member name(s) might be mentioned in a news headline and/or news description. Hence, the text of the headline and the text of the description were combined to perform the search of the rock artist and the rock artist member. Additionally, only whole/compound words are matched to avoid inaccurate labelling. The pre-built dictionaries of rock artists/artists member names contain 38,663 records. Given the ultimate purpose is assigning lists of identified rock artist/artist member names per every single text of the corpus, performance is a critical issue. Several methods were assessed including Vectorization, Flashtext, Regex and a whole word search approach proposed on Stack Overflow (question 5319922, user200783). The latter, in conjunction with a text preprocessing approach which removes special characters and a set of rock artists/artists member names, demonstrated to be the fastest and most effective.
Acronyms are used to mention some rock artists (A7X, RHCP, RATM, GN'R) and the definite article "The" is sometimes excluded to mention artists whose name starts with "The" (a stop word removal approach would perfectly work out for bands like "The Beatles" or "The Rolling Stones" but it is not the case for bands such as "The Who" - it would be completely removed as "The" and "Who" are both stop words - or "The Doors" - "Doors" would be matching both "The Doors" and "Three Doors Down" news articles afterwards). Moreover, popular songs or albums are often mentioned without reference to the rock artist in the headlines and/or descriptions. Misspellings were identified on the rock artist names in the headlines.
The words of the news headlines from the websites Loudwire and Ultimate Classic Rock start with capital letter.
Bands like "Yes", "HIM", "Sweet" or "The Band" lead to misleading labelling. Additional text preprocessing actions were required to ensure accurate outcomes.

Top 50 rock artists and rock artist members

A closer look at the identified rock artist and rock artist members
All tables presented in this repository are based on stratified random samples of 10 news articles (website is the stratifying field).
The headline and the description were combined to perform the extraction of the named entities.

Headline

Description

(back to top)

Rule-based text classification

Goals

This rule-based text classification model is intended to identify keywords and assign both topic labels and publication type categories across the unlabelled rock news headlines. A set of pre-defined rules was manually created for this purpose. The core keywords of the rock news headlines' semantic landscape comprise the following: "album", "single", "song', "show", "tour" and "video". The keywords are the basis to derive the classification rules and to assign human-readable contextualized tags.

Challenges

Ensure all semantically relevant keywords, in which the set of classification rules is based on, are integrated in the cleaned text corpus when performing the extraction of common nouns and verbs. A function was designed in this respect by combining the selection of the mentioned part-of-speech (POS) tags and a list of all relevant keywords.
Considering the target POS tags, the previously identified rock artists names were initially replaced by a unique word, "Bandname", to mitigate inaccuracies of the POS tagging activities performed at a later stage. The word "Bandname" was later removed from the text corpus.
Stemming was the most effective text normalization technique to prepare the text corpus for further processing. This was particularly significant when dealing with verb tenses. E.g., as "think" and "say" are relevant keywords and irregular verbs, its past simple form was replaced by the present simple.
A dictionary was created to ensure synonyms of relevant keywords are accurately standardized, considering the specific semantic field these keywords show in the context of rock news. Actually, the verbs "drop", "unleash", "share", "premier" and "launch" are generally related to music releases, whilst "unveil" and "reveal" tend to be associated to announcements in most cases.

Alluvial diagram on the rock news articles

Generated keywords, topic labels and publication type categories by headline

(back to top)

Topic modelling experiments

Goals

Along with the rule-based text classification model, an unsupervised machine learning method for topic modelling was conducted – the Latent Dirichlet Allocation (LDA). Two models were developed using the Python's libraries (1) Scikit-learn and (2) Gensim. A stratified 80/20 split-sample validation was implemented using the website as the stratifying field. The text preprocessing methodological choices have already been detailed in the Rule-based text classification chapter.

Challenges

As the LDA algorithm is stochastic and the output is different every run, the random state parameter was set to 0 to ensure the reproducibility of the Scikit-learn and Gensim LDA models.
The results obtained through the inverse document frequency (TF–IDF) were not aligned with expectations across both models. Despite its main purpose of scaling down the impact of predominant tokens, the interpretability of topics was not as coherent and as comprehensible as frequencies of events.
The hyperparameter optimization of the Scikit-learn LDA model, namely the parameters "n_components" and "learning_decay", was performed using the grid search method. The parameter "n_components" search was set as higher than 4 to ensure reliable interpretability of topics.
A Gensim Ensemble LDA was implemented to overcome the instability and non-reproducibility of the sklearn and "gensim lda"/"ldamulticore" approaches.

1. LDA model using Scikit-learn

Results

LDA evaluation model metrics in Scikit-learn
Perplexity and likelihood score are conventional performance metrics available in the Scikit-learn library to diagnose a LDA model. These statistical measures evaluate the predictive accuracy of a model on unseen texts. According to the available literature, the lower the perplexity, the more accurate is the model. On the contrary, a higher likelihood score is indicative of a better fit. However, there is no pre-defined threshold that defines a lower perplexity score or a higher likelihood score. Based on the work of Blei, D. et al. (2003), the perplexity goes in the opposite direction of the number of topics. This means that the former tends to decrease when the latter increases. On the other hand, a study conducted by Chang J. et al. (2009) suggested there is no relationship between perplexity and human interpretation. In conformity with the chart below, the 5 topics model ranks #2 when it comes to the statistical assumption of lower perplexity and higher likelihood score. This model was the most consonant with human interpretation as shown by a manual random topic assignment validation, independently of the recurrent inaccurate tagging. Surprisingly, the perplexity score only decreased in the models with 7, 9 and 12 topics.

Perplexity and Likelihood score over number of topics in the test set

Perplexity = 901.4
Likelihood score = -108,604.3

Interactive topic model visualization with pyLDAvis
To get a visual overview of the LDA model, the Python library “pyLDAvis” based on the R package “LDAvis”, developed by Sivert C. & Shirley K. (2014), was used. The left panel of the visualization is intended to clarify both the prevalence of each topic of the model and the interconnection between topics. In fact, the left-hand side chart shows 5 big bubbles distributed along the quadrants and distant between each other (click here to access the interactive topic model visualization). The obtained visual representation is symptomatic of a good model. In spite of this theoretical assumption, as stated above, incorrect labelling was critical mainly related to news articles assigned with the category "diverse topics" from the manual rule-based classifier. To a certain extent, the topics generated by the LDA model portray the previously mentioned semantic landscape of the rock news headlines:

Topic 0: album and single releases;
Topic 1: tour announcement;
Topic 2: live performance and song related;
Topic 3: miscellaneous;
Topic 4: video and festival.

Topic relevance by headline in the test set

Labelling accuracy of the topic "album and single releases"

2. LDA model using Gensim

Replicability and instability are two major issues of topic modelling. The Ensemble LDA method aims to mitigate these issues by "finding and generating stable topics from the results of multiple topic models" and remove topics "that are noise and are not reproducible" (Řehůřek, 2022b). Additionally, there is no "need to know the exact number of topics ahead of time" (Řehůřek, 2022a).

Results

The Ensemble LDA returned 7 topics which represent the semantic landscape of the rock news headlines more effectively:

Topic 0: tour announcement;
Topic 1: live performance related;
Topic 2: single and video releases;
Topic 3: album announcement;
Topic 4: song related;
Topic 5: video and movie related;
Topic 6: artist death related;

Top 10 words by topic

Topic relevance by headline in the test set

LDA evaluation model metrics in Gensim
A UMass Coherence score and Perplexity were used to evaluate the Ensemble LDA model. Within the topic modelling context, "a set of statements or facts is said to be coherent, if they support each other" (Röder et al., 2015). In simple terms, coherence is the "humans’ semantic appreciation of a topic represented by its N top words" (Trenquier, 2018). The UMass coherence score relies on document frequency and considers order among the top words of a topic (Röder et al., 2015). It reaches its peak at 0 meaning that topics are perfectly coherent. The UMass was chosen over C_V metric as the latter is not recommended "when it is used for randomly generated word sets" (Roeder, 2018). A UMass score of -14.9 was obtained. This indicates topics are not fully coherent. However, the 7 topics model was the most consistent (significantly better than the LDA Scikit-learn model) with human interpretation. Once again, a manual random topic assignment validation was conducted to assess the accuracy of the models. Incorrect labelling was detected mostly referring to news articles tagged with the manual rule-based category "diverse topics". The Perplexity (formula: 2^(-bound)) value is considered acceptable as it is the lower of the 3 models returned by LDA Ensemble. In contrast to the LDA Scikit-learn model, the perplexity value consistently decreased while the number of topics increased.

Perplexity and Coherence score over number of topics in the test set

Perplexity = 163.2
UMass coherence score = -14.9

Labelling accuracy of the topic "album announcement"

(back to top)

Rule-based text classification Vs. Machine Learning classification: final thoughts and further research

Natural language is intrinsically ambiguous in its lexical, semantic or syntactic form (Yadav et al., 2021). The text corpus of this study, comprised of unlabelled rock news headlines and descriptions, is very revealing of this ambiguity. Additional subjectivity and complexity come along with ambiguity as a news article can be assigned to multiple categories making this process a multi-label text classification challenge.
The rule-based text classification model is more reliable as a whole, more flexible, accurate and consistent with human interpretation. However, the required analysis to set up and maintain a manual rule-based text classifier is demanding and time-consuming. Furthermore, the developed manual rule-based text classifier also generates inaccurate assignments.
Instability and non-reproducibility are two well-known issues of the LDA algorithm. In respect to reliability, flexibility and accuracy, the rule-based text classifier outperformed both the unsupervised machine learning models tested. The diversity of the rock news headlines' semantic landscape was better captured by the LDA Gensim Ensemble model when compared to the LDA Scikit-learn model. Inaccurate assignments were recurring in both LDA models but more critical in the Scikit-learn.
A hybrid text classification method will be further developed from the generated labelled data of the rule-based text classification model.
Following the approach of M. Kelechava (2019), the LDA Gensim Ensemble model will be used as a basis for a machine learning supervised model.

Manual rule-based text classification Vs. Unsupervised Machine Learning classification
The alluvial diagram below is based on the test set.
Sklearn and Gensim LDA main topics below 40% were categorized as "multi-category".

(back to top)

Visualization portfolio

Metallica: the monster still lives
Infographic based on the text corpus

Tableau Public
The layout of the vizzes below have been set for desktop. When using phone/tablet, the viewing of the dashboard is not optimal.

(back to top)

References

(back to top)

ivodsbarros / rock-is-not-dead_nlp-experiments-on-rock-news-articles Goto Github PK