Text-Mining---Cluster-Analysis

This project uses different text mining techniques to assess whether Shakespeare was a single genius or a team of playwrights

Development setup

Several libraries are adopted along the project.

Sentiment Analysis libraries:

pip install textblob

pip install vaderSentiment

Text pipeline and NLP packages

pip install --user -U nltk

Cluster analysis

pip install -U scikit-learn

LDA Model

pip install gensim

pip install pyldavis

pip install pprintpp

The Model

In order to determine whether Shakespeare was a single genius or team of playwrights, we implemented several tools learned in class to inspect the master pieces.

To begin with, we decided to check whether there was any particular sentiment pattern both across each text and across all of the pieces. We applied Textblob to each of the lines and calculated the polarity score proportion of each sentiment (Positive, Negative, Neutral) to each piece. We can see the results pasted in a bar plot; the “Neutral” sentiment dominates the results and none of the texts exhibit a clear “Positive” or “Negative” tendency.

Afterwards, we performed a similar analysis using VADER sentiment analysis tool across each text. Once again, we see no clear pattern or tendency (please see the Jupyter code for the rest of the texts).

Since the first step of the analysis did not allow us to reach any conclusion, we decided to apply TFIDF (Term Frequency Inverse Document Frequency). This statistical technique will allow us to translate Shakespeare pieces into matrices by and labeling the most relevant words into numbers and creating a big matrix. This should allow us to detect the most distinctive words in each text and then run the cosine similarity analysis. The cosine analysis between two pieces will output a number closer to one when they are similar. As can see from the cosine_similarity matrix, the results show no clear no distinctive patters among texts.

We continued our analysis by running PCA on the matrix that resulted from running TFIDF. PCA is a linear dimensionality reduction method that will allow us to build orthogonal projections that most of the variability in the matrix under study. By plotting the projections on the scatter plots, we were able to detect potential outliers. PCA analysis shows suggests that the following plays are outliers:

• A Comedy of Errors

• Julius Caesar

• Antony and Cleopatra

• Titus Andronicus

Therefore, we run K-Means to confront and confirm this evidence. K-means is an unsupervised learning method. It is used to look for patterns in data when there is no particular target feature, or dependent variable. K-means clustering is a simple and elegant approach for partitioning a data set into K distinct, non-overlapping clusters. K-means problem is solved using Lloyd’s algorithm, which partitions the observations into K clusters such that the total within-cluster variation, summed over all K clusters, is as small as possible. Silhouette scores can be used to help evaluate the appropriate number of clusters that are truly in the data.

After running K-means on our dataset we found that number of clusters that reach the highest silhouette score (0.53) is three.

We then applied the Hierarchical analysis. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Each node represents a group.

Both clustering techniques show similar results. We decided to explore further into these three clusters and see if there is any writing pattern relevant in each of them.

By looking at the dendrogram, we decided to re-run K-means Hierarchical analysis again but this time excluding the play called “A Comedy of Errors”. Since K-Means reported the highest silhouette score with three clusters, we cut the H-tree by three. This second process allowed us to extract four clusters, which we will discuss further below.

Before doing that, we wanted to check whether there was any hint that these clusters were associated in time. To do this, we plotted the PCA projections and labeled each play using the year in which they were written according to Wikipedia. As we can see from the plots below, there does not seem to be any relation between the year in which the plays were published and the clusters we obtained above. However, it is clear that if it is true that Shakespeare was a single person, it is clear that he had a great talent since the majority of the plays were written in a short span of time.

Finally, we realized that the resulting clusters shared common traits. Each cluster has a broad theme associated. They are either tragedies, stories related to kings, love stories or comedies. There is evidence to suggest that maybe Shakespeare was a group of people who focused on different writing styles. In order to get a deeper insight on this fact, I decided to explore run LDA in each of the cluster and check the dominant themes and words. The results can be seen below.

Cluster 1: Tragedy ['Hamlet', 'Coriolanus', 'Cymbeline', 'Antony and Cleopatra', 'King Lear', 'Othello', 'Troilus and Cressida', 'A Winters Tale', 'Henry VIII', 'Alls well that ends well', 'Measure for measure', 'Loves Labours Lost', 'Merry Wives of Windsor', 'As you like it', 'Merchant of Venice', 'Julius Caesar', 'Much Ado about nothing', 'Twelfth Night', 'Pericles']

Cluster 2: Kings

['Richard III', 'Henry V', 'Henry VI Part 2', 'Henry IV', 'Henry VI Part 3', 'Henry VI Part 1', 'Richard II', 'King John', 'Titus Andronicus', 'Timon of Athens', 'macbeth', 'The Tempest', 'A Midsummer nights dream']

Cluster 3: Love, arranged matrimonies,

['Romeo and Juliet', 'Taming of the Shrew', 'Two Gentlemen of Verona']

Cluster 4: Comedy, short story

A Comedy of Errors

I could have continued to explore deeper and exclude words that do not seem to add information to the LDA analysis. Due to time constraint I have stopped the analysis here. We could explore in each text if there is any pattern in the different characters of the texts that belong to the same cluster. That could be done by first identifying the lines of each of this character and reorganize the LDA analysis by means of characters and not texts. The result of this analysis would show us whether characters share common patters. This could lead us to conclude whether writing patterns change among characters and clusters.

Articles

"Sentiment Analysis on the Texts of Harry Potter", Greg Rafferty, https://shorturl.at/dfinC
"Brazilian Heavy Metal: An Exploratory Data Analysis using NLP and LDA" , Flávio Clésio , https://shorturl.at/lsGMU
https://www.nltk.org/install.html
https://scikit-learn.org/stable/modules/clustering.html#clustering
https://pypi.org/project/gensim/
https://pypi.org/project/pyLDAvis/
https://pypi.org/project/pprintpp/

Contributing

Fork this project
Create your feature branch
Commit your changes
Push to the branch
Create a new Pull Request

mabyy / text-mining---cluster-analysis Goto Github PK