Giter Site home page Giter Site logo

text-mining---cluster-analysis's Introduction

Text-Mining---Cluster-Analysis

This project uses different text mining techniques to assess whether Shakespeare was a single genius or a team of playwrights

Development setup

Several libraries are adopted along the project.

  • Sentiment Analysis libraries:

pip install textblob

pip install vaderSentiment

  • Text pipeline and NLP packages

pip install --user -U nltk

  • Cluster analysis

pip install -U scikit-learn

  • LDA Model

pip install gensim

pip install pyldavis

pip install pprintpp

The Model

In order to determine whether Shakespeare was a single genius or team of playwrights, we implemented several tools learned in class to inspect the master pieces.

To begin with, we decided to check whether there was any particular sentiment pattern both across each text and across all of the pieces. We applied Textblob to each of the lines and calculated the polarity score proportion of each sentiment (Positive, Negative, Neutral) to each piece. We can see the results pasted in a bar plot; the “Neutral” sentiment dominates the results and none of the texts exhibit a clear “Positive” or “Negative” tendency.

Afterwards, we performed a similar analysis using VADER sentiment analysis tool across each text. Once again, we see no clear pattern or tendency (please see the Jupyter code for the rest of the texts).

Screen Shot 2021-04-20 at 7 26 28 PM

Screen Shot 2021-04-20 at 7 28 39 PM

Since the first step of the analysis did not allow us to reach any conclusion, we decided to apply TFIDF (Term Frequency Inverse Document Frequency). This statistical technique will allow us to translate Shakespeare pieces into matrices by and labeling the most relevant words into numbers and creating a big matrix. This should allow us to detect the most distinctive words in each text and then run the cosine similarity analysis. The cosine analysis between two pieces will output a number closer to one when they are similar. As can see from the cosine_similarity matrix, the results show no clear no distinctive patters among texts.

We continued our analysis by running PCA on the matrix that resulted from running TFIDF. PCA is a linear dimensionality reduction method that will allow us to build orthogonal projections that most of the variability in the matrix under study. By plotting the projections on the scatter plots, we were able to detect potential outliers. PCA analysis shows suggests that the following plays are outliers:

• A Comedy of Errors

• Julius Caesar

• Antony and Cleopatra

• Titus Andronicus

Screen Shot 2021-04-20 at 7 33 34 PM

Screen Shot 2021-04-20 at 7 33 44 PM

Therefore, we run K-Means to confront and confirm this evidence. K-means is an unsupervised learning method. It is used to look for patterns in data when there is no particular target feature, or dependent variable. K-means clustering is a simple and elegant approach for partitioning a data set into K distinct, non-overlapping clusters. K-means problem is solved using Lloyd’s algorithm, which partitions the observations into K clusters such that the total within-cluster variation, summed over all K clusters, is as small as possible. Silhouette scores can be used to help evaluate the appropriate number of clusters that are truly in the data.

After running K-means on our dataset we found that number of clusters that reach the highest silhouette score (0.53) is three.

We then applied the Hierarchical analysis. The result of hierarchical clustering is a tree-based representation of the objects, which is also known as dendrogram. Each node represents a group.

Screen Shot 2021-04-20 at 7 37 29 PM

Both clustering techniques show similar results. We decided to explore further into these three clusters and see if there is any writing pattern relevant in each of them.

By looking at the dendrogram, we decided to re-run K-means Hierarchical analysis again but this time excluding the play called “A Comedy of Errors”. Since K-Means reported the highest silhouette score with three clusters, we cut the H-tree by three. This second process allowed us to extract four clusters, which we will discuss further below.

Before doing that, we wanted to check whether there was any hint that these clusters were associated in time. To do this, we plotted the PCA projections and labeled each play using the year in which they were written according to Wikipedia. As we can see from the plots below, there does not seem to be any relation between the year in which the plays were published and the clusters we obtained above. However, it is clear that if it is true that Shakespeare was a single person, it is clear that he had a great talent since the majority of the plays were written in a short span of time.

Screen Shot 2021-04-20 at 7 39 21 PM

Finally, we realized that the resulting clusters shared common traits. Each cluster has a broad theme associated. They are either tragedies, stories related to kings, love stories or comedies. There is evidence to suggest that maybe Shakespeare was a group of people who focused on different writing styles. In order to get a deeper insight on this fact, I decided to explore run LDA in each of the cluster and check the dominant themes and words. The results can be seen below.

Cluster 1: Tragedy ['Hamlet', 'Coriolanus', 'Cymbeline', 'Antony and Cleopatra', 'King Lear', 'Othello', 'Troilus and Cressida', 'A Winters Tale', 'Henry VIII', 'Alls well that ends well', 'Measure for measure', 'Loves Labours Lost', 'Merry Wives of Windsor', 'As you like it', 'Merchant of Venice', 'Julius Caesar', 'Much Ado about nothing', 'Twelfth Night', 'Pericles']

Screen Shot 2021-04-20 at 7 40 25 PM

Cluster 2: Kings

['Richard III', 'Henry V', 'Henry VI Part 2', 'Henry IV', 'Henry VI Part 3', 'Henry VI Part 1', 'Richard II', 'King John', 'Titus Andronicus', 'Timon of Athens', 'macbeth', 'The Tempest', 'A Midsummer nights dream']

Screen Shot 2021-04-20 at 7 44 40 PM

Cluster 3: Love, arranged matrimonies,

['Romeo and Juliet', 'Taming of the Shrew', 'Two Gentlemen of Verona']

Screen Shot 2021-04-20 at 7 45 25 PM

Cluster 4: Comedy, short story

A Comedy of Errors

I could have continued to explore deeper and exclude words that do not seem to add information to the LDA analysis. Due to time constraint I have stopped the analysis here. We could explore in each text if there is any pattern in the different characters of the texts that belong to the same cluster. That could be done by first identifying the lines of each of this character and reorganize the LDA analysis by means of characters and not texts. The result of this analysis would show us whether characters share common patters. This could lead us to conclude whether writing patterns change among characters and clusters.

Articles

Contributing

  • Fork this project
  • Create your feature branch
  • Commit your changes
  • Push to the branch
  • Create a new Pull Request

text-mining---cluster-analysis's People

Contributors

mabyy avatar

Watchers

 avatar

text-mining---cluster-analysis's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.