- Tweets labelled as '1' are said to be "depressing" in nature and '0' otherwise.
- The dataset has 1.5m tweets, to speed up processing, we drop non-relevant columns (ItemID and Source of the Tweet).
- The dataset has no null values and is has nearly zero skew between classes.
- Tweets also have punctuation, which will not be useful for further analysis and hence we drop punctuation (, . - ! ? : ; ' " ")
- Drop URL's, to avoid training the model on words present in the url.
- All tweets are converted to lowercase
Fun fact : The difference between the ascii values of a uppercase alphabet and a lowercase is 32.
We have nearly equal number of datapoints from both classes therby showing that the dataset has zero-to-no skew.
#Code for above visualization
#requires 3.6.2 version of matplotlib
ax=df['Sentiment'].value_counts().plot.bar(color = 'pink', figsize = (6, 4))
ax.bar_label(ax.containers[-1])
ax.margins(y=0.1)
plt.show()
Common words that provide little to no meaning to a sentence, we drop them to focus more on "rare"-words.
Stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.
By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc.
Read more about stopwords :
Before removing stopwords from tweets :
Code for KDE plots of length of tweets before removing stopwords
#import seaborn as sns #import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(10, 3)) transparency=0.6 sns.kdeplot(df.loc[(df['Sentiment']==0), 'length'], color='crimson', label='Not depressed', ax=ax) sns.kdeplot(df.loc[(df['Sentiment']==1), 'length'], color='blue', label='Depressed', ax=ax) ax.legend() plt.tight_layout() plt.title('Before removing StopWords') plt.show()
After pre-processing of tweets:
Code for KDE plots of length of tweets after removing stopwords
#import seaborn as sns #import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(10, 3)) transparency=0.6 sns.kdeplot(df.loc[(df['Sentiment']==0), 'len after'], color='crimson', label='Not depressed', ax=ax) sns.kdeplot(df.loc[(df['Sentiment']==1), 'len after'], color='blue', label='Depressed', ax=ax) ax.legend() plt.tight_layout() plt.title('After Removing Stopwords') plt.show()
No trend-difference is seen in length of tweets, based on classes, which indicates this might not be a good classification feature.
-
For complete dataset :
-
For only tweets labelled as not-depressing :
-
For only tweets labelled as depressing :
- "amp" is an HTML encoder that is used to embed a Tweet, was parsed incorretly by scraper
- Many weird characters were present in SentimentText as emojis were not parsed
Which were as follows :
® © ¯ ª ¿ ¾ ¨ à ¸ £ ˆ ‡ • ‰ ž « ” ¢ — µ ¡ › ¥ ‚ – ð Ÿ ™ á º · ã ¹ » ± ³ € ¬ ‹ ¤ § ° ì š í † ë ¦ „ ¼ ´ ² ½ - Some tweets have length >140 which is not allowed by twitter, such instances were rare around 2 in a million (Total: 3)
- Line graph depicting change in accuracy with change in epochs ran
- Graph Showing change in Loss as number of iterations/epochs increase
Confusion Matrix :
precision recall f1-score support Negative 0.80 0.84 0.81 157391 Positive 0.83 0.79 0.81 158319 accuracy 0.81 315710 macro avg 0.81 0.81 0.81 315710 weighted avg 0.81 0.81 0.81 315710