Link to Colab Notebook for pre-processing

Link to Colab Notebook for LSTM Model
Link to (original) dataset
Link to processed-dataset

EDA

During pre-processing of data, some trends were observed which are listed below

Tweets labelled as '1' are said to be "depressing" in nature and '0' otherwise.
The dataset has 1.5m tweets, to speed up processing, we drop non-relevant columns (ItemID and Source of the Tweet).
The dataset has no null values and is has nearly zero skew between classes.
Tweets also have punctuation, which will not be useful for further analysis and hence we drop punctuation (, . - ! ? : ; ' " ")
Drop URL's, to avoid training the model on words present in the url.
All tweets are converted to lowercase

Fun fact : The difference between the ascii values of a uppercase alphabet and a lowercase is 32.

Skew or no Skew

We have nearly equal number of datapoints from both classes therby showing that the dataset has zero-to-no skew.

#Code for above visualization
#requires 3.6.2 version of matplotlib
ax=df['Sentiment'].value_counts().plot.bar(color = 'pink', figsize = (6, 4))
ax.bar_label(ax.containers[-1])
ax.margins(y=0.1)
plt.show()

What are stopwords ?

Common words that provide little to no meaning to a sentence, we drop them to focus more on "rare"-words.
Stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.
By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc.
Read more about stopwords :

Comparing length of tweets before and after removing stopwords

Before removing stopwords from tweets :

Code for KDE plots of length of tweets before removing stopwords

#import seaborn as sns
#import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 3))
transparency=0.6
sns.kdeplot(df.loc[(df['Sentiment']==0), 
            'length'],
            color='crimson', label='Not depressed', ax=ax)

sns.kdeplot(df.loc[(df['Sentiment']==1), 
            'length'],
            color='blue', label='Depressed', ax=ax)
ax.legend()
plt.tight_layout()
plt.title('Before removing StopWords')
plt.show()

After pre-processing of tweets:

Code for KDE plots of length of tweets after removing stopwords

#import seaborn as sns
#import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 3))
transparency=0.6
sns.kdeplot(df.loc[(df['Sentiment']==0), 
            'len after'],
            color='crimson', label='Not depressed', ax=ax)

sns.kdeplot(df.loc[(df['Sentiment']==1), 
            'len after'],
            color='blue', label='Depressed', ax=ax)
ax.legend()
plt.tight_layout()
plt.title('After Removing Stopwords')
plt.show()

No trend-difference is seen in length of tweets, based on classes, which indicates this might not be a good classification feature.

Word Clouds showing most frequent words

For complete dataset :
For only tweets labelled as not-depressing :
For only tweets labelled as depressing :

Errors in Pre-processing (now resolved) :
- "amp" is an HTML encoder that is used to embed a Tweet, was parsed incorretly by scraper
- Many weird characters were present in SentimentText as emojis were not parsed
  Which were as follows :
  ® © ¯ ª ¿ ¾ ¨ à ¸ £ ˆ ‡ • ‰ ž « ” ¢ — µ ¡ › ¥ ‚ – ð Ÿ ™ á º · ã ¹ » ± ³ € ¬ ‹ ¤ § ° ì š í † ë ¦ „ ¼ ´ ² ½
- Some tweets have length >140 which is not allowed by twitter, such instances were rare around 2 in a million (Total: 3)
Creating a Model

Results :
- Line graph depicting change in accuracy with change in epochs ran
- Graph Showing change in Loss as number of iterations/epochs increase
Confusion Matrix :
```
          precision    recall  f1-score   support

Negative       0.80      0.84      0.81    157391
Positive       0.83      0.79      0.81    158319

accuracy                           0.81    315710
macro avg      0.81      0.81      0.81    315710
weighted avg   0.81      0.81      0.81    315710
```

aryan1113 / moodscope-nlp Goto Github PK

moodscope-nlp's Introduction

Link to Colab Notebook for pre-processing

Link to Colab Notebook for LSTM Model
Link to (original) dataset
Link to processed-dataset

EDA

During pre-processing of data, some trends were observed which are listed below

Skew or no Skew

What are stopwords ?

Comparing length of tweets before and after removing stopwords

No trend-difference is seen in length of tweets, based on classes, which indicates this might not be a good classification feature.

Word Clouds showing most frequent words

Errors in Pre-processing (now resolved) :

Creating a Model

Results :

moodscope-nlp's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent