Giter Site home page Giter Site logo

aryan1113 / moodscope-nlp Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 878 KB

NLP project for Winter in Data Science, conducted by Analytics CLub IIT Bombay

Home Page: https://colab.research.google.com/drive/1hQM18tYFcuMqSWnRA7MaXi2UADIiMCML

Jupyter Notebook 97.87% Python 2.13%
natural-language-processing neural-network

moodscope-nlp's Introduction

EDA

During pre-processing of data, some trends were observed which are listed below

  • Tweets labelled as '1' are said to be "depressing" in nature and '0' otherwise.
  • The dataset has 1.5m tweets, to speed up processing, we drop non-relevant columns (ItemID and Source of the Tweet).
  • The dataset has no null values and is has nearly zero skew between classes.
  • Tweets also have punctuation, which will not be useful for further analysis and hence we drop punctuation (, . - ! ? : ; ' " ")
  • Drop URL's, to avoid training the model on words present in the url.
  • All tweets are converted to lowercase

Fun fact : The difference between the ascii values of a uppercase alphabet and a lowercase is 32.

Skew or no Skew

We have nearly equal number of datapoints from both classes therby showing that the dataset has zero-to-no skew.

Bar plot showing distribution of tweets, 1 represents tweet that is classified as depressed, 0 represents otherwise

#Code for above visualization
#requires 3.6.2 version of matplotlib
ax=df['Sentiment'].value_counts().plot.bar(color = 'pink', figsize = (6, 4))
ax.bar_label(ax.containers[-1])
ax.margins(y=0.1)
plt.show()

What are stopwords ?

Common words that provide little to no meaning to a sentence, we drop them to focus more on "rare"-words.
Stop words occur in abundance, hence providing little to no unique information that can be used for classification or clustering.
By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc.
Read more about stopwords :

Comparing length of tweets before and after removing stopwords

  • Before removing stopwords from tweets :

    KDE plot showing length of tweets before any pre-processing

    Code for KDE plots of length of tweets before removing stopwords

    #import seaborn as sns
    #import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots(figsize=(10, 3))
    transparency=0.6
    sns.kdeplot(df.loc[(df['Sentiment']==0), 
                'length'],
                color='crimson', label='Not depressed', ax=ax)
    
    sns.kdeplot(df.loc[(df['Sentiment']==1), 
                'length'],
                color='blue', label='Depressed', ax=ax)
    ax.legend()
    plt.tight_layout()
    plt.title('Before removing StopWords')
    plt.show()
  • After pre-processing of tweets:

    KDE plot of lenght of tweets after processing

    Code for KDE plots of length of tweets after removing stopwords

    #import seaborn as sns
    #import matplotlib.pyplot as plt
    fig, ax = plt.subplots(figsize=(10, 3))
    transparency=0.6
    sns.kdeplot(df.loc[(df['Sentiment']==0), 
                'len after'],
                color='crimson', label='Not depressed', ax=ax)
    
    sns.kdeplot(df.loc[(df['Sentiment']==1), 
                'len after'],
                color='blue', label='Depressed', ax=ax)
    ax.legend()
    plt.tight_layout()
    plt.title('After Removing Stopwords')
    plt.show()

No trend-difference is seen in length of tweets, based on classes, which indicates this might not be a good classification feature.

Word Clouds showing most frequent words

  1. For complete dataset :

    Wordcloud for complete dataset, requires 98s to run

  2. For only tweets labelled as not-depressing :

    Wordcloud for tweets with sentiment0, requires 50-55s to run

  3. For only tweets labelled as depressing :

    Wordcloud for tweets with sentiment1, requires 50-55s to run

    Errors in Pre-processing (now resolved) :

    • "amp" is an HTML encoder that is used to embed a Tweet, was parsed incorretly by scraper
    • Many weird characters were present in SentimentText as emojis were not parsed
      Which were as follows :
      ® © ¯ ª ¿ ¾ ¨ à ¸ £ ˆ ‡ • ‰ ž « ” ¢ — µ ¡ › ¥ ‚ – ð Ÿ ™ á º · ã ¹ » ± ³ € ¬ ‹ ¤ § ° ì š í † ë ¦ „ ¼ ´ ² ½
    • Some tweets have length >140 which is not allowed by twitter, such instances were rare around 2 in a million (Total: 3)

    Creating a Model

    Results :

    • Line graph depicting change in accuracy with change in epochs ran
    • Model Accuracy graph

    • Graph Showing change in Loss as number of iterations/epochs increase
    • Loss vs Epoch graph

    Confusion Matrix :

              precision    recall  f1-score   support
    
    Negative       0.80      0.84      0.81    157391
    Positive       0.83      0.79      0.81    158319
    
    accuracy                           0.81    315710
    macro avg      0.81      0.81      0.81    315710
    weighted avg   0.81      0.81      0.81    315710
    

moodscope-nlp's People

Contributors

aryan1113 avatar

Stargazers

Vaishnav Deore avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.