Giter Site home page Giter Site logo

epidural's Introduction

epidural

  1. Scraping: We scraped around 717,000 tweets with four different queries: "epidural," "medication-free birth," "natural birth," and "unmedicated birth." (see the notebook: scrapey.ipynb)

  2. Sentiment analysis: Afterward, we perform sentiment analysis on all of those tweets with the model: "cardiffnlp/twitter-roberta-base-sentiment-latest" (notes: it is important to pre-process the data the same way the model was trained, you can see the notebook "sentiment.ipynb" to see how this was done, or view the huggingface repo for the model; additionally I filter the datasets with LDA topic modeling after sentiment analysis -- this probably could have been done prior to sentiment analysis, but again because google collab was not sufficient for topic modeling, I figured I'd run the sentiment analysis first even if we throw away some of those tweets).

  3. Topic modeling: this is a common NLP method to look at topics within a corpus of documents (for us our documents are tweets). Various topic modeling approaches exist, but we went with LDA (attempted BERTopic but the results were nonsensical). Topic modeling is helpful for filtering because when scraping twitter you can get a lot of unneeded results (for example a lot of tweets were in spanish, but included the term epidural, it would be difficult to filter out all Spanish tweets by single keyword lexical matches). LDA models typically do require heavy pre-processing (i.e. filtering out stop words, lowercasing, getting rid of URLs, retweets, etc. I elected to not lemmatize for reasons that are probs beyond the scope of this email). I generated multiple models and selected the best model based on coherence score (0.502, 0 to 1, 5.0 is considered good) and perplexity score (-7.800 lower is better)

You can view the results of the LDA model here. From these results, you can see that topic 1 is the tweet dataset that we are interested in. I apply a filter to the total tweets dataset to only include tweets from topic 1. Finally, also within this script, to get rid of generic epidural tweets I manually remove any tweets that have the lexical matched keywords of "surgery," "stimulator," "steroid," "block." After LDA and manual filtering we are left with 431,524 tweets that pertain to topics involving epidural, labor, birth, natural/unmedicated birth, each with sentiment scores. The file for this dataset is in the repo and is titled: filtered_df.csv

What needs to happen next: Analyze the sentiments for the total dataset, I would also suggest doing a comparison of sentiment scores for total dataset and those that include terms like "natural" "unmedicated" birth/labor with and without the word epidural in it. This can be done with simple logical lexical matching (pseudocode: if tweet contains epidural AND natural, if tweet contains epidural AND NOT natural), you could use a similar approach with TFIDF to make sure the keywords for sub-datasets are important to the topic.

epidural's People

Contributors

kswanjitsu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.