Giter Site home page Giter Site logo

lingoquartet's Introduction

LingoQuartet

LingoQuartet: Unraveling political and linguistic themes using unsupervised learning clustering and other NLP techniques

Background

News sources are the main medium through which people receive information on current events; therefore, the amount of coverage on an issue and positive or negative sentiment in coverage of different stories can change public opinion. This project leveraged a large dataset containing articles from the politics section of the main US news sources to implement a series of embedding processes, clustering methods, and sentiment evaluation methods in order to understand effective methods for measuring topics and sentiment in the news.

For more information on methods and to see our results, see our final presentation.

Code Breakdown

The work that each person did can be found in the notebooks/ folder with each team member's name as the beginning of the files they owned. Cleaned up documented scripts can be found in the lingo_quartet/src filepath. Additionally, branches are prepended with names where more in-progress work may be located

Brief Notebook Descriptions:

megans_notebook.ipynb - Exploratory data analysis and cleaning
megans_preprocessing.ipynb – Data prepossessing for clustering
megans_embeddings.ipynb – Creating embeddings for hierarchical clustering
megan_k_means.ipynb - Implementation of early k-means model
megans_hierarchical_part_2.ipynb - Implementation, tuning, and evaluation of hierarchical clustering
jackies_notebook.ipynb – Exploratory data analysis and data prepossessing for clustering
jackie_sentiment_preprocessing.ipynb - Data prepossessing for sentiment analysis
jackie_sentiment_analysis.ipynb – Sentiment analysis on titles, full text, abbreviated text, etc. by source and regex derived topics
jackie_berttopicmodel.ipynb – Implementation, tuning, and evaluation of BERTopics
jpm_kmeans.ipynb – Embedding creation, implementation, tuning, and evaluation of kmeans clustering
jpm_sentiment_analysis_covid_cluster.ipynb – Sentiment analysis on clusters
lisette_text_BERT_analysis.ipynb – Bert sentiment model implementation (text)
lisette_title_BERT_analysis.ipynb - Bert sentiment model implementation (title)
lisette_cluster_analysis.ipynb - Sentiment analysis on clusters
lisettes_notebook.ipynb - Exploratory data analysis

Scripts rerfactored from notebooks for clarity:

  • lingo_quartet/src/step01_exploration.py
  • lingo_quartet/src/step02_preprocessing.py
  • lingo_quartet/src/step03_bert_embeddings.py
  • lingo_quartet/src/step04_fasttext_embeddings.py
  • lingo_quartet/src/step05_hierarchical_clusters.py

Environment Usage

The dependencies for this project were based off of the CAPP30255 conda environment used for othe assignments in the course. In order to track dependencies that were not included in the original capp30255 base environment we use the following process from the root of the repository with the current environment activated:

To capture all conda and pip dependencies:

$ conda env export > ./env_yamls/<ENVIRONMENT_NAME>.yml 

And to create a new environment that starts with the recorded dependences

# make sure you dont have any environment currently activated by running
$ conda deactivate

# then create a new environment
$ conda env create -n <NEW_ENV_NAME> -f env_yamls/<OLD_ENV_NAME>.yml

# and then start using it
$ conda activate <NEW_ENV_NAME>

lingoquartet's People

Contributors

meganhmoore avatar lisette-solis avatar jpmartinezclaeys avatar jackieglasheen avatar

Watchers

 avatar Kostas Georgiou avatar

Forkers

jackieglasheen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.