Giter Site home page Giter Site logo

longtv02 / news-analysis-nlp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from 54mir/news-analysis-nlp

0.0 0.0 0.0 24.69 MB

Data analysis on 1400 news articles. Uses natural language processing to explore how news sources differ in terms of their sentiment, lexical density, reading level, topic modeling, frequency of named persons, etc.

Java 100.00%

news-analysis-nlp's Introduction

SET-UP INSTRUCTIONS:

This is a computational text analysis on news articles. We use the following two external packages:

Our analysis is set up as a maven project, and therefore it is not necessary to download any .jar files. In order to run the analysis, you will just need to either clone this repository or download all of the files in the FinalProject folder, and set up the project as a maven project in your IDE of choice. (The pom.xml file is included.) More info on setting up Maven projects can be found here: http://maven.apache.org/guides/getting-started/ 

A note if you choose not to clone the repo: The src folder (within the FinalProject folder) contains all of our code. The other essential non-java files are articleMetricsArray_hold.ser and newSourcesSAMPLE1.csv.

RUNNING THE ANALYSIS:

Once the project is set up, open ProjectRunner.java and run the main method. Console prompts will guide you through the rest.

As you will see below, we have processed the data on 1400 articles and saved this data to disk. Our analysis runs on this stored data as opposed to processing the data each time the project is run. However, if you would like to see the processing part of the project, the console prompt will allow you to run through a small subset of the data. *NOTE* Do not be alarmed by the red text stating SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". This does not affect the analysis or the running of the progam, it is just an artifact of CoreNLP. When you proceed with the analysis, the project will generate charts and present them in a window.

THE NITTY-GRITTY - HOW THE ANALYSIS IS DESIGNED:

Part 1 – Creating a usable dataset, and Building Charts

We begin with a .csv file that contains 1400 articles from 14 sources, as well as metadata for the author, source, date of publication, etc. The RawDocumentReader class reads in this .csv file, makes the necessary CoreNLP annotations, and creates an Article object for each article. (The Article class computes and stores various metrics, which we use later on in our data analysis.) The articles are read into a .ser file, where they can be stored and shared to others who might be interested in running a similar analysis. The abstract GenericChart class reads in the .ser file and stores it into memory. The GenericChart superclass is extended by the following child classes:

  • LevelAndDensityCategoryChart (creates charts 1-4)
  • SentimentChart (creates charts 5-8)
  • LengthDensityAndLevelXYChart (creates charts 9-11)
  • FrequencyChart (creates chart 12)

The ProjectRunner class servs to create these chart objects and display them in a simple swing window. It also gives users the option to see a demo implementation of the RawDocumentReader class. This demo creates a mini .ser file with a small sample of the full dataset.  

Part 2 - Running the Analysis and Displaying the Charts

AN OVERVIEW:

In order to run the analysis, you'll need to run the ProjectRunnerClass. In the console, you will be asked whether or not you would like to create an .ser file from a sample of the full dataset. We have included this option so that you (the user, the TA, etc) can see our RawDocumentReader class in action. When this class is run on the full data set, it takes around four hours.  This is due to the computationally-intensive tasks of sentiment analysis and named-entity recognition. 

After the (optional) creation of the test .ser file, the full analysis will being to run. When it is complete, a bare-bones swing window will display the following graphs:

  • Charts 1-4: These charts display the average reading level and average lexical density for each news source in the corpus, as well as the z-scores for these averages. The goal of these charts was to give the user a sense of which sources might be written at a higher level, and which sources migth contain more information (ie: have a higher "lexical density" score.) We also chose to plot the z-scores of each of these etrics to give the user a sense of how much variance there was in these two metrics, across the different sources in the corpus. Ie - which sources really stand out in terms of being extra-dense or extra-hard, or alternatively,  extra-fluffy or extra-easy. 
  • Chart 5: This chart displays, for each news source, what percent of the sentiment n articles from that source is negative, positive, or neutral. This graph, perhaps unsuprisingly, shows us that across the board for all sources, reporting tends to err on the side of negative sentiment.
  • Charts 6, 7, and 8: In creating these charts, we took the normalized sums for each type of sentiment for every article in the corpus, and looked at what percent each source contributed to the total amont of each sentiment type. These charts show us that, for example, the New York Post contributes much more positive sentiment to our corpus than does, say, Reuters. Overall, however, it does seem that no sources really stick out as contributing an incredibly disproportionate amount of positive, negative, or neutral sentiment.
  • Charts 9, 10, and 11: These charts display the various relationships between an article’s length, reding level, and lexical density. They also allow the viewer to see how different sources (and certain outlier articles) tend to be written, and how tightly “clumped” they are in accordance with these parameters. For example, we can see that most Atlantic articles send to be very clumped in regards to their reading level and density, however, these articles are spread over a relatively wide spread of lengths.
  • Chart 12: This chart shows the trend, over time, in the mentions of a variety of politicians. Perhaps unsurprisingly, Obama and Trump/Donald Trump mentions have the largest spikes.

news-analysis-nlp's People

Contributors

samiritor avatar abigailella avatar 54mir avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.