The news-analysis-nlp from longtv02

SET-UP INSTRUCTIONS:

This is a computational text analysis on news articles. We use the following two external packages:

StanfordCore NLP (more info here: https://stanfordnlp.github.io/CoreNLP/)
XChart (https://knowm.org/open-source/xchart/)

Our analysis is set up as a maven project, and therefore it is not necessary to download any .jar files. In order to run the analysis, you will just need to either clone this repository or download all of the files in the FinalProject folder, and set up the project as a maven project in your IDE of choice. (The pom.xml file is included.) More info on setting up Maven projects can be found here: http://maven.apache.org/guides/getting-started/

A note if you choose not to clone the repo: The src folder (within the FinalProject folder) contains all of our code. The other essential non-java files are articleMetricsArray_hold.ser and newSourcesSAMPLE1.csv.

RUNNING THE ANALYSIS:

Once the project is set up, open ProjectRunner.java and run the main method. Console prompts will guide you through the rest.

As you will see below, we have processed the data on 1400 articles and saved this data to disk. Our analysis runs on this stored data as opposed to processing the data each time the project is run. However, if you would like to see the processing part of the project, the console prompt will allow you to run through a small subset of the data. *NOTE* Do not be alarmed by the red text stating SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". This does not affect the analysis or the running of the progam, it is just an artifact of CoreNLP. When you proceed with the analysis, the project will generate charts and present them in a window.

THE NITTY-GRITTY - HOW THE ANALYSIS IS DESIGNED:

Part 1 – Creating a usable dataset, and Building Charts

We begin with a .csv file that contains 1400 articles from 14 sources, as well as metadata for the author, source, date of publication, etc. The RawDocumentReader class reads in this .csv file, makes the necessary CoreNLP annotations, and creates an Article object for each article. (The Article class computes and stores various metrics, which we use later on in our data analysis.) The articles are read into a .ser file, where they can be stored and shared to others who might be interested in running a similar analysis. The abstract GenericChart class reads in the .ser file and stores it into memory. The GenericChart superclass is extended by the following child classes:

LevelAndDensityCategoryChart (creates charts 1-4)
SentimentChart (creates charts 5-8)
LengthDensityAndLevelXYChart (creates charts 9-11)
FrequencyChart (creates chart 12)

The ProjectRunner class servs to create these chart objects and display them in a simple swing window. It also gives users the option to see a demo implementation of the RawDocumentReader class. This demo creates a mini .ser file with a small sample of the full dataset.

Part 2 - Running the Analysis and Displaying the Charts

AN OVERVIEW:

In order to run the analysis, you'll need to run the ProjectRunnerClass. In the console, you will be asked whether or not you would like to create an .ser file from a sample of the full dataset. We have included this option so that you (the user, the TA, etc) can see our RawDocumentReader class in action. When this class is run on the full data set, it takes around four hours. This is due to the computationally-intensive tasks of sentiment analysis and named-entity recognition.

After the (optional) creation of the test .ser file, the full analysis will being to run. When it is complete, a bare-bones swing window will display the following graphs:

Charts 1-4: These charts display the average reading level and average lexical density for each news source in the corpus, as well as the z-scores for these averages. The goal of these charts was to give the user a sense of which sources might be written at a higher level, and which sources migth contain more information (ie: have a higher "lexical density" score.) We also chose to plot the z-scores of each of these etrics to give the user a sense of how much variance there was in these two metrics, across the different sources in the corpus. Ie - which sources really stand out in terms of being extra-dense or extra-hard, or alternatively, extra-fluffy or extra-easy.
Chart 5: This chart displays, for each news source, what percent of the sentiment n articles from that source is negative, positive, or neutral. This graph, perhaps unsuprisingly, shows us that across the board for all sources, reporting tends to err on the side of negative sentiment.
Charts 6, 7, and 8: In creating these charts, we took the normalized sums for each type of sentiment for every article in the corpus, and looked at what percent each source contributed to the total amont of each sentiment type. These charts show us that, for example, the New York Post contributes much more positive sentiment to our corpus than does, say, Reuters. Overall, however, it does seem that no sources really stick out as contributing an incredibly disproportionate amount of positive, negative, or neutral sentiment.
Charts 9, 10, and 11: These charts display the various relationships between an article’s length, reding level, and lexical density. They also allow the viewer to see how different sources (and certain outlier articles) tend to be written, and how tightly “clumped” they are in accordance with these parameters. For example, we can see that most Atlantic articles send to be very clumped in regards to their reading level and density, however, these articles are spread over a relatively wide spread of lengths.
Chart 12: This chart shows the trend, over time, in the mentions of a variety of politicians. Perhaps unsurprisingly, Obama and Trump/Donald Trump mentions have the largest spikes.

longtv02 / news-analysis-nlp Goto Github PK

news-analysis-nlp's Introduction

news-analysis-nlp's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent