Human rights in the news

A global heads-up dashboard of ongoing and emerging human rights issues will rely at least in part on ingesting, classifying, and geolocating human rights related news stories from around the world. This code represents a Minimum Viable Product demonstration of such a pipeline. The model serves as a first step towards far more sophisticated approaches which have already been implemented for general news stories by projects such as GDELT (https://www.gdeltproject.org/).

For a full discussion on the use case, end user, and product development perspective of this project, please refer to: https://www.garethwalker.me/microsoft-ai-for-good

A outline of the pipeline

Data pipeline is as follows:

RSS feeds relating to human rights news are ingested using feedparser (https://github.com/kurtmckee/feedparser). Note: In addition to sources such as Amnesty International, UNHRC, custom RSS feeds based on human rights related search terms are used for both Google and Bing news feeds.
Title and body text are classified and evaluated using a Support Vector Machine text classifier (trained on US Justice Department reports)
Location entities are extracted using entity detection module of Spacy package. Stories for which locations cannot be identified are dropped.
Location entities are passed to Google Maps API to obtain latitude and longitude coordinates. Locations for which latitude and longitude cannot be found are dropped.
News story title, date, human rights category, location, and coordinates are stored in a pandas dataframe and exported as CSV.
CSV ingested into Tableau visualization.

Model evaluation

When applied to training data, model achieves an accuracy, precision, and recall of 76%. However, when applied to news stories, accuracy drops to 67%, recall to 60%, and precision to 62%. This is most likly due to the model over-fitting to the domain-specific language of the training corpus (official reports) and subsequently struggling to interpret the more generalized language of news stories. Some measures were taken to limit this over fitting, including limited the number of features held within the TFIDF vectorizor to the 400 most frequent terms (after stop words are removed).

Using the model

Custom RSS feeds can be defined within the code. The output will be a CSV file containing the news story title, date, human rights category, location, and coordinates. This is then ready to be ingested by a data visualization or GIS platform. For an example of see the following:

https://public.tableau.com/profile/gareth.walker#!/vizhome/MappingHumanRightsNews-Updated/HumanRightsinthenews

Training Data Requirement

Requires following file to be placed in 'Data/'

'us_state_dept_reports_1999_2018.csv'

Can be download here: https://drive.google.com/file/d/1DhSlWI_2nKozM8Gcn7fy34kSBwzHjMSm/view?usp=sharing

API requirements

Google maps API key required for geolocating news stories

internetgareth / svm_human_rights Goto Github PK

svm_human_rights's Introduction

Human rights in the news

A outline of the pipeline

Model evaluation

Using the model

Training Data Requirement

API requirements

svm_human_rights's People

Contributors

Stargazers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent