Giter Site home page Giter Site logo

svm_human_rights's Introduction

Human rights in the news

A global heads-up dashboard of ongoing and emerging human rights issues will rely at least in part on ingesting, classifying, and geolocating human rights related news stories from around the world. This code represents a Minimum Viable Product demonstration of such a pipeline. The model serves as a first step towards far more sophisticated approaches which have already been implemented for general news stories by projects such as GDELT (https://www.gdeltproject.org/).

For a full discussion on the use case, end user, and product development perspective of this project, please refer to: https://www.garethwalker.me/microsoft-ai-for-good

A outline of the pipeline

Data pipeline is as follows:

  1. RSS feeds relating to human rights news are ingested using feedparser (https://github.com/kurtmckee/feedparser). Note: In addition to sources such as Amnesty International, UNHRC, custom RSS feeds based on human rights related search terms are used for both Google and Bing news feeds.

  2. Title and body text are classified and evaluated using a Support Vector Machine text classifier (trained on US Justice Department reports)

  3. Location entities are extracted using entity detection module of Spacy package. Stories for which locations cannot be identified are dropped.

  4. Location entities are passed to Google Maps API to obtain latitude and longitude coordinates. Locations for which latitude and longitude cannot be found are dropped.

  5. News story title, date, human rights category, location, and coordinates are stored in a pandas dataframe and exported as CSV.

  6. CSV ingested into Tableau visualization.

Outline of approach

Model evaluation

When applied to training data, model achieves an accuracy, precision, and recall of 76%. However, when applied to news stories, accuracy drops to 67%, recall to 60%, and precision to 62%. This is most likly due to the model over-fitting to the domain-specific language of the training corpus (official reports) and subsequently struggling to interpret the more generalized language of news stories. Some measures were taken to limit this over fitting, including limited the number of features held within the TFIDF vectorizor to the 400 most frequent terms (after stop words are removed).

Model performance

Using the model

Custom RSS feeds can be defined within the code. The output will be a CSV file containing the news story title, date, human rights category, location, and coordinates. This is then ready to be ingested by a data visualization or GIS platform. For an example of see the following:

https://public.tableau.com/profile/gareth.walker#!/vizhome/MappingHumanRightsNews-Updated/HumanRightsinthenews

Training Data Requirement

Requires following file to be placed in 'Data/'

'us_state_dept_reports_1999_2018.csv'

Can be download here: https://drive.google.com/file/d/1DhSlWI_2nKozM8Gcn7fy34kSBwzHjMSm/view?usp=sharing

API requirements

Google maps API key required for geolocating news stories

svm_human_rights's People

Contributors

internetgareth avatar

Stargazers

 avatar

Forkers

slramirez

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.