data4democracy / internal-displacement Goto Github PK

View Code? Open in Web Editor NEW

43.0 18.0 27.0 60.88 MB

Studying news events and internal displacement.

Python 3.03% Jupyter Notebook 93.19% HTML 0.21% Shell 0.02% JavaScript 1.51% CSS 2.03% Nginx 0.02%

internal-displacement's People

Contributors

Stargazers

Watchers

internal-displacement's Issues

Scraper - Refactor old scraper

Take any useful code from Scraper in scrape_articles.py and refactor into scraper.py

New Article from submitted URL

Initialize an Article from a url, save it to DB and set its status to new / pending

Generate a reliability score for a given article

In some contexts, information about IDPs is highly politicized, which could be problematic if you're drawing from media reports. You'd want to be very careful in selecting which sources you used for info about the Rohingya in Myanmar, for example.

It would be good to be able to score an article for reliability in order to help analysts as they analyze and interpret the extracted data.
In some cases, news sources may be government run, 'fake news' or have poor sources / track record, and so any data reported by and extracted from these sources should be identifiable as having potential issues.

On the front end, this could include a filter for analysts to use whereby they can select all articles, or those which a reliability score above a certain threshold.

Some thoughts for implementation include:

A maintainable list of known problematic sources
Measuring similarity of reported facts between sources
A maintainable list of highly trusted and common 'core' news sources and anything from these sources automatically gets a high reliability rating.
New or unknown sources automatically get a lower rating unless their facts are similar enough to a report from a highly trusted source etc.

Database schema for documents

Need an initial DB schema to capture information about documents and facts.

Proposal:
Tables:

Article (id, URL, retrieval date, source, publication date, title, authors, language, analyzer, analysis date) -- metadata extracted from a retrieved article
Full Text (article, content) full text of article
Analysis (id, article, reason, location, reporting term, reporting unit, number, metrics, analyzer analysis date) -- analysis of retrieved article

Pull data from S3 bucket

Currently, the csv data files are stored in my personal S3. We need to be able to download them locally or load them directly into a dataframe.

Implement filtering of documents not reporting on human mobility

This is the third filtering requirement from the competition guidelines, to eliminate articles which mention the word 'mobility' but are unrelated to human mobility.

As per @milanoleonardo, a possible approach:

this can be done by looking at the dependency trees of the sentences in the text to make sure there is a link between a “reporting term” and a “reporting unit” (see challenge for details). This would definitely remove all documents reporting on “hip displacement” or sentences like “displaced the body of people” etc.

Get random subsample of URLs from list

There is a placeholder function sample_urls in scrape_articles.py. This should be updated to return a random subsample URLs in a list.

Scrape and store article content from URLs

The master input, extended and training datasets all contain URLs. For initial exploration and later analysis, it would be nice to build functionality to scrape, strip, and store the article information.

Improve detection of numbers by spacy.

The extraction of information from unstructured texts currently relies on the ability of spacy to identify number-like substrings within text.
Specifically, the like_num method of the spacy Token class is used https://spacy.io/docs/api/token.

This method fails to detect approximate numerical terms (eg hundreds, thousands, dozens, few). While these terms are approximate, they are still useful for establishing orders of magnitude, and when compared across reports.

If someone could create an improved method to determine if a spacy token is like a number, that would significantly improve our ability to extract information from texts.

More details on this issue are included in this notebook.
https://github.com/Data4Democracy/internal-displacement/blob/master/notebooks/DependencyTreeExperiments3.ipynb

Scraper - Parsing PDFs?

Currently, scraper.py only scrapes html pages. Can we also scrape and parse PDFs?

Best Machine Learning approach for classifying documents and articles

Each parsed article needs to be classified into one of three categories:

Disasters
Conflict & Violence
Other

Additionally, the chosen classifier / model must allow for easy online-learning or re-training in the future using new or larger datasets.

Deal with update_status errors

In Pipeline.process_url we make multiple calls to article.update_status().

The update_status method may raise UnexpectedArticleStatusException if it appears that the status has been changed in the meantime.

process_url should be prepared for dealing with this exception.

Number-like entities to integers

Prior to saving reports to database quantities need to be converted to integers.

In some cases this is trivial (i.e. 500, 684), however there are other cases that need more work for example 'thousand', 'hundreds' etc.

Probably the best place to implement the conversion is in Report.__init__

Scraper - Tag broken URLs

Lots of the URLs are broken, or contain information that can't be parsed as text (eg. videos, images). How can we filter them out and tag them as such.

Scraper - Asynchronous tasks for scraper.py

@jlln You used async tasks to increase the speed of the beautifulsoup scraper. Can we do the same with the newspaper one?

Integrate event fact extraction

Integrate work from notebooks into codebase in an attempt to extract

The reporting term (i.e. destroyed, displaced, etc.)
The reporting unit (i.e. houses, people, villages etc.)
The quantity referenced (i.e. 500, thousands, tens)
The date of the event (i.e. Saturday 09 May 2015, last Saturday)
The location of the event

from each article's content.

Classify article

Given an Article, run the classifier, and update its category - conflict/violence/disaster/both/other

Check article language

Given an Article, check and update its language

Use Interpreter.check_language

Config for running in AWS

The docker-compose.yml and docker.env files are currently set up with local development in mind. We'll want a production-friendly config.

Don't run localdb
DB config refers to AWS RDS instance instead of localdb (please do not check credentials in to git)
Node.js runs production version, instead of development version

Python process to check for new URLs and run the pipeline on them

We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.

Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleeping and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.

Visualization discussion

The ultimate aim of this project is to make a visualization tool that can:

Map the displacement figures and locations, identify hotspots and trends.
Visualize reporting frequency and statistics for a selected region (using histogram or other such charts)
Display excerpts of documents where the relevant information is reported (either by looking at the map or browsing the list of URLs).
Visualize anything else you can think of!

To get started, datasets to play with can be found here.

Extract document details from PDF

Can we extract items such as the title and date published from a pdf?

pgConfig change breaks production config

This code in master breaks production:

//if not using docker
//create a pgConfig.js file in the same directory and put your credentials there
const connectionObj = require('./pgConfig');

nodejs_1   | [0] Error: Cannot find module './pgConfig'
nodejs_1   | [0]     at Function.Module._resolveFilename (module.js:470:15)
nodejs_1   | [0]     at Function.Module._load (module.js:418:25)
nodejs_1   | [0]     at Module.require (module.js:498:17)
nodejs_1   | [0]     at require (internal/module.js:20:19)
nodejs_1   | [0]     at Object.<anonymous> (/internal-displacement-web/server/pgDB/index.js:5:23)
nodejs_1   | [0]     at Module._compile (module.js:571:32)
nodejs_1   | [0]     at Object.Module._extensions..js (module.js:580:10)
nodejs_1   | [0]     at Module.load (module.js:488:32)
nodejs_1   | [0]     at tryModuleLoad (module.js:447:12)
nodejs_1   | [0]     at Function.Module._load (module.js:439:3)
nodejs_1   | [0] [nodemon] app crashed - waiting for file changes before starting...

Can this be made optional? If the file pgConfig exists, require it, otherwise use the environment variables?

Scraper - Detect and tag language

Tag which language scraped content is written in.

Maybe langdetect is useful?

Enhance country detection in article content

Enhance the country_code function in interpreter.py in order to more reliably recognize countries.
For example it currently fails for 'the United States' vs 'United States'.

It would also be good to try and detect countries even though the name is not explicitly mentioned, i.e. from city names etc.

The Mordecai library may be an option, however it requires its own NLP parsing and I was wondering if there was a simpler way to do this without using two NLP libraries + trained models.

Incomplete bullet point (with typo) in README.md

Under Project Components -> Visualizer:

Visualise relability of

Did you mean to say

Visualise reliability of machine learning [something, e.g. model for article classification]

Thanks!

Train classifier on training dataset

There is a training dataset here that can be used to train a classifier to tag articles with "violence" or "disaster" as the cause of displacement. It is quite a short dataset, but using the URL scraping functionality in scrape_articles.py we should be able to make a start at training a classification algorithm.

Infrastructure Plan

Here's a sketch of an infrastructure plan:

Development

Scrapers run locally (on developer machine) in Docker for prototyping (internal-displacement repo)
Write to local DB in docker
Can read scrape requests from database, but most scrapes will be triggered manually (through notebooks or scripts)

Web app runs locally in Docker for prototyping (internal-displacement-web repo)
Reads local DB in docker
Writes scrape requests to database

IDETECT Preparation

Scraper and Web app Docker containers deployed to AWS instance or similar cloud-hosted infrastructure
Read and write to Amazon RDS database
Large batch of scrape requests input into database, read from there and processed by Scraper(s)

Scraper + pipeline unit tests

Complete the unit tests for the scraper and pipeline components

Manage PDF scraping

PDF scraping could get a bit intensive hard disk style and could slow down scraping if doing a bulk load of urls. Can we:

Have the option to turn off pdf scraping. What part of the code should control this?
Delete a pdf as soon as it has been downloaded and parsed

Detect URLs with PDF

Copied from @jlln

Nice work with the parser. I have looked into incorporating it into the scraper but I have encountered the issue of identifying PDFs:

How to distinguish an url returning a pdf versus one returning html?
-The url is not sufficient to identify the return object type.
-I have tried using the python requests module to pull the header and examine the header content type. However this doesn't always work, because some urls return an html page that contains a pdf in an iframe. eg http://erccportal.jrc.ec.europa.eu/getdailymap/docId/1125

Plot of events

Scatter plot (or other) visualising information related to events. Could include

Location
Number of events
Magnitude of events

among others.

Convert relative dates to absolute datetimes

An Article may have a publication date in datetime format. Dates extracted from text can often be relative or vague, eg. "last Saturday".

Write a function to Combine article.Article publication date with dates interpreted in report.Report in order to convert dates extracted from text into datetimes.

Scraping reliability score

Write a function in article.Article that calculates the percentage of scraped fields which are returned empty.

We may consider expanding the definition of scraping reliability later, so suggestions welcome.

Reliability score for report interpretation

Write a function that calculates the percentage of missing fields in report.Report after an article has been interpreted.

We may expand this later to include weighting or other factors. Discussion welcome.

Articles to reports

Given an Article, if it is in English, make a call to Interpreter.process_article_new to obtain the Reports; for each Report returned, save in Report table; if no Reports, then set its relevance to be False

Pipeline - save data to csv

The scraper puts out a list of dictionaries with the contents and metadata from a webpage. We need to be able to save this as a csv. Bonus points if you can append it back to the original csv.

Scraper - Tag content type

During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.

Integrate article classification

Integrate code from notebooks that classifies an article as "conflict/distaster/both/other"

Pipeline - consistent date and time

Haven't looked too deeply into newspaper's handling of datetimes but if they vary from site to site, we will need to make them consistent. Maybe even having comma separated values for date, month and year published.

Pipeline testing for pdf articles

Make sure pipeline is working with pdf articles for different scenarios:

Non existent / broken url
Non English
Irrelevant
Relevant

Ideally include some tests in tests/test_Pipeline.py

Map of events/flow

World map(s) that show some combination of:

Number of events
Magnitude of events
Conflict/Violence labels

Bonus if we can filter the visualised points based on type, reporting unit, etc.

Build / train classifier for article classification

Scraped content to database

Make a call to Scraper.scrape for a given url and update the relevant attributes in the database with the result; add content to Content table

Modify Location Schema to be able to distinguish between cities and country sub-dvisions

The extracted location could be either a Country, a country sub-division (i.e. province or state) or a City.

We need to modify the schema for Location class in model.py to ensure that we can capture these different options:

CREATE TABLE location ( id SERIAL PRIMARY KEY,    description TEXT,    city TEXT,  
state TEXT,  country CHAR(3) REFERENCES country ON DELETE CASCADE, latlong TEXT );

Explore refugee data in Jupyter Notebooks

We want people to play around with our tool and do some data analysis and visualisations. Start a notebook, see what you can make or break and let us know below.

Create, maintain and update user guide / admin guide.

The competition deliverables include:

A brief document describing the functionalities, such as a user guide.

A document describing the steps to maintain and update the tool with further features, such as an admin guide.

Although there is not a lot to say at this point in time, it is worth keeping in mind rather than getting left to the last minute.

data4democracy / internal-displacement Goto Github PK

internal-displacement's People

Contributors

Stargazers

Watchers

Forkers

internal-displacement's Issues

Development

IDETECT Preparation

Recommend Projects

Recommend Topics

Recommend Org