Giter Site home page Giter Site logo

internal-displacement's People

Contributors

alexanderrich avatar arnold-jr avatar coldfashioned avatar domingohui avatar frenski avatar jlln avatar simonb83 avatar wanderingstar avatar wwymak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

internal-displacement's Issues

Generate a reliability score for a given article

In some contexts, information about IDPs is highly politicized, which could be problematic if you're drawing from media reports. You'd want to be very careful in selecting which sources you used for info about the Rohingya in Myanmar, for example.

It would be good to be able to score an article for reliability in order to help analysts as they analyze and interpret the extracted data.
In some cases, news sources may be government run, 'fake news' or have poor sources / track record, and so any data reported by and extracted from these sources should be identifiable as having potential issues.

On the front end, this could include a filter for analysts to use whereby they can select all articles, or those which a reliability score above a certain threshold.

Some thoughts for implementation include:

  1. A maintainable list of known problematic sources
  2. Measuring similarity of reported facts between sources
  3. A maintainable list of highly trusted and common 'core' news sources and anything from these sources automatically gets a high reliability rating.
  4. New or unknown sources automatically get a lower rating unless their facts are similar enough to a report from a highly trusted source etc.

Database schema for documents

Need an initial DB schema to capture information about documents and facts.

Proposal:
Tables:

  • Article (id, URL, retrieval date, source, publication date, title, authors, language, analyzer, analysis date) -- metadata extracted from a retrieved article
  • Full Text (article, content) full text of article
  • Analysis (id, article, reason, location, reporting term, reporting unit, number, metrics, analyzer analysis date) -- analysis of retrieved article

Pull data from S3 bucket

Currently, the csv data files are stored in my personal S3. We need to be able to download them locally or load them directly into a dataframe.

Implement filtering of documents not reporting on human mobility

This is the third filtering requirement from the competition guidelines, to eliminate articles which mention the word 'mobility' but are unrelated to human mobility.

As per @milanoleonardo, a possible approach:

this can be done by looking at the dependency trees of the sentences in the text to make sure there is a link between a “reporting term” and a “reporting unit” (see challenge for details). This would definitely remove all documents reporting on “hip displacement” or sentences like “displaced the body of people” etc.

Scrape and store article content from URLs

The master input, extended and training datasets all contain URLs. For initial exploration and later analysis, it would be nice to build functionality to scrape, strip, and store the article information.

Improve detection of numbers by spacy.

The extraction of information from unstructured texts currently relies on the ability of spacy to identify number-like substrings within text.
Specifically, the like_num method of the spacy Token class is used https://spacy.io/docs/api/token.

This method fails to detect approximate numerical terms (eg hundreds, thousands, dozens, few). While these terms are approximate, they are still useful for establishing orders of magnitude, and when compared across reports.

If someone could create an improved method to determine if a spacy token is like a number, that would significantly improve our ability to extract information from texts.

More details on this issue are included in this notebook.
https://github.com/Data4Democracy/internal-displacement/blob/master/notebooks/DependencyTreeExperiments3.ipynb

Deal with update_status errors

In Pipeline.process_url we make multiple calls to article.update_status().

The update_status method may raise UnexpectedArticleStatusException if it appears that the status has been changed in the meantime.

process_url should be prepared for dealing with this exception.

Number-like entities to integers

Prior to saving reports to database quantities need to be converted to integers.

In some cases this is trivial (i.e. 500, 684), however there are other cases that need more work for example 'thousand', 'hundreds' etc.

Probably the best place to implement the conversion is in Report.__init__

Scraper - Tag broken URLs

Lots of the URLs are broken, or contain information that can't be parsed as text (eg. videos, images). How can we filter them out and tag them as such.

Integrate event fact extraction

Integrate work from notebooks into codebase in an attempt to extract

  • The reporting term (i.e. destroyed, displaced, etc.)
  • The reporting unit (i.e. houses, people, villages etc.)
  • The quantity referenced (i.e. 500, thousands, tens)
  • The date of the event (i.e. Saturday 09 May 2015, last Saturday)
  • The location of the event

from each article's content.

Classify article

Given an Article, run the classifier, and update its category - conflict/violence/disaster/both/other

Config for running in AWS

The docker-compose.yml and docker.env files are currently set up with local development in mind. We'll want a production-friendly config.

  • Don't run localdb
  • DB config refers to AWS RDS instance instead of localdb (please do not check credentials in to git)
  • Node.js runs production version, instead of development version

Python process to check for new URLs and run the pipeline on them

We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.

Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleeping and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.

Visualization discussion

The ultimate aim of this project is to make a visualization tool that can:

  • Map the displacement figures and locations, identify hotspots and trends.
  • Visualize reporting frequency and statistics for a selected region (using histogram or other such charts)
  • Display excerpts of documents where the relevant information is reported (either by looking at the map or browsing the list of URLs).
  • Visualize anything else you can think of!

To get started, datasets to play with can be found here.

pgConfig change breaks production config

This code in master breaks production:

//if not using docker
//create a pgConfig.js file in the same directory and put your credentials there
const connectionObj = require('./pgConfig');
nodejs_1   | [0] Error: Cannot find module './pgConfig'
nodejs_1   | [0]     at Function.Module._resolveFilename (module.js:470:15)
nodejs_1   | [0]     at Function.Module._load (module.js:418:25)
nodejs_1   | [0]     at Module.require (module.js:498:17)
nodejs_1   | [0]     at require (internal/module.js:20:19)
nodejs_1   | [0]     at Object.<anonymous> (/internal-displacement-web/server/pgDB/index.js:5:23)
nodejs_1   | [0]     at Module._compile (module.js:571:32)
nodejs_1   | [0]     at Object.Module._extensions..js (module.js:580:10)
nodejs_1   | [0]     at Module.load (module.js:488:32)
nodejs_1   | [0]     at tryModuleLoad (module.js:447:12)
nodejs_1   | [0]     at Function.Module._load (module.js:439:3)
nodejs_1   | [0] [nodemon] app crashed - waiting for file changes before starting...

Can this be made optional? If the file pgConfig exists, require it, otherwise use the environment variables?

Enhance country detection in article content

Enhance the country_code function in interpreter.py in order to more reliably recognize countries.
For example it currently fails for 'the United States' vs 'United States'.

It would also be good to try and detect countries even though the name is not explicitly mentioned, i.e. from city names etc.

The Mordecai library may be an option, however it requires its own NLP parsing and I was wondering if there was a simpler way to do this without using two NLP libraries + trained models.

Train classifier on training dataset

There is a training dataset here that can be used to train a classifier to tag articles with "violence" or "disaster" as the cause of displacement. It is quite a short dataset, but using the URL scraping functionality in scrape_articles.py we should be able to make a start at training a classification algorithm.

Infrastructure Plan

Here's a sketch of an infrastructure plan:

Development

Scrapers run locally (on developer machine) in Docker for prototyping (internal-displacement repo)
Write to local DB in docker
Can read scrape requests from database, but most scrapes will be triggered manually (through notebooks or scripts)

Web app runs locally in Docker for prototyping (internal-displacement-web repo)
Reads local DB in docker
Writes scrape requests to database

IDETECT Preparation

Scraper and Web app Docker containers deployed to AWS instance or similar cloud-hosted infrastructure
Read and write to Amazon RDS database
Large batch of scrape requests input into database, read from there and processed by Scraper(s)

Manage PDF scraping

PDF scraping could get a bit intensive hard disk style and could slow down scraping if doing a bulk load of urls. Can we:

  1. Have the option to turn off pdf scraping. What part of the code should control this?
  2. Delete a pdf as soon as it has been downloaded and parsed

Detect URLs with PDF

Copied from @jlln

Nice work with the parser. I have looked into incorporating it into the scraper but I have encountered the issue of identifying PDFs:

  • How to distinguish an url returning a pdf versus one returning html?
    -The url is not sufficient to identify the return object type.
    -I have tried using the python requests module to pull the header and examine the header content type. However this doesn't always work, because some urls return an html page that contains a pdf in an iframe. eg http://erccportal.jrc.ec.europa.eu/getdailymap/docId/1125

Plot of events

Scatter plot (or other) visualising information related to events. Could include

  • Location
  • Number of events
  • Magnitude of events

among others.

Convert relative dates to absolute datetimes

An Article may have a publication date in datetime format. Dates extracted from text can often be relative or vague, eg. "last Saturday".

Write a function to Combine article.Article publication date with dates interpreted in report.Report in order to convert dates extracted from text into datetimes.

Scraping reliability score

Write a function in article.Article that calculates the percentage of scraped fields which are returned empty.

We may consider expanding the definition of scraping reliability later, so suggestions welcome.

Reliability score for report interpretation

Write a function that calculates the percentage of missing fields in report.Report after an article has been interpreted.

We may expand this later to include weighting or other factors. Discussion welcome.

Articles to reports

Given an Article, if it is in English, make a call to Interpreter.process_article_new to obtain the Reports; for each Report returned, save in Report table; if no Reports, then set its relevance to be False

Pipeline - save data to csv

The scraper puts out a list of dictionaries with the contents and metadata from a webpage. We need to be able to save this as a csv. Bonus points if you can append it back to the original csv.

Scraper - Tag content type

During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.

Pipeline - consistent date and time

Haven't looked too deeply into newspaper's handling of datetimes but if they vary from site to site, we will need to make them consistent. Maybe even having comma separated values for date, month and year published.

Pipeline testing for pdf articles

Make sure pipeline is working with pdf articles for different scenarios:

  • Non existent / broken url
  • Non English
  • Irrelevant
  • Relevant

Ideally include some tests in tests/test_Pipeline.py

Map of events/flow

World map(s) that show some combination of:

  • Number of events
  • Magnitude of events
  • Conflict/Violence labels

Bonus if we can filter the visualised points based on type, reporting unit, etc.

Scraped content to database

Make a call to Scraper.scrape for a given url and update the relevant attributes in the database with the result; add content to Content table

Modify Location Schema to be able to distinguish between cities and country sub-dvisions

The extracted location could be either a Country, a country sub-division (i.e. province or state) or a City.

We need to modify the schema for Location class in model.py to ensure that we can capture these different options:

CREATE TABLE location ( id SERIAL PRIMARY KEY,    description TEXT,    city TEXT,  
state TEXT,  country CHAR(3) REFERENCES country ON DELETE CASCADE, latlong TEXT );

Explore refugee data in Jupyter Notebooks

We want people to play around with our tool and do some data analysis and visualisations. Start a notebook, see what you can make or break and let us know below.

Create, maintain and update user guide / admin guide.

The competition deliverables include:

  • A brief document describing the functionalities, such as a user guide.
  • A document describing the steps to maintain and update the tool with further features, such as an admin guide.

Although there is not a lot to say at this point in time, it is worth keeping in mind rather than getting left to the last minute.

Best NLP approach to extract useful info from articles

From the articles, we need to extract the whether individuals or households are being displaced (the reporting unit), how many of them, the date the article was published published and which reporting term is most appropriate.

Improve text extraction from URLs with beautifulsoup

There is a barebones function to extract text from the URLs. However, this hasn't been tested across many different URLs and does not necessarily do the best job of extracting the main body of relevant text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.