data4democracy / internal-displacement Goto Github PK
View Code? Open in Web Editor NEWStudying news events and internal displacement.
Studying news events and internal displacement.
Take any useful code from Scraper in scrape_articles.py
and refactor into scraper.py
Initialize an Article
from a url, save it to DB and set its status to new
/ pending
In some contexts, information about IDPs is highly politicized, which could be problematic if you're drawing from media reports. You'd want to be very careful in selecting which sources you used for info about the Rohingya in Myanmar, for example.
It would be good to be able to score an article for reliability in order to help analysts as they analyze and interpret the extracted data.
In some cases, news sources may be government run, 'fake news' or have poor sources / track record, and so any data reported by and extracted from these sources should be identifiable as having potential issues.
On the front end, this could include a filter for analysts to use whereby they can select all articles, or those which a reliability score above a certain threshold.
Some thoughts for implementation include:
Need an initial DB schema to capture information about documents and facts.
Proposal:
Tables:
Currently, the csv data files are stored in my personal S3. We need to be able to download them locally or load them directly into a dataframe.
This is the third filtering requirement from the competition guidelines, to eliminate articles which mention the word 'mobility' but are unrelated to human mobility.
As per @milanoleonardo, a possible approach:
this can be done by looking at the dependency trees of the sentences in the text to make sure there is a link between a “reporting term” and a “reporting unit” (see challenge for details). This would definitely remove all documents reporting on “hip displacement” or sentences like “displaced the body of people” etc.
There is a placeholder function sample_urls
in scrape_articles.py
. This should be updated to return a random subsample URLs in a list.
The master input, extended and training datasets all contain URLs. For initial exploration and later analysis, it would be nice to build functionality to scrape, strip, and store the article information.
The extraction of information from unstructured texts currently relies on the ability of spacy to identify number-like substrings within text.
Specifically, the like_num method of the spacy Token class is used https://spacy.io/docs/api/token.
This method fails to detect approximate numerical terms (eg hundreds, thousands, dozens, few). While these terms are approximate, they are still useful for establishing orders of magnitude, and when compared across reports.
If someone could create an improved method to determine if a spacy token is like a number, that would significantly improve our ability to extract information from texts.
More details on this issue are included in this notebook.
https://github.com/Data4Democracy/internal-displacement/blob/master/notebooks/DependencyTreeExperiments3.ipynb
Currently, scraper.py
only scrapes html pages. Can we also scrape and parse PDFs?
Each parsed article needs to be classified into one of three categories:
Additionally, the chosen classifier / model must allow for easy online-learning or re-training in the future using new or larger datasets.
In Pipeline.process_url
we make multiple calls to article.update_status()
.
The update_status method may raise UnexpectedArticleStatusException
if it appears that the status has been changed in the meantime.
process_url
should be prepared for dealing with this exception.
Prior to saving reports to database quantities need to be converted to integers.
In some cases this is trivial (i.e. 500, 684), however there are other cases that need more work for example 'thousand', 'hundreds' etc.
Probably the best place to implement the conversion is in Report.__init__
Lots of the URLs are broken, or contain information that can't be parsed as text (eg. videos, images). How can we filter them out and tag them as such.
@jlln You used async tasks to increase the speed of the beautifulsoup
scraper. Can we do the same with the newspaper
one?
Integrate work from notebooks into codebase in an attempt to extract
from each article's content.
Given an Article
, run the classifier, and update its category - conflict/violence
/disaster
/both
/other
Given an Article
, check and update its language
The docker-compose.yml
and docker.env
files are currently set up with local development in mind. We'll want a production-friendly config.
We would like to have the front end be able to submit new URLs to process by writing an article row into the DB with a status of NEW. We need a process that runs on the back-end that looks for such rows and kicks of the scraping & interpretation pipeline.
Because it takes a while to bring up the interpretation environment (loading dependencies & model), it probably makes sense to have a long-running process that spends most of its time sleep
ing and occasionally (once every 60s? configurable?) wakes up and looks for new DB rows to process.
The ultimate aim of this project is to make a visualization tool that can:
To get started, datasets to play with can be found here.
Can we extract items such as the title and date published from a pdf?
This code in master
breaks production:
//if not using docker
//create a pgConfig.js file in the same directory and put your credentials there
const connectionObj = require('./pgConfig');
nodejs_1 | [0] Error: Cannot find module './pgConfig'
nodejs_1 | [0] at Function.Module._resolveFilename (module.js:470:15)
nodejs_1 | [0] at Function.Module._load (module.js:418:25)
nodejs_1 | [0] at Module.require (module.js:498:17)
nodejs_1 | [0] at require (internal/module.js:20:19)
nodejs_1 | [0] at Object.<anonymous> (/internal-displacement-web/server/pgDB/index.js:5:23)
nodejs_1 | [0] at Module._compile (module.js:571:32)
nodejs_1 | [0] at Object.Module._extensions..js (module.js:580:10)
nodejs_1 | [0] at Module.load (module.js:488:32)
nodejs_1 | [0] at tryModuleLoad (module.js:447:12)
nodejs_1 | [0] at Function.Module._load (module.js:439:3)
nodejs_1 | [0] [nodemon] app crashed - waiting for file changes before starting...
Can this be made optional? If the file pgConfig
exists, require it, otherwise use the environment variables?
Tag which language scraped content is written in.
Maybe langdetect is useful?
Enhance the country_code
function in interpreter.py
in order to more reliably recognize countries.
For example it currently fails for 'the United States' vs 'United States'.
It would also be good to try and detect countries even though the name is not explicitly mentioned, i.e. from city names etc.
The Mordecai library may be an option, however it requires its own NLP parsing and I was wondering if there was a simpler way to do this without using two NLP libraries + trained models.
Under Project Components -> Visualizer:
Visualise relability of
Did you mean to say
Visualise reliability of machine learning [something, e.g. model for article classification]
Thanks!
There is a training dataset here that can be used to train a classifier to tag articles with "violence" or "disaster" as the cause of displacement. It is quite a short dataset, but using the URL scraping functionality in scrape_articles.py
we should be able to make a start at training a classification algorithm.
Here's a sketch of an infrastructure plan:
Scrapers run locally (on developer machine) in Docker for prototyping (internal-displacement repo)
Write to local DB in docker
Can read scrape requests from database, but most scrapes will be triggered manually (through notebooks or scripts)
Web app runs locally in Docker for prototyping (internal-displacement-web repo)
Reads local DB in docker
Writes scrape requests to database
Scraper and Web app Docker containers deployed to AWS instance or similar cloud-hosted infrastructure
Read and write to Amazon RDS database
Large batch of scrape requests input into database, read from there and processed by Scraper(s)
Complete the unit tests for the scraper and pipeline components
PDF scraping could get a bit intensive hard disk style and could slow down scraping if doing a bulk load of urls. Can we:
Copied from @jlln
Nice work with the parser. I have looked into incorporating it into the scraper but I have encountered the issue of identifying PDFs:
- How to distinguish an url returning a pdf versus one returning html?
-The url is not sufficient to identify the return object type.
-I have tried using the python requests module to pull the header and examine the header content type. However this doesn't always work, because some urls return an html page that contains a pdf in an iframe. eg http://erccportal.jrc.ec.europa.eu/getdailymap/docId/1125
Scatter plot (or other) visualising information related to events. Could include
among others.
An Article
may have a publication date in datetime format. Dates extracted from text can often be relative or vague, eg. "last Saturday".
Write a function to Combine article.Article
publication date with dates interpreted in report.Report
in order to convert dates extracted from text into datetimes.
Write a function in article.Article
that calculates the percentage of scraped fields which are returned empty.
We may consider expanding the definition of scraping reliability later, so suggestions welcome.
Write a function that calculates the percentage of missing fields in report.Report
after an article has been interpreted.
We may expand this later to include weighting or other factors. Discussion welcome.
Given an Article
, if it is in English, make a call to Interpreter.process_article_new
to obtain the Reports
; for each Report
returned, save in Report table; if no Reports
, then set its relevance to be False
The scraper puts out a list of dictionaries with the contents and metadata from a webpage. We need to be able to save this as a csv. Bonus points if you can append it back to the original csv.
During scraping, can we tag whether something is text/video/image/pdf. Extra dessert if you can discern between news/blog etc.
Integrate code from notebooks that classifies an article as "conflict/distaster/both/other"
Haven't looked too deeply into newspaper
's handling of datetimes but if they vary from site to site, we will need to make them consistent. Maybe even having comma separated values for date, month and year published.
Make sure pipeline is working with pdf articles for different scenarios:
Ideally include some tests in tests/test_Pipeline.py
World map(s) that show some combination of:
Bonus if we can filter the visualised points based on type, reporting unit, etc.
Make a call to Scraper.scrape
for a given url and update the relevant attributes in the database with the result; add content to Content table
The extracted location could be either a Country, a country sub-division (i.e. province or state) or a City.
We need to modify the schema for Location
class in model.py
to ensure that we can capture these different options:
CREATE TABLE location ( id SERIAL PRIMARY KEY, description TEXT, city TEXT,
state TEXT, country CHAR(3) REFERENCES country ON DELETE CASCADE, latlong TEXT );
We want people to play around with our tool and do some data analysis and visualisations. Start a notebook, see what you can make or break and let us know below.
The competition deliverables include:
- A brief document describing the functionalities, such as a user guide.
- A document describing the steps to maintain and update the tool with further features, such as an admin guide.
Although there is not a lot to say at this point in time, it is worth keeping in mind rather than getting left to the last minute.
From the articles, we need to extract the whether individuals or households are being displaced (the reporting unit), how many of them, the date the article was published published and which reporting term is most appropriate.
There is a barebones function to extract text from the URLs. However, this hasn't been tested across many different URLs and does not necessarily do the best job of extracting the main body of relevant text.
Should encapsulate all of the dependencies.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.