Name: Luke Waninger
Date: 24 October 2016
Throughout this assignment I will explore an existing bias in the Wikipedia dataset. Specifically, the bias in terms of politician coverage through number and quality of the articles in relation to their home countries population. We will gather the data from two sources: one will be our source for population data and the other contains the relevant article metadata. For each article, we will query a machine learning service to estimate the quality of the article. And finally, we will generate a few tables and visualizations to display the bias. HCDE Fall 2018 - A2
To create the tables and visualizations we will draw from two datasources:
- Wikipedia Article Data found on Figshare.
field | data type | description |
---|---|---|
page | str | article title |
country | str | full country name |
rev_id | int | revision identification number |
- Population Data found at a random DropBox location.
field | data type | description |
---|---|---|
Geography | str | full country name |
Population mid-2018 (millions) | str | population in millions recorded in mid-2018 |
The environment dependencies can be installed via pip install -r requirements.txt
. I recommend installing in a Conda environment running Python version 3.7.
Documentation for Python can be found here: https://docs.python.org/3.7/
Documentation for Jupyter Notebook can be found here: http://jupyter-notebook.readthedocs.io/en/latest/
The following Python packages were used and their documentation can be found at the accompanying links:
io
itertools
json
multiprocessing
requests
ipycache
- must be installed via the most recent GitHub content.pip install git+https://github.com/rossant/ipycache.git
IPython
numpy
pandas
plotly
synapse
zipfile
The Objective Revision Evaluation Service (ORES)
This API will be used to estimate the quality of each article drawn from source 1 above. Data will be gathered from this source through formed API requests in the notebook. ORES will return an estimated article quality and probabilities for six different rankings.
- FA - Featured article
- GA - Good article
- B - B-class article
- C - C-class article
- Start - Start-class article
- Stub - Stub-class article
You may also notice two additional categories not to be confused with the six designated above. They are error results from the ORES API and may be either 'Text Deleted' or 'Revision not Found'.
This notebook creates 1 CSV file of data extracted and compiled as part of this analysis.
name | datatype | descriptions |
---|---|---|
country | str | the country being referenced |
article name | str | name of Wikipedia article |
revision_id | int | the revision id referenced for when quering ORES for article quality prediction |
article_quality | str | the predicted quality of article |
population | int | population count in millions of people |
This assignment code is released under the MIT license. Data Source Licenses:
- Figshare - CC BY 4.0
- DropBox - None Specified in Document Source
I began this project with the notion that bias would be easily identified for a number of reasons. First, each country has a different level of access to internet resources. Second, many countries filter the content for which their population can access. The most pervasive of which is North Korea which is reflected in the dataset. Third, Wikipedia itself will have a different level of presence within each country. And lastly, the education level of a population will vary across countries. This is in no way an inclusive set and only one of which is can be directly related to ethical issues (goverment filtering). All of these create could potentially be a source of variance amongst the resulting dataset. This being said, bias was found but I was suprised at how drastically the data is skewed. First check out the ten countries with the most articles per million people.
country | articles_per_million |
---|---|
Tuvalu | 5500 |
Nauru | 5300 |
San Marino | 2733.3 |
Monaco | 1000 |
Liechtenstein | 725 |
Tonga | 630 |
Marshall Islands | 616.7 |
Iceland | 515 |
Andorra | 425 |
Federated States Of Micronesia | 380 |
And the ten with the least.
country | articles_per_million |
---|---|
Vietnam | 2 |
Bangladesh | 1.9 |
Thailand | 1.7 |
Korea, North | 1.5 |
Zambia | 1.5 |
Ethiopia | 1 |
Uzbekistan | 0.9 |
China | 0.8 |
Indonesia | 0.8 |
India | 0.7 |
The ten countries with the highest percentage of quality articles.
country | percent_quality |
---|---|
Korea, North | 0.179 |
Saudi Arabia | 0.134 |
Central African Republic | 0.118 |
Romania | 0.115 |
Mauritania | 0.096 |
Bhutan | 0.091 |
Tuvalu | 0.091 |
Dominica | 0.083 |
United States | 0.075 |
Benin | 0.074 |
And the least quality.
country | percent_quality |
---|---|
Namibia | 0.006 |
Sierra Leone | 0.006 |
Brazil | 0.005 |
Bolivia | 0.005 |
Fiji | 0.005 |
Morocco | 0.005 |
Lithuania | 0.004 |
Nigeria | 0.004 |
Peru | 0.003 |
Tanzania | 0.002 |
The most suprising thing to me was the quality of articles. We see that North Korea has only 1.5 articles per million people but of those articles, they have high percentage of good quality. I don't know who wrote these articles. Maybe the reason for such a high percentage is because the originating authors are from western sources or this is a result of North Korean governance. The ten lowest quality do not necessarily suprise me. These are countries are neither English speaking or known to have great public education systems that would lead to high quality articles being written. The tables alone don't give a general perspective of how far outlying some of these data are. I would expect, with no bias, a more symmetric distribution. These, however, are incredibly right skewed. See the static visualizations below.
The analysis in this notebook leads to several more questions in regards to the where this bias comes from. Obviously a strong bias exists. Joining this dataset with the world indicators could bring to light more meaningful insights into the problem.