Bias in Data

Name: Luke Waninger

Date: 24 October 2016

Goal

Throughout this assignment I will explore an existing bias in the Wikipedia dataset. Specifically, the bias in terms of politician coverage through number and quality of the articles in relation to their home countries population. We will gather the data from two sources: one will be our source for population data and the other contains the relevant article metadata. For each article, we will query a machine learning service to estimate the quality of the article. And finally, we will generate a few tables and visualizations to display the bias. HCDE Fall 2018 - A2

Data sources used

To create the tables and visualizations we will draw from two datasources:

Wikipedia Article Data found on Figshare.

field	data type	description
page	str	article title
country	str	full country name
rev_id	int	revision identification number

Population Data found at a random DropBox location.

field	data type	description
Geography	str	full country name
Population mid-2018 (millions)	str	population in millions recorded in mid-2018

Resources used

The environment dependencies can be installed via pip install -r requirements.txt. I recommend installing in a Conda environment running Python version 3.7.

Documentation for Python can be found here: https://docs.python.org/3.7/

Documentation for Jupyter Notebook can be found here: http://jupyter-notebook.readthedocs.io/en/latest/

The following Python packages were used and their documentation can be found at the accompanying links:

io
itertools
json
multiprocessing
requests
ipycache - must be installed via the most recent GitHub content. pip install git+https://github.com/rossant/ipycache.git
IPython
numpy
pandas
plotly
synapse
zipfile

The Objective Revision Evaluation Service (ORES)

This API will be used to estimate the quality of each article drawn from source 1 above. Data will be gathered from this source through formed API requests in the notebook. ORES will return an estimated article quality and probabilities for six different rankings.

FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article

You may also notice two additional categories not to be confused with the six designated above. They are error results from the ORES API and may be either 'Text Deleted' or 'Revision not Found'.

Files Created

This notebook creates 1 CSV file of data extracted and compiled as part of this analysis.

name	datatype	descriptions
country	str	the country being referenced
article name	str	name of Wikipedia article
revision_id	int	the revision id referenced for when quering ORES for article quality prediction
article_quality	str	the predicted quality of article
population	int	population count in millions of people

License

This assignment code is released under the MIT license. Data Source Licenses:

Figshare - CC BY 4.0
DropBox - None Specified in Document Source

Results

I began this project with the notion that bias would be easily identified for a number of reasons. First, each country has a different level of access to internet resources. Second, many countries filter the content for which their population can access. The most pervasive of which is North Korea which is reflected in the dataset. Third, Wikipedia itself will have a different level of presence within each country. And lastly, the education level of a population will vary across countries. This is in no way an inclusive set and only one of which is can be directly related to ethical issues (goverment filtering). All of these create could potentially be a source of variance amongst the resulting dataset. This being said, bias was found but I was suprised at how drastically the data is skewed. First check out the ten countries with the most articles per million people.

country	articles_per_million
Tuvalu	5500
Nauru	5300
San Marino	2733.3
Monaco	1000
Liechtenstein	725
Tonga	630
Marshall Islands	616.7
Iceland	515
Andorra	425
Federated States Of Micronesia	380

And the ten with the least.

country	articles_per_million
Vietnam	2
Bangladesh	1.9
Thailand	1.7
Korea, North	1.5
Zambia	1.5
Ethiopia	1
Uzbekistan	0.9
China	0.8
Indonesia	0.8
India	0.7

The ten countries with the highest percentage of quality articles.

country	percent_quality
Korea, North	0.179
Saudi Arabia	0.134
Central African Republic	0.118
Romania	0.115
Mauritania	0.096
Bhutan	0.091
Tuvalu	0.091
Dominica	0.083
United States	0.075
Benin	0.074

And the least quality.

country	percent_quality
Namibia	0.006
Sierra Leone	0.006
Brazil	0.005
Bolivia	0.005
Fiji	0.005
Morocco	0.005
Lithuania	0.004
Nigeria	0.004
Peru	0.003
Tanzania	0.002

The most suprising thing to me was the quality of articles. We see that North Korea has only 1.5 articles per million people but of those articles, they have high percentage of good quality. I don't know who wrote these articles. Maybe the reason for such a high percentage is because the originating authors are from western sources or this is a result of North Korean governance. The ten lowest quality do not necessarily suprise me. These are countries are neither English speaking or known to have great public education systems that would lead to high quality articles being written. The tables alone don't give a general perspective of how far outlying some of these data are. I would expect, with no bias, a more symmetric distribution. These, however, are incredibly right skewed. See the static visualizations below.

Future Work

The analysis in this notebook leads to several more questions in regards to the where this bias comes from. Obviously a strong bias exists. Joining this dataset with the world indicators could bring to light more meaningful insights into the problem.

lukewaninger / hcds-a2-bias Goto Github PK

hcds-a2-bias's Introduction

Bias in Data

Goal

Data sources used

Resources used

The Objective Revision Evaluation Service (ORES)

Files Created

License

Results

Future Work

hcds-a2-bias's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent