Giter Site home page Giter Site logo

hcds-a2-bias's Introduction

Bias in Data

Name: Luke Waninger

Date: 24 October 2016

Goal

Throughout this assignment I will explore an existing bias in the Wikipedia dataset. Specifically, the bias in terms of politician coverage through number and quality of the articles in relation to their home countries population. We will gather the data from two sources: one will be our source for population data and the other contains the relevant article metadata. For each article, we will query a machine learning service to estimate the quality of the article. And finally, we will generate a few tables and visualizations to display the bias. HCDE Fall 2018 - A2

Data sources used

To create the tables and visualizations we will draw from two datasources:

  1. Wikipedia Article Data found on Figshare.
field data type description
page str article title
country str full country name
rev_id int revision identification number
  1. Population Data found at a random DropBox location.
field data type description
Geography str full country name
Population mid-2018 (millions) str population in millions recorded in mid-2018

Resources used

The environment dependencies can be installed via pip install -r requirements.txt. I recommend installing in a Conda environment running Python version 3.7.

Documentation for Python can be found here: https://docs.python.org/3.7/

Documentation for Jupyter Notebook can be found here: http://jupyter-notebook.readthedocs.io/en/latest/

The following Python packages were used and their documentation can be found at the accompanying links:

The Objective Revision Evaluation Service (ORES)

This API will be used to estimate the quality of each article drawn from source 1 above. Data will be gathered from this source through formed API requests in the notebook. ORES will return an estimated article quality and probabilities for six different rankings.

  1. FA - Featured article
  2. GA - Good article
  3. B - B-class article
  4. C - C-class article
  5. Start - Start-class article
  6. Stub - Stub-class article

You may also notice two additional categories not to be confused with the six designated above. They are error results from the ORES API and may be either 'Text Deleted' or 'Revision not Found'.

Files Created

This notebook creates 1 CSV file of data extracted and compiled as part of this analysis.

name datatype descriptions
country str the country being referenced
article name str name of Wikipedia article
revision_id int the revision id referenced for when quering ORES for article quality prediction
article_quality str the predicted quality of article
population int population count in millions of people

License

This assignment code is released under the MIT license. Data Source Licenses:

  • Figshare - CC BY 4.0
  • DropBox - None Specified in Document Source

Results

I began this project with the notion that bias would be easily identified for a number of reasons. First, each country has a different level of access to internet resources. Second, many countries filter the content for which their population can access. The most pervasive of which is North Korea which is reflected in the dataset. Third, Wikipedia itself will have a different level of presence within each country. And lastly, the education level of a population will vary across countries. This is in no way an inclusive set and only one of which is can be directly related to ethical issues (goverment filtering). All of these create could potentially be a source of variance amongst the resulting dataset. This being said, bias was found but I was suprised at how drastically the data is skewed. First check out the ten countries with the most articles per million people.

country articles_per_million
Tuvalu 5500
Nauru 5300
San Marino 2733.3
Monaco 1000
Liechtenstein 725
Tonga 630
Marshall Islands 616.7
Iceland 515
Andorra 425
Federated States Of Micronesia 380

And the ten with the least.

country articles_per_million
Vietnam 2
Bangladesh 1.9
Thailand 1.7
Korea, North 1.5
Zambia 1.5
Ethiopia 1
Uzbekistan 0.9
China 0.8
Indonesia 0.8
India 0.7

The ten countries with the highest percentage of quality articles.

country percent_quality
Korea, North 0.179
Saudi Arabia 0.134
Central African Republic 0.118
Romania 0.115
Mauritania 0.096
Bhutan 0.091
Tuvalu 0.091
Dominica 0.083
United States 0.075
Benin 0.074

And the least quality.

country percent_quality
Namibia 0.006
Sierra Leone 0.006
Brazil 0.005
Bolivia 0.005
Fiji 0.005
Morocco 0.005
Lithuania 0.004
Nigeria 0.004
Peru 0.003
Tanzania 0.002

The most suprising thing to me was the quality of articles. We see that North Korea has only 1.5 articles per million people but of those articles, they have high percentage of good quality. I don't know who wrote these articles. Maybe the reason for such a high percentage is because the originating authors are from western sources or this is a result of North Korean governance. The ten lowest quality do not necessarily suprise me. These are countries are neither English speaking or known to have great public education systems that would lead to high quality articles being written. The tables alone don't give a general perspective of how far outlying some of these data are. I would expect, with no bias, a more symmetric distribution. These, however, are incredibly right skewed. See the static visualizations below.

Boxplot of Counts Boxplot of Quality

Future Work

The analysis in this notebook leads to several more questions in regards to the where this bias comes from. Obviously a strong bias exists. Joining this dataset with the world indicators could bring to light more meaningful insights into the problem.

hcds-a2-bias's People

Contributors

lukewaninger avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.