Giter Site home page Giter Site logo

data-512-a2's Introduction

Exploring Bias in Wikipedia's Political Articles by Country

Goal

This repository contains all files required for Human Centered Data Science's (DATA 512) "Assignment 2: Bias in Data." More information about the assignment is available here.

The purpose of this assignment is to explore potential biases in Wikipedia’s political coverage by country. I explore this using two metrics:

  • Coverage: The percent of political articles per country population.
  • Quality: The percent of high-quality political articles per country's total political articles.

Article quality will be determined using a machine learning algorithm called Objective Revision Evalution Service (ORES). This algorithm predicts between the following options:

  • FA - Featured article
  • GA - Good article
  • B - B-class article
  • C - C-class article
  • Start - Start-class article
  • Stub - Stub-class article

A high-quality article will be ones considered either a Freatured Article (FA) or a Good Article (GA). The highest 2 of Wikipedia's 6 article quality options.

Then I reported and reflected on the highest and lowest ranked countries for each metric for potential bias. The Jupyter Notebook will walk you through the full analysis.

Directory

data-512-a2/

Filename Purpose
data/... Contains the outputted final merged dataset.
LICENSE A standard MIT license.
README.md What you're currently reading.
hcds-a2-bias.ipynb The source code.

Data Acquistion

Two datasets were used in this analysis.

  1. World Population Data available on DropBox.
Column Datatype
Geography String
Population mid-2018 (millions) String
  1. Wikipedia's Political Articles data available on Figshare.
Column Datatype
page String
country String
rev_id Integer

Licensing

The source datasets are not included in this repository due to the following licensing and copyright concerns.

  1. The World Population dataset does not provide explicit licensing information, so it falls under DropBox's copyright policy.
  2. The Wikipedia page dataset on Figshare is licensed under CC-BY-SA 4.0.

The code in this repository is licensed under a MIT license.

Output

After merging together the population data, Wikipedia page data, and the predicted page qualities I outputted this dataset.

Column Datatype
country String
article_name String
revision_id Integer
article_quality String
population Integer

Reproducibility

This work is aimed at being reproducible, however due to data licensing and changes in page quality (from ORES's output) exact values may differ. In order to fully reproduce the embedded figures, you'll need to import the merge dataset I provided and preform the steps after that file creation in the Jupyter Notebook. The merged dataset was created with the data available on 10/31/2018.

Reflection

When first starting this assignment, I assumed there would be unequal political coverage globally for multiple reasons. One assumption I had was based on internet access availability, whether from censorship or infrastructure, not being uniform. This assumption would assume higher political article coverage in more developed countries and countries without internet censorship. Another thing I had expected to find was a bias from that fact that a large fraction of English Wikipedia’s editors are from the US, Canada, Europe, and Australia. I thought this would translate to higher coverage in the countries where the prevalence of editors in high, or even the countries those countries have a lot of political interest in. Additionally, I assumed countries with larger government systems would have larger article counts and lower counts in countries with smaller political systems (democracy vs. ************ or monarchy). Lastly, I didn’t really know what biases to assume regarding quality coverage after reading about the ORES quality model which says quality is based on structural components not tone and good writing. Although it does say that structural quality does correlate with good writing.

In the analysis I wasn't able to analyze all 195 countries, instead I only used the 180 countries we had both population and article data for. Given more time I would have liked to identify the missing countries to see which were excluded and if that pointed to anything. The top ten countries for political coverage were all countries near Australia and Europe which supports my idea that the areas of higher concentration of editor would have higher coverage. Then the ten lowest ranked countries for coverage were in Asia and Africa which both have a low prevalence of editors (except India), again supporting my original assumption. India accounts for 3% of English Wikipedia editors, however they were ranked the lowest for coverage. This I imagine, is due to its large population.

One result I was surprised to see was that the United States wasn’t in the top ten in coverage. The United States only has 1098 political articles total. Given that our current House of Representatives is 435 people, this count seems surprisingly low. One follow-up question I had was what qualifies as a political article (i.e. would local area Governor articles count)? I don't think my original government size theory holds much water after the analysis, considering it would more likely be related to government size in comparison with population size.

North Korea, Saudi Arabia and the United States were in the top ten in quality and of the 180 countries we have articles for, 36 had zero high-quality articles. After removing those, the next lowest ten had about 1 high-quality article each. So the range of percent quality over the 180 countries was 0% - 17.9%. I was surprised that the top quality maxed out at 17.9 percent, because I would have assumed political articles tended to be more professional and well written.

With more time, I would have liked to explore the data graphically to better see trends and distributions because it is hard to theorize from short lists.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.