Giter Site home page Giter Site logo

data-512-a2's Introduction

Bias in Data

The prototype framework of this README.md file is taken from the instructor's tutorial repo here.

Goal

This project is the second assignment of the DATA 512 Ethics course. This assignment aims to study bias in data by looking at Wikipedia articles of politicians from different countries. More specifically, the project aims to look at how the coverage of politicians and quality of articles of politicians varies across countries.

The analysis calculates the following metrics for each country from Wikipedia articles of their political figures:

  • Proportion of number of Wikipedia articles about their political figures to the population.

  • Percentage of number of Wikipedia articles about political figures that are "high-quality" for each country. This is measured using Wikipedia's machine learning service ORES API.

The goal is to gain a better understanding of biases in data and its consequences through this exercise.

The specific guidline to the assignment from which the goals of this project in this README.md file borrows, can be found here

Data Sources

The data comes from two different sources:

This dataset contains the revision ID and names of each politician wikipedia article and the country name. This dataset was downloaded into csv file page_data.csv.

The country population dataset comes from PRB International Data. The dataset contains the country name and population of that country. It was downloaded into csv file WPDS_2018_data.csv.

The ORES API was used to obtain the article quality feature.

Resources

This analysis was done using Python 3.6 running in a Jupyter Notebook.

The documentation for Python 3.6 can be found here, and the documentation for Jupyter Notebook can be found here.

Object Revision Evaluation Service (ORES) was used to obtain the article quality feature. The documentation for using this API can be found here.

The API call function in the iPython notebook used a code block from the tutorial written by the course instructor Jonathan Morgan:

https://github.com/Ironholds/data-512-a2/blob/master/hcds-a2-bias_demo.ipynb

Final Data File

Running the iPython notebook with page_data.csv and WPDS_2018_data.csv files creates the final table containing the following schema:

country article_name revision_id article_quality population
Chad Bir I of Kanem 355319463 Stub 15400000.0

This is saved to csv file named final_data.csv.

License/Terms of Use

The code in this repository is licensed under MIT License.

For licensing information about the politicians by country data source, please refer to figshare's webpage for more information and their terms of use policy.

For licensing information about the country population data source, please refer to PRB's webpage for their site-wide general policy.

The content accessed via Wikimedia's API is licensed under the CC-BY-SA 3.0 and GFDL licenses, and you irrevocably agree to release modifications or additions made through this API under these licenses.

When reproducing the results, please take a look at the Wikimedia Terms of Use for more information.

data-512-a2's People

Contributors

ryanbae89 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.