Giter Site home page Giter Site logo

a2-hcds-hcc's Introduction

A2-hcds-hcc

The goal of this project is to acquisit, process, analyze, and publish a data set of the monthly traffic on Wikipedia.

Data source

As data source the Wikimedia Foundation REST API is used. Terms and Conditions to the Wikimedia Foundation REST API can be found here: Terms and Conditions. The content accessed via this API is licensed under the CC-BY-SA 3.0 and GFDL licenses, and thus all produced data throughout this project follows the same licensing policy.

To get a comprehensive set of data to different APIs need to be called:

  1. The Legacy Pagecounts API (documentation, endpoint) provides access to desktop and mobile traffic data from December 2007 through July 2016.
  2. The Pageviews API (documentation, endpoint) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.

Results

The resulting CSV-formatted data file en-wikipedia_traffic_200712-202010.csv can be found in the folder clean_data. It contains the following fields:

name description
year Year with the format YYYY
month Month with the format MM
pagecount_all_views Desktop and mobile views in the specific period fetched vie the Pagecounts API
pagecount_desktop_views Desktop views in the specific period fetched vie the Pagecounts API
pagecount_mobile_views Mobile views in the specific period fetched vie the Pagecounts API
pageview_all_views Desktop and mobile views in the specific period fetched vie the Pageviews API
pageview_desktop_views Desktop views in the specific period fetched vie the Pageviews API
pageview_mobile_views Mobile views in the specific period fetched vie the Pageviews API

Known issues & special considerations

Wikimedia Foundation REST APIs

The use of two different data sources leads to differences in the data represented by the two sources. For example the Pageview API excludes spiders/crawlers, while data from the Pagecounts API does not. As a result, the two data sources may provide different values, even if the same period is considered. The two data sources also overlap, so that for the period from July 2015 to July 2016 both sources provide data about the monthly traffic on Wikipedia.

Getting started

Prerequisites

In order to use this project (espaccilay the jupyter note book), please ensure that you have a Python version greater or equal to 3.6.1, a working installation of Poetry and git installed.

Setup

  1. Clone this repository (or use SSH) and move it into the repo root.

    git clone https://github.com/marisanest/A2-hcds-hcc.git cd A2-hcds-hcc

  2. Install the dependencies in the repo root.

    poetry install

  3. Create a subshell within the virtual environment by running:

    poetry shell

  4. Open the project with Jupyter in your browser.

    jupyter notebook


a2-hcds-hcc's People

Contributors

marisanest avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.