Giter Site home page Giter Site logo

wiki-links's Introduction

UpdatePages

I completed this project in three weeks as a Data Engineering fellow at Insight Data Engineering Fellows Program in NYC, June 2019.


Project Summary:

Today we’re living in the “Content Marketing Boom” with more and more businesses are starting up their own blogs. Particularly with tech related blogs, their credibility relies on how up-to-date the information is with rapidly changing technologies and tools. Therefore, it's important to maintain up-to-date articles to be reliable source for readers.

In this project, I created a dashboard to identify pages that are needed to be updated yet popular within the website. Popular meaning the page has cited many times internally and other pages in the blog used it as a source via a hyperlink.

To achieve this, I used Wikipedia data dump, which is publicly available and met the goal of my project statement. Since Wikipedia page editing is purely volunteer service, some pages can remain outdated. Though Wikipedia has a page on Articles in need of updating, it is hard to keep track of all the pages as data gets accumulated over time.

UpdatePages provides a dashboard for analyzing site-wide current updates on main pages of English Wikipedia. I analyze 45 million Wikipedia pages in zipped XML format, process them in Spark, and store in PostgreSQL.

Data Set:

There is a way to download wiki dumps for any project/language, the data is from early 2009. To access the latest versions of all Wikipedia page, go to this page and download files with the prefix "enwiki-latest-pages-meta-current"[1-27]. Wikipedia publishes the full site in 27 parts. Wikipedia offers other options for accessing their data, see a full description here

Data Pipeline:

alt text

UpdatePages is a batch processing pipeline over a large volume of data.

I downloaded all current pages on the English version of Wikipedia to an S3 bucket, which were in the format of bz2 zipped XMLs, using shell script. Spark Databricks Spark XML package was used to read and parse these input files into a dataframe. Data cleaning and processing operations were done using Spark and final tables were written in PostgreSQL. Finally, an interactive website was created with Dash Plotly, which reads query from PostgreSQL database.

Directory Description of Contents
src/dash/* HTML and CSS that queries DB and builds the UI
src/batch_pocess/parse_xml.py Reads from S3, unzips, parses, and writes into PostgreSQL
src/dataingestion/* Shell script to download data set from Wikipedia datadump
test Unit test for a smaller dataset

Cluster set up

This project used following EC2 nodes to Spark and Hadoop set up

  • master node m4.large
  • three worker nodes m4.2xlarge

Environment

Install AWS CLI and Pegasus, which is Insight's automatic cluster creator. Set the configuration in workers.yml and master.yml (3 workers and 1 master), then use Pegasus commands to spin up the cluster and install Hadoop and Spark. Clone the databricks XML parsing package and follow the setup instructions that they provide.

Technology Version No.
Hadoop v2.7.6
Spark v2.12
spark-xml v2.12
Postgres v10.6

Project Challenge

You provide a row tag of your xml files to treat as a row to parse your file using Spark-XML library and each record under a tag should be read by a single Java Virtual Machine (JVM). Because of this requirement, I have encountered JVM out of memory error due to uneven text information partitioned into my spark machines when some Wikipedia articles are relatively large compare to others. However, I was able to read all the meta data by tuning the {worker, driver} node as well as EC2 instance type.

After reading the 45 million pages of meta data, I preprocessed and cleaned it by filtering out only original articles by its namespace.

Another challenge I faced was quantifying the number of incoming links, which I used as a ranking metrics when showing the outdated articles, of each page. Each hyperlinks in each page were extracted using Regex in UDF, then I created the following information in the following DB structure:

ID Title Timestamp Hyperlinks
1 "X" 2019 [“Y”, “Z”]
2 "Y" 2018 [“X”, “Z”]

Then the number of incoming links to each page was done by doing data aggregation in spark.

ID Title Timestamp Hyperlinks Incoming links
1 "X" 2019 [“Y”, “Z”] “Y”
2 "Y" 2018 [“X”, “Z”] “X”

Demo

Presentation Slides

Dash UI Demo

YouTube Demo

wiki-links's People

Contributors

buyannemekh avatar

Watchers

Keo Chan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.