BDAM_Diaosi

Big Data Analysis and Management Course Project

Author: floodfill, ConanChou

Task description

Identify pages that have infoboxes. You will scan the Wikipedia pages and generate a CSV file named infobox.csv with the following format for each line corresponding to a page that contains an infobox: page_id, infobox_text If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values. Note that the id for a Wikipedia page is its title.

How to run it

The most simple way

Create a new job flow in Amazon Elastic Map/Reduce with following configurations

Hadoop version: 1.0.3
Custom jar file: s3n://diaosi-mapreduce/pro1.jar
Jar arguments: s3n://diaosi-mapreduce/raw_data s3n://<your-bucket-name>/<your-output-folder>

More sophisticated

Create a new Java Project in Eclipse
Import the src folder to the project
Add to build path dependency libraries where you can find them in the folder libs
Export runnable jars with com.github.diaosi.BDAM.mapreduce.InfoboxGetter as the class to be launched
Upload the jar you just generated to Amazon S3
(Optional and we already have this done) Extract Bzip2-compressed wikipedia dumps to raw xml files and upload them to Amazon S3, or use s3n://diaosi-mapreduce/raw_data as the input
Go to Amazon Elastic Map/Reduce, create a new job flow, run the jar with the first parameter as the input path and second one as the output path
Keep your finger crossed while the job flow is running until results are generated

Check the result

Switch to your S3 bucket and find results listed in the folder you set up before
They should be in CSV format that can be open by any spreedsheet softwares
Note that Microsoft Excel is not powerful enough to handle opening any UTF-8 encoded csv files
Also note that different line separators have been used in the generated csv files, we use \r\n as the real line separtor and \n is what we used inside the infobox

Phase II

The Phase II codes are under src/com/github/diaosi/BDAM/mapreduce(hadoop code), src/com/github/diaosi/BDAM/utils(single node code), and visualization(visualization code).

diaosi / bdam_diaosi Goto Github PK

bdam_diaosi's Introduction

BDAM_Diaosi

Task description

How to run it

The most simple way

More sophisticated

Check the result

Phase II

bdam_diaosi's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent