Giter Site home page Giter Site logo

cc-pyspark's Introduction

Common Crawl Logo

Common Crawl PySpark Examples

This project provides examples how to process the Common Crawl dataset with Apache Spark and Python:

Setup

To develop and test locally, you will need to install

pip install -r requirements.txt

Compatibility and Requirements

Tested with Spark 2.1.0 – 2.4.3 in combination with Python 2.7 or 3.5 and 3.6.

Get Sample Data

To develop locally, you'll need at least three data files – one for each format used in at least one of the examples. They can be fetched from the following links:

Alternatively, running get-data.sh will download the sample data. It also writes input files containing

  • sample input as file:// URLs
  • all input of one monthly crawl as s3:// URLs

Note that the sample data is from an older crawl (CC-MAIN-2017-13 run in March 2017). If you want to use more recent data, please visit the Common Crawl site.

Running locally

First, point the environment variable SPARK_HOME to your Spark installation. Then submit a job via

$SPARK_HOME/bin/spark-submit ./server_count.py \
	--num_output_partitions 1 --log_level WARN \
	./input/test_warc.txt servernames

This will count web server names sent in HTTP response headers for the sample WARC input and store the resulting counts in the SparkSQL table "servernames" in your warehouse location defined by spark.sql.warehouse.dir (usually in your working directory as ./spark-warehouse/servernames).

The output table can be accessed via SparkSQL, e.g.,

$SPARK_HOME/bin/pyspark
>>> df = sqlContext.read.parquet("spark-warehouse/servernames")
>>> for row in df.sort(df.val.desc()).take(10): print(row)
... 
Row(key=u'Apache', val=9396)
Row(key=u'nginx', val=4339)
Row(key=u'Microsoft-IIS/7.5', val=3635)
Row(key=u'(no server in HTTP header)', val=3188)
Row(key=u'cloudflare-nginx', val=2743)
Row(key=u'Microsoft-IIS/8.5', val=1459)
Row(key=u'Microsoft-IIS/6.0', val=1324)
Row(key=u'GSE', val=886)
Row(key=u'Apache/2.2.15 (CentOS)', val=827)
Row(key=u'Apache-Coyote/1.1', val=790)

See also

Running in Spark cluster over large amounts of data

As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it on Amazon AWS without incurring any transfer costs. The only cost that you incur is the cost of the machines running your Spark cluster.

  1. spinning up the Spark cluster: AWS EMR contains a ready-to-use Spark installation but you'll find multiple descriptions on the web how to deploy Spark on a cheap cluster of AWS spot instances. See also launching Spark on a cluster.

  2. choose appropriate cluster-specific settings when submitting jobs and also check for relevant command-line options (e.g., --num_input_partitions or --num_output_partitions, see below)

  3. don't forget to deploy all dependencies in the cluster, see advanced dependency management

  4. also the the file sparkcc.py needs to be deployed or added as argument --py-files sparkcc.py to spark-submit. Note: some of the examples require further Python files as dependencies.

Command-line options

All examples show the available command-line options if called with the parameter --help or -h, e.g.

$SPARK_HOME/bin/spark-submit ./server_count.py --help

Overwriting Spark configuration properties

There are many Spark configuration properties which allow to tune the job execution or output, see for example see tuning Spark or EMR Spark memory tuning.

It's possible to overwrite Spark properties when submitting the job:

$SPARK_HOME/bin/spark-submit \
    --conf spark.sql.warehouse.dir=myWareHouseDir \
    ... (other Spark options, flags, config properties) \
    ./server_count.py \
	... (program-specific options)

Credits

Examples are originally ported from Stephen Merity's cc-mrjob with the following changes and upgrades:

  • based on Apache Spark (instead of mrjob)
  • boto3 supporting multi-part download of data from S3
  • warcio a Python 2 and Python 3 compatible module to access WARC files

Further inspirations are taken from

License

MIT License, as per LICENSE

cc-pyspark's People

Contributors

sebastian-nagel avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.