Giter Site home page Giter Site logo

mongo-hadoop's Introduction

#MongoDB Connector for Hadoop

##Purpose

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.

Current stable release: 1.3.0

Features

  • Can create data splits to read from standalone, replica set, or sharded configurations
  • Source data can be filtered with queries using the MongoDB query language
  • Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)
  • Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems
  • Can write data out in .bson format, which can then be imported to any MongoDB database with mongorestore
  • Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.

Download

See the release page.

Building

The mongo-hadoop connector currently supports the following versions of hadoop: 0.23, 1.0, 1.1, 2.2, 2.3, 2.4, and CDH 4 abd 5. The default build version will build against the last Apache Hadoop (currently 2.4). If you would like to build against a specific version of Hadoop you simply need to pass -PclusterVersion=<your version> to gradlew when building.

Run ./gradlew jar to build the jars. The jars will be placed in to build/libs for each module. e.g. for the core module, it will be generated in the core/build/libs directory.

After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the following locations, depending on which Hadoop release you are using:

  • $HADOOP_HOME/lib/
  • $HADOOP_HOME/share/hadoop/mapreduce/
  • $HADOOP_HOME/share/hadoop/lib/

Supported Distributions of Hadoop

Hadoop Version Build Parameter
Apache Hadoop 0.23 -PclusterVersion='0.23'
Apache Hadoop 1.0 -PclusterVersion='1.0'
Apache Hadoop 1.1 -PclusterVersion='1.1'
Apache Hadoop 2.2 -PclusterVersion='2.2'
Apache Hadoop 2.3 -PclusterVersion='2.3'
Apache Hadoop 2.4 -PclusterVersion='2.4'
Cloudera Distribution for Hadoop 4 -PclusterVersion='cdh4'
Cloudera Distribution for Hadoop 5 -PclusterVersion='cdh5'

Configuration

Configuration

Streaming

Streaming

Hive

Hive

Pig

Pig

Examples

Examples

Usage with static .bson (mongo backup) files

BSON Usage

Usage with Amazon Elastic MapReduce

Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration, without needing to deal with provisioning nodes and installing software.

Using EMR with the MongoDB Connector for Hadoop allows you to run MapReduce jobs against MongoDB backup files stored in S3.

Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions lib folders.

For a full example (running the enron example on Elastic MapReduce) please see here.

Usage with Pig

Documentation on Pig with the MongoDB Connector for Hadoop.

For examples on using Pig with the MongoDB Connector for Hadoop, also refer to the examples section.

Notes for Contributors

If your code introduces new features, add tests that cover them if possible and make sure that ./gradlew check still passes. If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details and we will try to help. Note: Until findbugs updates its dependencies, running ./gradlew check on Java 8 will fail.

Maintainers

Justin lee ([email protected])

Contributors

Support

Issue tracking: https://jira.mongodb.org/browse/HADOOP/

Discussion: http://groups.google.com/group/mongodb-user/

mongo-hadoop's People

Contributors

agilejon avatar asya999 avatar bs1 avatar bwmcadams avatar dcrosta avatar erh avatar evanchooly avatar ghartnett avatar ianwhalen avatar jyemin avatar kstirman avatar lfrancke avatar mlew avatar mpobrien avatar pilliq avatar powerrr avatar rfliam avatar rgabo avatar rjurney avatar rozza avatar ryansb avatar solarmicrobe avatar spf13 avatar sweetiesong avatar tlockney avatar tychoish avatar tylerbrock avatar visualzhou avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.