Giter Site home page Giter Site logo

hadoop-java-example's Introduction

Hadoop Map-Reduce Example in Java

Get up and running in less than 5 minutes

Overview

This program demonstrates Hadoop's Map-Reduce concept in Java using a very simple example. The input is raw data files listing earthquakes by region, magnitude and other information.

nc,71920701,1,”Saturday, January 12, 2013 19:43:18 UTC”,38.7865,-122.7630,1.5,1.10,27,“Northern California”

The fields in bold are magnitude of the quake and name of region where the reading was taken, respectively. The goal is to process all input files to find the maximum magnitude quake reading for every region listed. The output is in the form:

    "region_name"      <maximum magnitude of earthquake recorded> 

The raw data files are in the input/ folder. You will notice that there is a compressed file named input.tar.gz. This is to demonstrate the concept that Hadoop MapReduce can automatically uncompress the archive to process the files.

Instructions for Setting Up Hadoop

  1. Download Hadoop 1.1.1 binary. Mirror

  2. Extract it to a folder on your computer:

     $ tar xvfz hadoop-1.1.1.tar.gz
    
  3. Setup JAVA_HOME environment variable to point to the directory where Java is installed. For my Mac OS X, I did the following:

     $ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
    

Note: If you are running Lion, you may want to update the JAVA_HOME to point to java_home command which outputs Java's home directory, that is,

    $ export JAVA_HOME=$(/usr/libexec/java_home)
  1. Setup HADOOP_INSTALL environment variable to point the directory where you extracted hadoop binary in step 2:

     $ export HADOOP_INSTALL=/Users/umermansoor/Documents/hadoop-1.1.1
    
  2. Edit the PATH environment variable:

     $ export PATH=$PATH:$HADOOP_INSTALL/bin
    

Or you can add these variables to your standard shell script. For example, checkout my Mac OSX's ~/.bash_profile

Instructions for Running the Sample

  1. Clone the project:

     $ git clone [email protected]:umermansoor/hadoop-java-example.git
    
  2. Change to the project directory:

     $ cd hadoop-java-example
    
  3. Build the project:

     $ mvn clean install
    
  4. Setup the HADOOP_CLASSPATH environment variable to tell Hadoop where to find the java classes for the sample:

     $ export HADOOP_CLASSPATH=target/classes/
    
  5. Run the sample. The output directory shouldn't exists otherwise this will fail.

     $ hadoop com.umermansoor.App input/ output
    

Note: the output will go to the output/ folder which Hadoop will create when run. The output will be in a file called part-r-00000.

Common Errors:

  1. Exception: java.lang.NoClassDefFoundError Cause: You didn't setup the HADOOP_CLASSPATH environment variable. You need to tell Hadoop where to find the java classes. Resolution: In this case, execute the following to setup HADOOP_CLASSPATH variable to point to the target/classes/ folder.

     $ export HADOOP_CLASSPATH=target/classes/
    
  2. Exception: org.apache.hadoop.mapred.FileAlreadyExistsException or 'Output directory output already exists'. Cause: Output directory already exists. Hadoop requires that the output directory doesn't exists when run. Resolution: Change the output directory or remove the existing one:

     $ hadoop com.umermansoor.App input/input.csv output_new 
    

Note: Hadoop failing if the output folder already exists is a good thing: it ensures that you don't accidentally overwrite your previous output, as typical Hadoop jobs take hours to complete.

hadoop-java-example's People

Contributors

umermansoor avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.