Giter Site home page Giter Site logo

apache-hadoop's Introduction

apache-hadoop

This repository uses a Big Data framework Apache Hadoop using;

  • HDFS,
  • Pig,
  • Hive,
  • MapReduce.

User's Manual

Part 1.

  • To display the users and corresponding directories open the terminal and type:

    • hdfs dfs -ls PATH //this will show the folders and files in the specified path
    • hdfs dfs -ls /user // thus, this will display users and corresponding directories

Part 2.

  • Initially create a folder in Cloudera with the following command:

    • hdfs dfs -mkdir /user/cloudera/folder_name
  • Then create 2 folders as input1 and input2 in the folder you created above (folder_name):

    • hdfs dfs -mkdir /user/cloudera/input1 /user/cloudera/input2
  • To load our csv file to input1 with indicated attributes (blocksize = 32 MB, replication factor = 3), use following commands in order:

    • hdfs dfs -D dfs.blocksize = 33554432 -D dfs.replication=3 -put 2459501.csv /user/cloudera/2459501/input1
  • To load our csv file to input2 with indicated attributes (blocksize = 64 MB, replication factor = 2), use following commands in order:

    • hdfs dfs -D dfs.blocksize = 67108864 -D dfs.replication=3 -put 2459501.csv /user/cloudera/2459501/input2

Part 3.

Create Project:

  • Create a java project (File->New->Project) using Eclipse.

  • Select Java Project and click "Next".

  • Enter your project name (Sensors) and click "Finish".

  • To create a class in the project you created, right click "src" select New->Class.

  • Enter the class name (e.g. ProcSensors) and click Finish.

  • Add dependencies (right click the project Select "Build Path->Configure Build Path")

  • Click on "Add External JARS…" and select all necessary jars (e.g. hadoop-common-*.*.*.jar, hadoop-mapreduce-client-core-*.*.*.jar, etc.).

  • You need to set run configurations for the project. To do this, click "Run-> Run Configurations…" and browse your current project (Sensors) as project and select the main class of the project (i.e. ProcSensors). To select input and output locations, click "Arguments" tab in the open window. Write your input path put a space and write your output path for the project (folder_name/input1/2459501.csv output). Click apply to apply these changes.

Create Classes:

  • To create a MapReduce Program, we need to have a Mapper Class (SensorMapper) to run on every single block (machine), Reducer Class (SensorReducer) to collect and process Mapper Class's outputs and main class (ProcSensors) to run the program. Optionally we may have a Combiner (SensorCombiner) Class in between Mapper and Reducer classes and MapResult (MapResult) Classes to keep multiple results in a MapResult object.

  • SensorMapper class extends Mapper Class. In SensorMapper class there is a map function which gets the input and transform it to "key, value" pairs.

  • SensorCombiner class extends Reducer Class. In SensorCombiner class there is a reduce function which gets input (key: place, value: MapResult object containing temperature value) from the output of map function. SensorCombiner outputs the place as key and MapResult object as value containing occurrences, sum of temperature, minimum and maximum temperatures of places.

  • SensorReducer class extends Reducer Class. In SensorReducer class there is a reduce function which gets input (key: place, value: MapResult containing occurrences, sum of temperature, minimum and maximum temperatures of places.) from reduce function of SensorCombiner class. SensorReducer outputs the place as Text containing place, occurrences of this place, minimum temperature of this place, maximum temperature of this place and average temperature of this place.

  • ProcSensors class is the main class of the program. It handles the configurations of the project. Input output file configurations are handled in this class and also it sets the MapReduce structure for the project by creating a job instance and setting Mapper, Reducer, Combiner classes.

  • MapResult is a class containing count, total, minimum and maximum variables

Part 4.

  • Define a table in Hive using the following query. You can use Hue for convenience.
  • The query below will create a table in which the values will be read by comma for each row and the header row at top will be neglected.

Create Table

CREATE EXTERNAL TABLE sensortable(place STRING, temperature FLOAT)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
TBLPROPERTIES("skip.header.line.count" = "1");
  • Then load the data by indicating its path as follows;
    load data nipath 'user/cloudera/folder_name/input1/2459501.csv' into table sensortable;
  • Finally, write a query as follows to select occurrences, minimum temperature, maximum temperature, average temperature for every place. (MapReduce for the hdfs file will be handled by Hive itself);
SELECT place,
count(temperature) as occurences,
min(temperature) as mintemp,
max(temperature) as miaxtemp,
avg(temperature) as avgtemp
FROM sensortable
GROUP BY place;
  • The results are as follows;

image

apache-hadoop's People

Contributors

beratuna avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.