Giter Site home page Giter Site logo

mahmoudparsian / data-algorithms-book Goto Github PK

View Code? Open in Web Editor NEW
1.1K 132.0 662.0 406.13 MB

MapReduce, Spark, Java, and Scala for Data Algorithms Book

Home Page: http://mapreduce4hackers.com

License: Other

Shell 3.13% Java 87.66% Python 0.04% R 0.05% Awk 0.05% Scala 6.34% HTML 2.74%
hadoop-mapreduce java distributed-computing scala mapreduce data-algorithms python machine-learning pyspark distributed-algorithms

data-algorithms-book's Introduction

Git Repository

The book's codebase can also be downloaded from the git repository at:

git clone https://github.com/mahmoudparsian/data-algorithms-book.git

Data Algorithms Book

How To Run Python Programs

To run python programs just call them with spark-submit together with the arguments to the program.

Questions/Comments

Thank you!

best regards,
Mahmoud Parsian

Data Algorithms Book

data-algorithms-book's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-algorithms-book's Issues

Second Edition

I am wondering to know when the second edition will be out? Thanks,

API out of date

ClientArguments only takes on argument in current version. Can you update this example? I'm not sure it's possible to do it this way (anymore), ack or nack?

Thanks

Hi am facing an issue with submit a job from java.

Hi when am invoking
https://github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/how-to-submit-spark-job-to-yarn-from-java-code.md
this class from a running spark-submit from local it is getting invoked and able to submit the spark-submit to the yarn cluster.

But when am invoking the class by running a spark-submit which is being submitted to yarn then this particular how-to-submit-spark-job-to-yarn-from-java-code.md class is getting accepted but it is not moving to a running state. and getting failed by throwing an error.

Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster

Application application_1493671618562_0072 failed 5 times due to AM Container for appattempt_1493671618562_0072_000005 exited with exitCode: 1
For more detailed output, check the application tracking page: http://headnode.internal.cloudapp.net:8088/cluster/app/application_1493671618562_0072 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e02_1493671618562_0072_05_000001
Exit code: 1
Exception message: /mnt/resource/hadoop/yarn/local/usercache/helixuser/appcache/application_1493671618562_0072/container_e02_1493671618562_0072_05_000001/launch_container.sh: line 26: $PWD:$PWD/spark_conf:$PWD/spark.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/:/usr/hdp/current/hadoop-client/lib/:/usr/hdp/current/hadoop-hdfs-client/:/usr/hdp/current/hadoop-hdfs-client/lib/:/usr/hdp/current/hadoop-yarn-client/:/usr/hdp/current/hadoop-yarn-client/lib/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /mnt/resource/hadoop/yarn/local/usercache/helixuser/appcache/application_1493671618562_0072/container_e02_1493671618562_0072_05_000001/launch_container.sh: line 26: $PWD:$PWD/spark_conf:$PWD/spark.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/
:/usr/hdp/current/hadoop-client/lib/:/usr/hdp/current/hadoop-hdfs-client/:/usr/hdp/current/hadoop-hdfs-client/lib/:/usr/hdp/current/hadoop-yarn-client/:/usr/hdp/current/hadoop-yarn-client/lib/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.

Thank you for your help.

Thanks,
Ankush Reddy.

BUILD FAILED

UserdeMacBook-Pro:data-algorithms-book user$ ant
Buildfile: /Users/user/Documents/Hadoop/data-algorithms-book/build.xml

init:
[mkdir] Created dir: /Users/user/Documents/Hadoop/data-algorithms-book/build
[mkdir] Created dir: /Users/user/Documents/Hadoop/data-algorithms-book/dist
[taskdef] Could not load definitions from resource scala/tools/ant/antlib.xml. It could not be found.

build_jar:
[echo] compiling scala src...

BUILD FAILED
/Users/user/Documents/Hadoop/data-algorithms-book/build.xml:34: Problem: failed to create task or type scalac
Cause: The name is undefined.
Action: Check the spelling.
Action: Check that any custom tasks/types have been declared.
Action: Check that any / declarations have taken place.

can you provide a sanbox in which sourc code can run

I want to run these source code ,but these source code need special development environment . so can you
provide a sanbox like cdh-sanbox that can source code can run .

if you can provide a sanbox , i appreciate that. thanks

There has some empty package

i read this book to chap10. i try my best to write this code but i failed. when i want to read the source code of chap10(mapreduce), i found it's empty. i feel so bad. Can you implements this code and push it?

there are some chaps in the same situation, such as chap13, chap14 and so on.

Issue submitting Spark Job from code when Spark Job is a Python program.

When submitting a Java or Scala program, everything works fine. When submitting a python program, it's gets to the ACCEPTED state and then stalls. It eventually times out, but it's not getting picked up to run. Is this interface just for Java/Scala programs/jobs or should it be able to submit PySpark/Python jobs as well?

I am trying to invoke the pi.py sample program that comes with Spark 1.6.0.

Below is the java program that I am testing with. I'm new to Spark so apologies for any "newbie" errors.

import org.apache.spark.SparkConf;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
import org.apache.hadoop.conf.Configuration;
// import org.apache.log4j.Logger;

/**

  • This class submits a SparkPi to a YARN from a Java client (as opposed

  • to submitting a Spark job from a shell command line using spark-submit).

  • To accomplish submitting a Spark job from a Java client, we use

  • the org.apache.spark.deploy.yarn.Client class described below:

  • Usage: org.apache.spark.deploy.yarn.Client [options]
    Options:
    --jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode)
    --class CLASS_NAME Name of your application's main class (required)
    --primary-py-file A main Python file
    --arg ARG Argument to be passed to your application's main class.
    Multiple invocations are possible, each will be passed in order.
    --num-executors NUM Number of executors to start (Default: 2)
    --executor-cores NUM Number of cores per executor (Default: 1).
    --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512 Mb)
    --driver-cores NUM Number of cores used by the driver (Default: 1).
    --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G)
    --name NAME The name of your application (Default: Spark)
    --queue QUEUE The hadoop queue to use for allocation requests (Default: 'default')
    --addJars jars Comma separated list of local jars that want SparkContext.addJar to work with.
    --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
    --files files Comma separated list of files to be distributed with the job.
    --archives archives Comma separated list of archives to be distributed with the job.

    How to call this program example:

    export SPARK_HOME="/Users/mparsian/spark-1.6.0"
    java -DSPARK_HOME="$SPARK_HOME" org.dataalgorithms.client.SubmitSparkPiToYARNFromJavaCode 10
    */
    public class SubmitSparkPiToYARNFromJavaCode {

    public static void main(String[] args) throws Exception {
    long startTime = System.currentTimeMillis();

    // this is passed to SparkPi program
    //THE_LOGGER.info("Slices Passed=" + args[0]);
    String slices = args[0];  
    // String slices = "10";
    //
    // String SPARK_HOME = System.getProperty("SPARK_HOME");
    String SPARK_HOME = "/opt/spark/spark-1.6.0";
    // THE_LOGGER.info("SPARK_HOME=" + SPARK_HOME);
    
    //
    pi(SPARK_HOME, slices); // ... the code being measured ... 
    //
    long elapsedTime = System.currentTimeMillis() - startTime;
    // THE_LOGGER.info("elapsedTime (millis)=" + elapsedTime);
    

    }

    static void pi(String SPARK_HOME, String slices) throws Exception {
    //
    String[] args = new String[]{
    "--name",
    "Submit-SparkPi-To-Yarn",
    //
    "--driver-memory",
    "512MB",
    //
    "--jar",
    SPARK_HOME + "/examples/target/spark-examples_2.11-1.6.0.jar",
    //
    "--class",
    "org.apache.spark.examples.JavaSparkPi",

        // argument 1 to my Spark program
        "--arg",
        slices,
    
        // argument 2 to my Spark program (helper argument to create a proper JavaSparkContext object)
        "--arg",
        "yarn-cluster"
    };
    
    Configuration config = new Configuration();
    //
    System.setProperty("SPARK_YARN_MODE", "true");
    //
    SparkConf sparkConf = new SparkConf();
    ClientArguments clientArgs = new ClientArguments(args, sparkConf);
    Client client = new Client(clientArgs, config, sparkConf);
    
    client.run();
    // done!
    

    }
    }

Thanks,

-Scott

Regarding input parameters

I am new to spark and I want to implement KNN algorithm for my college project. So please tell me about R and S (for what its stand for). If possible please tell me how it looks like. I want to use "csv" file as target database. Thanks in advance.

Hello,dear anthor.

I fiound not have the chapter 12 mapreduce source. It deleted?

Thanks for your reply?

Dependencies causes CVEs in your execution path

Your project uses some dependencies with CVEs. I found that the buggy methods of the CVEs are in the program execution path of your project, which makes your project at risk. I have suggested some version updates. Details are listed below:

  • Vulnerable Dependency: org.apache.hadoop : hadoop-common : 2.6.3

  • Call Chain to Buggy Methods:

    • Some files in your project call the library method org.apache.hadoop.fs.Path.getFileSystem(org.apache.hadoop.conf.Configuration), which can reach the buggy method of CVE-2017-15713.

      • Files in your project:
        src/main/java/org/dataalgorithms/chap29/combinesmallfilesbyhadoop/CustomRecordReader.java, src/main/java/org/dataalgorithms/chap24/mapreduce/FastaRecordReader.java, src/main/java/org/dataalgorithms/chap24/mapreduce/FastaInputFormat.java
      • One of the possible call chain:
      org.apache.hadoop.fs.Path.getFileSystem(org.apache.hadoop.conf.Configuration)
      org.apache.hadoop.fs.FileSystem.get(java.net.URI,org.apache.hadoop.conf.Configuration)
      org.apache.hadoop.fs.FileSystem.getDefaultUri(org.apache.hadoop.conf.Configuration)
      org.apache.hadoop.conf.Configuration.get(java.lang.String,java.lang.String)
      org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
      
    • Some files in your project call the library method org.apache.hadoop.conf.Configuration.getInt(java.lang.String,int), which can reach the buggy method of CVE-2017-15713.

      • Files in your project:
        src/main/java/org/dataalgorithms/chap05/mapreduce/RelativeFrequencyMapper.java(The rest of the 21 files is hidden)
      • One of the possible call chain:
      org.apache.hadoop.conf.Configuration.getInt(java.lang.String,int)
      org.apache.hadoop.conf.Configuration.getTrimmed(java.lang.String)
      org.apache.hadoop.conf.Configuration.get(java.lang.String)
      org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
      
    • Some files in your project call the library method org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,java.io.OutputStream,org.apache.hadoop.conf.Configuration,boolean), which can reach the buggy method of CVE-2017-15713.

      • Files in your project:
        src/main/java/org/dataalgorithms/chap29/combinesmallfilesbybuckets/BucketThread.java
      • One of the possible call chain:
      org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,java.io.OutputStream,org.apache.hadoop.conf.Configuration,boolean)
      org.apache.hadoop.conf.Configuration.getInt(java.lang.String,int)
      org.apache.hadoop.conf.Configuration.getTrimmed(java.lang.String)
      org.apache.hadoop.conf.Configuration.get(java.lang.String)
      org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
      
    • Some files in your project call the library method org.apache.hadoop.io.SequenceFile.createWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class), which can reach the buggy method of CVE-2017-15713.

      • Files in your project:
        src/main/java/org/dataalgorithms/util/SequenceFileWriterDemo.java, src/main/java/org/dataalgorithms/chap03/mapreduce/SequenceFileWriterForTopN.java
      • One of the possible call chain:
      org.apache.hadoop.io.SequenceFile.createWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class)
      org.apache.hadoop.io.SequenceFile.createWriter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.io.SequenceFile$Writer$Option[])
      org.apache.hadoop.io.SequenceFile.getDefaultCompressionType(org.apache.hadoop.conf.Configuration)
      org.apache.hadoop.conf.Configuration.get(java.lang.String)
      org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
      
  • Update suggestion: version 3.2.1
    3.2.1 is a safe version without CVEs. From 2.6.3 to 3.2.1, 17 of the APIs (called by 84 times in your project) were modified.

Chap 14, the output of Builder.scala is not available for classifier.scala

Dear mahmoudparsian,
Sorry to bother you.
Actually, it is known that two methods can be used in the propose of saving output when one scala-spark program finishes. As you do in the "NaiveBayesClassifierBuilder.scala", the pt table saved as part-* file in the HDFS. However, my issue is relative to this. RDD's method,called saveAsObjectFile, will return NULL first, and with a sequenceFile output second. Thus, in the second spark program (NaiveBayesClassifier.scala), a NullPointerException throws. In the another hand, if i use saveAsTextFile, the second spark program will show a exception that "A sequenceFile is required". Thus, I'm not sure how to deal with this issue in your scala programme. Could you give me any tips?

Best Wishes,
WeiWei HE

FastaRecordReader for huge fasta files

Hi,

I have a question about the FastaRecordReader class data-algorithms-book/src/main/java/org/dataalgorithms/chap24/mapreduce/FastaRecordReader.java

I have been trying to use it for large genomes (fasta files much larger than a HDFS block, ie: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.fna.gz) but I am getting wrong sequences.

Is it possible that using this classes from Spark with newAPIHadoopFile method does not work for very large files? Or maybe am I missing something?

Regards, and thank you very much for your time.

Jose M. Abuin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.