mahmoudparsian / data-algorithms-book Goto Github PK

MapReduce, Spark, Java, and Scala for Data Algorithms Book

License: Other

Shell 3.13% Java 87.66% Python 0.04% R 0.05% Awk 0.05% Scala 6.34% HTML 2.74%

hadoop-mapreduce java distributed-computing scala mapreduce data-algorithms python machine-learning pyspark distributed-algorithms

data-algorithms-book's Introduction

Data Algorithms Book

Author: Mahmoud Parsian ([email protected])
Title: Data Algorithms: Recipes for Scaling up with Hadoop and Spark
This GitHub repository will host all source code and scripts for Data Algorithms Book.
Publisher: O'Reilly Media
Published date: July 2015

Git Repository

The book's codebase can also be downloaded from the git repository at:

git clone https://github.com/mahmoudparsian/data-algorithms-book.git

2nd Edition! Coming Out @ the End of 2021

Upgraded to Spark-3.1.2

Production Version is Available NOW!

Java 8's LAMBDA Expressions to Spark...

Scala Spark Solutions

How To Build using Apache's Ant

How To Build using Apache's Maven

Machine Learning Algorithms using Spark

Spark for Cancer Outlier Profile Analysis

Webinars and Presentions on Data Algorithms

Introduction to MapReduce

Bonus Chapters

Author Book Signing

How To Run Spark/Hadoop Programs

Submit a Spark Job from Java Code

How To Run Python Programs

To run python programs just call them with spark-submit together with the arguments to the program.

My favorite quotes...

Questions/Comments

View Mahmoud Parsian's profile on LinkedIn
Please send me an email: [email protected]
Twitter: @mahmoudparsian

Thank you!

best regards,
Mahmoud Parsian

data-algorithms-book's People

Stargazers

Watchers

Forkers

jackerxff kashif grr01 jeromeku hanifmahboobi alexsisu slangeberg leoromanovsky rahuls23 sbenavid msurendra fedesilva point-line-surface-body narayana1208 arunenigma stevenio forkspace mailmahee justin2061 yenchih jamespaul007 ramanindya55 niketkumar pbhalesain reference-project jekey qiuzhuang rahuls5 billhongs zoopaper omkar-madineni yiwang njvijay tonydony zdx jnbala courseran4ever aseemanand gpraneesha anantraman mainstreetquant jqxu jasonshih defaultrobot ajohi gongzitao joeywen feiva laoshuaige zhaoxin887 joyrahman zendljrp robinjtu smokarizadeh nixonmark3 gabhi harishraj pgnepal linearregression vishwambhar sheeperd mathkann xinchoubiology fw1121 wonyonyon lviiii caohy1988 andrewzjut cissalc arden2600 ryan257 yzwei wuvsuqkp raychevamarina oelesin kumarsumit1 hubte1g jdprasad giserh nagabharat zhuangkechen wypb jindalcastle ldfaiztt grajk12-github nixoz qingkaikong datalearning malbolg bikmaeff vinodreddyb bpblanken qicst23 anaerobia animeshinvinci grzegorz07 vsingh58 adbadb udemirezen leeeeeo

data-algorithms-book's Issues

Second Edition

I am wondering to know when the second edition will be out? Thanks,

API out of date

ClientArguments only takes on argument in current version. Can you update this example? I'm not sure it's possible to do it this way (anymore), ack or nack?

Thanks

where can I get source code of chapter 18,19,20,21

Dear Professioal, this repo does not consist of the source codes of the chapter 18,19,20,21? Would you share ?

Hi am facing an issue with submit a job from java.

Hi when am invoking
https://github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/how-to-submit-spark-job-to-yarn-from-java-code.md
this class from a running spark-submit from local it is getting invoked and able to submit the spark-submit to the yarn cluster.

But when am invoking the class by running a spark-submit which is being submitted to yarn then this particular how-to-submit-spark-job-to-yarn-from-java-code.md class is getting accepted but it is not moving to a running state. and getting failed by throwing an error.

Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster

Application application_1493671618562_0072 failed 5 times due to AM Container for appattempt_1493671618562_0072_000005 exited with exitCode: 1
For more detailed output, check the application tracking page: http://headnode.internal.cloudapp.net:8088/cluster/app/application_1493671618562_0072 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e02_1493671618562_0072_05_000001
Exit code: 1
Exception message: /mnt/resource/hadoop/yarn/local/usercache/helixuser/appcache/application_1493671618562_0072/container_e02_1493671618562_0072_05_000001/launch_container.sh: line 26: $PWD:$PWD/spark_conf:$PWD/spark.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/:/usr/hdp/current/hadoop-client/lib/:/usr/hdp/current/hadoop-hdfs-client/:/usr/hdp/current/hadoop-hdfs-client/lib/:/usr/hdp/current/hadoop-yarn-client/:/usr/hdp/current/hadoop-yarn-client/lib/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /mnt/resource/hadoop/yarn/local/usercache/helixuser/appcache/application_1493671618562_0072/container_e02_1493671618562_0072_05_000001/launch_container.sh: line 26: $PWD:$PWD/spark_conf:$PWD/spark.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/:/usr/hdp/current/hadoop-client/lib/:/usr/hdp/current/hadoop-hdfs-client/:/usr/hdp/current/hadoop-hdfs-client/lib/:/usr/hdp/current/hadoop-yarn-client/:/usr/hdp/current/hadoop-yarn-client/lib/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/:$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/:$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
at org.apache.hadoop.util.Shell.runCommand(Shell.java:933)
at org.apache.hadoop.util.Shell.run(Shell.java:844)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1123)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:225)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.

Thank you for your help.

Thanks,
Ankush Reddy.

BUILD FAILED

UserdeMacBook-Pro:data-algorithms-book user$ ant
Buildfile: /Users/user/Documents/Hadoop/data-algorithms-book/build.xml

init:
[mkdir] Created dir: /Users/user/Documents/Hadoop/data-algorithms-book/build
[mkdir] Created dir: /Users/user/Documents/Hadoop/data-algorithms-book/dist
[taskdef] Could not load definitions from resource scala/tools/ant/antlib.xml. It could not be found.

build_jar:
[echo] compiling scala src...

BUILD FAILED
/Users/user/Documents/Hadoop/data-algorithms-book/build.xml:34: Problem: failed to create task or type scalac
Cause: The name is undefined.
Action: Check the spelling.
Action: Check that any custom tasks/types have been declared.
Action: Check that any / declarations have taken place.

Where is the source code for the 1st edition

Hi,
I am reading the first edition of this book, but it looks to me that the source code for the first edition is gone?

Code uses yarn private classes so cannot compile

Classes ClientArguments and Client in org.apache.spark.deploy.yarn are private and therefore not visible to user code. How are you getting this code to work outside of the yarn package?

Chapter 23 : where can I find PearsonCorrelation on mapreduce?

Hello ,I want to study the PearsonCorrelation on mapreduce from Chap23,but I could not found it from the Chap23.
Thank you !

Chapter 13, Use CombineByKey to Replace groupByKey and mapValues.

In the useChapter 13, use CombineByKey to Replace groupByKey and mapValues.

Can U give me the code or tell me how to achieve it ?

can you provide a sanbox in which sourc code can run

I want to run these source code ,but these source code need special development environment . so can you
provide a sanbox like cdh-sanbox that can source code can run .

if you can provide a sanbox , i appreciate that. thanks

where is the data like INPUT="file://$BOOK_HOME/time_series.txt"

There has some empty package

i read this book to chap10. i try my best to write this code but i failed. when i want to read the source code of chap10(mapreduce), i found it's empty. i feel so bad. Can you implements this code and push it?

there are some chaps in the same situation, such as chap13, chap14 and so on.

No source code of chapter 12 K-Means Clustering

Hi,
I am reading data-algorithms-book, and I didn't find the source code (MapReduce Solution)of chapter 12 K-Means Clustering in the github repository https://github.com/mahmoudparsian/data-algorithms-book/tree/master/src/main/java/org/dataalgorithms. Is it not pushed?

When we need use API of coalesce()?

hi, I notice you just use coalesce() in Top10NonUnique.java not in Top10.java. This is Why?

hank you for the reply.

Issue submitting Spark Job from code when Spark Job is a Python program.

When submitting a Java or Scala program, everything works fine. When submitting a python program, it's gets to the ACCEPTED state and then stalls. It eventually times out, but it's not getting picked up to run. Is this interface just for Java/Scala programs/jobs or should it be able to submit PySpark/Python jobs as well?

I am trying to invoke the pi.py sample program that comes with Spark 1.6.0.

Below is the java program that I am testing with. I'm new to Spark so apologies for any "newbie" errors.

import org.apache.spark.SparkConf;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
import org.apache.hadoop.conf.Configuration;
// import org.apache.log4j.Logger;

/**

This class submits a SparkPi to a YARN from a Java client (as opposed
to submitting a Spark job from a shell command line using spark-submit).
To accomplish submitting a Spark job from a Java client, we use
the org.apache.spark.deploy.yarn.Client class described below:
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster mode)
--class CLASS_NAME Name of your application's main class (required)
--primary-py-file A main Python file
--arg ARG Argument to be passed to your application's main class.
Multiple invocations are possible, each will be passed in order.
--num-executors NUM Number of executors to start (Default: 2)
--executor-cores NUM Number of cores per executor (Default: 1).
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512 Mb)
--driver-cores NUM Number of cores used by the driver (Default: 1).
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G)
--name NAME The name of your application (Default: Spark)
--queue QUEUE The hadoop queue to use for allocation requests (Default: 'default')
--addJars jars Comma separated list of local jars that want SparkContext.addJar to work with.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
--files files Comma separated list of files to be distributed with the job.
--archives archives Comma separated list of archives to be distributed with the job.

How to call this program example:

export SPARK_HOME="/Users/mparsian/spark-1.6.0"
java -DSPARK_HOME="$SPARK_HOME" org.dataalgorithms.client.SubmitSparkPiToYARNFromJavaCode 10
*/
public class SubmitSparkPiToYARNFromJavaCode {

public static void main(String[] args) throws Exception {
long startTime = System.currentTimeMillis();
```
// this is passed to SparkPi program
//THE_LOGGER.info("Slices Passed=" + args[0]);
String slices = args[0];  
// String slices = "10";
//
// String SPARK_HOME = System.getProperty("SPARK_HOME");
String SPARK_HOME = "/opt/spark/spark-1.6.0";
// THE_LOGGER.info("SPARK_HOME=" + SPARK_HOME);

//
pi(SPARK_HOME, slices); // ... the code being measured ... 
//
long elapsedTime = System.currentTimeMillis() - startTime;
// THE_LOGGER.info("elapsedTime (millis)=" + elapsedTime);
```
}

static void pi(String SPARK_HOME, String slices) throws Exception {
//
String[] args = new String[]{
"--name",
"Submit-SparkPi-To-Yarn",
//
"--driver-memory",
"512MB",
//
"--jar",
SPARK_HOME + "/examples/target/spark-examples_2.11-1.6.0.jar",
//
"--class",
"org.apache.spark.examples.JavaSparkPi",
```
    // argument 1 to my Spark program
    "--arg",
    slices,

    // argument 2 to my Spark program (helper argument to create a proper JavaSparkContext object)
    "--arg",
    "yarn-cluster"
};

Configuration config = new Configuration();
//
System.setProperty("SPARK_YARN_MODE", "true");
//
SparkConf sparkConf = new SparkConf();
ClientArguments clientArgs = new ClientArguments(args, sparkConf);
Client client = new Client(clientArgs, config, sparkConf);

client.run();
// done!
```
}
}

Thanks,

-Scott

Regarding input parameters

I am new to spark and I want to implement KNN algorithm for my college project. So please tell me about R and S (for what its stand for). If possible please tell me how it looks like. I want to use "csv" file as target database. Thanks in advance.

WikipediaKMeans.java Util where is

/src/main/java/org/dataalgorithms/machinelearning/kmeans/WikipediaKMeans.java

Hello,dear anthor.

I fiound not have the chapter 12 mapreduce source. It deleted?

Thanks for your reply?

How get data in Cochran-Armitage test for trend ?

Hello author,chapter 20 of the first edition of the book used genetic data,How to obtain these genetic data and the corresponding data format?

where is unit21's code src?

my teacher ask me to finish unit 21,as 50% for my final exam,,,i want to allele code...

Dependencies causes CVEs in your execution path

Your project uses some dependencies with CVEs. I found that the buggy methods of the CVEs are in the program execution path of your project, which makes your project at risk. I have suggested some version updates. Details are listed below:

Vulnerable Dependency: org.apache.hadoop : hadoop-common : 2.6.3
Call Chain to Buggy Methods:
- Some files in your project call the library method org.apache.hadoop.fs.Path.getFileSystem(org.apache.hadoop.conf.Configuration), which can reach the buggy method of CVE-2017-15713.
  - Files in your project:
    src/main/java/org/dataalgorithms/chap29/combinesmallfilesbyhadoop/CustomRecordReader.java, src/main/java/org/dataalgorithms/chap24/mapreduce/FastaRecordReader.java, src/main/java/org/dataalgorithms/chap24/mapreduce/FastaInputFormat.java
  - One of the possible call chain:
```
org.apache.hadoop.fs.Path.getFileSystem(org.apache.hadoop.conf.Configuration)
org.apache.hadoop.fs.FileSystem.get(java.net.URI,org.apache.hadoop.conf.Configuration)
org.apache.hadoop.fs.FileSystem.getDefaultUri(org.apache.hadoop.conf.Configuration)
org.apache.hadoop.conf.Configuration.get(java.lang.String,java.lang.String)
org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
```
- Some files in your project call the library method org.apache.hadoop.conf.Configuration.getInt(java.lang.String,int), which can reach the buggy method of CVE-2017-15713.
  - Files in your project:
    src/main/java/org/dataalgorithms/chap05/mapreduce/RelativeFrequencyMapper.java(The rest of the 21 files is hidden)
  - One of the possible call chain:
```
org.apache.hadoop.conf.Configuration.getInt(java.lang.String,int)
org.apache.hadoop.conf.Configuration.getTrimmed(java.lang.String)
org.apache.hadoop.conf.Configuration.get(java.lang.String)
org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
```
- Some files in your project call the library method org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,java.io.OutputStream,org.apache.hadoop.conf.Configuration,boolean), which can reach the buggy method of CVE-2017-15713.
  - Files in your project:
    src/main/java/org/dataalgorithms/chap29/combinesmallfilesbybuckets/BucketThread.java
  - One of the possible call chain:
```
org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,java.io.OutputStream,org.apache.hadoop.conf.Configuration,boolean)
org.apache.hadoop.conf.Configuration.getInt(java.lang.String,int)
org.apache.hadoop.conf.Configuration.getTrimmed(java.lang.String)
org.apache.hadoop.conf.Configuration.get(java.lang.String)
org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
```
- Some files in your project call the library method org.apache.hadoop.io.SequenceFile.createWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class), which can reach the buggy method of CVE-2017-15713.
  - Files in your project:
    src/main/java/org/dataalgorithms/util/SequenceFileWriterDemo.java, src/main/java/org/dataalgorithms/chap03/mapreduce/SequenceFileWriterForTopN.java
  - One of the possible call chain:
```
org.apache.hadoop.io.SequenceFile.createWriter(org.apache.hadoop.fs.FileSystem,org.apache.hadoop.conf.Configuration,org.apache.hadoop.fs.Path,java.lang.Class,java.lang.Class)
org.apache.hadoop.io.SequenceFile.createWriter(org.apache.hadoop.conf.Configuration,org.apache.hadoop.io.SequenceFile$Writer$Option[])
org.apache.hadoop.io.SequenceFile.getDefaultCompressionType(org.apache.hadoop.conf.Configuration)
org.apache.hadoop.conf.Configuration.get(java.lang.String)
org.apache.hadoop.conf.Configuration.substituteVars(java.lang.String) [buggy method]
```
Update suggestion: version 3.2.1
3.2.1 is a safe version without CVEs. From 2.6.3 to 3.2.1, 17 of the APIs (called by 84 times in your project) were modified.

Where can we get the source codes about chapter18?

Dear Professioal, this repo does not consist of the source codes of the 18th chapter? Would you share ?

Chap 14, the output of Builder.scala is not available for classifier.scala

Dear mahmoudparsian,
Sorry to bother you.
Actually, it is known that two methods can be used in the propose of saving output when one scala-spark program finishes. As you do in the "NaiveBayesClassifierBuilder.scala", the pt table saved as part-* file in the HDFS. However, my issue is relative to this. RDD's method,called saveAsObjectFile, will return NULL first, and with a sequenceFile output second. Thus, in the second spark program (NaiveBayesClassifier.scala), a NullPointerException throws. In the another hand, if i use saveAsTextFile, the second spark program will show a exception that "A sequenceFile is required". Thus, I'm not sure how to deal with this issue in your scala programme. Could you give me any tips?

Best Wishes,
WeiWei HE

FastaRecordReader for huge fasta files

Hi,

I have a question about the FastaRecordReader class data-algorithms-book/src/main/java/org/dataalgorithms/chap24/mapreduce/FastaRecordReader.java

I have been trying to use it for large genomes (fasta files much larger than a HDFS block, ie: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.fna.gz) but I am getting wrong sequences.

Is it possible that using this classes from Spark with newAPIHadoopFile method does not work for very large files? Or maybe am I missing something?

Regards, and thank you very much for your time.

Jose M. Abuin