Giter Site home page Giter Site logo

llnl / magpie Goto Github PK

View Code? Open in Web Editor NEW
184.0 13.0 53.0 10.3 MB

Magpie contains a number of scripts for running Big Data software in HPC environments, including Hadoop and Spark. There is support for Lustre, Slurm, Moab, Torque. LSF, Flux, and more.

License: GNU General Public License v2.0

Perl 0.09% Shell 98.83% Makefile 0.88% PigLatin 0.01% Python 0.20%
shell hpc workflows

magpie's Introduction

Magpie

Magpie contains a number of scripts for running Big Data software in HPC environments. Thus far, Hadoop, Spark, Hbase, Storm, Pig, Phoenix, Kafka, Zeppelin, Zookeeper, and Alluxio are supported. It currently supports running over the parallel file system Lustre and running over any generic network filesytem. There is scheduler/resource manager support for Slurm, Moab, Torque, LSF, and Flux.

Some of the features presently supported:

  • Run jobs interactively or via scripts.
  • Run against a number of filesystem options, such as HDFS, HDFS over Lustre, HDFS over a generic network filesystem, Lustre directly, or a generic network filesystem.
  • Take advantage of SSDs/NVRAM for local caching if available
  • Make decent optimizations for your hardware

Experimental support for several distributed machine learning frameworks has also been added. Presently tensorflow, tensorflow w/ horovod, and ray is supported.

Basic Idea

The basic idea behind these scripts are to:

  1. Submit a Magpie batch script to allocate nodes on a cluster using your HPC scheduler/resource manager. Slurm, Slurm+mpirun, Moab+Slurm, Moab+Torque, LSF+mpirun, and Flux are currently supported.

  2. The batch script will create configuration files for all appropriate projects (Hadoop, Spark, etc.) The configuration files will be setup so the rank 0 node is the "master". All compute nodes will have configuration files created that point to the node designated as the master server.

    The configuration files will be populated with values for your filesystem choice and the hardware that exists in your cluster. Reasonable attempts are made to determine optimal values for your system and hardware (they are almost certainly better than the default values). A number of options exist in the batch scripts to adjust these values for individual jobs.

  3. Launch daemons on all nodes. The rank 0 node will run master daemons, such as the Hadoop Namenode. All remaining nodes will run appropriate worker daemons, such as the Hadoop Datanodes.

  4. Now you have a mini big data cluster to do whatever you want. You can log into the master node and interact with your mini big data cluster however you want. Or you could have Magpie run a script to execute your big data calculation instead.

  5. When your job completes or your allocation time has run out, Magpie will cleanup your job by tearing down daemons. When appropriate, Magpie may also do some additional cleanup work to hopefully make re-execution on later runs cleaner and faster.

Supported Packages & Versions

For a complete list of supported package versions and dependencies, please see doc/README. The following can be considered a summary of support.

Hadoop - 2.2.0, 2.3.0, 2.4.X, 2.5.X, 2.6.X, 2.7.X, 2.8.X, 2.9.X, 3.0.X, 3.1.X, 3.2.X, 3.3.X

Spark - 1.1.X, 1.2.X, 1.3.X, 1.4.X, 1.5.X, 1.6.X, 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X, 3.0.X, 3.1.X, 3.2.X, 3.3.X, 3.4.X, 3.5.X

Hbase - 1.0.X, 1.1.X, 1.2.X, 1.3.X, 1.4.X, 1.5.X, 1.6.X

Hive - 2.3.0

Pig - 0.13.0, 0.14.0, 0.15.0, 0.16.0, 0.17.0

Zookeeper - 3.4.X

Storm - 0.9.X, 0.10.X, 1.0.X, 1.1.X, 1.2.X

Phoenix - 4.5.X, 4.6.0, 4.7.0, 4.8.X, 4.9.0, 4.10.1, 4.11.0, 4.12.0, 4.13.X, 4.14.0

Kafka - 2.11-0.9.0.0

Zeppelin - 0.6.X, 0.7.X, 0.8.X

Alluxio - 2.3.0

TensorFlow - 1.9, 1.12

Ray - 0.7.0

Older Supported Packages & Features

Some packages and features were dropped due to lack of interest, the software becoming old/deprecated, and/or their initial experimental addition into Magpie. If you are interested in them, please look at older versions for supported versions and documentation. If you are very interested in support in current versions of Magpie beyond an experimental nature, please submit a support request and we can reconsider adding it back in.

Removed in Magpie 2.0

  • Hadoop 1.X support
  • Tachyon
  • UDA/uda-plugin for Hadoop
  • HDFS Federation in Hadoop
  • IntelLustre option for a Hadoop Filesystem
  • MagpieNetworkFS option for a Hadoop Filesystem

Removed in Magpie 3.0

  • Spark 0.9.X support
  • Hbase 0.98.X and 0.99.X support
  • Mahout

Documentation

All documentation is in the 'doc' subdirectory. Please see the doc/README file as a starting point. It provides general instructions as well as pointers to documentation for each project, setup requirements, ability to do local configurations, tips & tricks, and more information.

Release

Magpie is release under a GPL license. For more information, see the COPYING file.

LLNL-CODE-644248

magpie's People

Contributors

akantak avatar chu11 avatar cmd-ntrf avatar gijshendriksen avatar hoeze avatar ianlee1521 avatar joshuata avatar milinda avatar nealepetrillo avatar sammuli avatar xunpan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magpie's Issues

Teardown Script

I think it would be nice to move all the teardown stuff to a file like magpie-teardown. Then this would allow one to quickly and safely shutdown the cluster while in interactive mode. I'm not sure on all the logistics of how to get all the variables automatically but I think it wouldn't be too difficult.

I wanted to run it by you and see if you had any thoughts on it before I attempted to do it.

latex-ize README

The README is getting gigantic and ridiculous. Need to latex-ize it, generate HTML, PDFs, etc.

hbase quorum

Is there a variable that sets the hbasezookeeperquorum? When I run interactively, even when I set the conf dir for hbase, this doesn't seem to get set internally. I must set it manually like such:

conf = {"hbase.zookeeper.quorum": host, <----------------- list of nodes
"hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}

I was hoping I didn't need to set anything, because it requires a lookup.

I tried setting the classpath with --driver-class-path and at first thought it was working. However it does not seem to be the case.

Documentation?

Hi,

Since I cannot find how to ask this question anywhere else, I will do it here. Feel free to delete it if this is not the right place for it.

I am looking into the code and there is no documentation to show how should users use magpie. I do it like this: In script-templates folder I use the Makefile for my cluster customizations. Then with magpi run I try to run sparkpi with the resulting configs. I am sure I am doing it wrong since I am changing so many of the files myself that now I ended up with scripts of my own.

Will be any documentation for running and customizing Spark, Hadoop, and Yarn. I would like to use it so that I can have it as a "module add" for other users.

Thanks

Srun error on Cray system

Trying to run magpie on a Cray XC40 under slurm 14.03.7
Done installation using the misc/magpie-apache-download-and-setup.sh script.
Then customized magpie.sbatch-hadoop submission script. But trying to sbatch -k customized.sh the job ends with the following error:

srun: error: Unable to create job step: Requested node configuration is not available

No way making the error disappears by modifying the #SBATCH special comments in the script.

The wisdom here is that you should never use srun on Cray and instead use aprun because Cray version of slurm is not well integrated with their machines. Unfortunately these commands are not equivalent (changing the first srun --no-kill -W 0 to aprun -B makes the first script magpie-check-inputs complain about missing SLURM_* variables for example).

Before trying to hack magpie scripts, I would ask:

  1. Is it a known error? If yes, is there any workaround or debugging tip (besides adding -v 4 times to srun)?
  2. Have you run magpie on any Cray machine?
  3. On which slurm version have you tested magpie?
  4. If it is not much trouble, could I have a working submission script for slurm and the log generated adding -v -v -v -v to the first srun command?

Thanks for helping!
mario

Ing. Mario Valle
Swiss National Supercomputing Centre (CSCS) | http://mariovalle.name/
v. Trevano 131, 6900 Lugano, Switzerland | Tel: +41 (91) 610.82.60

Support some sort of Monitoring Software

It would be great it we added something like ambari so that we could check the status of our nodes. In one case, all of my HBase region servers went down and my Spark job just hung around waiting for them to come back up (which they never did due to an unrelated issue). I had no idea that this occurred. It would be nice to be able to see what is happening all in one place.

I'm not sure what else is out that other than ambari at this point.

develop mechanism to access HDFS over Lustre/networkfs w/o launching a job

I believe through a series of scripting tricks, it would be doable to script this. Hypothetically, launch X hdfs daemons as processes and launch a namenode process on the same node. Configure them to use appropriate paths in lustre/networkfs as their "local drive". Make sure each have separate ports so they communicate to each other on the same node.

Add script/mechanism for downloading packages and patching them

Script/mechanism to download latest supported versions of Hadoop, Hbase, Zookeeper, etc. and applying patches would be useful for users setting up Magpie for the first time.

In addition, after the download, scripts could be pre-seeded with paths to appropriate locations for the projects.

Enhancement proposal: making timeout behavior more dynamic

While testing, I found that when the setup time was shorter than expected, the job was unable to use all the walltime available because of the shutdown timeout.

Quickly, I see two ways to fix this.
A- Make MAGPIE_STARTUP_TIME dynamic, instead of a user fixed variable. For Moab, we could use the walltime reported by checkjob as a startup time in Magpie_wait_script function. Something along these lines:

  walltime=$(checkjob ${MOAB_JOBID} | grep -Po '(?<=WallTime:  \s).*' | cut -d' ' -f1))
  startuptime=$((10#${walltime:0:2}*60 + 10#${walltime:3:2} + 10#${walltime:6:2}>0))
  scriptsleepamounttemp=`expr ${MAGPIE_TIMELIMIT_MINUTES} - ${startuptime}`

B- Replace Magpie_wait_script by a mechanism of signal catching. Moab can be told to send a pre-termination signal at the desired time before expiration of the job's wall clock limit, for example:

-l signal=SIGHUP@5:00

This signal could be catch with the bash trap command and the script killed adequately.

Both solutions are specific to Moab for now, but I think that the mechanisms used are provided by most scheduler.

If I had to choose, I would opt for solution B which I find more elegant as it alleviates the necessity of having a timeout stop watch in Magpie.

Output clearer error message on hdfs over lustre issues

minimally easy idea, store hadoop version in path and on subsequent runs support better error message to user on what they have to do (upgrade hdfs, etc.). Is it possible to read HDFS version within files within? maybe ... or get version via hadoop commands and store?

minimally easy idea, store number of data nodes in path and on subsequent runs support better error message if user uses lower number of nodes.

MAGPIE_PRE_JOB_RUN environment

What would be the best way to source the PRE_JOB_RUN script? It appears it runs within its own environment from the magpie-pre-run. For instance, in my pre run I need to source a few scripts which setup my environment including mpi.

I would like to just change that line to source ${MAGPIE_PRE_JOB_RUN} but I feel like that is not the best way to do this.

Set java.io.tmpdir appropriately

so scratch data is sent do the LOCAL_DIR directories and not /tmp by default

especially can affect no-local-dir, b/c various things assume it can dump into /tmp

option to run daemons on separate nodes

consider an option to run daemon on separate nodes?

i.e. hbase daemons on 4 nodes, spark daemons on 4 different nodes

would we want to split masters onto different nodes too?

hbase tar.gz path has been archived

My workaround was to apply the patch below, however I think the best idea would be to use a newer hbase. Thus I didn't submit the patch for a pull request.

diff --git a/scripts/misc/magpie-apache-download-and-setup.sh b/scripts/misc/magpie-apache-download-and-setup.sh
index 2b4ec9a..aa6a5a1 100755
--- a/scripts/misc/magpie-apache-download-and-setup.sh
+++ b/scripts/misc/magpie-apache-download-and-setup.sh
@@ -109,9 +109,13 @@ fi

 if [ "${HBASE_DOWNLOAD}" == "Y" ]
 then
-    APACHE_DOWNLOAD_HBASE="${APACHE_DOWNLOAD_BASE}/${HBASE_PACKAGE}"
-
-    HBASE_DOWNLOAD_URL=`wget -q -O - ${APACHE_DOWNLOAD_HBASE} | grep "${HBASE_PACKAGE}" | head -n 1 | grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'`
+         # Package has been archived
+   if [ "${HBASE_PACKAGE}" == "hbase/hbase-0.98.9/hbase-0.98.9-hadoop2-bin.tar.gz" ]; then
+       HBASE_DOWNLOAD_URL="http://archive.apache.org/dist/${HBASE_PACKAGE}"
+   else
+       APACHE_DOWNLOAD_HBASE="${APACHE_DOWNLOAD_BASE}/${HBASE_PACKAGE}"
+       HBASE_DOWNLOAD_URL=`wget -q -O - ${APACHE_DOWNLOAD_HBASE} | grep "${HBASE_PACKAGE}" | head -n 1 | grep -o '<a href=['"'"'"][^"'"'"']*['"'"'"]' | sed -e 's/^<a href=["'"'"']//' -e 's/["'"'"']$//'`
+    fi

     echo "Downloading from ${HBASE_DOWNLOAD_URL}"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.