cloudera / seismichadoop Goto Github PK

System for performing seismic data processing on a Hadoop cluster.

Shell 0.25% Java 99.75%

seismichadoop's Introduction

Seismic Hadoop

Introduction

Seismic Hadoop combines Seismic Unix with Cloudera's Distribution including Apache Hadoop to make it easy to execute common seismic data processing tasks on a Hadoop cluster.

Build and Installation

You will need to install Seismic Unix on both your client machine and the servers in your Hadoop cluster.

In order to create the jar file that coordinates job execution, simply run mvn package.

This will create a seismic-0.1.0-job.jar file in the target/ directory, which includes all of the necessary dependencies for running a Seismic Unix job on a Hadoop cluster.

Running Seismic Hadoop

The suhdp script in the bin/ directory may be used as a shortcut for running the following commands. It requires that the HADOOP_HOME environment variable is set on the client machine.

Writing SEG-Y or SU data files to the Hadoop Cluster

The load command to suhdp will take SEG-Y or SU formatted files on the local machine, format them for use with Hadoop, and copy them to the Hadoop cluster.

suhdp load -input <local SEG-Y/SU files> -output <HDFS target> [-cwproot <path>]

The cwproot argument only needs to be specified if the CWPROOT environment variable is not set on the client machine. Seismic Hadoop will use the segyread command to parse a local file unless it ends with ".su".

Reading SU data files from the Hadoop Cluster

The unload command will read Hadoop-formatted data files from the Hadoop cluster and write them to the local machine.

suhdp unload -input <SU file/directory of files on HDFS> -output <local file to write>

Running SU Commands on data in the Hadoop cluster

The run command will execute a series of Seismic Unix commands on data stored in HDFS by converting the commands to a series of MapReduce jobs.

suhdp run -command "seismic | unix | commands" -input <HDFS input path> -output <HDFS output path> \
    -cwproot <path to SU on the cluster machines>

For example, we might run:

suhdp run -command "sufilter f=10,20,30,40 | suchw key1=gx,cdp key2=offset,gx key3=sx,sx b=1,1 c=1,1 d=1,2 | susort cdp gx" \
    -input aniso.su -output sorted.su -cwproot /usr/local/su

In this case, Seismic Hadoop will run a MapReduce job that applies the sufilter and suchw commands to each trace during the Map phase, and then sorts the data by the CDP field in the trace header during the Shuffle phase, and then performs a secondary sort on the receiver locations for each CDP gather in the Reduce phase. There are a few things to note about running SU commands on the cluster:

Most SU commands that are specified are run as-is by the system. The most notable exception is susort, which is performed by the framework, but is designed to be compatible with the standard susort command.
If the last SU command specified in the command argument is an X Windows command (e.g., suximage, suxwigb), then the system will stream the results of running the pipeline to the client machine, where the X Windows command will be executed locally. Make sure that the CWPROOT environment variable is specified on the client machine in order to support this option.
Certain commands that are not trace parallel (e.g., suop2) will not work correctly on Seismic Hadoop. Also, commands that take additional input files will not work properly because the system will not copy those input files to the jobs running on the cluster. We plan to fix this limitation soon.

seismichadoop's People

Contributors

Stargazers

Watchers

seismichadoop's Issues

Processed file not getting re-loaded

Hi,

I'm using Apache Hadoop cluster + seismic-0.1.0-job.jar.

A SegY file gets loaded properly and also I'm able to perform seismic operations on it e.g Hilbert Transform + Whitening.

When I unloaded this processed file on the local file system by name AGC_Hilbert.segy(signifying that Hilbert Transform has been performed on this file) and tried to reload it, I got an error:

/Loading processed file/
./suhdp load -input /home/hd/omkar/AGC_Hilbert.segy -output /sufiles/AGC_Hilbert.su /home/hd/seismicunix

Reading input file: /home/hd/omkar/AGC_Hilbert.segy
BIG_ENDIAN :BIG_ENDIAN

/home/hd/seismicunix/bin/segyread: format not SEGY standard (1, 2, 3, 5, or 8)
1+0 records in
6+1 records out
3200 bytes (3.2 kB) copied, 0.000103235 s, 31.0 MB/s
Bytes read: 0
Callback list size 1
Bytes written: 0
path: /sufiles/AGC_Hilbert.su
path: hdfs://172.25.38.87:9000/user/hd
path: hdfs://172.25.38.87:9000/user/hd
parent: /sufiles

I flipped through the code and I think that the issue exists because(not sure!) in SegyUnloader.java, the part files formed post-processing are directly written to a DataOutputStream to a local file WITHOUT enforcing the BIG ENDIAN order unlike SUReader.java.

Thanks and regards !!!

Processing error

Hey,

I was trying to run some commands with the seismichadoop and encountered the following error.
It seems some miss match with classes.

Can you please help me out to remove it?

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.crunch.hadoop.mapreduce.lib.output.CrunchMultipleOutputs.getNamedOutputsList(CrunchMultipleOutputs.java:210)
at org.apache.crunch.hadoop.mapreduce.lib.output.CrunchMultipleOutputs.checkNamedOutputName(CrunchMultipleOutputs.java:197)
at org.apache.crunch.hadoop.mapreduce.lib.output.CrunchMultipleOutputs.addNamedOutput(CrunchMultipleOutputs.java:256)
at org.apache.crunch.io.impl.FileTargetImpl.configureForMapReduce(FileTargetImpl.java:65)
at org.apache.crunch.io.impl.FileTargetImpl.configureForMapReduce(FileTargetImpl.java:50)
at org.apache.crunch.impl.mr.plan.MSCROutputHandler.configure(MSCROutputHandler.java:63)
at org.apache.crunch.io.impl.FileTargetImpl.accept(FileTargetImpl.java:71)
at org.apache.crunch.impl.mr.plan.MSCROutputHandler.configureNode(MSCROutputHandler.java:51)
at org.apache.crunch.impl.mr.plan.JobPrototype.build(JobPrototype.java:138)
at org.apache.crunch.impl.mr.plan.JobPrototype.getCrunchJob(JobPrototype.java:114)
at org.apache.crunch.impl.mr.plan.MSCRPlanner.plan(MSCRPlanner.java:111)
at org.apache.crunch.impl.mr.MRPipeline.plan(MRPipeline.java:144)
at org.apache.crunch.impl.mr.MRPipeline.run(MRPipeline.java:154)
at org.apache.crunch.impl.mr.MRPipeline.done(MRPipeline.java:183)
at com.cloudera.seismic.crunch.SUPipeline.run(SUPipeline.java:152)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.cloudera.seismic.segy.Main.main(Main.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Experiences with large segy files?

Hi,

I'm playing around with this again and this time I've brought som larger files. More specifically > 100gb each segy. When loading these files using suhdp -load it's insanely slow. I'm getting write speeds to HDFS at around 500-600kb/s. I've tried on vanilla Hadoop, CDH and also on a CDH cluster of xlarge instances on EC2 with the segy located locally on the node loading it.

Anyone with similar experience?

Thanks! :)

Log4J vulnerability

Hi all,
A vulnerability (CVE-2021-44228) was announced by the Apache Foundation, in their Log4J2 (AKA Log4J v2) logging library. This library is a common method to write to log files when using the Java language.
Does Seissee software utilize Log4J and is affected by CVE-2021-44228?
If yes, which versions are impacted?
Does this software require internet access to function?
Regards,