ngageoint / mrgeo Goto Github PK

MrGeo is a geospatial toolkit designed to provide raster-based geospatial capabilities that can be performed at scale. MrGeo is built upon Apache Spark and the Hadoop ecosystem to leverage the storage and processing of hundreds of commodity computers. See the wiki for more details.

Home Page: https://github.com/ngageoint/mrgeo/wiki

License: Apache License 2.0

Shell 1.82% Java 74.47% Python 3.93% Scala 18.77% FreeMarker 0.05% Jupyter Notebook 0.53% Scheme 0.44%

mrgeo's Introduction

	Apache
Version	2.6.0	2.7.1
Status

	CDH
Version	5.5.2	5.6.0	5.7.1
Status

	RPM	pyMrGeo
Version	1.1.0	0.0.7
Status

	Amazon EMR
Version	4.7.1	5.0.0
Status

##Origin

MrGeo was developed at the National Geospatial-Intelligence Agency (NGA) in collaboration with DigitalGlobe. The government has "unlimited rights" and is releasing this software to increase the impact of government investments by providing developers with the opportunity to take things in new directions. The software use, modification, and distribution rights are stipulated within the Apache 2.0 license.

###Pull Requests

All pull request contributions to this project will be released under the Apache 2.0 license.

Software source code previously released under an open source license and then modified by NGA staff is considered a "joint work" (see 17 USC 101); it is partially copyrighted, partially public domain, and as a whole is protected by the copyrights of the non-government authors and must be released according to the terms of the original open source license.

###MrGeo in Action See YouTube explainer

###MrGeo in the News NGA Press Release

DigitalGlobe Press Release

MrGeo got a mention in an article on NGA Opensourcing from Reuters

###MrGeo Overview

MrGeo (pronounced "Mister Geo") is an open source geospatial toolkit designed to provide raster-based geospatial processing capabilities performed at scale. MrGeo enables global geospatial big data image processing and analytics.

MrGeo is built upon the Apache Spark distributed processing frarmework to leverage the storage and processing of 100’s of commodity computers. Functionally, MrGeo stores large raster datasets as a collection of individual tiles stored in Hadoop to enable large-scale data and analytic services. The co-location of data and analytics offers the advantage of minimizing the movement of data in favor of bringing the computation to the data; a more favorable compute method for Geospatial Big Data. This framework has enabled the servicing of terabyte scale raster databases and performed terrain analytics on databases exceeding 100’s of gigabytes in size.

MrGeo has been fully deployed and tested in Amazon EMR.

See Wiki for detailed documentation

Unique features/solutions of MrGeo:

Scalable storage and processing of raster data
Application ready data: data is stored in MrGeo in a format that is ready for computation, eliminating several data pre-processing steps from production workflows.
A suite of robust Spark analytics that that include algebraic math operations, focal operations (e.g. slope and gaussian)
A third generation data storage model that
- Maintains data locality via spatial indexing.
- An abstraction layer between the analytics and storage methods to enables a diverse set of cloud storage options such as HDFS, Accumulo, HBASE etc.
A Map algebra interface that enables the development of custom algorithms in a simple scripting API
A plugin architecture that facilitates a modular software development and deployment strategies
Data and Analytic capabilities provisioned by OGC and REST service end points

Exemplar MrGeo Use Cases:

Raster Storage and Provisioning: MrGeo has been used to store, index, tile, and pyramid multi-terabyte scale image databases. Once stored, this data is made available through a simple Tiled Map Services (TMS) and Web Mapping Services (WMS) and can be made available through GeoServer via a MrGeo plugin.
Large Scale Batch Processing and Serving: MrGeo has been used to pre-compute global 1 ArcSecond (nominally 30 meters) elevation data (300+ GB) into derivative raster products : slope, aspect, relative elevation, terrain shaded relief (collectively terabytes in size), and Tobler and Pingel friction surfaces
Global Computation of Cost Distance: Given all pub locations in OpenStreetMap, compute 2 hour drive times from each location. The full resolution is 1 ArcSecond (30 meters nominally)

mrgeo's People

Contributors

Stargazers

Watchers

Forkers

lowtalker jwsy klavigne gijs carecoalition meiyoumingzia abenrob 14mmm pliguori willtemperley techchrj brendancol dlindenbaum gitter-badger vsingh58 derekmburgess ttislerdg rabbaby djohnson729 lmvenegas aashish24 sky727 bradh johnjo55 alant4ng giserh akarmas tulika-chaterjee feihugis dizzykc apsaltis ericwood73 geoslegend damianblanck xiaozan-pku mbrukman theolivenbaum jdunbar921 wsf1990 arthurmartinez dvntucker kaydoh granturing tspannhw dsrebro 13903596952 nefuniepei grseb9s samiazmi westybsa hufh vaquarkhan soxueren borishouenou gisdevelope cherylhughey supriyatambe17 williamzcy haytastan zhhongsh zhongshuiyuan peter-mcclonski mdmubarak

mrgeo's Issues

Accumulo Data Provider Exclude Tables List

For the Accumulo Data Provider, there needs to be a way to make sure that some tables inside Accumulo can be excluded from any type of interaction. This is true for the accumulo.root and accumulo.metadata tables.

Geowave data provider scanning all features to compute bounds

It seems that the Geowave data provider is scanning all features when obtaining the source bounds for the vector metadata. There is code in place to read the bounding box from Geowave's metadata, but it doesn't seem to work anymore.

Add query support to GeoWave data provider

In map algebra, when a GeoWave data source is used, allow a query to be included to filter the data that GeoWave reads. The GeoWave api allows the use of Geotools CQL syntax for querying data. Modify the map algebra syntax so that a CQL query can be specified.

For example, assuming that the mrgeo.conf has a GeoWave data source defined named "geowave", and vector data named "roads" has been ingested into GeoWave, filter "roads" to a BBOX AOI and rasterize them to a MrsPyramid:

aoiRoads = [geowave:roads; BBOX(the_geom, -120.0, 20.0, -110.0, 30.0)];
aoiRoadsRaster = RasterizeVector(aoiRoads, "MASK", "12z");

Update data provider for Spark setup

When running a Spark job in cluster mode, the "driver" code runs on a slave machine rather than the master node (where MrGeo is installed). As a result, mrgeo.conf is not available when the "driver" code executes. We need to do at least the following:

Set the values from mrgeo.conf into the Spark Configuration before submitting the job. These settings should read from mrgeo.conf, have their key names prefixed to avoid name collisions, and set them into the Spark configuration.
Change the data provider interface to include a setupConfiguration in addition to the setupJob method. The reason for this is that setupJob cannot be called for Spark jobs because there is no Hadoop Job object to pass in, and we don't need to setup the InputFormat class in there like setupJob does. We only need to set configuration values. In Spark, the InputFormat class is specified in the SparkContext.newAPIHadoopRDD method.
Push the partitioning logic down into the HDFS data provider since Accumulo has no need for that functionality. The use of a partitioner needs to be exposed through the data provider interface.

Build documentation.

Building, rebuilding, cleaning, and getting ready for development all have to be documented. These are just simple elements for documentation but they will help people use our MrGeo capabilities.

Initial code upload to new repository

We need to populate the initial code repository with MrGeo.

Investigate data quality issues with ingested elevation data

There are large gaps in the ingested Aster GDEM above a certain latitude. I'll look into whether that is in the raw source data on S3 or the result of a bug during ingest/build pyramids.

It would be great to have an HBase implementation for Tile Storage

attach-effective-pom fails with mrgeo-mapalgebra

I'm having some issues building the mapalgebra modules. I'm using Maven 3.0.5 as this is the only version I can get to work with the pomtools-maven-plugin (newer versions complain of a missing class).

I get the following error after all the child mapalgebra modules build, and I can't figure out why it's happening. I wonder if someone could give me some pointers here please? Thanks.

Failed to execute goal org.codehaus.mojo:build-helper-maven-plugin:1.9.1:attach-artifact (attach-effective-pom) on project mrgeo: Execution attach-effective-pom of goal org.codehaus.mojo:build-helper-maven-plugin:1.9.1:attach-artifact failed: For artifact {org.mrgeo:mrgeo:0.5.0-cdh5.1.0-SNAPSHOT:pom}: An attached artifact must have a different ID than its corresponding main artifact

MrGEO does not build when using latest version of Maven (3.3.1)

Could not get the source tree to build using Maven 3.3.1 without failing. Had to roll my maven version back to 3.0.5 to get it to build the jars properly.

Generate base data layers - SRTM

Generate the following layers globally for SRTM

Slope (rad)
Aspect (rad)
Hillshade
Tobler
Water bodies

Generate base data layers - ASTER

Generate the following layers globally for ASTER

Slope (rad)
Aspect (rad)
Hillshade
Tobler
Water bodies

Provide details on which web servers mrgeo will run on

It would be great if someone could provide details of which webservers mrgeo can be run on. I've tried a few and I'm not having much luck. With JBOSS AS 7 and 8 I get "Failed to start service jboss.module.service."deployment.mrgeo-0.5.0-cdh5.1.3-SNAPSHOT-cdh5.1.3.war".main" and with JBOSS AS6 I get issues with Java 7 compatibility.

If someone could give me some pointers I would be happy to write a wiki page on the WAR deployment process and requirements.

geotiff.geotiff-jai version 0.0 Not Found

the file geotiff-jai-0.0.jar does not appear to be available in any public repository.

[ERROR] Failed to execute goal on project mrgeo-core: Could not resolve dependencies for project org.mrgeo:mrgeo-core:jar:0.5.0-SNAPSHOT: Could not find artifact geotiff:geotiff-jai:jar:0.0 in

getSplits in HdfsMrsImagePyramidInputFormat is very slow

The getSplits method opens up each index in each partition to find the tile id where that partition starts. This is extremely slow when using an image stored in S3.

One approach to solving this problem is store the start and end tile id's in the splits file instead of just the end tile id. However, the code should be smart enough to still work correctly with older splits files that only contain the end tile id (even though it would obviously be slower). Maybe we should consider writing a new command that converts older splits files to the newer format. At the same time we make this change, we should consider converting the splits files to text rather than base64 encoded. This would make it easier to diagnose problems with splits if needed.

Make the new Spark code that saves images work with older versions of Spark

Some of the API's used in the Spark code that saves the RDD as a MrsPyramid don't exist in older versions of Spark. Modify this code to use reflection to find the newer API methods so that it can continue to compile and run against older Spark versions.

Slow processing at end of ingest to S3

When ingesting an image within an EMR cluster and outputting the MrsPyramid to S3, there is slow processing at the end that seems to be related to listing files, possibly as part of handling splits files.

org.geotools version 10-DG not found

the required artifacts under org.geotools version 10.0-DG do not appear to be available in any public repository

[ERROR] Failed to execute goal on project mrgeo: Could not resolve dependencies for project org.mrgeo:mrgeo:pom:0.5.0-SNAPSHOT: The following artifacts could not be resolved: org.geotools:gt-epsg-hsql:jar:10.0-DG, org.geotools:gt-epsg-wkt:jar:10.0-DG, org.geotools:gt-referencing:jar:10.0-DG, org.geotools:gt-imageio-ext-gdal:jar:10.0-DG, org.geotools:gt-image:jar:10.0-DG, org.geotools:gt-geotiff:jar:10.0-DG: Could not find artifact org.geotools:gt-epsg-hsql:jar:10.0-DG in...

GeoWave as a data source for MrGeo

Implement a vector data provider for GeoWave to allow MrGeo to make use of vector data stored in GeoWave. Initially, we should support reading data out of GeoWave and being able to specify a GeoWave data source as an input format for a map/reduce job in MrGeo. An initial use case should be the ability to run a RasterizeVector operation against a GeoWave vector source.

Use band-specific no data values of source image during ingest and build pyramid

From a review of the ingest code, if the user does not specify a no data override value on the command line, we use the no data value of band 0 for all of the bands. We should consider changing this to use the actual no data value for each individual band separately.

Similarly, when building the pyramid, we use the no data value for band 0 as the no data value for all of the bands. We should consider using the no data value for each individual band.

Unable to ingest Aster from HDFS

Multiple attempts to ingest aster from HDFS have resulted in "Invocation target error". I've tried the ingest with a memoryintensive.multiplier value of 2,8,12, and 32. The raw aster was copied to hdfs using distcp and I verified that all data copied over correctly.

The error happens seemingly randomly, sometimes failing after 15min, sometimes not for an hour. The mrgeo command is:

mrgeo ingest -nd -9999 -sk -sp -z 12 -o /mrgeo/images/aster-hdfs-30m /mrgeo/raw-images/aster-30m

Ingest from S3 runs successfully with a multiplier of 12.

Resolve dependency on memory intensive multiplier

Need a better solution for setting the memory intensive multiplier in mrgeo.conf. The process now is inefficient in that it requires adjusting the setting until Ingest and/or BuildPyramid will run. These processes do not fail quickly, often running for over an hour before crashing. The setting varies by cluster type/size and input data size/format.

A first step could be adding a mrgeo command to set various conf variables so that it does not have to be done manually. Eventually an automatically assigned value based on the cluster configuration would be preferable.

Add command for launching MrGeo web services in an embedded web server

Look into starting an embedded instance of jetty from the command line to run the MrGeo web services.

Documentation on MrGeo conventions

There are a number of conventions used inside MrGeo that we need to document for this project. For example, how we label tiles and what is expected when passing around the TileId.

Update cost distance to work with CDH5 and YARN

The current version of MrGeo includes a cost distance operation based on the Giraph graph processing API. Issues with that approach are:

Problems that arose when porting cost distance for use in CDH5 and YARN
Giraph required that the entire graph be loaded in memory across the cluster at once, thus making cluster requirements for large-scale cost distance computation prohibitive

To resolve these issues, look into re-implementing the algorithm based on a distributed computing architecture like Apache Spark. The GraphX API included in Apache Spark does not require the entire graph to be memory-based, thus improving scalability, and Spark jobs can run under YARN and coexist nicely with other jobs.

An additional benefit of this effort is having a reusable approach for submitting Spark jobs from MrGeo so that we can take advantage of the simplifying distributed programming paradigm offered by Spark beyond just map/reduce.

Accumulo Data Provider Create Tables

When using the Accumulo Data Provider as the primary data provider, the provider needs to be able to create tables when jobs are run. This will enable faster work flows. Right now, the provider has to have the table created before it will work with it.

Resolve dependency on input dataset tile scheme

Once the multiplier in mrgeo.conf is set correctly, ingest works perfectly for aster and srtm tiled datasources. However, I've not been able to ingest any other data sources including global landcover and tiled geotiffs of srtm water bodies:

Single 300m, ~300mb globcover raster
Tiled 300m globcover raster
Tiled, resampled 30m globcover raster
Tried the above with various multiplier settings - 2,4,8,12,16,32

skip-all-tests profile doesn't work

It would look like creating a profile called "skip-all-tests" should help in skipping the tests at mvn build time.
Unfortunately it doesn't or at least I didn't find the right way to do it.
How I made it skipping the tests was actually setting 2 variables called mrgeo.core.test.skiptests and mrgeo.core.integration.skiptests to true.

Clean code to use MrGeoConstants

There are lots of places still not using MrGeoConstants. Clean these up to use it.

Accumulo Data Provider Build Pyramid world scale or export problem

Confirm that the Accumulo Data Provider can correctly build pyramids at world scale. This may be a problem with the export command when Accumulo is the source. The command:
mrgeo export -z 3 -s -o /home/andrew/mask83 mask8
May be a problem for the Accumulo Data Provider.

spark-yarn package doesn't exist in CDH repo

this is the definition for the cdh532 build:
<spark.version>1.2.0-cdh5.3.2</spark.version>

but inside the cloudera artifactory repository it doesn't exist.

The same applies for the cdh530 or cdh520 builds.

Slow processing at end of building pyramids for an image in S3

After building pyramids for an image stored in S3, the cleanup processing that occurs after the job is complete is very slow because it calls FileSystem.listStatus multiple times to cleanup the _SUCCESS files that Hadoop creates.

From a quick scan online, it seems that we should be able to set
mapreduce.fileoutputcommitter.marksuccessfuljobs=false
in the jjob configuration to prevent the creation of the _SUCCESS files. And then we can eliminate the call to HadoopFileUtils.cleanupDirectory at the end of build pyramids. The same should be done for other hadoop jobs.

Data Provider design and developer documentation

We need to show how the data provider interface is developed inside MrGeo. Additionally, we need to show how a data provider is recognized inside MrGeo. This will help the next developer build an interface.

Need delimited text vector data provider

Need to develop a vector data provider for delimited text. At the same time, investigate whether it makes sense to re-factor existing code to use the vector data provider now.

Convert WMS servlet to jersey

The WMS servlet is not currently jersey-based, so the URI to make WMS requests differs from the URI to make TMS requests. Convert the WMS servlet to jersey so it shares the same context root as the TMS servlet.

Additionally, the WMS servlet is not wired into the stand-alone web server available from the command line. Ensure that once it's jersey-based, it is available from the stand-alone web server.

Accumulo data provider lazy table list

Consider not searching the list of Accumulo tables when the data provider is initialized (to look for available images), but rather do that only when needed (like if listImages is called).

ImageStats collection during MapAlgebra

I have been recently running the crop map algebra command and come upon and error inside the mrgeo-core/src/main/java/org/mrgeo/mapreduce/OpChainMapper.java with the statistics built up.

The problem is around line 171 and 172 where the calls are to ImageStats.

I got this one first:

java.lang.ArrayIndexOutOfBoundsException: 1
at org.mrgeo.image.ImageStats.computeAndUpdateStats(ImageStats.java:152)
at org.mrgeo.image.ImageStats.computeStats(ImageStats.java:173)
at org.mrgeo.mapreduce.OpChainMapper.map(OpChainMapper.java:171)
at org.mrgeo.mapreduce.OpChainMapper.map(OpChainMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

Then as I captured the ArrayIndexOutOfBoundsException from the first one, the second one came along:

java.lang.ArrayIndexOutOfBoundsException: 1
at org.mrgeo.image.ImageStats.aggregateStats(ImageStats.java:97)
at org.mrgeo.mapreduce.OpChainMapper.map(OpChainMapper.java:172)
at org.mrgeo.mapreduce.OpChainMapper.map(OpChainMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

It looks like that array of "new double[] { noData }" is probably the culprit. I go around this to do crop by catching exceptions. But, I wanted to check with you about this to see if there is something wrong. The OpChainMapper has not changed since the initial check-in.

The data I was using was the Kathmandu image. Using crop with the Paris image does not start a mapreduce job.

Tile boundary artifacts after ingest of SRTM data

Seeing missing data for every tile boundary in the ingest, screenshots below. To reproduce:

mrgeo ingest -o s3://mrgeo/images/srtm-elevation -sk -sp -z 10 -v s3://mrgeo-source/srtm-90
mrgeo buildpyramid s3://mrgeo/images/srtm-elevation

500 m3.xlarge
Ingest - 9:30.51elapsed
BuildPyramid - 11:16.55elapsed

404 on Maven Plugin

There is also a custom Maven plugin the team built to make discovering jars easier. Download it from our git wiki and manually install it to your local repo as follows (making sure you are not in the MrGeo source directory):

There is no file there.

Add PNG output to export command

Also allow a color scale to be specified on the command line

ingest accepts image but then fails with ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)

Using Cloudera QuickStart CDH 5.3 VirtualBox, I was able to build the project for CDH5.3 and yarn, but cannot ingest an image (about 450MB geotif). The ingest command says "accepted", but the YARN job fails:

[cloudera@quickstart mrgeo]$ mrgeo-cmd/src/main/scripts/mrgeo ingest River09Q100B.tif/River09Q100B.tif --output River09Q100B.pyramid
/home/cloudera/git/mrgeo/mrgeo-cmd/mrgeo-cmd-distribution/target:/home/cloudera/git/mrgeo/mrgeo-cmd/mrgeo-cmd-distribution/target/lib/*
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/cloudera/git/mrgeo/mrgeo-cmd/mrgeo-cmd-distribution/target/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
*** checking River09Q100B.tif/River09Q100B.tif accepted ***
15/03/24 05:27:34 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
15/03/24 05:31:29 ERROR ingest.IngestImage: IngestImage exited with error
[cloudera@quickstart mrgeo]$

The verbose mode has some additional info:

[cloudera@quickstart mrgeo]$ mrgeo-cmd/src/main/scripts/mrgeo ingest River09Q100B.tif/River09Q100B.tif --output River09Q100B.pyramid --verbose
/home/cloudera/git/mrgeo/mrgeo-cmd/mrgeo-cmd-distribution/target:/home/cloudera/git/mrgeo/mrgeo-cmd/mrgeo-cmd-distribution/target/lib/*
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/cloudera/git/mrgeo/mrgeo-cmd/mrgeo-cmd-distribution/target/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
*** checking River09Q100B.tif/River09Q100B.tif15/03/27 02:17:37 INFO geotools.GeotoolsRasterUtils: Loading missing epsg codes
15/03/27 02:17:41 INFO data.DataProviderFactory: Skipping image ingest data provider org.mrgeo.data.accumulo.ingest.AccumuloImageIngestDataProviderFactory because isValid returned false
15/03/27 02:17:41 INFO data.DataProviderFactory: Skipping mrs image data provider org.mrgeo.data.accumulo.image.AccumuloMrsImageDataProviderFactory because isValid returned false
15/03/27 02:17:42 WARN imageio.gdalframework: Failed to load the GDAL native libs. This is not a problem unless you need to use the GDAL plugins: they won't be enabled.
java.lang.UnsatisfiedLinkError: no gdaljni in java.library.path
15/03/27 02:21:00 INFO HSQLDB4C5A88DA50.ENGINE: dataFileCache open start
15/03/27 02:21:00 INFO HSQLDB4C5A88DA50.ENGINE: Checkpoint start
15/03/27 02:21:00 INFO HSQLDB4C5A88DA50.ENGINE: Checkpoint end
 accepted ***
15/03/27 02:24:25 INFO ingest.IngestImage: Ingest inputs (1)
15/03/27 02:24:25 INFO ingest.IngestImage:    River09Q100B.tif/River09Q100B.tif
15/03/27 02:24:26 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
15/03/27 02:24:26 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
15/03/27 02:24:29 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
15/03/27 02:24:30 INFO Configuration.deprecation: io.sort.spill.percent is deprecated. Instead, use mapreduce.map.sort.spill.percent
15/03/27 02:24:30 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
15/03/27 02:24:30 INFO format.IngestImageSplittingInputFormat: Spill size for splitting is: 79691776b
15/03/27 02:24:30 INFO format.IngestImageSplittingInputFormat:   reading: hdfs://quickstart.cloudera:8020/user/cloudera/River09Q100B.tif/River09Q100B.tif
15/03/27 02:24:30 INFO format.IngestImageSplittingInputFormat:     zoomlevel: 14
15/03/27 02:28:40 INFO mapreduce.JobSubmitter: number of splits:2450
15/03/27 02:28:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1427193016034_0007
15/03/27 02:28:41 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
15/03/27 02:28:41 INFO impl.YarnClientImpl: Submitted application application_1427193016034_0007
15/03/27 02:28:41 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1427193016034_0007/
15/03/27 02:28:41 INFO mapreduce.Job: Running job: job_1427193016034_0007
15/03/27 02:29:06 INFO mapreduce.Job: Job job_1427193016034_0007 running in uber mode : false
15/03/27 02:29:06 INFO mapreduce.Job:  map 0% reduce 0%
15/03/27 02:29:06 INFO mapreduce.Job: Job job_1427193016034_0007 failed with state FAILED due to: Application application_1427193016034_0007 failed 2 times due to AM Container for appattempt_1427193016034_0007_000002 exited with  exitCode: 1 due to: Exception from container-launch.
Container id: container_1427193016034_0007_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1
.Failing this attempt.. Failing the application.
15/03/27 02:29:06 INFO mapreduce.Job: Counters: 0
15/03/27 02:29:06 ERROR ingest.IngestImage: IngestImage exited with error

I am not sure if my build is not correct (I had to make many tweaks to get it done) or if there is something wrong with my Hadoop config (see "No job jar file set. " warning). Anybody there who managed to get this done with CDH 5.3 and YARN? As I said I am using QS VirtualBox with no extra settings, running as user cloudera with sudo rights, etc.

Accumulo Data Provider Bulk Ingest User Resolution

When producing files for bulk ingest, the Accumulo data provider fails when there is a different user running accumulo then the user running the job for ingest. This can be resolved by running the ingest with the accumulo user but it is better to make the job work correctly.

Image query capability

Add the ability to query both spatially and by attribute values with logical expressions for images within map algebra. The idea is query for a list of images matching some criteria, and then be able to write some analytic map algebra that operates on the resulting list.

Add color scales for base layers

Need basic color scales for global base layers as well as a few to cover friction/cost surfaces.

Some POM.xml files have execute permissions

Artifacts in output of slope operations

I'm seeing linear artifacts in the slope output computed from the global aster layer.

To reproduce:

Compute slope on s3://mrgeo/images/aster-30m in radians
mrgeo mapalgebra -e "result = slope([s3://mrgeo/images/aster-30m],\"rad\")" -o s3://mrgeo/images/slope-rad
Build Pyramids
Visualize with any color scale

Screenshots to follow

Accumulo Data Provider working with MapAlgebra

The Accumulo Data Provider needs to work with the MapAlgebra interface for MrGeo. Protection levels for data has to be preserved when using a single source (or allowed to go to a higher level). Also, the derived product must have the protection level of the sources used (or allowed to go to a higher level).