azavea / raster-vision Goto Github PK

An open source library and framework for deep learning on satellite and aerial imagery.

License: Other

Shell 1.46% Python 96.73% Dockerfile 0.64% Jupyter Notebook 1.17%

deep-learning computer-vision remote-sensing geospatial object-detection semantic-segmentation classification machine-learning pytorch

raster-vision's Introduction

Raster Vision is an open source Python library and framework for building computer vision models on satellite, aerial, and other large imagery sets (including oblique drone imagery).

It has built-in support for chip classification, object detection, and semantic segmentation with backends using PyTorch.

Examples of chip classification, object detection and semantic segmentation

As a library, Raster Vision provides a full suite of utilities for dealing with all aspects of a geospatial deep learning workflow: reading geo-referenced data, training models, making predictions, and writing out predictions in geo-referenced formats.

As a low-code framework, Raster Vision allows users (who don't need to be experts in deep learning!) to quickly and repeatably configure experiments that execute a machine learning pipeline including: analyzing training data, creating training chips, training models, creating predictions, evaluating models, and bundling the model files and configuration for easy deployment.

Raster Vision also has built-in support for running experiments in the cloud using AWS Batch.

See the documentation for more details.

Installation

For more details, see the Setup documentation.

Install via `pip`

You can install Raster Vision directly via pip.

pip install rastervision

Use Pre-built Docker Image

Alternatively, you may use a Docker image. Docker images are published to quay.io (see the tags tab).

We publish a new tag per merge into master, which is tagged with the first 7 characters of the commit hash. To use the latest version, pull the latest suffix, e.g. raster-vision:pytorch-latest. Git tags are also published, with the Github tag name as the Docker tag suffix.

Build Docker Image

You can also build a Docker image from scratch yourself. After cloning this repo, run docker/build, and run then the container using docker/run.

Usage Examples and Tutorials

Non-developers may find it easiest to use Raster Vision as a low-code framework where Raster Vision handles all the complexities and the user only has to configure a few parameters. The Quickstart guide is a good entry-point into this. More advanced examples can be found on the Examples page.

For developers and those looking to dive deeper or combine Raster Vision with their own code, the best starting point is Usage Overview, followed by Basic Concepts and Tutorials.

Contact and Support

You can ask questions and talk to developers (let us know what you're working on!) at:

Contributing

For more information, see the Contribution page.

We are happy to take contributions! It is best to get in touch with the maintainers about larger features or design changes before starting the work, as it will make the process of accepting changes smoother.

Everyone who contributes code to Raster Vision will be asked to sign the Azavea CLA, which is based off of the Apache CLA.

Download a copy of the Raster Vision Individual Contributor License Agreement or the Raster Vision Corporate Contributor License Agreement
Print out the CLAs and sign them, or use PDF software that allows placement of a signature image.
Send the CLAs to Azavea by one of:

Scanning and emailing the document to [email protected]
Faxing a copy to +1-215-925-2600.
Mailing a hardcopy to: Azavea, 990 Spring Garden Street, 5th Floor, Philadelphia, PA 19107 USA

Licenses

Raster Vision is licensed under the Apache 2 license. See license here.

3rd party licenses for all dependecies used by Raster Vision can be found here.

raster-vision's People

Contributors

Stargazers

Watchers

Forkers

donrv lossyrob mrubash1 danlopez00 cuulee dhuastraea hkcaesar cugsongchao igabriel85 fooway gaolipeng benjamesbabala qzhao wassname geoyi anuragreddygv323 drzhouq vybhavk bjangeofan homesong gaosh0405 rheupler neuralnetworkingtechnologies fangyizhang alexliyang splashinn lonl crikeli zilongzhong shyamsunder007 gilyazutdinov bx5974 windwang2 bekerov reyadrahman bw4sz infinitone ndahlan lytk01 nojuman africamachineintelligence neithi hanmakaidao gitter-badger lishulincug grseb9s hanmakaidao2 stsaten6 bayzidul jemgold wsf1990 melatron18 simonkassel chenyizi086 airbusaerial v4-hub jamesmcclain lancerliusong rayodukalegsi eong2012 mkmitchell akashgupta299 elffer puqiu jinunmeng bipulneupane ghomsi winggy anujonthemove jiahuan-wan emediacode nholeman crishernandezmaps hc10024 spgriffin giserh yzuaiyou sharadshingade williamzcy terratenney lulzzz cxz tchen0123 gpsbird tomorth nbrown140 vasu-kukkapalli gridl fengshow12345 jdc08161063 yochengliu bityangke mountxing jison147 friendshipity gninnur gisdeveloper2017 geopamplona mrshll balrajashwath

raster-vision's Issues

Add option to turn off debug plot generation

This could slow down the prediction process.

TypeError: super() takes at least 1 argument (0 given)

Hello! I've been trying to deploy and run the code, but I've been running into this type error -

I'm currently trying to run the code from the VM using Python 2.7.11

Traceback (most recent call last):
File "/opt/conda/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/opt/conda/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/opt/src/rastervision/semseg/data/factory.py", line 72, in
factory = SemsegDataGeneratorFactory()
File "/opt/src/rastervision/semseg/data/factory.py", line 16, in init
super().init([POTSDAM, VAIHINGEN], [IMAGE, NUMPY])
TypeError: super() takes at least 1 argument (0 given)

From what I've found on Stack Overflow it's a syntax error between Python 2 and 3. I tried a workaround by altering the super().init syntax, but it led to more complications down the road.
Does anybody know a solution to this? Should the script be running conda w/ Python 3?

Thanks!

Preprocessing problem: TIFF reading error

By running the "rastervision.semseg.data.factory isprs/vaihingen all preprocess" for a while, the file size of data in "gts_for_participants" and "dsm" increased a lot. After reaching the 4GB size limit of tiff image, I got the following error:

rasterio._err.CPLE_AppDefined: TIFFReadDirectory:Failed to read directory at offset 4294655492

It seems a similar problem as:
raster-foundry/raster-foundry#209

But the solution there is not working for this issue.

Generate negative chips

The tiff_chipper.py script only generates chips that contain at least one object. We should make it so it attempts to generate some number of negative chips that contain no objects. I don't think this is typically needed, but it seems like it might help with the ships dataset, since the ships always have the same surroundings (sea or docks) and so the network never sees land.

Out of memory when restarting training process

In theory, if you run train_ec2.sh and exit before training completes, and then restart the job, it should pick up where it left off. But this doesn't actually work because on the second run, TF emits an out of memory error. We should isolate the exact conditions when this occurs and file an issue in the repo for TF Object Detection. We should also check to see if there's an issue already there.

Terminating AWS Batch jobs broken

If you terminate an AWS Batch job via the console or AWS CLI, it doesn't work. This happens when running the train_ec2.sh script. The only way to kill it is to kill the underlying spot instance. We think this is due to a bug in Batch, and should submit a bug report to AWS.

Add back support for Vaihingen dataset

When refactoring the data generators I removed support for the Vaihingen dataset, but it should be easy to add back in.

Decide on a name, move all naming to that new name

Throughout the repo (this one and https://github.com/azavea/keras-semantic-segmentation-viz), there's various names for parts of the project or deployment, etc: keras-semantic-segmentation, opentreeid, pointcloud demo, Raster Vision ...

we need to decide on a name and then modify all places that have other names to use the one we decided on.

Cannot make train_ratio 1.0

To maximize performance, sometimes one would like to train a model using the entire development dataset, in other words, using a train_ratio of 1.0. This causes the program to crash. As a workaround , we have been using a train_ratio of 0.99.

Add depth and IR to infrared to prediction plots

This would help us figure out why some regions of the image are being misclassified. I suspect that the depth channel might have errors in it.

May I ask why there's no drop out in your u-net model?

Want to know the reason why you don't utilize the drop out, thank you.

FileNotFoundError: [Errno 2] No such file or directory: 'aws'

I am running the code through locally on my machine

python3 -m rastervision.run experiments/semseg/4_20_17/fcn_0.json

but got a error saying

  File "/home/sizhexi/keras/raster-vision/src/rastervision/common/utils.py", line 145, in s3_sync
    call(['aws', 's3', 'sync', src_path, dst_path])
  File "/usr/lib/python3.4/subprocess.py", line 537, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/usr/lib/python3.4/subprocess.py", line 859, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.4/subprocess.py", line 1457, in _execute_child
    raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'aws'

how can I resolve this issue?

Prediction problem: eval_target_size

When the "eval_target_size" is set to [2000, 2000], the testing images are evaluated for the full resolution [6000, 6000], but only the results of the last tile [4001:6000, 4001:6000] are saved. Can somebody please fix it?

Move seed into experiment configuration file

Currently the random seed is hardcoded. It makes more sense to have that as part of the RunOptions so it's possible to do multiple randomized runs.

Not using best_model.h5

The train_model task saves the best model (according to the validation loss) as best_model.h5. However, this model is not used by tasks that are run subsequently during the same invocation of the program. Instead, the model, as trained by the final epoch, is used by subsequent tasks, unless the program is run again, at which point best_model.h5 file will be loaded.

Create a Packer configuration that creates a GPU workload AMI

Capture the steps in https://docs.aws.amazon.com/batch/latest/userguide/batch-gpu-ami.html and #64 (comment) within a Packer configuration template. Also, try Googling around to make sure that no one has done the first part already.

Use same validation set each epoch

Currently the validation set is shuffled and randomly augmented, which adds to the variance of the validation loss each epoch. It makes more sense to use the same validation set for each epoch so that epochs are more directly comparable.

Experiment runing problem

I am trying test the code with a json file from 4_20_17. The variable "epoch" is defined within the "train_stages" in this json. But the rastervision/common/options.py uses the epoch variable without retrieving the train_stages. So when I run with "python -m rastervision.run experiments/....", it gives the following error:

root@0a7cd1dcc9ef:/opt/src# python -m rastervision.run experiments/semseg/4_20_17/fcn_0.json
Using TensorFlow backend.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.5/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/src/rastervision/run.py", line 47, in <module>
    run_tasks()
  File "/opt/src/rastervision/run.py", line 34, in run_tasks
    options = make_options(options_dict)
  File "/opt/src/rastervision/options.py", line 15, in make_options
    options = SemsegOptions(options_dict)
  File "/opt/src/rastervision/semseg/options.py", line 16, in __init__
    super().__init__(options)
  File "/opt/src/rastervision/common/options.py", line 18, in __init__
    self.epochs = options['epochs']
KeyError: 'epochs'

Can somebody please help to look at it?

Make tiff_chipper.py work on directory of TIFFs

Currently, we can only turn a single TIFF/JSON pair into chips. We should change this to generate chips for an entire directory of these. This will allow us to train on more ships.

./scripts/infra destroy does not terminate instances

After running ./scripts/infra destroy the spot fleet and associated instances should be terminated. However, it only seems to terminate the spot fleet, while leaving the instance running. The following error message is printed:

aws_spot_fleet_request.gpu_worker: Still destroying... (5m0s elapsed)
Error applying plan:

1 error(s) occurred:

* aws_spot_fleet_request.gpu_worker: fleet still has (1) running instances

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Ensemble run error

I was trying to run an ensemble experiment and got the following error. "load_options" is missing. Can somebody please help?

Traceback (most recent call last):
File "/opt/conda/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/src/rastervision/run.py", line 47, in
run_tasks()
File "/opt/src/rastervision/run.py", line 37, in run_tasks
runner.run_tasks(options, args.tasks)
File "/opt/src/rastervision/common/run.py", line 67, in run_tasks
self.run_path, self.options, self.generator, use_best=True)
File "/opt/src/rastervision/common/models/factory.py", line 40, in get_model
model = self.make_model(options, generator)
File "/opt/src/rastervision/semseg/models/factory.py", line 88, in make_model
models, active_input_inds_list = self.load_ensemble_models(options)
File "/opt/src/rastervision/semseg/models/factory.py", line 28, in load_ensemble_models
from ..options import load_options
ImportError: cannot import name 'load_options'

Validation tasks fail on ensemble_avg when using validation folds of different sizes

In an ensemble experiment that uses folds to get full coverage of the data set by using different validation sets, if the validation sets are not all of the same expected size then all the validation tasks will fail in the ensemble's aggregation job. This means that we cannot complete the following tasks: validation_probs, train_thresholds, train_predict, validation_predict or test_predict.

Failing unittest for test_predict compute_predictions2

Running ./scripts/test on develop yields this test failure:

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/src/rastervision/tagging/tasks/test/test_predict.py", line 48, in test_compute_predictions2
    self.assertTrue(np.array_equal(y_pred, y))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 11 tests in 0.012s

FAILED (failures=1)```

Switch to windowed reading in make_windows.py

This script generates a bunch of windows of a TIFF file by loading the whole image into memory and then slicing it. This won't work for very large images, so we should use rasterio's ability to do windowed reading.

Requesting input on upstream Keras Semantic Segmentation design

I came across your repository and it looks like good work and for that reason I'm submitting this request.

François Chollet, Keras' author, said he is interested in directly incorporating dense prediction/FCN into the Keras API, so I'm seeking suggestions/feedback/contributions at keras-team/keras#6538.

Thanks for your consideration!

Make predict.py script use >1 image per batch

If running on GPU, we can probably get a big speedup by feeding in > 1 image per batch when making predictions. However, I'm not convinced we'll want to use GPUs for prediction in batch mode considering how fast it runs on CPU and the overhead for booting up the instance.

Jobs stuck in Runnable

We've noticed that sometimes jobs get stuck in a runnable state on Batch. I just logged into the instance for such a job and found that the ecs-agent is not running as it is supposed to. (See http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-introspection.html)

[ec2-user@ip-172-31-45-73 ecs]$ curl http://localhost:51678/v1/metadata
curl: (7) Failed to connect to localhost port 51678: Connection refused

I also looked at the ecs-agent log, which contains error messages which I don't currently understand.

[ec2-user@ip-172-31-45-73 ecs]$ pwd
/var/log/ecs
[ec2-user@ip-172-31-45-73 ecs]$ cat ecs-init.log.2017-07-06-20
2017-07-06T20:21:28Z [INFO] Network error connecting to docker, backing off for '1.14777941s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:29Z [INFO] Network error connecting to docker, backing off for '2.282153551s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:31Z [INFO] Network error connecting to docker, backing off for '4.466145821s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:36Z [INFO] Network error connecting to docker, backing off for '5.235010051s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:41Z [INFO] Network error connecting to docker, backing off for '5.287113937s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:46Z [ERROR] dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:46Z [INFO] Network error connecting to docker, backing off for '1.14777941s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:21:48Z [INFO] Network error connecting to docker, backing off for '2.282153551s', error: dial unix /var/run/docker.sock: connect: no such file or directory
2017-07-06T20:22:17Z [INFO] post-stop
2017-07-06T20:22:17Z [INFO] Cleaning up the credentials endpoint setup for Amazon EC2 Container Service Agent
2017-07-06T20:22:17Z [ERROR] Error performing action 'delete' for credentials proxy endpoint route: exit status 1; raw output: iptables: No chain/target/match by that name.

2017-07-06T20:22:17Z [ERROR] Error performing action 'delete' for credentials proxy endpoint route: exit status 1; raw output: iptables: No chain/target/match by that name.

pre-trained weights

Hi --

I am looking to do semantic segmentation (no need for tagging, detection, or object recognition at this stage) and I was wondering if anyone has the pre-trained weights available for download? It can be on any of the models, just want to test it out for now.

Thanks vm

Add inception/xception model

At the end of the Planet Kaggle competition, we found that adding Inception to the ensemble improved the score. Unfortunately, the code we were using was problematic because it doesn't assign a unique name to each layer, so we can't use model.load_weights with it. The automatically generated layer names aren't consistent each time you create a new model (the names contain a constant which is global incremented) so they can't be used after the "best" model is loaded from disk after training finishes. We can fix this in a few ways: fix the underlying problem in Keras or report an issue, add unique layer names to the inception code, or use the Xception model builtin in to Keras which appears to be an improved version of Inception and has unique layer names.

Move development environment to use linked .aws profiles instead of environment vars

When setting up the vagrant box it asks for credentials - this is in the ansible "open-tree-id.environment" role. It should be sufficient to load in the user's ~/.aws directory instead.

Run experiments in parallel on AWS

Currently, we can run experiments in parallel by spinning up some instances and then manually SSHing into each one, and running a command for each experiment. This doesn't scale well, so we would like to find a way of automating this. Some ideas include:

Use AWS Batch. This seems ideal, but is challenging since we are using nvidia-docker and ECS uses docker. There's a workaround to make GPUs available in docker but it might be tricky to get it to work. https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e#.mau60bfvo
Run arbitrary commands across multiple EC2 instances via the SSM agent. Simple Systems Manager: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/systems-manager-prereqs.html
Custom producer/consumer workflow with ASG of consumers that pull work off of an SQS queue. Similar to https://github.com/stamen/vapor-clock
Elastic Beanstalk worker environment. The EB worker environment may not work due to the length of duration for each task.

OpenAI has a blog post about their infrastructure setup which we should mine for ideas https://openai.com/blog/infrastructure-for-deep-learning/

Reason for removing fcn.py model (based on Kampffmeyer)

It looks like you deleted the fcn.py model in this commit.

Is there any specific reason for leaving that one out entirely? I personally found it useful as a starting point.

Thanks for your work in putting together these models.

Specify order of channels in TIFF

Currently, we assume that the channels in TIFFS are ordered as BGR-IR. To make this more general, we should be able to specify the order of the channels as a command line argument. That is, unless BGR-IR is standard and the assumption is safe.

The dataset required in README.txt maybe incorrect?

Hi, the dataset required in /keras-semantic-segmentation-develop/src/model_training/README.txt maybe wrong. I tried to run your code and got this error: /opt/data/datasets/potsdam/4_Ortho_RGBIR/top_potsdam_2_10_RGBIR.tif: No such file or directory, seems that the program needs 4_Ortho_RGBIR dataset instead of 3_Ortho_IRRG as written in the README file.

Bad behavior when generators try to download files on local machines

Related to #66

The way we use "done.txt" can get really out of sorts if you download the data and place it in your local directory outside of the process.
Also if a download fails, it writes the done.txt as if it succeeded.

We need to refactor this part of the code to be more robust.

Validation accuracy

Validation accuracy in the score.json is different from it in the log.txt/stdout.txt.

BTW, the avg_accuracy in the validation_eval.py should be called overall_accuracy.

Box clustering creating false negatives

Question on FCN, why use resize_bilinear instead of deconv

In some other accomplishments , the 'c8 to r8' is done by deconv, but the FCN model here uses resize_bilinear?

Allow easily changing size of training and validation sets

Currently, if you want to change the size of the training and validation sets, you need to run the preprocess.py script again, which puts files into train and validation directories. Now that we aren't using the Keras data generator there's no need to keep the files in separate directories. Instead, each data generator could be given a list of files it can use.

Debug plot not showing all detections

There are more detections that show up when viewing the GeoJSON file in QGIS than show up in the debug plot generated by aggregate_predictions.py.

Fine Tuning ...

Hello,

Can you guide me how can I fine tune the model for a different dataset and naturally different number of classes?

I suppose to create the dataset in your format I think, at the beginning.

Downloading of files for different runs can not happen when needed

In download_dataset, the code checks if the data directory is there, and if not makes it and downloads data. If it is, it considered the data already downloaded. For machines that are being reused in the ECS cluster, it could run multiple trainings. If the one run needs different files from the other, then it can skip downloading important data. We should check for the existence of the files at a per file level (which presents a slight challenge because zip files are deleted for good reason).

Resuming training does not take into account the last used learning rate

In https://github.com/azavea/keras-semantic-segmentation/blob/develop/src/model_training/train.py, we resume training from saved files by getting the last epoch number from log.txt and then use that as the initial_epoch argument to fit_generators. But that doesn't take into account the latest learning rate that was being used, which could have been set by the ReduceLROnPlateau callback.

Add NDVI channel to data

Adding an NDVI channel could help detect vegetation. It's a ratio of red and infrared channels that the network may have trouble discovering on its own.

See https://en.wikipedia.org/wiki/Normalized_Difference_Vegetation_Index

Setup script fails

Running ./scripts/setup fails with the following. Running vagrant provisionworks though.

fatal: [raster_vision]: FAILED! => {"changed": false, "failed": true, "msg": "dpkg --force-confdef --force-confold -i /tmp/nvidia-docker.deb failed", "stderr": "start: Job failed to start\ninvoke-rc.d: initscript nvidia-docker, action \"start\" failed.\ndpkg: error processing package nvidia-docker (--install):\n subprocess installed post-installation script returned error exit status 1\nErrors were encountered while processing:\n nvidia-docker\n", "stdout": "Selecting previously unselected package nvidia-docker.\n(Reading database ... 65558 files and directories currently installed.)\nPreparing to unpack /tmp/nvidia-docker.deb ...\nUnpacking nvidia-docker (1.0.0~rc.3-1) ...\nSetting up nvidia-docker (1.0.0~rc.3-1) ...\nConfiguring user\nSetting up permissions\nProcessing triggers for ureadahead (0.100.0-16) ...\n", "stdout_lines": ["Selecting previously unselected package nvidia-docker.", "(Reading database ... 65558 files and directories currently installed.)", "Preparing to unpack /tmp/nvidia-docker.deb ...", "Unpacking nvidia-docker (1.0.0~rc.3-1) ...", "Setting up nvidia-docker (1.0.0~rc.3-1) ...", "Configuring user", "Setting up permissions", "Processing triggers for ureadahead (0.100.0-16) ..."]}

Infrastructure improvements

@lewfish and I discussed some potential areas for infrastructure improvements:

Reducing EC2 Boot times

Build docker-images locally and push them to quay.io/ECR, rather than copying the entire local workspace up to EC2 for builds.
Get latest source code onto the EC2 instance by cloning this repository using cloud-init or a command run over SSH
Replace cloud-config installation of nvidia-docker with our own AMI, based on ami-50b4f047, that has nvidia-docker installed.

Optimizations for multi-user collaborations

Identify a user's EC2 instance via key-pair name: scripts/run uses aws ec2 wait to determine when Spot Fleet requests are complete. However, if multiple users are running the script at the same time, the script may wait for the wrong Spot Fleet request to finish. One way to avoid this is to allow users to use their own (named) key-pairs, and add key-name as an additional filter to aws ec2 wait instance-running.
Use the AWS CLI to terminate instances once the jobs have finished. Ideally we'd be able to run this from inside the container, but that would either require stored credentials in the container (a security risk), or access to the EC2 metadata service.

Concurrent processing across instances

We want to be able to run the same command with different parameters, simultaneously, across all available workers. We settled on the following:

Add instance name/index $INSTANCE_ID as an environment variable via cloud-config
Store the command parameters in files namespaced by instance ID. Instances would access a file like $INSTANCE_ID-command-options.json.

Cloud init fails occasionally

Intermittently, the cloud-init fails due to a problem with installing packages. The following is from the log file. The result is that the data and docker image aren't downloaded to the instance.

Get:17 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 libwebp5 amd64 0.4.4-1 [165 kB]
Get:18 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 libwebpmux1 amd64 0.4.4-1 [14.2 kB]
Get:19 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 python3-pil amd64 3.1.2-0ubuntu1 [312 kB]
Get:20 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 python3-pygments all 2.1+dfsg-1 [520 kB]
Get:21 http://security.ubuntu.com/ubuntu xenial-security/main amd64 libtiff5 amd64 4.0.6-1ubuntu0.1 [146 kB]
Get:22 http://us-east-1.ec2.archive.ubuntu.com/ubuntu xenial/main amd64 unzip amd64 6.0-20ubuntu1 [158 kB]
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Fetched 3,703 kB in 0s (9,560 kB/s)
dpkg: error: dpkg status database is locked by another process
E: Sub-process /usr/bin/dpkg returned an error code (2)
/var/lib/cloud/instance/scripts/part-001: line 9: aws: command not found
/var/lib/cloud/instance/scripts/part-001: line 10: pushd: data/datasets: No such file or directory
/var/lib/cloud/instance/scripts/part-001: line 11: unzip: command not found
/var/lib/cloud/instance/scripts/part-001: line 12: popd: directory stack empty
Cloning into 'keras-semantic-segmentation'...
/var/lib/cloud/instance/scripts/part-001: line 18: aws: command not found
Using default tag: latest
Pulling repository 002496907356.dkr.ecr.us-east-1.amazonaws.com/keras-semantic-segmentation-gpu
unauthorized: authentication required
Cloud-init v. 0.7.8 running 'modules:final' at Mon, 27 Feb 2017 19:26:15 +0000. Up 65.95 seconds.
2017-02-27 19:26:48,013 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [1]
2017-02-27 19:26:48,015 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2017-02-27 19:26:48,016 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed

Problem starting `./scripts/run --gpu`

After starting up 10 EC2 instance, I ran ./scripts/run --gpu on each of them. On 4 of the instances, it hung for a bit and then had the following error message.

docker: Error response from daemon: create nvidia_driver_375.51: Post http://%2Frun%2Fdocker%2Fplugins%2Fnvidia-docker.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0.
See 'docker run --help'.

Invoking the command a second time worked.

Model improvements

Right now we get 85.8% accuracy on the Potsdam dataset using a single U-Net-like model trained from scratch on RGBIRD. From reading papers, it seems that this is about the best you can do with a single off-the-shelf model trained from scratch. To improve accuracy, we might explore the following ideas:

Do some hyperparameter tuning once we have the ability to run lots of experiments in parallel.
Implement a more proper version of U-Net. The version we have takes some shortcuts for ease of implementation that might result in lower accuracy.
Use a more state-of-the-art model like 100 Layer Tiramisu, which is like a U-Net but uses DenseNets as its base network. https://arxiv.org/abs/1611.09326
Train a bunch of models and combine them as an ensemble. Top entries in the contest do this.
Use a pre-trained model on the RGB channels and then fuse with a trained-from-scratch model on the IR and D channels. How and where you do the fusing probably matters. Top entries in the contest do this.

Generate full 6000x6000 test images

Due to memory limitations, we can't make a prediction image for full 6000x6000 images, but we need to if we want to make a submission to the contest (see http://www2.isprs.org/potsdam-2d-semantic-labeling.html). Currently, we just doing it on 2000x2000 tiles. We need to break the big tile up into little tiles, make a prediction for each one, and then reassemble the predictions back into a big tile.