brinkmanlab / galaxy-container Goto Github PK
View Code? Open in Web Editor NEWGalaxy container generation and cloud deployment recipes. Uses Buildah, Ansible, and Terraform.
License: MIT License
Galaxy container generation and cloud deployment recipes. Uses Buildah, Ansible, and Terraform.
License: MIT License
Delete
galaxy-container/destinations/k8s/visualizations.tf
Lines 45 to 89 in 035b92f
after galaxyproject/galaxy#11057 is resolved
Example for k8s: galaxyproject/pulsar@f892f71
Galaxies job runners poorly expose the underlying runner interfaces (docker/k8s). A template runner that takes in a templatised configuration data-structure and hands it off directly to the underlying library for that compute resource.
This allows fine grained control of the underlying libraries features without having to implement every one into an interface in Galaxy.
This will likely require modification of the k8s runner in Galaxy.
Once kubernetes/autoscaler#3916 is completed, add the ability to specify per tool limits on scaling. We will also potentially want to add per user limits on scaling based on fair share.
For now, it is failing on
terraform init
stage, because there is no connection to registry.terraform.io
Is there any way to download/to provide needed plugins to Terraform manually, without having own corporate Terraform registry?
the htcondor/mini docker container might be useful for a simple job manager
Galaxy creates a control queue for each of its processes, but might not be cleaning them up properly despite setting auto_delete=True in the related code.
They can be manually cleaned up using aws sqs list-queues --output text | grep control | cut -f 2 | xargs -i aws sqs delete-queue --queue-url {} --output text
but this will clobber the currently active processes queue too. This may require restarting the app.
This requires separating out the prep/staging and cleanup/metadata steps into their own entry points.
Pods can then be configured with init-containers as the prep and compute, with cleanup as the main container.
This will require a significant contribution to the Galaxy codebase.
galaxy.jobs.runners.kubernetes DEBUG 2021-01-22 01:40:08,715 Starting queue_job for job 173121
galaxy.tool_util.deps.containers INFO 2021-01-22 01:40:09,008 Checking with container resolver [ExplicitContainerResolver[]] found description [None]
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
galaxy.tool_util.deps.container_resolvers.mulled INFO 2021-01-22 01:40:09,126 Call to `docker images` failed, configured container resolution may be broken
galaxy.tool_util.deps.containers INFO 2021-01-22 01:40:09,126 Checking with container resolver [CachedMulledDockerContainerResolver[namespace=biocontainers]] found description [None]
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
galaxy.tool_util.deps.container_resolvers.mulled INFO 2021-01-22 01:40:09,199 Call to `docker images` failed, configured container resolution may be broken
galaxy.tool_util.deps.containers INFO 2021-01-22 01:40:09,199 Checking with container resolver [CachedMulledDockerContainerResolver[namespace=local]] found description [None]
galaxy.tool_util.deps.containers INFO 2021-01-22 01:40:11,716 Checking with container resolver [MulledDockerContainerResolver[namespace=biocontainers]] found description [None]
galaxy.tool_util.deps.conda_util ERROR 2021-01-22 01:40:11,803 Could not execute: '['docker', 'run', 'continuumio/miniconda3:latest', 'conda', 'search', '--full-name', '--json', '--platform', 'linux-64', '--override-channels', '--channel', 'conda-forge', '--channel', 'bioconda', 'feature_merge']'
.
Executing: /srv/galaxy/involucro -v=3 -f /srv/galaxy/lib/galaxy/tool_util/deps/mulled/invfile.lua -set CHANNELS=conda-forge,bioconda -set TARGETS=feature_merge=1.3 -set REPO=quay.io/local/feature_merge:1.3 -set BINDS=build/dist:/usr/local/ -set DEST_BASE_IMAGE=bgruening/busybox-bash:0.1 -set TEST=true build-and-test
galaxy.tool_util.deps.containers ERROR 2021-01-22 01:40:11,804 Could not get container description for tool 'toolshed.g2.bx.psu.edu/repos/brinkmanlab/feature_merge/feature-merge/1.3'
Traceback (most recent call last):
File "/srv/galaxy/lib/galaxy/tool_util/deps/containers.py", line 243, in find_best_container_description
resolved_container_description = self.resolve(enabled_container_types, tool_info, **kwds)
File "/srv/galaxy/lib/galaxy/tool_util/deps/containers.py", line 265, in resolve
container_description = container_resolver.resolve(enabled_container_types, tool_info, install=install, resolution_cache=resolution_cache, session=session)
File "/srv/galaxy/lib/galaxy/tool_util/deps/container_resolvers/mulled.py", line 498, in resolve
mull_targets(
File "/srv/galaxy/lib/galaxy/tool_util/deps/mulled/mulled_build.py", line 286, in mull_targets
ret = involucro_context.exec_command(involucro_args)
File "/srv/galaxy/lib/galaxy/tool_util/deps/mulled/mulled_build.py", line 338, in exec_command
os.mkdir('./build')
PermissionError: [Errno 13] Permission denied: './build'
galaxy.jobs.command_factory INFO 2021-01-22 01:40:11,893 Built script [/data/jobs_directory/000/173/173121/tool_script.sh] for tool command [feature_merge -v > /data/jobs_directory/000/173/173121/outputs/COMMAND_VERSION 2>&1; feature_merge -i -m append -t 500 /data/database/files/000/240/dataset_240310.dat /data/database/files/000/240/dataset_240311.dat /data/database/files/000/240/dataset_240312.dat /data/database/files/000/240/dataset_240313.dat /data/database/files/000/240/dataset_240314.dat /data/database/files/000/240/dataset_240315.dat /data/database/files/000/240/dataset_240316.dat /data/database/files/000/240/dataset_240317.dat > /data/jobs_directory/000/173/173121/outputs/galaxy_dataset_9d7c66fe-8b1b-4fa1-9fda-ffceef935e1d.dat]
Terraform recursively cloning the entire repo to import a module is expensive. It is better to publish artifacts that contain only the relevant docker or k8s module and import those using https://www.terraform.io/docs/language/modules/sources.html#fetching-archives-over-http
By default git doesn't convert symlinks to the Windows equivalent. This causes an "Argument or block definition required" error during terraform init.
To enable this run:
git config --global core.symlinks true
This should be added to the documentation of this repo and all dependent projects.
Jobs can be scheduled with a criteria that makes them always unscheduable. A service that polls the docker swarm/k8s queue for new jobs can then take the scheduling user and other metadata from the job and remove the unscheduable criteria based on fair share.
Once the jobs are complete, Galaxy can be told not to delete the job. This service can then collect CPU times and any other metrics before cleaning up the job.
This is ideally separate from the Galaxy job-runner so that it can be re-used by other project potentially not using Galaxy, or across multiple Galaxy instances.
After brinkmanlab/cloud_recipes#2 forward the options provided in the deployment example recipes.
Customise the 503 error page and convert 504 responses to 503, showing the page when the uwsgi endpoint cant be reached.
Should say something along the lines of "the service is busy or restarting, try again later".
Change to remote-exec provisioner that runs same command in the app container. Locate any other 'wait' resources and change to provisioner.
Also remove
https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/cron_job
https://github.com/galaxyproject/galaxy/tree/dev/scripts/cleanup_datasets
https://github.com/galaxyproject/gxadmin
https://galaxyproject.org/admin/config/performance/purge-histories-and-datasets/
/srv/galaxy/scripts/cleanup_datasets/pgcleanup.py -o $maxage -l /var/log/galaxy/cleanup-{{ galaxy_db_name }} delete_datasets delete_userless_histories purge_datasets purge_deleted_hdas purge_deleted_histories purge_historyless_hdas purge_hdas_of_purged_histories update_hda_purged_flag delete_exported_histories
This requires configuring a minio container for the docker deployment and S3 for AWS.
AWS SQS queues need to have a configurable prefix to avoid name collisions from multiple deployments.
The queues should also be pre-created and handed to galaxy.
This depends on galaxyproject/galaxy#14634
Buildah was chosen early on as a solution on recommendation, but having ansible drive the container creation is turning out to be clunky.
The recipes should be refactored to docker files, using multi-stage-builds to compile the client and render the config files with ansible respectively.
For the app container:
For the web container:
Currently long running tasks are being done in the app process, causing hangs of the web service, or interruptions of the task in the event that the app restarts. Galaxy can now delegate those tasks to a Celery queue:
We need to add a deployment for Celery workers and integrate the cloud provided message queue when available.
Galaxy now supports working with TUS uploads.
Create a tusd service with HPA and configure Galaxy and Nginx to accept uploads from tusd.
https://training.galaxyproject.org/training-material/topics/admin/tutorials/tus/tutorial.html
https://hub.docker.com/r/tusproject/tusd
https://docs.galaxyproject.org/en/latest/admin/nginx.html#receiving-files-via-the-tus-protocol
https://pypi.org/project/python-nomad/
https://github.com/hashicorp/levant - relevant but not directly useable
The runner should take a HCL template and populate it with the job specifics. This way the template can be swapped out with the deployment and every possible parameter for running a job on Nomad doesn't need to be forwarded via the Galaxy config.
Currently the entire deployment is baked into a single container layer.
The galaxy_env role, galaxy files/galaxy app role can be separated into different layers to allow for more compact uploads of the container.
The Galaxy container resolver currently has quay.io/bioconda hardcoded as the only container source.
This needs to be rewritten to allow additional sources to be configurable via Galaxies container_resolvers_conf.yml
The current implementation also makes direct docker registry v1 http requests to quay.io, this needs to be swapped out for the official docker sdk. This is required to improve forwards compatibility and upgrade to v2 of the docker registry api spec. Currently it is not possible to configure a pull through proxy as the proxy is not compatible with v1 of the spec.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.