Giter Site home page Giter Site logo

Comments (20)

kislyuk avatar kislyuk commented on August 27, 2024 2

I'm sorry @jamestwebber, I did not properly test my fix, and managed to have a bug in the fix as well. Please try again with v2.6.9, which I think I tested properly and with success.

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024 1

Hi @sunitj, thank you for reporting this. I am working on a fix for this issue, which was also reported in the idseq project. Your report will cause me to allocate more of my time to the fix to accelerate it. I'll update here ASAP.

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024 1

Thanks, I did not anticipate this error condition. The issue is not about xvdz (the script automatically chose the next available letter, xvdy), but that the failure to attach (due to too many concurrent requests) causes the volume ID to not be properly printed and saved for cleaning it up. I'll have a fix for you to try later today.

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

In March 2017, EC2 gained the ability to tag resources at creation time.

In the shellcode for self-managed EBS volume attachment, the volume is created and tags are applied before the cleanup exit trap is set: https://github.com/kislyuk/aegea/blob/master/aegea/util/aws/batch.py#L20.

Plan:

  • Combine the create volume call and the tagging call into one
  • Move the cleanup trap up to be set before the initial create volume call
  • Modify the default shellcode to make use of instance local storage if available and running in privileged mode
  • Advise users to set up their Batch CE to use nvme-enabled instances to automatically provide scratch space on /mnt

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

This should be fixed in the latest release. The job should detach and destroy the volume upon any failure. I also recommend that instead of using EBS volumes, you look into using NVMe instance-attached storage. Automatic utilization of instance-attached storage is now available using the aegea batch submit --mount-instance-storage [mountpoint] option (where mountpoint is /mnt by default).

If you want to switch from EBS to instance storage, the steps are as follows:

  • Locate suitable instance types at https://ec2instances.info. Enter "nvme" into the filter above the "Instance Storage" column. I recommend using instances from the m5d, r5d, or c5d family.

  • Create a new compute environment, or delete and re-create an existing compute environment, to utilize the instance types you selected (they don't support updating the instance types in-place).

  • If you've never worried about compute environments, and just ran aegea batch submit with the default settings, then aegea created a compute environment called aegea_batch for you. The new release of aegea uses spot m5d/r5d/c5d instances by default. To delete and re-create aegea_batch, run:

    aegea batch delete-queue aegea_batch
    aegea batch delete-compute-environment aegea_batch
    aegea batch submit --command "df -h" --mount-instance-storage --watch
    

    That last command will re-create a new aegea_batch compute environment, launch a test job, format and mount your instance storage, and print this free space report as the last log line, which you can use to verify success:

    2019-09-09 05:05:52+00:00 /dev/md0         92G   61M   87G   1% /mnt
    

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

This issue should be resolved now, please reopen it if you still encounter difficulties with orphaned volumes.

from aegea.

sunitj avatar sunitj commented on August 27, 2024

Hi Andrey!
Thank you for the quick turnaround! I updated to v2.6.6 today and retried my test job. However, the job failed again due to a (possibly) related error/bug. Here is the snippet of the error:

2019-09-16 18:10:08+00:00   Found existing installation: awscli 1.11.13
2019-09-16 18:10:08+00:00     Not uninstalling awscli at /usr/lib/python3/dist-packages, outside environment /usr
2019-09-16 18:10:10+00:00 Successfully installed aegea-2.6.6 argcomplete-1.10.0 asn1crypto-0.24.0 awscli-1.16.238 babel-2.7.0 bcrypt-3.1.7 boto3-1.9.228 botocore-1.12.228 certifi-2019.9.11 cffi-1.12.3 chardet-3.0.4 cryptography-2.7 dnspython-1.16.0 idna-2.8 ipwhois-1.1.0 keymaker-1.0.9 paramiko-2.6.0 pycparser-2.19 pynacl-1.3.0 python-dateutil-2.8.0 pytz-2019.2 pyyaml-5.1.2 requests-2.22.0 s3transfer-0.2.1 tweak-1.0.3 uritemplate-3.0.0 urllib3-1.25.3
2019-09-16 18:10:56+00:00 Traceback (most recent call last):
2019-09-16 18:10:56+00:00   File "/usr/local/bin/aegea", line 23, in <module>
2019-09-16 18:10:56+00:00     aegea.main()
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/__init__.py", line 89, in main
2019-09-16 18:10:56+00:00     result = parsed_args.entry_point(parsed_args)
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 65, in create
2019-09-16 18:10:56+00:00     return attach(parser_attach.parse_args([res["VolumeId"]], namespace=args))
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 139, in attach
2019-09-16 18:10:56+00:00     logger.info("Formatting %s (%s)", args.volume_id, find_devnode(args.volume_id))
2019-09-16 18:10:56+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 109, in find_devnode
2019-09-16 18:10:56+00:00     raise Exception("Could not find devnode for {}".format(volume_id))
2019-09-16 18:10:56+00:00 Exception: Could not find devnode for vol-00db2784fc73d40a3
2019-09-16 18:10:56+00:00 Detaching EBS volume
2019-09-16 18:10:57+00:00 usage: aegea ebs detach [-h] [--max-col-width MAX_COL_WIDTH] [--json]
2019-09-16 18:10:57+00:00                         [--log-level {CRITICAL,DEBUG,WARNING,ERROR,INFO}]
2019-09-16 18:10:57+00:00                         [--unmount] [--delete] [--force] [--dry-run]
2019-09-16 18:10:57+00:00                         volume_id
2019-09-16 18:10:57+00:00 aegea ebs detach: error: the following arguments are required: volume_id

The complete error log is attached.
job.log
I checked my volumes and vol-00db2784fc73d40a3 was in fact created.
image

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

This feature is currently only compatible with m5/c5/r5 instance type families. I will look into making it compatible with other instance types, but I can't promise it. Please take a look at the suggested list of steps above.

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

OK, I tracked down the problem and released a fix in v2.6.8. Please try again - sorry about the breakage.

from aegea.

jamestwebber avatar jamestwebber commented on August 27, 2024

Just tried it and got the same error.

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

Can you double check the output of aegea --version to make sure it says v2.6.8?

Are you running this command? aegea batch submit --queue aegea_batch --vcpus 16 --memory 64000 --ecr-image aligner --storage /mnt=500 --command 'PATH=$HOME/anaconda/bin:$PATH; cd utilities; git pull; git checkout master; python setup.py install; python -m utilities.alignment.run_star_and_htseq --taxon mm10-plus --num_partitions 100 --partition_id 0 --s3_input_path s3://czb-seqbot/fastqs/190906_A00111_0366_AHNKGFDSXX/ --s3_output_path s3://czb-maca/Plate_seq/parabiosis/190906_A00111_0366_AHNKGFDSXX/mm10/'

And your instance types are m3/c3/r3?

from aegea.

jamestwebber avatar jamestwebber commented on August 27, 2024

aegea is version 2.6.8 but this job queue is not m3, it's set to optimal.

from aegea.

sunitj avatar sunitj commented on August 27, 2024

Just tried v2.6.9 with the same commands as before. Instances were launched successfully. Thanks!

from aegea.

sunitj avatar sunitj commented on August 27, 2024

Hi @kislyuk
It looks like the orphan volumes issue wasn't resolved completely in v2.6.9. Some of my jobs were killed due to the following error, and the volumes associated with them were left behind as available:

...
...
Successfully installed aegea-2.6.9 argcomplete-1.10.0 asn1crypto-0.24.0 awscli-1.16.240 babel-2.7.0 bcrypt-3.1.7 boto3-1.9.230 botocore-1.12.230 certifi-2019.9.11 cffi-1.12.3 chardet-3.0.4 cryptography-2.7 dnspython-1.16.0 idna-2.8 ipwhois-1.1.0 keymaker-1.0.9 paramiko-2.6.0 pycparser-2.19 pynacl-1.3.0 python-dateutil-2.8.0 pytz-2019.2 pyyaml-5.1.2 requests-2.22.0 s3transfer-0.2.1 tweak-1.0.3 uritemplate-3.0.0 urllib3-1.25.3
2019-09-18 05:29:34+00:00 WARNING:aegea:BDM node xvdz is already in use, looking for next available node
2019-09-18 05:29:35+00:00 Traceback (most recent call last):
2019-09-18 05:29:35+00:00   File "/usr/local/bin/aegea", line 23, in <module>
2019-09-18 05:29:35+00:00     aegea.main()
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/__init__.py", line 89, in main
2019-09-18 05:29:35+00:00     result = parsed_args.entry_point(parsed_args)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 65, in create
2019-09-18 05:29:35+00:00     attach(parser_attach.parse_args([res["VolumeId"]], namespace=args))
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 126, in attach
2019-09-18 05:29:35+00:00     res = attach_volume(args)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/aegea/ebs.py", line 87, in attach_volume
2019-09-18 05:29:35+00:00     Device=args.device)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 357, in _api_call
2019-09-18 05:29:35+00:00     return self._make_api_call(operation_name, kwargs)
2019-09-18 05:29:35+00:00   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 661, in _make_api_call
2019-09-18 05:29:35+00:00     raise error_class(parsed_response, operation_name)
2019-09-18 05:29:35+00:00 botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the AttachVolume operation (reached max retries: 4): Request limit exceeded.
2019-09-18 05:29:35+00:00 Detaching EBS volume
2019-09-18 05:29:36+00:00 usage: aegea ebs detach [-h] [--max-col-width MAX_COL_WIDTH] [--json]
2019-09-18 05:29:36+00:00                         [--log-level {ERROR,CRITICAL,WARNING,INFO,DEBUG}]
2019-09-18 05:29:36+00:00                         [--unmount] [--delete] [--force] [--dry-run]
2019-09-18 05:29:36+00:00                         volume_id
2019-09-18 05:29:36+00:00 aegea ebs detach: error: the following arguments are required: volume_id
INFO:aegea:Job a0ac748d-da20-4b73-b63e-47330224c12d: Essential container in task exited

Here, is the detailed log
a0ac748d-da20-4b73-b63e-47330224c12d.log

In this case, it possible that the warning about the selected device (xvdz) not being available has something to do with it?

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

Released v2.6.11, which should address this issue by always saving the volume ID and trying to detach it even if errors are encountered in the volume management process. Please try again.

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

Closing due to inactivity.

from aegea.

sunitj avatar sunitj commented on August 27, 2024

Hello @kislyuk,
Apologies for the delay, I wasn't running many jobs since my last comment and hence could not test the updates.
However, recently, I started noticing an increase in the orphan volumes again. It seems like, if there is an error during execution, aegea seems to catch the exception and tries to detach the correct volume, but is unable to and eventually leaves an orphaned volume behind with the message:

2019-10-09 22:36:57+00:00 Detaching EBS volume vol-0fc487d57716a4ac2
2019-10-09 22:37:07+00:00 Traceback (most recent call last):
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/bin/aegea", line 23, in <module>
2019-10-09 22:37:07+00:00     aegea.main()
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/aegea/__init__.py", line 89, in main
2019-10-09 22:37:07+00:00     result = parsed_args.entry_point(parsed_args)
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/aegea/ebs.py", line 186, in detach
2019-10-09 22:37:07+00:00     Force=args.force)
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/botocore/client.py", line 357, in _api_call
2019-10-09 22:37:07+00:00     return self._make_api_call(operation_name, kwargs)
2019-10-09 22:37:07+00:00   File "/opt/aegea-venv/lib/python3.5/site-packages/botocore/client.py", line 661, in _make_api_call
2019-10-09 22:37:07+00:00     raise error_class(parsed_response, operation_name)
2019-10-09 22:37:07+00:00 botocore.exceptions.ClientError: An error occurred (RequestLimitExceeded) when calling the DetachVolume operation (reached max retries: 4): Request limit exceeded.

Here is the detailed log and the command passed to the job (let me know if you need an unedited version)
aegea version: 2.7.9

For comparison, here is a similar example in v2.6.11 where it seems to have worked: command, log

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

I suspect this has to do with concurrently launching many jobs, but I'll need to look at the logs more closely or experiment to confirm that.

Short of staggering the job launch, the main workaround for RequestLimitExceeded that I can think of is to change the botocore config to increase retries.

With that said, I encourage you to stop using EBS volumes for your jobs, and to use instance attached storage instead. Going forward I intend to focus on only supporting instance storage as scratch space. Is there a specific reason preventing you from using instance storage for scratch space using the options I described above?

from aegea.

sunitj avatar sunitj commented on August 27, 2024

@kislyuk I haven't tested instance attached storage yet. I guess my concern was the limited instance family support and the need to setup another compute environment with these instance families (we are nearing our hard-limit of environments).

Neither of these are problems that we can't overcome. So, if that's where aegea is headed, then I'm happy to update my environment.

from aegea.

kislyuk avatar kislyuk commented on August 27, 2024

Closing this - please reopen if you still need support.

from aegea.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.