Giter Site home page Giter Site logo

Comments (14)

sean-smith avatar sean-smith commented on June 11, 2024

@nyue My guess is you're running the update script without having first deployed the stack. Take a look at More Regions (click to expand) in the README and click on the ca-central-1 launch stack button:

image

from pcluster-manager.

nyue avatar nyue commented on June 11, 2024

I tried clicking on ca-central-1, it started creating the stack. After monitoring it (IN_PROGRESS) for about 30 minutes, I get a ROLLBACK_COMPLETE

It's like it has issue creating the stack

pcluster-manager-rollback-compelete

from pcluster-manager.

sean-smith avatar sean-smith commented on June 11, 2024

@nyue Can you check the stack events and find the event that failed? You might need filter by FAILED to get the correct stack.

from pcluster-manager.

nyue avatar nyue commented on June 11, 2024

I found the failure, it mentioned ECR

Embedded stack arn:aws:cloudformation:ca-central-1:083230063072:stack/pcluster-manager-ParallelClusterApi-HAUEC9DAO8LE/099a6170-a96b-11ec-9e63-024e96f96c22 was not successfully created: The following resource(s) failed to create: [EcrImage].

pcluster_ecr_failed

from pcluster-manager.

cpollard0 avatar cpollard0 commented on June 11, 2024

@nyue - Can you please share the error with the embedded stack? It should have information specific to the ECR failure.

from pcluster-manager.

nyue avatar nyue commented on June 11, 2024

I am not sure how to get to an embedded stack? Is that the same as a nested stack ?

I am using this ID to match 099a6170-a96b-11ec-9e63-024e96f96c22

image

from pcluster-manager.

cpollard0 avatar cpollard0 commented on June 11, 2024

You've got the right stack!

Yes, embedded stacks are synonymous with nested stacks.

Once CloudFormation finds a failure it will delete all the resources that it has created up to that point, so seeing all the resources as DELETE_COMPLETE is expected.

Can please click on the Events tab of the stack and share information regarding the first failures you see?

FWIW I just deployed to ca-central-1 by clicking on the link @sean-smith sent and PCluster manager successfully installed.

from pcluster-manager.

nyue avatar nyue commented on June 11, 2024

Here are the events

image

from pcluster-manager.

cpollard0 avatar cpollard0 commented on June 11, 2024

@nyue - Can you please scroll down the list of events and grab a screenshot the point at which you see the first error? The delete events happened subsequent to the error.

from pcluster-manager.

nyue avatar nyue commented on June 11, 2024

Here is the part where the creation failure message is visible. It mentioned SSM Agent

image

from pcluster-manager.

cpollard0 avatar cpollard0 commented on June 11, 2024

Thanks for the detail! I think Image Builder uses the default VPC to build an image. What does that VPC look like in your environment? If there are private subnets do all their route tables have a NAT GW entry? Any NACLs in place? Any VPC or gateway endpoints with policies in place?

from pcluster-manager.

nyue avatar nyue commented on June 11, 2024

I am not an AWS power user, I never touch the networking stuff so I am struggling to comprehend the details and be able to find answers to the questions (even though I vaguely understand the questions and they look important).

I have been setting up MPI clusters via terraform and was hoping to do so via parallelcluster v3 seeing the configuration has move to YAML.

I am looking to pcluster-manager to make it simple for users like me to launch an MPI cluster, run some MPI aware applications be they rendering, genomics, visualization and shut them down once the work is done.

I am happy to learn about the VPC/network working stuff but am unable to provide useful answers at this juncture.

Cheers

from pcluster-manager.

sean-smith avatar sean-smith commented on June 11, 2024

Thanks @nyue

This is exactly the type of use case we're building PCM to solve. We want to make it easy to deploy and manage HPC clusters without having to be an expert in AWS networking (which I'm not).

Specifically for this issue, I think the root of the problem comes from #55 which will be merged in the next few weeks.

Basically the default VPC (which you can find here) needs to exist and have a route to the internet. Once #55 is merged you'll be able to specify another VPC/Subnet outside of the default one to launch in.

from pcluster-manager.

sean-smith avatar sean-smith commented on June 11, 2024

Closing since #55 was merged

from pcluster-manager.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.