Giter Site home page Giter Site logo

aws-samples / pcluster-manager Goto Github PK

View Code? Open in Web Editor NEW
65.0 10.0 27.0 23.5 MB

Manage AWS ParallelCluster through an easy to use web interface

Home Page: https://pcluster.cloud

License: Apache License 2.0

Dockerfile 0.12% Python 15.60% CSS 0.01% JavaScript 0.35% Shell 2.66% TypeScript 80.85% Ruby 0.29% HTML 0.13%
aws hpc parallel-computing

pcluster-manager's Introduction

Warning

ParallelCluster Manager has become an official feature of AWS ParallelCluster under the name AWS ParallelCluster UI. Therefore, this repository is no longer supported. For the latest features and updates, we encourage customers to refer to the new AWS ParallelCluster UI repository and documentation.

pcluster-manager's People

Contributors

amazon-auto avatar barcomasile avatar bollig avatar chambersaj avatar charlesg3 avatar cpollard0 avatar dependabot[bot] avatar eantonin avatar enrico-usai avatar ermanno avatar lazyoft avatar lukeseawalker avatar maurizp avatar mendaomn avatar mtfranchetto avatar psacc avatar rkilpadi avatar sean-smith avatar stephenmsachs avatar tmscarla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pcluster-manager's Issues

ca-central-1 setup failed

First time user, launching from Canada but encountered failure. Docker/EC2 image id hard coded for us-east-1 ?

./scripts/update.sh --region ca-central-1
Logging in to docker...
Login Succeeded
latest: Pulling from n0x0o5k1/pcluster-manager-awslambda
69e3a639bf87: Pull complete 
e5ad02396619: Pull complete 
8cce4e515b2a: Pull complete 
03ac043af787: Pull complete 
65e284713a7d: Pull complete 
eaacfb3c821c: Pull complete 
e8c62d590dcd: Pull complete 
3060f5d03faa: Pull complete 
27ef4d8fb12f: Pull complete 
9cef9f9d3a96: Pull complete 
8498e5ec098a: Pull complete 
88b8884bf765: Pull complete 
Digest: sha256:770178a924c2d260e98926e8d7292b569a34f53757d626b9648286c09cb986a1
Status: Downloaded newer image for public.ecr.aws/n0x0o5k1/pcluster-manager-awslambda:latest
public.ecr.aws/n0x0o5k1/pcluster-manager-awslambda:latest
Error parsing reference: "083230063072.dkr.ecr.ca-central-1.amazonaws.com/None:latest" is not a valid repository/tag: invalid reference format: repository name must be lowercase

pcluster-manager : creation failure ?

I was about to demo the setup to some folks and encountered this failure so I had to cancel the demo

I noticed the new UI does not have the VPC drop down

image

Missing buttons

image

Seems like I'm missing some buttons here. I don't see shell, DCV, or anything else in my deployed manager. Any thoughts here?

efa for hpc6a

Need to enable Efa toggle for hpc6a; toggle is also disabled for c5n.9xlarge and other supported instance types

        - Name: queue0-hpc6a24xlarge
          MinCount: 0
          MaxCount: 10
          InstanceType: hpc6a.48xlarge
          DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true

Submit Job Error, no debug information shown

Using the new Submit Job dialog it returns with just an "Error: undefined" message, and a corresponding error 500 on the API Call.
CloudWatch Logs shows less than helpful information as there is no info or error message provided other than the below:

[ERROR] 2022-03-15T19:05:54.980Z 20825d25-cc35-457b-b808-724c73f6edf1 Exception on /manager/submit_job [POST]
File "/var/task/app.py", line 141, in submit_job_
return submit_job()
File "/var/task/api/PclusterApiHandler.py", line 245, in submit_job 

t3.medium in pulldown

It would be terrific to see the t3 instances in the pulldown for headnode instance types.

VPC Endpoint Support

I'm in a locked down govcloud environment with no NAT gateway or public ips. I do have egress through internet gateway out to the public internet.With this being said, the nested cloudformation stack labeled (NESTED
pcluster-manager-ParallelClusterApi-132J3P9XLSR4J) is failing. I traced the problem back to ImageBuilder, it seems like there is some sort of SSM agent timeout. Not entirely sure what is happening here but I suspect the SSM service is unable to communicate with the agent on the container. I did poke through the cloudformation a bit but I didn't see anywhere I could configure vpc endpoints as I suspect this is what is needed.

The error I get from ImageBuilder is this:

"SSM execution '8787352-dcvd-4325-5235-32525gd' failed with status = 'TimedOut' in state = 'BUILDING' and failure message = 'SSM Agent verification failed'"

New hosted zone every time a cluster is created ?

I am learning PCM and creating/destroying cluster frequently.

I noticed that I get charge something call HostedZone every time I create a cluster.

Is there a way to reuse ? It gets expensive for short live cluster jobs.

image

Cognito Standalone template

Where/how does one enter existing Cognito information on the template? I can't get past this error.....

Parameters: [CognitoUserPoolId, UserPoolAuthDomain] do not exist in the template

ssm doc only applies to us-east-1

am switching between us-east-1 and 2 and noticed that ssm doc only applies to region where cfn stack is deployed. Would be nice to have consistent experience across regions from the single manager pane

Limit EC2 instance choices based on region and AZ

Generally, not all instance types, particularly GPU instances, are available in each AZ. As such, when a user creates a queue, there is no check to confirm if a selected EC2 instance class is available in the selected subnet.

Also, there is no check to see if instances are available at a region level. Example: create a cluster in us-east-1, but HPC6a is listed as an option.

Job Scheduling - Error: undefined

Receiving "Error: undefined" when selecting "Job Scheduling" menu under a cluster. This seemed to be fixed by adding an additional policy to the head node

  • Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

After adding that it still does not see the Slurm queues though after clicking the "Submit Job" button.

Shows "Queue" and "[ANY]"

image

Invalid cluster configuration: Cannot find official ParallelCluster AMI

Ive deployed the cloudformation stack provided on the Readme into our account. Whenever I login to the pcluster-manager url and try to deploy our first cluster, I get this error -

Invalid cluster configuration: Cannot find official ParallelCluster AMI

Everything deployed related to parallelcluster was done via the cloudformation stack in the readme.

Any help would be greatly appreciated.

Slurm Accounting

Following the instructions to add Slurm accounting and I'm unable to locate the SecretsManagerPolicy, only SecretsManagerReadWrite.

blank login screen

I successfully ran the pcluster-manager CloudFormation template but when logging into the created "UserPoolDomain", all I get is a blank screen. This happens both when I'm on the vpn and when I'm not.

pcluster manager - shutdown procedure ?

Hi,

I have successfully launch pcluster manager.

As there are many nested resources, is there a single point of entry where we can do a clean shutdown to ensure that all the resources spurn up for a given pcluster is all gone and does not continue to charge to our account ?

Cheers

Filter custom AMI's by date

Custom images cannot be listed by the creation date property.

This feature would be useful to identify the latest images or locate images based on the creation date without looking at the details for every AMI.

DCV settings fields editable?

The web UI provides 2 fields with pre-filled values (they are not greyed out)

Are the values editable ?

Not sure if it is a bug or this requires some other settings enabled/disable for the fields to be editable.

image

Sanity checking on root volume size

Manager let me enter an arbitrary value (e.g. 20 GB) then create fails.

  LocalStorage:
    RootVolume:
      Size: 20
Invalid cluster configuration.
Validation Errors:
ConfigSchemaValidator: [('HeadNode', {'LocalStorage': {'RootVolume': {'Size': ['Root volume size 20 is invalid. It must be at least 35.']}}})]

Would be preferable to have sanity checking on the input fields to prevent this.

DCV Function from main cluster page

Attempting to use the DCV function from the main cluster page results in a blank screen (trying to update). Tried both on the vpn and not on the vpn. The shell button works fine. Can't find any red flags in the logs but I will keep looking.

Ubuntu 20.04 cluster - shell error

I created an Ubuntu 20.04 compute node cluster

When I shell (using the web ui) into the head node, I get the following errors

if [ -d '/opt/parallelcluster' ]; then source /opt/parallelcluster/cfnconfig; sudo su - $cfn_cluster_user; fi; /bin/bash
$ sh: 1: source: not found

Manually ssh into the headnode is fine.

Cheers

Create User : bad request ?

On the user tab/page when I click on Create User, I get a flyby red banner mentioning a bad request

Is this a bug or is there some other setup I need to perform before I am allow to create user ?

Cheers

Update from CFN console

Would love to have ability to upgrade to latest pcluster-manager by providing new version number in CFN console:

image

Resource handler returned message: "Resource dependency error: The resource ARN 'arn:aws:imagebuilder:us-east-2:XXXXXXX:image/importpublicecrimage-3-1-1-YYYYYYY/3.1.1/1' has a running workflow. (Service: Imagebuilder, Status Code: 400, Request ID: ZZZZZZZ, Extended Request ID: null)" (RequestToken: ZZZZZZ, HandlerErrorCode: GeneralServiceException)

Custom Images - Building Wizard ?

Are there plans to have some wizard or more elaborate UI to help with building custom AMI ?

Currently, the UI just ask for a YAML configuration.

I think a wizard will be helpful to also spread best practices in the YAML it generates.

Cheers

SecretsManger IAM Policy

For Slurm accounting, an AWS managed policy, SecretsMangerReadWrite is attached to allow for the injection of the Slurm database password into the Slurm configuration.

Suggest we create policy that is more restrictive...it only needs to be read-only...and attach that as part of the accounting mult-script runner option.

Custom AMI Images - persistence ?

Hi,

If I go through the process of creating a custom AMI image with all the required tools and configuration done, I believe I should be able to create all subsequent cluster to use those custom AMI images.

However, if I were to shutdown the pcluster manager stack, what will happen to those custom AMI images ? Will they be deleted ? Or will they appear in my next creation of a new pcluster manager stack ?

If the pcluster manager build custom AMI images are not persisted, does it make sense to build custom images outside of pcluster manager ? How will pcluster manager know of their existence ?

Cheers

Bad Request in email link

Hi,

I created a stack in the ca-central region yesterday

I have tried connecting via the link sent via the email from AWS/Cloudformation

I have still been unable to utilize the pcluster manager service

I get the following HTTP error when connecting to the link

image

Cheers

Number of IAM Policies

For the Slurm accounting in the post-install-scripts branch, we need to attach an additional IAM policy (SecretsManagerReadWritePolicy) to the role used by PclusterManagerFunction (see documentation here: https://aws-samples.github.io/pcluster-manager/02-tutorials/02-slurm-accounting.html)

That role has 10 policies already attached, which is the default max. This can be increased, but suggest trying to consolidate some of the policies.

Alternatively, the documentation should be updated to reflect the required limit increase.

parallelcluster version

Is there a way to pick the version of parallelcluster when launching with pcluster-manager?

mssh connection : how to specify ssh key?

I am trying out a connection via

image

It requires my ssh key but there is no option to provide that via the mssh command line tool

Maybe I am going about it the wrong way.

DCV : Direct connection from my local machine ?

I am able to utilise DCV via the DCV button on pcluster manager. It opens up DCV in the browser and I can interact graphically with it.

However, I was wondering if I am also able to connect via DCV to my head node on my local desktop NICE DCV application

I tried but I get the following error

image

head node status implication on cluster restart ?

I created a cluster head-node + 2 x compute-nodes

I stopped the cluster via the web ui

I stopped the head-node via the web ui

Is it required (by design?) to start the head-node first (via web ui) before re-starting the cluster (via web ui)

or merely re-starting the cluster will also trigger the restarting of the head-node ?

Cheers

Remove Centos 7 Option in GovCloud

image

Centos 7 isn't available in GovCloud, modify the drop down to just alinux & ubuntu.

If you select Centos 7 it fails validation with a cryptic message:

image

sinfo : command not found, why ?

I just spurn up a cluster and chose slurm

I ssh to the head node to run sinfo to check on the cluster

It says sinfo not found.

I just created a different cluster about 12 hours ago and a couple of days ago and sinfo is installed.

Not sure if this is a regression or something else.

Cheers

s3 bucket access denied

image

Not clear which role I need to attach S3 actions to. Can you point me in the right direction?

pcluster manager - cost

Hi,

If I only have a cluster manager running (no cluster created), what is the approximate costs (per hour or per day) ?

Is there also available information breakdown for one-off (setup) and on-going (ec2/ecr) costs ?

Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.