Giter Site home page Giter Site logo

metaflow-tools's Introduction

Metaflow Tools

Devops tools and utilities for operating Metaflow in the cloud.

This git repository is archived, but the tools continue to be actively maintained at outerbounds/metaflow-tools

metaflow-tools's People

Contributors

crk-codaio avatar greghilstonhop avatar jasonge27 avatar jimmycfa avatar kldavis4 avatar msavela avatar oavdeev avatar queueburt avatar romain-intel avatar savingoyal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metaflow-tools's Issues

Task crashed due to CannotPullContainerError: invalid reference format

Hi
Testing the 05 tutorial i have found a problem with (i think) the docker step.
In the attached image we can see that exist a problem with the docker image result of execute python 05-helloaws/helloaws.py run

I have configured with metaflow configure aws
The configuration:
{
"METAFLOW_DEFAULT_DATASTORE":"s3",
"METAFLOW_DATASTORE_SYSROOT_S3":"s3://stackmetaflow-metaflows3bucket-[REMOVED_BY_SECURITY]/metaflow",
"METAFLOW_DATATOOLS_SYSROOT_S3":"s3://stackmetaflow-metaflows3bucket-[REMOVED_BY_SECURITY]/metaflow/data",
"METAFLOW_BATCH_JOB_QUEUE":"arn:aws:batch:us-west-2:[REMOVED_BY_SECURITY]:job-queue/job-queue-stackMetaflow",
"METAFLOW_BATCH_CONTAINER_IMAGE":"3.9.0a2-alpine3.10",
"METAFLOW_BATCH_CONTAINER_REGISTRY":"https://index.docker.io/v1/",
"METAFLOW_ECS_S3_ACCESS_IAM_ROLE":"arn:aws:iam::[REMOVED_BY_SECURITY]6:role/stackMetaflow-BatchS3TaskRole-[REMOVED_BY_SECURITY]",
"METAFLOW_DEFAULT_METADATA":"service",
"METADATA_SERVICE_URL":"https://[REMOVED_BY_SECURITY].execute-api.us-west-2.amazonaws.com/api/"
}
Although it says in the documentation that both: Container and hub are optional, the config tool requires both.

I suppose that the configurator demands the tag of python image.

The hub have been obtainer from the command: docker info
image

image

Add some release tags

To be able to use terraform modules directly from git it would be nice to have some version tags in the repo

Have a nice standalone terraform module for Metaflow

Right now we have a pretty good terraform config in #26, however a couple things would make it even easier to use:

  • Make all the generic bits like VPC resources separate and optional. Many orgs would already have VPC in place, so Metaflow terraform should play nicely with those. For example, imagine someone already using terraform-aws-vpc, and now they want to deploy Metaflow.
  • To extend on that, it would be great if Metaflow terraform was a standalone module that would be usable by just importing it from Github. I think we're 95% there, it is just some bits that need to be moved from infra to metaflow.
  • (I imagine Sagemaker stuff would still be optional and separate from the "main" module)

cfn template error

Template contains errors.: Template format error: YAML not well-formed. (line 103, column 65)
This error comes comes when I create a stack.

Lack of disk space on default AWS Batch ComputeEnvironment

Running training jobs with AWS Batch often will result in a lack of disk space as the default EBS volume size is quite low (8 GB). Would it be possible to add an input parameter to increase the root EBS volume of the EC2 instances ? I have tested it out and it requires adding an EC2 Launch Template to the ComputeEnvironment setup. Happy to create a PR if required.

CloudFormation template failure: Invalid IamInstanceProfile: ecsInstanceRole

Attempting to run the template produces an error on a brand new AWS account. The error is

Operation failed, ComputeEnvironment went INVALID with error: CLIENT_ERROR - Invalid IamInstanceProfile: ecsInstanceRole

when attempting to set up the ComputeEnvironment with batch. googling lead me to this question

https://stackoverflow.com/questions/46265278/aws-batch-client-error-invalid-iaminstanceprofile

Which indicated that the Instance Role: setting needs to reference a AWS::IAM::InstanceProfile rather than a AWS::IAM::Role

following the advice there and adding an :InstanceProfile that referenced that role seems to have worked

        Type: AWS::IAM::InstanceProfile
        Properties:
            Path: '/'
            Roles:
                - !Ref BatchExecutionRole
  ComputeEnvironment:
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      Type: MANAGED
      ServiceRole: !GetAtt 'BatchExecutionRole.Arn'
      ComputeEnvironmentName: !Join [ "-", [ 'batch-compute-env', !Ref 'AWS::StackName' ] ]
      ComputeResources:
        MaxvCpus: !Ref MaxVCPUBatch
        SecurityGroupIds:
          - !Ref SagemakerSecurityGroup
        Type: EC2
        Subnets:
          - !Ref Subnet1
          - !Ref Subnet2
        MinvCpus: !Ref MinVCPUBatch
        InstanceRole: !GetAtt BatchInstanceProfile.Arn
        InstanceTypes:
          - c4.large
          - c4.xlarge
          - c4.2xlarge
          - c4.4xlarge
          - c4.8xlarge
        DesiredvCpus: !Ref DesiredVCPUBatch
      State: ENABLED

However, given my relative inexperience with cloudformation I figured I would bring this up as an issue instead of a pr

Using CFn auto-generated resource name in CFn template

Problem

CloudFormation (CFn) has a limitation that when the AWS resource name is hard-coded, instead of letting CFn auto-generate, the updating the template and making change via changeset can fail as in https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-custom-name/.

While working on Netflix/metaflow#251, added p2/p3 instances to the CFn template and re-run the change set failed due to the limitation.

Request

Please consider letting CFn auto-generate the AWS resource names.

CloudFormation template fails.

31 UTC+0100 MetaflowStack ROLLBACK_IN_PROGRESS The following resource(s) failed to create: [DefaultGateway, BatchS3TaskRole, ComputeEnvironment, NLB, Subnet1RTA, SageMakerExecutionRole, MetadataSvcECSTaskRole, Subnet2RTA]. . Rollback requested by user.
15:58:30 UTC+0100 ComputeEnvironment CREATE_FAILED An error occurred (ClientException) when calling the CreateComputeEnvironment operation: Error executing request, Exception : Instance type can only be one of [m5d.8xlarge, r4, r5, optimal, r5d.24xlarge, r4.16xlarge, r5a.2xlarge, m5.xlarge, m5ad.12xlarge, r5ad.large, i3.4xlarge, r5ad.12xlarge, m5a.24xlarge, m5.large, c5d.large, r5.large, m5d.2xlarge, c5.large, g4dn.4xlarge, g4dn.2xlarge, c5.metal, m5.2xlarge, r5d.4xlarge, c5d.24xlarge, r5d.metal, x1.32xlarge, c5, g4dn.xlarge, m5d.12xlarge, m5.metal, m5ad.xlarge, m5.12xlarge, m5a.8xlarge, c5d.9xlarge, m5ad.4xlarge, c5d.2xlarge, r5ad.4xlarge, d2, m5ad.24xlarge, r5d.2xlarge, r5ad, c5.24xlarge, r5.16xlarge, r4.xlarge, r5.xlarge, m5a.2xlarge, m5ad.large, r4.8xlarge, m5d.xlarge, m5.8xlarge, c5.2xlarge, m5d.16xlarge, m5a.xlarge, r4.2xlarge, i3.2xlarge, r5ad.24xlarge, m5, m5d.4xlarge, r5d.large, m5a.12xlarge, c5d, c5d.xlarge, i3.xlarge, r5ad.xlarge, r5a.8xlarge, d2.4xlarge, r5a.16xlarge, m5.16xlarge, m5d.large, r5a.xlarge, c5.12xlarge, r5.8xlarge, r5d.xlarge, c5.4xla

CFN template fails in China regions

Issues desc:
For China regions:

  1. apigw do not support Edge-optimized endpoint configuration, which will cause error Endpoint Configuration type EDGE is not supported in this region: cn-northeast-1
  2. for new AWS account, the default allowed notebook instance types are ml.t2.medium and ml.t3.medium, thus this template will failed due to notebook instance limitation.
  3. wrong endpoint url formats of apigw and sagemaker notebook

Solutions:

  1. specify the EndpointConfiguration under Api / Properties section:
  Api:
    DependsOn: VpcLink
    Type: 'AWS::ApiGateway::RestApi'
    Properties:
      EndpointConfiguration:
        Types:
        - REGIONAL
  1. update instance types:
Parameters:
  SagemakerInstance: 
    Type: String
    Default: ml.t3.medium
    AllowedValues: ['ml.t3.medium','ml.t2.medium']
    Description: 'Instance type for Sagemaker Notebook.'
  1. endpoint url formats
    • apigw: https://${Api}.execute-api.${AWS::Region}.amazonaws.com.cn/api/
    • sagemaker notebook: https://${SageMakerNotebookInstance.NotebookInstanceName}.notebook.${AWS::Region}.sagemaker.com.cn/tree

Missing IAM permissions to run Metaflow commands from SageMaker notebook

I was attempting to use the SageMaker notebook instance configured as part of the Metaflow setup on AWS to run the Metaflow tutorial commands however it lacked the permissions to call AWS Batch and Step Function API calls.

Not sure if this is by design as they are all supposed to be run from a local machine

Invalid bucket name.

Hi.
Deploying Cloudformation template, and testing 05 tutorial,I get:
Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9-]{1,63}$"). Retrying 7 more times..

The bucket name is created by the name of the stack.
The arm is obtained as in the picture below
image

The arm is: arn:aws:s3:::stackmetaflow-metaflows3bucket-[REMOVED-BY-SECURITY]

I have configured the aws credentials (linux client usign aws2) as well as the bucket (metaflow configure aws)

What am I doing wrong?

METAFLOW_SERVICE_URL always return "Missing Authentication Token" when accessed in browser

I’m currently experimenting on Metaflow. I followed the documentation and was able to deploy an aws setup with the given cloud formation template.
My question is why is that I’m always getting a:

              message: "Missing Authentication Token"

when I access METAFLOW_SERVICE_URL in the browser, even if I made sure that the APIBasicAuth was set to false during the creation of cfn?
Shouldn’t this setting make the metadata service accessible without the authentication/api key?
How can I resolve this? Or is this expected? That is, I cannot really view the metadata/metaflow service url via browser?
Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.