Giter Site home page Giter Site logo

microsoft / frameworkcontroller Goto Github PK

View Code? Open in Web Editor NEW
173.0 25.0 43.0 13.64 MB

General-Purpose Kubernetes Pod Controller

License: MIT License

Shell 7.95% Dockerfile 2.39% Go 89.66%
frameworklauncher kubernetes-controller kubernetes container-orchestration container-management containers go tensorflow

frameworkcontroller's Introduction

Turn it into maintenance mode, read-only, and only update for critical fixes.

Microsoft OpenPAI FrameworkController

Build Status Latest Release Docker Pulls License

As one standalone component of Microsoft OpenPAI, FrameworkController (FC) is built to orchestrate all kinds of applications on Kubernetes by a single controller, especially for DeepLearning applications.

These kinds of applications include but not limited to:

Why Need It

Problem

In the open source community, there are so many specialized Kubernetes Pod controllers which are built for a specific kind of application, such as Kubernetes StatefulSet Controller, Kubernetes Job Controller, KubeFlow TensorFlow Operator, KubeFlow PyTorch Operator. However, no one is built for all kinds of applications and combination of the existing ones still cannot support some kinds of applications. So, we have to learn, use, develop, deploy and maintain so many Pod controllers.

Solution

Build a General-Purpose Kubernetes Pod Controller: FrameworkController.

And then we can get below benefits from it:

Architecture

Feature

Framework Feature

A Framework represents an application with a set of Tasks:

  1. Executed by Kubernetes Pod
  2. Partitioned to different heterogeneous TaskRoles which share the same lifecycle
  3. Ordered in the same homogeneous TaskRole by TaskIndex
  4. With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
  5. With fine grained ExecutionType to Start/Stop the whole Framework
  6. With fine grained RetryPolicy for each Task and the whole Framework
  7. With fine grained FrameworkAttemptCompletionPolicy for each TaskRole
  8. With PodGracefulDeletionTimeoutSec for each Task to tune Consistency vs Availability
  9. With fine grained Status for each TaskAttempt/Task, each TaskRole and the whole FrameworkAttempt/Framework

Controller Feature

  1. Highly generalized as it is built for all kinds of applications
  2. Light-weight as it is only responsible for Pod orchestration
  3. Well-defined Framework Consistency vs Availability, State Machine and Failure Model
  4. Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
  5. Support to specify how to classify and summarize Pod failures
  6. Support to ScaleUp/ScaleDown Framework with Strong Safety Guarantee
  7. Support to expose Framework and Pod history snapshots to external systems
  8. Easy to leverage FrameworkBarrier to achieve light-weight Gang Execution and Service Discovery
  9. Easy to leverage HiveDScheduler to achieve GPU Topology-Aware, Multi-Tenant, Priority and Gang Scheduling
  10. Compatible with other Kubernetes features, such as Kubernetes Service, Gpu Scheduling, Volume, Logging
  11. Idiomatic with Kubernetes official controllers, such as Pod Spec
  12. Aligned with Kubernetes Controller Design Guidelines and API Conventions

Prerequisite

  1. A Kubernetes cluster, v1.16.15 or above, on-cloud or on-premise.

Quick Start

  1. Run Controller
  2. Submit Framework

Doc

  1. Deep Dive Slides
  2. User Manual
  3. Known Issue and Upcoming Feature
  4. FAQ
  5. Release Note

Official Image

Related Project

Third Party Controller Wrapper

A specialized wrapper can be built on top of FrameworkController to optimize for a specific kind of application:

Recommended Kubernetes Scheduler

FrameworkController can directly leverage many Kubernetes Schedulers and among them we recommend these best fits:

Similar Offering On Other Cluster Manager

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

frameworkcontroller's People

Contributors

microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar scarlett2018 avatar warmchang avatar xudifsd avatar yqwang-ms avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

frameworkcontroller's Issues

[Question] Node selector for framework

I am working with NNI on k8s.

Thanks to frameworkcontroller, I has worked with NNI more easily.

Recently, I am digging hardly to construct k8s training system and I have a question about frameworkcontroller.

Frameworkcontroller consumes frameworks to create pods for NNI. So I edited framework format in NNI codes to allocate pods to desired nodes.

To do this, I added a node selector to framework format in NNI codes but it was not working.

Is there any guides or ways to allocates pods from frameworks to desired nodes like node selector?

BarrierUnknownFailed. nni pod remains permanently in Init state

https://nni.readthedocs.io/en/stable/TrainingService/FrameworkControllerMode.html
I follow this istruction to train my model with nni + kubernetes.

nniexp pod remains in Init state. It didn't start the experiment.

kubectl describe pod nniexp returns these messages.

Controlled By:  ConfigMap/nniexpshhbauzetrialhm50l-attempt
Init Containers:
  frameworkbarrier:
    Container ID:  docker://e4cc0c27a6ca206698dbcee976010a4b039f7d0e0aed8e3eaf1ebe32a186caca
    Image:         frameworkcontroller/frameworkbarrier
    Image ID:      docker-pullable://frameworkcontroller/frameworkbarrier@sha256:4f56b0f70d060ab610bc72d994311432565143cd4bb2613916425f8f3e80c69f
    Port:          <none>
    Host Port:     <none>
    State:         Running
      Started:     Fri, 03 Sep 2021 13:43:04 +0900
    Last State:    Terminated
      Reason:      Error
      Message:     Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexpshhbauzetrialhm50l" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 04:42:40.232306       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexpshhbauzetrialhm50l" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 04:42:50.259713       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexpshhbauzetrialhm50l" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 04:43:00.259436       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexpshhbauzetrialhm50l" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
W0903 04:43:00.260439       9 barrier.go:253] Failed to get Framework object from ApiServer: frameworks.frameworkcontroller.microsoft.com "nniexpshhbauzetrialhm50l" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
E0903 04:43:00.260455       9 barrier.go:283] BarrierUnknownFailed: frameworks.frameworkcontroller.microsoft.com "nniexpshhbauzetrialhm50l" is forbidden: User "system:serviceaccount:default:default" cannot get resource "frameworks" in API group "frameworkcontroller.microsoft.com" in the namespace "default"
E0903 04:43:00.260466       9 barrier.go:470] ExitCode: 1: Exit with unknown failure to tell controller to retry within maxRetryCount.

      Exit Code:    1
      Started:      Fri, 03 Sep 2021 13:33:00 +0900
      Finished:     Fri, 03 Sep 2021 13:43:00 +0900
    Ready:          False
    Restart Count:  4
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpshhbauzetrialhm50l
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpshhbauzetrialhm50l-attempt
      FC_POD_NAME:                        nniexpshhbauzetrialhm50l-worker-0
      FC_FRAMEWORK_UID:                   68bb7a39-2209-407c-9c37-5e4e449a9afb
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_acfe3e48-99ad-44fb-88bc-9024c99b01c3
      FC_CONFIGMAP_UID:                   acfe3e48-99ad-44fb-88bc-9024c99b01c3
      FC_TASKROLE_UID:                    c57048a6-0c6b-11ec-b7f1-0242ac110006
      FC_TASK_UID:                        c5704927-0c6b-11ec-b7f1-0242ac110006
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wthsw (ro)
Containers:
  framework:
    Container ID:
    Image:         msranni/nni:latest
    Image ID:
    Port:          4000/TCP
    Host Port:     0/TCP
    Command:
      sh
      /tmp/mount/nni/SHhbAuZE/hm50L/run_worker.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpshhbauzetrialhm50l
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpshhbauzetrialhm50l-attempt
      FC_POD_NAME:                        nniexpshhbauzetrialhm50l-worker-0
      FC_FRAMEWORK_UID:                   68bb7a39-2209-407c-9c37-5e4e449a9afb
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_acfe3e48-99ad-44fb-88bc-9024c99b01c3
      FC_CONFIGMAP_UID:                   acfe3e48-99ad-44fb-88bc-9024c99b01c3
      FC_TASKROLE_UID:                    c57048a6-0c6b-11ec-b7f1-0242ac110006
      FC_TASK_UID:                        c5704927-0c6b-11ec-b7f1-0242ac110006
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /tmp/mount from nni-vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wthsw (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True

I paste the part of messages. If you need full log messages, then I upload it later.

I executed the programs in this order:

  1. launch minikube with --driver=none, api-server=127.0.0.1
  2. launch k8s-nvidia-daemonset
  3. launch framework controller
  4. launch nnictl

My minikube config is here:

apiVersion: v1
clusters:
- cluster:
    certificate-authority: /home/gwlee/.minikube/ca.crt
    extensions:
    - extension:
        last-update: Fri, 03 Sep 2021 09:23:56 KST
        provider: minikube.sigs.k8s.io
        version: v1.22.0
      name: cluster_info
    server: https://localhost:8443
  name: minikube
contexts:
- context:
    cluster: minikube
    extensions:
    - extension:
        last-update: Fri, 03 Sep 2021 09:23:56 KST
        provider: minikube.sigs.k8s.io
        version: v1.22.0
      name: context_info
    namespace: default
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /home/gwlee/.minikube/profiles/minikube/client.crt
    client-key: /home/gwlee/.minikube/profiles/minikube/client.key

[Question] How to raise the size of shared memory when running NNI frameworkcontroller?

I used the frameworkcontroller as NNI training platform. So this question is not proper in here. Sorry for about it but I really want to find the answer...

Just you know, pytorch dataloader uses shared memory and k8s assigns 64MB as size of shared memory.

But it is not sufficient in real scenario, so I want to increase the size of shared memory.

But NNI has no options for this and I was trying to edit frameworkcontroller.yaml like below YAML but it was not working.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller-shm
  namespace: default
spec:
  serviceName: frameworkcontroller-shm
  selector:
    matchLabels:
      app: frameworkcontroller-shm
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller-shm
    spec:
      # Using the ServiceAccount with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller-shm
      containers:
      - name: frameworkcontroller-shm
        image: frameworkcontroller/frameworkcontroller
        volumeMounts:
        - name: shm-volume
          mountPath: /dev/shm
      volumes:
        - name: shm-volume
          emptyDir:
            medium: Memory

Is there any solution or idea to solve this problem?

Having trouble when tried NNI with FrameworkController on-premise k8s

when i tried nni with FrameworkController on on-premise k8s

I followed this document
https://nni.readthedocs.io/en/stable/TrainingService/FrameworkControllerMode.html

  • create Serviceaccount and clusterrolebinding
kubectl create serviceaccount frameworkcontroller --namespace default
kubectl create clusterrolebinding frameworkcontroller \
  --clusterrole=cluster-admin \
  --user=system:serviceaccount:default:frameworkcontroller
  • and set StatefulSet
    framework-config.yml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller
  namespace: default
spec:
  serviceName: frameworkcontroller
  selector:
    matchLabels:
      app: frameworkcontroller
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller
    spec:
      # Using the ServiceAccount with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller
      containers:
      - name: frameworkcontroller
        image: frameworkcontroller/frameworkcontroller
        # Using k8s inClusterConfig, so usually, no need to specify
        # KUBE_APISERVER_ADDRESS or KUBECONFIG
#        env:
        #- name: KUBE_APISERVER_ADDRESS
        #  value: {http[s]://host:port}
#          - name: KUBECONFIG
#            value: ~/.kube/config
kubectl apply -f framework-config.yml

and when I tried kubectl get po
image

and tried kubectl logs frameworkcontroller-0

image

It said

updateRemoteFrameworkStatus: 
Failed: Framework.frameworkcontroller.microsoft.com "nniexpqfwzydmkenvfpwjw" is invalid:
spec.taskRoles.task.podGracefulDeletionTimeoutSec: Invalid value: "null":
spec.taskRoles.task.podGracefulDeletionTimeoutSec in body must be of type integer: "null"

what should I do to solve this problem?

Having trouble nni with frameworkcontroller on k8s again

Hi, I got new problem during nni with frameworkcontroller on k8s and I created issue at below link

but, It couldn't get answer for a long time
Is there anyone can solve it?

Thanks!

microsoft/nni#4588 (comment)


Details

Describe the issue:
When I tried nni with frameworkcontroller on k8s, I used these yaml files

  • I tried nfs

for nni config
config_framework.yml

authorName: default
experimentName: example_mnist_pytorch
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai, kubeflow
trainingServicePlatform: frameworkcontroller
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
assessor:
  builtinAssessorName: Medianstop
  classArgs:
    optimize_mode: maximize
trial:
  codeDir: .
  taskRoles:
    - name: worker
      taskNum: 1
      command: python3 mnist.py
      gpuNum: 0
      cpuNum: 1
      memoryMB: 8192
      image: msranni/nni:latest
      frameworkAttemptCompletionPolicy:
        minFailedTaskCount: 3
        minSucceededTaskCount: 1
frameworkcontrollerConfig:
  storage: nfs
  nfs:
    # Your NFS server IP, like 10.10.10.10
    server: 192.168.1.106
    # Your NFS server export path, like /var/nfs/nni
    path: /home/mj_lee/mount
  serviceAccountName: frameworkcontroller

and for frameworkcontroller Statefulset
frameworkcontroller-with-default-config.yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller
  namespace: default
spec:
  serviceName: frameworkcontroller
  selector:
    matchLabels:
      app: frameworkcontroller
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller
    spec:
      # Using the ServiceAccount with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller
      containers:
      - name: frameworkcontroller
        image: frameworkcontroller/frameworkcontroller
        # Using k8s inClusterConfig, so usually, no need to specify
        # KUBE_APISERVER_ADDRESS or KUBECONFIG
        env:
        #- name: KUBE_APISERVER_ADDRESS
        #  value: {http[s]://host:port}
          - name: KUBECONFIG
            value: ~/.kube/config

and execute below command for k8s statefulset

kubectl apply -f frameworkcontroller-with-default-config.yaml

then frameworkcontroller-0 set to Run

image

and execute nnictl command

nnictl create --config config_framework.yml

then new experiment worker pod created
but it failed to run

image

when I check logs by kubectl logs nniexp~

image

so I checked the nfs mount directory,
and there is not nni directory, but It has envs directory and run.sh file

image

I think it should create nni/experiment_id/run.sh in mount folder

here is describe of nniexp-worker-0 pod

Name:         nniexpr2ys5f9aenvzchoa-worker-0
Namespace:    default
Priority:     0
Node:         zerooneai-p210908-4/192.168.1.104
Start Time:   Fri, 25 Feb 2022 14:33:07 +0900
Labels:       FC_FRAMEWORK_NAME=nniexpr2ys5f9aenvzchoa
              FC_TASKROLE_NAME=worker
              FC_TASK_INDEX=0
Annotations:  FC_CONFIGMAP_NAME: nniexpr2ys5f9aenvzchoa-attempt
              FC_CONFIGMAP_UID: 0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_ATTEMPT_ID: 0
              FC_FRAMEWORK_ATTEMPT_INSTANCE_UID: 0_0be50971-55f8-434a-bfe4-6b47d64212eb
              FC_FRAMEWORK_NAME: nniexpr2ys5f9aenvzchoa
              FC_FRAMEWORK_NAMESPACE: default
              FC_FRAMEWORK_UID: 2c55ec33-69b8-43a4-a643-84ff4e0604b2
              FC_POD_NAME: nniexpr2ys5f9aenvzchoa-worker-0
              FC_TASKROLE_NAME: worker
              FC_TASKROLE_UID: 751bf95b-c6bd-4dd0-aafe-e160f9c10220
              FC_TASK_ATTEMPT_ID: 0
              FC_TASK_INDEX: 0
              FC_TASK_UID: 27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
              cni.projectcalico.org/podIP: 10.0.243.33/32
              cni.projectcalico.org/podIPs: 10.0.243.33/32
Status:       Running
IP:           10.0.243.33
IPs:
  IP:           10.0.243.33
Controlled By:  ConfigMap/nniexpr2ys5f9aenvzchoa-attempt
Init Containers:
  frameworkbarrier:
    Container ID:   docker://b05885b647cdb41dba4587f6f93eeb5bd19a390641687012bc017d73cc21aa79
    Image:          frameworkcontroller/frameworkbarrier
    Image ID:       docker-pullable://frameworkcontroller/frameworkbarrier@sha256:9d95e31152460e3cc5c7ad2b09738c1fdb540ff7a50abc72b2f8f9d0badb87da
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 25 Feb 2022 14:33:12 +0900
      Finished:     Fri, 25 Feb 2022 14:33:22 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Containers:
  framework:
    Container ID:  docker://dc9ff6579c67e8bc394c734e8add70fbdb581d014541044d1877a9e5d888f828
    Image:         msranni/nni:latest
    Image ID:      docker-pullable://msranni/nni@sha256:8985fb134204ef523e113ac4a572ae7460cd246a5ff471df413f7d17dd917cd1
    Port:          4000/TCP
    Host Port:     0/TCP
    Command:
      sh
      /tmp/mount/nni/r2ys5f9a/run.sh
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   sh: 0: Can't open /tmp/mount/nni/r2ys5f9a/run.sh

      Exit Code:    127
      Started:      Fri, 25 Feb 2022 14:36:43 +0900
      Finished:     Fri, 25 Feb 2022 14:36:43 +0900
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     1
      memory:  8Gi
    Requests:
      cpu:     1
      memory:  8Gi
    Environment:
      FC_FRAMEWORK_NAMESPACE:             default
      FC_FRAMEWORK_NAME:                  nniexpr2ys5f9aenvzchoa
      FC_TASKROLE_NAME:                   worker
      FC_TASK_INDEX:                      0
      FC_CONFIGMAP_NAME:                  nniexpr2ys5f9aenvzchoa-attempt
      FC_POD_NAME:                        nniexpr2ys5f9aenvzchoa-worker-0
      FC_FRAMEWORK_UID:                   2c55ec33-69b8-43a4-a643-84ff4e0604b2
      FC_FRAMEWORK_ATTEMPT_ID:            0
      FC_FRAMEWORK_ATTEMPT_INSTANCE_UID:  0_0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_CONFIGMAP_UID:                   0be50971-55f8-434a-bfe4-6b47d64212eb
      FC_TASKROLE_UID:                    751bf95b-c6bd-4dd0-aafe-e160f9c10220
      FC_TASK_UID:                        27edb2eb-a5ca-4e65-8a2c-57dd15cabccb
      FC_TASK_ATTEMPT_ID:                 0
      FC_POD_UID:                          (v1:metadata.uid)
      FC_TASK_ATTEMPT_INSTANCE_UID:       0_$(FC_POD_UID)
    Mounts:
      /mnt/frameworkbarrier from frameworkbarrier-volume (rw)
      /tmp/mount from nni-vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from frameworkcontroller-token-7sw6q (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nni-vol:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    192.168.1.106
    Path:      /home/zerooneai/mj_lee/mount
    ReadOnly:  false
  frameworkbarrier-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  frameworkcontroller-token-7sw6q:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  frameworkcontroller-token-7sw6q
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  6m19s                 default-scheduler  Successfully assigned default/nniexpr2ys5f9aenvzchoa-worker-0 to zerooneai-p210908-4
  Normal   Pulling    6m18s                 kubelet            Pulling image "frameworkcontroller/frameworkbarrier"
  Normal   Pulled     6m15s                 kubelet            Successfully pulled image "frameworkcontroller/frameworkbarrier" in 3.364620261s
  Normal   Created    6m14s                 kubelet            Created container frameworkbarrier
  Normal   Started    6m14s                 kubelet            Started container frameworkbarrier
  Normal   Pulled     6m1s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.375328373s
  Normal   Pulled     5m56s                 kubelet            Successfully pulled image "msranni/nni:latest" in 4.709013579s
  Normal   Pulled     5m36s                 kubelet            Successfully pulled image "msranni/nni:latest" in 2.373976028s
  Normal   Pulling    5m9s (x4 over 6m4s)   kubelet            Pulling image "msranni/nni:latest"
  Normal   Created    5m7s (x4 over 6m1s)   kubelet            Created container framework
  Normal   Pulled     5m7s                  kubelet            Successfully pulled image "msranni/nni:latest" in 2.484752039s
  Normal   Started    5m6s (x4 over 6m1s)   kubelet            Started container framework
  Warning  BackOff    71s (x22 over 5m54s)  kubelet            Back-off restarting failed container

please let me know how to solving this trouble thanks!

Environment:

  • NNI version: 2.6
  • Training service (local|remote|pai|aml|etc): frameworkcontroller
  • Client OS: ubuntu 18.04
  • Server OS (for remote mode only):
  • Python version: 3.6.9
  • PyTorch/TensorFlow version: 1.10.1+cu102

Failed to put CRD error, frameworkcontroller frequently restarts

I set up kubernetes system with kubeadm!

I have one master node and one worker node.

I use calico as pod-network.

There are some problem when setting up frameworkcontroller.

Frameworkcontroller restarts frequently (every 1min) and I found this message when typing kubectl logs frameworkcontroller-0

I1028 02:01:51.888234      10 controller.go:207] Initializing frameworkcontroller
I1028 02:01:51.888637      10 controller.go:210] With Config:
kubeApiServerAddress: https://localhost:40443
kubeConfigFilePath: ""
kubeClientQps: 200
kubeClientBurst: 300
workerNumber: 500
largeFrameworkCompression: true
crdEstablishedCheckIntervalSec: 1
crdEstablishedCheckTimeoutSec: 60
objectLocalCacheCreationTimeoutSec: 300
frameworkCompletedRetainSec: 2592000
frameworkMinRetryDelaySecForTransientConflictFailed: 60
frameworkMaxRetryDelaySecForTransientConflictFailed: 900
logObjectSnapshot:
  framework:
    onFrameworkRetry: true
    onFrameworkDeletion: true
  task:
    onTaskRetry: true
    onTaskDeletion: true
  pod:
    onPodDeletion: true
podFailureSpec: []
I1028 02:01:51.889430      10 controller.go:427] Recovering frameworkcontroller
E1028 02:02:21.890073      10 runtime.go:69] Observed a panic: &errors.errorString{s:"Failed to put CRD: Get https://localhost:40443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/frameworks.fr
ameworkcontroller.microsoft.com: dial tcp: i/o timeout"} (Failed to put CRD: Get https://localhost:40443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/frameworks.frameworkcontroller.microsoft.
com: dial tcp: i/o timeout)
/go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/go/src/github.com/microsoft/frameworkcontroller/pkg/internal/utils.go:66
/go/src/github.com/microsoft/frameworkcontroller/pkg/controller/controller.go:428
/go/src/github.com/microsoft/frameworkcontroller/cmd/frameworkcontroller/main.go:35
/usr/local/go/src/runtime/proc.go:200
/usr/local/go/src/runtime/asm_amd64.s:1337
E1028 02:02:21.890143      10 panic.go:522] Stopping frameworkcontroller
panic: Failed to put CRD: Get https://localhost:40443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/frameworks.frameworkcontroller.microsoft.com: dial tcp: i/o timeout [recovered]
        panic: Failed to put CRD: Get https://localhost:40443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/frameworks.frameworkcontroller.microsoft.com: dial tcp: i/o timeout

goroutine 1 [running]:
github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x11f84e0, 0xc0003f8110)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/frameworkcontroller/pkg/internal.PutCRD(0xc000338960, 0xc0003438c0, 0xc000047590, 0xc000047598)
        /go/src/github.com/microsoft/frameworkcontroller/pkg/internal/utils.go:66 +0x173
github.com/microsoft/frameworkcontroller/pkg/controller.(*FrameworkController).Run(0xc000107290, 0xc0000e4960)
        /go/src/github.com/microsoft/frameworkcontroller/pkg/controller/controller.go:428 +0x157
main.main()
        /go/src/github.com/microsoft/frameworkcontroller/cmd/frameworkcontroller/main.go:35 +0x47

I use the example yaml described in frameworkcontroller guideline https://github.com/Microsoft/frameworkcontroller/tree/master/example/run

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: frameworkcontroller
  namespace: default
spec:
  serviceName: frameworkcontroller
  selector:
    matchLabels:
      app: frameworkcontroller
  replicas: 1
  template:
    metadata:
      labels:
        app: frameworkcontroller
    spec:
      # Using the ServiceAccount with granted permission
      # if the k8s cluster enforces authorization.
      serviceAccountName: frameworkcontroller
      containers:
      - name: frameworkcontroller
        image: frameworkcontroller/frameworkcontroller
        # Using k8s inClusterConfig, so usually, no need to specify
        # KUBE_APISERVER_ADDRESS or KUBECONFIG
        env:
        - name: KUBE_APISERVER_ADDRESS
          value: https://localhost:40443 #{http[s]://host:port}
        #- name: KUBECONFIG
        #  value: {Pod Local KubeConfig File Path}
        command: [
          "bash", "-c",
          "cp /frameworkcontroller-config/frameworkcontroller.yaml . &&
          ./start.sh"]
        volumeMounts:
        - name: frameworkcontroller-config
          mountPath: /frameworkcontroller-config
      volumes:
      - name: frameworkcontroller-config
        configMap:
          name: frameworkcontroller-config

Additionally frameworkbarrier cannot find the custom resource frameworks, my guessing is that frameworkcontroller doesn't work well so it cannot build the custom resource frameworks.

I also upload my entire script for launching frameworkcontroller!

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

sleep 10

kubectl create serviceaccount frameworkcontroller --namespace default
kubectl create clusterrolebinding frameworkcontroller \
  --clusterrole=cluster-admin \
  --user=system:serviceaccount:default:frameworkcontroller

sleep 5

#kubectl create -f frameworkcontroller-with-default-config.yaml

# custom config
kubectl create -f frameworkcontroller-customized-config.yaml
kubectl create -f frameworkcontroller-with-customized-config.yaml

sleep 15

kubectl create serviceaccount frameworkbarrier --namespace default
kubectl create clusterrole frameworkbarrier --verb=get,list,watch --resource=frameworks
kubectl create clusterrolebinding frameworkbarrier --clusterrole=frameworkbarrier --user=system:serviceaccount:default:frameworkbarrier

Is frameworkcontoller working on kubeadm(not minikube)?

I set kubernetes cluster setting with kubeadm.

I succeeded in setting frameworkcontroller with minikube.

So I firstly set master node and connect worker node to master node.

And run these command lines

#kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml


kubectl create serviceaccount frameworkcontroller --namespace default
kubectl create clusterrolebinding frameworkcontroller \
  --clusterrole=cluster-admin \
  --user=system:serviceaccount:default:frameworkcontroller


kubectl create -f frameworkcontroller-with-default-config.yaml


kubectl create serviceaccount frameworkbarrier --namespace default
kubectl create clusterrole frameworkbarrier --verb=get,list,watch --resource=frameworks
kubectl create clusterrolebinding frameworkbarrier --clusterrole=frameworkbarrier --user=system:serviceaccount:default:frameworkbarrier

After run these cmdlines, I find that the pod about frameworkcontroller falls in CrashBackOff state.

This is the message after running kubectl describe pod frameworkcontroller-0

Name:         frameworkcontroller-0
Namespace:    default
Priority:     0
Node:         mofl-c246-wu4/192.168.0.28
Start Time:   Mon, 27 Sep 2021 16:58:34 +0900
Labels:       app=frameworkcontroller
              controller-revision-hash=frameworkcontroller-7697d48ff7
              statefulset.kubernetes.io/pod-name=frameworkcontroller-0
Annotations:  <none>
Status:       Running
IP:           172.17.0.8
IPs:
  IP:           172.17.0.8
Controlled By:  StatefulSet/frameworkcontroller
Containers:
  frameworkcontroller:
    Container ID:   docker://c69a07541214fe32fae92a148255d8582d632c01bceb5cd2f887d4dafb86d7ea
    Image:          frameworkcontroller/frameworkcontroller
    Image ID:       docker-pullable://frameworkcontroller/frameworkcontroller@sha256:27674c36f3e5da2cac1249fb6c4fc318e1ab60227c0680cef82695152b442738
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 27 Sep 2021 17:00:16 +0900
      Finished:     Mon, 27 Sep 2021 17:00:16 +0900
    Ready:          False
    Restart Count:  4
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pdcf2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-pdcf2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  3m11s                 default-scheduler  Successfully assigned default/frameworkcontroller-0 to mofl-c246-wu4
  Normal   Pulled     3m6s                  kubelet            Successfully pulled image "frameworkcontroller/frameworkcontroller" in 2.62030994s
  Normal   Pulled     3m1s                  kubelet            Successfully pulled image "frameworkcontroller/frameworkcontroller" in 2.645024761s
  Normal   Pulled     2m44s                 kubelet            Successfully pulled image "frameworkcontroller/frameworkcontroller" in 2.582812868s
  Normal   Pulled     2m15s                 kubelet            Successfully pulled image "frameworkcontroller/frameworkcontroller" in 2.58364693s
  Normal   Created    2m14s (x4 over 3m5s)  kubelet            Created container frameworkcontroller
  Normal   Started    2m14s (x4 over 3m4s)  kubelet            Started container frameworkcontroller
  Warning  BackOff    106s (x8 over 2m59s)  kubelet            Back-off restarting failed container
  Normal   Pulling    93s (x5 over 3m9s)    kubelet            Pulling image "frameworkcontroller/frameworkcontroller"

Failed to put CRD: the server could not find the requested resource

I'm using default StatefulSet config with AKS cluster, Kubernetes version 1.22.4.

Pod "frameworkcontroller-0" keeps failing.

Log:

W0114 06:20:47.365287      10 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0114 06:20:47.366946      10 controller.go:427] Recovering frameworkcontroller
I0114 06:20:47.398683      10 utils.go:108] Create CRD frameworks.frameworkcontroller.microsoft.com
E0114 06:20:47.409570      10 runtime.go:69] Observed a panic: &errors.errorString{s:"Failed to put CRD: the server could not find the requested resource"} (Failed to put CRD: the server could not find the requested resource)
/go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/go/src/github.com/microsoft/frameworkcontroller/pkg/internal/utils.go:66
/go/src/github.com/microsoft/frameworkcontroller/pkg/controller/controller.go:428
/go/src/github.com/microsoft/frameworkcontroller/cmd/frameworkcontroller/main.go:35
/usr/local/go/src/runtime/proc.go:200
/usr/local/go/src/runtime/asm_amd64.s:1337
E0114 06:20:47.409597      10 panic.go:522] Stopping frameworkcontroller
panic: Failed to put CRD: the server could not find the requested resource [recovered]
        panic: Failed to put CRD: the server could not find the requested resource

goroutine 1 [running]:
github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/microsoft/frameworkcontroller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x11f84e0, 0xc0003fc260)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/frameworkcontroller/pkg/internal.PutCRD(0xc00032e3c0, 0xc00033f080, 0xc000341098, 0xc0003410a0)
        /go/src/github.com/microsoft/frameworkcontroller/pkg/internal/utils.go:66 +0x173
github.com/microsoft/frameworkcontroller/pkg/controller.(*FrameworkController).Run(0xc0000b93f0, 0xc0000a1140)
        /go/src/github.com/microsoft/frameworkcontroller/pkg/controller/controller.go:428 +0x157
main.main()
        /go/src/github.com/microsoft/frameworkcontroller/cmd/frameworkcontroller/main.go:35 +0x47

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.