Giter Site home page Giter Site logo

Comments (7)

withsmilo avatar withsmilo commented on August 27, 2024 1

@dcrankshaw
This issue is critical for the product deployment system. We resolved it by applying the 'blue-green' deployment policy to our system. I implemented own SwarmContainerManager inspired by DockerContainerManager.

  • we call python_deployer.deploy_python_closure().
    • [python_deployer] calls build_and_deploy_model() of clipper_admin.
      • [clipper_admin] calls self.build_model().
      • [clipper_admin] calls self.deploy_model().
      • [clipper_admin] calls deploy_model() of SwarmContainerManager.
        • [SwarmContainerManager] creates a new swarm service.
        • [SwarmContainerManager] checks that,
          • all tasks of the new swarm service have some 'running' status, or not.
          • the replicas number of the new swarm service is the same with predefined replicas, or not.
        • [SwarmContainerManager] adds it to the metric config.
      • [clipper_admin] calls self.register_model() to register the new model.

I experienced that the time to initialize the new swarm service is very variable, so I think that Clipper might need some routines to check a new model's status before registering it to Clipper.

from clipper.

rmdort avatar rmdort commented on August 27, 2024

Maybe clipper_manager could have functions to undeploy old models. I was thinking, when a new model is successfully deployed, clipper can automatically stop the old containers/models.

Or undeploy can also be called manually from the clipper manager

Maybe a few new APIs

  1. Undeploy a model
  2. Remove application
  3. Pause application

from clipper.

dcrankshaw avatar dcrankshaw commented on August 27, 2024

Yeah agreed. Have you found that Swarm's container status is sufficient to indicate whether a container is running yet? We've found that for several deep learning models, especially when running on a GPU, there is a non-trivial amount of time after a container has started to initialize the model and connect to Clipper. Using the underlying container manager to detect when the containers ready as you do would be relatively simple to implement, but I was worried that that was not sufficient.

from clipper.

dcrankshaw avatar dcrankshaw commented on August 27, 2024

@chester-leung For a first version of a fix, you should modify the container_manager.deploy_model function to block until the container is actually running. Right now, we start the container then return immediately, rather than waiting until the container is fully running. You'll need to implement this for both the Kubernetes container manager and the Docker container manager.

from clipper.

withsmilo avatar withsmilo commented on August 27, 2024

@dcrankshaw
Thank you for your advice. I agreed with your opinion. How about use Docker's healthcheck option in DockerFile and then check healthy status in the our *ContainerManager to decide whether a container is running or not?

According to the reference,

When a container has a healthcheck specified, it has a health status in addition to its normal status. This status is initially starting. Whenever a health check passes, it becomes healthy (whatever state it was previously in). After a certain number of consecutive failures, it becomes unhealthy.

from clipper.

dcrankshaw avatar dcrankshaw commented on August 27, 2024

That's a good idea. @chester-leung for a first step, let's get a version working that just looks at the container state. As a second step, we can modify the RPC implementation to write a file somewhere once the container has connected to Clipper as the healthcheck.

from clipper.

chester-leung avatar chester-leung commented on August 27, 2024

So far, I've implemented a fix to force the container manager to sleep until all added containers are deemed ready. This should fix the problem of querying a model in a new container before the new container is fully functional.

from clipper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.