Giter Site home page Giter Site logo

fastmachinelearning / gw-iaas Goto Github PK

View Code? Open in Web Editor NEW
2.0 14.0 4.0 10.59 MB

Deep learning inference-as-a-service tools and pipelines for gravitational wave physics

License: MIT License

Python 88.31% Makefile 0.27% Batchfile 0.33% Jupyter Notebook 9.76% CSS 0.23% Shell 0.45% Dockerfile 0.64%
deep-learning inference gravitational-waves ml-infrastructure

gw-iaas's People

Contributors

alecgunny avatar drankincms avatar ethanmarx avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gw-iaas's Issues

Simplify and organize container builds

Right now, the structure for container builds for individual projects is to have the project Dockerfile live in the project's root directory, then to use the root of the whole repository as the build context in order to copy local dependencies into the image for installation via poetry. These dependencies need to be added explicitly in multiple COPY statements, including the code for the project itself. The advantages here are that

  • Dockerfiles get to live with the applications they're intended to execute, keeping things organized
  • Rebuilding is only required when one of the depency directories change
  • Depency code can be volume mapped from the host into the container at runtime for easier development
  • Each project can use the Poetry/Python version required for its purposes (also potentially a disadvantage, see below)

However, the disadvantages are that

  • Individual Dockerfiles are less clear, since the COPY statements will be relative to the build context root and not the directory containing the Dockerfile (which is not obvious unless you inspect the CI yamls)
  • Specifying each project as a dependency to itself is redundant. Even having to specify the local dependencies is redundant since technically we should already know these from the project's pyproject.toml or poetry.lock.
  • Images are needlessly bloated by requiring that all the source code be added and live with container forever, even though in production all we need are the built libraries wheels
  • As the code base grows, sending the entire repo as the build context to the Docker engine could become really onerous
  • No guarantees that projects are built against the same Poetry and Python versions

Possible Solutions

Use Makefiles as outlined here

Advantages

  • Necessary dependencies and applications themselves are installed into containers automatically
  • make insures that rebuilds only happen when the relevant libraries change
  • Build contexts are isolated to project directories, reducing the size of the context
  • Only copying built libraries reduces the size of the image
  • Projects are all built against the same (local to build) Python and Poetry versions

Disadvantages

  • Extra dependency on make and familiarity with Makefile syntax
  • Makefile syntax in tools makes certain assumptions about the relative directory depths of applications and libraries
  • Dockerfiles depend on products of local builds, defeating the purpose of isolated container environments
  • Dockerfiles provide almost no clarity about what's going into them

Global base image, project-specific base and build images

Build begins with a global build image which adds libraries and installs the desired Poetry version

ARG PYTHON_TAG
FROM python:${PYTHON_TAG}
ARG POETRY_VERSION
RUN python -m pip install ${POETRY_VERSION}
COPY libs /opt/gw-iaas/libs

built by

docker build -t build .

Then for individual projects, build starts with a python script that builds all dependency wheels via something like (making docker a dependency in the root pyproject.toml:

import argparse
import re
import pathlib
from io import StringIO

import docker


parser = argparse.ArgumentParser()
parser.add_argument("--project", required=True, type=str, help="Path to project")
args = parser.parse_args()
project = pathlib.Path(args.project)

dockerfile = """
FROM build
COPY . /opt/build
RUN set +x \
        \
        && mkdir /opt/lib \
"""

with open(project / "poetry.lock", "r") as f:
    lockfile = f.read()

def add(line):
    dockerfile += start + "\\"
    dockerfile += start + f"&& {line} \\"

start = "\n" + " " * 8
root = "/opt/gw-iaas/libs"
for dep in re.findall("<regex for local deps>", lockfile):
    add("cd {root}/{dep}")
    add("poetry build")
    add("cp dist/*.whl /opt/lib")

add("cd /opt/build")
add("poetry build")
dockerfile = dockerfile[:-2]

client = docker.from_env()
build_image, _ = client.images.build(
    fileobj=StringIO(dockerfile),
    tag=f"{project}:build"
)

client.images.build(
    path=project / "Dockerfile",
    tag=str(project)
)

client.images.remove(build_image)

then individual project Dockerfiles would include a line

COPY --from=<project>:build /opt/lib/*.whl .
RUN pip install *.whl && rm *.whl

Advantages

  • Unifies and isolates Poetry and Python environments used for builds
  • Automates addition of dependencies and project code
  • Project builds have local contexts and COPY paths are relative to Dockerfile location
  • Only installing wheels reduces the size of images

Disadvantages

  • Addition of extra host dependencies
  • Easy for CI, but local builds become more complicated (could solve with a Makefile?)
  • Haven't tested this so no idea if it will actually work
  • Python script obscures what's going into container, makes builds less reproducible (Python script dependent on host environment)

`BrokenPipeError` in `hermes.stillwater.process.PipelineProcess`

Trying to close a hermes.stillwater.process.PipelineProcess too quickly will cause a BrokenPipeError due to the self.in_q.close() call in the __exit__ method due to python/cpython#80025. This should be fixed and backported since python/cpython#31913, but I'm not sure most releases have had time to implement it yet, so for now it might be worth inserting a time.sleep as the linked issue suggests to avoid this error. Once the fix is implemented, I don't think we'll need to clear the queue manually anymore (if we ever did...).

Incidentally, we're also not closing the out_q of PipelineProcess objects during __exit__. This is probably related to the asynchronous communication of processes, i.e. if we're __exit__ing due to an error, we'd like to raise this error rather than try to do a put on a closed q and get a ValueError that we don't know whether to catch or raise because it's a real problem. I think it makes sense to close all the qs and then do some logic around

  1. Why we're __exit__ing (is there an error, and was it raised by this process?)
  2. Why a get or a put on a q might cause a ValueError (are we stopped now? Or should we always assume that this is due to another process exiting and trust that users won't manually close these qs accidentally?)

Add `TimeLag` output model

Rather than average over predictions, one server-side streaming output option could be a TimeLag model that just takes a subset of the kernel of some length update_size, but lag seconds from the edge of the kernel rather than at the edge itself. Could be useful for models like DeepClean where there's not much benefit to aggregating near the center of kernels, but poorer quality predictions near the edges.

Add tests for `cloudbreak` library

Need both unit tests for verifying basic functionality, as well as integration tests to ensure that appropriate resources are spun up/down on corresponding clouds. Should these latter tests be performed in PR CI, or will this become too costly?

Support windows for weighted averages on output streaming model

Some models with output timeseries like DeepClean experience non-trivial drops in predictive performance near the edges of kernels. It would be useful if the output aggregation model in hermes.quiver supported specification of window functions or other weighting schemes to downweight the contributions of predictions made closer to the edge of the kernel.

Use updated Triton stateful backend behavior for snapshotter

Triton's documentation now contains a description of a state management feature which should help to improve the efficiency of the snapshotter model by removing the need for updating the snapshot weight in the model itself. This in turn decouples the snapshotter from needing to be implemented in TensorFlow. A simple implementation of this might look like

import torch
from collections.abc import Iterable

class Snapshotter(torch.nn.Module):
    def __init__(self, snapshot_size: int, channels: Iterable) -> None:
        super().__init__()
        self.snapshot_size = snapshot_size
        self.channels = channels

    def forward(self, update, snapshot):
        snapshot = torch.cat([update, snapshot[len(update):]], axis=2)
        snapshots = torch.split(snapshot, list(self.channels))
        return tuple(snapshots) + (snapshot,)

And the model config would include a section that looks something like

sequence_batching {
  state [
    {
      input_name: "old_snapshot"
      output_name: "new_snapshot"
      data_type: TYPE_INT32
      dims: [ -1 ]
      initial_state: {
       data_type: TYPE_INT32
       dims: [ 1 ]
       zero_data: true
       name: "initial state"
      }
    }
  ]
}

With the actual state naming mechanism needing to be worked out.

This will require using the latest version of the Triton container, but this is something we should probably move to anyway now that NVIDIA/TensorRT#1587 has been resolved in the latest releases of TensorRT and Triton.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.