fastmachinelearning / gw-iaas Goto Github PK

View Code? Open in Web Editor NEW

2.0 14.0 4.0 10.59 MB

Deep learning inference-as-a-service tools and pipelines for gravitational wave physics

License: MIT License

Python 88.31% Makefile 0.27% Batchfile 0.33% Jupyter Notebook 9.76% CSS 0.23% Shell 0.45% Dockerfile 0.64%

deep-learning inference gravitational-waves ml-infrastructure

gw-iaas's People

Contributors

Stargazers

Watchers

Forkers

alecgunny kagra-tw-ml gracospa ethanmarx

gw-iaas's Issues

Simplify and organize container builds

Right now, the structure for container builds for individual projects is to have the project Dockerfile live in the project's root directory, then to use the root of the whole repository as the build context in order to copy local dependencies into the image for installation via poetry. These dependencies need to be added explicitly in multiple COPY statements, including the code for the project itself. The advantages here are that

Dockerfiles get to live with the applications they're intended to execute, keeping things organized
Rebuilding is only required when one of the depency directories change
Depency code can be volume mapped from the host into the container at runtime for easier development
Each project can use the Poetry/Python version required for its purposes (also potentially a disadvantage, see below)

However, the disadvantages are that

Individual Dockerfiles are less clear, since the COPY statements will be relative to the build context root and not the directory containing the Dockerfile (which is not obvious unless you inspect the CI yamls)
Specifying each project as a dependency to itself is redundant. Even having to specify the local dependencies is redundant since technically we should already know these from the project's pyproject.toml or poetry.lock.
Images are needlessly bloated by requiring that all the source code be added and live with container forever, even though in production all we need are the built libraries wheels
As the code base grows, sending the entire repo as the build context to the Docker engine could become really onerous
No guarantees that projects are built against the same Poetry and Python versions

Possible Solutions

Use Makefiles as outlined here

Advantages

Necessary dependencies and applications themselves are installed into containers automatically
make insures that rebuilds only happen when the relevant libraries change
Build contexts are isolated to project directories, reducing the size of the context
Only copying built libraries reduces the size of the image
Projects are all built against the same (local to build) Python and Poetry versions

Disadvantages

Extra dependency on make and familiarity with Makefile syntax
Makefile syntax in tools makes certain assumptions about the relative directory depths of applications and libraries
Dockerfiles depend on products of local builds, defeating the purpose of isolated container environments
Dockerfiles provide almost no clarity about what's going into them

Global base image, project-specific base and build images

Build begins with a global build image which adds libraries and installs the desired Poetry version

ARG PYTHON_TAG
FROM python:${PYTHON_TAG}
ARG POETRY_VERSION
RUN python -m pip install ${POETRY_VERSION}
COPY libs /opt/gw-iaas/libs

built by

docker build -t build .

Then for individual projects, build starts with a python script that builds all dependency wheels via something like (making docker a dependency in the root pyproject.toml:

import argparse
import re
import pathlib
from io import StringIO

import docker


parser = argparse.ArgumentParser()
parser.add_argument("--project", required=True, type=str, help="Path to project")
args = parser.parse_args()
project = pathlib.Path(args.project)

dockerfile = """
FROM build
COPY . /opt/build
RUN set +x \
        \
        && mkdir /opt/lib \
"""

with open(project / "poetry.lock", "r") as f:
    lockfile = f.read()

def add(line):
    dockerfile += start + "\\"
    dockerfile += start + f"&& {line} \\"

start = "\n" + " " * 8
root = "/opt/gw-iaas/libs"
for dep in re.findall("<regex for local deps>", lockfile):
    add("cd {root}/{dep}")
    add("poetry build")
    add("cp dist/*.whl /opt/lib")

add("cd /opt/build")
add("poetry build")
dockerfile = dockerfile[:-2]

client = docker.from_env()
build_image, _ = client.images.build(
    fileobj=StringIO(dockerfile),
    tag=f"{project}:build"
)

client.images.build(
    path=project / "Dockerfile",
    tag=str(project)
)

client.images.remove(build_image)

then individual project Dockerfiles would include a line

COPY --from=<project>:build /opt/lib/*.whl .
RUN pip install *.whl && rm *.whl

Advantages

Unifies and isolates Poetry and Python environments used for builds
Automates addition of dependencies and project code
Project builds have local contexts and COPY paths are relative to Dockerfile location
Only installing wheels reduces the size of images

Disadvantages

Addition of extra host dependencies
Easy for CI, but local builds become more complicated (could solve with a Makefile?)
Haven't tested this so no idea if it will actually work
Python script obscures what's going into container, makes builds less reproducible (Python script dependent on host environment)

Accept `model_version=-1` for stillwater `InferenceClient`

If model_version=-1 when initializing hermes.stillwater.client.InferenceClient, use model metadata to automatically map to the latest model version. How will this keep track of new versions that get deployed on the server?

Finish and test `offline_orchestration` project

Need to finish converting the paper code for orchestrating cloud resources to the newer framework.

`BrokenPipeError` in `hermes.stillwater.process.PipelineProcess`

Trying to close a hermes.stillwater.process.PipelineProcess too quickly will cause a BrokenPipeError due to the self.in_q.close() call in the __exit__ method due to python/cpython#80025. This should be fixed and backported since python/cpython#31913, but I'm not sure most releases have had time to implement it yet, so for now it might be worth inserting a time.sleep as the linked issue suggests to avoid this error. Once the fix is implemented, I don't think we'll need to clear the queue manually anymore (if we ever did...).

Incidentally, we're also not closing the out_q of PipelineProcess objects during __exit__. This is probably related to the asynchronous communication of processes, i.e. if we're __exit__ing due to an error, we'd like to raise this error rather than try to do a put on a closed q and get a ValueError that we don't know whether to catch or raise because it's a real problem. I think it makes sense to close all the qs and then do some logic around

Why we're __exit__ing (is there an error, and was it raised by this process?)
Why a get or a put on a q might cause a ValueError (are we stopped now? Or should we always assume that this is due to another process exiting and trust that users won't manually close these qs accidentally?)

Add `TimeLag` output model

Rather than average over predictions, one server-side streaming output option could be a TimeLag model that just takes a subset of the kernel of some length update_size, but lag seconds from the edge of the kernel rather than at the edge itself. Could be useful for models like DeepClean where there's not much benefit to aggregating near the center of kernels, but poorer quality predictions near the edges.

Add tests for `cloudbreak` library

Need both unit tests for verifying basic functionality, as well as integration tests to ensure that appropriate resources are spun up/down on corresponding clouds. Should these latter tests be performed in PR CI, or will this become too costly?

Add AWS backend for `cloudbreak`

Create another cloud backend in cloudbreak for AWS.

Support windows for weighted averages on output streaming model

Some models with output timeseries like DeepClean experience non-trivial drops in predictive performance near the edges of kernels. It would be useful if the output aggregation model in hermes.quiver supported specification of window functions or other weighting schemes to downweight the contributions of predictions made closer to the edge of the kernel.

Use updated Triton stateful backend behavior for snapshotter

Triton's documentation now contains a description of a state management feature which should help to improve the efficiency of the snapshotter model by removing the need for updating the snapshot weight in the model itself. This in turn decouples the snapshotter from needing to be implemented in TensorFlow. A simple implementation of this might look like

import torch
from collections.abc import Iterable

class Snapshotter(torch.nn.Module):
    def __init__(self, snapshot_size: int, channels: Iterable) -> None:
        super().__init__()
        self.snapshot_size = snapshot_size
        self.channels = channels

    def forward(self, update, snapshot):
        snapshot = torch.cat([update, snapshot[len(update):]], axis=2)
        snapshots = torch.split(snapshot, list(self.channels))
        return tuple(snapshots) + (snapshot,)

And the model config would include a section that looks something like

sequence_batching {
  state [
    {
      input_name: "old_snapshot"
      output_name: "new_snapshot"
      data_type: TYPE_INT32
      dims: [ -1 ]
      initial_state: {
       data_type: TYPE_INT32
       dims: [ 1 ]
       zero_data: true
       name: "initial state"
      }
    }
  ]
}

With the actual state naming mechanism needing to be worked out.

This will require using the latest version of the Triton container, but this is something we should probably move to anyway now that NVIDIA/TensorRT#1587 has been resolved in the latest releases of TensorRT and Triton.

fastmachinelearning / gw-iaas Goto Github PK

gw-iaas's People

Contributors

Stargazers

Watchers

Forkers

gw-iaas's Issues

Possible Solutions

Use Makefiles as outlined here

Advantages

Disadvantages

Global base image, project-specific base and build images

Advantages

Disadvantages

Recommend Projects

Recommend Topics

Recommend Org