fastmachinelearning / gw-iaas Goto Github PK
View Code? Open in Web Editor NEWDeep learning inference-as-a-service tools and pipelines for gravitational wave physics
License: MIT License
Deep learning inference-as-a-service tools and pipelines for gravitational wave physics
License: MIT License
Right now, the structure for container builds for individual projects is to have the project Dockerfile
live in the project's root directory, then to use the root of the whole repository as the build context in order to copy local dependencies into the image for installation via poetry. These dependencies need to be added explicitly in multiple COPY
statements, including the code for the project itself. The advantages here are that
However, the disadvantages are that
COPY
statements will be relative to the build context root and not the directory containing the Dockerfile (which is not obvious unless you inspect the CI yamls)pyproject.toml
or poetry.lock
.make
insures that rebuilds only happen when the relevant libraries changemake
and familiarity with Makefile
syntaxtools
makes certain assumptions about the relative directory depths of applications and librariesBuild begins with a global build image which adds libraries and installs the desired Poetry version
ARG PYTHON_TAG
FROM python:${PYTHON_TAG}
ARG POETRY_VERSION
RUN python -m pip install ${POETRY_VERSION}
COPY libs /opt/gw-iaas/libs
built by
docker build -t build .
Then for individual projects, build starts with a python script that builds all dependency wheels via something like (making docker a dependency in the root pyproject.toml
:
import argparse
import re
import pathlib
from io import StringIO
import docker
parser = argparse.ArgumentParser()
parser.add_argument("--project", required=True, type=str, help="Path to project")
args = parser.parse_args()
project = pathlib.Path(args.project)
dockerfile = """
FROM build
COPY . /opt/build
RUN set +x \
\
&& mkdir /opt/lib \
"""
with open(project / "poetry.lock", "r") as f:
lockfile = f.read()
def add(line):
dockerfile += start + "\\"
dockerfile += start + f"&& {line} \\"
start = "\n" + " " * 8
root = "/opt/gw-iaas/libs"
for dep in re.findall("<regex for local deps>", lockfile):
add("cd {root}/{dep}")
add("poetry build")
add("cp dist/*.whl /opt/lib")
add("cd /opt/build")
add("poetry build")
dockerfile = dockerfile[:-2]
client = docker.from_env()
build_image, _ = client.images.build(
fileobj=StringIO(dockerfile),
tag=f"{project}:build"
)
client.images.build(
path=project / "Dockerfile",
tag=str(project)
)
client.images.remove(build_image)
then individual project Dockerfiles would include a line
COPY --from=<project>:build /opt/lib/*.whl .
RUN pip install *.whl && rm *.whl
COPY
paths are relative to Dockerfile locationIf model_version=-1
when initializing hermes.stillwater.client.InferenceClient
, use model metadata to automatically map to the latest model version. How will this keep track of new versions that get deployed on the server?
Need to finish converting the paper code for orchestrating cloud resources to the newer framework.
Trying to close a hermes.stillwater.process.PipelineProcess
too quickly will cause a BrokenPipeError
due to the self.in_q.close()
call in the __exit__
method due to python/cpython#80025. This should be fixed and backported since python/cpython#31913, but I'm not sure most releases have had time to implement it yet, so for now it might be worth inserting a time.sleep
as the linked issue suggests to avoid this error. Once the fix is implemented, I don't think we'll need to clear the queue manually anymore (if we ever did...).
Incidentally, we're also not closing the out_q
of PipelineProcess
objects during __exit__
. This is probably related to the asynchronous communication of processes, i.e. if we're __exit__
ing due to an error, we'd like to raise this error rather than try to do a put
on a closed q
and get a ValueError
that we don't know whether to catch or raise because it's a real problem. I think it makes sense to close all the q
s and then do some logic around
__exit__
ing (is there an error, and was it raised by this process?)get
or a put
on a q
might cause a ValueError
(are we stopped
now? Or should we always assume that this is due to another process exiting and trust that users won't manually close these q
s accidentally?)Rather than average over predictions, one server-side streaming output option could be a TimeLag
model that just takes a subset of the kernel of some length update_size
, but lag
seconds from the edge of the kernel rather than at the edge itself. Could be useful for models like DeepClean where there's not much benefit to aggregating near the center of kernels, but poorer quality predictions near the edges.
Need both unit tests for verifying basic functionality, as well as integration tests to ensure that appropriate resources are spun up/down on corresponding clouds. Should these latter tests be performed in PR CI, or will this become too costly?
Create another cloud
backend in cloudbreak
for AWS.
Some models with output timeseries like DeepClean experience non-trivial drops in predictive performance near the edges of kernels. It would be useful if the output aggregation model in hermes.quiver
supported specification of window functions or other weighting schemes to downweight the contributions of predictions made closer to the edge of the kernel.
Triton's documentation now contains a description of a state management feature which should help to improve the efficiency of the snapshotter model by removing the need for updating the snapshot weight in the model itself. This in turn decouples the snapshotter from needing to be implemented in TensorFlow. A simple implementation of this might look like
import torch
from collections.abc import Iterable
class Snapshotter(torch.nn.Module):
def __init__(self, snapshot_size: int, channels: Iterable) -> None:
super().__init__()
self.snapshot_size = snapshot_size
self.channels = channels
def forward(self, update, snapshot):
snapshot = torch.cat([update, snapshot[len(update):]], axis=2)
snapshots = torch.split(snapshot, list(self.channels))
return tuple(snapshots) + (snapshot,)
And the model config would include a section that looks something like
sequence_batching {
state [
{
input_name: "old_snapshot"
output_name: "new_snapshot"
data_type: TYPE_INT32
dims: [ -1 ]
initial_state: {
data_type: TYPE_INT32
dims: [ 1 ]
zero_data: true
name: "initial state"
}
}
]
}
With the actual state naming mechanism needing to be worked out.
This will require using the latest version of the Triton container, but this is something we should probably move to anyway now that NVIDIA/TensorRT#1587 has been resolved in the latest releases of TensorRT and Triton.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.