pennsignals / dsdk Goto Github PK
View Code? Open in Web Editor NEWData Science Deploy Kit
License: MIT License
Data Science Deploy Kit
License: MIT License
Yaml 1.1 has a loose interpretation of y|n|on|off|yes|no as boolean even when used as a key. Use a yaml 1.2 if possible.
We already have a way to import dsdk components using the (sub-)module without ambiguity, and without duplicating the submodule names in the class names:```
from dsdk import mssql, postgres, Service
class CDISepsis(mssql.Mixin, postgres.Mixin, Service):
pass```
The reason I typically use from dsdk.postgres import Mixin as PostgresMixin
instead of importing submodules (or modules e.g. import dsdk.mssql
, import re
, import os.path
) has to do with how module imports expose implementation details.
For, example, if I from os import environ
inside dsdk/mssql.py, environ can be used as mssql.environ
by importers of dsdk.mssql
--unless shielded by a submodule directory and a dunderinit file--.
as
used in import numpy as np
, yet this form "gets a pass" even though it has overly wide scope. For client code, I don't care. Module code has a different bar.import re
even though this form is less specific and less deliberate than from dsdk.postgres import Mixin as PostgresMixin
or from re import compile as re_compile
.PennSignalsDSDKPostresMixin
. Seriously! In the immunehealth repos "immunehealth" (or worse arbitrary abbreviations) must be specified at least three times: the org, included in the repo name, included in the class names. Why do we prefix penn
on signals
everywhere when it is part of the organization? pennsignals.uphs.upenn.edu
It is right there in "upenn"!So if I was going to "fix" DSDK, I would add directories and dunderinit files that narrow the scope imported by from dsdk import mssql
or import dsdk.mssql
instead of duplicating the submodule namespace in the class names.
isort
back into the .pre-commit-config.yaml
. Add isort options to setup.cfg
or pyproject.toml
compatible with black
.mypy
into the .pre-commit-config.yaml
. Add mypy
options to setup.cfg
.tox.ini
into config files setup.cfg
, pyproject.toml
, etc. Bare invocations of black
, flake8
, isort
, pylint
, pytest
will be configured the same way remote and local. Command invocations by pre-commit
will be configured the same way as direct invocations to the tools. Layer the macros to ease debugging.Understanding pep518: https://snarky.ca/clarifying-pep-518/
Add logging to dsdk, additionally update the cookiecutter template to make this work as expected.
The .cookiecutterrc still includes references to Mike Becker's upstream project. Like:
Line 30 in acf5fa1
Fix these references.
If the cookie cutter patcher is intended to be used. Ensure that it regenerates target files correctly. i.e. no pymodules
in setup.py
, setup.py
is alphabetized and uses constant tuples where possible.
Shortly we will be able to remove the 3.7, and 3.8 builds.
It looks like numpy may have updated. This nightly build is desirable for exactly this sort of discovery.
Allow for the expansion of multiple columns in CTEs from each kwarg in keys:
with panel as (
select cast(null as int) as pat_id, cast(null as int) as dept_id
union all select 1, 5
union all select 2, 5
) ...
This will allow us to pass in related columns instead of a single dimension list.
This accommodates the need to pat_id AND an primary_department in pcsnap.
I've avoided duplicating dependency list in tox.ini
and setup.py
and requirements.txt
, the dependency lists have been added as extras to setup.py
and used in tox.ini
like so:
[testenv]
command =
python -m pip install .[lint] # install
pre-commit run-all-files
python -m pip install .[test] # install
{posargs:pytest}
... However, this will be difficult to debug because there is no separation between a dependency failure (pip install failure) and a test failure. The original duplicate code used the dep =
clause above command =
clause.
Find a way to create dependency lists from setup.py
and use the dep =
clause from the extras in setup.py
Better, find out how to use dependencies from pyproject.toml
without setup.py
as pyproject
is about declaring dependencies instead of imperatively running setup
to install dependencies.
The Asset class is used to load text files from a directory tree.
By design this is done once as part of yaml deserialization when the configuration is first loaded to separate early configuration failures from late runtime failures: dev-ops principle to shift all error to the left.
This is problematic for working on sql queries in a notebook since components must be deserialized again after every sql file change.
Create a development version of Asset registered as !asset
for notebook use that dynamically loads content from the filesystem.
Right now we're using timezone unaware datetime
s. This causes issues when you try to compare timestamps that come out of a mongo ObjectId
(via generation_time
). It might work when comparing against datetime
s that come out of a mongo document (i.e. fields other than _id that are passed in as a datetime
to mongo). I'm not even sure what will happen with datetime
s that come out of mssql.
This task entails implementing a datetime
timezone strategy that works for at least all of these cases.
Enable DSDK application to enter a predictive value for a list of patients into each patient's PennSignals flowsheet.
For every new Penn Signals application, we can enter a Service Request to create a new flowsheet ID for the PennSignals flowsheet.
Here is the staging url: https://secureappstest.uphs.upenn.edu/PennChartX-Stage/api/v1/epic/addFlowSheetValue
These are the parameters you can pass in:
"projectName": "",
"uid": "",
"csn": "",
"value": ""
The projectName parameters take 1 of 3 values:
MaternalMorbidity:
Sepsis:
VentLiberation:
To test this I logged into Epic-POC and found a patient encounter. From that I found the UID and CSN. I called it 3 times passing in different project names and values. Here is a screenshot of the results:
Here is the cUrl for one of the calls (Granted you will have to create and change the Bearer token.):
curl -X POST "https://secureappstest.uphs.upenn.edu/PennChartX-Stage/api/v1/epic/addFlowSheetValue" -H "accept: application/json" -H "authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Ikl2WFhNY3ViNldMYVBsMVdwTS1pT29WZ3J5RSIsImtpZCI6Ikl2WFhNY3ViNldMYVBsMVdwTS1pT29WZ3J5RSJ9.eyJpc3MiOiJodHRwczovL3VwaHNuZXR0ZXN0MjAxMi51cGhzLnVwZW5uLmVkdS9NZWRWaWV3SWRlbnRpdHlTZXJ2ZXJfU3RhZ2UvaWRlbnRpdHkiLCJhdWQiOiJodHRwczovL3VwaHNuZXR0ZXN0MjAxMi51cGhzLnVwZW5uLmVkdS9NZWRWaWV3SWRlbnRpdHlTZXJ2ZXJfU3RhZ2UvaWRlbnRpdHkvcmVzb3VyY2VzIiwiZXhwIjoxNTcxNDI0NTU0LCJuYmYiOjE1NzE0MjA5NTQsImNsaWVudF9pZCI6InN3YWdnZXJfdWkiLCJzY29wZSI6InBlbm5jaGFydHgtYXBpLXNjb3BlIiwic3ViIjoiQTQ1MDgzNTMtRTAyQi00MEUyLUE4NDgtNkU3NkY4MTM1RDNCIiwiYXV0aF90aW1lIjoxNTcxNDIwOTU0LCJpZHAiOiJpZHNydiIsIlBlbm5JRCI6IjgxMDA5NzA3IiwibmFtZSI6IkxBTkcsIEVSSUsgRCIsInJvbGUiOiJETUVFOiBDbGluaWNhbDogQ2FuIFZpZXcgQWxsIiwiYW1yIjpbInBhc3N3b3JkIl19.mOV5_XOu58nJuZHgSqmiNxyf0_1LWuH-FDzie7_r0T-LEEAr2Q0Rf44vvFFopqMy0rm_T8w7hp3NMNud7D5im09D0t69_exTe4JHIcFXUz2X8PdRmidEq13NUmD7ONfNhTtYMRcRHiYhHK3khbD3eMkGmpOZHitvZ_eIliFSPf_3jKq6Al4a3aL-70baRfw7cGvWhhgjbLEQX6Z2-I35GidoTfruc23G7ydwX1BnZUykqtJNkBWQWtkF-VgqHcmPaylZXEbM-fvpPONhZ1H-maf4x2u3-rF_HWE5dKyoCJ96sWA8fJakoSmLgXzctYweivcVRYVscP52WNTymBmJuQ" -H "Content-Type: application/json" -d "{ "projectName": "VentLiberation", "uid": "8642014385", "csn": "100040977", "value": "Another important note"}"
It isn’t clear to me what the purpose of the third message is. Under Tara’s quote, there is a note: "We will likely not send the Recipient message.” It seems that the value corresponds to a pre-defined group id. It also seems that the flowsheetID should be different, otherwise, how does Epic know it isn’t a text alert message (the second type of message) with a numeric value?
Ok, based on the uncertainty of the third message, we can start testing out the endpoint based on the first 2 messages. I have moved the endpoint into stage, so you can start testing with it. Currently, it is pointed towards the Epic POC environment.
The original cookie cutter template: gh:ionelmc/cookiecutter-pylibrary
... intended to make both the tox.ini
and .travis
files not just scaffold-ed, but regenerated from templates in /ci
configured by .cookiecutterrc
and some details in setup.cfg
. It did not do a good job of using setup.cfg
nor did it attempt to use the more appropriate pyproject.toml
.
Edit the original template from upstream and recreate a tox.ini template that regenerated the appropriate, or better, find a way to reduce the .cookiecutterrc
and use options from .pyproject.toml
to generate the tox.ini
and .travis
files.
"Noise" in the error logs:
/root/.local/lib/python3.9/site-packages/dsdk/postgres.py:299: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["run_id"] = run_id
The run_id column is being added to all rows. Find the correct way to do this. Very likely, the execution of the noisy code is setting the value for each row one at a time.
This could be as simple as:
df.loc[:, ("run_id")] = run_id
Experiment in a jupyterlab notebook and ensure that the run_id is set for all rows using your own DataFrame that has multiple columns.
Motivation
Trading an increase in both planned and unplanned redeployment and on-call time for a decrease in planned initial deployment time is not a trade off that gives us time.
Trading an increase in runtime and maintenance complexity for initial ease of implementation is not a trade off that gives us simplicity.
On-Call
Is an outage due to an upstream/downstream/database/fileserver system failure? Add a smoke test that runs before the pipeline. Run the smoke test independent from the pipeline at any time to check the state of all dependencies.
Is the server deployment/redeployment failure due to misconfiguration of consul/nomad/vault? The smoke test runs all configuration and initialization, followed by the dependent systems check.
Is it a bug in the code and/or unexpected data or is it configuration? Cleanly separate the configuration and validation from the pipeline init and run. Ensure that code and data problems are not conflated with bad configuration problems that pass through init. Problems detected by the smoke test after deployment generally do not need a full service build locally to reproduce the problem.
Software Specification:
Startup sequence is: setup static dependencies, initialization, setup dynamic dependencies, and finally complete any other required startup tasks. An example of a required startup task is writing a 'start run' record to a data store.
Shutdown sequence is: complete any other required teardown tasks, teardown dynamic dependencies. , and teardown static dependencies. An example of a required teardown task is writing a 'end run' record to a datastore.
Static dependencies are: argument definition, argument parsing, configuration, and validation. Argument parsing and configuration will include model unpickle, options file read, environment variable read, secrets read. Configuration will be a hierarchical structure allowing the same class to be configured by different parts of the hierarchical structure without additional subclassing needed solely for location in the hierarchy. Validation will include creating typed parameters for __init__
.
Initialization will be completed by typed and validated parameters to __init__
, not unvalidated configuration collections like dicts, or lists, or json-supported types. See primitive obsession anti-pattern
Dynamic dependencies (database connections, sockets) will be created after initialization with lifetimes and state managed by method scope (context managers), not callbacks with shared state like setup
and teardown
. Dynamic dependencies will have as narrow a scope as possible, limited to the Task/Block in which they are required. Dynamic dependencies will be separated by kind of use (input vs output; read vs read+write), not mere type (mongo vs mssql, pgsql): it is very common to have mongo collections read from production while writing to staging or development. Dynamic dependencies will be separated by data type, not mere source: just because all types mongo ingest data lives on a single server doesn't mean that it will continue to do so. A method will be added to check runtime dependencies without having to run the Service/Pipeline.
Discuss the actual existence of dynamic dependencies for the batch jobs. If auto reconnect, or reconnect on failure are not desired features, then database connections and sockets may be treated as static pipeline dependencies to be passed in as initialization parameters. These resources will still be managed by context managers.
Consequences:
__init__
methods do not require any dependencies like the filesystem.__init__
methods are declarative. The software has seams supporting the open-closed principalRefactor Demonstration:
from __future__ import annotations
from typing import List
"""
Concepts:
+ See separation of concerns.
+ See liskov substitution principle.
+ See open-closed principle.
+ See refused bequest anti-pattern.
+ See primitive obsession anti-pattern vs domain driven design
+ See dependency injection. How do you validate injected dependencies?
Operations Requirements:
+ Error as early as possible.
Prefer __init__ before __call__/run
Prefer static dependencies before __init__
Prefer configuration before __init__
Prefer arg parsing before configuration.
As a last resort allow contextmanager to
control dynamic dependencies after __init__
Do not pass through unvalidated configuration to __call__/run
Do not pass through primitive dict as a runtime lookup
+ Add a smoke test that may be run independently from the pipeline job.
+ Add the smoke test code to run before __call__/run.
Maintenance Requirements:
+ Descriptive method signatures.
+ Separate argument parsing, from configuration from initialization, from run.
+ Separate smoke test before run.
+ Units tests are code too.
+ Prefer composition to inheritance.
"""
class PredictModel:
"""PredictModel."""
class PredictStage:
"""PredictStage."""
def __init__(self) -> None:
"""__init__.
Bad:
- Subclasses must call super().__init__()
__init__ cannot be overridden by subclasses, merely extended.
__init__ has non-essential dependencies, violates open-closed principle.
- Must use the same file name, a single file, and are tied to the filesystem.
- Mocks patch shared(!) implementation detail `pickle.loads`
- Init is configuration, violates separation of concerns.
"""
self.model = pickle.loads('predict.pkl')
class FirstRefactorForPredictStage:
"""
Add a seam.
"""
def __init__(self) -> None:
"""__init__.
Problem:
- `__init__` signature doesn't declare true dependencies.
"""
self.model = self.configure()
def configure(self, model: str='predict.pkl') -> PredictModel:
"""Configure.
Good:
+ Subclasses can override `configure`, and default `path`.
+ Mocks may patch intent by overriding `configure` instead of shared implementation detail.
Is the initialization of self complete when this method is called? No...
Problem:
- Methods get a partially initialized self, violates the __init__ method contract.
- Configuration errors are conflated with initialization errors, violates separation of concerns.
"""
return pickle.load(model)
class SecondRefactorForPredictStage:
"""
Add an alternate no-arg/default-arg constructor.
Note that the default constructor `__new__(cls, ...)` is an undecorated classmethod,
but it is still a classmethod: `a = SecondRefactorForPipelineStage(...)`
"""
@classmethod
def configure(cls, model: str='predict.pkl') -> SecondRefactorForPredictStage:
"""Configure.
Notice that we "traded" the type of the `model` variable:
A str primitive for a PredictModel.
Good:
+ Subclasses can still override `configure`, default `path`.
+ Filesystem availability and model load are optional,
the true static dependency is the model itself.
+ Configuration errors happen here.
Problem:
- Handle more than just model path.
"""
return cls(model=pickle.load(model))
def __init__(self, model: PredictModel) -> None:
"""__init__.
Good:
+ Interface to `__init__` documents dependencies with their types.
+ Tests no longer need mocks nor patches to create PredictStage,
they may avoid using the alternate constructor.
+ Parameters provided to __init__ are fully initialized.
+ Mocks for model are simply passed in without patching or the need for a alternate pkl file.
+ Pickle is optional.
+ No io in __init__.
+ Initialization errors happen here.
"""
self.model = model
pipeline_cfg = {
"pipeline": {
"extract": {
"clarity": {
"tables": [
"hl7_patients",
"hl7_labs",
],
},
},
"predict": {
"model": "./models/predict.pkl",
},
},
}
class ThirdRefactorForPredictStage:
"""
Add argument parser and alternate constructor to configure from parsed args.
The argument parser is not a required for creation of a PredictStage.
The argument parser is not a required for creation of a cfg dict either.
The model path *may* be overridden by the arg parser.
The argument parser *may* be used for the creation of a cfg dict.
Bad:
- Two pass/four steps for argument parsing:
Traverse classes adding args.
Parse args.
Load root configuration for the microservice/pipeline from the root cfg arg.
Traverse pipeline classes apply args and returning Stage(s).
- Coupling between add_args_to(parser) and configure_from(args).
- Duplication of default value for model.
- Configuration and pipeline is nested, while add_to_args parsing is flat.
"""
@classmethod
def add_args_to(cls, parser: ArgParser) -> None:
"""Add args to parser."""
parser.add(
name='model',
arg='--model',
# default='predict.pkl'
help="Path for the model pickle file",
env_var="MODEL",
)
@classmethod
def configure_from(cls, cfg: dict, args: Namespace=None) -> ThirdRefactorForPredictStage:
"""Configure from.
Argument and cfg dict structure errors happen here.
Bad:
- Root cfg dict problems are conflated with arg patching problems.
"""
if cfg is None:
cfg = {}
if args:
for key, value in (('model', args.model),):
if value is not None:
cfg[key] = args.model
return cls.configure(model=cfg['model'])
@classmethod
def configure(cls, model='predict.pkl') -> ThirdRefactorForPredictStage:
"""Configure."""
return cls(model=pickle.loads(cfg['model']))
def __init__(self, model: PredictModel) -> None:
"""__init__."""
self.model = model
class Foo:
"""Foo."""
def __init__(self, identity: int):
self.identity = identity
class Bar:
"""Bar."""
def __init__(self, uri: str):
self.uri = uri
class FifthRefactorForPredictStage:
"""
Simplify and chain the constructors.
Bad:
- Assumes the constructors can be chained and that no common
parameters are pushed from up to each stage the root configuration,
- Changing arg and env variable names requires(!) subclassing.
but parent components could pass into the cfg key to
add_args_to(parser) and from_args(cfg, args).
Possible Solutions:
+ Pass parent components to run/__call__ method.
+ Pass cfg key or a key list to add_args_to(parser) and from_args(cfg, args)
so these methods can alter the argument names depending on the location
of their configuration within the root cfg file.
ConfigArgParse:
- Please see the design choices section: https://pypi.org/project/ConfigArgParse/
- Produces a non-heirarchical flat config due to command line option equivalency. + Extensible ConfigFileParser
- Only helps with from_args, does not do the validation/trading of from_cfg, still requires add_args_to.
"""
@classmethod
def add_args_to(cls, parser: ArgParser) -> None:
"""Add args to parser."""
parser.add(
name='model',
arg='--model',
type=str, # type of the primitive option, not type of the model
# default='predict.pkl'
help="Path for the model pickle file",
env_var="MODEL",
)
Foo.add_args_to(parser)
@classmethod
def from_args(cls, cfg: dict, args: Namespace) -> FifthRefactorForPredictStage:
"""Return instance from args patched into root cfg: PredictStage.from_args(cfg, args).
Good:
+ Arg patching and arg errors happen here.
+ Arg parsing errors happen just before in parser.parse().
"""
if cfg is None:
cfg = {}
for key, from_args in (('foo', Foo.from_cfg)):
cfg[key] = from_args(cfg.get(key), args)
for key, value in (('model', args.model),):
if value is not None:
cfg[key] = args.model
return cls.from_cfg(cfg)
@classmethod
def from_cfg(cls, cfg: dict) -> FifthRefactorForPredictStage:
"""Return instance from cfg: PredictStage.from_cfg(cfg).
Good:
+ Validation errors happen here while trading primitive types for instances.
+ Pass through only legitimate primitives.
"""
for key, from_cfg in (('model', pickle.loads), ('foo', Foo.from_cfg), ('bar', Bar)):
# from_cfg could be any alternate, single-arg constructor or adapter function
cfg[key] = from_cfg(cfg[key]) # KeyError if expected is missing.
# Trim to only expected parameters, ignore unexpected parameters.
kwargs = {key: cfg[key] for key in ('model', 'foo', 'bar')}
return cls(**kwargs)
def __init__(self, model: PredictModel, foo: Foo, bar: Bar) -> None:
"""__init__.
Good:
+ Initialization errors happen here.
"""
self.model = model
self.foo = foo
self.bar = bar
def __call__(self, batch: Batch, run: Run, service: Service) -> Batch:
"""__call__."""
The github workflow trigger seems to triplicate the lint, test and publish tasks. Look into this and perhaps tighten the constrains on the trigger.
Quay had a nice feature that automated security scans of docker images. Automate the same functionality using github workflows.
The dockerfile code scans might be equivalent here.
Add mongo context manager.
Ensure that the database connection is opened immediately to catch credential errors early.
Update DSDK to install cfgenvy from pypi.
Create a project from DSDK cookie to explore how the cookie cuter works. Review example notebooks to understand better how they work.
Tox local tests never worked correctly. Ensure that local tox runs like tox on travis.
Pull the retry decorator up from cdi sepsis.
Fix:
Add:
sleep=time.sleep
function to replace duration in test.warn=logger.warn
function to replaceLogging appearing to come from dsdk.util.retry
may not be useful. Passing the client module's warn=logger.warn
during decoration allows the retry attempt to be logged in the client module's stderr or stdout stream.
Right now there are no real tests for this project. Add a test that will exercise the Batch and and Block paradigms to get things started.
Move coverage configuration from setup.cfg
to pyproject.toml
.
See feature here: nedbat/coveragepy#664
Figure out how to represent exclude_lines
in the toml file, because toml has a structured syntax unlike setup.cfg
.
Remove the .coveragerc
file.
Add to setup.cfg
:
[coverage:paths]
source =
src
*/site-packages
[coverage:report]
exclude_lines =
raise NotImplementedError()
pragma: nocover
if __name__ == .__main__.:
precision = 2
show_missing = true
[coverage:run]
branch = true
source =
src
test
parallel = true
Add a card to move coverage into pyproject.toml
when converage
supports toml.
Add any other exclude_lines
as needed to reduce the pragma: nocover
noise in the code.
The generates sql is currently written to a temp file.
Checksum this file along with the host of the db connection to rapidly return results for what should be the same query.
Cache with something durable against datatype rot (no json). (pickle the dataframe. See if we can compress pickle at all)
When pylint 2.5 is released, remove .pylintrc
and add:
pylint 2.5
[tool.pylint]
[tool.pylint.BASIC]
good-names = ["a", "b", "c", "d", "df", "i", "logger", "n", "tz"]
[tool.pylint.MASTER]
ignore = ["ci", "docs"]
[tool.pylint.'MESSAGES CONTROL']
disable = ["C0330", "duplicate-code"]
[tool.pylint.MISCELLANEOUS]
# List of note tags to take in consideration, separated by a comma.
notes = ["FIXME", "XXX"]
[tool.pylint.SIMILARITIES]
# Minimum lines number of a similarity.
min-similarity-lines = 4
# Ignore comments when computing similarities.
ignore-comments = "yes"
# Ignore docstrings when computing similarities.
ignore-docstrings = "yes"
# Ignore imports when computing similarities.
ignore-imports = "yes"
... to pyproject.toml
Previous applications had some stdout format which was able to be ingested by grafana through loki and promtail log scraping and pushing.
Currently, dsdk does not log stdout in a format that can be used to track application execution.
Create a new stdout format that is:
datetime - service_name - LEVEL - {JSON payload}
2020-04-11 15:25:19,238 - event_writer - INFO - {"messages": 7716, "buffer": 0, "sets": 7716, "events": 7716, "in": 102.04260312020779, "rate": 151.23095185861598, "reconnects": 4}
2020-04-20 18:09:34,358 - ventcue.micro - INFO - {"buffer": 0, "inputs": {"flowsheet": {"ready": true, "recorded_on": 1587420540, "recv": 4487, "sets": 4487, "events": {"n": 437, "s": 32409.919143662788, "rate": 0.013483526387798718}, "queries": {"n": 26902, "s": 85463.01181396982, "rate": 0.3147794517066457}, "models": {"n": 349980, "s": 86717.0700424281, "rate": 4.035883590494526}}, "location": {"ready": true, "recorded_on": 1587420526, "recv": 5039, "sets": 5039, "events": {"n": 4047, "s": 38378.57613998442, "rate": 0.10544945662493363}, "queries": {"n": 344, "s": 482.0784293974284, "rate": 0.7135768352672015}, "models": {"n": 53080, "s": 482.3068911090959, "rate": 110.05440929516703}}, "mar": {"ready": true, "recorded_on": 1587420484, "recv": 4487, "sets": 4487, "events": {"n": 437, "s": 21854.630461717723, "rate": 0.019995762489120248}, "queries": {"n": 5702, "s": 30855.604431888787, "rate": 0.18479625030800148}, "models": {"n": 9205, "s": 31152.971045301296, "rate": 0.2954774357352462}}}, "outputs": {"cue": {"ready": true, "buffer": 0, "sets": 7757, "events": 41849, "sends": 7757, "reconnects": 0}}, "caches": {"n": 437, "s": 40283.425938990666, "rate": 0.01084813393632005}, "entries": 6516, "runs_in": 119072.81125657004}
See prototype for details
Requirements:
PostgresPersistor
to add a store_evidence
method
execute_batch
, when it throws an DBError, go through row-by-row and log each one that failed. Also keep going through the full dataset to find other bad data points. Rollback after so it doesn't get committed. Then throw an exception (or raise the OG exception).insert = getattr(sql, name).insert
will throw an AttributeError if name
is wrong. This would be better if it was a more user friendly error.run_id
on batch creationstore_evidence
to MongoPersistor
instead of Service
.register_adapter
for all standard pandas types. Use pandas.to_sql
as a reference for types we need to support. This need 100% coverage with real postgres. Test should write and then re-read (into a DataFrame).Compile a list of items for project code review:
# pragma: nocover
The test_flowsheets are not picking up the test data, and are not running correctly.
Store assert_on datetime as epoch time in mongo
Utility functions for dsdk users to return the correct epoch time for storage in a db (milestone 1)
time.time_ns()
seems like a good candidate.
Also store that batch at the end of the pipeline instead of the beginning. This will require us to get an OID from mongo at the beginning of the pipeline and use it again later. The assertion_time
DSDK testing has been simplified. After activating a python 3.10 virtual environment, use:
pytest
... to show coverage.
Look for ways to add coverage without adding code perhaps by updating configuration files, or loading additional configuration files in the tests.
Familiarize with pytest.raises, pytest.parametrize, pytest.mark, logger.debug, conftest.py, and running pytest --log-cli-level=debug
.
Review the lines not covered and pick the easy coverage first. Avoid brittle tests that break when the implementation changes.
Corey was looking for a way to add a verify/validation step after the pipeline runs to examine postgres predictions.
This cannot be a task in the pipeline because the predictions, and run are only written toipostgres after the pipeline is completed.
The intent is to provide an optional post-pipeline task/tasks that can verify/validate without the race conditions and complexity invited by:
@mdbecker Let me know if you have other ideas to handle validation systems test: ingest -> predict -> postgres. I don't want to test the whole integration: ingest -> predict -> postgres -> epic. The validation/verification is to check the model (not the entire integration pipeline). Ideally we could exclude ingest errors as well, but not at this point.
─────────────────────────────▀███
────────────────────────────────█
───────────────▄▄▄▄▄────────────█
──────────────▀▄────▀▄──────────█
──────────▄▀▀▀▄─█▄▄▄▄█▄▄─▄▀▀▀▄──█
─────────█──▄──█────────█───▄─█─█
─────────▀▄───▄▀────────▀▄───▄▀─█
──────────█▀▀▀────────────▀▀▀─█─█
──────────█───────────────────█─█
▄▀▄▄▀▄────█──▄█▀█▀█▀█▀█▀█▄────█─█
█▒▒▒▒█────█──█████████████▄───█─█
█▒▒▒▒█────█──██████████████▄──█─█
█▒▒▒▒█────█───██████████████▄─█─█
█▒▒▒▒█────█────██████████████─█─█
█▒▒▒▒█────█───██████████████▀─█─█
█▒▒▒▒█───██───██████████████──█─█
▀████▀──██▀█──█████████████▀──█▄█
──██───██──▀█──█▄█▄█▄█▄█▄█▀──▄█▀
──██──██────▀█─────────────▄▀▓█
──██─██──────▀█▀▄▄▄▄▄▄▄▄▄▀▀▓▓▓█
──████────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──███─────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██──────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██──────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██─────────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██────────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██───────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌
──██──────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌
──██─────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌
──██────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌
setup pypi integration and figure out how to upload packages automatically
PRs should not checkout as ready unless the github lint and test workflows pass. Look into how these check can be integration into the PR process as required.
Lines 34 to 39 in 86b358a
Remove the secrets.yaml
and local.yaml
options here.
The primary purpose of the ./local
and ./secrets
directories is that the .gitignore
in each directory establishes a default policy to exclude all files from git in those directories except for explicit example files.
The other purpose of the ./local
and ./secrets
directories is to maintain parity between local testing with docker-compose and running on nomad. The docker-compose files maps ./local
and ./secrets
to the container's /local
and /secrets
directories to maintain this parity. Nomad/Consul writes configuration files to /local
and Nomad/Vault writes secrets files to /secrets
These two directories (along with /alloc
) have special significance and handling by nomad: https://www.nomadproject.io/docs/runtime/environment.html#task-directories
I understand that we don't want the nomad job to process secrets.env
as an env file and expose the secrets to the container. However, we simply remove env = true
from the nomad job and require the microservice to read the file.
We should discuss the work required to assemble multiple secrets into secrets.yaml
inside the nomad job, and how we maintain the same type of secrets handling in docker-compose and local development.
The as-of time is typically a past date such that results from clarity are generally stable. However, I forgot to unclude the datetime that the gold file was created in the gold file itself.
Alter on_create_gold to include the current non naive datetime in the generated gold file.
Lines 32 to 36 in 55cbf87
The Task.run
method has an explicit backreference to self.batch
that had to be wired up post initialization for each Task. Replace the shared state with parameters to Task.run(self, batch)
and eliminate the Feature Envy anti-pattern
Replace all references to self.batch
in the apps with simply batch
. Lint will complain about no-self-use
because the Task methods aren't object-oriented. They use the Batch as an Accumulating Parameter just as a non-method procedure would.
name = task.name
evidence = task.run(self)
if evidence:
self.evidence[name] = evidence
-[x] Correct the mixin implementation to the canonical forms of mixin, and dependency-injection. See test/test_mixin.py
Correct logical and programming errors indicated by type checking. Apply Guido's recommendations for working with Mixins and type checking.
Add a Service
class separate from Batch
. In accordance with Accumulating Parameter, consider turning the Batch into a dataclass, and moving the dynamic, configured, or mixed in properties to a Service class to separate mutable (Batch) from immutable configuration, and runtime connections. Batch only exists during run
, not the smoke test or check
while Service is needed for both.
Replace Service.run, Task.run with Service.__call__
and Task.__call__
. Since tasks are typically no-self-use
, this interface change may allow bare functions to be used as tasks.
class Service:
def __call__(self):
with self.open_batch(...) as batch:
for task in self.pipeline():
task(batch, self)
Use a MongoEvidenceMixin
to use MongoMixin.open_mongo_database
connection contextmanager, or aPgsqlEvidenceMixin
to use PgsqlMixin.open_pgsql_database
contextmanager or extend or override Service.open_batch
and Service.store_evidence
. Remove the @mongo.store_evidence
decorator from all the Tasks. Remove the monkey patches from the unit tests. Since tasks customize how they store evidence (exclude columns from the df from storage), have tasks call Service.store_evidence
directly and remove the evidence code from the loop inside Service.__call__
.
Replace batch = Batch()
with a contextmanager Service.open_batch(...) to manage the open and close of the batch.
Add a basic implementation for Model
.
Get the cdisepsis project into a good state and then use it to make a cookiecutter template for future projects. Some things that would be good to address before we do this:
Current wheel build is not working for pymssql in Python 3.10. Resolve how to get wheel to build properly for pip install
Create documentation
from __future__ import annotations
If it turns out that readthedocs is not easy to setup, split that off into a separate card.
Timebox this effort to 2 days max.
Either via a context manager, a decorator, or a Batch mixin
Rather then depend on datetime.utcnow
, pass this into the application via some other mechanism (commandline arguments?) to assist in supporting replay.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.