pennsignals / dsdk Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 7.0 600 KB

Data Science Deploy Kit

License: MIT License

Python 79.28% PLpgSQL 10.15% Shell 10.57%

python

dsdk's People

Contributors

Stargazers

Watchers

Forkers

brittanyistenes dancampos1 fxdgear hessam61 dmexs mdbecker

dsdk's Issues

Look into using a yaml 1.2 parser

Yaml 1.1 has a loose interpretation of y|n|on|off|yes|no as boolean even when used as a key. Use a yaml 1.2 if possible.

Submodule naming conventions

We already have a way to import dsdk components using the (sub-)module without ambiguity, and without duplicating the submodule names in the class names:```
from dsdk import mssql, postgres, Service

class CDISepsis(mssql.Mixin, postgres.Mixin, Service):
pass```

The reason I typically use from dsdk.postgres import Mixin as PostgresMixin instead of importing submodules (or modules e.g. import dsdk.mssql, import re , import os.path) has to do with how module imports expose implementation details.

For, example, if I from os import environ inside dsdk/mssql.py, environ can be used as mssql.environ by importers of dsdk.mssql --unless shielded by a submodule directory and a dunderinit file--.

People are familiar/comfortable with as used in import numpy as np, yet this form "gets a pass" even though it has overly wide scope. For client code, I don't care. Module code has a different bar.
People are familiar/comfortable with nearly-bare imports import re even though this form is less specific and less deliberate than from dsdk.postgres import Mixin as PostgresMixin or from re import compile as re_compile.
Submodules, orgs, and repos provide perfectly good namespaces already. as is designed to disambiguation collisions. If these namespaces aren't good enough then this becomes PennSignalsDSDKPostresMixin. Seriously! In the immunehealth repos "immunehealth" (or worse arbitrary abbreviations) must be specified at least three times: the org, included in the repo name, included in the class names. Why do we prefix penn on signals everywhere when it is part of the organization? pennsignals.uphs.upenn.edu It is right there in "upenn"!
Open closed principle, and cohesion of the submodules would indicate that any aliasing needed by the module happens at the module level in dunderinit.

So if I was going to "fix" DSDK, I would add directories and dunderinit files that narrow the scope imported by from dsdk import mssql or import dsdk.mssql instead of duplicating the submodule namespace in the class names.

Unify remote ci and local lint and test workflows

Add isort back into the .pre-commit-config.yaml. Add isort options to setup.cfg or pyproject.toml compatible with black.
Add mypy into the .pre-commit-config.yaml. Add mypy options to setup.cfg.
Move command line options for all tools from tox.ini into config files setup.cfg, pyproject.toml, etc. Bare invocations of black, flake8, isort, pylint, pytest will be configured the same way remote and local. Command invocations by pre-commit will be configured the same way as direct invocations to the tools. Layer the macros to ease debugging.

Understanding pep518: https://snarky.ca/clarifying-pep-518/

Add logging

Add logging to dsdk, additionally update the cookiecutter template to make this work as expected.

Update .cookiecutterrc

The .cookiecutterrc still includes references to Mike Becker's upstream project. Like:

dsdk/.cookiecutterrc

Line 30 in acf5fa1

    
           codacy_projectid:          '[Get ID from https://app.codacy.com/app/mdbecker/dsdk/settings]'

Fix these references.

If the cookie cutter patcher is intended to be used. Ensure that it regenerates target files correctly. i.e. no pymodules in setup.py, setup.py is alphabetized and uses constant tuples where possible.

setup readthedocs

Remove setup.py

Fix build

Failing for python 3.10

Shortly we will be able to remove the 3.7, and 3.8 builds.

It looks like numpy may have updated. This nightly build is desirable for exactly this sort of discovery.

Allow keys in df_from_query to be DataFrame with multiple columns

Allow for the expansion of multiple columns in CTEs from each kwarg in keys:

with panel as (
    select cast(null as int) as pat_id, cast(null as int) as dept_id
    union all select 1, 5
    union all select 2, 5
) ...

This will allow us to pass in related columns instead of a single dimension list.

This accommodates the need to pat_id AND an primary_department in pcsnap.

Declare python module dependencies in tox.ini without using commands

I've avoided duplicating dependency list in tox.ini and setup.py and requirements.txt, the dependency lists have been added as extras to setup.py and used in tox.ini like so:

[testenv]
command =
    python -m pip install .[lint]  # install
    pre-commit run-all-files
    python -m pip install .[test]  # install
    {posargs:pytest}

... However, this will be difficult to debug because there is no separation between a dependency failure (pip install failure) and a test failure. The original duplicate code used the dep = clause above command = clause.

Find a way to create dependency lists from setup.py and use the dep = clause from the extras in setup.py Better, find out how to use dependencies from pyproject.toml without setup.py as pyproject is about declaring dependencies instead of imperatively running setup to install dependencies.

Add a dynamic asset to pick up file changes

The Asset class is used to load text files from a directory tree.

By design this is done once as part of yaml deserialization when the configuration is first loaded to separate early configuration failures from late runtime failures: dev-ops principle to shift all error to the left.

This is problematic for working on sql queries in a notebook since components must be deserialized again after every sql file change.

Create a development version of Asset registered as !asset for notebook use that dynamically loads content from the filesystem.

Properly handle timezones

Right now we're using timezone unaware datetimes. This causes issues when you try to compare timestamps that come out of a mongo ObjectId (via generation_time). It might work when comparing against datetimes that come out of a mongo document (i.e. fields other than _id that are passed in as a datetime to mongo). I'm not even sure what will happen with datetimes that come out of mssql.

This task entails implementing a datetime timezone strategy that works for at least all of these cases.

Create an interface to EPIC Flowsheets to allow batch jobs to enter a predictive value for a list of patients.

Intro

Enable DSDK application to enter a predictive value for a list of patients into each patient's PennSignals flowsheet.

For every new Penn Signals application, we can enter a Service Request to create a new flowsheet ID for the PennSignals flowsheet.

API End Point

Here is the staging url: https://secureappstest.uphs.upenn.edu/PennChartX-Stage/api/v1/epic/addFlowSheetValue

These are the parameters you can pass in:
"projectName": "",
"uid": "",
"csn": "",
"value": ""

The projectName parameters take 1 of 3 values:
MaternalMorbidity:
Sepsis:
VentLiberation:

To test this I logged into Epic-POC and found a patient encounter. From that I found the UID and CSN. I called it 3 times passing in different project names and values. Here is a screenshot of the results:

Example

Here is the cUrl for one of the calls (Granted you will have to create and change the Bearer token.):

curl -X POST "https://secureappstest.uphs.upenn.edu/PennChartX-Stage/api/v1/epic/addFlowSheetValue" -H "accept: application/json" -H "authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6Ikl2WFhNY3ViNldMYVBsMVdwTS1pT29WZ3J5RSIsImtpZCI6Ikl2WFhNY3ViNldMYVBsMVdwTS1pT29WZ3J5RSJ9.eyJpc3MiOiJodHRwczovL3VwaHNuZXR0ZXN0MjAxMi51cGhzLnVwZW5uLmVkdS9NZWRWaWV3SWRlbnRpdHlTZXJ2ZXJfU3RhZ2UvaWRlbnRpdHkiLCJhdWQiOiJodHRwczovL3VwaHNuZXR0ZXN0MjAxMi51cGhzLnVwZW5uLmVkdS9NZWRWaWV3SWRlbnRpdHlTZXJ2ZXJfU3RhZ2UvaWRlbnRpdHkvcmVzb3VyY2VzIiwiZXhwIjoxNTcxNDI0NTU0LCJuYmYiOjE1NzE0MjA5NTQsImNsaWVudF9pZCI6InN3YWdnZXJfdWkiLCJzY29wZSI6InBlbm5jaGFydHgtYXBpLXNjb3BlIiwic3ViIjoiQTQ1MDgzNTMtRTAyQi00MEUyLUE4NDgtNkU3NkY4MTM1RDNCIiwiYXV0aF90aW1lIjoxNTcxNDIwOTU0LCJpZHAiOiJpZHNydiIsIlBlbm5JRCI6IjgxMDA5NzA3IiwibmFtZSI6IkxBTkcsIEVSSUsgRCIsInJvbGUiOiJETUVFOiBDbGluaWNhbDogQ2FuIFZpZXcgQWxsIiwiYW1yIjpbInBhc3N3b3JkIl19.mOV5_XOu58nJuZHgSqmiNxyf0_1LWuH-FDzie7_r0T-LEEAr2Q0Rf44vvFFopqMy0rm_T8w7hp3NMNud7D5im09D0t69_exTe4JHIcFXUz2X8PdRmidEq13NUmD7ONfNhTtYMRcRHiYhHK3khbD3eMkGmpOZHitvZ_eIliFSPf_3jKq6Al4a3aL-70baRfw7cGvWhhgjbLEQX6Z2-I35GidoTfruc23G7ydwX1BnZUykqtJNkBWQWtkF-VgqHcmPaylZXEbM-fvpPONhZ1H-maf4x2u3-rF_HWE5dKyoCJ96sWA8fJakoSmLgXzctYweivcVRYVscP52WNTymBmJuQ" -H "Content-Type: application/json" -d "{ "projectName": "VentLiberation", "uid": "8642014385", "csn": "100040977", "value": "Another important note"}"

Open questions -Third message

It isn’t clear to me what the purpose of the third message is. Under Tara’s quote, there is a note: "We will likely not send the Recipient message.” It seems that the value corresponds to a pre-defined group id. It also seems that the flowsheetID should be different, otherwise, how does Epic know it isn’t a text alert message (the second type of message) with a numeric value?

Ok, based on the uncertainty of the third message, we can start testing out the endpoint based on the first 2 messages. I have moved the endpoint into stage, so you can start testing with it. Currently, it is pointed towards the Epic POC environment.

Add ci template to regenerate tox.ini back into project

The original cookie cutter template: gh:ionelmc/cookiecutter-pylibrary

... intended to make both the tox.ini and .travis files not just scaffold-ed, but regenerated from templates in /ci configured by .cookiecutterrc and some details in setup.cfg. It did not do a good job of using setup.cfg nor did it attempt to use the more appropriate pyproject.toml.

Edit the original template from upstream and recreate a tox.ini template that regenerated the appropriate, or better, find a way to reduce the .cookiecutterrc and use options from .pyproject.toml to generate the tox.ini and .travis files.

Fix code to silence warning

"Noise" in the error logs:

/root/.local/lib/python3.9/site-packages/dsdk/postgres.py:299: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["run_id"] = run_id

The run_id column is being added to all rows. Find the correct way to do this. Very likely, the execution of the noisy code is setting the value for each row one at a time.

This could be as simple as:

df.loc[:, ("run_id")] = run_id

Experiment in a jupyterlab notebook and ensure that the run_id is set for all rows using your own DataFrame that has multiple columns.

Separation of concerns

Separation of Concerns

Motivation

Trading an increase in both planned and unplanned redeployment and on-call time for a decrease in planned initial deployment time is not a trade off that gives us time.
Trading an increase in runtime and maintenance complexity for initial ease of implementation is not a trade off that gives us simplicity.

On-Call

Is an outage due to an upstream/downstream/database/fileserver system failure? Add a smoke test that runs before the pipeline. Run the smoke test independent from the pipeline at any time to check the state of all dependencies.
Is the server deployment/redeployment failure due to misconfiguration of consul/nomad/vault? The smoke test runs all configuration and initialization, followed by the dependent systems check.
Is it a bug in the code and/or unexpected data or is it configuration? Cleanly separate the configuration and validation from the pipeline init and run. Ensure that code and data problems are not conflated with bad configuration problems that pass through init. Problems detected by the smoke test after deployment generally do not need a full service build locally to reproduce the problem.

Software Specification:

Startup sequence is: setup static dependencies, initialization, setup dynamic dependencies, and finally complete any other required startup tasks. An example of a required startup task is writing a 'start run' record to a data store.

Shutdown sequence is: complete any other required teardown tasks, teardown dynamic dependencies. , and teardown static dependencies. An example of a required teardown task is writing a 'end run' record to a datastore.

Static dependencies are: argument definition, argument parsing, configuration, and validation. Argument parsing and configuration will include model unpickle, options file read, environment variable read, secrets read. Configuration will be a hierarchical structure allowing the same class to be configured by different parts of the hierarchical structure without additional subclassing needed solely for location in the hierarchy. Validation will include creating typed parameters for __init__.

Initialization will be completed by typed and validated parameters to __init__, not unvalidated configuration collections like dicts, or lists, or json-supported types. See primitive obsession anti-pattern

Dynamic dependencies (database connections, sockets) will be created after initialization with lifetimes and state managed by method scope (context managers), not callbacks with shared state like setup and teardown. Dynamic dependencies will have as narrow a scope as possible, limited to the Task/Block in which they are required. Dynamic dependencies will be separated by kind of use (input vs output; read vs read+write), not mere type (mongo vs mssql, pgsql): it is very common to have mongo collections read from production while writing to staging or development. Dynamic dependencies will be separated by data type, not mere source: just because all types mongo ingest data lives on a single server doesn't mean that it will continue to do so. A method will be added to check runtime dependencies without having to run the Service/Pipeline.

Discuss the actual existence of dynamic dependencies for the batch jobs. If auto reconnect, or reconnect on failure are not desired features, then database connections and sockets may be treated as static pipeline dependencies to be passed in as initialization parameters. These resources will still be managed by context managers.

Consequences:

Tests for static dependencies are separate from tests of dynamic behavior. Tests checking runtime behavior are not conflated to include startup argument definition, argument parsing, configuration, nor validation --these are separate concerns--.
Mocks are only needed for dynamic behavior. Mocks do not have to work around the concerns that they are not intended to test.
__init__ methods do not require any dependencies like the filesystem.
Block/Task/Stage subclasses and tests do not need to work around unwanted superclass behavior. See refused bequest anti-pattern.
Managing static dependencies is decoupled from initialization. Initialization is decoupled from dynamic dependencies. __init__ methods are declarative. The software has seams supporting the open-closed principal
Misconfigured static dependencies will be caught early without having to run the Service/Pipeline, or resulting in a partial run.
Misconfigured dynamic dependencies will be caught early without having to run the Service/Pipeline, or resulting in a partial run.

Refactor Demonstration:

from __future__ import annotations

from typing import List

"""
Concepts:
    + See separation of concerns.
    + See liskov substitution principle.
    + See open-closed principle.
    + See refused bequest anti-pattern.
    + See primitive obsession anti-pattern vs domain driven design
    + See dependency injection. How do you validate injected dependencies?
    
Operations Requirements:
    + Error as early as possible.
        Prefer __init__ before __call__/run
        Prefer static dependencies before __init__
            Prefer configuration before __init__
            Prefer arg parsing before configuration.
        As a last resort allow contextmanager to
            control dynamic dependencies after __init__
        Do not pass through unvalidated configuration to __call__/run
        Do not pass through primitive dict as a runtime lookup
    + Add a smoke test that may be run independently from the pipeline job.
    + Add the smoke test code to run before __call__/run.
    
Maintenance Requirements:
    + Descriptive method signatures.
    + Separate argument parsing, from configuration from initialization, from run.
    + Separate smoke test before run.
    + Units tests are code too.
    + Prefer composition to inheritance.
"""

class PredictModel:
    """PredictModel."""

class PredictStage:
    """PredictStage."""
    
    def __init__(self) -> None:
        """__init__.
        
        Bad:
        - Subclasses must call super().__init__()
            __init__ cannot be overridden by subclasses, merely extended.
            __init__ has non-essential dependencies, violates open-closed principle.
        - Must use the same file name, a single file, and are tied to the filesystem.
        - Mocks patch shared(!) implementation detail `pickle.loads`
        - Init is configuration, violates separation of concerns.
        """
        self.model = pickle.loads('predict.pkl')

class FirstRefactorForPredictStage:
    """
    Add a seam.
    """
    
    def __init__(self) -> None:
        """__init__.
        
        Problem:
        - `__init__` signature doesn't declare true dependencies.
        """
        self.model = self.configure()
        
    def configure(self, model: str='predict.pkl') -> PredictModel:
        """Configure.
        
        Good:
        + Subclasses can override `configure`, and default `path`.
        + Mocks may patch intent by overriding `configure` instead of shared implementation detail.
        
        Is the initialization of self complete when this method is called? No...
        
        Problem:
        - Methods get a partially initialized self, violates the __init__ method contract.
        - Configuration errors are conflated with initialization errors, violates separation of concerns.
        """
        return pickle.load(model)

class SecondRefactorForPredictStage:
    """
    Add an alternate no-arg/default-arg constructor.
    
    Note that the default constructor `__new__(cls, ...)` is an undecorated classmethod,
        but it is still a classmethod: `a = SecondRefactorForPipelineStage(...)`
    """
    
    @classmethod
    def configure(cls, model: str='predict.pkl') -> SecondRefactorForPredictStage:
        """Configure.
        
        Notice that we "traded" the type of the `model` variable:
            A str primitive for a PredictModel.
        
        Good:
        + Subclasses can still override `configure`, default `path`.
        + Filesystem availability and model load are optional,
            the true static dependency is the model itself.
        + Configuration errors happen here.
        
        Problem:
        - Handle more than just model path.
        """
        return cls(model=pickle.load(model))
    
    def __init__(self, model: PredictModel) -> None:
        """__init__.
        
        Good:
        + Interface to `__init__` documents dependencies with their types.
        + Tests no longer need mocks nor patches to create PredictStage,
            they may avoid using the alternate constructor.
        + Parameters provided to __init__ are fully initialized.
        + Mocks for model are simply passed in without patching or the need for a alternate pkl file.
        + Pickle is optional.
        + No io in __init__.
        + Initialization errors happen here.
        """
        self.model = model

pipeline_cfg = {
    "pipeline": {
        "extract": {
            "clarity": {
                "tables": [
                    "hl7_patients",
                    "hl7_labs",
                ],
            },
        },
        "predict": {
            "model": "./models/predict.pkl",
        },
    },
}


class ThirdRefactorForPredictStage:
    """
    Add argument parser and alternate constructor to configure from parsed args.
    
    The argument parser is not a required for creation of a PredictStage.
    The argument parser is not a required for creation of a cfg dict either.
    The model path *may* be overridden by the arg parser.
    The argument parser *may* be used for the creation of a cfg dict.
    
    Bad:
    - Two pass/four steps for argument parsing:
        Traverse classes adding args.
        Parse args.
        Load root configuration for the microservice/pipeline from the root cfg arg.
        Traverse pipeline classes apply args and returning Stage(s).
    - Coupling between add_args_to(parser) and configure_from(args).
    - Duplication of default value for model.
    - Configuration and pipeline is nested, while add_to_args parsing is flat.
    """
    
    @classmethod
    def add_args_to(cls, parser: ArgParser) -> None:
        """Add args to parser."""
        parser.add(
            name='model',
            arg='--model',
            # default='predict.pkl'
            help="Path for the model pickle file",
            env_var="MODEL",
        )
        
    @classmethod
    def configure_from(cls, cfg: dict, args: Namespace=None) -> ThirdRefactorForPredictStage:
        """Configure from.
        
        Argument and cfg dict structure errors happen here.
        
        Bad:
        - Root cfg dict problems are conflated with arg patching problems.
        """
        if cfg is None:
            cfg = {}
        
        if args:
            for key, value in (('model', args.model),):
                if value is not None:
                    cfg[key] = args.model
        
        return cls.configure(model=cfg['model'])
    
    @classmethod
    def configure(cls, model='predict.pkl') -> ThirdRefactorForPredictStage:
        """Configure."""
        return cls(model=pickle.loads(cfg['model']))
                   
    def __init__(self, model: PredictModel) -> None:
        """__init__."""
        self.model = model

class Foo:
    """Foo."""
    
    def __init__(self, identity: int):
        self.identity = identity
    
class Bar:
    """Bar."""
    
    def __init__(self, uri: str):
        self.uri = uri
        
    
class FifthRefactorForPredictStage:
    """
    Simplify and chain the constructors.
    
    Bad:
    - Assumes the constructors can be chained and that no common
        parameters are pushed from up to each stage the root configuration,
        
    - Changing arg and env variable names requires(!) subclassing.
        but parent components could pass into the cfg key to
        add_args_to(parser) and from_args(cfg, args).
        
    Possible Solutions:
    + Pass parent components to run/__call__ method.
    + Pass cfg key or a key list to add_args_to(parser) and from_args(cfg, args)
        so these methods can alter the argument names depending on the location
        of their configuration within the root cfg file.

    ConfigArgParse:
    - Please see the design choices section: https://pypi.org/project/ConfigArgParse/
    - Produces a non-heirarchical flat config due to command line option equivalency.   + Extensible ConfigFileParser
    - Only helps with from_args, does not do the validation/trading of from_cfg, still requires add_args_to.
    """

    @classmethod
    def add_args_to(cls, parser: ArgParser) -> None:
        """Add args to parser."""
        parser.add(
            name='model',
            arg='--model',
            type=str,  # type of the primitive option, not type of the model
            # default='predict.pkl'
            help="Path for the model pickle file",
            env_var="MODEL",
        )
        Foo.add_args_to(parser)
        
    @classmethod
    def from_args(cls, cfg: dict, args: Namespace) -> FifthRefactorForPredictStage:
        """Return instance from args patched into root cfg: PredictStage.from_args(cfg, args).
        
        Good:
        + Arg patching and arg errors happen here.
        + Arg parsing errors happen just before in parser.parse().
        """
        if cfg is None:
            cfg = {}
            
        for key, from_args in (('foo', Foo.from_cfg)):
            cfg[key] = from_args(cfg.get(key), args)
            
        for key, value in (('model', args.model),):
            if value is not None:
                cfg[key] = args.model
        return cls.from_cfg(cfg)
    
    @classmethod
    def from_cfg(cls, cfg: dict) -> FifthRefactorForPredictStage:
        """Return instance from cfg: PredictStage.from_cfg(cfg).
        
        Good:
        + Validation errors happen here while trading primitive types for instances.
        + Pass through only legitimate primitives.
        """
        
        for key, from_cfg in (('model', pickle.loads), ('foo', Foo.from_cfg), ('bar', Bar)):
            # from_cfg could be any alternate, single-arg constructor or adapter function
            cfg[key] = from_cfg(cfg[key])  # KeyError if expected is missing.
            
        # Trim to only expected parameters, ignore unexpected parameters.
        kwargs = {key: cfg[key] for key in ('model', 'foo', 'bar')}
        return cls(**kwargs)
                   
    def __init__(self, model: PredictModel, foo: Foo, bar: Bar) -> None:
        """__init__.
        
        Good:
        + Initialization errors happen here.
        """
        self.model = model
        self.foo = foo
        self.bar = bar
        
    def __call__(self, batch: Batch, run: Run, service: Service) -> Batch:
        """__call__."""

Fix github workflow trigger

The github workflow trigger seems to triplicate the lint, test and publish tasks. Look into this and perhaps tighten the constrains on the trigger.

Fix automated security scans

Quay had a nice feature that automated security scans of docker images. Automate the same functionality using github workflows.

The dockerfile code scans might be equivalent here.

Add mongo context manager

Add mongo context manager.

Ensure that the database connection is opened immediately to catch credential errors early.

Install cfgenvy dependency from pypi

Update DSDK to install cfgenvy from pypi.

Create Project from dsdk cookie/review examples

Create a project from DSDK cookie to explore how the cookie cuter works. Review example notebooks to understand better how they work.

Fix tox local

Tox local tests never worked correctly. Ensure that local tox runs like tox on travis.

Add retry decorator

Pull the retry decorator up from cdi sepsis.

Fix:

for loop inside exception handler after initial attempt.
spurious delay after the final retry attempt.
accumulated floating point errors in the delay.

Add:

default argument for sleep=time.sleep function to replace duration in test.
default argument for warn=logger.warn function to replace
stderr/stdout during test.

Logging appearing to come from dsdk.util.retry may not be useful. Passing the client module's warn=logger.warn during decoration allows the retry attempt to be logged in the client module's stderr or stdout stream.

Create successful tests

Right now there are no real tests for this project. Add a test that will exercise the Batch and and Block paradigms to get things started.

Update pyproject.toml because coverage 5.0 has been released

Move coverage configuration from setup.cfg to pyproject.toml.

See feature here: nedbat/coveragepy#664

Figure out how to represent exclude_lines in the toml file, because toml has a structured syntax unlike setup.cfg.

Configure coverage

Remove the .coveragerc file.
Add to setup.cfg:

[coverage:paths]
source =
    src
    */site-packages
[coverage:report]
exclude_lines =
    raise NotImplementedError()
    pragma: nocover
    if __name__ == .__main__.:
precision = 2
show_missing = true
[coverage:run]
branch = true
source =
    src
    test
parallel = true

Add a card to move coverage into pyproject.toml when converage supports toml.

Add any other exclude_lines as needed to reduce the pragma: nocover noise in the code.

Add caching for df_from_query

The generates sql is currently written to a temp file.

Checksum this file along with the host of the db connection to rapidly return results for what should be the same query.

Cache with something durable against datatype rot (no json). (pickle the dataframe. See if we can compress pickle at all)

Update pyproject.toml when pylint 2.5 is released

When pylint 2.5 is released, remove .pylintrc and add:

pylint 2.5
[tool.pylint]

[tool.pylint.BASIC]
good-names = ["a", "b", "c", "d", "df", "i", "logger", "n", "tz"]

[tool.pylint.MASTER]
ignore = ["ci", "docs"]

[tool.pylint.'MESSAGES CONTROL']
disable = ["C0330", "duplicate-code"]

[tool.pylint.MISCELLANEOUS]
# List of note tags to take in consideration, separated by a comma.
notes = ["FIXME", "XXX"]

[tool.pylint.SIMILARITIES]
# Minimum lines number of a similarity.
min-similarity-lines = 4
# Ignore comments when computing similarities.
ignore-comments = "yes"
# Ignore docstrings when computing similarities.
ignore-docstrings = "yes"
# Ignore imports when computing similarities.
ignore-imports = "yes"

... to pyproject.toml

Standardize stdout logging

Summary

Previous applications had some stdout format which was able to be ingested by grafana through loki and promtail log scraping and pushing.

Problem

Currently, dsdk does not log stdout in a format that can be used to track application execution.

Solution

Create a new stdout format that is:

human readable (ASCII)
can be modularly ingested by log scrapping services (promtail)
formatted in JSON
writes application cycle metrics (to be used for timing and sizing analysis)

Current stdout format (non-dsdk)

datetime - service_name - LEVEL - {JSON payload}

2020-04-11 15:25:19,238 - event_writer - INFO - {"messages": 7716, "buffer": 0, "sets": 7716, "events": 7716, "in": 102.04260312020779, "rate": 151.23095185861598, "reconnects": 4}

2020-04-20 18:09:34,358 - ventcue.micro - INFO - {"buffer": 0, "inputs": {"flowsheet": {"ready": true, "recorded_on": 1587420540, "recv": 4487, "sets": 4487, "events": {"n": 437, "s": 32409.919143662788, "rate": 0.013483526387798718}, "queries": {"n": 26902, "s": 85463.01181396982, "rate": 0.3147794517066457}, "models": {"n": 349980, "s": 86717.0700424281, "rate": 4.035883590494526}}, "location": {"ready": true, "recorded_on": 1587420526, "recv": 5039, "sets": 5039, "events": {"n": 4047, "s": 38378.57613998442, "rate": 0.10544945662493363}, "queries": {"n": 344, "s": 482.0784293974284, "rate": 0.7135768352672015}, "models": {"n": 53080, "s": 482.3068911090959, "rate": 110.05440929516703}}, "mar": {"ready": true, "recorded_on": 1587420484, "recv": 4487, "sets": 4487, "events": {"n": 437, "s": 21854.630461717723, "rate": 0.019995762489120248}, "queries": {"n": 5702, "s": 30855.604431888787, "rate": 0.18479625030800148}, "models": {"n": 9205, "s": 31152.971045301296, "rate": 0.2954774357352462}}}, "outputs": {"cue": {"ready": true, "buffer": 0, "sets": 7757, "events": 41849, "sends": 7757, "reconnects": 0}}, "caches": {"n": 437, "s": 40283.425938990666, "rate": 0.01084813393632005}, "entries": 6516, "runs_in": 119072.81125657004}

Add postgres evidence storage to dsdk

See prototype for details

Requirements:

Update PostgresPersistor to add a store_evidence method
- Try execute_batch, when it throws an DBError, go through row-by-row and log each one that failed. Also keep going through the full dataset to find other bad data points. Rollback after so it doesn't get committed. Then throw an exception (or raise the OG exception).
- insert = getattr(sql, name).insert will throw an AttributeError if name is wrong. This would be better if it was a more user friendly error.
Update batch to add run_id on batch creation
Move mongo store_evidence to MongoPersistor instead of Service.
register_adapter for all standard pandas types. Use pandas.to_sql as a reference for types we need to support. This need 100% coverage with real postgres. Test should write and then re-read (into a DataFrame).

Code review items for projects

Compile a list of items for project code review:

fstrings instead of %s or format
fstrings for logging -- test how these behave
type annotations
judicious or generalizable additions to setup.cfg and pyproject.toml to avoid adding # pragma: nocover
Handle epoch_ns to non-naive datetime conversions.
Handle closed-open intervals for time --avoid requiring systems to agree on the chronon width.
UI's must use epoch_ns and non-naive datetime as well.

Broken VCR tests

The test_flowsheets are not picking up the test data, and are not running correctly.

Publish wheel to pypi

Matrix test supported version of python
Update dsdk pip install to use wheel instead of source code

Store datetimes as epoch time in mongo

Store assert_on datetime as epoch time in mongo

Utility functions for dsdk users to return the correct epoch time for storage in a db (milestone 1)

time.time_ns() seems like a good candidate.

Also store that batch at the end of the pipeline instead of the beginning. This will require us to get an OID from mongo at the beginning of the pipeline and use it again later. The assertion_time

Improve test coverage

DSDK testing has been simplified. After activating a python 3.10 virtual environment, use:

pytest

... to show coverage.

Look for ways to add coverage without adding code perhaps by updating configuration files, or loading additional configuration files in the tests.

Familiarize with pytest.raises, pytest.parametrize, pytest.mark, logger.debug, conftest.py, and running pytest --log-cli-level=debug.

Review the lines not covered and pick the easy coverage first. Avoid brittle tests that break when the implementation changes.

Add pre and post run steps

Corey was looking for a way to add a verify/validation step after the pipeline runs to examine postgres predictions.

This cannot be a task in the pipeline because the predictions, and run are only written toipostgres after the pipeline is completed.

The intent is to provide an optional post-pipeline task/tasks that can verify/validate without the race conditions and complexity invited by:

an additional script that must be called after a run is complete.
an additional microservice that must be triggered or listen for the run to finish.

@mdbecker Let me know if you have other ideas to handle validation systems test: ingest -> predict -> postgres. I don't want to test the whole integration: ingest -> predict -> postgres -> epic. The validation/verification is to check the model (not the entire integration pipeline). Ideally we could exclude ingest errors as well, but not at this point.

100% test coverage

─────────────────────────────▀███
────────────────────────────────█
───────────────▄▄▄▄▄────────────█
──────────────▀▄────▀▄──────────█
──────────▄▀▀▀▄─█▄▄▄▄█▄▄─▄▀▀▀▄──█
─────────█──▄──█────────█───▄─█─█
─────────▀▄───▄▀────────▀▄───▄▀─█
──────────█▀▀▀────────────▀▀▀─█─█
──────────█───────────────────█─█
▄▀▄▄▀▄────█──▄█▀█▀█▀█▀█▀█▄────█─█
█▒▒▒▒█────█──█████████████▄───█─█
█▒▒▒▒█────█──██████████████▄──█─█
█▒▒▒▒█────█───██████████████▄─█─█
█▒▒▒▒█────█────██████████████─█─█
█▒▒▒▒█────█───██████████████▀─█─█
█▒▒▒▒█───██───██████████████──█─█
▀████▀──██▀█──█████████████▀──█▄█
──██───██──▀█──█▄█▄█▄█▄█▄█▀──▄█▀
──██──██────▀█─────────────▄▀▓█
──██─██──────▀█▀▄▄▄▄▄▄▄▄▄▀▀▓▓▓█
──████────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──███─────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██──────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██──────────█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██─────────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██────────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█
──██───────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌
──██──────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌
──██─────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌
──██────▐█▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓█▌

setup pypi

setup pypi integration and figure out how to upload packages automatically

Fix PR checks to include lint and test

PRs should not checkout as ready unless the github lint and test workflows pass. Look into how these check can be integration into the PR process as required.

Make docs work for newly built projects (those built from cookiecutter templates).

Avoid adding secrets and local to git

dsdk/src/dsdk/utils.py

Lines 34 to 39 in 86b358a

    
           default_config_files=[ 
        
               "/local/config.yaml", 
        
               "/secrets/config.yaml", 
        
               "secrets.yaml", 
        
               "local.yaml", 
        
           ],

Remove the secrets.yaml and local.yaml options here.

The primary purpose of the ./local and ./secrets directories is that the .gitignore in each directory establishes a default policy to exclude all files from git in those directories except for explicit example files.

The other purpose of the ./local and ./secrets directories is to maintain parity between local testing with docker-compose and running on nomad. The docker-compose files maps ./local and ./secrets to the container's /local and /secrets directories to maintain this parity. Nomad/Consul writes configuration files to /local and Nomad/Vault writes secrets files to /secrets

These two directories (along with /alloc) have special significance and handling by nomad: https://www.nomadproject.io/docs/runtime/environment.html#task-directories

I understand that we don't want the nomad job to process secrets.env as an env file and expose the secrets to the container. However, we simply remove env = true from the nomad job and require the microservice to read the file.

We should discuss the work required to assemble multiple secrets into secrets.yaml inside the nomad job, and how we maintain the same type of secrets handling in docker-compose and local development.

Create gold must also embed the current utc non naive datetime

The as-of time is typically a past date such that results from clarity are generally stable. However, I forgot to unclude the datetime that the gold file was created in the gold file itself.

Alter on_create_gold to include the current non naive datetime in the generated gold file.

Update Task run parameters

dsdk/src/dsdk/__init__.py

Lines 32 to 36 in 55cbf87

    
           def run(self): 
        
               """Run.""" 
        
               for block in self.pipeline: 
        
                   print(type(block).__name__)  # TODO: logging 
        
                   self.evidence[block.name] = block.run()

The Task.run method has an explicit backreference to self.batch that had to be wired up post initialization for each Task. Replace the shared state with parameters to Task.run(self, batch) and eliminate the Feature Envy anti-pattern
Replace all references to self.batch in the apps with simply batch. Lint will complain about no-self-use because the Task methods aren't object-oriented. They use the Batch as an Accumulating Parameter just as a non-method procedure would.

name = task.name
evidence = task.run(self)
if evidence:
    self.evidence[name] = evidence

-[x] Correct the mixin implementation to the canonical forms of mixin, and dependency-injection. See test/test_mixin.py Correct logical and programming errors indicated by type checking. Apply Guido's recommendations for working with Mixins and type checking.

Add a Service class separate from Batch. In accordance with Accumulating Parameter, consider turning the Batch into a dataclass, and moving the dynamic, configured, or mixed in properties to a Service class to separate mutable (Batch) from immutable configuration, and runtime connections. Batch only exists during run, not the smoke test or check while Service is needed for both.
Replace Service.run, Task.run with Service.__call__ and Task.__call__. Since tasks are typically no-self-use, this interface change may allow bare functions to be used as tasks.

class Service:

    def __call__(self):
        with self.open_batch(...) as batch:
            for task in self.pipeline():
                task(batch, self)

Use a MongoEvidenceMixin to use MongoMixin.open_mongo_database connection contextmanager, or aPgsqlEvidenceMixin to use PgsqlMixin.open_pgsql_database contextmanager or extend or override Service.open_batch and Service.store_evidence. Remove the @mongo.store_evidence decorator from all the Tasks. Remove the monkey patches from the unit tests. Since tasks customize how they store evidence (exclude columns from the df from storage), have tasks call Service.store_evidence directly and remove the evidence code from the loop inside Service.__call__.
Replace batch = Batch() with a contextmanager Service.open_batch(...) to manage the open and close of the batch.
Add a basic implementation for Model.

cookiecutter template for new projects

Get the cdisepsis project into a good state and then use it to make a cookiecutter template for future projects. Some things that would be good to address before we do this:

Get rid of tox, replace with pytest
- Or not... we might want to keep tox to manage our environments.
Make sure versioning works as expected (seems broken now)
Verify CICD works
Provide separate dockerfiles for testing vs deployment
Update the readme with steps required to get the project in a working state and how to use it

Resolve pymssql for Python 3.10

Current wheel build is not working for pymssql in Python 3.10. Resolve how to get wheel to build properly for pip install

Documentation

Create documentation

Ensure every file/function/class has a docstring
- Use 3.7 type annotations with from __future__ import annotations
- Use 3.8 type annotations without the future import
Make sure we can build the docs
Setup readthedocs
Update readme and project description to point to the docs

If it turns out that readthedocs is not easy to setup, split that off into a separate card.

Timebox this effort to 2 days max.

	default_config_files=[
	"/local/config.yaml",
	"/secrets/config.yaml",
	"secrets.yaml",
	"local.yaml",
	],

	def run(self):
	"""Run."""
	for block in self.pipeline:
	print(type(block).__name__) # TODO: logging
	self.evidence[block.name] = block.run()