qiime2 / provenance-lib Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 4.0 43.98 MB

QIIME 2 Provenance Replay Tools

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.46% Python 97.90% TeX 1.64%

hacktoberfest

provenance-lib's Introduction

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Installation

Detailed instructions are available in the documentation.

Users

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Developers

Please visit the contributing page for more information on contributions, documentation links, and more.

Citing QIIME 2

If you use QIIME 2 for any published research, please include the following citation:

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, and Caporaso JG. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. https://doi.org/10.1038/s41587-019-0209-9

provenance-lib's People

Contributors

Stargazers

Watchers

Forkers

lizgehret chriskeefe colinvwood gregcaporaso

provenance-lib's Issues

Remove DeprecationWarnings filter (VERSION parser)

This guy exists to filter out arcane deprecation warnings that pop up in the repeated encoding and decoding on the _VERSION_MATCHER regex. It should be removed, either by handling the encoding/decoding properly, or by factoring out the regex entirely, as Greg has suggested during an unrelated conversation.

Additional target interfaces

Targeting additional interfaces with new ReplayUsage drivers will improve analysis interpretability and collaboration across different levels of computational expertise. Some important targets include:

replay to jupyter notebooks (.ipynb) for interactive use (Python 3 API)
replay to jupyter notebooks (.ipynb) with BASH kernel. (lower priority than the above)
Galaxy
R someday
Jupyterbook someday - could include viewing of QZAs inline, markdown annotation of methods, etc. In many ways, this could become the skeleton of a reproducibility manifest (#75)

Use ExecutionUsage to test validity of written usage examples

The ExecutionUsage driver allows us to confirm that a usage example is valid - that, if nothing else, it "goes". This can be used to confirm that the usage examples replay templates out are fundamentally viable, at least in the context of "strict" (original-data, original-metadata) replay. We should test with this.

NOTE: This doesn't guarantee our rendered examples will work. The ReplayPythonUsage driver, for example, renders dummy lines that look something like Metadata.load(<your_metadata.tsv>) - these obviously will not go unless there's a <your_metadata.tsv> in the current working directory.

Treat `provenance_is_valid` like an exit code

The provenance_is_valid attribute on ProvDAG objects should be more nuanced than a boolean. Some options follow:

This could be handled in the exit code style. E.g.

0 -> provenance is valid
1 -> predates checksums.md5 so assume good
2 -> user opted out of validation
3 -> provenance not valid

Or in a way that preserves truthiness

0 -> provenance not valid
1 -> user opted out of validation
2 -> predates checksums.md5 so assume good
3 -> provenance is valid

Or maybe use a registry of strings so the value is very readable, and all of the options live in one place.

ReplayPythonUsage should optionally capture all results from each action

Currently, ReplayPythonUsage assumes outputs that weren't included in the Replayed artifacts' provenance are not of interest. They are templated out as _ variables, so cannot generally be interrogated or saved to disk. This behavior is inconsistent with ReplayCLIUsage, because of the different requirements of those two interfaces, but we should allow users to replay all results via the Python3 API.

UUID Validation?

Do we want a validated UUID class?

Use ProvNodes as DiGraph nodes?

As currently implemented, our DiGraph nodes are string literal UUIDs, with attributes added manually at creation. These UUIDs are (probably trivially) easier to query than when we use ProvNodes directly as the objects.

If our nodes are ProvNodes, can we directly query ProvNode attributes with the NetworkX query API? If so, this may give us a more readable/maintainable API - users can just look at the ProvNode class docs when they are thinking through a query.

If class attributes aren't directly query-able, and we need to add attributes to the node "manually", that's dumb. But it creates a situtation in which keeping ProvNode, _Action, and our other intermediary classes around may not be useful. Moving all of that logic into parser classes might simplify things.

Allow user to pass output folder for recorded metadata dumps

replay_fp (and probably replay_provdag) always dump recorded metadata into the same directory relative to the user's cwd. User should be able to pass an optional argument to put the files where they want em.

Test a .qza?

Shouldn't be anything substantially different at this time, but it's probably not a bad idea

Better error message for non-file-like objects passed to ArtifactParser

From inline

When non-file-like objects are passed to ArtifactParser.get_parser it raises an AttributeError that's not super informative (has no attribute 'seek'). Might be worth catching and raising a more informative error.

ReplayCLIUsage: Support lumping outputs into output_dir

All outputs are templated out at this time. Outputting them into an output_dir argument would change their location on disk,
and would probably require us to update the stored filepath in our UUID-keyed usage variable store.

Add DOIs to citations for common methods.

replay citations deduplicates based on DOI, and citations without DOIs sometimes show up with annoying frequency.
Where possible, we can just add DOIs to citations.
After that, if things are still gross, we can make deduplication more robust.

DiGraph attribute interface

This is less an issue than a scratch pad where I can think about design:
What provenance data should be available (and therefore directly query-able) as DiGraph node attributes? What data should be available only by indexing into the payload dictionary intermediary collections?

Some provenance attributes are arguably more interesting than others, and these are given their own node-level attributes when the DiGraph is constructed.This puts simplicity above comprehensive access, which I like, but users who want to prod "non-select" values will have to index directly into the payload dict, rather than querying nodes directly through NetworkX.

I can imagine there are:

some attributes that deserve top-level billing (e.g. plugin, action, uuid)
some that make sense as top-level collections (e.g. python packages)
and some that are unimportant enough to remain unexposed in the data payload

A performant approach to loading multiple Results

IO is expensive, and Results will frequently be duplicated across multiple Artifacts (e.g. the same FeatureTable will show up in all downstream Artifacts).

When loading a bunch of Artifacts at once, we can reduce IO by generating a ProvDAG from one Artifact and checking whether the root UUID of the next artifact is in first_provdag.nodes, parsing and unioning any artifacts that aren't subsets of the already-loaded DAG.

Insert abstract Metaclass for ReplayUsage drivers

ReplayUsage drivers diverge a little from their parent drivers. An abstract metaclass would allow us to require:

self.header
self.footer
self.shebang?
build_header()
build_footer()
etc

Multiple views of nested provenance

QIIME 2's provenance capture supports multiple views of provenance graphs. Under the hood, Pipelines run other Actions, possibly including Pipelines. Provenance is captured for each inner-Action result, and a redundant terminal result (an alias) is also recorded for the pipeline itself, allowing tools like q2view to hide inner Actions. For details, see the dev docs.

We should expose at least two (and probably three) views of provenance:

Complete: all Results are visible, including aliases, because it's easy and allows interfaces to make their own choices
Nested: like q2view, where inner nodes are hidden behind terminal/alias nodes
Inner Actions: terminal nodes are hidden, showing a comprehensive collection of Actions performed

Overwrite `artifact_passed_as_metadata` filler type with actual Semantic Types

Because Artifacts passed as metadata don't have Semantic Type information associated with them in a Result's action.yaml, ProvDAGs currently represent these with the "filler type" artifact_passed_as_metadata.

This consistent filler type, along with captured UUIDs, should allow ProvDAG to find/replace the filler with the artifact's original Semantic Type, if that's the right choice. Alternately, we could write this as Metadata. Very open to feedback on what makes the most sense.

Make checksum validation optional

Checksum validation requires re-hashing almost all of the data in the zip archive, and may be computationally expensive enough to negatively impact large-scale projects (e.g. QIITA). This should be optional, but should probably validate by default, as most users won't be negatively impacted.

Mandatory replay features

Replay for multiple/disconnected graphs (This probably just needs testing)
Handle no-provenance nodes
Some kind of environment checker/management tool (e.g. diffing installed plugin versions against provenance-captured/required for replay plugins)

Replay produces a methods manifest

Methods-section writing in computational biology papers requires authors to find a balance between completeness and simplicity. Few readers are likely to care about the details of every method in a large-scale project, but a reader attempting to reproduce a study benefits significantly from a plaintext description of every step. When researchers don't write about every computational method, they are also less likely to provide attribution to low-level or non-terminal methods, creating disparities in citation counts that may not reflect actual patterns of use. (many publications prefer no citations of work not referenced in the text.)

A methods manifest solves both of these problems by providing brief descriptions of QIIME 2 actions, registered via the plugin registration API, alongside the name of the plugin and action that were used, and reference keys that map to the keys in an output bibliography. Depending on the complexity of the tooling required, these could be simple numerical keys managed in Python, or this report and the reference list itself could be produced with LaTeX/BibTeX. Optionally, each action may be associated with data on the computational environment in which it was performed, or a reference key to that information.

By including the methods manifest for publication as an appendix, authors can defer comprehensive methods descriptions to the manifest as needed, while still providing complete attribution to the authors of the methods applied during the analysis.

Simplify sample_metadata loading

Provenance doesn't know much about where sample metadata came from, so has no way to indicate whether a sample metadata file passed at step 0 is the same as the one passed at step N.

We're currently dealing with this by including an init_metadata() call every time a Metadata is needed. It would be nice to clean this up, as it can be a bit much in the rendered output.

citation for provenance replay paper

Add a bibtex citation in alongside the framework citation. This will be exported with citation lists by anyone who uses provenance_lib to export citations.

DirectoryParser's recursive behavior should be optional.

DirectoryParser.parse_prov always recursively globs the directory. It should only recurse if the users wants it to.

Checksum Validation

By validating checksums.md5, we can provide users with reasonable feedback on the integrity of their archives, and the usefulness of their provenance data. ("Hacked" archives may have incorrect provenance)

Users can and should export and re-import data to hack without breaking checksums. This will, of course, break the provenance chain, but that's not our concern - users can track their own methods while outside of QIIME 2.

validate_checksums() will probably help. @ebolyen recommends copy/paste rather than depending on the framework for now.

Simple checksum validation tool for CLI?

replay validate-provenance or similar

Checksum-based validation of archives is run by default on all parsing commands. Though low-priority, it could be useful to expose a validation-only tool for the CLI.

Think about ProvDAG equality criteria

From inline notes

ProvDAG equality checks are currently based on class identity and graph isomorphism.

Is this a reasonable way to define ProvDAG equality? It doesn't take into account the equality of some ProvDAG attributes, but leans on DiGraph isomorphism.

Questions

Are two isomorphic ProvDAGs unequal if one was created without checksum validation?

How about if one has a checksum diff because it's been tinkered with?

If in the future we have a method that produces non-identical results from two identical DiGraphs based on _parsed_artifact_uuids, are those ProvDAGs still equal?

Memoize topological sort, and related questions

Sorts can be costly (though parsing is definitely our highest-priority optimization target). At some point, it might make sense to memoize the topological sort we perform on a dag in build_usage_examples. Questions include:

do we assume the sort will always be over the collapsed view of provenance?
does group_by_action then just get the dag as an arg, and grab the sorted view from a dag property?
is there any reason group_by_action would not want a topologically sorted view? We're currently flexible on that, but could lock it in.

Improve replay from captured metadata files

Replay using recorded metadata could go one of two ways:

"Touchless" replay asks the user after parsing for appropriate data inputs, and then runs the replay with them.
"Hands on" writes a script (like we do now), but incorporates the captured metadata files by dumping them, and then loading them appropriately from their dumped location on disk.

If we decide to build a "touchless" replay that uses recorded metadata, init_md_from_recorded_md needn't render anything, but needs to load properly.

If we go "hands on", we're going to need to render the Python API to insert an actual filepath, rather than producing some comment. This is probably easiest to handle with another variant driver.

Rewire ProvDAG to use ProvNodes as nodes

Currently, nx is using UUID strings as nodes, with all data stored as node attributes.
Refactoring to use ProvNodes as nx.DiGraph nodes might clean up the logic in ProvDAG.init() a bit.

Make EmptyParser a ParserResultsParser?

Empty ProvDAGs aren't very useful. Maybe we should refac this as a ParserResults parser. This would mean tools like Union that are basically constructing ParserResults "manually" don't need to create an empty ProvDAG and then overwrite its fields. Instead, they create a ParserResults and once the data's there, they throw it at ProvDAG().

In favor

By requiring tools to actually write ParserResults, we ensure they create all required data, and mypy can check it's correctly typed. This approach may be slightly more efficient, too.

I also don't love that passing no args to ProvDAG is possible, because it seems to encourage this somewhat useless behavior.

Against

The "create an empty object and populate it" idiom is common and familiar... 🤷

OSX may break replay

When replaying an extracted zip archive of files that was compressed on OSX, parsing fails because the included __MACOSX directory contains non-zipfiles named ._something.qz*:

We need to:

confirm that this only impacts compressed and then decompressed ziparchives
check whether this behavior also occurs on MacOS - details below
fix, possibly by ignoring hidden files, ignoring files within __MACOSX in the fp, removing __MACOSX, or catching BadZipFile("File is not a zip file") errors and looking more closely at them.

Improve input sanitization

If we're going to consider running generated usage examples through the ExecutionDriver sight unseen by users, we also need to consider more robust input sanitization.

We're already using yaml.safe_load, but we should learn more about what responsible really looks like here. Schema validation could be an option to consider.

An algebraic approach to provenance replay

A user of provenance replay may only care about analysis downstream of some specified result(s) (e.g. I want this whole analysis after the FeatureTable). Similarly, a user might only care about an analysis upstream of some specified result(s).

The ideal future use case basically looks like this:

user dumps a bunch of artifacts into the parser, generating a ProvDAG that looks like this:

          B -- C
I -- A<           > G -- H
          D -- E
                 \-- F

user visualizes the ProvDAG graph in a fancy future GUI, and says "I want to replay provenance from this FeatureTable(A) through these Vizualizations (F & G) disregarding the rest.
The interface captures UUIDs for those nodes and passes them off to Replay

The core idea here is that a user might want to replay only (F | G) - A. We can probably use graphviews to handle this, writing predicates flexibly based on what a user wants to include and exclude. Computationally, this could be a little spendy, as we're probably actually composing_all of the unioned UUIDs and their ancestors, before truncating A and all of its ancestors.

ProvDAG Union

In order to parse full analyses from multiple .qza/.qzv files, a graph union operation will be required. We can use nx's union operator, but must work around its requirement that the joined graphs be disjoint.

Probably the simplest approach here is to write our own union, which removes common nodes and then calls nx.union():

common_nodes = G & H
Drop common nodes (or edges?) from H (or G, it doesn't matter)
do we need to drop impacted edges?
call nx.union() on the resulting disjoint graphs
do we need to wire up parent relationships again explicitly? (related to dropping edges, above)

Finding common nodes gives us some easy optimizations:

if G == H:
   # we don't need to union because identity
    return G
if G is a proper subset/subgraph of H:
    # we don't need to union because G is already in H
    return H
If H is a proper subset/subgraph of G:
    # as above
    return G
if common_nodes not empty:
    # drop common nodes, skipping this preprocessing if the graphs are already disjoint
    ....
nx.union()
# if we need to wire up parent relationship edges again, that can also be hidden behind
# b/c no common nodes implies no common edges
if common_nodes not empty:
    # wire up edges

Notes:

nx does offer a more permissive disjoint union operator, but it forces integer node labels, so would require relabeling with UUIDs in addition to "manual" deduplication of nodes in order to bring us back to a provenance graph with UUID node labels
If this doesn't scale adequately, we may have to consider a graph library that isn't pure python.

Broad analysis metadata reporting

Reporting some additional high-level metadata may make provenance replay more useful/easy to use. Useful items for an analysis metadata supplement could include:

a summary of data artifact UUIDs and/or filepath names (encourage data sharing by helping users identify the files they need to share)
details on the size and type of input data and artifact data (e.g. for estimating storage costs or compute times)
overall and per-action summaries of runtime/hardware provisioning (could also be implemented in #75)

ProvDAG constructor may not support some inherited functions

summary

Some nx functions fail with ProvDAGs. (probably those which create new graphs based on an existing graph object)

details

nx.reverse_view() fails with the traceback below.
nx.subgraph_view(dag, filter_node = lambda x: x in nested_nodes) fails
nx.relabel_nodes also fails, unless copy=False.

I suspect this is an issue with the way we've reimplemented __init__. I suspect these functions are attempting to create new graphs/graphviews of the same type as the "parent" and then fill in the data, but are choking on the way ProvDAG is initialized. A few quick attempts didn't yield results.

None of these functions are quite critical at this time, but I'll have to come back to this. Union may also be affected if we're asking nx to create a new ProvDAG on its own. If addressing this directly proves problematic, we may be able to sidestep the issue by making ProvDAGs (or something with a better name) that simply have a DiGraph, and implement helpers that call DiGraph functions.

examples

Call:

nx.subgraph_view(dag, filter_node = lambda x: x in nested_nodes)

traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-e778f4be6599> in <module>
----> 1 nx.subgraph_view(dag, filter_node = lambda x: x in nested_nodes)
      2 nx.draw(dag, with_labels=True, font_weight="bold")

~/miniconda/envs/prov/lib/python3.8/site-packages/networkx/classes/graphviews.py in subgraph_view(G, filter_node, filter_edge)
    142     EdgeView([(0, 1), (1, 2), (2, 3)])
    143     """
--> 144     newG = nx.freeze(G.__class__())
    145     newG._NODE_OK = filter_node
    146     newG._EDGE_OK = filter_edge

TypeError: __init__() missing 1 required positional argument: 'archive_fp'

Call:

dag = ProvDAG(archive_fp=qzv)
reversed = nx.reverse_view(dag)

Traceback:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-a11c5f1b7b87> in <module>
      6 contents = dag.parser_results.archive_contents
      7 nodes = list(contents.values())
----> 8 reversed = nx.reverse_view(dag)

<decorator-gen-128> in reverse_view(G)

~/miniconda/envs/prov/lib/python3.8/site-packages/networkx/utils/decorators.py in _not_implemented_for(not_implement_for_func, *args, **kwargs)
     76             raise nx.NetworkXNotImplemented(msg)
     77         else:
---> 78             return not_implement_for_func(*args, **kwargs)
     79 
     80     return _not_implemented_for

~/miniconda/envs/prov/lib/python3.8/site-packages/networkx/classes/graphviews.py in reverse_view(G)
    201     OutEdgeView([(2, 1), (3, 2)])
    202     """
--> 203     newG = generic_graph_view(G)
    204     newG._succ, newG._pred = G._pred, G._succ
    205     newG._adj = newG._succ

~/miniconda/envs/prov/lib/python3.8/site-packages/networkx/classes/graphviews.py in generic_graph_view(G, create_using)
     42 def generic_graph_view(G, create_using=None):
     43     if create_using is None:
---> 44         newG = G.__class__()
     45     else:
     46         newG = nx.empty_graph(0, create_using)

~/src/provenance_py/provenance_lib/parse.py in __init__(self, cfg, archive_fp)
    129                         type = tuple(parent.keys())[0]
    130                         parent_uuid = tuple(parent.values())[0]
--> 131                         ebunch.append((parent_uuid, node_id,
    132                                        {'type': type}))
    133             self.add_edges_from(ebunch)

~/miniconda/envs/prov/lib/python3.8/zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1267         try:
   1268             if mode == 'r':
-> 1269                 self._RealGetContents()
   1270             elif mode in ('w', 'x'):
   1271                 # set the modified flag so central directory gets written

~/miniconda/envs/prov/lib/python3.8/zipfile.py in _RealGetContents(self)
   1330         fp = self.fp
   1331         try:
-> 1332             endrec = _EndRecData(fp)
   1333         except OSError:
   1334             raise BadZipFile("File is not a zip file")

~/miniconda/envs/prov/lib/python3.8/zipfile.py in _EndRecData(fpin)
    262 
    263     # Determine file size
--> 264     fpin.seek(0, 2)
    265     filesize = fpin.tell()
    266 

AttributeError: 'NoneType' object has no attribute 'seek'

References

The DiGraph source has a subclassing example near the end of the docstring

Handle action.yaml variation with schema validation?

Different types of Action track different types of data. For example, the action.yaml schema differs in meaningful ways between "true" Actions (method, pipeline, visualizer) and import actions (which have no plugin, no parent inputs, etc).

Currently, _Action handles this variability by patching 'interesting' missing keys (e.g. plugin) using properties, and leaving 'raw' provenance data accessible in the full_ProvNode_payload dict.

Sentinel values like action_type exist, which let us differentiate between Actions and Imports. if we decide to look more closely at action.yaml's less charismatic values as the queryable ProvNode API firms up, it could be useful to write and validate schemas explicitly. Cerberus does lightweight schema validation, if we go there.

Thoughts: Replay UI Options

This tool is currently implemented as an executable file generator, which lets the user review the executable, but requires them to run it themself. This is reasonably secure, and focuses reading/interpreting, but might be clunkier than we always prefer.

Another approach worth considering might look like generating a dag (so we can identify the number and type of raw data inputs required). The user could then Replay that dag, passing their inputs with the replay command, and never having to run the commands themself.

This would take the pressure off of having a pretty rendering, but may run into issues because, e.g. metadata inputs are somewhat ambiguous in provenance, and we don't know whether the same sample metadata was passed to all commands in an analysis. If the goal is to replay with captured metadata, this is a non-issue. If the goal is replay with new, user-supplied metadata, it will require some finesse.

Handle in-memory Results

As currently implemented (because we don't know anything about QIIME 2 Results objects), the provenance parser assumes a zipfile-based archive structure. If in the future we allow dependencies on the Framework, we might need to refactor significantly to allow the loading of Results from .qza/.qzv files, and the creation of ProvDAGs from those objects.

Report plugin dependencies missing from the environment

RESCRIPt is used in building our ready-made taxonomic classifiers, but is not shipped with the "core" distribution.

This is going to blow up EVERYONE's replay, making a provenance-aware replay-package-installation tool absolutely critical.
We could probably patch this by abusing the plugin manager on a special-case basis for RESCRIPt, but that's gross.

Edit: the following loses us more than it gains us.
The other approach here, which may be worth pursuing, is cutting the plugin manager out of replay entirely in the local Usage drivers, and letting replay do the best it can from provenance. This will be more permissive, but will leave users with no idea about which parameter names, for example, have changed.

Gracefully handle corrupted archives

The checksum validator warns of missing files, but allows the program to continue.

Expected files missing at the node level ('action.yaml', 'VERSION', etc.), however, raise errors. If we decide to handle these gracefully (e.g. so that high-throughput users don't get failures when a single bad Archive is passed), we'll need to decide how. Two reasonable paths:

capture as much data as we can (which may be complex)
capture minimal data for the node (e.g. a UUID, which we can get from zipfile filepaths), and flag the DAG as not provenance_is_valid

Alternately, this could be handled at a higher level, by catching those errors in cases where more than one Archive are being parsed.

Clean up the `accepted_data_types` implementation in Parsers

Parsers have an accepted_data_types field, which is used in the error message raised if an unsupported payload is passed in for parsing. This information is stored as a string, which works great for error messaging, but isn't otherwise useful. Plus, it feels a little hokey?

This probably doesn't deserve attention, but maybe I'll think about it on a boring Tuesday someday.

Inner view of nested provenance

A followup to #20 , this issue targets the creation of an "inner" view of nested provenance. In this view, outer (pipeline) nodes are hidden, showing instead a graph of all of the Methods and Visualizers run within the pipelines.

This view has potential utility as a tool for deeper inspection of analytical workflows, allowing users to study, modify, and re-run all of the fundamental Actions being run in an analysis.

Improve support for replay in upstream Usage driver code

replay.py monkeypatches parts of these drivers which don't currently support replay adequately. ~~Wherever possible, these patches should be removed in favor of support in the home repositories.~~

Some issues

The ArtifactAPIDriver doesn't check parameter names against registered functions, so may fail silently, producing normal-looking results with bad parameter names. This should probably be addressed

The CLIDriver does check param names against their registered names, which causes failures when param names change between capture and the currently installed version

The ArtifactAPIDriver does not allow replay unless all outputs are passed to Usage.action()

EDIT: My thinking has changed on this over time. There are enough differences between the expectations of replay drivers and existing Usage drivers that upstream support is probably not worth pursuing in most cases. One possible exception I see is in the lack of support in Usage.action for replaying archive versions that don't capture output-name, described below. Usage.action is far enough up the chain of inheritance to make working around it ugly, and a reasonable improvement wouldn't take much effort.

I'm going to keep this issue open as a reference to the framework issue, but have renamed it appropriately.

Consider supporting a minimal ProvNode definition

build_no_provenance_node_usage has a messy signature because it has to be able to handle either a ProvNode or None as the arg to node (the result of dag.get_node_data). If None, it needs other context to do its job.

None is used, because some dag nodes don't actually have underlying ProvNodes. Consider refactoring ProvNode so that the minimal node has only a UUID. This would probably require ProvNode properties with Optional returns and more if not None checks, but it would allow us to drop some Optional[ProvNode]s too. The semantics feel pretty good. A node is, at least, a UUID, after all.

Dependency-aware node ordering for replay

In order to successfully replay an analysis, we must order Actions such that all dependencies of an Action are satisfied before that Action may be run - beginning with nodes of in-degree 0, and ending with nodes of out-degree 0.

I think a topological sort will do this for us.

Topological sorts are not by definition unique, so we may get varying sort results as things change. (e.g this python version change -> sort order change). lexical_topological_sort will let us impose a more nuanced ordering, at the cost of a little bit of complexity. This commit from the same issue is the best example I've found to date of what a key function might look like.

ProvDAG should check its own "guaranteed" properties

In #32 we outsourced the creation of ParserResults to the parsers, and ProvDAG no longer defines the behavior that produces its data locally. This is way more flexible and extensible, but separates the data collection from the ProvDAG object that guarantees certain data be present.

Should ProvDAG vet that e.g. every node has non-Null node_data and has_provenance attributes? Alternately, maybe we can enforce this in the Parser ABC. The only machinery currently making ProvNodes is in archive_parser, so there's no real risk right now, but because we can now write and plug in arbitrary parsers here, the door is no longer closed to introduced breaches of that guarantee.

CI: MYPY and coverage checks

Coverage, especially, would be nice.

efficient replay for multiple interfaces

Currently, replay acts on a ProvDAG directly, grabbing what it needs and passing it to a single Usage driver for rendering. Gathering this data into a persistent structure would allow us to run it through multiple drivers without having to collect data repeatedly.

OPT: Memoize dag.collapsed_view

This might save us a little run time someday, but it's not a high priority.

qiime2 / provenance-lib Goto Github PK

provenance-lib's Introduction

qiime2 (the QIIME 2 framework)

Installation

Users

Developers

Citing QIIME 2

provenance-lib's People

Contributors

Stargazers

Watchers

Forkers

provenance-lib's Issues

Questions

In favor

Against

summary

details

examples

Call:

traceback:

Call:

Traceback:

References

Some issues

Recommend Projects

Recommend Topics

Recommend Org