Giter Site home page Giter Site logo

bids-standard / bep028_bidsprov Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 13.0 253.68 MB

Organizing and coordinating BIDS extension proposal 28 : BIDS Provenance

Home Page: https://bids.neuroimaging.io/bep028

License: Creative Commons Attribution 4.0 International

Python 79.41% Makefile 0.18% MATLAB 14.63% Shell 5.78%

bep028_bidsprov's People

Contributors

bclenet avatar cmaumet avatar cyril-data avatar hermann74 avatar omar-rifai avatar remi-gau avatar remiadon avatar satra avatar thomasbtnfr avatar tiborauer avatar yarikoptic avatar yibeichan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bep028_bidsprov's Issues

Storing RRID's

This is an update proposal for BIDS Prov (BEP028).

Problem Statement

As discussed on the BIDS-Prov meeting (November 8, 2021), in BIDS-Prov we need a way to store the RRID of a software agent.

Note: RRIDs are unique identifiers for software agent, as defined on the scicrunch.org website. More info at: scicrunch.org Getting Started and in this paper (10.1016/j.neuron.2016.04.030). Importantly RRIDs are not version-specific, e.g. SPM5 and SPM12 have the same RRID:SCR_007037).

Rationale

Currently in the draft BIDS-Prov

Currently in the BIDS-Prov specification, RRIDs are stored with a specific attribute rrid:


2.3 Agent (Optional)

Including an Agent record is OPTIONAL. If included, each Agent record has the following fields:

Key name Description
@id REQUIRED. UUID. An (randomly-assigned) identifier for the software (this identifier will be used to associated activities with this software).
rrid OPTIONAL. URI. URI of the RRID for this software package (cf. scicrunch).
...

Pros:

  • A specific tag for rrids will make it very easy to query and retrieve this information
  • A specific tag for rrids prevents using other types of identifiers

Cons:

  • A specific RRID tag means that we will have to create a term (or reuse an existing one if already available)

Other alternatives

Dandi

... uses an "identifier" term, which is separate from the "@id"

Screen Shot 2021-11-23 at 14 36 03

Pros:

  • A specific tag for identifiers will make it very easy to query and retrieve this information
  • A generic "identifier" tag allows for using other types of identifiers

Cons:

  • When querying for the identifier, it might not be clear which type of identifier is retreived (RRID or something else?) -- although this might not be entirely true if an URI is used as a value in which case the base url can be an indication of the type of identifier.
  • id and identifier are very close words and might be confused

Question: has the dandi model been released? Can we directly reuse the terms?

mulltiline definitions for spm_parser.py

Is your feature request related to a problem? Please describe.
When running

curl -LJO https://raw.githubusercontent.com/incf-nidash/nidmresults-examples/master/spm_group_ols/batch.m

the downloaded batch.m contains the following lines

matlabbatch{1}.spm.stats.factorial_design.des.t1.scans = {
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-01/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-02/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-03/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-04/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-05/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-06/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-07/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-08/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-09/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-10/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-11/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-12/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-13/con_0001.nii,1'
                                                          '/storage/essicd/data/NIDM-Ex/BIDS_Data/RESULTS/EXAMPLES/ds011/SPM/LEVEL1/sub-14/con_0001.nii,1'
                                                          };

564df3a
introduced a quick fix --> we just ignore those lines

Finalize proposal submission

  • First pass at updating the BEP028 spec to be compliant with latest discussions.

Add a usage examples of this spec for the following:

  • 1- simplest model with one activity and its parameters,
  • 2- workflow with more than 1 activity,
  • 3- encoding environement (Matlab, docker, bash, etc.).
  • 4- an example with hierarchies of activities (use isPartOf form PROVONE)
  • 5- an example that uses the ontology of processings / image types
  • 6- an example discussing file-level prov VS dataset-level prov

globbing to describe collections of files

For now we use globbing to represent collections of files
Quoting the W3C-prov doc: Collections are defined as entities proving structures on top of other entities. In the context of file enumeration I found easier to use a syntax that many users are familiar with

Another aspect of entities in our framework is the "sha" field, which is used for quick equality checking between entities. A simple solution with files is simply to call a sha function on each file. In order to fill the "sha" field for a collection a files, we can simply pipe sha functions, i.e apply a sha function on the result of individual sha results.

Having many images in directory named 'fM00223', this gives

sha1sum fM00223/*.img | cut -d " " -f 1 | sha1sum | cut -d " " -f 1  # "cut" is used to trim filenames infos returned by "sha1sum"

which yields a single value for all .img files in this directory

This proposition aims to facilitate integration with existing software (eg. globbing is used in the SPM GUI to select files) as well as keeping our prov files as concise as possible

Provide more human-readable labels for activities

Is your feature request related to a problem? Please describe.
Currently the labels for the activities are automatically extracted from the keys in the matlabbatch, this is great but could be improved to have more human-readable labels.

Describe the solution you'd like
We could add a mapping between the keys in the matlabbatch and a human-readable name/label. This could be stored in a parameter file very similarly to what is done to add inputs to activities.

What do you think?

implement Digest Attributes for entity (file)

implement Digest Attributes for entity (file)
from BIDS_prov specification
Digest RECOMMENDED. Dict. For files, this would include checksums of files. It would take the form {"<checksum-name>": "value"}.
-> one checksum SHA for each file (input or output)

Follow up Copenhagen BIDS-Prov meeting

This issue is open to keep track of discussions / things to do following BIDS-derivatives meeting :

  • We need an example in BIDS-Prov spec on how to include custom code for some of the steps
  • The file-level provenance that we had in older version can be more intuitive when writing the provenance of a single file and should probably included back in the spec
  • Question could the BIDS-Prov file be in .prov.json and not .prov.jsonld to be simpler for the devs who know json (suggestion by Arnault)
  • We should consider making it possible to use BIDS url as identifiers

BIDS-Prov meeting: Nov 2, 3pm UTC

Hi everyone!

We are very happy to announce the first meeting to discuss BIDS-Prov that will be held, November 2nd, 3pm UTC i.e. 8am CA / 11am ET / 5pm Paris.

In this meeting we will discuss BIDS-Prov examples for SPM and FSL and how to make them more concise (similarly to the approach taken in reproschema).

Camille & @satra

BIDS-Prov meeting: Jan 24

Dear BIDS-Prov folks,

Thanks for joining us on our last BIDS-Prov meeting. Here is a brief summary of what happened and what we would like to focus on next.

First, the minutes of our two last calls are available at:

On our last call, we agreed on reviewing the BIDS-Prov specification (Google doc) and adding comments for any questions by our next call.

Thank you all and looking forward to seeing you on our next meeting on February 7. In preparation of our meeting, please feel free to include the points you would like to discuss directly in the agenda.

As always, I'd be very happy to answer any questions. Your contributions are very much appreciated!

Camille


Note: We meet every two weeks by videoconference on Mondays at 7-8am PDT / 10am-11am EDT / 3-4pm BST. The group is always open to new contributors interested in neuroimaging data sharing. To join the call or to ask any question, please email us at [email protected].

Participate to BrainWeb virtual hackathon [3 days]

Join the BrainWeb, take part if the kickoff on April 6 and attend the virtual hackathon.

Delivrable : write up about this experience of joining a virtual hackathon (possibly also about the project(s) you joined). This will be published on the Empenn blog/website.

clear explanation of an Activity

as discussed with @cmaumet , @dbkeator and @satra on November 2nd

What do we mean by Activity ?
Currently the examples encapsulate everything related to the run : an activity is the call of an Agent on a specific set of entities, and includes parameters (as defined in a batch.m for example)

Perhaps we should rather separate user-oriented graph definition from a more complete description, which would contain Activities, and the full set of parameters

Multiple Entities as input/output

Update proposal for BIDS Prov (BEP028)

Problem Statement

Defining a pipeline usually consist in linking functions (Activities) to their inputs/outputs (entities), knowing the context (Agents)
Allowing only one input/output pair per Activity will probably end up in defining an artificially high number of activities, and quickly become cumbersome

Rationale

As an example, the segment activity in the SPM default example takes a single entity as input : a .nii updated header, and generates 5 distinct tissue files, so we need to allow for multiple entries/outputs to be declared. This way we can quickly link to the same activity, which is appropriate for reading/querying

Minimal example

Here is the entity definition for spm_default/coreg_and_segment.json

      {
        "@id": "niiri:fsiud1",
        "label": "tissue1",
        "wasAttributedTo": "RRID:SCR_007037",
        "wasGeneratedBy": "niiri:sdfsdofjiosdf",
        "derivedFrom": "niiri:fsiudfqsoi938409283409fdskj",
        "prov:atLocation": "$HOME/spm12/tpm/TPM.nii,1"
      },
      {
        "@id": "niiri:fsiud2",
        "label": "tissue2",
        "wasAttributedTo": "RRID:SCR_007037",
        "wasGeneratedBy": "niiri:sdfsdofjiosdf",
        "derivedFrom": "niiri:fsiudfqsoi938409283409fdskj",
        "prov:atLocation": "$HOME/spm12/tpm/TPM.nii,2"
      },
...

and here is the associated subgraph
Screen Shot 2020-11-03 at 10 17 21

Create an issue template that we will use to propose updates on the spec [2h]

Choose one feature that is not currently in the spec and use it as an example to create an issue template for proposal of updates on the spec.

This issue template will include the following: "Minimal example before/after", "rationale" and probably more.

Look at examples of issues in the https://github.com/bids-standard/bids-specification/ to see if there is a common structure / existing issue templates?

Deliverable: a markdown file (in the current repo) as Github issue template.

parameters encoding

Update proposal for BIDS Prov (BEP028)

Problem Statement

At OHBM we already had a few questions about how one should encode parameter

For the moment we allow passing any json-compliant values into the attributes field of an Activity

We should not encode parameters, but provide a way to encode parameters
My Suggestion is that we define a new type name Parameter

Checklist

  • update new_features.md at the root of this project

add clear explanations for spect update [5D]

  • update new_features.md
  • create issue for type indexing
  • create issue for "Activity definitions"
  • create issue for "Activities attributes"
  • create issue for Multiple entities as input
  • validate

First blog article

@cmaumet we should also think about a first article to be posted by the end of this month, as discussed in meeting.

It can be very simple (eg. just showcasing what's been done), but we should seek a formal definition of what's inside

Cheers,
Rémi

High Level Example

add an example that is high level, i.e where a node in the graph encapsulates a call to a docker container, as in FMRI prep

SPM Parser for BIDSProv

Is your feature request related to a problem? Please describe.
Right now we provide short examples that relies on our model
To get faster in the inclusion/discussion of new examples we need to automate their translation into .jsonld or .turtle files that respect BEP028

Describe the solution you'd like
We want a parser that

  • takes .m files as input, as those provided in nidm-results
  • outputs a valid .jsonld file

This parse will have to be updated with regard to the specification

TODO

  • get spm_default and spm_groups_ols
  • get a first example, with each cell in the original .m file having an activity in the produced sidecar file
  • write a showcase example of a pipeline, eg. parsing | visalisation
  • write a github action to make sure the parser do not fail with a bunch of .m files (regression testing)

Choose names to replace Activity/Entity/Agent in the BIDS-Prov skeleton

(This issue is opened following progress made on the specification at the OHBM Brainhack.)

Problem Statement

In the BIDS-Prov skeleton, we are currently referring directly to PROV terms (Activity/Entity/Agent).

Those should be replaced by subtypes that will be specific to BIDS-Prov (but generic enough to encompass any type of object).

Rationale

As a starting point "Activity" could be replaced by "Processing" as discussed w/ @ssaneei and Michael Dayan. "Entity" by "InputOutput" and "Agent" by "SoftwarePackage"?

Minimal example

{
"@context": "https://purl.org/nidash/bidsprov/context.json",  
"BIDSProvVersion": "1.0.0",
"records": {
	"SoftwarePackage": [
  	{
    	...
  	}
	],
	"Processing": [
  	{
    	...
  	},
  	...
  	}
	],
	"InputOutput": [
  	{
    	...
  	},
  	{
    	...
  	},
	]
  }
}
}

Log of related discussion on Gdoc:

Screen Shot 2021-06-22 at 11 10 20

Screen Shot 2021-06-22 at 11 20 02

type indexing explained

Update proposal for BIDS Prov (BEP028)

Problem Statement

BIDS-prov provides a framework to describe any neuroimaging pipeline as a graph of operations, defined over digital entities
For our description to be generic enough we use 3 main concepts: Activities, Entities, and Agents.

One our graph is built, we want to allow a broad range of operations on it. The most basic operation we could think of is querying the graph, e.g to search for an entity giving part of its name.

Rationale

json-LD does not bring any kind of constraint on how we should define a graph, all we have to do is to respect the JSON syntax. Activities, Entities, and Agents could be defined anywhere, in any order, which makes it harder to investigate.

For our queries to be written easily, and run fast, we have to find a compromise between respecting the JSON syntax and setting up constraint on the structure of our graph.

For this purpose we use type indexing, which consists in using the types of the digital objects we describe as the primary key. This gives a very simple structure to our graph (a key for Agents, one for Entities, and one for Activities), yet allowing flexible definitions to correspond to those keys.

Minimal example

Here is an extract from examples/spm_default/realign.json

    ...
    "prov:Agent": [
      {
        "@id": "RRID:SCR_007037",
        "@type": "prov:SoftwareAgent",
        "label": "SPM"
      }
    ],
    "prov:Activity": [
      {
        "@id": "niiri:fdskjfnskjndflqkjndl",
        "label": "realign",
        "wasAssociatedWith": "RRID:SCR_007037",
        "startedAtTime": "10/10/2020 00:00:00",
        "endedAtTime": "10/10/2020 01:00:00",
        "used": "niiri:sjhgdqd",
        "attributes": [
          ["eoptions.quality", 0.9],
          ["eoptions.sep", 4],
          ["eoptions.fwhm", 5],
        ]
      }
    ],
    "prov:Entity": [
      {
        "@id": "niiri:fdsjnflqj12381U39fdskjnf",
        "wasAttributedTo": "RRID:SCR_007037",
        "wasGeneratedBy": "niiri:fdskjfnskjndflqkjndl",
        "derivedFrom": "niiri:sjhgdqd",
        "label": "Realigned func",
      }
    ]
  }

and here it how it would turn WITHOUT TYPE INDEXING

    ...
     [
      {
        "@id": "RRID:SCR_007037",
        "@type": "prov:Agent",
        "label": "SPM"
      },
      {
        "@id": "niiri:fdskjfnskjndflqkjndl",
        "@type" : "prov:Activity",
        "label": "realign",
        "wasAssociatedWith": "RRID:SCR_007037",
        "startedAtTime": "10/10/2020 00:00:00",
        "endedAtTime": "10/10/2020 01:00:00",
        "used": "niiri:sjhgdqd",
        "attributes": [
          ["eoptions.quality", 0.9],
          ["eoptions.sep", 4],
          ["eoptions.fwhm", 5],
        ]
      },
      {
        "@id": "niiri:fdsjnflqj12381U39fdskjnf",
        "@type" : "prov:Entity",
        "wasAttributedTo": "RRID:SCR_007037",
        "wasGeneratedBy": "niiri:fdskjfnskjndflqkjndl",
        "derivedFrom": "niiri:sjhgdqd",
        "label": "Realigned func",
      }
    ]

Checklist

  • links to related existing issues and/or PR

BEP update

Hey BEP028!

Happy new year! I am opening up this issue to inquire if you all may have a status update to share? These updates are shared on our website. I have included a couple points to guide the update.

BEP status update:

  1. Status update on BEP 028
  2. Sharing the blocking items or sticking points
  3. Items left to discuss and clarify

Thank you!

OHBM abstract

  • get ressources / examples from @cmaumet
  • write v1
  • feedback on v1 (meeting) with @cmaumet
  • write v2 and iterate over
  • share with BIDSProv comittee

URN and UUID

Update proposal for BIDS Prov (BEP028)

Problem Statement

On the BIDS-Prov meeting (November 8, 2021), we agreed to include the following in BIDS-Prov:

  • For UUIDs that we can resolve : we'll use a specific prefix (similar to e.g. dandiasset for DANDI)
  • For UUIDs that we cannot resolve from any service in any way : use an urn prefix

Looking more closely at urn, it looks like those have to be accompagnied by a registered namespace identifier, cf. https://en.wikipedia.org/wiki/Uniform_Resource_Name.

We could directly reuse the already registered UUID namespace identifier, e.g. "urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66"

@satra: can you confirm that the latter (i.e. using urn:uuid) is what you had in mind)?

Once we have converged on this, TODO:

  • Update the spec and examples to replace all instances of "niiri:" with "urn:uuid:"
  • Add a section in the BIDS-Prov spec explaining how identifiers are chosen (two options as described above)

validator module

One thing we might want to do at some point in providing a provenance framework is providing a validator for it
If an institution or a user creates prov files, we should provide them a program to check the validity of those files within the framework

This program should :

  • take *.json files as input
  • return a boolean value at the very end
  • raise warnings and errors along the way to enlighten the user about ways to fix those issues, in a clear and understandable manner.

In other words running this program acts as a sanity check.


For warnings and errors, a way would be to use the python logging module, but that looks a bit tedious for that. For a V1 I think we can use the warnings module, and raise a warning if anything looks non-valid, and just return False in any other situation

User stories

Until now we have discussed pros and cons of different concepts and features in BEP028. The few use cases implemented as sidecar .jsonld files corresponding to standard examples. To go a little bit deeper and foster a broader range of users we would like to formulate real-life examples and discuss their implementation with the current standard

First set of user stories

  1. As a researcher I'd like to found out which realignment algorithm was applied in order to understand how it affects my final results
  2. As an SPM developper implementing the BIDS-PROv export I'd like to get a list of all activities in order to verify that it is consistent with my matlabbatch script.
  3. As an SPM user I'd like to visualize the BIDSProv graph corresponding to my matlabbtach file in order to get a visual representation of my pipeline (for example to be shared in a paper).

Activity attributes as DataElements

  • provide an example using nidm:DataElement
  • rewrite SPM and FSL examples if needed
  • provide a query example
  • update new_features.md

An example using DataElement in turtle format here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.