aced-idp / gen3_util Goto Github PK

View Code? Open in Web Editor NEW

1.0 4.0 1.0 5.11 MB

Collection of command line tools to interact with a Gen3 instance

License: MIT License

Python 99.99% Shell 0.01%

gen3_util's Introduction

Gen3 Tracker

Utilities to manage Gen3 schemas, projects and submissions.

Quick Start

Installation


$ pip install gen3_tracker

$ g3t version
version: 0.0.1

Use

$ g3t --help
Usage: g3t [OPTIONS] COMMAND [ARGS]...

  Gen3 Tracker: manage FHIR metadata and files.

Options:
  --format [yaml|json|text]  Result format. G3T_FORMAT  [default: yaml]
  --profile TEXT             Connection name. G3T_PROFILE See
                             https://bit.ly/3NbKGi4

  --version
  --help                     Show this message and exit.

Commands:
  ping          Verify gen3-client and test connectivity.
  init          Create project, both locally and on remote.
  add           Add file to the index.
  commit        Record changes to the project.
  diff          Show new/changed metadata since last commit.
  push          Submit committed changes to commons.
  status        Show the working tree status.
  clone         Clone meta and files from remote.
  pull          Download latest meta and data files.
  update-index  Update the index from the META directory.
  rm            Remove project.
  utilities     Useful utilities.

User Guide

See use cases and documentation

Contributing

See CONTRIBUTING.md

gen3_util's People

Contributors

Stargazers

Watchers

Forkers

lbeckman314

gen3_util's Issues

documentation/clarify-gen3-init-instructions

As a user, I would like clearer explanations how to use g3t init so that I don't run into confusing errors downstream. Specifically, I noticed that...

g3t init --help describes the flags as "--project_id TEXT Gen3 program-project G3T_PROJECT_ID" which is not very clear to me. I could see something like “--project_id project ID formatted as myprogram-myproject”
When running g3t init program-project without a profile, I appreciate that there's multiple warnings that there is no profile set, one being "No profile set. Continuing in disconnected mode. Use set profile <profile>". However, it would be great to have an explanation of disconnected mode (no mention of it in docs) and provide a command like export G3T_PROJECT_ID=myprogram-myproject

improve empty project

As an aced data submitter, when I empty all content from the project, the .g3t/state should be cleared

Although the etl pod now has a empty project capability, several improvements could be made:

move the feature from g3t utilities projects empty to a new, top level command, perhaps g3t reset
add fhir_delete to aced_submission

Data Governance Capabilities

As a ACED stakeholder, in order to enable flexible, secure data sharing, I need a way to:

request creation of a project
request adding a user with read or write permissions to a project
approving creation of a project
approving adding a user to a project
These abilities should be scoped to an organization [ohsu, ucl, manchester, etc.]

Add ability to republish last push

delete_file_locations support

https://ohsucomputationalbio.slack.com/archives/C043HPV0VMY/p1700260682334969

Liam Beckman
2:38 PM
Hi all, we’re plugging along on the aced-idp.org data portal and are running into an issue when attempting to delete indexd records using the Gen3 SDK. I’ve included additional info below on that, but let us know if we can add anything else or try to reproduce the issue, thank you!

Issue:
When attempting to delete an indexd record using the delete_file_locations() method in the Gen3 SDK, we encounter a “The AWS Access Key Id you provided does not exist” error. The indexd record is in non-AWS S3 bucket (MinIO endpoint), which is specified by the endpoint_url of the bucket in the Fence config.

Expected Behavior:
The delete_file_locations() method should delete both the file and the indexd record for both AWS and non-AWS S3-compatible buckets. (edited)

fix: supress stack trace when indexd document already exists

As an ACED user, when I re-upload files, the helpful error message is obscured by the stacktrace.

Helpful:

[ERROR] gen3_util.files.manifest indexd record already exists, consider using --overwrite. 0f371f8f-a26c-5d56-a166-ab832f720f50 409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists for url: https://aced-idp.org/index/index/

  0%|                                                                                           | 0/3 [00:00<?, ?it/s]
[2024-02-28 09:51:46,680] [ERROR] gen3_util.repo 409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists for url: https://aced-idp.org/index/index/
Traceback (most recent call last):
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/repo/cli.py", line 247, in push_cli
    raise e
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/repo/cli.py", line 236, in push_cli
    push(config, restricted_project_id=restricted_project_id, overwrite_index=overwrite_index,
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/repo/pusher.py", line 35, in push
    manifest_entries = upload_commit_to_indexd(
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/files/manifest.py", line 209, in upload_commit_to_indexd
    _ = _write_indexd(
        ^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/files/manifest.py", line 160, in _write_indexd
    raise e
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/files/manifest.py", line 147, in _write_indexd
    response = index_client.create_record(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3/index.py", line 420, in create_record
    rec = self.client.create(
          ^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 286, in create
    resp = self._post(
           ^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 41, in timeout
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 412, in _post
    handle_error(resp)
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 35, in handle_error
    resp.raise_for_status()
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists for url: https://aced-idp.org/index/index/
msg: '409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists
  for url: https://aced-idp.org/index/index/'
exception: '409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists
  for url: https://aced-idp.org/index/index/'```

Install of gen3_util issue libmagic

Installing as per README.md results in

ImportError: failed to find libmagic.  Check your installation

Suggestions online suggest pip install python-magic==0.4.15 or pip uninstall python-magic pip install python-magic-bin==0.4.14 to resolve this issue libmagic should be added to requirements.txt

backlog/git-lite

Epic

As a release manager, I need a feature tracking system to identify and prioritize missing features with real value for the users.

Missing / incomplete features

LOE: level of effort

easy:

rm remove a file - we can do it via utilities file rm However, this is not really "git like" LOE: easy
some commands depend on server [init, utilities access sign status] we should immediately log "working" to console (in case server is slow) LOE: easy
log show commit log - we can do it via utilities file ls --is_metadata However, this is not really "git like" LOE:easy
diff - no equivalent LOE:moderate ✅

moderate:

pull "some" - currently pull retrieves all the files, need to add some ability to pull only some files based on some parameters ✅ TBD: e.g.
- pull
- pull -object_ids
- pull -patient_ids
  - -specimen_ids
reset commit - no equivalent LOE:moderate
- remove 1 or more commit records from completed, save in pending
- re-run push's publish step

hard:

orphaned metadata analyse server database(s), flag orphaned records, allow user to remove them, or automatically purge them. LOE: hard

unknown:

g3t status - when performed after push, each time user executes g3t status the status command will check k8s job command if job status in [Unknown, Running]. Once the job status changes, the response is cached and not checked again. However, if the user never executes g3t status for more than N minutes and then executes it. The job logs expires, we have no way of knowing status plus the gen3 client library logs a bunch of errata to the screen.
@lbeckman314 : is there a workaround for this at the k8s level. i.e. long lived logs see here and k8s doc

Re-add and test symlink and scp code

https://github.com/ACED-IDP/gen3_util/compare/feature/no-bucket-2...fix/pull-from-meta?expand=1#diff-d26ac821961247a2c2806bba2d73af03cc124c13fa452f7bb1b989997c88884cR32-R65

test-plan/git-lite/submitter

Epic

As a release manager , I want a test script to ensure comprehensive and repeatable testing of the new feature(s).

Use case

As a testing engineer, I want a data submission, validation, and upload script to ensure accurate and secure processing of user-submitted data.

Definition of Done:

Test script is created and documented.
The script is reviewed and approved by the testing team.
optional The script is integrated into the testing process and automated frameworks.
script is executed and all acceptance criteria are met.

Considerations

The system should generate clear error messages for users in case of invalid data submissions, guiding them on how to correct the issues.
It should perform thorough validation of data integrity to prevent corruption or loss during the upload process.
Security measures should be implemented to protect against potential data breaches or unauthorized access during the submission and upload process.
It must log relevant information, including successful uploads and any errors encountered, for auditing and debugging purposes.
The system should be version-controlled to track changes and updates over time.
The validation and upload process should be easily integrable into automated testing frameworks for continuous integration.

Script

submitter test script

# Use case: As a data submitter, I will need to create a project.
## test should work with or without environment variables
#export G3T_PROFILE=local
#export G3T_PROJECT_ID=ohsu-test002b
#g3t init
unset G3T_PROJECT_ID
unset G3T_PROFILE
g3t --profile local init ohsu-test001b

# Use case: As a institution data steward, I need to approve the project before it can be shared.
g3t utilities access sign

# Use case: As a ACED administrator, I need to create projects in sheepdog so that submissions can take place.
g3t utilities projects ls
## test: the project should be listed as incomplete
g3t utilities projects create
## test: the project should be listed as complete

# Use case: As a data submitter, I will need to add files to the project and associate them with a subject(patient).
g3t add tests/fixtures/dir_to_study/file-1.txt  --patient P1
g3t utilities meta create
## test meta generation:  META should have 4 files
g3t commit  -m "commit-1"
## test that the commit: g3t status should return commit info - was message added?
#  resource_counts:
#      DocumentReference: 1
#      Patient: 1
#      ResearchStudy: 1
#     ResearchSubject: 1

# Use case: when subjects are added to study I need to add them to the project.
g3t add tests/fixtures/dir_to_study/file-2.csv  --patient P2
g3t status
## test add: should return one entry in "uncommitted_manifest:"
g3t utilities meta create
## test meta generation:  META should have 4 files Patient ResearchSubject DocumentReference should have 1 new record each
g3t commit -m "commit-2"
## test the commit: g3t status should return commit info - was message added? there should only be the three new records
#    resource_counts:
#      DocumentReference: 1
#      Patient: 1
#      ResearchSubject: 1
#    manifest_files:
#    - tests/fixtures/dir_to_study/file-2.csv

# Use case: some subjects have specimens, I need to add them to the project.
g3t add tests/fixtures/dir_to_study/sub-dir/file-3.pdf --patient P3 --specimen S3
g3t utilities meta create
## test should create a Specimen.ndjson file in META
# Created 4 new records.
wc -l META/Specimen.ndjson
#       1 META/Specimen.ndjson
g3t diff
## test diff: should show new records
g3t commit -m "commit-3"
## test the commit: g3t status should return commit info - was message added? 4 new records
#    message: commit-3
#    resource_counts:
#      DocumentReference: 1
#      Patient: 1
#      ResearchSubject: 1
#      Specimen: 1
#    manifest_files:
#    - tests/fixtures/dir_to_study/sub-dir/file-3.pdf

# Use case: I'm ready to share my data
## push to remote
g3t push
## test:  the system should respond with reasonable, informative messages without too much verbosity
## I need to know the status of my project. During job execution, I should be able to query the status.
g3t status
## test: After job execution, I should have detailed information about the results.
#  pushed_commits:
#  - published_timestamp: 2024-01-19T09:45:47.018426
#    published_job:
#      output:
#        uid: 82322961-8d2a-47e4-8833-af0e299aa393
#        name: fhir-import-export-ohiwi
#        status: Completed
#    commits:
#    - d050c8f931bab152279ff18e0a21434f commit-1
#    - 2f77cf6017ec3b0485b7493ebe459f53 commit-2
#    - a550281b43713937ce684e3cab13639f commit-3

## test: Once complete, the remote counts should reconcile with my activity
#remote:
#  resource_counts:
#    DocumentReference: 3
#    Patient: 3
#    ResearchStudy: 1
#    ResearchSubject: 3
#    Specimen: 1
wc -l META/*.ndjson
#       3 META/DocumentReference.ndjson
#       3 META/Patient.ndjson
#       1 META/ResearchStudy.ndjson
#       3 META/ResearchSubject.ndjson
#       1 META/Specimen.ndjson

## If I want more detailed information, I should be able to query it
## get UID from status -> local.pushed_commits.published_job.output.uid
g3t utilities jobs get UID
# ....


# Use case: As a data submitter, when I know more about meta, I should be able to add it.
# e.g. alter a patient record
sed -i.bak 's/"P1"}]}/"P1"}], "gender": "male"}/' META/Patient.ndjson
# see https://stackoverflow.com/a/22084103
rm META/Patient.ndjson.bak
g3t diff
## test diff: should show changed records
g3t commit -m "commit-4"
## test: the commit should process only one patient record
#resource_counts:
#  Patient: 1

## Use case: I should be able to publish a 'meta only' change
g3t push

## Use case: As a human being, I make mistakes, the system should prevent me from committing `no changes`
g3t commit -m "commit-5 has no changes"
## test: the system should reject the commit
# msg: No resources changed in META

## Use case: As a human being, I make mistakes, the system should prevent me from committing `invalid fhir`
sed -i.bak 's/"gender"/"foobar"/' META/Patient.ndjson
# see https://stackoverflow.com/a/22084103
rm META/Patient.ndjson.bak
g3t commit -m "commit-6 has invalid fhir"
## test: should fail validation, the response should be informative and give me enough information to fix the problem

Feature Request: add support for additional file hashes (e.g. etag)

Background

Multiple hashes are allowed for the importing of files into the indexd service, including etags:

acceptable hashes in indexd

ACCEPTABLE_HASHES = {
    "md5": re.compile(r"^[0-9a-f]{32}$").match,
    "sha1": re.compile(r"^[0-9a-f]{40}$").match,
    "sha256": re.compile(r"^[0-9a-f]{64}$").match,
    "sha512": re.compile(r"^[0-9a-f]{128}$").match,
    "crc": re.compile(r"^[0-9a-f]{8}$").match,
    "etag": re.compile(r"^[0-9a-f]{32}(-\d+)?$").match,
}

Current Behavior

Currently the g3t command requires the md5 hash of the file to be provided in order to be uploaded to the indexd service. In the case where this hash is not available (i.e. importing files from an existing S3 endpoint) it can take a rather long amount of time to both download the file and calculate it's md5 hash.

New Behavior

Adding support for additional hashes like etag would allow for greater efficiency when uploading files where the md5 hash is not immediately available or not yet calculated.

For remote files already registered in an S3 bucket the etag hash can be fetched with the MinIO client as follows:

➜ mc stat -r example-s3/example-bucket --json

{
 "status": "success",
 "name": "example-bucket/example-file",
 "lastModified": "2024-01-01T00:59:20-08:00",
 "size": 123,
 "etag": "4pophfvzd8eo8pir7i2sgzn4nifz88jho-1234",   <--- example etag hash
 "type": "file",
 "metadata": {
  "Content-Type": "application/gzip"
 }
}

Steps for Implementing

add --etag option here:

gen3_util/gen3_util/files/cli.py

Lines 73 to 74 in 18d34e4

    
           @click.option('--md5', default=None, required=False, show_default=True, 
        
                         help="MD5 sum, if not provided, will be calculated before upload, required for non-local files")

pass to manifest.put here:

gen3_util/gen3_util/files/manifest.py

Line 43 in 18d34e4

def put(config: Config, file_name: str, project_id: str, md5: str, size: int = None, modified: str = None):
verify md5 is not being calculated (it shouldn’t) here:

gen3_util/gen3_util/files/manifest.py

Line 54 in 18d34e4

if file.is_file(): # could be a url
return it as part of the manifest:

gen3_util/gen3_util/files/manifest.py

Line 73 in 18d34e4

"md5": md5,
make indexd hashes conditional: ie md5 or etag not both:

gen3_util/gen3_util/files/manifest.py

Line 190 in 18d34e4

hashes = {'md5': manifest_item['md5']}

Environment

Rancher Desktop version: 1.11.1
Helm version: v3.13.1
Gen3 Chart version: 0.1.25

Add tests for privileged and non privileged User.

As a ACED engineer, I need to know that a privileged user can grant non privileged users write access to a project

add .gitignore to .g3t/ and META/ directories

Add .gitignore and README.md to directory

See https://github.com/ACED-IDP/gen3_util/blob/development/gen3_util/config/__init__.py#L236-L239

META/.gitignore

*
!README.md

.g3t/state/.gitignore

*
!README.md

META .g3t /README.md


# Data Directory

Welcome to the data directory! This repository contains important data files for our project. Before you proceed, please take note of the following guidelines to ensure the security and integrity of our data.

## Important Note: Do Not Check in Protected Files

Some files in this directory are considered protected and contain sensitive information. **DO NOT** check in or commit these protected files to the version control system (e.g., Git). This is crucial to prevent unauthorized access and to comply with security and privacy policies.
 

## Usage Guidelines:

1. **Read-Only Access:** Unless you have explicit permission to modify or update the data, treat this directory as read-only.

2. **Data Integrity:** Ensure the integrity of the data by following proper procedures for reading, updating, and managing files.

3. **Security Awareness:** Be aware of the sensitivity of the data stored here and take necessary precautions to protect it from unauthorized access.

## How to Obtain Access:

If you need access to these files, please contact the project administrator for access to idp.cbds.ohsu.edu

Thank you for your cooperation in maintaining the security and confidentiality of our data.

Simplify CLI user interface

from https://ohsucomputationalbio.slack.com/archives/D0AQV57D2/p1703194001745479

Metadata operations

Copy project meta to local storage

gen3_util meta pull --profile=aced --project_id=aced-test <PATH-TO-FHIR>

FHIR to TSV

gen3_util meta to_tabular <PATH-TO-FHIR> <PATH-TO-TABULAR>

FHIR to Excel

gen3_util meta to_tabular --excel <PATH-TO-FHIR> <PATH-TO-TABULAR>

TSV to FHIR

gen3_util meta from_tabular <PATH-TO-TABULAR> <PATH-TO-FHIR>

Validate local files

gen3_util meta validate <PATH-TO-FHIR>

Push local FHIR to Gen3 instance

gen3_util meta push --profile=aced --project_id=aced-test <PATH-TO-FHIR>

File Operations

List files in a project

gen3_util files ls-remote

Remove files from a project index and bucket

gen3_util files rm-remote <REMOTE-ID>

gen3_util files rm --remote <REMOTE-ID>

Add file meta information to current index

gen3_util files add <LOCAL-PATH>

Read working index

gen3_util files status

Upload working index and files

gen3_util files push

Remove local file(s) from working index

gen3_util files rm <LOCAL-PATH>

Improve tests for init / commit / push / push --re_run

test-plan/git-lite/consumer

Epic

As a release manager , I want a test script to ensure comprehensive and repeatable testing of the new feature(s).

Use case

As a testing engineer, I want a data download and replication script to facilitate efficient testing of data synchronization and replication processes.

Definition of Done:

Test script is created and documented.
The script is reviewed and approved by the testing team.
optional The script is integrated into the testing process and automated frameworks.
script is executed and all acceptance criteria are met.

Considerations

The system should generate clear error messages for users in case of invalid data submissions, guiding them on how to correct the issues.
It should perform thorough validation of data integrity to prevent corruption or loss during the upload process.
Security measures should be implemented to protect against potential data breaches or unauthorized access during the submission and upload process.
It must log relevant information, including successful uploads and any errors encountered, for auditing and debugging purposes.
The system should be version-controlled to track changes and updates over time.
The validation and upload process should be easily integrable into automated testing frameworks for continuous integration.

consumer test script

# Use case: As a data consumer, I will need download a project.

## test should work with or without environment variables

#export G3T_PROFILE=local
#export G3T_PROJECT_ID=ohsu-test002b
#g3t clone

unset G3T_PROJECT_ID
unset G3T_PROFILE
g3t --profile local clone --project_id ohsu-test001b

## test: the project should exist
cd ohsu-test001b
## test: the meta data should be in place with the latest changes
grep male META/Patient.ndjson |  jq '[.id, .gender]'
#"20d7d7eb-46f9-5175-b474-cb504f66e10e"
## test by default, the files should not be downloaded
ls tests
# ls: tests: No such file or directory

## Use case: I should be able to download files
g3t pull
## test directory should now contain
tree tests
#tests
#└── fixtures
#    └── dir_to_study
#        ├── file-1.txt
#        ├── file-2.csv
#        └── sub-dir
#            └── file-3.pdf
#

feature/improve-validation-missing-references

Improvement would flag these types of structures as invalid

{
  "resourceType": "Specimen",
  "id": "XXXXXXXXX-8058-57d1-aaf2-c3fcc564125f",
  "identifier": [
    {
      "system": "http://XXXX.YYY/ZZZ/specimen",
      "value": "ABC"
    }
  ]
}

See

gen3_util/gen3_util/meta/validator.py

Line 93 in 9317dd9

resources[parse_result.resource.resource_type] += 1

Pseudo-code:


REFERENCE_REQUIRED_EXCEMPTIONS = ['Patient', 'ResearchStudy', 'Substance']  # this is not an exhaustive list
if parse_result.resource.resource_type not in REFERENCE_REQUIRED_EXCEMPTIONS and len(nested_references) == 0:
      parse_result.exception = Exception(
          f"Resource has no references {parse_result.resource.resource_type}/{parse_result.resource.id}"
     )
     exceptions.append(parse_result)

bug: project name should be unique within program, currently unique across all programs

symptoms

log into etl pod on development

select _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  'ohsu')) ;
sheepdog_development=> select _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  'ohsu')) ;
   code    
-----------
 demo
 myproject
 dev
 aws

however,

sheepdog_development=> select _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  'ohsu_two')) ;
 code 
------
(0 rows)

git brainstorming

From 2023-01-03 discussion

Use cases:

init

As an ACED user, when I want to start a new project, I need a simple way to create project structure.

[x] done

gen3_util init

Usage: gen3_util init [OPTIONS]

  Create project, both locally and on remote.

Options:
  --project_id TEXT  Gen3 program-project
  --help             Show this message and exit.

create common directory structure [DATA/, META/, .g3t/]
localize config file and state directory in hidden dir
issue requestor commands to create project in remote, (ready for signing)

clone

As an ACED user, when I want to work with a project locally, I need a simple way to retrieve meta data and store configuration in well known locations.

[x] done

gen3_util clone

Usage: gen3_util clone [OPTIONS]

  Clone meta and files from remote.

Options:
  --project_id TEXT             Gen3 program-project
  --data_type [meta|files|all]  Clone meta and/or files from remote.
                                [default: all]

creates directories, the same as init
downloads meta, files

commit

As an ACED user, when I think I'm complete with my contributions, I need a single command to record a comment and run validation and sanity checks.

gen3_util commit

status

As an ACED user, when I'm making changes, I need a quick way to summarize my changes versus the remote project

gen3_util status

log

As an ACED user, when I need to see the history of a project, I need a quick way to summarize the contributions

gen3_util log

push

As an ACED user, when I'm ready to publish my contributions, I need a single command to upload files and meta data.

gen3_util push

As an ACED user, I may be using gen3_utils on a system with the data already on the file system - or I may be working on a system that will retrieve the data. It would be useful if the project structure could incorporate symlinks

Possible implementation points: file add, clone, init, pull

simplify adding data steward, obscure roles

1eaf596#commitcomment-137679851

edit metadata

Background:

Edit Metadata

feature/manifest support "no bucket", ie. upload: no-op, download: scp or symlink

User Story

As a DevOps architect, I want to implement a file manifest collection system to efficiently organize and track files, where each file is represented by a URL using a symlink or the SCP protocol. This will enable us to streamline the process of managing metadata and indexing files where we cannot move the data to a managed bucket

Acceptance Criteria

dataframer: include condition in Observation dataframe

Looking to add cancer type / biopsy location to the FHIR labkey metadata. Looking through the labkey data it seems like this information can be gained by joining via patient id the enrollment table to the sequencing table. How would you approach this?

Assumptions:

a Condition MAY exist - creating the condition is the responsibility of the data submitter, the condition may be created in any fashion; by hand, by a g3t_etl transformer or other mechanism.

Approach:

The subject of the observation is called here if the subject is a Patient
Note that the condition is not mapped directly from the observation. There can be several paths from Observation to Condition via Observation.focus:
- Procedure
- Patient->Condition

Effectively the "join" is observation.subject == condition.subject :

Items from the patient should be mapped to the observation similar to how fields from a linked Procedure is mapped
This may be overly simplistic as the patient can have multiple conditions, however this will get the basic observation dataframe populated

	@click.option('--md5', default=None, required=False, show_default=True,
	help="MD5 sum, if not provided, will be calculated before upload, required for non-local files")

aced-idp / gen3_util Goto Github PK

gen3_util's Introduction

Gen3 Tracker

Quick Start

Installation

Use

User Guide

Contributing

gen3_util's People

Contributors

Stargazers

Watchers

Forkers

gen3_util's Issues

Epic

Missing / incomplete features

Epic

Use case

Definition of Done:

Considerations

Script

submitter test script

Background

Current Behavior

New Behavior

Steps for Implementing

Environment

Metadata operations

Copy project meta to local storage

FHIR to TSV

FHIR to Excel

TSV to FHIR

Validate local files

Push local FHIR to Gen3 instance

File Operations

List files in a project

Remove files from a project index and bucket

Add file meta information to current index

Read working index

Upload working index and files

Remove local file(s) from working index

Epic

Use case

Definition of Done:

Considerations

consumer test script

From 2023-01-03 discussion

Use cases:

init

clone

commit

status

log

push

Background:

Edit Metadata

User Story

Acceptance Criteria

Recommend Projects

Recommend Topics

Recommend Org