Giter Site home page Giter Site logo

gen3_util's Introduction

Gen3 Tracker

Utilities to manage Gen3 schemas, projects and submissions.

Quick Start

Installation


$ pip install gen3_tracker

$ g3t version
version: 0.0.1


Use

$ g3t --help
Usage: g3t [OPTIONS] COMMAND [ARGS]...

  Gen3 Tracker: manage FHIR metadata and files.

Options:
  --format [yaml|json|text]  Result format. G3T_FORMAT  [default: yaml]
  --profile TEXT             Connection name. G3T_PROFILE See
                             https://bit.ly/3NbKGi4

  --version
  --help                     Show this message and exit.

Commands:
  ping          Verify gen3-client and test connectivity.
  init          Create project, both locally and on remote.
  add           Add file to the index.
  commit        Record changes to the project.
  diff          Show new/changed metadata since last commit.
  push          Submit committed changes to commons.
  status        Show the working tree status.
  clone         Clone meta and files from remote.
  pull          Download latest meta and data files.
  update-index  Update the index from the META directory.
  rm            Remove project.
  utilities     Useful utilities.


User Guide

Contributing

gen3_util's People

Contributors

bwalsh avatar matthewpeterkort avatar lbeckman314 avatar

Stargazers

 avatar

Watchers

 avatar Kyle Ellrott avatar Nasim Sanati (Rieker) avatar  avatar

Forkers

lbeckman314

gen3_util's Issues

documentation/clarify-gen3-init-instructions

As a user, I would like clearer explanations how to use g3t init so that I don't run into confusing errors downstream. Specifically, I noticed that...

  • g3t init --help describes the flags as "--project_id TEXT Gen3 program-project G3T_PROJECT_ID" which is not very clear to me. I could see something like “--project_id project ID formatted as myprogram-myproject”
  • When running g3t init program-project without a profile, I appreciate that there's multiple warnings that there is no profile set, one being "No profile set. Continuing in disconnected mode. Use set profile <profile>". However, it would be great to have an explanation of disconnected mode (no mention of it in docs) and provide a command like export G3T_PROJECT_ID=myprogram-myproject

Data Governance Capabilities

As a ACED stakeholder, in order to enable flexible, secure data sharing, I need a way to:

  • request creation of a project
  • request adding a user with read or write permissions to a project
  • approving creation of a project
  • approving adding a user to a project
    These abilities should be scoped to an organization [ohsu, ucl, manchester, etc.]

delete_file_locations support

https://ohsucomputationalbio.slack.com/archives/C043HPV0VMY/p1700260682334969

Liam Beckman
2:38 PM
Hi all, we’re plugging along on the aced-idp.org data portal and are running into an issue when attempting to delete indexd records using the Gen3 SDK. I’ve included additional info below on that, but let us know if we can add anything else or try to reproduce the issue, thank you!

Issue:
When attempting to delete an indexd record using the delete_file_locations() method in the Gen3 SDK, we encounter a “The AWS Access Key Id you provided does not exist” error. The indexd record is in non-AWS S3 bucket (MinIO endpoint), which is specified by the endpoint_url of the bucket in the Fence config.

Expected Behavior:
The delete_file_locations() method should delete both the file and the indexd record for both AWS and non-AWS S3-compatible buckets. (edited)

fix: supress stack trace when indexd document already exists

As an ACED user, when I re-upload files, the helpful error message is obscured by the stacktrace.

Helpful:

[ERROR] gen3_util.files.manifest indexd record already exists, consider using --overwrite. 0f371f8f-a26c-5d56-a166-ab832f720f50 409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists for url: https://aced-idp.org/index/index/

  0%|                                                                                           | 0/3 [00:00<?, ?it/s]
[2024-02-28 09:51:46,680] [ERROR] gen3_util.repo 409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists for url: https://aced-idp.org/index/index/
Traceback (most recent call last):
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/repo/cli.py", line 247, in push_cli
    raise e
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/repo/cli.py", line 236, in push_cli
    push(config, restricted_project_id=restricted_project_id, overwrite_index=overwrite_index,
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/repo/pusher.py", line 35, in push
    manifest_entries = upload_commit_to_indexd(
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/files/manifest.py", line 209, in upload_commit_to_indexd
    _ = _write_indexd(
        ^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/files/manifest.py", line 160, in _write_indexd
    raise e
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3_util/files/manifest.py", line 147, in _write_indexd
    response = index_client.create_record(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/gen3/index.py", line 420, in create_record
    rec = self.client.create(
          ^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 286, in create
    resp = self._post(
           ^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 41, in timeout
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 412, in _post
    handle_error(resp)
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/indexclient/client.py", line 35, in handle_error
    resp.raise_for_status()
  File "/home/users/leejor/.conda/envs/py3-11.third/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists for url: https://aced-idp.org/index/index/
msg: '409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists
  for url: https://aced-idp.org/index/index/'
exception: '409 Client Error: did "0f371f8f-a26c-5d56-a166-ab832f720f50" already exists
  for url: https://aced-idp.org/index/index/'```

Install of gen3_util issue libmagic

Installing as per README.md results in

ImportError: failed to find libmagic.  Check your installation

Suggestions online suggest pip install python-magic==0.4.15 or pip uninstall python-magic pip install python-magic-bin==0.4.14 to resolve this issue libmagic should be added to requirements.txt

backlog/git-lite

Epic

As a release manager, I need a feature tracking system to identify and prioritize missing features with real value for the users.

Missing / incomplete features

LOE: level of effort

easy:

  • rm remove a file - we can do it via utilities file rm However, this is not really "git like" LOE: easy
  • some commands depend on server [init, utilities access sign status] we should immediately log "working" to console (in case server is slow) LOE: easy
  • log show commit log - we can do it via utilities file ls --is_metadata However, this is not really "git like" LOE:easy
  • diff - no equivalent LOE:moderate ✅

moderate:

  • pull "some" - currently pull retrieves all the files, need to add some ability to pull only some files based on some parameters ✅ TBD: e.g.

    • pull
    • pull -object_ids
    • pull -patient_ids
      • -specimen_ids
  • reset commit - no equivalent LOE:moderate

    • remove 1 or more commit records from completed, save in pending
    • re-run push's publish step

hard:

  • orphaned metadata analyse server database(s), flag orphaned records, allow user to remove them, or automatically purge them. LOE: hard

unknown:

  • g3t status - when performed after push, each time user executes g3t status the status command will check k8s job command if job status in [Unknown, Running]. Once the job status changes, the response is cached and not checked again. However, if the user never executes g3t status for more than N minutes and then executes it. The job logs expires, we have no way of knowing status plus the gen3 client library logs a bunch of errata to the screen.
    @lbeckman314 : is there a workaround for this at the k8s level. i.e. long lived logs see here and k8s doc

test-plan/git-lite/submitter

Epic

As a release manager , I want a test script to ensure comprehensive and repeatable testing of the new feature(s).

Use case

As a testing engineer, I want a data submission, validation, and upload script to ensure accurate and secure processing of user-submitted data.

Definition of Done:

  • Test script is created and documented.
  • The script is reviewed and approved by the testing team.
  • optional The script is integrated into the testing process and automated frameworks.
  • script is executed and all acceptance criteria are met.

Considerations

  • The system should generate clear error messages for users in case of invalid data submissions, guiding them on how to correct the issues.
  • It should perform thorough validation of data integrity to prevent corruption or loss during the upload process.
  • Security measures should be implemented to protect against potential data breaches or unauthorized access during the submission and upload process.
  • It must log relevant information, including successful uploads and any errors encountered, for auditing and debugging purposes.
  • The system should be version-controlled to track changes and updates over time.
  • The validation and upload process should be easily integrable into automated testing frameworks for continuous integration.

Script

submitter test script

# Use case: As a data submitter, I will need to create a project.
## test should work with or without environment variables
#export G3T_PROFILE=local
#export G3T_PROJECT_ID=ohsu-test002b
#g3t init
unset G3T_PROJECT_ID
unset G3T_PROFILE
g3t --profile local init ohsu-test001b

# Use case: As a institution data steward, I need to approve the project before it can be shared.
g3t utilities access sign

# Use case: As a ACED administrator, I need to create projects in sheepdog so that submissions can take place.
g3t utilities projects ls
## test: the project should be listed as incomplete
g3t utilities projects create
## test: the project should be listed as complete

# Use case: As a data submitter, I will need to add files to the project and associate them with a subject(patient).
g3t add tests/fixtures/dir_to_study/file-1.txt  --patient P1
g3t utilities meta create
## test meta generation:  META should have 4 files
g3t commit  -m "commit-1"
## test that the commit: g3t status should return commit info - was message added?
#  resource_counts:
#      DocumentReference: 1
#      Patient: 1
#      ResearchStudy: 1
#     ResearchSubject: 1

# Use case: when subjects are added to study I need to add them to the project.
g3t add tests/fixtures/dir_to_study/file-2.csv  --patient P2
g3t status
## test add: should return one entry in "uncommitted_manifest:"
g3t utilities meta create
## test meta generation:  META should have 4 files Patient ResearchSubject DocumentReference should have 1 new record each
g3t commit -m "commit-2"
## test the commit: g3t status should return commit info - was message added? there should only be the three new records
#    resource_counts:
#      DocumentReference: 1
#      Patient: 1
#      ResearchSubject: 1
#    manifest_files:
#    - tests/fixtures/dir_to_study/file-2.csv

# Use case: some subjects have specimens, I need to add them to the project.
g3t add tests/fixtures/dir_to_study/sub-dir/file-3.pdf --patient P3 --specimen S3
g3t utilities meta create
## test should create a Specimen.ndjson file in META
# Created 4 new records.
wc -l META/Specimen.ndjson
#       1 META/Specimen.ndjson
g3t diff
## test diff: should show new records
g3t commit -m "commit-3"
## test the commit: g3t status should return commit info - was message added? 4 new records
#    message: commit-3
#    resource_counts:
#      DocumentReference: 1
#      Patient: 1
#      ResearchSubject: 1
#      Specimen: 1
#    manifest_files:
#    - tests/fixtures/dir_to_study/sub-dir/file-3.pdf

# Use case: I'm ready to share my data
## push to remote
g3t push
## test:  the system should respond with reasonable, informative messages without too much verbosity
## I need to know the status of my project. During job execution, I should be able to query the status.
g3t status
## test: After job execution, I should have detailed information about the results.
#  pushed_commits:
#  - published_timestamp: 2024-01-19T09:45:47.018426
#    published_job:
#      output:
#        uid: 82322961-8d2a-47e4-8833-af0e299aa393
#        name: fhir-import-export-ohiwi
#        status: Completed
#    commits:
#    - d050c8f931bab152279ff18e0a21434f commit-1
#    - 2f77cf6017ec3b0485b7493ebe459f53 commit-2
#    - a550281b43713937ce684e3cab13639f commit-3

## test: Once complete, the remote counts should reconcile with my activity
#remote:
#  resource_counts:
#    DocumentReference: 3
#    Patient: 3
#    ResearchStudy: 1
#    ResearchSubject: 3
#    Specimen: 1
wc -l META/*.ndjson
#       3 META/DocumentReference.ndjson
#       3 META/Patient.ndjson
#       1 META/ResearchStudy.ndjson
#       3 META/ResearchSubject.ndjson
#       1 META/Specimen.ndjson

## If I want more detailed information, I should be able to query it
## get UID from status -> local.pushed_commits.published_job.output.uid
g3t utilities jobs get UID
# ....


# Use case: As a data submitter, when I know more about meta, I should be able to add it.
# e.g. alter a patient record
sed -i.bak 's/"P1"}]}/"P1"}], "gender": "male"}/' META/Patient.ndjson
# see https://stackoverflow.com/a/22084103
rm META/Patient.ndjson.bak
g3t diff
## test diff: should show changed records
g3t commit -m "commit-4"
## test: the commit should process only one patient record
#resource_counts:
#  Patient: 1

## Use case: I should be able to publish a 'meta only' change
g3t push

## Use case: As a human being, I make mistakes, the system should prevent me from committing `no changes`
g3t commit -m "commit-5 has no changes"
## test: the system should reject the commit
# msg: No resources changed in META

## Use case: As a human being, I make mistakes, the system should prevent me from committing `invalid fhir`
sed -i.bak 's/"gender"/"foobar"/' META/Patient.ndjson
# see https://stackoverflow.com/a/22084103
rm META/Patient.ndjson.bak
g3t commit -m "commit-6 has invalid fhir"
## test: should fail validation, the response should be informative and give me enough information to fix the problem

Feature Request: add support for additional file hashes (e.g. etag)

Background

Multiple hashes are allowed for the importing of files into the indexd service, including etags:

ACCEPTABLE_HASHES = {
    "md5": re.compile(r"^[0-9a-f]{32}$").match,
    "sha1": re.compile(r"^[0-9a-f]{40}$").match,
    "sha256": re.compile(r"^[0-9a-f]{64}$").match,
    "sha512": re.compile(r"^[0-9a-f]{128}$").match,
    "crc": re.compile(r"^[0-9a-f]{8}$").match,
    "etag": re.compile(r"^[0-9a-f]{32}(-\d+)?$").match,
}

Current Behavior

Currently the g3t command requires the md5 hash of the file to be provided in order to be uploaded to the indexd service. In the case where this hash is not available (i.e. importing files from an existing S3 endpoint) it can take a rather long amount of time to both download the file and calculate it's md5 hash.

New Behavior

Adding support for additional hashes like etag would allow for greater efficiency when uploading files where the md5 hash is not immediately available or not yet calculated.

For remote files already registered in an S3 bucket the etag hash can be fetched with the MinIO client as follows:

➜ mc stat -r example-s3/example-bucket --json
{
 "status": "success",
 "name": "example-bucket/example-file",
 "lastModified": "2024-01-01T00:59:20-08:00",
 "size": 123,
 "etag": "4pophfvzd8eo8pir7i2sgzn4nifz88jho-1234",   <--- example etag hash
 "type": "file",
 "metadata": {
  "Content-Type": "application/gzip"
 }
}

Steps for Implementing

Environment

  • Rancher Desktop version: 1.11.1
  • Helm version: v3.13.1
  • Gen3 Chart version: 0.1.25

add .gitignore to .g3t/ and META/ directories

Add .gitignore and README.md to directory

See https://github.com/ACED-IDP/gen3_util/blob/development/gen3_util/config/__init__.py#L236-L239

META/.gitignore

*
!README.md

.g3t/state/.gitignore

*
!README.md

META .g3t /README.md


# Data Directory

Welcome to the data directory! This repository contains important data files for our project. Before you proceed, please take note of the following guidelines to ensure the security and integrity of our data.

## Important Note: Do Not Check in Protected Files

Some files in this directory are considered protected and contain sensitive information. **DO NOT** check in or commit these protected files to the version control system (e.g., Git). This is crucial to prevent unauthorized access and to comply with security and privacy policies.
 

## Usage Guidelines:

1. **Read-Only Access:** Unless you have explicit permission to modify or update the data, treat this directory as read-only.

2. **Data Integrity:** Ensure the integrity of the data by following proper procedures for reading, updating, and managing files.

3. **Security Awareness:** Be aware of the sensitivity of the data stored here and take necessary precautions to protect it from unauthorized access.

## How to Obtain Access:

If you need access to these files, please contact the project administrator for access to idp.cbds.ohsu.edu

Thank you for your cooperation in maintaining the security and confidentiality of our data.

Simplify CLI user interface

from https://ohsucomputationalbio.slack.com/archives/D0AQV57D2/p1703194001745479

Metadata operations

Copy project meta to local storage

gen3_util meta pull --profile=aced --project_id=aced-test <PATH-TO-FHIR>

FHIR to TSV

gen3_util meta to_tabular <PATH-TO-FHIR> <PATH-TO-TABULAR>

FHIR to Excel

gen3_util meta to_tabular --excel <PATH-TO-FHIR> <PATH-TO-TABULAR>

TSV to FHIR

gen3_util meta from_tabular <PATH-TO-TABULAR> <PATH-TO-FHIR>

Validate local files

gen3_util meta validate <PATH-TO-FHIR>

Push local FHIR to Gen3 instance

gen3_util meta push --profile=aced --project_id=aced-test <PATH-TO-FHIR>

File Operations

List files in a project

gen3_util files ls-remote

Remove files from a project index and bucket

gen3_util files rm-remote <REMOTE-ID>

or

gen3_util files rm --remote <REMOTE-ID>

Add file meta information to current index

gen3_util files add <LOCAL-PATH>

Read working index

gen3_util files status

Upload working index and files

gen3_util files push

Remove local file(s) from working index

gen3_util files rm <LOCAL-PATH>

test-plan/git-lite/consumer

Epic

As a release manager , I want a test script to ensure comprehensive and repeatable testing of the new feature(s).

Use case

As a testing engineer, I want a data download and replication script to facilitate efficient testing of data synchronization and replication processes.

Definition of Done:

  • Test script is created and documented.
  • The script is reviewed and approved by the testing team.
  • optional The script is integrated into the testing process and automated frameworks.
  • script is executed and all acceptance criteria are met.

Considerations

  • The system should generate clear error messages for users in case of invalid data submissions, guiding them on how to correct the issues.
  • It should perform thorough validation of data integrity to prevent corruption or loss during the upload process.
  • Security measures should be implemented to protect against potential data breaches or unauthorized access during the submission and upload process.
  • It must log relevant information, including successful uploads and any errors encountered, for auditing and debugging purposes.
  • The system should be version-controlled to track changes and updates over time.
  • The validation and upload process should be easily integrable into automated testing frameworks for continuous integration.

consumer test script

# Use case: As a data consumer, I will need download a project.

## test should work with or without environment variables

#export G3T_PROFILE=local
#export G3T_PROJECT_ID=ohsu-test002b
#g3t clone

unset G3T_PROJECT_ID
unset G3T_PROFILE
g3t --profile local clone --project_id ohsu-test001b

## test: the project should exist
cd ohsu-test001b
## test: the meta data should be in place with the latest changes
grep male META/Patient.ndjson |  jq '[.id, .gender]'
#"20d7d7eb-46f9-5175-b474-cb504f66e10e"
## test by default, the files should not be downloaded
ls tests
# ls: tests: No such file or directory

## Use case: I should be able to download files
g3t pull
## test directory should now contain
tree tests
#tests
#└── fixtures
#    └── dir_to_study
#        ├── file-1.txt
#        ├── file-2.csv
#        └── sub-dir
#            └── file-3.pdf
#

feature/improve-validation-missing-references

Improvement would flag these types of structures as invalid

{
  "resourceType": "Specimen",
  "id": "XXXXXXXXX-8058-57d1-aaf2-c3fcc564125f",
  "identifier": [
    {
      "system": "http://XXXX.YYY/ZZZ/specimen",
      "value": "ABC"
    }
  ]
}

See

resources[parse_result.resource.resource_type] += 1

Pseudo-code:


REFERENCE_REQUIRED_EXCEMPTIONS = ['Patient', 'ResearchStudy', 'Substance']  # this is not an exhaustive list
if parse_result.resource.resource_type not in REFERENCE_REQUIRED_EXCEMPTIONS and len(nested_references) == 0:
      parse_result.exception = Exception(
          f"Resource has no references {parse_result.resource.resource_type}/{parse_result.resource.id}"
     )
     exceptions.append(parse_result)  

bug: project name should be unique within program, currently unique across all programs

symptoms

log into etl pod on development

select _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  'ohsu')) ;
sheepdog_development=> select _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  'ohsu')) ;
   code    
-----------
 demo
 myproject
 dev
 aws

however,

sheepdog_development=> select _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  'ohsu_two')) ;
 code 
------
(0 rows)

Similar results for program cbds:

select _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  'cbds')) ;
 code 
------
(0 rows)

Drilling in a little more into the sheepdog db, …

select node_id, _props->>'code' as code  from node_project ;
               node_id                |        code         
--------------------------------------+---------------------
 07ed9016-2ae5-5078-a244-c6c79ece33fc | eimages
 e3998e35-baf2-5720-9227-6b3007160020 | demo
 217daa38-587d-599f-aec0-8a02436009e9 | myproject
 a77f549b-c74b-563e-80bb-570b5a4dde88 | test
 e42f7315-90c0-5b97-bfd0-8c1d11fa75a8 | sower_test2
 aa957b86-3aca-515c-ad75-06d0ec7f684f | test_sower3
 dea0c744-e454-5822-82f1-06735ca37018 | dev
 a1b72d7a-cd62-589b-99be-c52ba48f5049 | test_sower_brian2
 cd300876-5179-5d97-8440-8b76383866a0 | test_sower_liam4
 e655e907-e59a-5fa9-bdd2-cd36f0d3cad6 | test_sower_matthew
 c1b8c16a-b803-5ca0-ae14-30649de7a7c7 | test_sower_liam7
 ae7be7ea-8938-57a4-910e-48ecbbe62e6d | test_sower_matthew7
 84fdfb93-87d3-551d-9a1f-a67dda89cafc | OMOPTEST
 9c3e7832-43f3-57d9-8e6d-770fc6f4bf3d | aws
(14 rows)

But when we look at requestor (arborist), we see there are several:

# in arborist_users 
 grep "/dev$" DATA/aced-commons-development/user.yaml | uniq
    - /programs/aced/projects/dev
    - /programs/ohsu/projects/dev
    - /programs/cbds/projects/dev
    - /programs/ohsu_two/projects/dev
    - /programs/wasabi/projects/dev
    - /programs/aws/projects/dev

Looking into the submission code, I think the strategy here will always match only on project.code, not considering program
https://github.com/ACED-IDP/submission/blob/a663e15e1c8eb94d00bc63ebaee173c05ef38006/aced_submission/meta_graph_load.py#L332

As a result, the test case "Same project.code used in multiple programs" will only create the first program specified.
This is completely untested, but something like this should work:

select node_id, _props->>'code' as code  from node_project where node_id in (select src_id from edge_projectmemberofprogram where dst_id = (select node_id from node_program where _props->>'name' =  '{program}')) and _props->>'code' = '{project}' ;

git brainstorming

From 2023-01-03 discussion

image

Use cases:

init

As an ACED user, when I want to start a new project, I need a simple way to create project structure.

[x] done

gen3_util init

Usage: gen3_util init [OPTIONS]

  Create project, both locally and on remote.

Options:
  --project_id TEXT  Gen3 program-project
  --help             Show this message and exit.

  • create common directory structure [DATA/, META/, .g3t/]
  • localize config file and state directory in hidden dir
  • issue requestor commands to create project in remote, (ready for signing)

clone

As an ACED user, when I want to work with a project locally, I need a simple way to retrieve meta data and store configuration in well known locations.

[x] done

gen3_util clone

Usage: gen3_util clone [OPTIONS]

  Clone meta and files from remote.

Options:
  --project_id TEXT             Gen3 program-project
  --data_type [meta|files|all]  Clone meta and/or files from remote.
                                [default: all]
  • creates directories, the same as init
  • downloads meta, files

commit

As an ACED user, when I think I'm complete with my contributions, I need a single command to record a comment and run validation and sanity checks.

gen3_util commit

status

As an ACED user, when I'm making changes, I need a quick way to summarize my changes versus the remote project

gen3_util status

log

As an ACED user, when I need to see the history of a project, I need a quick way to summarize the contributions

gen3_util log

push

As an ACED user, when I'm ready to publish my contributions, I need a single command to upload files and meta data.

gen3_util push

As an ACED user, I may be using gen3_utils on a system with the data already on the file system - or I may be working on a system that will retrieve the data. It would be useful if the project structure could incorporate symlinks

Possible implementation points: file add, clone, init, pull

feature/manifest support "no bucket", ie. upload: no-op, download: scp or symlink

User Story

As a DevOps architect, I want to implement a file manifest collection system to efficiently organize and track files, where each file is represented by a URL using a symlink or the SCP protocol. This will enable us to streamline the process of managing metadata and indexing files where we cannot move the data to a managed bucket

Acceptance Criteria

  • As a user, I should be able to initiate the collection of a manifest of files indicating that the files should NOT be uploaded.

    • Add a parameter to the g3t add command --no-upload
    • The path added should still be a relative path from the project directory
    • The add command should populate the DocumentReference's 'source_path' extension with an scp://<hostname>/<full-path> style url
  • On push, the system should:

    • upload all file meta data to indexd
    • continue to upload files added without the --no-upload parameter to the project bucket via gen3-client
    • skip uploading any files added with the --no-upload parameter
  • On pull or clone, the system should:

    • continue to download all files from the project bucket as it does now
    • when the current hostname == the url's scp://<hostname> the system should create a symlink in the project directory. The symlink creation process should reflect the file's original structure, maintaining the directory hierarchy specified in the manifest.
    • when the current hostname != the url's scp://<hostname> the system should initiate a multi-threaded scp download, ideally outsourcing this job to utility package on the user's machine. The system should provide a mechanism for the user to authenticate and authorize access to the files using SCP credentials.
  • The system should handle errors gracefully and provide informative messages for troubleshooting, especially in cases of connection issues or authentication failures.

dataframer: include condition in Observation dataframe

Looking to add cancer type / biopsy location to the FHIR labkey metadata. Looking through the labkey data it seems like this information can be gained by joining via patient id the enrollment table to the sequencing table. How would you approach this?

Assumptions:

  • a Condition MAY exist - creating the condition is the responsibility of the data submitter, the condition may be created in any fashion; by hand, by a g3t_etl transformer or other mechanism.

Approach:

  • The subject of the observation is called here if the subject is a Patient
  • Note that the condition is not mapped directly from the observation. There can be several paths from Observation to Condition via Observation.focus:

Effectively the "join" is observation.subject == condition.subject :

  • Items from the patient should be mapped to the observation similar to how fields from a linked Procedure is mapped
  • This may be overly simplistic as the patient can have multiple conditions, however this will get the basic observation dataframe populated

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.