ga4gh / fasp-scripts Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 7.0 18.24 MB

License: Apache License 2.0

Python 1.11% WDL 0.04% Jupyter Notebook 98.65% Dockerfile 0.01% Shell 0.01% HTML 0.18% XSLT 0.01%

fasp-scripts's People

Stargazers

Watchers

Forkers

lifebit-ai jb-adams denis-yuen dnastack kathy-t

fasp-scripts's Issues

Review FaspScript naming and documentation

The number of scripts that now exist mean that simple numbering of the scripts is no longer informative.

A different approach to naming can help but it's expecting too much of a naming convention to provide adequate information about what a script does. Simple metadata about the scripts is helpful too. The table on the readme page addresses that by providing information about which clients are used in a given script. Those clients, are at the moment, specific to implementations* so one can also tell which implementations are used.

The logging performed by FASPRunner captures the necessary metadata automatically. The data in the log provided an easy way to create the table for the readme.

Beyond the metadata currently being captured it would be useful to know which data sources are queried, which workflows are being run etc.

* This specificity is open to the criticism that the clients should be more general purpose. In some cases the specific client is doing no more than wrapping the host url of the service. That does make things slightly more convenient for script writers. That does seem worthwhile. In other cases the specificity has to do with the different authentication and authorization behaviors of different implementations. Over time we might expect those differences to disappear and common clients will work.

Evaluate 'DRS plus Passport' design for use in results retrieval

The SRAExample notebook might be a good test for the 'DRS plus Passport' design in that it goes all the way through to using a second DRS server to get results files. Question is, who the passport broker would be for authz for the second DRS server?

The authz one is looking for is likely some construct within the WES server – that construct would be ‘project’ in the SB case – though ultimately the authorization is to access storage owned by that project. Currently getting that authorization is not a problem. The WES instance knows about the relationship internally – so authentication to the DRS server, using the same credentials as for the WES, gives you access to the files.

Answering: “who the passport broker would be for authz for the second DRS server?” : it seems to me the 80% answer would be - the passport broker for the WES service. But are Passport brokers for WES servers in scope?

Might be worth running that through the 'DRS plus Passport' design.

The other 20% (a guess) would be where WES is writing the results out to some other DRS server. Haven't seen/tried that yet. Not thinking it's the first urgency. Again seems to me your design would handle it. Not ruling out it should be a test case?

Add a notebook to list GA4GH Registry in easily consumable form

Registry currently only exists as json returned from https://registry.ga4gh.org/v1/services. That is not very consumable by humans.

GA4GHRegistryClient already provides a python interface to at least some Registry capabilities. It should be fairly straightforward to create an iPython notebook that generates consumable output.

@jb-adams feel free to suggest/contribute as you see fit.

Rename classes to reflect Discovery Search name change to Data Connect

The main change is expected to be to the DiscoverySearchClient which will be renamed DataConnectClient.

The fasp.search sub-package will retain its current name. It's siblings are named fast.loc and fasp.workflow and roughly correspond to DRS and WES/TES functions respectively. However, the packages include equivalent functionality from other sources and are named to reflect the functionality covered rather than the specific GA4GH specification used.

Add Bulk-FHIR Connectathon examples to Jupyter Notebook

Jonathan worked up the Search examples in the attachment from queries Brian Walsh had done as
https://colab.research.google.com/drive/1HhEEB3MJ8LbMP2ta946s8OARPc5RflHu?usp=sharing#scrollTo=nM-GHd3IWeqF

Jonathan wrote

A Search implementation with an appropriate connection to FHIR could take care of everything before "Clinical / Care Management Example Query" step, leaving just the interesting query parts up to the data consumer. Those queries were written for sqlite. Here's the same query adapted to PrestoSQL (now Trino), with table name and filters adjusted so it returns some rows from the kidsfirst FHIR data at http://ga4gh-search-adapter-presto-public.staging.dnastack.com/search:

These would be useful to have in a Jupyter notebook. Have attached the queries Jonathan created as an attachment.

fhir_query_examples.txt

Review repository/package structure before Jan 2021 hackathon

After merging in https://github.com/DNAstack/plenary-resources-2020 moved the plenary-resources folder to top level.

Seems more appropriate at the top level but I suggest we need a more considered rethink.

Some considerations
The initial aim was to get working examples up and running and evolve a more thorough architecture as we go.
At this point clients for search, object location/retrieval and workflow execution were included under the fasp package. Long term, one might expect these clients to be in different packages/repos. Recognizing this, separate packages were created for each of these functions

search
loc
workflow

Also

runner
Generalize the fasp pattern to allow scripts simply to pick the required clients
scripts
contains scripts which call the appropriate selection of clients needed for specific data and their locations

Getting started guide

@ianfore From our Connect call, I'm thinking it might be helpful to create a "getting started guide" as an entrypoint to being able to run these scripts. As we want more of the community to use and contribute to these scripts, we'll want to provide an easy path for them to working with this repo.

This guide could take the form of a one-pager within the repo that explains how to go about getting registered for the various services/platforms, how to configure keys locally, test scripts with expected output, etc. It would take a user/researcher with no pre-existing identity with any of the FASP platforms to being able to run most, or all scripts. Since I fall into this category (I only have an ID with CGC and Cavatica), I'm happy to take notes on the process and collate into a getting started guide.

Does this sound useful?

@briandoconnor @mbarkley

Modify DRSMetaresolver login during initialization

Remove automatic login during metaresolver initialization

Authorize only when a specific DRS Server is used

Get a working example script for SevenBridges WES

Started with SBCGC_WES_Example.ipynb.

Posting here as I know others (@mbarkley ) are working on using the SB WES service. I'm modifying the WES client used to submit the GWAS script from the plenary demo that we just merged in.

Currently getting a 415 error. @mattions, do you have any insights on this?

Check RAS logins for relevant DRS and WES Servers

RAS login is now the default login for at least some subset of Gen3 and Seven Bridges nodes running WES and DRS.

Check that tokens available for RAS logins gain proper access and authentication for the expected datasets and resources.

Links for access to relevant portals are available here

Incorporate direct FHIR Search into a FASP workflow

See direct FHIR Search examples using the NCPI FHIR Server which were explored in a Nov 2020 codeathon here .

Some Resources on that server can return DRS ids which could be resolved to URLs by the Kids First DRS Server.

Add script that runs parts of a single workflow with different files on different WES implementations

Motivation

A long-standing goal of the FASP GA4GH group has been to have a federated workflow demo that anyone can run. In this context, a federated workflow means:

Multiple compute environments are used for a single computational analysis
Different data is analysed in each compute environment
A single script drives the analysis using GA4GH standards, where possible

Goals

Automate as much of this task as possible in scripts in this repo
Use WES API to control and monitor workflows
Use public or synthetic data to make this more accessible for other people to try

Todo

Identify some data (preferably with mirrors in different regions)
Fix script using DNAstack WES
Write script using ELIXIR WES
Write script using Seven Bridges WES
Write single script calling DNAstack, WES, SB scripts on different subsets of files

Access GTEX PFB data via Search

Relevant to #14 more detailed search of GTEx subject and specimen (phenotypic) data is possible via data which may be exported from Anvil.

Obtaining GTEx subject and specimen data in this way is described here.

A user might load these data to a private BigQuery tables. A pull request has been submitted to pypfb which supports this.

Such tables could be incorporated into a FASP workflow using BigQuerySearchClient.

Alternatively a private GA4GH Server could be set up to access these data.

Use CloudOS and NextFlow to integrate TCGA and GTEx data

See GTEX_TCGA_Federated_Analysis notebook for an iPython workflow.

The overall flow of the script/notebook is probably more illustrative that directly usable in NextFlow. A good question is how to do the equivalent work from CloudOS and NextFlow where that is the platform of preference e.g. as it is for the JAX team. If it is of interest to you to explore that we could work together to get started.

I have no experience of writing NextFlow, I’ll leave that to you. Nevertheless, I was able to get a sense from the nf script that Sangram had shared previously of how you structure things.
https://github.com/lifebit-ai/sra-dbgap-datafetch/blob/main/main.nf

One of the questions is which capabilities it makes sense to do from with NextFlow, and where those capabilities should be in some library which NextFlow calls. Either way, I think we could do some useful experimentation. Some of the fasp package may be useful, if not I could modify it to help.

See #15 for a possible additional step.

Update schema for table used by mapping1-manual notebook

In order to illustrate the effective use of Search schema, please update schema on the public DNAStack server for search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru
to one derived from the dbGap XML data dictionary.

See fast/data/dbgap for the data_dict.xml to use.

The notebook that uses that table is https://github.com/ga4gh/fasp-scripts/blob/master/notebooks/search/mapping1-manual.ipynb

Create scrambled copy of COPDGene_Phenotypes_drs

This table is accessed by FASPNotebook02 but it contains controlled access data. The notebook is useful for "Getting started" purposes. A scrambled (aka synthetic) copy of the data would not need to be controlled access and would therefore serve the purpose.

Review which scripts would make more sense as Jupyter notebooks

Possibly all of them?

Review variations in dbGaP data_dicts

There are some variations in how the dbGaP data_dict.xml files define data sources.

Some work to address this includes validating every dbGaP data_dict for conformance to an XML schema. After five iterations of that schema 16,000 of approx 20,000 data-dictionaries validate. Simple changes to XML schema make significant increases in the number of schema that pass. This suggest this approach gives significant leverage on the problem.
To do: put that code in GitHub.

Add notebook to demonstrate DRS resolution and authentication

See checkResolution() in https://github.com/ga4gh/fasp-scripts/blob/master/fasp/loc/drs_metaresolver.py
That demonstrates CURIE resolution. Need to do the same for DRS URIs.

While the hackathon title suggests the objective is to produce a notebook in (Jupyter) the real point is to understand factors driving different approaches to different styles of DRS ids, including the use of prefixing.

The use of the two different styles of DRS ids is also something to explore. I'll revive a previous discussion of this and link or post it below.

Update DNAStackWESClient for new methods to obtain tokens

Was informed that DNAStack are replacing the WES Server we used for 2020 plenary. That server was used by DNAStackWESclient in this repo so the code will need to be updated. One revision will be to the approach to obtain tokens. (Passport? @mbarkley :-)

Creating a ticket now in anticipation of the revised WES server.

Complete migration of scripts notebooks or other fates

Scripts to be converted or retired.

Google Cloud Pipeline jobs composed in python fail

Pipeline job composed in python submitted via pipelines().run fails.
See runstats() method in gcpls_samtools.py

The parameters for the run match those for a job submitted via the command line. The details appear to be the same in task list interface.

For now the workaround is to submit the job via subprocess.run()