Giter Site home page Giter Site logo

fasp-scripts's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fasp-scripts's Issues

Review FaspScript naming and documentation

The number of scripts that now exist mean that simple numbering of the scripts is no longer informative.

A different approach to naming can help but it's expecting too much of a naming convention to provide adequate information about what a script does. Simple metadata about the scripts is helpful too. The table on the readme page addresses that by providing information about which clients are used in a given script. Those clients, are at the moment, specific to implementations* so one can also tell which implementations are used.

The logging performed by FASPRunner captures the necessary metadata automatically. The data in the log provided an easy way to create the table for the readme.

Beyond the metadata currently being captured it would be useful to know which data sources are queried, which workflows are being run etc.

* This specificity is open to the criticism that the clients should be more general purpose. In some cases the specific client is doing no more than wrapping the host url of the service. That does make things slightly more convenient for script writers. That does seem worthwhile. In other cases the specificity has to do with the different authentication and authorization behaviors of different implementations. Over time we might expect those differences to disappear and common clients will work.

Evaluate 'DRS plus Passport' design for use in results retrieval

The SRAExample notebook might be a good test for the 'DRS plus Passport' design in that it goes all the way through to using a second DRS server to get results files. Question is, who the passport broker would be for authz for the second DRS server?

The authz one is looking for is likely some construct within the WES server – that construct would be ‘project’ in the SB case – though ultimately the authorization is to access storage owned by that project. Currently getting that authorization is not a problem. The WES instance knows about the relationship internally – so authentication to the DRS server, using the same credentials as for the WES, gives you access to the files.

Answering: “who the passport broker would be for authz for the second DRS server?” : it seems to me the 80% answer would be - the passport broker for the WES service. But are Passport brokers for WES servers in scope?

Might be worth running that through the 'DRS plus Passport' design.

The other 20% (a guess) would be where WES is writing the results out to some other DRS server. Haven't seen/tried that yet. Not thinking it's the first urgency. Again seems to me your design would handle it. Not ruling out it should be a test case?

Rename classes to reflect Discovery Search name change to Data Connect

The main change is expected to be to the DiscoverySearchClient which will be renamed DataConnectClient.

The fasp.search sub-package will retain its current name. It's siblings are named fast.loc and fasp.workflow and roughly correspond to DRS and WES/TES functions respectively. However, the packages include equivalent functionality from other sources and are named to reflect the functionality covered rather than the specific GA4GH specification used.

Add Bulk-FHIR Connectathon examples to Jupyter Notebook

Jonathan worked up the Search examples in the attachment from queries Brian Walsh had done as
https://colab.research.google.com/drive/1HhEEB3MJ8LbMP2ta946s8OARPc5RflHu?usp=sharing#scrollTo=nM-GHd3IWeqF

Jonathan wrote

A Search implementation with an appropriate connection to FHIR could take care of everything before "Clinical / Care Management Example Query" step, leaving just the interesting query parts up to the data consumer. Those queries were written for sqlite. Here's the same query adapted to PrestoSQL (now Trino), with table name and filters adjusted so it returns some rows from the kidsfirst FHIR data at http://ga4gh-search-adapter-presto-public.staging.dnastack.com/search:

These would be useful to have in a Jupyter notebook. Have attached the queries Jonathan created as an attachment.

fhir_query_examples.txt

Review repository/package structure before Jan 2021 hackathon

After merging in https://github.com/DNAstack/plenary-resources-2020 moved the plenary-resources folder to top level.

Seems more appropriate at the top level but I suggest we need a more considered rethink.

Some considerations
The initial aim was to get working examples up and running and evolve a more thorough architecture as we go.
At this point clients for search, object location/retrieval and workflow execution were included under the fasp package. Long term, one might expect these clients to be in different packages/repos. Recognizing this, separate packages were created for each of these functions

  • search
  • loc
  • workflow

Also

  • runner
    Generalize the fasp pattern to allow scripts simply to pick the required clients
  • scripts
    contains scripts which call the appropriate selection of clients needed for specific data and their locations

Getting started guide

@ianfore From our Connect call, I'm thinking it might be helpful to create a "getting started guide" as an entrypoint to being able to run these scripts. As we want more of the community to use and contribute to these scripts, we'll want to provide an easy path for them to working with this repo.

This guide could take the form of a one-pager within the repo that explains how to go about getting registered for the various services/platforms, how to configure keys locally, test scripts with expected output, etc. It would take a user/researcher with no pre-existing identity with any of the FASP platforms to being able to run most, or all scripts. Since I fall into this category (I only have an ID with CGC and Cavatica), I'm happy to take notes on the process and collate into a getting started guide.

Does this sound useful?

@briandoconnor @mbarkley

Check RAS logins for relevant DRS and WES Servers

RAS login is now the default login for at least some subset of Gen3 and Seven Bridges nodes running WES and DRS.

Check that tokens available for RAS logins gain proper access and authentication for the expected datasets and resources.

Links for access to relevant portals are available here

Add script that runs parts of a single workflow with different files on different WES implementations

Motivation

A long-standing goal of the FASP GA4GH group has been to have a federated workflow demo that anyone can run. In this context, a federated workflow means:

  1. Multiple compute environments are used for a single computational analysis
  2. Different data is analysed in each compute environment
  3. A single script drives the analysis using GA4GH standards, where possible

Goals

  1. Automate as much of this task as possible in scripts in this repo
  2. Use WES API to control and monitor workflows
  3. Use public or synthetic data to make this more accessible for other people to try

Todo

  • Identify some data (preferably with mirrors in different regions)
  • Fix script using DNAstack WES
  • Write script using ELIXIR WES
  • Write script using Seven Bridges WES
  • Write single script calling DNAstack, WES, SB scripts on different subsets of files

Access GTEX PFB data via Search

Relevant to #14 more detailed search of GTEx subject and specimen (phenotypic) data is possible via data which may be exported from Anvil.

Obtaining GTEx subject and specimen data in this way is described here.

A user might load these data to a private BigQuery tables. A pull request has been submitted to pypfb which supports this.

Such tables could be incorporated into a FASP workflow using BigQuerySearchClient.

Alternatively a private GA4GH Server could be set up to access these data.

Use CloudOS and NextFlow to integrate TCGA and GTEx data

See GTEX_TCGA_Federated_Analysis notebook for an iPython workflow.

The overall flow of the script/notebook is probably more illustrative that directly usable in NextFlow. A good question is how to do the equivalent work from CloudOS and NextFlow where that is the platform of preference e.g. as it is for the JAX team. If it is of interest to you to explore that we could work together to get started.

I have no experience of writing NextFlow, I’ll leave that to you. Nevertheless, I was able to get a sense from the nf script that Sangram had shared previously of how you structure things.
https://github.com/lifebit-ai/sra-dbgap-datafetch/blob/main/main.nf

One of the questions is which capabilities it makes sense to do from with NextFlow, and where those capabilities should be in some library which NextFlow calls. Either way, I think we could do some useful experimentation. Some of the fasp package may be useful, if not I could modify it to help.

See #15 for a possible additional step.

Create scrambled copy of COPDGene_Phenotypes_drs

This table is accessed by FASPNotebook02 but it contains controlled access data. The notebook is useful for "Getting started" purposes. A scrambled (aka synthetic) copy of the data would not need to be controlled access and would therefore serve the purpose.

Review variations in dbGaP data_dicts

There are some variations in how the dbGaP data_dict.xml files define data sources.

Some work to address this includes validating every dbGaP data_dict for conformance to an XML schema. After five iterations of that schema 16,000 of approx 20,000 data-dictionaries validate. Simple changes to XML schema make significant increases in the number of schema that pass. This suggest this approach gives significant leverage on the problem.
To do: put that code in GitHub.

Add notebook to demonstrate DRS resolution and authentication

See checkResolution() in https://github.com/ga4gh/fasp-scripts/blob/master/fasp/loc/drs_metaresolver.py
That demonstrates CURIE resolution. Need to do the same for DRS URIs.

While the hackathon title suggests the objective is to produce a notebook in (Jupyter) the real point is to understand factors driving different approaches to different styles of DRS ids, including the use of prefixing.

The use of the two different styles of DRS ids is also something to explore. I'll revive a previous discussion of this and link or post it below.

Update DNAStackWESClient for new methods to obtain tokens

Was informed that DNAStack are replacing the WES Server we used for 2020 plenary. That server was used by DNAStackWESclient in this repo so the code will need to be updated. One revision will be to the approach to obtain tokens. (Passport? @mbarkley :-)

Creating a ticket now in anticipation of the revised WES server.

Complete migration of scripts notebooks or other fates

Scripts to be converted or retired.

  • FASPScript1
  • FASPScript2
  • FASPScript3
  • FASPScript4
  • FASPScript5
  • FASPScript6
  • FASPScript7
  • FASPScript8
  • FASPScript9
  • FASPScript10
  • FASPScript11
  • FASPScript12
  • FASPScript13
  • FASPScript14
  • FASPScript15
  • FASPScript16
  • FASPScript17
  • FASPScript18

Google Cloud Pipeline jobs composed in python fail

Pipeline job composed in python submitted via pipelines().run fails.
See runstats() method in gcpls_samtools.py

The parameters for the run match those for a job submitted via the command line. The details appear to be the same in task list interface.

For now the workaround is to submit the job via subprocess.run()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.