ga4gh / fasp-scripts Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
The number of scripts that now exist mean that simple numbering of the scripts is no longer informative.
A different approach to naming can help but it's expecting too much of a naming convention to provide adequate information about what a script does. Simple metadata about the scripts is helpful too. The table on the readme page addresses that by providing information about which clients are used in a given script. Those clients, are at the moment, specific to implementations* so one can also tell which implementations are used.
The logging performed by FASPRunner captures the necessary metadata automatically. The data in the log provided an easy way to create the table for the readme.
Beyond the metadata currently being captured it would be useful to know which data sources are queried, which workflows are being run etc.
* This specificity is open to the criticism that the clients should be more general purpose. In some cases the specific client is doing no more than wrapping the host url of the service. That does make things slightly more convenient for script writers. That does seem worthwhile. In other cases the specificity has to do with the different authentication and authorization behaviors of different implementations. Over time we might expect those differences to disappear and common clients will work.
The SRAExample notebook might be a good test for the 'DRS plus Passport' design in that it goes all the way through to using a second DRS server to get results files. Question is, who the passport broker would be for authz for the second DRS server?
The authz one is looking for is likely some construct within the WES server – that construct would be ‘project’ in the SB case – though ultimately the authorization is to access storage owned by that project. Currently getting that authorization is not a problem. The WES instance knows about the relationship internally – so authentication to the DRS server, using the same credentials as for the WES, gives you access to the files.
Answering: “who the passport broker would be for authz for the second DRS server?” : it seems to me the 80% answer would be - the passport broker for the WES service. But are Passport brokers for WES servers in scope?
Might be worth running that through the 'DRS plus Passport' design.
The other 20% (a guess) would be where WES is writing the results out to some other DRS server. Haven't seen/tried that yet. Not thinking it's the first urgency. Again seems to me your design would handle it. Not ruling out it should be a test case?
Registry currently only exists as json returned from https://registry.ga4gh.org/v1/services. That is not very consumable by humans.
GA4GHRegistryClient already provides a python interface to at least some Registry capabilities. It should be fairly straightforward to create an iPython notebook that generates consumable output.
@jb-adams feel free to suggest/contribute as you see fit.
The main change is expected to be to the DiscoverySearchClient which will be renamed DataConnectClient.
The fasp.search sub-package will retain its current name. It's siblings are named fast.loc and fasp.workflow and roughly correspond to DRS and WES/TES functions respectively. However, the packages include equivalent functionality from other sources and are named to reflect the functionality covered rather than the specific GA4GH specification used.
Jonathan worked up the Search examples in the attachment from queries Brian Walsh had done as
https://colab.research.google.com/drive/1HhEEB3MJ8LbMP2ta946s8OARPc5RflHu?usp=sharing#scrollTo=nM-GHd3IWeqF
Jonathan wrote
A Search implementation with an appropriate connection to FHIR could take care of everything before "Clinical / Care Management Example Query" step, leaving just the interesting query parts up to the data consumer. Those queries were written for sqlite. Here's the same query adapted to PrestoSQL (now Trino), with table name and filters adjusted so it returns some rows from the kidsfirst FHIR data at http://ga4gh-search-adapter-presto-public.staging.dnastack.com/search:
These would be useful to have in a Jupyter notebook. Have attached the queries Jonathan created as an attachment.
After merging in https://github.com/DNAstack/plenary-resources-2020 moved the plenary-resources folder to top level.
Seems more appropriate at the top level but I suggest we need a more considered rethink.
Some considerations
The initial aim was to get working examples up and running and evolve a more thorough architecture as we go.
At this point clients for search, object location/retrieval and workflow execution were included under the fasp package. Long term, one might expect these clients to be in different packages/repos. Recognizing this, separate packages were created for each of these functions
Also
@ianfore From our Connect call, I'm thinking it might be helpful to create a "getting started guide" as an entrypoint to being able to run these scripts. As we want more of the community to use and contribute to these scripts, we'll want to provide an easy path for them to working with this repo.
This guide could take the form of a one-pager within the repo that explains how to go about getting registered for the various services/platforms, how to configure keys locally, test scripts with expected output, etc. It would take a user/researcher with no pre-existing identity with any of the FASP platforms to being able to run most, or all scripts. Since I fall into this category (I only have an ID with CGC and Cavatica), I'm happy to take notes on the process and collate into a getting started guide.
Does this sound useful?
Remove automatic login during metaresolver initialization
Authorize only when a specific DRS Server is used
Started with SBCGC_WES_Example.ipynb.
Posting here as I know others (@mbarkley ) are working on using the SB WES service. I'm modifying the WES client used to submit the GWAS script from the plenary demo that we just merged in.
Currently getting a 415 error. @mattions, do you have any insights on this?
RAS login is now the default login for at least some subset of Gen3 and Seven Bridges nodes running WES and DRS.
Check that tokens available for RAS logins gain proper access and authentication for the expected datasets and resources.
Links for access to relevant portals are available here
See direct FHIR Search examples using the NCPI FHIR Server which were explored in a Nov 2020 codeathon here .
Some Resources on that server can return DRS ids which could be resolved to URLs by the Kids First DRS Server.
A long-standing goal of the FASP GA4GH group has been to have a federated workflow demo that anyone can run. In this context, a federated workflow means:
Relevant to #14 more detailed search of GTEx subject and specimen (phenotypic) data is possible via data which may be exported from Anvil.
Obtaining GTEx subject and specimen data in this way is described here.
A user might load these data to a private BigQuery tables. A pull request has been submitted to pypfb which supports this.
Such tables could be incorporated into a FASP workflow using BigQuerySearchClient.
Alternatively a private GA4GH Server could be set up to access these data.
See GTEX_TCGA_Federated_Analysis notebook for an iPython workflow.
The overall flow of the script/notebook is probably more illustrative that directly usable in NextFlow. A good question is how to do the equivalent work from CloudOS and NextFlow where that is the platform of preference e.g. as it is for the JAX team. If it is of interest to you to explore that we could work together to get started.
I have no experience of writing NextFlow, I’ll leave that to you. Nevertheless, I was able to get a sense from the nf script that Sangram had shared previously of how you structure things.
https://github.com/lifebit-ai/sra-dbgap-datafetch/blob/main/main.nf
One of the questions is which capabilities it makes sense to do from with NextFlow, and where those capabilities should be in some library which NextFlow calls. Either way, I think we could do some useful experimentation. Some of the fasp package may be useful, if not I could modify it to help.
See #15 for a possible additional step.
In order to illustrate the effective use of Search schema, please update schema on the public DNAStack server for search_cloud.cshcodeathon.organoid_profiling_pc_subject_phenotypes_gru
to one derived from the dbGap XML data dictionary.
See fast/data/dbgap for the data_dict.xml to use.
The notebook that uses that table is https://github.com/ga4gh/fasp-scripts/blob/master/notebooks/search/mapping1-manual.ipynb
This table is accessed by FASPNotebook02 but it contains controlled access data. The notebook is useful for "Getting started" purposes. A scrambled (aka synthetic) copy of the data would not need to be controlled access and would therefore serve the purpose.
Possibly all of them?
There are some variations in how the dbGaP data_dict.xml files define data sources.
Some work to address this includes validating every dbGaP data_dict for conformance to an XML schema. After five iterations of that schema 16,000 of approx 20,000 data-dictionaries validate. Simple changes to XML schema make significant increases in the number of schema that pass. This suggest this approach gives significant leverage on the problem.
To do: put that code in GitHub.
See checkResolution() in https://github.com/ga4gh/fasp-scripts/blob/master/fasp/loc/drs_metaresolver.py
That demonstrates CURIE resolution. Need to do the same for DRS URIs.
While the hackathon title suggests the objective is to produce a notebook in (Jupyter) the real point is to understand factors driving different approaches to different styles of DRS ids, including the use of prefixing.
The use of the two different styles of DRS ids is also something to explore. I'll revive a previous discussion of this and link or post it below.
Was informed that DNAStack are replacing the WES Server we used for 2020 plenary. That server was used by DNAStackWESclient in this repo so the code will need to be updated. One revision will be to the approach to obtain tokens. (Passport? @mbarkley :-)
Creating a ticket now in anticipation of the revised WES server.
Scripts to be converted or retired.
Pipeline job composed in python submitted via pipelines().run fails.
See runstats() method in gcpls_samtools.py
The parameters for the run match those for a job submitted via the command line. The details appear to be the same in task list interface.
For now the workaround is to submit the job via subprocess.run()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.