pubseq / bh20-seq-resource Goto Github PK
View Code? Open in Web Editor NEWTool to upload SARS-CoV-2 sequences to BH20 Arvados instance and orchestrate analysis
License: Apache License 2.0
Tool to upload SARS-CoV-2 sequences to BH20 Arvados instance and orchestrate analysis
License: Apache License 2.0
Over 12K sequences available. Needs some form of annotation to be useful:
ftp://download.big.ac.cn/Genome/Viruses/Coronaviridae/
and
I'm seeing this error running import
[2020-07-07 20:43:19] WARNING 'MT385461.1 uploaded by unknown@50f4c4f28070 from 3.89.224.155' (lugli-4zz18-nb1luabe2d62v9k) has valid
ation errors: Testing <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> against shape https://raw.github
usercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submissionShape
Testing _:b1 against shape https://raw.githubusercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submitterShape
_:b1 context:
<http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> MainSchema:submitter _:b1 .
_:b1 sio:SIO_000116 "Data Science" .
_:b1 sio:SIO_000172 "Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA" .
No matching triples found for predicate obo:NCIT_C42781
I've installed bh20-seq-resource with a combination of pip and Guix. I ran 'guix environment --ad-hoc python curl python-pycurl' to get an environment with python3 and python-pycurl and then ran 'pip3 install --user git+https://github.com/arvados/bh20-seq-resource.git@master'. ~/.local/bin/bh20-seq-uploader --help gives me the output:
Traceback (most recent call last):
File "/home/efraimf/.local/bin/bh20-seq-uploader", line 11, in
load_entry_point('bh20-seq-uploader==1.0.20200410122633', 'console_scripts', 'bh20-seq-uploader')()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 489, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2793, in load_entry_point
return ep.load()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2411, in load
return self.resolve()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2417, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/efraimf/.local/lib/python3.7/site-packages/bh20sequploader/main.py", line 9, in
import qc_metadata
ModuleNotFoundError: No module named 'qc_metadata'
The URIs in the database should be resolvable, e.g.
http://arvados.org/keep:00a6af865453564f6a59b3d2c81cc7c1+123/sequence.fasta
We should allow only a limited number of viral species. This requires a homology check at the FASTA sequence level.
From discussion 27 August 2020:
sample_id
should be the same between the metadata file and fasta header from upload, should be validatedOriginally on the list I don't think we're doing this right now:
For namespacing identifiers, sample_id should be a URI. Add command line option to uploader to give URI prefix. Give instructions to put your institution's web page if you don't know what else to use. Validate that sample_id is a valid URI.
@ekg is working on an MSA workflow
The validation step needs to run before disabling the button.
EBI has some 2K sequences we could bring in too. @AndreaGuarracino can you take a look at:
Can we create a subdirectory for the JSON files? If we lead people to the latest results the current list is confusing.
Currently, pangenome-generate.cwl
fails when trying to pull jerven/spodgi
; looking at Dockerhub, it does not have the latest
tag. Explicitly pulling jerven/spodgi:0.0.5
fixes the problem.
Better solution would be to add the latest
tag to the image on dockerhub.
The predicate in the schema for lab_address
is http://purl.obolibrary.org/obo/OBI_0600047 which is a typo because that's actually the predicate for sample_sequencing_technology
. Need to determine the correct predicate and fix the schema.
We need to add a permaid (see #103). I think submitter and submission are missing too. I'll check.
At least:
@LLTommy is adding metadata and validation
I'm getting the following after trying to install the package from GH:
Collecting git+https://github.com/arvados/bh20-seq-resource.git Cloning https://github.com/arvados/bh20-seq-resource.git to /tmp/pip-req-build-sx_tz19n Running command git clone -q https://github.com/arvados/bh20-seq-resource.git /tmp/pip-req-build-sx_tz19n Collecting arvados-python-client Using cached arvados-python-client-2.0.2.tar.gz (182 kB) Collecting schema-salad Using cached schema_salad-5.0.20200416112825-py3-none-any.whl (457 kB) Collecting python-magic Using cached python_magic-0.4.15-py2.py3-none-any.whl (5.5 kB) Collecting pyshex Using cached PyShEx-0.7.14-py3-none-any.whl (50 kB) Collecting ciso8601>=2.0.0 Using cached ciso8601-2.1.3.tar.gz (15 kB) Collecting future Using cached future-0.18.2.tar.gz (829 kB) Collecting google-api-python-client<1.7,>=1.6.2 Using cached google_api_python_client-1.6.7-py2.py3-none-any.whl (56 kB) Collecting httplib2>=0.9.2 Using cached httplib2-0.17.3-py3-none-any.whl (95 kB) Processing /home/bonface/.cache/pip/wheels/40/ae/bd/3e7d7af6588020c7e993f6f114fb708d966276dbc2f224d3f9/pycurl-7.43.0.5-cp38-cp38-linux_x86_64.whl Collecting ruamel.yaml<=0.15.77,>=0.15.54 Using cached ruamel.yaml-0.15.77.tar.gz (312 kB) ERROR: Command errored out with exit status 1: command: /home/bonface/projects/bh20-seq-resource/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"'; __file__='"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-h7g8lchz/ruamel.yaml/pip-egg-info cwd: /tmp/pip-install-h7g8lchz/ruamel.yaml/ Complete output (11 lines): Traceback (most recent call last): File "", line 1, in File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 211, in pkg_data = _package_data(__file__.replace('setup.py', '__init__.py')) File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 184, in _package_data data = literal_eval("".join(lines)) File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 156, in literal_eval return _convert(node_or_string) File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 95, in _convert if isinstance(node, Str): NameError: name 'Str' is not defined ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Steps to reproduce:
virtualenv --python python3 venv
. venv/bin/activate
pip install git+https://github.com/arvados/bh20-seq-resource.git
Should be able to install the package without any problems
Redcap has a clinical HIPAA compliant database. We should share a field that refers to clinical patient information https://redcap-covid19.elixir-luxembourg.org/redcap/. One strain may have multiple records in Redcap.
Need to check the recompute of fastq from long reads and short reads. I think it is a good idea to focus on ONT initially.
We should capture the metadata on workflows somehow. As per @LLTommy's suggestion
https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html
Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml (the pangenome-generate workflow is listed, as of 2020-10-06) -> https://dockstore.org/my-workflows/github.com/arvados/bh20-seq-resource/Pangenome%20Generator
Note, the pangenome-generator has also been published to https://workflowhub.eu/workflows/63
When submitting a sequence and metadata we can compute a hash value over the submission to make sure it is not already in the database. People will accidentily resubmit stuff and there is no reason to trigger the pipeline. Or, @tetron, is this automatic in Arvados?
https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html
Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml
We need a script added to pangenome-generate-cwl to remove the following sequences from the public sequence resource prior to creating a GFA output:
00ef4c4427c0881a0030f7f400ce1ed0+123/sequence.fasta 1a191370cb868f80c824d93f9169599a+126/sequence.fasta 9e6fe32c3f7d281332ba958b5f62d109+123/sequence.fasta bafb25a84fa5167d5a049fa43d607a44+126/sequence.fasta 9fe51f2847f3e8e3060c9ddebf3a41e5+123/sequence.fasta d637278d9b95bbd1a5ef0bcd17a95c21+123/sequence.fasta 53fa57b401f3695feb0facf498f60871+123/sequence.fasta 392451211d0b7500ebaaa4e3182838be+123/sequence.fasta bc7dcac01570c2fb81f16f76b98add9d+126/sequence.fasta 898c212f7a9d4984c382d782bad53fd4+123/sequence.fasta f8001cec2144c59cbd851706b898ddfe+123/sequence.fasta 71063763aabd91e0b33d6861294bdff6+123/sequence.fasta 57dca4995c2186b11b67ab1cff0b005b+126/sequence.fasta f95a298c57718bf290d9facdda59eb66+123/sequence.fasta 71da768110cd21ff99f5664bc335a4ec+126/sequence.fasta 06f5726c45483d0e8fdea3004f2c4adf+123/sequence.fasta f9cea932bff8e83a2cb490c3bd694742+123/sequence.fasta 5914683bbe1ff047a163b3e57110f11b+126/sequence.fasta 27bb9a654a5f46e08888f55021d37b17+126/sequence.fasta a9be2d60f66fd03a75418b40306ededc+126/sequence.fasta aa1d1c497dabed0589c8ea6423179441+123/sequence.fasta c6f8550cf6940591fea7de5f2159d88b+123/sequence.fasta ab9c2241bda0599d20877ece1e1bc04e+126/sequence.fasta 5caa10de623c2384a31160c72a8f4f9c+126/sequence.fasta 0f24420528d58bff3468084aca3d7328+123/sequence.fasta 4887cadadce95997fed59d129e47b47b+126/sequence.fasta e8e00929537a550b0989be12147d6241+126/sequence.fasta 7ebbc05a6949a6ce0637fa692af183ad+126/sequence.fasta 6566c86da5313159640092f16ac8a0cb+123/sequence.fasta d04a38579335168796dd8d25f362ff8f+123/sequence.fasta 810d1e1012cbc4f63226159bd8b1fa08+123/sequence.fasta 4d40985616d6975a41a117c41fd38145+123/sequence.fasta d2062c46515c5fffed7d27b95a9e32c9+126/sequence.fasta
Our uploader should be able to prepare EBI/NCBI submissions. At least go some of the way of making it really easy.
Be nice to show when sequences were sampled.
@BonfaceKilz having a small issue with the grid not aligning under the intro box. I can't figure it out. Can you see if you can fix it? Running at http://covid-19.genenetwork.org/
Be good to present some output and GFA visualisation (for example)
Need to check before go-live
A random remark field is probably a good idea. It can lead to additions to the schema. Also I think we can allow for additional RDF if people want that. At least give a proposal field.
We have a script which can update virtuoso. It checks Arvados for updates of the metadata.ttl. I still need to:
From now on all data should be guaranteed to remain in Arvados for the life time of the project.
For multi-field options the [+] button is not working.
It is possible the same sequence gets multiple metadata entries (e.g. for clinical outcome). So we need to be able to add metadata several times. Furthermore it is important to be able to update existing metadata - I think simply by adding versions.
Request from the pangenome viewer team
Ideally based on SPARQL queries.
To support bulk uploads we don't want to trigger the workflows at every step. I think we should have a switch that prevents the workflows from running on individual submissions. Just add the sequence and metadata to Keep.
We can soon add @BonfaceKilz' news feed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.