pubseq / bh20-seq-resource Goto Github PK

Tool to upload SARS-CoV-2 sequences to BH20 Arvados instance and orchestrate analysis

License: Apache License 2.0

Python 54.04% Dockerfile 0.45% HTML 15.68% Makefile 0.02% TeX 0.42% Common Workflow Language 13.49% Shell 1.53% CSS 3.79% JavaScript 4.63% Ruby 5.95%

biohackcovid20

bh20-seq-resource's People

Contributors

Stargazers

Watchers

Forkers

pjotrp adamnovak cp-weiland bonfacekilz stain inutano ambarishk dcgenomics daniwelter bio-ontology-research-group mandosoft gitter-badger mr-c sravani2000hub proccaserra prasunanand urbanslug mady1258 svonworl

bh20-seq-resource's Issues

Add CC0 license as an option

https://creativecommons.org/share-your-work/public-domain/cc0

Add seq similarity and overlap to metadata

Embed CN resource

Over 12K sequences available. Needs some form of annotation to be useful:

ftp://download.big.ac.cn/Genome/Viruses/Coronaviridae/

and

https://bigd.big.ac.cn/ncov?lang=en

submitterShap error

I'm seeing this error running import

[2020-07-07 20:43:19] WARNING 'MT385461.1 uploaded by unknown@50f4c4f28070 from 3.89.224.155' (lugli-4zz18-nb1luabe2d62v9k) has valid
ation errors:   Testing <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> against shape https://raw.github
usercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submissionShape
    Testing _:b1 against shape https://raw.githubusercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submitterShape
    _:b1 context:
      <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> MainSchema:submitter _:b1 .
         _:b1 sio:SIO_000116 "Data Science" .
         _:b1 sio:SIO_000172 "Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA" .

         No matching triples found for predicate obo:NCIT_C42781

ModuleNotFoundError: No module named 'qc_metadata'

I've installed bh20-seq-resource with a combination of pip and Guix. I ran 'guix environment --ad-hoc python curl python-pycurl' to get an environment with python3 and python-pycurl and then ran 'pip3 install --user git+https://github.com/arvados/bh20-seq-resource.git@master'. ~/.local/bin/bh20-seq-uploader --help gives me the output:

Traceback (most recent call last):
File "/home/efraimf/.local/bin/bh20-seq-uploader", line 11, in
load_entry_point('bh20-seq-uploader==1.0.20200410122633', 'console_scripts', 'bh20-seq-uploader')()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 489, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2793, in load_entry_point
return ep.load()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2411, in load
return self.resolve()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2417, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/efraimf/.local/lib/python3.7/site-packages/bh20sequploader/main.py", line 9, in
import qc_metadata
ModuleNotFoundError: No module named 'qc_metadata'

Make Keep links referrable for sequence files and metadata files

The URIs in the database should be resolvable, e.g.

http://arvados.org/keep:00a6af865453564f6a59b3d2c81cc7c1+123/sequence.fasta

Add edit links to documents

Add check to pipeline for SARC-CoV-2 origin of sequence

We should allow only a limited number of viral species. This requires a homology check at the FASTA sequence level.

Add more clinical metadata

Add Markdown support

Better sample id management

From discussion 27 August 2020:

sample_id should be the same between the metadata file and fasta header from upload, should be validated
If an uploaded, valid sequence has the same sample id as an existing validated sequence, copy the new sequence/metadata to the existing collection. Enable versioning on Arvados
Cleaning up
- Clean up existing sequences,
- merge based on sequence_label and take the latest (most recent created_at).
- revalidate previously validated samples that have invalid dates or specimen fields

Originally on the list I don't think we're doing this right now:

For namespacing identifiers, sample_id should be a URI. Add command line option to uploader to give URI prefix. Give instructions to put your institution's web page if you don't know what else to use. Validate that sample_id is a valid URI.

Add MSA workflow

@ekg is working on an MSA workflow

Web page: submit button should only be disabled when submitting

The validation step needs to run before disabling the button.

Give nice error messages on qc_metadata fail

Add EBI sequences and metadata?

EBI has some 2K sequences we could bring in too. @AndreaGuarracino can you take a look at:

https://www.covid19dataportal.org/sequences?db=embl

Run pipeline on SPARQL query

Move 10,000 JSON files into subdirectory

Can we create a subdirectory for the JSON files? If we lead people to the latest results the current list is confusing.

Package uploader in pip

Docker image needs 'latest' tag

Currently, pangenome-generate.cwl fails when trying to pull jerven/spodgi; looking at Dockerhub, it does not have the latest tag. Explicitly pulling jerven/spodgi:0.0.5 fixes the problem.

Better solution would be to add the latest tag to the image on dockerhub.

wrong predicate for lab_address

The predicate in the schema for lab_address is http://purl.obolibrary.org/obo/OBI_0600047 which is a typo because that's actually the predicate for sample_sequencing_technology. Need to determine the correct predicate and fix the schema.

Search box stopped working (HTML disabled on demo page)

Automated uploads

We need to add a permaid (see #103). I think submitter and submission are missing too. I'll check.

Ontology for assemblies

At least:

pangenome from only de-novo assemblies (should be of higher quality)
pangenome from de-novo assemblies and read mapping experiments (reference biased)

Add pangenome browser

Add phylogeny workflow

Metadata

@LLTommy is adding metadata and validation

[Build] Unable to install the package from Github

Description:

I'm getting the following after trying to install the package from GH:

Click here to see debug info

Collecting git+https://github.com/arvados/bh20-seq-resource.git
  Cloning https://github.com/arvados/bh20-seq-resource.git to /tmp/pip-req-build-sx_tz19n
  Running command git clone -q https://github.com/arvados/bh20-seq-resource.git /tmp/pip-req-build-sx_tz19n
Collecting arvados-python-client
  Using cached arvados-python-client-2.0.2.tar.gz (182 kB)
Collecting schema-salad
  Using cached schema_salad-5.0.20200416112825-py3-none-any.whl (457 kB)
Collecting python-magic
  Using cached python_magic-0.4.15-py2.py3-none-any.whl (5.5 kB)
Collecting pyshex
  Using cached PyShEx-0.7.14-py3-none-any.whl (50 kB)
Collecting ciso8601>=2.0.0
  Using cached ciso8601-2.1.3.tar.gz (15 kB)
Collecting future
  Using cached future-0.18.2.tar.gz (829 kB)
Collecting google-api-python-client<1.7,>=1.6.2
  Using cached google_api_python_client-1.6.7-py2.py3-none-any.whl (56 kB)
Collecting httplib2>=0.9.2
  Using cached httplib2-0.17.3-py3-none-any.whl (95 kB)
Processing /home/bonface/.cache/pip/wheels/40/ae/bd/3e7d7af6588020c7e993f6f114fb708d966276dbc2f224d3f9/pycurl-7.43.0.5-cp38-cp38-linux_x86_64.whl
Collecting ruamel.yaml<=0.15.77,>=0.15.54
  Using cached ruamel.yaml-0.15.77.tar.gz (312 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/bonface/projects/bh20-seq-resource/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"'; __file__='"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-h7g8lchz/ruamel.yaml/pip-egg-info
         cwd: /tmp/pip-install-h7g8lchz/ruamel.yaml/
    Complete output (11 lines):
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 211, in 
        pkg_data = _package_data(__file__.replace('setup.py', '__init__.py'))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 184, in _package_data
        data = literal_eval("".join(lines))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 156, in literal_eval
        return _convert(node_or_string)
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 95, in _convert
        if isinstance(node, Str):
    NameError: name 'Str' is not defined
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

To Reproduce:

Steps to reproduce:

virtualenv --python python3 venv
. venv/bin/activate
pip install git+https://github.com/arvados/bh20-seq-resource.git

Expected Behaviour:

Should be able to install the package without any problems

Environment setup:

OS: Arch Linux
Python Version: python3.8.2

Share identifiers Redcap

Redcap has a clinical HIPAA compliant database. We should share a field that refers to clinical patient information https://redcap-covid19.elixir-luxembourg.org/redcap/. One strain may have multiple records in Redcap.

Add fastq support

Need to check the recompute of fastq from long reads and short reads. I think it is a good idea to focus on ONT initially.

Add metadata on workflows

We should capture the metadata on workflows somehow. As per @LLTommy's suggestion

https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html

Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml (the pangenome-generate workflow is listed, as of 2020-10-06) -> https://dockstore.org/my-workflows/github.com/arvados/bh20-seq-resource/Pangenome%20Generator

Note, the pangenome-generator has also been published to https://workflowhub.eu/workflows/63

Compute HASH on inputs

When submitting a sequence and metadata we can compute a hash value over the submission to make sure it is not already in the database. People will accidentily resubmit stuff and there is no reason to trigger the pipeline. Or, @tetron, is this automatic in Arvados?

Add navigation tabs to web site

Add metadata about the workflows themselves

https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html

Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml

Filter out list of sequences before building GFA

We need a script added to pangenome-generate-cwl to remove the following sequences from the public sequence resource prior to creating a GFA output:

00ef4c4427c0881a0030f7f400ce1ed0+123/sequence.fasta 1a191370cb868f80c824d93f9169599a+126/sequence.fasta 9e6fe32c3f7d281332ba958b5f62d109+123/sequence.fasta bafb25a84fa5167d5a049fa43d607a44+126/sequence.fasta 9fe51f2847f3e8e3060c9ddebf3a41e5+123/sequence.fasta d637278d9b95bbd1a5ef0bcd17a95c21+123/sequence.fasta 53fa57b401f3695feb0facf498f60871+123/sequence.fasta 392451211d0b7500ebaaa4e3182838be+123/sequence.fasta bc7dcac01570c2fb81f16f76b98add9d+126/sequence.fasta 898c212f7a9d4984c382d782bad53fd4+123/sequence.fasta f8001cec2144c59cbd851706b898ddfe+123/sequence.fasta 71063763aabd91e0b33d6861294bdff6+123/sequence.fasta 57dca4995c2186b11b67ab1cff0b005b+126/sequence.fasta f95a298c57718bf290d9facdda59eb66+123/sequence.fasta 71da768110cd21ff99f5664bc335a4ec+126/sequence.fasta 06f5726c45483d0e8fdea3004f2c4adf+123/sequence.fasta f9cea932bff8e83a2cb490c3bd694742+123/sequence.fasta 5914683bbe1ff047a163b3e57110f11b+126/sequence.fasta 27bb9a654a5f46e08888f55021d37b17+126/sequence.fasta a9be2d60f66fd03a75418b40306ededc+126/sequence.fasta aa1d1c497dabed0589c8ea6423179441+123/sequence.fasta c6f8550cf6940591fea7de5f2159d88b+123/sequence.fasta ab9c2241bda0599d20877ece1e1bc04e+126/sequence.fasta 5caa10de623c2384a31160c72a8f4f9c+126/sequence.fasta 0f24420528d58bff3468084aca3d7328+123/sequence.fasta 4887cadadce95997fed59d129e47b47b+126/sequence.fasta e8e00929537a550b0989be12147d6241+126/sequence.fasta 7ebbc05a6949a6ce0637fa692af183ad+126/sequence.fasta 6566c86da5313159640092f16ac8a0cb+123/sequence.fasta d04a38579335168796dd8d25f362ff8f+123/sequence.fasta 810d1e1012cbc4f63226159bd8b1fa08+123/sequence.fasta 4d40985616d6975a41a117c41fd38145+123/sequence.fasta d2062c46515c5fffed7d27b95a9e32c9+126/sequence.fasta

Prepare uploading to EBI and/or NCBI resource (BOSC2020)

Our uploader should be able to prepare EBI/NCBI submissions. At least go some of the way of making it really easy.

Demp: list by time stamp

Be nice to show when sequences were sampled.

Web page does not display correctly in Chromium

@BonfaceKilz having a small issue with the grid not aligning under the intro box. I can't figure it out. Can you see if you can fix it? Running at http://covid-19.genenetwork.org/

Create a nice website for users

Be good to present some output and GFA visualisation (for example)

Connect output of workflows

Need to check before go-live

Add remark field to metadata

A random remark field is probably a good idea. It can lead to additions to the schema. Also I think we can allow for additional RDF if people want that. At least give a proposal field.

Updated virtuoso instance

We have a script which can update virtuoso. It checks Arvados for updates of the metadata.ttl. I still need to:

Run as a CRON job
Clear the old graph before updates (requires permissions)
Add the update time stamp to the store
Perhaps bring in graph versioning