Giter Site home page Giter Site logo

pubseq / bh20-seq-resource Goto Github PK

View Code? Open in Web Editor NEW
13.0 13.0 24.0 30.06 MB

Tool to upload SARS-CoV-2 sequences to BH20 Arvados instance and orchestrate analysis

License: Apache License 2.0

Python 54.04% Dockerfile 0.45% HTML 15.68% Makefile 0.02% TeX 0.42% Common Workflow Language 13.49% Shell 1.53% CSS 3.79% JavaScript 4.63% Ruby 5.95%
biohackcovid20

bh20-seq-resource's People

Contributors

adamnovak avatar andreaguarracino avatar bonfacekilz avatar daniwelter avatar dcgenomics avatar gitter-badger avatar heuermh avatar inutano avatar lltommy avatar mr-c avatar pjotrp avatar proccaserra avatar stain avatar tetron avatar uniqueg avatar urbanslug avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bh20-seq-resource's Issues

submitterShap error

I'm seeing this error running import

[2020-07-07 20:43:19] WARNING 'MT385461.1 uploaded by unknown@50f4c4f28070 from 3.89.224.155' (lugli-4zz18-nb1luabe2d62v9k) has valid
ation errors:   Testing <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> against shape https://raw.github
usercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submissionShape
    Testing _:b1 against shape https://raw.githubusercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submitterShape
    _:b1 context:
      <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> MainSchema:submitter _:b1 .
         _:b1 sio:SIO_000116 "Data Science" .
         _:b1 sio:SIO_000172 "Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA" .

         No matching triples found for predicate obo:NCIT_C42781

ModuleNotFoundError: No module named 'qc_metadata'

I've installed bh20-seq-resource with a combination of pip and Guix. I ran 'guix environment --ad-hoc python curl python-pycurl' to get an environment with python3 and python-pycurl and then ran 'pip3 install --user git+https://github.com/arvados/bh20-seq-resource.git@master'. ~/.local/bin/bh20-seq-uploader --help gives me the output:

Traceback (most recent call last):
File "/home/efraimf/.local/bin/bh20-seq-uploader", line 11, in
load_entry_point('bh20-seq-uploader==1.0.20200410122633', 'console_scripts', 'bh20-seq-uploader')()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 489, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2793, in load_entry_point
return ep.load()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2411, in load
return self.resolve()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2417, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/efraimf/.local/lib/python3.7/site-packages/bh20sequploader/main.py", line 9, in
import qc_metadata
ModuleNotFoundError: No module named 'qc_metadata'

Better sample id management

From discussion 27 August 2020:

  • sample_id should be the same between the metadata file and fasta header from upload, should be validated
  • If an uploaded, valid sequence has the same sample id as an existing validated sequence, copy the new sequence/metadata to the existing collection. Enable versioning on Arvados
  • Cleaning up
    • Clean up existing sequences,
    • merge based on sequence_label and take the latest (most recent created_at).
    • revalidate previously validated samples that have invalid dates or specimen fields

Originally on the list I don't think we're doing this right now:

For namespacing identifiers, sample_id should be a URI. Add command line option to uploader to give URI prefix. Give instructions to put your institution's web page if you don't know what else to use. Validate that sample_id is a valid URI.

Docker image needs 'latest' tag

Currently, pangenome-generate.cwl fails when trying to pull jerven/spodgi; looking at Dockerhub, it does not have the latest tag. Explicitly pulling jerven/spodgi:0.0.5 fixes the problem.

Better solution would be to add the latest tag to the image on dockerhub.

Automated uploads

We need to add a permaid (see #103). I think submitter and submission are missing too. I'll check.

Ontology for assemblies

At least:

  • pangenome from only de-novo assemblies (should be of higher quality)
  • pangenome from de-novo assemblies and read mapping experiments (reference biased)

[Build] Unable to install the package from Github

Description:

I'm getting the following after trying to install the package from GH:

Click here to see debug info
Collecting git+https://github.com/arvados/bh20-seq-resource.git
  Cloning https://github.com/arvados/bh20-seq-resource.git to /tmp/pip-req-build-sx_tz19n
  Running command git clone -q https://github.com/arvados/bh20-seq-resource.git /tmp/pip-req-build-sx_tz19n
Collecting arvados-python-client
  Using cached arvados-python-client-2.0.2.tar.gz (182 kB)
Collecting schema-salad
  Using cached schema_salad-5.0.20200416112825-py3-none-any.whl (457 kB)
Collecting python-magic
  Using cached python_magic-0.4.15-py2.py3-none-any.whl (5.5 kB)
Collecting pyshex
  Using cached PyShEx-0.7.14-py3-none-any.whl (50 kB)
Collecting ciso8601>=2.0.0
  Using cached ciso8601-2.1.3.tar.gz (15 kB)
Collecting future
  Using cached future-0.18.2.tar.gz (829 kB)
Collecting google-api-python-client<1.7,>=1.6.2
  Using cached google_api_python_client-1.6.7-py2.py3-none-any.whl (56 kB)
Collecting httplib2>=0.9.2
  Using cached httplib2-0.17.3-py3-none-any.whl (95 kB)
Processing /home/bonface/.cache/pip/wheels/40/ae/bd/3e7d7af6588020c7e993f6f114fb708d966276dbc2f224d3f9/pycurl-7.43.0.5-cp38-cp38-linux_x86_64.whl
Collecting ruamel.yaml<=0.15.77,>=0.15.54
  Using cached ruamel.yaml-0.15.77.tar.gz (312 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/bonface/projects/bh20-seq-resource/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"'; __file__='"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-h7g8lchz/ruamel.yaml/pip-egg-info
         cwd: /tmp/pip-install-h7g8lchz/ruamel.yaml/
    Complete output (11 lines):
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 211, in 
        pkg_data = _package_data(__file__.replace('setup.py', '__init__.py'))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 184, in _package_data
        data = literal_eval("".join(lines))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 156, in literal_eval
        return _convert(node_or_string)
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 95, in _convert
        if isinstance(node, Str):
    NameError: name 'Str' is not defined
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

To Reproduce:

Steps to reproduce:

virtualenv --python python3 venv
. venv/bin/activate
pip install git+https://github.com/arvados/bh20-seq-resource.git

Expected Behaviour:

Should be able to install the package without any problems

Environment setup:

  • OS: Arch Linux
  • Python Version: python3.8.2

Add fastq support

Need to check the recompute of fastq from long reads and short reads. I think it is a good idea to focus on ONT initially.

Add metadata on workflows

We should capture the metadata on workflows somehow. As per @LLTommy's suggestion

https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html

Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml (the pangenome-generate workflow is listed, as of 2020-10-06) -> https://dockstore.org/my-workflows/github.com/arvados/bh20-seq-resource/Pangenome%20Generator

Note, the pangenome-generator has also been published to https://workflowhub.eu/workflows/63

Compute HASH on inputs

When submitting a sequence and metadata we can compute a hash value over the submission to make sure it is not already in the database. People will accidentily resubmit stuff and there is no reason to trigger the pipeline. Or, @tetron, is this automatic in Arvados?

Filter out list of sequences before building GFA

We need a script added to pangenome-generate-cwl to remove the following sequences from the public sequence resource prior to creating a GFA output:

00ef4c4427c0881a0030f7f400ce1ed0+123/sequence.fasta 1a191370cb868f80c824d93f9169599a+126/sequence.fasta 9e6fe32c3f7d281332ba958b5f62d109+123/sequence.fasta bafb25a84fa5167d5a049fa43d607a44+126/sequence.fasta 9fe51f2847f3e8e3060c9ddebf3a41e5+123/sequence.fasta d637278d9b95bbd1a5ef0bcd17a95c21+123/sequence.fasta 53fa57b401f3695feb0facf498f60871+123/sequence.fasta 392451211d0b7500ebaaa4e3182838be+123/sequence.fasta bc7dcac01570c2fb81f16f76b98add9d+126/sequence.fasta 898c212f7a9d4984c382d782bad53fd4+123/sequence.fasta f8001cec2144c59cbd851706b898ddfe+123/sequence.fasta 71063763aabd91e0b33d6861294bdff6+123/sequence.fasta 57dca4995c2186b11b67ab1cff0b005b+126/sequence.fasta f95a298c57718bf290d9facdda59eb66+123/sequence.fasta 71da768110cd21ff99f5664bc335a4ec+126/sequence.fasta 06f5726c45483d0e8fdea3004f2c4adf+123/sequence.fasta f9cea932bff8e83a2cb490c3bd694742+123/sequence.fasta 5914683bbe1ff047a163b3e57110f11b+126/sequence.fasta 27bb9a654a5f46e08888f55021d37b17+126/sequence.fasta a9be2d60f66fd03a75418b40306ededc+126/sequence.fasta aa1d1c497dabed0589c8ea6423179441+123/sequence.fasta c6f8550cf6940591fea7de5f2159d88b+123/sequence.fasta ab9c2241bda0599d20877ece1e1bc04e+126/sequence.fasta 5caa10de623c2384a31160c72a8f4f9c+126/sequence.fasta 0f24420528d58bff3468084aca3d7328+123/sequence.fasta 4887cadadce95997fed59d129e47b47b+126/sequence.fasta e8e00929537a550b0989be12147d6241+126/sequence.fasta 7ebbc05a6949a6ce0637fa692af183ad+126/sequence.fasta 6566c86da5313159640092f16ac8a0cb+123/sequence.fasta d04a38579335168796dd8d25f362ff8f+123/sequence.fasta 810d1e1012cbc4f63226159bd8b1fa08+123/sequence.fasta 4d40985616d6975a41a117c41fd38145+123/sequence.fasta d2062c46515c5fffed7d27b95a9e32c9+126/sequence.fasta

Add remark field to metadata

A random remark field is probably a good idea. It can lead to additions to the schema. Also I think we can allow for additional RDF if people want that. At least give a proposal field.

Updated virtuoso instance

We have a script which can update virtuoso. It checks Arvados for updates of the metadata.ttl. I still need to:

  • Run as a CRON job
  • Clear the old graph before updates (requires permissions)
  • Add the update time stamp to the store
  • Perhaps bring in graph versioning

Allow for metadata updates

It is possible the same sequence gets multiple metadata entries (e.g. for clinical outcome). So we need to be able to add metadata several times. Furthermore it is important to be able to update existing metadata - I think simply by adding versions.

Add support for bulk uploads

To support bulk uploads we don't want to trigger the workflows at every step. I think we should have a switch that prevents the workflows from running on individual submissions. Just add the sequence and metadata to Keep.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.