genialis / resolwe-bio Goto Github PK

View Code? Open in Web Editor NEW

17.0 12.0 34.0 44.17 MB

Bioinformatics pipelines for Resolwe

License: Apache License 2.0

Python 97.88% R 1.04% HTML 0.53% Shell 0.32% PLpgSQL 0.24%

resolwe-bio's Introduction

Resolwe Bioinformatics

Bioinformatics pipelines for the Resolwe dataflow package for Django framework.

Docs & Help

Read about getting started and how to write processes in the documentation.

To chat with developers or ask for help, join us on Slack.

Install

Prerequisites

Make sure you have Python 3.6 installed on your system. If you don't have it yet, follow these instructions.

Resolwe requires PostgreSQL (9.4+). Many Linux distributions already include the required version of PostgreSQL (e.g. Fedora 22+, Debian 8+, Ubuntu 15.04+) and you can simply install it via distribution's package manager. Otherwise, follow these instructions.

Additionally, installing some (indirect) dependencies from PyPI will require having a C compiler (e.g. GCC) as well as Python development files installed on the system.

Note

The preferred way to install the C compiler and Python development files is to use your distribution's packages, if they exist. For example, on a Fedora/RHEL-based system, that would mean installing gcc and python3-devel packages.

Using PyPI

pip install resolwe-bio

To install a pre-release, use:

pip install --pre resolwe-bio

Using source on GitHub

pip install --pre https://github.com/genialis/resolwe-bio/archive/<git-tree-ish>.tar.gz

where <git-tree-ish> can represent any commit SHA, branch name, tag name, etc. in Resolwe Bioinformatics' GitHub repository. For example, to install the latest Resolwe Bioinformatics from the master branch, use:

pip install --pre https://github.com/genialis/resolwe-bio/archive/master.tar.gz

Contribute

We welcome new contributors. To learn more, read Contributing section of the documentation.

resolwe-bio's People

Contributors

Stargazers

Watchers

resolwe-bio's Issues

Sample API resource is inconsistent with the rest of the API

It would be more consistent with the rest of the API if the Sample resource was accessible via /api/sample and annotated would be a filter query argument. So you would then do queries:

/api/sample?annotated=1 instead of /api/sample/annotated
/api/sample?annotated=0 instead of /api/sample/unannotated

This would make it more consistent for use in the GenJs frontend API.

Remove obstolete Mongo escaping

There is some left-over escaping for MongoDB syntax from the times that Resolwe used MongoDB. It should be removed as it may cause problems.

Function in question, which is used in some places and should be removed:

resolwe-bio/resolwe_bio/tools/utils.py

Lines 11 to 13 in 9518106

    
           def escape_mongokey(key): 
        
               """Escape keys when serializing database entries.""" 
        
               return key.replace('$', u'\uff04').replace('.', u'\uff0e').replace(' ', '_')

Feature query fails for many genes

Issue moved from genialis/resolwe-bio-py#78

To reproduce, run:

import resdk
res = resdk.Resolwe(url='https://qa.genialis.com')

res.feature.filter(source="NCBI", query=range(300)) # works
res.feature.filter(source="NCBI", query=range(400)) # fails

Elastic search traceback:

Traceback:

File "/srv/genialis/venv/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
  149.                     response = self.process_exception_by_middleware(e, request)

File "/srv/genialis/venv/lib/python2.7/site-packages/django/core/handlers/base.py" in get_response
  147.                     response = wrapped_callback(request, *callback_args, **callback_kwargs)

File "/srv/genialis/venv/lib/python2.7/site-packages/django/views/decorators/csrf.py" in wrapped_view
  58.         return view_func(*args, **kwargs)

File "/srv/genialis/venv/lib/python2.7/site-packages/rest_framework/viewsets.py" in view
  83.             return self.dispatch(request, *args, **kwargs)

File "/srv/genialis/venv/lib/python2.7/site-packages/rest_framework/views.py" in dispatch
  477.             response = self.handle_exception(exc)

File "/srv/genialis/venv/lib/python2.7/site-packages/rest_framework/views.py" in handle_exception
  437.             self.raise_uncaught_exception(exc)

File "/srv/genialis/venv/lib/python2.7/site-packages/rest_framework/views.py" in dispatch
  474.             response = handler(request, *args, **kwargs)

File "/srv/genialis/venv/lib/python2.7/site-packages/resolwe/elastic/viewsets.py" in list_with_post
  158.         return self.paginate_response(search)

File "/srv/genialis/venv/lib/python2.7/site-packages/resolwe/elastic/viewsets.py" in paginate_response
  120.         return Response(serializer.data)

File "/srv/genialis/venv/lib/python2.7/site-packages/rest_framework/serializers.py" in data
  725.         ret = super(ListSerializer, self).data

File "/srv/genialis/venv/lib/python2.7/site-packages/rest_framework/serializers.py" in data
  262.                 self._data = self.to_representation(self.instance)

File "/srv/genialis/venv/lib/python2.7/site-packages/rest_framework/serializers.py" in to_representation
  643.             self.child.to_representation(item) for item in iterable

File "/srv/genialis/venv/lib/python2.7/site-packages/elasticsearch_dsl/search.py" in __iter__
  233.         return iter(self.execute())

File "/srv/genialis/venv/lib/python2.7/site-packages/elasticsearch_dsl/search.py" in execute
  627.                     **self._params

File "/srv/genialis/venv/lib/python2.7/site-packages/elasticsearch/client/utils.py" in _wrapped
  69.             return func(*args, params=params, **kwargs)

File "/srv/genialis/venv/lib/python2.7/site-packages/elasticsearch/client/__init__.py" in search
  539.             doc_type, '_search'), params=params, body=body)

File "/srv/genialis/venv/lib/python2.7/site-packages/elasticsearch/transport.py" in perform_request
  327.                 status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)

File "/srv/genialis/venv/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py" in perform_request
  109.             self._raise_error(response.status, raw_data)

File "/srv/genialis/venv/lib/python2.7/site-packages/elasticsearch/connection/base.py" in _raise_error
  113.         raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)

Exception Type: TransportError at /api/kb/feature/search
Exception Value: TransportError(500, u'search_phase_execution_exception', u'maxClauseCount is set to 1024')

Fix Pylint warning about over-ridden create methods in feature and mapping view sets

The newest version of Pylint (1.8.3) detects that parameters of the over-ridden create() in FeatureViewSet and MappingViewSet differ from the one in DRF:

linters runtests: commands[1] | pylint resolwe_bio .scripts/check_large_files.py
Using config file /var/lib/jenkins/jobs/genialis-github/jobs/resolwe-bio/branches/PR-519/workspace/.pylintrc
************* Module resolwe_bio.kb.views
W:110, 4: Parameters differ from overridden 'create' method (arguments-differ)
W:184, 4: Parameters differ from overridden 'create' method (arguments-differ)

------------------------------------------------------------------
Your code has been rated at 9.99/10 (previous run: 9.99/10, -0.00)

Probably, the overridden methods should also handle *args and **kwargs?

Make our custom HTML page template work with Sphinx 1.6.1+

Building documentation with the latest version of Sphinx (1.6.1) fails with:

$ python setup.py build_sphinx --fresh-env --warning-is-error
running build_sphinx
Running Sphinx v1.6.1
loading intersphinx inventory from https://docs.python.org/3/objects.inv...
loading intersphinx inventory from https://resolwe.readthedocs.io/en/latest/objects.inv...
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 10 source files that are out of date
updating environment: 10 added, 0 changed, 0 removed
reading sources... [100%] ref                                                                                                               
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [ 10%] CHANGELOG                                                                                                          
Exception occurred:
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx_rtd_theme/layout.html", line 45, in top-level template code
    {% for cssfile in css_files %}
TypeError: 'NoneType' object is not iterable
The full traceback has been saved in /tmp/sphinx-err-453cydcj.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!

Full traceback in /tmp/sphinx-err-453cydcj.log is:

Traceback (most recent call last):
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/setup_command.py", line 192, in run
    app.build(force_all=self.all_files)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/application.py", line 338, in build
    self.builder.build_update()
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/builders/__init__.py", line 328, in build_update
    'out of date' % len(to_build))
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/builders/__init__.py", line 394, in build
    self.write(docnames, list(updated_docnames), method)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/builders/__init__.py", line 431, in write
    self._write_serial(sorted(docnames))
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/builders/__init__.py", line 440, in _write_serial
    self.write_doc(docname, doctree)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/builders/html.py", line 556, in write_doc
    self.handle_page(docname, ctx, event_arg=doctree)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/builders/html.py", line 940, in handle_page
    output = self.templates.render(templatename, ctx)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/jinja2glue.py", line 176, in render
    return self.environment.get_template(template).render(context)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
  File "/home/tadej/Genialis/resolwe-bio/docs/_templates/page.html", line 3, in top-level template code
    {% set css_files = css_files + ["_static/css/custom.css"] %}
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx/themes/basic/page.html", line 10, in top-level template code
    {%- extends "layout.html" %}
  File "/home/tadej/.virtualenvs/resolwe-bio/lib/python3.5/site-packages/sphinx_rtd_theme/layout.html", line 45, in top-level template code
    {% for cssfile in css_files %}
TypeError: 'NoneType' object is not iterable

It looks like something is wrong in our docs/_templates/page.html file on line:

{% set css_files = css_files + ["_static/css/custom.css"] %}

Write new GO enrichment process

New process slug: go-enrichment.

Inputs:

ontology (data:ontology:obo)
gene_ids (list:basic:string)
source (basic:string)
gaf (data:gaf, optional)
pvalue_threshold (basic:decimal, default=0.1)
genes_in_term_threshold (basic:integer, default=1)

Algorithm

You will have to use resdk to access GAF data object and map genes.

If gaf input not given:
If source input is equal to the source of any GAF data object on the platform (query data resources). Download the GAF file of the latest data:gaf object (use resdk). RUN GOTEA
If the source input has to be mapped:
Find out to which of the GAF sources do the gene ids map (by query). Map them, then RUN GOTEA.

If gaf input given:
If gaf source is equal to source input:
RUN GOTEA
If gaf source is not equal to source input:
Try to map gene_ids input to gaf source.
If nothing mapps raise error “Input genes did not map to GAF gene ids.”
If mappings successful: RUN GOTEA

How should we implement the mappings, do queries? I suggest Domen works on the mapping part of the process.

Set the default log2(FC) value to 1 in DE object descriptor schema

Use `get_or_create` in `insert_features`management command

It is not recommended to use try/except in scenarios where failure in expected (handling errors takes much more time than other checks), so it would be much faster to use get_or_create query and checks second parameter of the returned tuple, which tells if object was created or not.

Bioconductor fails to build on Ubuntu 17.10 and 18.04

This problem affects resolwebio/base Docker image which our other images are derived from. Consequently, many processes cannot be updated. This is a known problem https://support.bioconductor.org/p/101833/. Even the official bioconductor/release-base Docker image uses an older version of R.

Improve and enable test_amplicon_report test

Currently, resolwe_bio.tests.processes.test_generate_report.ReportProcessorTestCase.test_amplicon_report test is disabled due to requiring a custom Docker image.

I think we need to improve the following:

The test is quite slow. It takes ~35s on my machine. I profiled the test line-by-line and obtained the following result:

Timer unit: 1e-06 s

Total time: 35.5463 s
File: /home/tadej/Genialis/resolwe-bio/resolwe_bio/tests/processes/test_generate_report.py
Function: test_amplicon_report at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                               @skipDockerFailure("Processor requires a custom Docker image.")
     9                                               @profile
    10                                               def test_amplicon_report(self):
    11         1       933503 933503.0      2.6          template = self.run_process('upload-file', {'src': 'report_template.tex'})
    12         1       898266 898266.0      2.5          logo = self.run_process('upload-file', {'src': 'genialis_logo.pdf'})
    13                                           
    14         1       936879 936879.0      2.6          bam = self.run_process('upload-bam', {'src': '56GSID_10k_trimmed.bam'})
    15         1       924495 924495.0      2.6          bed = self.run_process('upload-bed', {'src': '56g_targets_small.bed'})
    16                                           
    17         1      2076697 2076697.0      5.8          coverage = self.run_process('coveragebed', {'alignment': bam.id, 'bed': bed.id})
    18                                           
    19         1      2835931 2835931.0      8.0          genome = self.run_process('upload-genome', {'src': 'hs_b37_chr22_frag.fasta.gz'})
    20         1       890908 890908.0      2.5          bed_picard = self.run_process('upload-bed', {'src': '56g_targets_picard_small.bed'})
    21                                           
    22         1            4      4.0      0.0          inputs = {'src': 'Mills_and_1000G_gold_standard.indels.b37.chr22_small.vcf.gz'}
    23         1       905177 905177.0      2.5          indels = self.run_process('upload-variants-vcf', inputs)
    24                                           
    25         1       869405 869405.0      2.4          dbsnp = self.run_process('upload-variants-vcf', {'src': 'dbsnp_138.b37.chr22_small.vcf.gz'})
    26                                           
    27                                                   inputs = {
    28         1            5      5.0      0.0              'alignment': bam.id,
    29         1            1      1.0      0.0              'bed': bed_picard.id,
    30         1            1      1.0      0.0              'genome': genome.id,
    31         1            1      1.0      0.0              'known_indels': [indels.id],
    32         1            3      3.0      0.0              'known_vars': [dbsnp.id]
    33                                                   }
    34                                           
    35         1     21350158 21350158.0     60.1          preprocess_bam = self.run_process('vc-preprocess-bam', inputs)
    36                                           
    37                                                   report_inputs = {
    38         1            2      2.0      0.0              'bam': preprocess_bam.id,
    39         1            0      0.0      0.0              'coverage': coverage.id,
    40         1            2      2.0      0.0              'template': template.id,
    41         1            1      1.0      0.0              'logo': logo.id
    42                                                   }
    43                                           
    44         1      2924870 2924870.0      8.2          self.run_process('amplicon-report', report_inputs)

This revealed the following issues:

60 % of the time is spent running the vc-preprocess-bam process which is already covered by the resolwe_bio.tests.processes.test_variant_calling.VariantCallingTestCase.test_vc_preprocess_bam test.
Only 8 % of the time is actually spent calling the amplicon-report process which means the test could be a lot faster.

The test requires a custom Docker image only due to one of its prerequisites requiring a custom Docker image. The amplicon-report process doesn't actually require a custom Docker image.

I suggest you rewrite the test to only run the amplicon-report process and provide it with the 4 pre-computed inputs. This will speed up the test significantly and remove the requirement for a custom Docker image.

Fix resolwebio/utils Docker image

Currently, building this Docker images on Docker Hub fails with:

Downloading 'ncbi/sra-toolkit' version '2.8.2-1'...

Fetching 'https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz'...

Verifying package...

�[91mERROR: SHA256 digest mismatch.

Here is an example build log.

Fix DIff Exp data names

Remove "Differential expression" from the data name.

Bug in Subread process (data:alignment:bam:subread)

resolwe-bio/resolwe_bio/processes/alignment/subread.yml

Lines 167 to 168 in f7a80b5

    
                     -S {{ PE_options.reads_orientation }} \ 
        
                     -S {{ PE_options.reads_orientation }}

The second line should be

-p {{ PE_options.consensus_subreads }}

Display sample names in the Cuffnorm box plot (not relation IDs)

In cuffnorm one of the outputs is boxplot. There should be sample names instead of replicates numbers.

The example is on BCM server: https://bcm.genialis.com/rna-seq/bioinformatics/collection/p4936_ms2_rna/data/slug/cuffnorm-p4936_rna_ms2_2hr_r2-p4936_rna_ms2_8hr_r3-p4936_rna_ms2_8hr_r2-p4936_rna_ms2_8hr_r1-2?_b=15c8ff39-d90b-487f-a40c-12912ba5c838&_s=36d81108-56f8-4d3d-8f88-995631216903

cuffnorm_exprs_boxplot (5).pdf

Update geneset processors and generators to remove duplicated genes

Processors:

create-geneset
upload-geneset
create-geneset-venn

Also add a warning if there are duplicated genes.

Fix whitespace handling in filenames where "list:data:fastq:" objects are inputs

Make EnrichmentProcessorTestCase independent from external API server

Currently, EnrichmentProcessorTestCase needs to access gene knowledge base on an external API server which is prone to test's failures and inconsistencies.
Replace this by using Resolwe's custom LiveServerTestCase.

Prerequisites:

Implement Resolwe's custom LiveServerTestCase.
Finish implementation of Resolwe's elastic app.
Port Resolwe Bioinformatics gene knowledge base from Haystack to the new Resolwe's elastic app.

	def escape_mongokey(key):
	"""Escape keys when serializing database entries."""
	return key.replace('$', u'\uff04').replace('.', u'\uff0e').replace(' ', '_')

	-S {{ PE_options.reads_orientation }} \
	-S {{ PE_options.reads_orientation }}