bimsbbioinfo / pigx_scrnaseq Goto Github PK

View Code? Open in Web Editor NEW

10.0 11.0 6.0 11.42 MB

Pipeline for analysis of Dropseq single cell data

Home Page: http://bioinformatics.mdc-berlin.de/pigx

R 18.45% Python 59.32% Shell 17.65% Scheme 0.73% M4 2.49% Makefile 1.36%

dropseq single-cell-rna-seq snakemake-workflows guix

pigx_scrnaseq's People

Contributors

Stargazers

Watchers

Forkers

mlebeur j3gu rbarraud hjanime t-carroll smoe

pigx_scrnaseq's Issues

loom2sce should be called only once

loom2sce - takes quite a lot of time, and is called twice.
Can be optimized

Verbosely describe how to contribute

dependencies, paths
versioning
setting rule parameters for cluster execution

make iSEE/singleCellTK package work nicely with sce object produced by the pipeline

Log file output not going through on SM jobs on the cluster

Currently when jobs crash after cluster submission, nothing is printed to the log files, rather we need to fish for the execution command and submit it manually before the file can be read.
We should figure out why

First question should be whether the same jobs print logs on local submission

package dropseq tools and Java dependencies

The dropseq tools are written in Java and bundle a handful of unpackaged dependencies. Package them all for Guix.

Rename slots in single cell experiment to standardly used names

Slots in single cell experiment should be named

cnts
logcnts

implement iSEE as the final step of the app

Implement interactive exploration of single cell data with iSEE
https://github.com/csoneson/iSEE

add barcode error correction

https://github.com/pachterlab/Sircel

Settings parsing - more descriptive error messages

error messages upon for yaml formatting should be more descriptive, and not just python error messages

Gene count matrix

Currently the pipeline outputs only the UMI matrix, should also output the Gene count matrix.

Adaptation of the get_umi_matrix rule

Add important stuff

Important stuff:

memory consumption
tempdir location
there was a third important thing

convert_matrix_from_mtx_to_loom:102 - ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

Hello,
I run into this error while running through your build tests of what is tagged as version 1.1.6.

$ cat /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
Parsing gene ids from gtf file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Annotation/hg19/hg19.gtf
Reading input files ...
Counts
Features ...
Traceback (most recent call last):
  File "/home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py", line 102, in <module>
    genes.columns = ['gene_id','gene_id2']
  File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 5287, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
  File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 661, in _set_axis
    self._data.set_axis(axis, labels)
  File "/usr/lib/python3/dist-packages/pandas/core/internals/managers.py", line 177, in set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

which was invoked in this context:

Error in rule convert_matrix_from_mtx_to_loom:
    jobid: 0
    output: /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom
    log: /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log (check log file(s) for error message)

RuleException:
CalledProcessError in line 37 of /home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/Accessory_Functions.py:                                                                                                                                                                                                                 
Command 'set -euo pipefail;  /usr/bin/python3 /home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py --sample_id WT_HEK_4h_br1 --input_dir /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_Solo.out --gtf_file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Annotation/hg19/hg19.gtf --star_output_types_keys Gene GeneFull Velocyto Velocyto --star_output_types_vals Counts GeneFull Spliced Unspliced --output_file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom --sample_sheet_file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/sample_sheet.csv --path_script /home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts &> /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log' returned non-zero exit status 1.                                
  File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 2168, in run_wrapper                                                                                                                                                                                                                             
  File "/home/moeller/git/med-team/pigx/pigx-scrnaseq/Snake_Dropseq.py", line 769, in __rule_convert_matrix_from_mtx_to_loom                                                                                                                                                                                                   
  File "/home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/Accessory_Functions.py", line 37, in print_shell                                                                                                                                                                                                                 
  File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 529, in _callback                                                                                                                                                                                                                                
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run                                                                                                                                                                                                                                                      
  File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 515, in cached_or_run                                                                                                                                                                                                                            
  File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 2199, in run_wrapper                                                                                                                                                                                                                             
Exiting because a job execution failed. Look above for error message

Would you have any immediate idea for me on what to check/change? As a disclaimer, I am running this with Debian packages, not with guix. But all dependencies should be very recent and build-tested themselves.

Many thanks and regards
Steffen

Redirect noisy output to log file

Current SM out put has a bunch of black text clogging the output screen

Add README

At the very least we should describe the format of the settings.yaml and the sample sheet.

--printshellcmds not working

printshellcmds flag is being recognized but won't actually do anything (as of current tip on cluster_compatibility branch ---didn't notice this until rebase with master)

Add output folder description

Use common format for settings.yaml and sample sheet CSV

Replace the different yaml files with a single settings.yaml.
Provide etc/settings.yaml.in with all default settings; expect the user to override settings keys by providing their own settings.yaml

Manual parameter should override default parameters

Manually set parameters should override the parameters determined during the pipeline

Example: number of genes per cell as determined in find_absolute_read_cutoff rule should be overridden by parameters in the config

Set TEMPDIR and java TEMPDIR locations

Develop indrops parser

Develop parsers for different single cell technologies have different cell and umi adaptor locations in R1.
A parser should be able to convert this kind of data into drop-seq like adapters.

Indrop uses variable length adapters - adapter specification:
indrop UMI description:
sequence W1 adapter: AAGGCGTCACAAGCAATCACTC

B1 anatomy: BBBBBBBB[BBB]WWWWWWWWWWWWWWWWWWWWWWCCCCCCCCUUUUUUTTTTTTTTTT______________
B = Barcode1, can be 8, 9, 10 or 11 bases long.
W = 'W1' sequence, specified below
C = Barcode2, always 8 bases
U = UMI, always 6 bases
T = Beginning of polyT tail.
_ = Either sequencing survives across the polyT tail, or signal starts dropping off
(and start being anything, likely with poor quality)
minimal_polyT_len_on_R1 = 7
hamming_threshold_for_W1_matching = 3
w1 = "GAGTGATTGCTTGTGACGCCTT"
rev_w1 = "AAGGCGTCACAAGCAATCACTC" #Hard-code so we don't recompute on every one of millions of calls

Add report description

too little explanation in the report

There should be a bit more explanation in the reports. A couple of sentences explaining what is shown in the plots would be helpful. This could also be a simple enough update to get the interns started.

Enable counting of intronic reads

Adapt the pipeline to make a UMI matrix of intronic reads

Define intronic annotation from the GFF file
Convert the annotation to the format for the dropseq pipeline
make a parallel set of rules to tag the mapped bam file with introns using a custom tag
make a UMI matrix based on intronic reads

Develop parsers for different single cell technologies.

Develop parsers for different single cell technologies have different cell and umi adaptor locations in R1.
A parser should be able to convert this kind of data into drop-seq like adapters.

Indrop uses variable length adapters - adapter specification:
indrop UMI description:
sequence W1 adapter: AAGGCGTCACAAGCAATCACTC

new version of star will break the pipeline

Velocyto folder structure changed - spliced/unspliced matrices are now split into two files

Add whitelist option to the pipeline

Add set of whitelisted cell barcodes as an optional input to the pipeline

Add reports into rules

Append reports into rules for downstream analysis

Pipeline dies on sort bam step if tmp files exist

Pipeline dies on sort bam step if tmp files exist.

Make a check whether tmp files exist and delete them before sorting.

try out loomR/loompy for matrix merging

check whether the genome is defined

The error message when the genome file is not defined does not print

check for corresponding chromosome names in GTF and fasta

The pipeline should check whether the Fasta and GTF contain the corresponding chromosome names, If not it should fail.

Optional - take the intersection of chromosome names.

Local pipeline install fails during convert_matrix_from_mtx_to_loom.py

Hi BIMSB team,

I've been working on getting your pipeline up and running as a non-root user on our group's compute cluster. Unfortunately, installing guix isn't an option in my case, but after some trial and error I was able to put together a conda environment with the listed dependencies that manages to get through the configure/make process.

However, when I go to run the pipeline on the test dataset, the pipeline fails with the following message:

./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml

Error in job convert_matrix_from_mtx_to_loom while creating output file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom.
RuleException:
CalledProcessError in line 37 of /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/Accessory_Functions.py:
Command '/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/bin/python /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py --sample_id WT_HEK_4h_br1 --input_dir /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_Solo.out --gtf_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Annotation/hg19/hg19.gtf --star_output_types_keys Gene GeneFull Velocyto Velocyto --star_output_types_vals Counts GeneFull Spliced Unspliced --output_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom --sample_sheet_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/sample_sheet.csv --path_script /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts &> /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log' returned non-zero exit status 1.
File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/Snake_Dropseq.py", line 769, in __rule_convert_matrix_from_mtx_to_loom
File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/Accessory_Functions.py", line 37, in print_shell
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message

I then checked the log referenced in the error message:

cat /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Log/WT_HEK_0h_br1.hg19.convert_matrix_from_mtx_to_loom.log

Parsing gene ids from gtf file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Annotation/hg19/hg19.gtf
Reading input files ...
Counts
Features ...
Traceback (most recent call last):
File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py", line 102, in
genes.columns = ['gene_id','gene_id2']
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in setattr
return object.setattr(self, name, value)
File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.set
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
self._data.set_axis(axis, labels)
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

Looks like some issue with pandas, where there's 2 elements instead of the expected 3? I'm not as experienced with Python so having trouble figuring this out. Perhaps it's some dependency I missed during my manual local install? If you could provide any advice on what might be going on here or where to look for more troubleshooting info it would be most appreciated. I'm running in a CentOS Linux 7 (Core) environment, with tools and packages installed by Conda where possible and manually otherwise.

Thanks for the help!
-Tom

Make reports deterministic

Reset the htmlwidgets seed and respect SOURCE_DATE_EPOCH to reset timestamps.

add mitochondrial RNA contamination estimation

Seurat has mitochondrial RNA content estimation - add to pipeline

Read statistics with dropbeat

R script that would use the UMI matrix and Gene matrix to calculate read statistics (i.e. duplication rate)

Java fails with old kernel.

Error discovered yesterday: something about how Java builds a virtual machine but can't allocate a stack of a certain size. If the memory values are too low then we just run out of memory. If it's too high then we hit this error.

    jobid: 0
    output: /home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Mapped/WT_HEK_4h_br1/WT_HEK_4h_br1.fastq.bam
    log: /home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Log/WT_HEK_4h_br1.merge_fastq_to_bam.log```

Log file contents:

```$ tail  /home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Log/WT_HEK_4h_br1.merge_fastq_to_bam.log
DEBUG    2018-03-22 17:56:54    ClassFinder    could not load class: picard.vcf.processor.VariantAccumulatorExecutor$MultiThreadedChunkBased$MultiException$1java.lang.NoClassDefFoundError: com/google/common/base/Function
DEBUG    2018-03-22 17:56:54    ClassFinder    could not load class: picard.vcf.processor.VariantIteratorProducer$Threadsafe$3java.lang.NoClassDefFoundError: com/google/common/base/Function
DEBUG    2018-03-22 17:56:54    ClassFinder    could not load class: picard.vcf.processor.VariantIteratorProducer$Threadsafe$NonUniqueVariantPredicatejava.lang.NoClassDefFoundError: com/google/common/base/Predicate
DEBUG    2018-03-22 17:56:54    ClassFinder    could not load class: picard.vcf.processor.VariantIteratorProducer$Threadsafe$OverlapsPredicatejava.lang.NoClassDefFoundError: com/google/common/base/Predicate
DEBUG    2018-03-22 17:56:54    ClassFinder    could not load class: picard.vcf.processor.VcfFileSegmentGenerator$1$1java.lang.NoClassDefFoundError: com/google/common/base/Predicate
DEBUG    2018-03-22 17:56:54    ClassFinder    could not load class: picard.vcf.processor.VcfFileSegmentGenerator$ByWholeContig$1java.lang.NoClassDefFoundError: com/google/common/base/Function
DEBUG    2018-03-22 17:56:54    ClassFinder    could not load class: picard.vcf.processor.VcfFileSegmentGenerator$WidthLimitingDecorator$1java.lang.NoClassDefFoundError: com/google/common/base/Function
[Thu Mar 22 17:56:54 CET 2018] picard.sam.FastqToSam FASTQ=/home/bosberg/projects/pigx_scrnaseq_masterbak/tests/sample_data/reads/HEK_4h_br1_R1.fastq.gz FASTQ2=/home/bosberg/projects/pigx_scrnaseq_masterbak/tests/sample_data/reads/HEK_4h_br1_R2.fastq.gz QUALITY_FORMAT=Standard OUTPUT=/home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Mapped/WT_HEK_4h_br1/WT_HEK_4h_br1.fastq.bam SAMPLE_NAME=WT_HEK_4h_br1 SORT_ORDER=queryname    USE_SEQUENTIAL_FASTQS=false READ_GROUP_NAME=A MIN_Q=0 MAX_Q=93 STRIP_UNPAIRED_MATE_NUMBER=false ALLOW_AND_IGNORE_EMPTY_LINES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json```

I seems that all of the rules the run with Java have this problem, and the latest efforst at diagnosis have indicated that it is a problem related to the old kernel on the max cluster.

r-argparse should be checked by configure.ac

test pipeline with R3.5 and updated packages

check whether the output directory is writeable

check the permissions off all directories before running the scripts, and output a verbose error message

Provide Seurat object as output

Convert the SingleCellExperiment to the Seurat object and provide as an additional output file

m4/ax_r_package.m4 is outdated

The old m4 macro leads to configure errors under some circumstances. It should be updated, e.g. to the version used by pigx-rnaseq.

Replace dropseqtools with STAR solo

https://github.com/alexdobin/STAR/releases/tag/2.7.0a

mapping with hisat2

Check whether it would be possible to map the data using hisat2 - it has much lower memory requirements than STAR.

Add number of cells per experiment to the report

environment.yml file missing for conda installation

Hello there Pigx team!

I'm a junior bioinformatician in charge of our research group analysis and I'd like to test Pigx for analyzing some Fastq files for RNA velocity estimation (I was originally going to use DropSeq pipeline). However, I'm having some issues installing it, since GNU Guix installation was troublesome on our HPC cluster.

I then proceeded to try to install via conda. However, the environment.yml file is missing from the git source when installed by (git clone https://github.com/BIMSBbioinfo/pigx_scrnaseq.git). I need it to run conda env create -f environment.yml and proceed through conda installation of Pigx.

Any chance this can receive an update?

Cheers!