bimsbbioinfo / pigx_scrnaseq Goto Github PK
View Code? Open in Web Editor NEWPipeline for analysis of Dropseq single cell data
Home Page: http://bioinformatics.mdc-berlin.de/pigx
Pipeline for analysis of Dropseq single cell data
Home Page: http://bioinformatics.mdc-berlin.de/pigx
loom2sce - takes quite a lot of time, and is called twice.
Can be optimized
Currently when jobs crash after cluster submission, nothing is printed to the log files, rather we need to fish for the execution command and submit it manually before the file can be read.
We should figure out why
First question should be whether the same jobs print logs on local submission
The dropseq tools are written in Java and bundle a handful of unpackaged dependencies. Package them all for Guix.
Slots in single cell experiment should be named
Implement interactive exploration of single cell data with iSEE
https://github.com/csoneson/iSEE
error messages upon for yaml formatting should be more descriptive, and not just python error messages
Currently the pipeline outputs only the UMI matrix, should also output the Gene count matrix.
Adaptation of the get_umi_matrix rule
Important stuff:
Hello,
I run into this error while running through your build tests of what is tagged as version 1.1.6.
$ cat /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
Parsing gene ids from gtf file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Annotation/hg19/hg19.gtf
Reading input files ...
Counts
Features ...
Traceback (most recent call last):
File "/home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py", line 102, in <module>
genes.columns = ['gene_id','gene_id2']
File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 5287, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 661, in _set_axis
self._data.set_axis(axis, labels)
File "/usr/lib/python3/dist-packages/pandas/core/internals/managers.py", line 177, in set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements
which was invoked in this context:
Error in rule convert_matrix_from_mtx_to_loom:
jobid: 0
output: /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom
log: /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log (check log file(s) for error message)
RuleException:
CalledProcessError in line 37 of /home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/Accessory_Functions.py:
Command 'set -euo pipefail; /usr/bin/python3 /home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py --sample_id WT_HEK_4h_br1 --input_dir /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_Solo.out --gtf_file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Annotation/hg19/hg19.gtf --star_output_types_keys Gene GeneFull Velocyto Velocyto --star_output_types_vals Counts GeneFull Spliced Unspliced --output_file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom --sample_sheet_file /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/sample_sheet.csv --path_script /home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts &> /home/moeller/git/med-team/pigx/pigx-scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log' returned non-zero exit status 1.
File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 2168, in run_wrapper
File "/home/moeller/git/med-team/pigx/pigx-scrnaseq/Snake_Dropseq.py", line 769, in __rule_convert_matrix_from_mtx_to_loom
File "/home/moeller/git/med-team/pigx/pigx-scrnaseq/scripts/Accessory_Functions.py", line 37, in print_shell
File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 529, in _callback
File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
File "/usr/lib/python3/dist-packages/snakemake/executors/__init__.py", line 2199, in run_wrapper
Exiting because a job execution failed. Look above for error message
Would you have any immediate idea for me on what to check/change? As a disclaimer, I am running this with Debian packages, not with guix. But all dependencies should be very recent and build-tested themselves.
Many thanks and regards
Steffen
At the very least we should describe the format of the settings.yaml and the sample sheet.
printshellcmds flag is being recognized but won't actually do anything (as of current tip on cluster_compatibility branch ---didn't notice this until rebase with master)
Manually set parameters should override the parameters determined during the pipeline
Example: number of genes per cell as determined in find_absolute_read_cutoff rule should be overridden by parameters in the config
Develop parsers for different single cell technologies have different cell and umi adaptor locations in R1.
A parser should be able to convert this kind of data into drop-seq like adapters.
Indrop uses variable length adapters - adapter specification:
indrop UMI description:
sequence W1 adapter: AAGGCGTCACAAGCAATCACTC
B1 anatomy: BBBBBBBB[BBB]WWWWWWWWWWWWWWWWWWWWWWCCCCCCCCUUUUUUTTTTTTTTTT______________
B = Barcode1, can be 8, 9, 10 or 11 bases long.
W = 'W1' sequence, specified below
C = Barcode2, always 8 bases
U = UMI, always 6 bases
T = Beginning of polyT tail.
_ = Either sequencing survives across the polyT tail, or signal starts dropping off
(and start being anything, likely with poor quality)
minimal_polyT_len_on_R1 = 7
hamming_threshold_for_W1_matching = 3
w1 = "GAGTGATTGCTTGTGACGCCTT"
rev_w1 = "AAGGCGTCACAAGCAATCACTC" #Hard-code so we don't recompute on every one of millions of calls
There should be a bit more explanation in the reports. A couple of sentences explaining what is shown in the plots would be helpful. This could also be a simple enough update to get the interns started.
Adapt the pipeline to make a UMI matrix of intronic reads
Develop parsers for different single cell technologies have different cell and umi adaptor locations in R1.
A parser should be able to convert this kind of data into drop-seq like adapters.
Indrop uses variable length adapters - adapter specification:
indrop UMI description:
sequence W1 adapter: AAGGCGTCACAAGCAATCACTC
B1 anatomy: BBBBBBBB[BBB]WWWWWWWWWWWWWWWWWWWWWWCCCCCCCCUUUUUUTTTTTTTTTT______________
B = Barcode1, can be 8, 9, 10 or 11 bases long.
W = 'W1' sequence, specified below
C = Barcode2, always 8 bases
U = UMI, always 6 bases
T = Beginning of polyT tail.
_ = Either sequencing survives across the polyT tail, or signal starts dropping off
(and start being anything, likely with poor quality)
minimal_polyT_len_on_R1 = 7
hamming_threshold_for_W1_matching = 3
w1 = "GAGTGATTGCTTGTGACGCCTT"
rev_w1 = "AAGGCGTCACAAGCAATCACTC" #Hard-code so we don't recompute on every one of millions of calls
Velocyto folder structure changed - spliced/unspliced matrices are now split into two files
Add set of whitelisted cell barcodes as an optional input to the pipeline
Append reports into rules for downstream analysis
Pipeline dies on sort bam step if tmp files exist.
Make a check whether tmp files exist and delete them before sorting.
The error message when the genome file is not defined does not print
The pipeline should check whether the Fasta and GTF contain the corresponding chromosome names, If not it should fail.
Optional - take the intersection of chromosome names.
Hi BIMSB team,
I've been working on getting your pipeline up and running as a non-root user on our group's compute cluster. Unfortunately, installing guix isn't an option in my case, but after some trial and error I was able to put together a conda environment with the listed dependencies that manages to get through the configure/make process.
However, when I go to run the pipeline on the test dataset, the pipeline fails with the following message:
./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml
Error in job convert_matrix_from_mtx_to_loom while creating output file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom.
RuleException:
CalledProcessError in line 37 of /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/Accessory_Functions.py:
Command '/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/bin/python /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py --sample_id WT_HEK_4h_br1 --input_dir /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_Solo.out --gtf_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Annotation/hg19/hg19.gtf --star_output_types_keys Gene GeneFull Velocyto Velocyto --star_output_types_vals Counts GeneFull Spliced Unspliced --output_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom --sample_sheet_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/sample_sheet.csv --path_script /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts &> /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log' returned non-zero exit status 1.
File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/Snake_Dropseq.py", line 769, in __rule_convert_matrix_from_mtx_to_loom
File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/Accessory_Functions.py", line 37, in print_shell
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
I then checked the log referenced in the error message:
cat /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Log/WT_HEK_0h_br1.hg19.convert_matrix_from_mtx_to_loom.log
Parsing gene ids from gtf file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Annotation/hg19/hg19.gtf
Reading input files ...
Counts
Features ...
Traceback (most recent call last):
File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py", line 102, in
genes.columns = ['gene_id','gene_id2']
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in setattr
return object.setattr(self, name, value)
File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.set
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis
self._data.set_axis(axis, labels)
File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements
Looks like some issue with pandas, where there's 2 elements instead of the expected 3? I'm not as experienced with Python so having trouble figuring this out. Perhaps it's some dependency I missed during my manual local install? If you could provide any advice on what might be going on here or where to look for more troubleshooting info it would be most appreciated. I'm running in a CentOS Linux 7 (Core) environment, with tools and packages installed by Conda where possible and manually otherwise.
Thanks for the help!
-Tom
Reset the htmlwidgets seed and respect SOURCE_DATE_EPOCH to reset timestamps.
Seurat has mitochondrial RNA content estimation - add to pipeline
R script that would use the UMI matrix and Gene matrix to calculate read statistics (i.e. duplication rate)
Error discovered yesterday: something about how Java builds a virtual machine but can't allocate a stack of a certain size. If the memory values are too low then we just run out of memory. If it's too high then we hit this error.
jobid: 0
output: /home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Mapped/WT_HEK_4h_br1/WT_HEK_4h_br1.fastq.bam
log: /home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Log/WT_HEK_4h_br1.merge_fastq_to_bam.log```
Log file contents:
```$ tail /home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Log/WT_HEK_4h_br1.merge_fastq_to_bam.log
DEBUG 2018-03-22 17:56:54 ClassFinder could not load class: picard.vcf.processor.VariantAccumulatorExecutor$MultiThreadedChunkBased$MultiException$1java.lang.NoClassDefFoundError: com/google/common/base/Function
DEBUG 2018-03-22 17:56:54 ClassFinder could not load class: picard.vcf.processor.VariantIteratorProducer$Threadsafe$3java.lang.NoClassDefFoundError: com/google/common/base/Function
DEBUG 2018-03-22 17:56:54 ClassFinder could not load class: picard.vcf.processor.VariantIteratorProducer$Threadsafe$NonUniqueVariantPredicatejava.lang.NoClassDefFoundError: com/google/common/base/Predicate
DEBUG 2018-03-22 17:56:54 ClassFinder could not load class: picard.vcf.processor.VariantIteratorProducer$Threadsafe$OverlapsPredicatejava.lang.NoClassDefFoundError: com/google/common/base/Predicate
DEBUG 2018-03-22 17:56:54 ClassFinder could not load class: picard.vcf.processor.VcfFileSegmentGenerator$1$1java.lang.NoClassDefFoundError: com/google/common/base/Predicate
DEBUG 2018-03-22 17:56:54 ClassFinder could not load class: picard.vcf.processor.VcfFileSegmentGenerator$ByWholeContig$1java.lang.NoClassDefFoundError: com/google/common/base/Function
DEBUG 2018-03-22 17:56:54 ClassFinder could not load class: picard.vcf.processor.VcfFileSegmentGenerator$WidthLimitingDecorator$1java.lang.NoClassDefFoundError: com/google/common/base/Function
[Thu Mar 22 17:56:54 CET 2018] picard.sam.FastqToSam FASTQ=/home/bosberg/projects/pigx_scrnaseq_masterbak/tests/sample_data/reads/HEK_4h_br1_R1.fastq.gz FASTQ2=/home/bosberg/projects/pigx_scrnaseq_masterbak/tests/sample_data/reads/HEK_4h_br1_R2.fastq.gz QUALITY_FORMAT=Standard OUTPUT=/home/bosberg/projects/pigx_scrnaseq_masterbak/tests/out/Mapped/WT_HEK_4h_br1/WT_HEK_4h_br1.fastq.bam SAMPLE_NAME=WT_HEK_4h_br1 SORT_ORDER=queryname USE_SEQUENTIAL_FASTQS=false READ_GROUP_NAME=A MIN_Q=0 MAX_Q=93 STRIP_UNPAIRED_MATE_NUMBER=false ALLOW_AND_IGNORE_EMPTY_LINES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json```
I seems that all of the rules the run with Java have this problem, and the latest efforst at diagnosis have indicated that it is a problem related to the old kernel on the max cluster.
check the permissions off all directories before running the scripts, and output a verbose error message
Convert the SingleCellExperiment to the Seurat object and provide as an additional output file
The old m4 macro leads to configure errors under some circumstances. It should be updated, e.g. to the version used by pigx-rnaseq.
Check whether it would be possible to map the data using hisat2 - it has much lower memory requirements than STAR.
Hello there Pigx team!
I'm a junior bioinformatician in charge of our research group analysis and I'd like to test Pigx for analyzing some Fastq files for RNA velocity estimation (I was originally going to use DropSeq pipeline). However, I'm having some issues installing it, since GNU Guix installation was troublesome on our HPC cluster.
I then proceeded to try to install via conda. However, the environment.yml file is missing from the git source when installed by (git clone https://github.com/BIMSBbioinfo/pigx_scrnaseq.git). I need it to run conda env create -f environment.yml
and proceed through conda installation of Pigx.
Any chance this can receive an update?
Cheers!
When there are multiple files which belong to the same sample, merge them automatically
Check whether fastq file contains barcode, if so remove it before merging
STAR mm matrix for velocyto output has three columns for counts - spliced, unsplice, rest.
mm parser from scipy does not allow such format - velocyto counts in pigx output are currently wrong (contain spliced instead of unspliced)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.