cancerit / dockstore-cgpmap Goto Github PK
View Code? Open in Web Editor NEWMapping using PCAP
License: GNU Affero General Public License v3.0
Mapping using PCAP
License: GNU Affero General Public License v3.0
With dockstore-cgpmap_3.3.0.sif is it necessary to unalign BAM files before mapping or is this adequately catered for by your pipeline. Empirically it does not seem to be required. Please can you clarify this, and whether this is true for both bwamem and bwamem2 with or without bwakit. Thanks, Marian
Dear Colleagues
I hope you are all well.
I would like to run 'cgpmap' on a compute cluster using singularity. I would like to use multiple cores to align genomes faster. I find that even if I specify multiple cores using the "--threads" parameter (eg. 16), when bwa is fired, only one core is used. The first few steps of 'cgpmap' spawn multiple processes that take more than 1 CPU collectively, but no single process can take more than 100% on CPUs. Bwa ends up using only one core and runs for a long time. On a standalone machine with Docker, I had no problem running bwa with 16 cores.
Have you encountered similar behavior with singularity before, and is there something I can do to use more cores for mapping?
Best
Dominik
Dear Colleagues,
I am recurrently getting errors when mapping ~80x WGS samples, using the singularity image. I wonder if you could advise any checks I could do on the process or parameters.
I am running the image similarly to:
export CGPMAP_VER=3.0.0 singularity pull docker://quay.io/wtsicgp/dockstore-cgpmap:$CGPMAP_VER singularity exec\ --workdir /.../workspace \ --home /.../workspace:/home \ --bind /.../ref/human:/var/spool/ref:ro \ --bind /.../example_data/cgpmap/insilico_21:/var/spool/data:ro \ dockstore-cgpmap-${CGPMAP_VER}.simg \ ds-cgpmap.pl \ -r /var/spool/ref/core_ref_GRCh37d5.tar.gz \ -i /var/spool/ref/bwa_idx_GRCh37d5.tar.gz \ -s SOMENAME \ -t 12 \ /var/spool/data/\*.fastq
The errors I get are similar to this one:
more PCAP_Bwa_bwa_mem.2.err \+ set -o pipefail \+ /opt/wtsi-cgp/bin/bwa mem -v 1 -p -R \"@RG\tID:2\tSM:M00026\" -t 6 /home/refer ence_files/genome.fa /home/tmpMap_M00026/split/2/i.fastq.gz \+ /opt/wtsi-cgp/bin/reheadSQ -d /home/reference_files/genome.fa.dict \+ /opt/wtsi-cgp/biobambam2/bin/bamsort fixmate=1 inputformat=sam level=1 tmpfile =/home/tmpMap_M00026/bamsort.2_tmp O=/home/tmpMap_M00026/sorted/2_i.fastq.gz _sorted.bam outputthreads=5 calmdnm=1 calmdnmrecompindetonly=1 calmdnmreference= /home/reference_files/genome.fa [V] Reading alignments from source. [V] 1M [V] 2M [V] 3M [V] 4M [V] 5M [V] 6M [V] 7M /home/tmpMap_M00026/logs/PCAP_Bwa_bwa_mem.2.sh: line 3: 19166 Broken pipe /opt/wtsi-cgp/bin/bwa mem -v 1 -p -R \"@RG\tID:2\tSM:M00026\" -t 6 /home /reference_files/genome.fa /home/tmpMap_M00026/split/2/i.fastq.gz 19167 | /opt/wtsi-cgp/bin/reheadSQ -d /home/reference _files/genome.fa.dict 19168 Segmentation fault (core dumped) | /opt/wtsi-cgp/biobambam2/bin/ bamsort fixmate=1 inputformat=sam level=1 tmpfile=/home/tmpMap_M00026/bamsort. 2_tmp O=/home/tmpMap_M00026/sorted/2_i.fastq.gz_sorted.bam outputthreads=5 cal mdnm=1 calmdnmrecompindetonly=1 calmdnmreference=/home/reference_files/genome.fa Command exited with non-zero status 139 3594.45user 12.26system 9:55.32elapsed 605%CPU (0avgtext+0avgdata 6201544maxresi dent)k
On smaller fastq files the jobs sometimes complete with and without errors. On larger fastq files they consistently give errors, but after variable about of time.
I also previously successfully aligned a WGS 40x case using this code.
Any advice on how to successfully execute the mapping process is really appreciated. Kind regards
Attempt building on alpine for secure and small image.
Alpine may not be possible due to underlying software used but worth a try.
Hello Colleagues,
I realise that the source of my downstream problem is that the bam files aligned with cgpmap always have zeros in RNEXT and PNEXT.
Eg A00125:128:HG5TCDSXX:1:1101:11388:1141 0 X 74361818 60 151M * 0 0 ATAGAATATAATTAACAT...
I am guessing bwa does not see the read pairs from two fast files as pairs. I checked read names in two fast files, and they match. Eg matching lines from the two fast files are:
Fastq1
@a00155:110:HGCVKDSXX:1:1101:16107:7435 1:N:0:ACGCACCT+GGTGAAGG
TTGGA...
Fast2
@a00155:110:HGCVKDSXX:1:1101:16107:7435 2:N:0:ACGCACCT+GGTGAAGG
ATCCA
Could there be something wrong with my fastq files? Alternatively, the way I am running Cgpmap (shown below)?
I appreciate your help. Dominik
export CGPMAP_VER=3.0.0 singularity exec \ --workdir /home/workspace \ --home /home/dglodzik/:/home \ --bind /home/dglodzik/referenceData:/var/spool/ref:ro \ --bind /home/dglodzik/testData:/var/spool/data:ro \ dockstore-cgpmap-${CGPMAP_VER}.simg \ ds-cgpmap.pl \ -r /var/spool/ref/core_ref_GRCh37d5.tar.gz \ -i /var/spool/ref/bwa_idx_GRCh37d5.tar.gz \ -s testSample \ -t 6 \ /var/spool/data/test_*.fastq.gz
error while loading shared libraries: libhts.so.2: cannot open shared object file: No such file or directory
Please suggest, how to deal with this error.
Following is the complete error file.
/opt/wtsi-cgp/biobambam2/bin/../lib/libmaus2.so.2(_ZN8libmaus24util10StackTraceC1Ev+0x4c) [0x7f8224ff9f8c]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq(_ZN8libmaus29exception16LibMausExceptionC1Ev+0x20) [0x41a680]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x43e2c2]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x43e3f4]
/opt/wtsi-cgp/biobambam2/bin/../lib/libstdc++.so.6(_ZNSt15basic_streambufIcSt11char_traitsIcEE5uflowEv+0x26) [0x7f822445d416]
/opt/wtsi-cgp/biobambam2/bin/../lib/libstdc++.so.6(_ZNSi3getEv+0x66) [0x7f8224437f76]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x42fd1f]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x480721]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x481124]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x48772c]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x48e205]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x415f73]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x4169fe]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x41210a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f82219b4830]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x412912]
Command exited with non-zero status 1
0.01user 0.10system 0:00.20elapsed 61%CPU (0avgtext+0avgdata 5360maxresident)k
6890inputs+16outputs (42major+2100minor)pagefaults 0swaps
Also add install of cgpBigWig
GROUPINFO
is supposed to be mmqc
dockstore-cgpmap/scripts/mapping.sh
Line 99 in 7e7c682
In the near future the elements specific to PCAP-core will be built in a separate container which this one will add the "helper scripts" to. This is to aid in our internal testing cycle.
{
"reference": {
"path": "/path/to/core_ref_GrCh38.tar.gz",
"class": "File"
},
"bwa_idx": {
"path": "/path/to/bwa_idx_GrCh38.tar.gz",
"class": "File"
},
"sample": "sim",
"seq_in": [
{
"path": "/path/to/read_1.fastq",
"class": "File"
},
{
"path": "/path/to/read_2.fastq",
"class": "File"
}
],
"out_bam": {
"path": "/path/to/mapped.bam",
"class": "File"
},
"bwa": " -Y -K 100000000",
"mmqc": false,
"mmqcfrac": 0.05
}
See conversation at https://gitter.im/ga4gh/dockstore?at=5d8beae234a7236bf5bef131
When input sequencing files are in fastq format, how to use the yaml file is tricky. For example the paired fastq file names have to ended with _1.fq.gz
and _2.fg.gz
, but such converntions are not mentioned anywhere.
Our code does not validate the file properly either, a few cases below:
SM
tag is missed in the yaml file, cgpmap runs without complains, but will ignore rg id in the yaml file and generate a random one.I think it'll be better if we have a flag option specificly for single-ended fastqs. Cgpmap will assume inputs are paired-ended, and complains if inputs are neither interleaved nor paired, and if it's really a single ended input, user will need to label them specificly. Currently it just went on with its own assumptions silently.
Easy to loose as isn't needed to compile. Should be added into the final stage of the build:
apt-get install -yq libtcmalloc-minimal4
Build and add <idxbase>.alt
to ref bundle.
https://github.com/lh3/bwa/tree/master/bwakit
run-gen-ref
supports GRCh37d5: https://github.com/lh3/bwa/blob/master/bwakit/run-gen-ref#L6
0.4.2
fixes problem in handling contig names including :
as found in GRCh38
I want to use CRAM files as inputs. Where do I give CGPMAP the reference that the CRAM was mapped to? I cannot find this in your documentation, Iām sure Iām being very inobservant.
thanks,
Marian
For travis and quay
Need to expose this info, most cases not needed but should doc.
Works better with containers
Add info for use of CPU to override the actual CPU count when running on a cluster rather than in a docker instance.
To enable better support for running on traditional compute environments like singularity on LSF.
It is possible now to specify max CPU with the environment variable CPU
otherwise all CPU resources are potentially used which is not farm friendly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ššš
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ā¤ļø Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.