cancerit / dockstore-cgpmap Goto Github PK

View Code? Open in Web Editor NEW

6.0 11.0 11.0 245 KB

Mapping using PCAP

License: GNU Affero General Public License v3.0

Shell 19.64% Perl 31.31% Common Workflow Language 47.34% Dockerfile 1.71%

bwa-mem biobambam2

dockstore-cgpmap's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger ink1 jwkaiqi denis-yuen garyluu lutsik kathy-t issymeade elisabethgoldman

dockstore-cgpmap's Issues

Is unaligning BAMs necessary?

With dockstore-cgpmap_3.3.0.sif is it necessary to unalign BAM files before mapping or is this adequately catered for by your pipeline. Empirically it does not seem to be required. Please can you clarify this, and whether this is true for both bwamem and bwamem2 with or without bwakit. Thanks, Marian

cgpmap on singularity

Dear Colleagues

I hope you are all well.

I would like to run 'cgpmap' on a compute cluster using singularity. I would like to use multiple cores to align genomes faster. I find that even if I specify multiple cores using the "--threads" parameter (eg. 16), when bwa is fired, only one core is used. The first few steps of 'cgpmap' spawn multiple processes that take more than 1 CPU collectively, but no single process can take more than 100% on CPUs. Bwa ends up using only one core and runs for a long time. On a standalone machine with Docker, I had no problem running bwa with 16 cores.

Have you encountered similar behavior with singularity before, and is there something I can do to use more cores for mapping?

Best
Dominik

Segmentation faults

Dear Colleagues,

I am recurrently getting errors when mapping ~80x WGS samples, using the singularity image. I wonder if you could advise any checks I could do on the process or parameters.

I am running the image similarly to:

export CGPMAP_VER=3.0.0 singularity pull docker://quay.io/wtsicgp/dockstore-cgpmap:$CGPMAP_VER singularity exec\ --workdir /.../workspace \ --home /.../workspace:/home \ --bind /.../ref/human:/var/spool/ref:ro \ --bind /.../example_data/cgpmap/insilico_21:/var/spool/data:ro \ dockstore-cgpmap-${CGPMAP_VER}.simg \ ds-cgpmap.pl \ -r /var/spool/ref/core_ref_GRCh37d5.tar.gz \ -i /var/spool/ref/bwa_idx_GRCh37d5.tar.gz \ -s SOMENAME \ -t 12 \ /var/spool/data/\*.fastq

The errors I get are similar to this one:
more PCAP_Bwa_bwa_mem.2.err \+ set -o pipefail \+ /opt/wtsi-cgp/bin/bwa mem -v 1 -p -R \"@RG\tID:2\tSM:M00026\" -t 6 /home/refer ence_files/genome.fa /home/tmpMap_M00026/split/2/i.fastq.gz \+ /opt/wtsi-cgp/bin/reheadSQ -d /home/reference_files/genome.fa.dict \+ /opt/wtsi-cgp/biobambam2/bin/bamsort fixmate=1 inputformat=sam level=1 tmpfile =/home/tmpMap_M00026/bamsort.2_tmp O=/home/tmpMap_M00026/sorted/2_i.fastq.gz _sorted.bam outputthreads=5 calmdnm=1 calmdnmrecompindetonly=1 calmdnmreference= /home/reference_files/genome.fa [V] Reading alignments from source. [V] 1M [V] 2M [V] 3M [V] 4M [V] 5M [V] 6M [V] 7M /home/tmpMap_M00026/logs/PCAP_Bwa_bwa_mem.2.sh: line 3: 19166 Broken pipe /opt/wtsi-cgp/bin/bwa mem -v 1 -p -R \"@RG\tID:2\tSM:M00026\" -t 6 /home /reference_files/genome.fa /home/tmpMap_M00026/split/2/i.fastq.gz 19167 | /opt/wtsi-cgp/bin/reheadSQ -d /home/reference _files/genome.fa.dict 19168 Segmentation fault (core dumped) | /opt/wtsi-cgp/biobambam2/bin/ bamsort fixmate=1 inputformat=sam level=1 tmpfile=/home/tmpMap_M00026/bamsort. 2_tmp O=/home/tmpMap_M00026/sorted/2_i.fastq.gz_sorted.bam outputthreads=5 cal mdnm=1 calmdnmrecompindetonly=1 calmdnmreference=/home/reference_files/genome.fa Command exited with non-zero status 139 3594.45user 12.26system 9:55.32elapsed 605%CPU (0avgtext+0avgdata 6201544maxresi dent)k

On smaller fastq files the jobs sometimes complete with and without errors. On larger fastq files they consistently give errors, but after variable about of time.
I also previously successfully aligned a WGS 40x case using this code.

Any advice on how to successfully execute the mapping process is really appreciated. Kind regards

alpine build

Attempt building on alpine for secure and small image.

Alpine may not be possible due to underlying software used but worth a try.

Mate pairs and RNEXT, PNEXT fields in the result bam file

Hello Colleagues,

I realise that the source of my downstream problem is that the bam files aligned with cgpmap always have zeros in RNEXT and PNEXT.
Eg A00125:128:HG5TCDSXX:1:1101:11388:1141 0 X 74361818 60 151M * 0 0 ATAGAATATAATTAACAT...

I am guessing bwa does not see the read pairs from two fast files as pairs. I checked read names in two fast files, and they match. Eg matching lines from the two fast files are:
Fastq1
@a00155:110:HGCVKDSXX:1:1101:16107:7435 1:N:0:ACGCACCT+GGTGAAGG
TTGGA...
Fast2
@a00155:110:HGCVKDSXX:1:1101:16107:7435 2:N:0:ACGCACCT+GGTGAAGG
ATCCA

Could there be something wrong with my fastq files? Alternatively, the way I am running Cgpmap (shown below)?

I appreciate your help. Dominik

export CGPMAP_VER=3.0.0 singularity exec \ --workdir /home/workspace \ --home /home/dglodzik/:/home \ --bind /home/dglodzik/referenceData:/var/spool/ref:ro \ --bind /home/dglodzik/testData:/var/spool/data:ro \ dockstore-cgpmap-${CGPMAP_VER}.simg \ ds-cgpmap.pl \ -r /var/spool/ref/core_ref_GRCh37d5.tar.gz \ -i /var/spool/ref/bwa_idx_GRCh37d5.tar.gz \ -s testSample \ -t 6 \ /var/spool/data/test_*.fastq.gz

HG38 build for bwamem2 references

samtools error

error while loading shared libraries: libhts.so.2: cannot open shared object file: No such file or directory

Please suggest, how to deal with this error.

Following is the complete error file.

/opt/wtsi-cgp/bin/samtools view -F 2816 -T /home/reference_files/genome.fa -u /home/tmpMap_Hu_333/links/Hu_333_normal.bam
/opt/wtsi-cgp/biobambam2/bin/bamtofastq exclude=QCFAIL,SECONDARY,SUPPLEMENTARY tryoq=1 gz=1 level=1 outputperreadgroup=1 outputperreadgroupsuffixF=_i.fq outputperreadgroupsuffixF2=_i.fq T=/home/tmpMap_Hu_333/bamtofastq.1 outputdir=/home/tmpMap_Hu_333/split/1 split=2000000000000
/opt/wtsi-cgp/bin/samtools: error while loading shared libraries: libhts.so.2: cannot open shared object file: No such file or directory
BgzfInflateBase::readData(): unexpected eof

/opt/wtsi-cgp/biobambam2/bin/../lib/libmaus2.so.2(_ZN8libmaus24util10StackTraceC1Ev+0x4c) [0x7f8224ff9f8c]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq(_ZN8libmaus29exception16LibMausExceptionC1Ev+0x20) [0x41a680]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x43e2c2]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x43e3f4]
/opt/wtsi-cgp/biobambam2/bin/../lib/libstdc++.so.6(_ZNSt15basic_streambufIcSt11char_traitsIcEE5uflowEv+0x26) [0x7f822445d416]
/opt/wtsi-cgp/biobambam2/bin/../lib/libstdc++.so.6(_ZNSi3getEv+0x66) [0x7f8224437f76]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x42fd1f]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x480721]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x481124]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x48772c]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x48e205]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x415f73]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x4169fe]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x41210a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f82219b4830]
/opt/wtsi-cgp/biobambam2/bin/bamtofastq() [0x412912]

Command exited with non-zero status 1
0.01user 0.10system 0:00.20elapsed 61%CPU (0avgtext+0avgdata 5360maxresident)k
6890inputs+16outputs (42major+2100minor)pagefaults 0swaps

Versions for Bio::DB::HTS upgrade

Also add install of cgpBigWig

seems a wrong comment

GROUPINFO is supposed to be mmqc

dockstore-cgpmap/scripts/mapping.sh

Line 99 in 7e7c682

# if GROUPINFO set

Bulk of continer build will be removed

In the near future the elements specific to PCAP-core will be built in a separate container which this one will add the "helper scripts" to. This is to aid in our internal testing cycle.

Pre-release for fragment+readQC

Create feature/fragementQc to link to pre-release of PCAP-core

Looks like parameter file needs update past 2.0.3

{
  "reference": {
    "path": "/path/to/core_ref_GrCh38.tar.gz",
    "class": "File"
  },
  "bwa_idx": {
    "path": "/path/to/bwa_idx_GrCh38.tar.gz",
    "class": "File"
  },
  "sample": "sim",
  "seq_in": [
    {
      "path": "/path/to/read_1.fastq",
      "class": "File"
    },
    {
      "path": "/path/to/read_2.fastq",
      "class": "File"
    }
  ],
  "out_bam": {
    "path": "/path/to/mapped.bam",
    "class": "File"
  },
  "bwa": " -Y -K 100000000",
  "mmqc": false,
  "mmqcfrac": 0.05
}

See conversation at https://gitter.im/ga4gh/dockstore?at=5d8beae234a7236bf5bef131

no intro on how to use the group meta yaml file and the tool does not validate yaml properly

When input sequencing files are in fastq format, how to use the yaml file is tricky. For example the paired fastq file names have to ended with _1.fq.gz and _2.fg.gz, but such converntions are not mentioned anywhere.

Our code does not validate the file properly either, a few cases below:

If SM tag is missed in the yaml file, cgpmap runs without complains, but will ignore rg id in the yaml file and generate a random one.
Identical fastq file names in the file will not trigger any error/warning.
When a input file is missed in the file, it will not complain.

I think it'll be better if we have a flag option specificly for single-ended fastqs. Cgpmap will assume inputs are paired-ended, and complains if inputs are neither interleaved nor paired, and if it's really a single ended input, user will need to label them specificly. Currently it just went on with its own assumptions silently.

gperftools - fast malloc seems to have been lost

Easy to loose as isn't needed to compile. Should be added into the final stage of the build:

apt-get install -yq libtcmalloc-minimal4

Update ref bundle

Build and add <idxbase>.alt to ref bundle.

https://github.com/lh3/bwa/tree/master/bwakit

run-gen-ref supports GRCh37d5: https://github.com/lh3/bwa/blob/master/bwakit/run-gen-ref#L6