Giter Site home page Giter Site logo

Comments (1)

mathiasbio avatar mathiasbio commented on August 15, 2024

The first issue was observed while development of this PR: #1358

All picard tools were crashing with errors like this:

INVALID_UNALIGNED_MATE_START
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR::INVALID_UNALIGNED_MATE_START:Record 6, Read name UMI-ATG-CTA-8ebe1c0d-A01901:175:H23G3DSX7:4:2103:25003:36119-D1, The unaligned mate start position is 10004, should be 0

This could be fixed by adding samtools collate and samtools fixmate.

But while doing further testing in the downstream branch: #1429 where more samples were included. I encountered more issues with the bamfiles which only occurred for a few reads in a few samples.

These errors were making these few samples fail Picard again:

ERROR:INVALID_FLAG_FIRST_OF_PAIR
ERROR:INVALID_FLAG_SECOND_OF_PAIR

And after running ValidateSamFile I discovered further errors in the bamfile, however none of them seemed to affect variant calling or picard tools. So they could possibly be irrelevant.

ERROR:MATE_CIGAR_STRING_INVALID_PRESENCE
WARNING:MISSING_TAG_NM

But I figured at this point it was best to be thorough so I decided to strive towards having an error free ValidateSamFile output.

The rare error was a bug due to some difficulties of Dedup to figure out if two read-pairs were from the same molecule or not. This is what Don Freed wrote:

Hi Mathias,

Thank you for digging into this! I hope that you are having a nice weekend now ☺️

For cases like this, the software needs to choose between one of two unlikely scenarios; (1) the reads are derived from two distinct molecules but still have the same start position (when accounting for the soft-clips), insert size (when accounting for the soft-clips), and umi-tag. Or (2) the reads are derived from one molecule, but have a large sequence divergence. Both scenarios are unlikely, but the software chose to treat the reads as originating from the same molecule. My guess is the reads do originate from the same molecule, but there is a sequencing artifact that causes a large number of base errors at the end of one of the second read pairs.

This bug seems to be in the creation of the consensus read anme. The software decides to create a consensus read, from both pairs of "A00621:509:H5LFJDSX2:4:1678:20600:4163" and "A00621:509:H5LFJDSX2:4:2677:19072:32111", but it gives the two ends of the consensus read different read names ("UMI-CTC-GCG-39d8d216-A00621:509:H5LFJDSX2:4:2677:19072:32111-D1" and "UMI-CTC-GCG-39d8d216-A00621:509:H5LFJDSX2:4:1678:20600:4163-D1" in my test). If these consensus reads had the same name, then the output file would be correct (although the output would still contain other incorrect mate information).

Best regards,
Don

In the end I ended up adding the following post-processing steps to the TGA bamfiles after Dedup with UMI consensus collapse:

samtools collate -@ {threads} -O -u {input.bam} {params.tmpdir}/collate/{wildcards.sample}_collate_tmp |
samtools fixmate -O SAM --reference {input.ref} -@ {threads} - - |
awk -f {params.postprocess_fixmate_script} - |
samtools sort -@ {threads} -m 4G -O BAM -T {params.tmpdir}/sort/{wildcards.sample}_sort_tmp - |
samtools calmd -@ {threads} -b - {input.ref} > {output.bam} ;
samtools index {output.bam} ;

Explained briefly:

  • collate (because samtools fixmate need it)
  • fixmate to fix the first issues
  • awk script to add read-pair flag to reads that have read-pair information, and to remove MATE_CIGAR_STRING_INVALID_PRESENCE which samtools fixmate adds (but picard doesn't like...)
  • then sort on position before running calmd to add the missing NM fields

from balsamic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.