Giter Site home page Giter Site logo

Comments (6)

pjbriggs avatar pjbriggs commented on July 21, 2024

Hello @shiltemann thanks for raising this issue.

If I understand correctly, your suggestion is to automatically set either -phred33 or -phred64 within the Trimmomatic tool based on the type of the input FASTQ. If so then I think this would be technically possible however I'm not sure about how useful this would be in practice.

My reasoning for this is that Galaxy appears to offer only fastqsanger, fastqillumina, or fastqsolexa (alongside the generic fastq and .gz and .bz variants of each of these) - but according to https://en.wikipedia.org/wiki/FASTQ_format#Encoding there are subvariants of the Illumina FASTQ format which can be either Phred+33 (Illumina 1.8+) or Phred+64 (Illumina 1.3+ and 1.5+). As the Galaxy datatypes are unable to capture these FASTQ format variants, it would seem that knowing the type wouldn't generally be sufficient to determine the encoding within the tool.

However, I'm interested to know more about the specifics of the failure that you described. It sounds like you explicitly set a value for the quality encoding within the Trimmomatic tool (rather than leaving it as the default "nothing selected"); in this case I would have expected that the presence of either -phred33 or -phred64 on the Trimmomatic command line would have overridden the autodetection within the program itself. As you say the autodetection failed, I'm not completely clear on what the issue was that you experienced.

So please let me know what you think about my comments on automatically setting the encoding based on datatype. Also if you could provide more details on what the specific failure you encountered then that would be very helpful too. Thanks!

from galaxy-tools.

shiltemann avatar shiltemann commented on July 21, 2024

Thanks @pjbriggs!

Galaxy EU actually did not have the latest version of the tool installed, so the quality encoding parameter was not available to me, so it was always defaulting to autodetection. Trimmomatic would some times fail and sometimes succeed to autodetect the encoding on the same dataset (I guess it does some random subsetting of reads for determination?). And in my case it could have deduced the encoding from the Galaxy datatype, and I assumed this was true in general (I thought this was the entire reason for the different sub datatypes actually), but was not aware of the Illumina issue.

I guess the ideal solution would be to have Galaxy distinguish between illumina 1.8+ or earlier in their FASTQ datatypes, before this can work in a generic way.

I will ask Galaxy EU to update trimmomatic to the latest version so I can explicitly set the encoding and that would solve my immediate issue at least (using it for a tutorial).

from galaxy-tools.

pjbriggs avatar pjbriggs commented on July 21, 2024

Hello again @shiltemann

Thanks for the clarification re the quality encoding option not being available on the instance you were using. Hopefully if they can update the tool, then as you say at least you will have a workaround for the immediate issue.

I hadn't been aware of Trimmomatic's own encoding auto-detection failing before, however I would assume as you suggest that it takes some subset of reads and tries to deduce it from that - so if e.g. the subset isn't very representative of the rest of the FASTQ data then I can see it going wrong for sure.

Also like you I had assumed that there would be Galaxy datatypes to distinguish between different Illumina FASTQ encoding versions. I didn't do a thorough check so I could have it wrong - I only looked at what options the uploader on Galaxy main offered when I presented it with a FASTQ file. I'd imagine that all recent FASTQs would be Phred33 encoded, however the advice I generally give people is to do FastQC on their data first as that should report the correct encoding (though I wonder now if that also ever goes wrong? ;).

I'm not sure if there's much more I can help you with at this time, however if there is something specific that you can think of then please let me know.

from galaxy-tools.

pjbriggs avatar pjbriggs commented on July 21, 2024

I just stumbled on something that may be relevant, relating to the BioPython SeqIO module (https://biopython.org/wiki/SeqIO). Apologies in advance for the very wordy nature of what follows.

In the SeqIO documentation there is a table of file format names, which includes a few that map onto the Galaxy datatypes. Specifically there are fastq-sanger, fastq-solexa, and fastq-illumina formats, along with more specific explanations of what each of these is. In BioPython, fastq-sanger assumes Phred+33, whereas fastq-solexa and fastq-illumina both assume Phred+64.

It seems likely to me that the datatypes in Galaxy were originally taken from SeqIO, and using these definitions would provide a clear mapping between datatypes and the quality encoding that could be used in the Trimmomatic tool. However I still have a concern about the fastq-illumina/fastqillumina datatype, and whether a Galaxy user would know or otherwise assume that this refers to the older (Phred+66) format (for myself, prior to reading this I would have assumed the opposite i.e. that fastqillumina would actually refer to the more recent (Phred+33) format).

Given the potential for confusion (and resulting datatype misassignment) sadly I feel it would be unwise to rely on the datatype within the Trimmomatic tool to set the encoding automatically. It seems to me that the "least bad" thing in this case is allowing the user to set the encoding explicitly if it's known, or else trust in Trimmomatic's automatic detection (which we acknowledge could fail)?

from galaxy-tools.

shiltemann avatar shiltemann commented on July 21, 2024

Thanks for the discussion, and the info about SeqIO, I did not know this but it indeed sounds like that is what the Galaxy datatypes are based on. But given that many people might rely on FastQC's encoding detection which will give them values like Illumina1.5 or illumina1.8+ which are different encodings but I would not blame users for choosing fastqillumina in Galaxy for both cases, it indeed seems unwise to rely on this too much.

With the parameter to tell trimmomatic explicitly the encoding to use it should be sufficient to cover the case where autodetection fails (and hopefully that is rare and in my case perhaps due to heavily cleaned/subsetted training datasets)

from galaxy-tools.

pjbriggs avatar pjbriggs commented on July 21, 2024

Thanks @shiltemann - sorry that it wasn't more straightforward!

from galaxy-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.