Comments (6)
Hello @shiltemann thanks for raising this issue.
If I understand correctly, your suggestion is to automatically set either -phred33
or -phred64
within the Trimmomatic tool based on the type of the input FASTQ. If so then I think this would be technically possible however I'm not sure about how useful this would be in practice.
My reasoning for this is that Galaxy appears to offer only fastqsanger
, fastqillumina
, or fastqsolexa
(alongside the generic fastq
and .gz
and .bz
variants of each of these) - but according to https://en.wikipedia.org/wiki/FASTQ_format#Encoding there are subvariants of the Illumina FASTQ format which can be either Phred+33 (Illumina 1.8+) or Phred+64 (Illumina 1.3+ and 1.5+). As the Galaxy datatypes are unable to capture these FASTQ format variants, it would seem that knowing the type wouldn't generally be sufficient to determine the encoding within the tool.
However, I'm interested to know more about the specifics of the failure that you described. It sounds like you explicitly set a value for the quality encoding within the Trimmomatic tool (rather than leaving it as the default "nothing selected"); in this case I would have expected that the presence of either -phred33
or -phred64
on the Trimmomatic command line would have overridden the autodetection within the program itself. As you say the autodetection failed, I'm not completely clear on what the issue was that you experienced.
So please let me know what you think about my comments on automatically setting the encoding based on datatype. Also if you could provide more details on what the specific failure you encountered then that would be very helpful too. Thanks!
from galaxy-tools.
Thanks @pjbriggs!
Galaxy EU actually did not have the latest version of the tool installed, so the quality encoding parameter was not available to me, so it was always defaulting to autodetection. Trimmomatic would some times fail and sometimes succeed to autodetect the encoding on the same dataset (I guess it does some random subsetting of reads for determination?). And in my case it could have deduced the encoding from the Galaxy datatype, and I assumed this was true in general (I thought this was the entire reason for the different sub datatypes actually), but was not aware of the Illumina issue.
I guess the ideal solution would be to have Galaxy distinguish between illumina 1.8+ or earlier in their FASTQ datatypes, before this can work in a generic way.
I will ask Galaxy EU to update trimmomatic to the latest version so I can explicitly set the encoding and that would solve my immediate issue at least (using it for a tutorial).
from galaxy-tools.
Hello again @shiltemann
Thanks for the clarification re the quality encoding option not being available on the instance you were using. Hopefully if they can update the tool, then as you say at least you will have a workaround for the immediate issue.
I hadn't been aware of Trimmomatic's own encoding auto-detection failing before, however I would assume as you suggest that it takes some subset of reads and tries to deduce it from that - so if e.g. the subset isn't very representative of the rest of the FASTQ data then I can see it going wrong for sure.
Also like you I had assumed that there would be Galaxy datatypes to distinguish between different Illumina FASTQ encoding versions. I didn't do a thorough check so I could have it wrong - I only looked at what options the uploader on Galaxy main offered when I presented it with a FASTQ file. I'd imagine that all recent FASTQs would be Phred33 encoded, however the advice I generally give people is to do FastQC on their data first as that should report the correct encoding (though I wonder now if that also ever goes wrong? ;).
I'm not sure if there's much more I can help you with at this time, however if there is something specific that you can think of then please let me know.
from galaxy-tools.
I just stumbled on something that may be relevant, relating to the BioPython SeqIO
module (https://biopython.org/wiki/SeqIO). Apologies in advance for the very wordy nature of what follows.
In the SeqIO
documentation there is a table of file format names, which includes a few that map onto the Galaxy datatypes. Specifically there are fastq-sanger
, fastq-solexa
, and fastq-illumina
formats, along with more specific explanations of what each of these is. In BioPython, fastq-sanger
assumes Phred+33, whereas fastq-solexa
and fastq-illumina
both assume Phred+64.
It seems likely to me that the datatypes in Galaxy were originally taken from SeqIO
, and using these definitions would provide a clear mapping between datatypes and the quality encoding that could be used in the Trimmomatic tool. However I still have a concern about the fastq-illumina
/fastqillumina
datatype, and whether a Galaxy user would know or otherwise assume that this refers to the older (Phred+66) format (for myself, prior to reading this I would have assumed the opposite i.e. that fastqillumina
would actually refer to the more recent (Phred+33) format).
Given the potential for confusion (and resulting datatype misassignment) sadly I feel it would be unwise to rely on the datatype within the Trimmomatic tool to set the encoding automatically. It seems to me that the "least bad" thing in this case is allowing the user to set the encoding explicitly if it's known, or else trust in Trimmomatic's automatic detection (which we acknowledge could fail)?
from galaxy-tools.
Thanks for the discussion, and the info about SeqIO, I did not know this but it indeed sounds like that is what the Galaxy datatypes are based on. But given that many people might rely on FastQC's encoding detection which will give them values like Illumina1.5 or illumina1.8+ which are different encodings but I would not blame users for choosing fastqillumina in Galaxy for both cases, it indeed seems unwise to rely on this too much.
With the parameter to tell trimmomatic explicitly the encoding to use it should be sufficient to cover the case where autodetection fails (and hopefully that is rare and in my case perhaps due to heavily cleaned/subsetted training datasets)
from galaxy-tools.
Thanks @shiltemann - sorry that it wasn't more straightforward!
from galaxy-tools.
Related Issues (20)
- Travis-CI builds fail for macs21 tool (Galaxy 17.01) HOT 1
- Is trimmomatic synced with toolshed? HOT 5
- Migrate tools and Travis-CI tests to conda HOT 4
- Pal_finder fails if number of requested N-mers is zero but (N+1)-mers is non-zero
- Trimmomatic wrapper does not catch java exception HOT 1
- Pal_finder: improve error message when no microsatellites found
- Pal_finder: enable tool to operate on a subset of read pairs
- cistrome BETA HOT 2
- trimmomatic input types (and phred64) HOT 5
- Trimmomatic version 0.38 is available HOT 2
- Trimmomatic: Separate report output for use in MultiQC? HOT 2
- Weeder2: bioconda updates break tool
- motif_tools: Travis-CI tests frequently time out fetching [email protected] HOT 3
- Update Travis CI tests to use Python 3.5 or higher
- Trimmomatic Galaxy tool: zlib issue when using container HOT 6
- Update Trimmomatic tool to version 0.39 HOT 2
- Trimmomatic error: unable identify Phred quality encoding HOT 2
- Trimmomatic: input dataset names are incorrectly reported in output dataset titles?
- Trimmomatic: move tool to IUC HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from galaxy-tools.