Generating interleaved FASTQ for bam2FastQ? about bamutil HOT 5 CLOSED

statgen commented on August 13, 2024

Generating interleaved FASTQ for bam2FastQ?

from bamutil.

Comments (5)

mktrost commented on August 13, 2024

It was not something I was planning, but if it would be useful, it
shouldn't be too difficult to add it.

Just a few questions (probably the hardest part of implementing it):

Would both ends as well as any unpaired reads (if any) all get written to 1
file in this case?
Based on my quick read of the bwa page, I think this is a no.
Would a user want the unpaired reads at all?
Should there be an option to disable outputing unpaired reads?

Filename extension question:
Typically the naming convention is:
first in pair is outBase_1.fastq
2nd in pair is outBase_2.fastq
unpaired in pair is outBase.fastq
What would you want the naming convention to be for the interleaved file
(if not written to stdout)?

I think that it makes sense to always write the first in pair first. The
logic already supports this.

Would you like a flag, like --interleave? I'm open to flag name
suggestions.
Or would you rather just specify --firstOut & --secondOut to be the same
filename?
Either way should be easy enough to implement (and maybe both should be
allowed).

I think for implementation, it would be checking the flag/filenames and if
any are the same, then only call ifopen once and set the pointers, then
only close once.

Thanks for the recommendation. Below I have some roughed out code that I
am currently thinking might work. That way if I go to implement it later I
remember what I am thinking now or if you want to implement it, you can see
what I'm thinking.

Mary Kate

If you just do --firstOut - --secondOut -
and add the following code, it might work

Instead of the current ifopen logic, use:

myUnpairedFile = ifopen(unpairedOut, "w");
if(myFirstFile == myUnpairedFile)
{
    myFirstFile = myUnpairedFile;
}
else
{
    myFirstFile = ifopen(firstOut, "w");
}
if(mySecondFile == myUnpairedFile)
{
    mySecondFile = myUnpairedFile;
}
else if(mySecondFile == myFirstFile)
{
    mySecondFile = myFirstFile;
}
else
{
    mySecondFile = ifopen(secondOut, "w");
}

Or if it never makes sense to interleave unpaired, then just check the
first & second files against each other.

Then I'd make a closeFiles() method:

if((mySecondFile != myFirstFile) && (mySecondFile != myUnpairedFile) &&
(mySecondFile != NULL))
{
ifclose(mySecondFile);
mySecondFile = NULL;
}
if((myFirstFile != myUnpairedFile) && (mySecondFile != NULL))
{
ifclose(myFirstFile);
myFirstFile = NULL;
}
if(myUnpairedFile != NULL)
{
ifclose(myUnpairedFile);
myUnpairedFile = NULL;
}

Then call closeFiles() in the destructor and at the end of execute rather
than manually closing each one.

On Fri, Sep 20, 2013 at 3:15 PM, Alex Lancaster [email protected]:

Any chance of adding the option to bam2FastQ to generate an interleaved
FASTQ file for paired-end reads? Currently it defaults to 2 FASTQ files.
Some tools (e.g. bwa) will handle this form of data, e.g. the "-p" option
for the "bwa mem" command (http://bio-bwa.sourceforge.net/bwa.shtml). It
also makes it a lot easier to pipe the output from bam2FastQ directly into
bwa without writing intermediate files. If it's not a feature that's
currently planned, I'd be willing to look into what would be required to
create a patch. Thanks!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/10
.

from bamutil.

alexlancaster commented on August 13, 2024

I think it would have to put unpaired reads in perhaps a separate file by default, maybe add an option to append them? For naming convention maybe outBase_interleaved.fastq by default? Yes, definitely the first pair first.

I am also testing "htscmd bam2fq" which is part of https://github.com/samtools/htslib which also generates interleaved FASTQ, but in testing it seems it does not necessarily interleave them using the (2i), (2i+1) convention as outlined in the "bwa mem" docs. In other words, if I use "bam2FastQ", I get the following:

$ grep '^\@F' 111582.test_1.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/1
@FCC0837ACXX:5:1202:10619:55648#/1
$ grep '^\@F' 111582.test_2.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/2
@FCC0837ACXX:5:1202:10619:55648#/2

whereas using the output from "htscmd bam2fq -a" on the same original bam I get this:

$ grep '^\@F' 111582.test.sorted.interleaved.fastq |head -4
@FCC0837ACXX:5:1307:4114:17474#/2
@FCD0JP0ACXX:6:1101:11361:95577#/1
@FCC0837ACXX:5:1101:8706:79404#/2
@FCD0JP0ACXX:6:1104:14809:174913#/2

so ideally Bam2FastQ would interleave them correctly.

from bamutil.

mktrost commented on August 13, 2024

Ok,

I have updated bam2Fastq. Can you do a "git pull" to pull in the latest
code?

I added a --merge option. (--interleave didn't work due to a naming
conflict with the --in option).
It produces a file with _interleaved.fastq extension.

You can change the name of the interleaved file by using --firstOut.

bin/bam bam2FastQ --in input.sam --merge
Should create: input.fastq (unpaired) & input_interleaved.fastq (1st & 2nd
in pair)

So if you want interleaved written to stdout:

bin/bam bam2FastQ --in input.sam --firstOut - --merge

It will write input.fastq containing the unpaired files.

Let me know if this works for you, if you have any questions, or
suggestions to improve this new feature.

Mary Kate

On Thu, Sep 26, 2013 at 4:46 PM, Alex Lancaster [email protected]:

I think it would have to put unpaired reads in perhaps a separate file by
default, maybe add an option to append them? For naming convention maybe
outBase_interleaved.fastq by default? Yes, definitely the first pair first.

I am also testing "htscmd bam2fq" which is part of
https://github.com/samtools/htslib which also generates interleaved
FASTQ, but in testing it seems it does not necessarily interleave them
using the (2i), (2i+1) convention as outlined in the "bwa mem" docs. In
other words, if I use "bam2FastQ", I get the following:

$ grep '^@f' 111582.test_1.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/1
@FCC0837ACXX:5:1202:10619:55648#/1
$ grep '^@f' 111582.test_2.fastq |head -2
@FCC0837ACXX:5:2307:12026:198869#/2
@FCC0837ACXX:5:1202:10619:55648#/2

whereas using the output from "htscmd bam2fq -a" on the same original bam
I get this:

$ grep '^@f' 111582.test.sorted.interleaved.fastq |head -4
@FCC0837ACXX:5:1307:4114:17474#/2
@FCD0JP0ACXX:6:1101:11361:95577#/1
@FCC0837ACXX:5:1101:8706:79404#/2
@FCD0JP0ACXX:6:1104:14809:174913#/2

so ideally Bam2FastQ would interleave them correctly.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/10#issuecomment-25202307
.

from bamutil.

alexlancaster commented on August 13, 2024

Great!!

Sorry for only getting back to this now, so the new option is --merge? I think it seems to be working for us now. Let me run some more tests then I'll close.

from bamutil.

mktrost commented on August 13, 2024

Correct, it is called "--merge" (I couldn't use "--interleave" because it
caused confusion with the "--in" parameter.)
Yes, please let me know how your testing goes.
Also, let me know if you need/notice anything else.

On Fri, Mar 14, 2014 at 4:21 PM, Alex Lancaster [email protected]:

Great!!

Sorry for only getting back to this now, so the new option is --merge? I
think it seems to be working for us now. Let me run some more tests then
I'll close.

Reply to this email directly or view it on GitHubhttps://github.com//issues/10#issuecomment-37691165
.

from bamutil.

Generating interleaved FASTQ for bam2FastQ? about bamutil HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent