holtjma / fmlrc Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 3.0 45 KB

a long-read error correction tool using the multi-string Burrows Wheeler Transform

License: MIT License

C++ 97.47% Makefile 0.61% Shell 1.91%

fmlrc's People

Contributors

Stargazers

Watchers

Forkers

txje skerker wenmm

fmlrc's Issues

Error during msbwt

Dear Holtjma,

I've got this error from the first step.
$ gunzip ~/scratch/data/Zeh/Won/PEreads/*.gz | awk "NR % 4 == 2" | sort -T ./temp | tr NT TN | ropebwt2/ropebwt2 -LR | tr NT TN | msbwt convert ./zeh_msbwt
Can you please check?

Won

File "/data/gpfs/assoc/pgl/bin/python/virtualenv-15.2.0/python_default/bin/msbwt", line 6, in
CommandLineInterface.mainRun()
File "/data/gpfs/home/wyim/scratch/bin/python/virtualenv-15.2.0/python_default/lib/python2.7/site-packages/MUS/CommandLineInterface.py", line 242, in mainRun
CompressToRLE.compressInput(args.inputTextFN, args.dstDir)
File "CompressToRLE.pyx", line 100, in MUSCython.CompressToRLE.compressInput (MUSCython/CompressToRLE.c:2459)
TypeError: cannot concatenate 'str' and 'int' objects

Erronneous all-adenine contigs introduced

Hello there,

I'm currently working on polishing some RNA-seq data obtained from nanopore sequencing. I'm using a huge amount of illumina data (paired end). Although your program runs blazingly fast, I encountered a problem. Sometimes (ca. 1-2% of total nanopore reads) the nanopore read is changed dramatically to a frantically screaming, all-lysine transcript that looks something like this:

>cch_R_01730
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Do you know what the problem could be? Maybe low coverage? Because I encountered this also in a test run, where I only used several hundreds of illumina reads, instead of the final amount.
Thanks for the program. And also thank you for writing an implementation in rust, I cannot wait to try that one out!

../fmlrc-convert: No such file or directory

Hi,

I am trying to run the example, When I was running:
if [ ! -f ${DATADIR}/ecoli_comp_msbwt.npy ]; then
mkdir temp
gunzip -c ${DATADIR}/ERR022075_?.fastq.gz | awk "NR % 4 == 2" | sort -T ./temp | tr NT TN | ./ropebwt2/ropebwt2 -LR | tr NT TN | ../fmlrc-convert ${DATADIR}/ecoli_comp_msbwt.npy
fi

the output shows: ../fmlrc-convert: No such file or directory

I looked at the path and I cannot find the binary file fmlrc-convert anywhere. Can you please help me with this?

Question about final command

Hello , so my question is about your final command
./fmlrc [options] <comp_msbwt.npy> <long_reads.fa> <corrected_reads.fa>
^^^^^^^^^^^^^^^^^^^
here we define the output?

correct at ends of reads is not completed

The tool works great to correct nanopore cdna reads but I have an issue that last 200-300bp of ends of cdna reads don't get corrected but the rest does. It might be that poly A tail and primer still on so I will trim these and try again but wondering if some limitation on the ends of contig correction? even though mapping should be still good there.

iterated times matter with the results?

Hi:
I have a question about short reads correction process.
I built the short read BWT, and then corrected the long reads :
fmlrc -p 20 ngs_msbwt.npy long.fasta

and I got the corrected1_long_reads

Then I use the same ngs_msbwt.npy to corrected the results from last process again, I got the corrected2_long_reads, there was little change in the size. So how many time should I iterated?
And what's the evaluate parameter to evaluate the corrected results?
By the way, does it meaning or meaningless that I use different short reads to correct the same long reads?

sincerely
Yun

Index no output

Hello, I am trying to use the following command to build a comp_msbwt.npy file yet I have an error somewhere:

The command:

ilR1=input/filtered_Illumina_R1.fastq.gz
ilR2=input/filtered_Illumina_R2.fastq.gz
ropebwt2=/home/me/anaconda3/envs/fmlrc/bin/ropebwt2
fmlrc_convert=/home/me/anaconda3/envs/fmlrc/bin/fmlrc-convert
idx=out/path/comp_msbwt.npy
mkdir temp
pigz -p 20 -dc $ilR1 $ilR2 | awk 'NR % 4 == 2' | sort -T temp | tr NT TN | $ropebwt2 -LR | tr NT TN | $fmlrc_convert $idx

The output (the command runs yet no file appears in the /out/path directory) :

/out/path/comp_msbwt.npy: No such file or directory

Can it be due to the sort command? When I pipe until sort, no outputs appear in the terminal? The command is still running should I wait till the whole data is uncompressed?

Thanks a million :)

No output file

Hello, I finished the fmlrc run, however, there was no output file. This might be a really dumb mistake on my part, do you have to specify an output file name? Are outputs only in .fa format?

Thanks

How to generate RLE BWT format for paired-end illumina data

Hi
I have a paired-end illumina data in fastq format. I want to use it to correct the nanopore long reads using fmlrc. How to generate RLE BWT format using the X_1 and X_2 reads ? Should I combine the X_1.fastq and X_2.fastq into one fastq file ?

Xin

Comment on % Pacbio sequences corrected ?

Hi,

I know this will depend on the dataset quality, but I will ask anyway. You compared quite extensively with the tool Lordec. In our (brief) experience Lordec is very quick and useful, but only keeps around 25-30 % of the Pacbio reads. In addition, many of the Pacbio reads are highly truncated. Short Pacbio reads are then not useful for assembly of complex plant genomes.

Maybe I missed it, but are there any stats on how your tool compares to Lordec for some of your test datasets ? i.e. % pacbio reads corrected, % excluded, % of initial read length corrected, etc..

Thanks,
Colin

Error corrupted size vs. prev_size

hello,
I am running an fmlrc error correction of nanopore data using illumina short-reads.
The full command I set to run the software was:
/cluster/home/gabrieha/software/fmlrc/fmlrc -p 16 -i BWT/comp_msbwt.npy Rabiosa_ONT_181026_split.0.fa Rabiosa_ONT_errorcor0.fa
The computer resources I have availabe are restricted to 24 cores and 5120 MB RAM per core (122880 MB total). I was checking occasionally resource usage while it was running and it never used more than 70% of availabe resources.
Now, when the software has processed about 1/4 of the data it crashes giving me an output error of *** Error in `/cluster/home/gabrieha/software/fmlrc/fmlrc': corrupted size vs. prev_size: 0x00002ba614016630 ***
Have you encountered something similar before or do you have an idea what could have caused the error?

I attached the report in case it could help to explain.
Thank you and all the best!

report_corrpsize_vs_prevsize.txt

Segmentation fault: 11

Hi there,

I got the following error message after running, how should i deal with it?

loaded bwt with 87963178 compressed values Segmentation fault: 11

Thank you!

fastq format long read file input?

Hi,

I guess that fmlrc does not accept a fastq format file for long reads because I have the following message in a fmlrc run
ERROR: input long reads must be in FASTA format - file must end in '.fasta' or '.fa'

I usually correct sequence assemblies in fasta format. But, I need to polish raw long reads before assembling to locate genes in the reads. Is there an easier way of polishing long reads using fmlrc?

One thing I could do is that I could

convert a fastq file to a FASTA format file
polish long reads in the FASTA format file
put the polished long reads back to the starting fastq file.

But, I am sure whether there is such a tool.

Thank you,

Sang

[fmlrc-convert] ERROR - unexpected symbol in input: char: " ", hex: "20"

Hi,

when I run the
gunzip -c reads.sorted.txt.gz | tr NT TN | /opt/ropebwt2/ropebwt2 -LR | tr NT TN | /opt/fmlrc-1.0.0/fmlrc-convert comp_msbwt.npy
command, I get the following error:
ropebwt2: mrope.c:268: mr_insert_multi: Assertion `len > 0 && s[len-1] == 0' failed.
[fmlrc-convert] ERROR - unexpected symbol in input: char: " ", hex: "20"
Any idea what the source of the error might be?
BR,
Pezhman

Correcting Nanopore long reads with PacBio HiFi

Dear author,

It happens that I have PacBio HiFi reads of about ~11 Kb avg read length and Nanopore >30 Kb avg read length, and I am wondering how could I polish my Nanopore reads with HiFi? It would be gold if this was possible since I suspect that the correction would be more accurate than when using short reads given the differences at mapping. Thanks.

Rom

correcting low cov reads from heterozygous genomes

Hello,
I have about 20x (per allele) PromethION data of a highly heterozygous plant genome. I also have plenty of short read data to align to it for the error correction.
I wonder if with FMLRC the small allelic variants (substitutions and indels) will be washed away at the error correction step: I want to remove errors, but keep allelism so that I can assemble separately the two alleles - we know that this is already possible with Illumina, I want to do it with long reads now.
Did you ever try your tool with low (~20x raw data) ONT coverage from a heterozygous genome? Do you have any suggestions (k and K size, T, ...) on how not to lose allelic variation?
thanks!

About short reads: insert size and type

Hi,

In order to correct my long reads, I have short reads of different insert size and different types (paired end and mate pair).

How is it handle by the program ?

For example for the 1st part Building the short-read BWT with this guide: https://github.com/holtjma/fmlrc/wiki/Converting-to-the-fmlrc-RLE-BWT-format

Should I just catenate everything into a single file (so transforming my paired to single reads). Can we use only paired end or also mate pairs ?
What is the optimal coverage for long reads and short reads ?

Thanks a lot for your help.

does RNA long reads suit for correction?

FMLRC (Or Ropebwt) not working correctly

Hello,
I am trying to run the FMLRC program on Drosophila melanogaster data. Since not all the shortreads are exactly 100bp (some are shorter due to trimming, and some a re a few bases more), I am using Ropebswt2. I had to change the command line as compared to the guidelines, because the numbers for ($, A, G, C, T, N) were incoherent:
$ awk 'NR % 4 == 2' Dmel_data/Shortreads/Short_*_trimmed.fastq | sort | ropebwt2 -LR | tr NT TN | msbwt convert ./msbwt6

The info from Ropebwt performance is:

[M::main_ropebwt2] inserted 10415295755 symbols in 764.260 sec, 1747.782 CPU sec
[M::main_ropebwt2] inserted 6587840007 symbols in 443.204 sec, 1198.430 CPU sec
[M::main_ropebwt2] constructed FM-index in 1346.672 sec, 3046.289 CPU sec
[M::main_ropebwt2] symbol counts: ($, A, C, G, T, N) = (171420988, 5079437921, 3339642692, 3349827151, 5058703850, 4103160)
[M::main] Version: r187
[M::main] CMD: ropebwt2 -LR
[M::main] Real time: 4022.990 sec; CPU: 3399.186 sec

AND

[2017-08-25 12:29:16] INFO: Input: stdin
[2017-08-25 12:29:16] INFO: Output: ./msbwt6
[2017-08-25 12:29:16] INFO: Beginning conversion...
[2017-08-25 13:36:01] INFO: Finished conversion.

It looks to me that Ropebwt2 reads a file containing the correct number of sequences ($=171420988, and #G~#C /#A~#T (which was not the case with the added "tr NT TN").
I do have a 2Go file in msbwt6 (comp_msbwt.npy), but when I run FMLRC as recommended ($ fmlrc -p 14 -V ./msbwt6/comp_msbwt.npy Dmel_data/PacBio_reads/Pacreads.fasta ./Dmel_PB_corrected_FMLRC6.fasta) on this file, the . fasta output looks incoherent, here is the had of the output file:

$ head Dmel_PB_corrected_FMLRC6.fasta
Usage: fmlrc [options] <comp_msbwt.npy> <long_reads.fa> <corrected_reads.fa>
NNTNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNTNNNNNNNNNNNNNNNNANNNNNNTNNNNNNNNNNNNNNTNNN
NNNANNNNNNNNNNNNNNNNNNANNTNNNNNNNNNNNNNNNNNNNTNNNN
NANGNNNNNNNNNNNNNNNNNANNTNNNNNNNNNTNNNNNNNNNNNNGNN
NNANNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNCNNNNCTNNNNTNNNA
NNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNANNTNNNTANTNNNTNNN
NNNANNTNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNANNTNNNNN
NNNTNNNNNNANNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNANN
NNNTNNNNNNNNNNCNNNTNTNNCNNNNNNNNANNATNNNNNNANNTNNN

So I have several questions to figure out how to debug this:

1/ What are the accepted formats for input data ?
2/ can we feed two different files as input (ie, 1.fastq and 2.fastq, for paired end reads) ?
3/ How can we know whether the Ropebswt2 step versus the FMLRC step went wrong, ie how can I validate the comp_msbwt.npy file ?

Note than even in verbose mode, FMLRC produces scarce info on what happens.
Thank you very much for your help.
Best regards,
Coline