jiaolaboratory / craq Goto Github PK

Identification of errors in draft genome assemblies with single-base pair resolution for quality assessment and improvement

Home Page: https://doi.org/10.1038/s41467-023-42336-w

License: MIT License

Shell 35.12% Reason 0.28% Perl 60.21% Python 3.87% Raku 0.52%

quality-assessment bioinformatics genome-assembly

craq's People

Contributors

Stargazers

Watchers

Forkers

yuzhenpeng tong2200 juke34

craq's Issues

Seeking Insights on Error Discrepancies in Combined ONT and HiFi Sequencing Data”

Hello,

Following your suggestion, I merged the BAM files of ONT and HIFI data to generate an image. This process involved sequencing a plant with both second and third-generation techniques. The resulting assembly was performed using ONT and HIFI data. Notably, the third-generation data in the image exhibits fewer errors compared to the second-generation data, which shows more errors. This discrepancy raises questions about the reliability of error correction in the second-generation data. I am seeking your advice on how to interpret these results.

Thank you for your guidance！

Out of Memory

Hi, Dear author
The run failed on my machine with 128Gb RAM. My genome is 350 Mb; Pacbio CLR long-read bam is 21 Gb and short-read bam is 11Gb.
The memory usage slowly and continuously increased, and finally be killed. Is this normal?

Best wishes

craq -g genome.fa -sms lgs.sort.bam -ngs sgs.sort.bam --sms_coverage 150 --ngs_coverage 65 -t 32
Running CRAQ analysis .........
PARAMETERS:
Genome sequence(-g): genome.nextpolish.fa
SMS input(-sms): lgs.sort.bam
NGS input(-ngs): sgs.sort.bam
Minimum NGS clipped-reads (-sn): 2
Minimum SMS clipped-reads (-ln): 2
Clipping coverRate(-sf): 0.75
Lower clipping rate for heterozygous allele(-hmin): 0.4
Upper clipping rate for heterozygous allele(-hmax): 0.6
Block score benchmarking (-rw): 1000000
Gap[N] is treated with (-gm): 1:CRE
Minimum gapsize(-mgs): 10
Break chimera fragments (-b): F
Search error region (-ser): T
Mapping SMS reads use (-x): map-hifi
Mapping quality (-q): 20
Window size for error normalizing (-nw): 35560
Plot CRAQ metrics (-pl): F
Alignment thread(-t): 32
Current working at : /mnt/dataset/moth/species/cs/0assembly/9polish/4polish
CRAQ output dir(-D): /mnt/dataset/moth/species/cs/0assembly/9polish/4polish/output
-------------------------Start Running-------------------------

Running SMS long-reads CRAQ analysis ......
CMD: /mnt/data/miniconda3/envs/craq/bin/../share/CRAQ/src/runLR.sh -g genome.nextpolish.fa -x map-hifi -z seq.size -1 lgs.sort.bam -q 20 -m 2 -f 0.4 -h 0.6 -r 0.75 -a 150 -d 50000 -v F -t 32
worker_pipeline::
Skipping alignment::
[M::worker_pipeline:: Filtering bamfiles]
[M::worker_pipeline:: Compute effective SMS coverage]
[M::worker_pipeline:: Extract SMS clipping signal]
[M::worker_pipeline:: Collect potential CSE|H]
[M::worker_pipeline:: Collect potential CRE|H]

Out of Memory and the running was killed here.

Can CRAQ be used as software for detecting haplotype assembly errors? Is it capable of disassembling erroneous assemblies?

For the assembly of homologous polyploids, phasing presents the greatest challenge. Is it possible to identify phasing assemblies using ONT data and to disassemble erroneous assemblies? We hope you can assist us in resolving such issues, as haplotype resolution in polyploids is critically important. The accuracy of ONT reads appears to be only around Q18, so can we avoid such erroneous identifications at the parameter level?

Thank you very much for your attention to this matter.

Sincerely,
Tuo Yang

low_confidence.bed empty file

Hi,
Thanks very much for the great tool. We've been trying CRAQ out in several of our assembly projects and we sometimes run into strange behavior where analysis seems to finish ok, but low_confidence.bed is empty and the Low-confident.Rate in out_final.Report is 0. We know for sure that is not the case (from quality scores) and as a parallel run on a different machine gave >100 low confidence points. The fasta file after breaks is also not produced. Would there be a chance to look into it? It is a 1.1Gb genome and we are using a 500Gb mem, 28 CPU machine, so resources should be plenty.

Running CRAQ analysis .........
PARAMETERS:
Genome sequence(-g): yahs.out_scaffolds_final.fa
SMS input(-sms): lr.fq.gz
NGS input(-ngs): SRR10382366_1.fastq  SRR10382366_2.fastq
Minimum NGS clipped-reads (-sn): 2
Minimum SMS clipped-reads (-ln): 2
NGS clipping coverRate(-sf): 0.75
SMS clipping coverRate(-lf): 0.75
Lower clipping rate for heterozygous allele(-hmin): 0.4
Upper clipping rate for heterozygous allele(-hmax): 0.6
Block score benchmarking (-rw): 500000
Gap[N] is treated with (-gm): 1:CRE
Minimum gapsize(-mgs): 10
Break chimera fragments (-b): T
Search error region (-ser): T
Mapping SMS reads use (-x): map-hifi
Mapping quality (-q): 20
Window size for error normalizing (-nw): 103347
Plot CRAQ metrics (-pl): F
Alignment thread(-t): 24
Current working at : /vol/volume/agolicz/napus_microc/noUL/stringent3
CRAQ output dir(-D): /vol/volume/agolicz/napus_microc/noUL/stringent3/output
-------------------------Start Running-------------------------
Running NGS short-reads CRAQ analysis ......
CMD: /vol/volume/agolicz/CRAQ/bin/../src/runSR.sh -g yahs.out_scaffolds_final.fa  -z seq.size -1  SRR10382366_1.fastq -2 SRR10382366_2.fastq -q 20 -m 2 -f 0.4 -h 0.6 -r 0.75 -a 20 -t 24
worker_pipeline::
worker_pipeline:: NGS reads aligning and filtering
[M::worker_pipeline:: Sort bamfiles]
[M::worker_pipeline:: Compute effective NGS coverage]
[M::worker_pipeline:: Collect potential CRE|RH]
SR clipping analysis completed. Check current directory SRout for final results!

Running CRAQ benchmark analysis ......
CMD: /vol/volume/agolicz/CRAQ/bin/../src/runAQI_NGS.sh -g  yahs.out_scaffolds_final.fa  -z seq.size   -e SRout/SR_eff.size  -c SRout/SR_putative.RE.RH  -d SRout/SR_sort.depth  -r 0.75 -p 0.4 -q 0.6  -n 10 -s 103347 -w 500000    -j 1 -u T -v F -y F -x /vol/volume/agolicz/napus_microc/noUL/stringent3/output/seq.size
[M::worker_pipeline:: Collect potential CRE|H]
[M::worker_pipeline:: Get putative CRE]
[M::worker_pipeline:: Search noisy region]
[M::worker_pipeline:: Quality benchmarking]
[M::worker_pipeline:: Create regional metrics]
[M::worker_pipeline:: Create final report]
CRAQ analysis is finished. Check current directory runAQI_out for final results!

Citation!

Hello!
This project seems to be very useful. How one can cite it in the paper?

Best regards
Asan

Can't open perl script "../bin/craq"

interpretation of results

Hello,
thanks for developing a very useful tool, I have a little doubt when I use this tool, I would like to ask you :
My command is as follows :

perl  ./CRAQ/bin/craq -g chr.fasta -sms hifi.sorted.bam -pl T

When I go to check the results after normal operation, I mainly focus on two files：out_final.CRE.bed and out_final.CSE.bed
The result of out_final.CRE.bed is as follows:

m01	1	1	m01:1	CRE
m01	60253887	60253887	m01:60253887	CRE
m02	1	1	m02:1	CRE
m02	44039054	44039054	m02:44039054	CRE
m02	54796608	54796608	m02:54796608	CRE
m03	1	1	m03:1	CRE
m03	11057839	11057839	m03:11057839	CRE
m03	49202667	49202667	m03:49202667	CRE
m04	1	1	Gm04:1	CRE
m04	36798166	36798166	m04:36798166	CRE
m04	55614844	55614844	m04:55614844	CRE

The result of out_final.CSE.bed is as follows:

m08	15154	15155	m08:15154	CSE
m11	27340675	27340676	m11:27340675	CSE
m17	22122332	22122333	m17:22122332	CSE
m17	24023509	24023510	m17:24023509	CSE
m17	24032958	24032959	m17:24032958	CSE

I don 't understand why the area length of my bed file is 1
Hope to get help,
thank you

Can I use ont and HiFi data at the same time?

Hello,
Thank you for developing a very useful tool,I have some doubts about using the tool, so I woulld like to ask you for advice:

$ craq -g assembly.fa -sms SMS.fa.gz -ngs NGS_R1.fa.gz,NGS_R2.fa.gz -x map-hifi
Can I input both of ONT and hifi data at the same time in this step: -ngs ?

$ craq -g assembly.fa -sms SMS_sort.bam -ngs NGS_sort.bam
Or input both of the BAM files of ONT and hifi data at the same time in this step: -ngs ?

Illegal division by zero

Hello! I ran CRAQ on a small partial genome with SMS reads only, and I ran into this error.

Running CRAQ benchmark analysis ......
CMD: /data/wenhuaming/software/CRAQ/bin/../src/runAQI_SMS.sh -g good_zhou_haploid.fa -z seq.size -e LRout/LR_eff.size -C LRout/LR_putative.SE.SH -D LRout/LR_sort.depth -r 0.75 -p 0.4 -q 0.6 -R 0.75 -P 0.4 -Q 0.6 -n 10 -s 391 -w 500000 -j 1 -b F -y -t -x /data/wenhuaming/data/tomato/CRAQ/original/haploid/output/seq.size -v F
[M::worker_pipeline:: Filter putative LER]
[M::worker_pipeline:: Quality benchmarking
[M::worker_pipeline:: Create CRAQ metrics]
[M::worker_pipeline:: Create final report]
Illegal division by zero at /data/wenhuaming/software/CRAQ/src/final_short_report_minlen.pl line 23, line 12.
Illegal division by zero at /data/wenhuaming/software/CRAQ/src/final_short_report_minlen.pl line 23, line 12.
CRAQ analysis is finished. Check current directory runAQI_out for final results!

The alignment seems fine, but the out_final.Report has no content other than the head. I am sure my genome has no zero length contigs. It seems you deleted all the intermediate files so I can not debug it on my own.

BTW, the -t option is not working, when I set it to 64, the alignment still worked under default 10 threads. And the -pl option either, I did not see any plot generated when it is on. But they are not related to this issue, I just think you might wanna notice.

Best wishes!

Why there is no out_correct.fa file?

The runAQI_out directory has files as below:

my script: perl /nfs/software/CRAQ/bin/craq -g rex.genome.fa -sms rex.fasta.gz -ngs rex_L1_clean.fa.gz,rex_L2_clean.fa.gz -avgl 60 -avgs 100 -t 40 -D output

About the "-rw" parameter. Could I get a new regional AQI score using intermediate or final results?

Is there a way by adjusting the window size(-rw) for obtaining regional AQI scores and plot it from intermediate or final files? instead of rerunning CRAQ to get a new out_regional.AQI.bdg?

more fastq file

How should I run CRAQ if I have multiple sms or ngs input data？ @JiaoLaboratory

Default value about default "sms_coverage" and "ngs_coverage"

    --sms_coverage|-avgl            Average SMS coverage. Default: 100
    --ngs_coverage|-avgs            Average NGS coverage. Default: 100

Great job, thank you to the development team. I am using this software in my project. I would like to ask if "coverage" here refers to Depth? If so, is the default 100X requirement considered high? Would it be more suitable to lower it to, for example, 30?