lh3 / yak Goto Github PK

View Code? Open in Web Editor NEW

110.0 8.0 8.0 88 KB

Yet another k-mer analyzer

License: MIT License

C 93.88% Makefile 1.15% JavaScript 3.89% Perl 1.08%

bioinformatics k-mer

yak's People

Contributors

Stargazers

Watchers

Forkers

mphschmitt lutfia95 xfengnefx kdm9 skyclub3 pythseq radygenomics hlilab

yak's Issues

yak count question with paired reads

Under "getting started" on readme page, it describes:

for paired end: to provide two identical streams

./yak count -b37 -t32 -o sr.yak <(zcat sr*.fq.gz) <(zcat sr*.fq.gz)

What is the reason to feed the same reads to yak twice as identical streams? Or is this meant to be R1 and R2 read pairs as separate streams?

how does yak count match paired end reads in the two identical streams?

Since yak count requires two identical streams for paired end reads, is it necessary to have forward/reverse reads separated by file or labeled as /R1 /R2? I'm regularly converting paired end short read bams -> fasta for yak count and wondered if I need to bother sorting and separating reads by R1/R2.

Is yak inspect -p still supported

In the readme I see

print k-mer histogram
./yak inspect sr.yak > sr.hist
print k-mers (warning: large output)
./yak inspect -p sr.yak > sr.kmers

But actually both these variants just print k-mer histogram. Is there any way to get k-mers themselves from the yak file?

Description of how yak works

Hello and thank you for developing Yak, it is very useful to me!

I was just wondering if there was a place where I could find information about how it works, to compare the methodology to Merqury's for example.

how to interpret yak inspect to evaluate the completeness of an assembly

Thank you for this tool. After running:

yak count -b37 -t4 -o sr.yak myreads.fq.gz
yak count -b37 -t4 -o scaffold.yak scaffold.fa
yak inspect sr.yak scaffold.yak > sr-scaffold.kqv.txt

I obtain these results in sr-scaffold.kqv.txt:

...
SN      5       29997294        2943642 0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
SN      4       54814313        5427049 0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
SN      3       119820206       12359127        0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
SN      2       577560021       57889442        0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
QV      20      1686753367      1686676513      4.915   -1.000
QV      19      1806056255      1805978021      4.893   -1.000
QV      18      1918241646      1918162190      4.874   -1.000
QV      17      2021819785      2021739232      4.857   -1.000
QV      16      2115831944      2115750327      4.843   -1.000
QV      15      2199727376      2199644874      4.831   -1.000
QV      14      2273569360      2273486096      4.821   -1.000
QV      13      2337885508      2337801575      4.813   -1.000
QV      12      2393381989      2393297514      4.806   -1.000
QV      11      2441048817      2440963948      4.799   -1.000
QV      10      2481791146      2481705971      4.794   -1.000
QV      9       2516588006      2516502574      4.789   -1.000
QV      8       2546760834      2546675206      4.785   -1.000
QV      7       2573349003      2573263248      4.781   -1.000
QV      6       2600045275      2599959418      4.778   -1.000
QV      5       2630042569      2629956665      4.773   -1.000
QV      4       2684856882      2684770932      4.764   -1.000

How should I interpret these results (this is a human assembly from pacbio, and the short reads are 80x illumina).

Bioconda packaging

Hi,

Have you concider to package yak in Bioconda. It will be easier to install yak on shared structure and will follow the FAIR principe and especially the reproductibility aspect.

First of all, please add a new tag to the current state of the repository.

Documentation for trio eval

Dear Heng,

Trying out the trio eval tool, it's a great addition.

Can you share the meaning of S/W/H records?

I'm guessing an S is a sequence?

S ctg.000223F 13 0 12 0 0 0
S ctg.000227F 27 3 23 3 3 0
S ctg.356 10 136 3 7 7 128
W 152511 1638448 0.093083
H 267877 1638666 0.163473

Higher QV in plant trio HiFi assembly using paternal kmer than offspring kmer

Hi, @lh3

We use the hifiasm trio mode to assemble a high het plant genome (320Mb, 1.8% het, pat and mat 50x illumina, F1 have 100x Illumina and HiFi ). In the yak trioeval result, the switch error and hamming error is low (pat: 0.38%/0.43%; mat: 1.1%/0.54%). It is a dioecy plant, but unlike human,it is usually hunderads of kb region for sex (sex-determing region in autosome).

By the way, the HiC mode have a good performance on the plant genome (hap1: 0.51% / 3.3% ; hap2: 0.49%/2.61% )

Type	Hap	Switch	Hamming
Trio	hap1	0.49%	0.59%
Trio	hap2	0.56%	1.13%
Trio-check	hap1	0.44%	0.39%
Trio-check	hap2	0.55%	1.12%
HiC	hap1	0.50%	2.62%
HiC	hap2	0.52%	3.31%

Trio version have manual check in the HiC heatmap (https://github.com/baozg/phased-assembly-check) and manually resuce three unitig (chhylp123/hifiasm#115 (comment))

But when we use the yak qv, the offspring shows a lower qv than using the matched parent‘s illumina data.

Assembly	QV_kmer	QV
hap1 (pat)	pat	46.475
hap1 (pat)	F1	39.931
hap1 (mat)	mat	42.899
hap1 (mat)	F1	38.986

Here is the command I use:

CentOS 7.5

## hifiasm 

### yak count
yak count -k31 -b37 -t16 -o pat_k31.yak <(cat pat.Clean.R1.fq.gz pat.Clean.R2.fq.gz) <(cat pat.Clean.R1.fq.gz pat.Clean.R2.fq.gz)

### trio mode (0.14.2-r315)

hifiasm -o F1.asm -t32 -1 pat.yak -2 mat.yak F1.ccs.fq.gz

### HiC mode (0.15-r327)
hifiasm -o F1.asm -t32 --h1 F1.Clean.R1.fq.gz --h2 F1.Clean.R2.fq.gz F1.ccs.fq.gz


### trioeval (yak 0.1-r62-dirty)

yak trioeval pat.yak mat.yak F1.asm.trio.hap1.p_ctg.fasta > hap1.hic.trioeval
yak trioeval pat.yak mat.yak F1.asm.trio.hap2.p_ctg.fasta > hap2.hic.trioeval


### qv 

yak qv -t 32 -p -K2g -l 100k F1.yak F1.asm.trio.hap1.fasta > F1.hap1.qv
yak qv -t 32 -p -K2g -l 100k F1.yak F1.asm.trio.hap2.fasta > F1.hap2.qv
yak qv -t 32 -p -K2g -l 100k pat.yak F1.asm.trio.hap1.fasta > pat.hap1.qv
yak qv -t 32 -p -K2g -l 100k mat.yak F1.asm.trio.hap2.fasta > mat.hap2.qv

core dumped yak qv

Hi, thanks for writing such a useful tool!

I am hitting a core dump when running yak qv.

$ yak version
0.1-r58-dirty

$ yak count -t 8 -o myfile.k31.yak myfile.30X.fastq
[M::worker_pipeline::0.301*3.60] processed 852 sequences; 8921223 distinct k-mers in the hash table
...
[M::worker_pipeline::76.182*4.75] processed 726 sequences; 304800752 distinct k-mers in the hash table
[M::yak_ch_dump] dumpped the hash table to file 'myfile.k31.yak'.
[M::main] Version: 0.1-r58-dirty
[M::main] CMD: yak count -t 8 -o myfile.k31.yak myfile.30X.fastq
[M::main] Real time: 92.484 sec; CPU: 377.137 sec; Peak RSS: 4.307 GB

$ yak qv -t 8 myfile.k31.yak hifiasm.asm.p_ctg.fasta > hifiasm.asm.p_ctg.k31.qv.txt
[M::yak_ch_restore_core] inserted 304800752 k-mers, of which 304800752 are new
[M::yak_qv_cb] read 358 sequences
[M::yak_qv_cb] read 0 sequences
[M::[email protected]*1.70] processed 358 sequences
[M::yak_qv_cb] read 0 sequences
yak: qv.c:232: yak_qv_solve: Assertion `adj_sum <= (double)qs->tot' failed.
Aborted (core dumped)

The input fastq file is here. Assembly file is here.

Thank you,
Sarah

dumpped the hash table to file ".yak"

Hi everybody,
I encountered with this issue several times. Do you have any idea about this:

[M::worker_pipeline::837.296*10.33] processed 94 sequences; 2333500880 distinct k-mers in the hash table
[M::main_count] 2333322325 distinct k-mers after shrinking
[M::yak_ch_dump] dumpped the hash table to file 'ccs.yak'.
[M::main] Version: 0.1-r56
[M::main] CMD: yak count -b37 -t32 -o ccs.yak ccs.fa
[M::main] Real time: 963.551 sec; CPU: 8858.973 sec; Peak RSS: 48.784 GB
yak: main.c:108: main_qv: Assertion ch' failed. /var/spool/slurmd/job62811721/slurm_script: line 16: 32703 Aborted (core dumped) yak qv -t32 -p -K3.2g -l100k sr.yak "/project/6049291/karimi81/Pacbio_data/wtdbg/mink.srp.fa" > asm-sr.qv.txt yak: main.c:108: main_qv: Assertion ch' failed.
/var/spool/slurmd/job62811721/slurm_script: line 17: 32706 Aborted (core dumped) yak qv -t32 -p sr.yak "/project/6049291/karimi81/Pacbio_data/lib1/ccs.fastq" > ccs-sr.qv.txt
/var/spool/slurmd/job62811721/slurm_script: line 20: 32707 Segmentation fault (core dumped) yak inspect ccs.yak sr.yak > ccs-sr.kqv.txt
[M::yak_ch_restore_core] inserted 2132336103 k-mers, of which 2132336103 are new

Re-alignment and call the variants with T2T-CHM13

Hi Heng,

According to your estimation, when re-alignment and call the variants with T2T-CHM13 can be done for 1000 Genome and other related projects?

Thanks.

Shicheng

Yak trioeval for polypoidy plant genome

Hi, @lh3

We have fully-phased polyploidy plant genome using the genetics map. It has 4 haplotype totally. We want to use the gentics grouping info as ground truth as trio in the diploidy genome. Can I use the yak trioeval for switch error and hamming error?

hap1 -> father
hap2+hap3+hap4 assembly (HiFi only) -> mother
evaluate the hap1 assembly).

Is it reasonable ? Or the yak trioeval has the assumption of polyploidy?

Thanks.

zcat stream error

Dear developer,

Thanks for the software.

When submitting the command in a script via SGE cluster, it failed and the log said:

/sge/default/spool/c5/job_scripts/1124529: line 1: syntax error near unexpected token `('
/sge/default/spool/c5/job_scripts/1124529: line 1: `yak count -o k41.yak -k 41 -b 37 -t 14 <(zcat NGS.reads.r1.fq.gz NGS.reads.r2.fq.gz) <(zcat NGS.reads.r1.fq.gz NGS.reads.r2.fq.gz)'

Could you please give some advies how to deal with the issue?

Best regards.

Updated output documentation for yak triobin

yak/triobin.c

Line 176 in 6de3aff

    
           //		fprintf(stderr, "Output: ctg err strongMixed sPat sMat weakMixed wPat1 wMat1 wPat2 wMat2\n");

Do you have any updated documentation for the output of yak triobin? I'm looking at the output of verison r43, which has 13 columns as opposed to the 10 columns documented in the help text. I'm especially trying to understand column 2, which has values m, p, a, and 0.

Is Hifi and Short Read Combined Counting Possible?

Hi! The question is basically in the title. I was wondering what the command might be if one would like to create a data base from both HiFi and Illumina short reads. Is there a way to merge two yak files after counting? Should they be input all together to the same yak count call?

How to cite yak?

Hi Dr. Li,
I used yak in my paper to calculate QV and I'd like to cite yak. However, I haven't found any information about how to cite yak. Could you please tell me how to cite yak?
Thank you!
Chujia.

Is Assembly and Reads Combined Counting Possible?

Hi! I have another question about combining different data sources into one database. I was wondering what the command might be if one would like to create a Yak database using both an assembly fasta file and paired-end short read fastqs. I know this is a rather unusual request, but I was wondering if this is something that would be possible with Yak, and which setting sone should use in this case, or how can one combine Yak databases after counting each separately.

Large -K is not respected

Hi,

This is basically a duplicate/reminder of lh3/minimap2#491 but for yak. In short if -K is too large it only processes one sequence at a time.

Will process many at once:

 ./yak count -K 2000000000  mat_ill.fasta -o  /dev/null

Will only process one at a time:

 ./yak count -K 20000000000  mat_ill.fasta -o  /dev/null

Best,
Mitchell

trioeval stuff

continuing from #17, I have tried to use trio eval. It too has a bunch of undocumented output fields but lets ignore that.

When you say "father and mother specific counts", you mean count of kmers right? If so, something seems wrong. Meryl results suggest that about 8% of the kmers in my HiC reads are haplotype specific. Yak trioeval is only reporting <0.01% as being haplotype specific. Any idea what is happening here?

To be clear, I have two assemblies, hap1 and hap2. I want to use those for getting haplotype specific kmers, then score each hic read that I have with counts of: hap1 specific, hap2 specific, both, neither.

triobin documentation

I'm currently trying to bin HiC reads in order to attempt to do haplotype specific scaffolding of a phased hifiasm assembly. I was hoping I could use yaks triobinning functionality to get hapmer counts associated with each read, so that I could then attempt to bin the read pairs.

The issue I'm running into is that triobin has no meaningful documentation at the moment. It produces an output table with 10 columns. The first is a contig id. The second is yaks binning result which is useless to me. I'm hoping that the actual counts are somewhere in the remaining 8 columns.

So what are the columns in triobin output?

Understanding output of yak inspect for completeness

As mentioned in the manual, I have checked the read k-mer profile vs assembly k-mer profile using yak inspect. The output file contains lines starting with SN and QV.

SN      2       588933275       66784153        0.0007  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000 0.0000   0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
QV      20      2507879712      2505578138      6.467   -1.000
QV      19      2526259992      2523952802      6.464   -1.000
QV      18      2542258425      2539946329      6.461   -1.000
QV      17      2555656121      2553339972      6.459   -1.000
QV      16      2566448967      2564129690      6.457   -1.000
QV      15      2574901954      2572580226      6.456   -1.000
QV      14      2581364686      2579041194      6.455   -1.000
QV      13      2586262378      2583937639      6.454   -1.000
QV      12      2590120298      2587794660      6.453   -1.000
QV      11      2593472460      2591146198      6.452   -1.000
QV      10      2596865531      2594538820      6.452   -1.000
QV      9       2600840467      2598513407      6.451   -1.000
QV      8       2606142321      2603814977      6.450   -1.000
QV      7       2614002259      2611674634      6.448   -1.000
QV      6       2626876357      2624548450      6.445   -1.000
QV      5       2650645186      2648316963      6.439   -1.000
QV      4       2702958740      2700630199      6.428   -1.000
QV      3       2849803532      2847474729      6.395   -1.000
QV      2       3438736807      3436407745      6.282   -1.000

Could you please help understand the output fields and how to calculate the completeness? Other outputs (qv, trioeval) from yak have header describing these fields.

Thank you

yak inspect "-p" unused

Hi Dr. Li,

Thanks so much for developing this tool. I'm currently exploring the usage of yak to see if I could incorporate it into a pipeline. In the pipeline, I'd need to get a list a kmers (preferably one I could read). To to this with yak,

I downloaded the latest branch:
"
git clone https://github.com/lh3/yak.git
cd yak && make
"

I tried to use yak to generate a list of kmers via these two commands:
""
./yak count -b37 -t6 -o kmers.yak reads.fq.gz
./yak inspect -p kmers.yak > list.kmers
""

However, the output is the exact same as the command to develop a histogram, i.e.
"
./yak inspect kmers.yak > list.kmers

HS 1023 0 139628 139628
HS 1022 0 140 139768
HS 1021 0 149 139917
HS 1020 0 137 140054
HS 1019 0 160 140214
HS 1018 0 120 140334
HS 1017 0 129 140463
"

Admittedly, "./yak inspect -h" only presents a single flag "-m" for use, which perhaps means that "-p" has been discontinued? Am I misunderstanding what the program is supposed to be doing (highly likely), or is this a bug? Please advise.

Thanks in advance!
Best,
bjp

Aborted (core dumped)

Hello professors:
Thanks for providing such a good tool!

I got a issue when running yak qv, the message is shown below.

[M::yak_ch_restore_core] inserted 1145802121 k-mers, of which 1145802121 are new
CC CT kmer_occurrence short_read_kmer_count raw_input_kmer_count adjusted_input_kmer_count
CC FR fpr_lower_bound fpr_upper_bound
CC ER total_input_kmers adjusted_error_kmers
CC CV coverage
CC QV raw_quality_value adjusted_quality_value
CC
[M::yak_qv_cb] read 125 sequences
[M::yak_qv_cb] read 0 sequences
[M::[email protected]*3.77] processed 125 sequences
[M::yak_qv_cb] read 0 sequences
yak: qv.c:232: yak_qv_solve: Assertion `adj_sum <= (double)qs->tot' failed.
Aborted (core dumped)

Could you please give me some advice about how to solve the problem? Thank you so much.

Look foward to your reply!

yak on Hifi parental ?

Hello,

Does yak work with Hifi parental data or only with illumina?