eblerjana / pangenie Goto Github PK

View Code? Open in Web Editor NEW

97.0 97.0 10.0 1.21 MB

Pangenome-based genome inference

License: MIT License

CMake 0.41% Python 12.71% C++ 86.88%

pangenie's People

Contributors

Stargazers

Watchers

Forkers

asleonard wook2014 josephlalli yonghaoy hhaentze qi-shen

pangenie's Issues

Which version of CHM13 reference genome to use

Hi,

Thanks for developing this great tool. I would like to use the "HPRC-CHM13 (88 haplotypes)" variant reference. Based on the name, I think the reference genome I should use for genotyping is CHM13, but I am not sure which version of CHM13 to use. Could you provide links to the reference genome you used when generating the variant reference vcf?

Thanks,
Han

High FP call after running Truvari on HG002

Hi @eblerjana,

I've been working with PanGenie more often since my last issue. I was keen on assessing the performance for the genome inference of HG002, so I benchmarked my PnaGenie outputs for this sample – obtained after running the tool on both my pangenome graphs (built with minigraph-CACTUS and PGGB) – against the truth set from GIAB.

However, despite not being as good as the HPRC one the recall is reasonable, on the other hand the precision is straight up bad... after talking with the team at the HPRC here in Santa Cruz they pointed out how most of these FP lowering the precision value are likely on satellite other difficult regions (I guess?), and explained to me that most of the improvement of a PanGenie run in terms of precision is due to the post-processing filtering with a ML approach (I've been reading this #14).

Therefore, my question is how can I run this script for just my single HG002 sample? What I got from the issue #14 is that this is particularly beneficial for large cohort with trio data; is there a way to assess it a single sample? Also, you pointed out to a concept of genotype quality which might work in the absence of the former information..., how can I assess that eventually?

Thanks in advance and looking up online I came across these two links:

Are they anyhow related on how the ML approach works or is based on? This is just for my understanding. Thanks again!

update help/readme: uncompressed input

The command line help is accurate (reads.fa/fq), but since uncompressed read files are uncommon, it would be helpful to add that requirement in a more obvious way, e.g.

-i VAL	sequencing reads in FASTA/FASTQ format or Jellyfish database in jf format (required).
        NOTE: INPUT FASTA/Q FILE MUST NOT BE COMPRESSED

FastaReader::parse_file: reference file cannot be opened

Hi, pangenie group,
I met a problem when I run the pangenie with singularity, it seems that it cannot open the reference fasta file, which I can read. The error is listed below. In addition, I'd like to ask how to select an appropriate reference fasta file, is it also a part of our pan-genome?

singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif cat /metadata/jellyfish.lib.version
libjellyfish-2.0-2:amd64 2.3.0-4build1 libjellyfish-2.0-dev:amd64 2.3.0-4build1

singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif cat /metadata/pangenie.git.version
6e0da25

singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif PanGenie -i $input.fastq -r $reference.fa -v $graph.vcf -o test

program: PanGenie - genotyping and phasing based on kmer-counting and known haplotype sequences.
author: Jana Ebler

Files and parameters used:
-c	0
-d	0
-e	3000000000
-g	1
-i	$input.fastq
-j	1
-k	31
-o	test
-p	0
-r	$reference.fa
-s	sample
-t	1
-u	0
-v	$graph.vcf
Determine allele sequences ...
terminate called after throwing an instance of 'std::runtime_error'
  what():  FastaReader::parse_file: reference file cannot be opened.
Aborted

How to process large paired end fastq file?

Hello,
I have about 200 samples. There are 2 fastq.gz file (sample.R1.fq.gz and sample.R2.fq.gz) for every file. Each file is very large (more than 50G compressed file). How can I unzip them and combine the paired end file into one file?

Thank you !

How is Jellyfish database in jf format created for PanGenie?

The PanGenie command line help indicates that a jellyfish database in jf format is an option for input.

-i VAL sequencing reads in FASTA/FASTQ format or Jellyfish database in jf format.
NOTE: INPUT FASTA/Q FILE MUST NOT BE COMPRESSED. (required).

What is the recommended JellyFish command for creating the jf file? How does the jf formated file compare in size to the uncompressed fasta or fastq file?

Normalization of genotype likelihoods from subsampling steps

Currently, genotype likelihoods from subsampling steps are added and normalized after each step. A better approach would be to first add up all likelihoods, and then do the normalization once at the end. Otherwise, the likelihoods are not weighted equally, which might lead to genotyping errors.

pangenie/src/pggtyper.cpp

Line 96 in ae47956

// normalize the likelihoods after they have been combined

jellyfish version?

INSTALL

pangenie]$ cd build/
build]$ cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /xxx/software/miniconda3/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /xxx/software/miniconda3/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Checking for module 'jellyfish-2.0'
-- Found jellyfish-2.0, version 2.2.10
-- /Parastor300s_G30S/zhangjj/software/miniconda3/envs/pangenie/include/jellyfish-2.2.10
-- /parastor300/work01/zhangjj/software/PanGenie/pangenie/build/src
-- Configuring done
-- Generating done
-- Build files have been written to: /parastor300/work01/zhangjj/software/PanGenie/pangenie/build
build]$ make
Scanning dependencies of target PanGenieLib
[ 1%] Building CXX object src/CMakeFiles/PanGenieLib.dir/emissionprobabilitycomputer.cpp.o
[ 2%] Building CXX object src/CMakeFiles/PanGenieLib.dir/copynumber.cpp.o
[ 4%] Building CXX object src/CMakeFiles/PanGenieLib.dir/commandlineparser.cpp.o
[ 5%] Building CXX object src/CMakeFiles/PanGenieLib.dir/columnindexer.cpp.o
[ 7%] Building CXX object src/CMakeFiles/PanGenieLib.dir/dnasequence.cpp.o
[ 8%] Building CXX object src/CMakeFiles/PanGenieLib.dir/fastareader.cpp.o
[ 10%] Building CXX object src/CMakeFiles/PanGenieLib.dir/genotypingresult.cpp.o
[ 11%] Building CXX object src/CMakeFiles/PanGenieLib.dir/histogram.cpp.o
[ 13%] Building CXX object src/CMakeFiles/PanGenieLib.dir/hmm.cpp.o
。。。
[ 38%] Linking CXX executable PanGenie-graph
/xxx/software/miniconda3/bin/../lib/gcc/x86_64-conda-linux-gnu/9.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -ljellyfish-2.0
collect2: error: ld returned 1 exit status
make[2]: *** [src/CMakeFiles/PanGenie-graph.dir/build.make:104: src/PanGenie-graph] Error 1
make[1]: *** [CMakeFiles/Makefile2:166: src/CMakeFiles/PanGenie-graph.dir/all] Error 2
make: *** [Makefile:114: all] Error 2

build]$ jellyfish --version
jellyfish 2.2.10

Does the version of Jellyfish have to be 2.0?

Should i hard-masked the assemblies firstly

Hi !
I used your pipeline pangenome-graph-from-assemblies to construct a VCF file, but I noticed that some of the variants have an unusually high number of alleles. see belows:
1 18300 . GACACACAACACACAC GACAGACAGACACACACAC,GACAGACAGACAGACAGACACAC,GACAGACAGACAGACAGACAGACACACACACACACACACAC,G,GACACACACACACACACACACAC,GACAGACAGACAGACACACACACACACAC
1 19171 . CAAAAAAAA CAA,CAAA,CAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAAAA,CAAAAA,CAAAA,CAAAAAAAAAA,C,CA,CAAAAAAAAAAA
I suspect that this may be due to repetitive sequences in my input assemblies, which were directly obtained from hifiasm assembly results without being masked.

[question] Can variants vcf is single chromosome ?

Hi, @eblerjana

Can the variants vcf is single chromosome ? For large-scale samples parallel, can we just use one chromosome vcf for genotyping/phasing? Or the pangenie must have whole genome variants for more precisely kmer counting and genotyping?

Thanks.
Zhigui Bao

How to represent structural variants in input VCF?

Hi,

Thank you very much for such a nice tool.

I would like to represent not only snps but also structural variants in the input VCF.

Could you please tell me how to do that?

This is my toy example:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1>
##ALT=<ID=DEL,Description="Deletion">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00732 NA12878
1 15951 . T ~~. PASS . GT 0|0 1|1~~

End error message:

VariantReader: skip variant at 1:15950 since alleles contain undefined nucleotides:

Best,
Mehmet

Can pangenie also work with GBS data? Genotyping-by-sequencing?

Hi pangenie team,

First of all, thank you very much for developing such a beautiful tool for the community.

I am wondering, instead of whole genome sequencing data, can pangenie also work with genotyping-by-sequencing (GBS) data? In plant breeding, people normally do GBS to save money since there might be hundreds of plants needed to be genotyped.

Looking forward to your answer and thank you very much in advance.

Best wishes,
Yutang

Genome graph construction, variant filtering & phasing

Hi,

Thanks for your work on this program. The method seems really elegant and we've had some really exciting results from it so far.

Here is a brief outline of our preliminary analysis:

We have a well assembled reference genome plus 8 individuals sequenced with HiFi, two of the individuals are properly phased with parent short read data, whereas the others are just mosaic haplotype assemblies.
We then have a large number of short read population samples, which we'd like to genotype.
There is a specific region of the genome that we are interested in, which has an elevated concentration of large SVs that overlap, including an insertion of >20Kbp. Getting accurate genotyping results for this region from short-reads would be really useful for our subsequent analyses.
Currently the genotyping of some of the TE sequences in this region have fairly low GQ scores and so don't seem completely reliable.

I have a couple of questions to make sure I'm getting the most out of the program:

I initially had good results with minigraph-cactus for generating a comprehensive vcf file. However I read on the PanGenie issues page, that this is not a recommended program. There was a reference to vcfbub, but It's not clear to me what steps need to be taken to go from the reference + long read assemblies to an appropriate pangenome graph. I didn't notice any obvious issues with the subsequent short read genotyping results from minigraph-cactus, but I'm worried that I'm not getting the best results.
I was trying to be quite conservative with the genotyping results, I used the GQ flag as mentioned in the paper, and then also applied a minimum KC of 7 per sample. Is this a reasonable filtering approach?
I noticed that there is an experimental phasing option. We are quite interested in the haplotype structure of our data, and initially we used Shapeit (using the HiFi individuals as reference haplotypes) to phase the population data. Are there anymore details available about PanGenie phasing? Is it worth applying the PanGenie phasing prior to Shapeit perhaps? Is it worth combining PanGenie genotyping with read-based phasing programs like WhatsHap?

Counting kmers from reads uses too much memory compared to jf

Hi Jana,

I've noticed that the memory usage when counting kmers directly from reads (and not a .jf) is incredibly high, and it looks like PanGenie reads the entire fastq file into memory and then counts the kmers. I think this also leads to under reporting of usage, because the PanGenie log for a run says

Count kmers in genome ...
#### Memory usage until now: 61.7867 GB ####
Determine unique kmers ...
#### Memory usage until now: 79.1394 GB ####
....
Total maximum memory usage: 84.4493 GB

When the same cluster reports peak usage of 223.6 Gb. At one point during kmer counting it was 176.3 Gb, so it definitely seems like the kmer counting is a big bottleneck, especially for moderate/high coverage samples.

Just running jellyfish by itself to produce the .jf file only peaked at 42.1 Gb, so that seems like a better approach currently.

I tried and very strongly failed at trying to implement some streaming feature in jellyfish. It looks like they have some code set up for that, but I didn't last long enough.

update help/readme: output prefix

It is not obvious how PanGenie handles the output prefix if the path includes non-existent folders. By looking at the runtime errors, I assume PanGenie does not create non-existent folders.

VariantReader: skip too many short variants from minigraph-cactus

PanGenie is a user-friendly and fast genotyper.

I generated vcf from minigraph-cactus.
When i used PanGenie v2.1.0, it can output results but too many (nearly 36%) variants from were skipped. The command is

~/tool/pangenie-2.1.0/build/src/PanGenie -i $fq -r $ref -v $mcvcf -j $th -s $sample -o $sample -t 20

Logs output like

VariantReader: skip variant at Chr01:5201 since it is contained in a previous one.
...

The source vcf file had been changed to the diploid version. Here are the first and second variants.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1       Sample2   Sample3     Sample4     Sample5     Sample6   Sample7       Sample8     Sample9       Sample10
Chr01   5194    >1008>1119      TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC      TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTAAACCCTAAACTCTAAACCCTAAACCCTAAACTCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAATCCTAAACCCTAAACCCTAAACCCTAAACCC,TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCTGAACCCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCTAAACCCTAAACCCTAACCCTAAGCCCTAAACCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAACCCTAAACCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TATT 60      .       AC=1,1,1,1,1;AF=0.166667,0.166667,0.166667,0.166667,0.166667;AN=6;AT=>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1104>1106>1108>1109>1110>1112>1113>1115>1116>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1104>1106>1107>1109>1110>1112>1114>1115>1116>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1063>1065>1066>1068>1069>1071>1072>1074>1075>1077>1078>1080>1081>1083>1084>1086>1087>1089>1090>1092>1093>1095>1097>1099>1100>1102>1103>1104>1106>1107>1109>1111>1112>1113>1115>1117>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1041>1043>1044>1046>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1063>1065>1066>1068>1069>1071>1072>1074>1075>1077>1078>1080>1081>1083>1084>1086>1087>1089>1090>1092>1093>1095>1096>1097>1098>1100>1101>1103>1104>1106>1107>1109>1110>1112>1113>1115>1116>1118>1119,>1008>1009>1012>1013>1015>1016>1018>1019>1021>1022>1024>1025>1027>1028>1030>1031>1033>1034>1036>1037>1039>1040>1042>1043>1045>1046>1047>1049>1050>1052>1053>1055>1056>1058>1059>1061>1062>1064>1065>1067>1068>1070>1071>1073>1074>1076>1077>1079>1080>1082>1083>1085>1086>1088>1089>1091>1092>1094>1095>1097>1099>1100>1101>1103>1104>1105>1106>1107>1109>1110>1112>1114>1115>1116>1118>1119,>1008>1009>1010>1119;NS=6;LV=0     GT      5|5     .|.     3|3     1|1     0|0     2|2     .|.     4|4     .|.     .|.
Chr01   5202    >1013>1016      G       A       60      .       AC=1;AF=0.2;AN=5;AT=>1013>1014>1016,>1013>1015>1016;NS=5;LV=1;PS=>1008>1119     GT      .|.     .|.     0|0     0|0     0|0     0|0     .|.     1|1     .|.     .|.
...

I checked the REF/ALT sequence, and found the second variant was redundant.
RER[Sample5] 5202bp position can be inferred from the first variant. TGAACCCT[g]AA
ALT[Sample8] 5202bp position can be inferred from the first variant as well. TAAACCCT[a]AA

I know that PanGenie uses non-overlapping variants from the vcf file, but I wonder how to check the vcf file can be used as the input of PanGenie? Are there some suggested steps to pre-process the vcf file?

Segmentation fault

Hi
I started to install pangenie using the following code:
git clone https://github.com/eblerjana/pangenie.git
cd pangenie
conda activate pangenie
mkdir build; cd build; cmake .. ; make
Since I use an old linux HPC with old glibc I got this message:

cmake: /lib64/libc.so.6: version `GLIBC_2.17' not found

I installed glibc2.17 using this code:
conda install -c fridh glibc
Then I reactivated pangenie and tried
cmake
at the pangenie directory.
I get:

Segmentation fault

Thank you

Add option to restrict PanGenie to region set

Would be useful to be able to work, e.g., with exome data. If variants outside the given region set could be imputed, that'd be even more cool.

Does pangenie support ARM architecture?

I want to install pangenie in ARM server, but it always issue errors, I don't know weather pangenie support ARM architecture, If you know it, you can tell me. thanks.

Can long-reads be use for <reads.fa/fq>？

Hi,
Thank you very much for such a nice tool. usage: PanGenie [options] -i <reads.fa/fq> -r <reference.fa> -v <variants.vcf>
I fund the paper (https://www.biorxiv.org/content/10.1101/2020.11.11.378133v1）said that ：we propose a new algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation....

May I ask if the pangenie software can only use short reads? Can long-reads be use for <reads.fa/fq> (such as Nanopore or PacBio reads)? ，What is the advantage over the second generation
The genotype of the outfile is the genotype of reads (i.e -i <reads.fa/fq>) that corresponds to the sample, rigth？

Thanks.

Error while launching from Snakefile

Hi there,

My aim is to get a genome inference for a human sample (HG005) to benchmark my graph in terms of SVs calling using PanGenIe. I assembled both with minigraph-CACTUS (for which I'm testing right now) and with pggb.

With that said, despite the pangenomes are quite small (only five individuals), after correctly outputting the .vcf files for both of them, I incur in an error when launching from Snakefile. Below the output for the command:

File "Snakefile", line 13
    rule all:
           ^
SyntaxError: invalid syntax

Now, the reason I'm doing this is because of the non-overlapping variants requirement. Therefore, considering my .vcf haven't been generated using your pipeline, I'm assuming they are not in accordance to the specific standard you mentioned (particularly the third one); hence why, I proceeded this way.

One last mention, I'm not very experience with Snakefiles, and after looking online I've seen they are launched invoking python3; is that the correct way? Let me know, thanks in advance!

P.S. attached a version of the .yaml edited to conform the inputs for the script (config.log)

pggb vcf are empty

Hi again,

I'm having serious troubles processing the PGGB .vcf files... At first the tool pointed out to haplotypes being un-phased; this is a problem I fixed collapsing the columns for each individual haplotype in a single one for all samples using the following awk command:

awk -F '\t' '/^#/ {print;next;} {OFS="\t";for(i=j=10; i < NF; i+=2) {$j = $i"|"$(i+1); j++} NF=j-1}1' pangenome_ref_guided_GRCh38.vcf > pangenome_ref_guided_GRCh38-phased.vcf

In fact, compared to minigraph-CACTUS, PGGB outputs a .vcf file where the haplotypes for each individual in the graph are split into two columns, this leads PanGenIe to think they are un-pahsed. Anyway, I proceed then to fix the columns headers for the respective samples, but still despite running to completion the process returns an empty .vcf.

I'm not sure what was the cause for it, and after talking with Glenn and looking up online I acknowledged I was missing some pre-processing steps, namely removing complex bubbles and decomposing nested alleles with vcfbub and vcfwave, respectively.
I run the following command:

vcfbub -l 0 -r 10000 -i pangenome_ref_guided_GRCh38-phased.vcf.gz | vcfwave -L 10000 -t 16 -n | bgzip -c -@ 16 > decomposed_and_cleaned.vcf.gz

Unfortunately, I got the same result... So, going back to one of my hypotheses this could have been linked to having # in the filenames as per the PanSN standard for naming used by PGGB.
I tested removing from the #CHROM column everything but the string chr(number); however, even after this attempt the result is always an empty .vcf file.
I though it could have been helpful to re-run the vcfbub and vcflib commands, but this time I got the following error:

error: more sample names in header than sample fields
samples: CHM13 HG002 HG00514 HG00733 HG03492 NA19240
line: chr1 418412 >11061496>11061499 T C 60.0 . AC=1;AF=1;AN=1;AT=>11061496>11061497>11061499,>11061496>11061498>11061499;NS=1;LV=0 GT .|. .|. .|. .|. .|1
thread 'main' panicked at 'called Result::unwrap() on an Err value: IoError(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })', src/main.rs:199:46
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

This is strange as I previously didn't get that, maybe because I run the pre-processing before editing with awk, I don't know for sure...

Please, if you have any clue/idea of what is happening I would really appreciate some help. I tried several approaches but all failed and I wonder whether there is something I'm missing.
With that said, I'm happy to attach a screenshot of the original .vcf file if needed and I'm sorry for the long text.

Thanks in advance!

P.S. I always made sure the contigs name in the reference and the #CHROM in the .vcf were consistent in all trials

Add pangenie to bioconda

Hi is it possible to add Pangenie to Bioconda instead of having to manually build the tool?

AK field might be misleading

Variants in the input VCF that are less than the kmer size apart are internally merged by PanGenie and treated as a single variant. Alleles of this merged variant are composed of combinations of the merged variants (defined by the paths). The number of unique kmers per allele is computed only for this merged variant. When writing the output VCF, these merged variants are converted back to their original representation (creating the original entries). However, the AK counts are computed for each original allele by summing up the unique kmers of all combined alleles that include that particular allele. The problem is that the same kmer can be counted several times when occurring on several combined alleles, leading to numbers of AK that exceed UK. There is not really any proper way of computing AK counts for the original alleles.

Error in vcf-merging/pangenome-graph-from-assemblies/Snakemake file

Hi,

When I tried to reproduce your pangenome graph, I noticed that there were several errors in vcf-merging/pangenome-graph-from-assemblies/Snakemake file, mainly about generating trio sample list and ped file.

According to function parse_trios in scripts/mendelian-consistency.py,

def parse_trios(ped_file, samples):
	samples_to_include = set()
	with open(samples, 'r') as listing:
		for line in listing:
			samples_to_include.add(line.strip())
	trios = {}
	for line in open(ped_file, 'r'):
		if line.startswith('#'):
			continue
		fields = line.split()
		# map trio_name -> [child, parent1, parent2, nr consistent, nr_inconsistent, nr_untyped, nr total]
		if any([s not in samples_to_include for s in fields[1:4]]):
			continue
		trios[(fields[1], fields[2], fields[3])] = [fields[1], fields[2], fields[3], 0, 0, 0, 0]
	return trios

trio sample list (results/trio-samples.txt) should be like:

HG00731
HG00732
HG00733
HG00512
HG00513
HG00514
NA19238
NA19239
NA19240

However, the Snakemake file generated a file like:

Puerto Rican
HG00731
HG00732
HG00733
Chinese
HG00512
HG00513
HG00514
Yoruban
NA19238
NA19239
NA19240

And ped (results/trios.ped) file should look like:

Puerto_Rican	HG00733	HG00731	HG00732
Chinese	HG00514	HG00512	HG00513
Yoruban	NA19240	NA19238	NA19239

instead of:

Puerto Rican	Puerto Rican	HG00731	HG00732
Chinese	Chinese	HG00512	HG00513
Yoruban	Yoruban	NA19238	NA19239

(Also, trio names should not contain spaces)

To solve the problem above, you may need to modify some lines in Snakemake file:

#################################################################
#  5) Check mendelian consistency for trios and construct graph
#################################################################


# generate a file specifying the trio relationships
rule generate_ped_file:
	output:
		"{outdir}trios.ped"
	run:
		with open(output[0], "w") as ped_output:
			for trio in config['trios']:
				father=config['trios'][trio][0]
				mother=config['trios'][trio][1]
				#Original
				#ped_output.write('\t'.join([trio, trio, father, mother]) + '\n')
				#MODIFED#
				child=config['trios'][trio][2]
				ped_output.write('\t'.join([trio, child, father, mother]) + '\n')
				
rule generate_samples_file:
	output:
		"{outdir}trio-samples.txt"
	run:
		with open(output[0], "w") as sample_output:
			for trio in config['trios']:
				#sample_output.write(trio + '\n') #MODIFIED#
				for sample in config['trios'][trio]:
					sample_output.write(sample + '\n')

How to use paired end fastq file

Hello,
My sequencing files (.fastq) are paired end. But I didn't find parameter which fits this kind of file in panGenie. Can you give me an example usage of paired end file?

Thanks!

Push singularity image to Sylabs Cloud

Any chance of pushing the singularity image to a cloud service?
It would be great for those of us who try to integrate this tool into nextflow genotyping pipelines.

Thank you.

Which step used the graph pangenome?

Hello, authors,

At the demo, 'test-variants.vcf' was a regular variation vcf file. I am confused which step used the graph pangenome format with Extended Data Fig. 1. I used the 'pangenome-graph-from-assemblies' pipline to generate a callset.vcf file. What should I do next? Should I use vcfbub to convert it into a bubble file and then perform genotyping?

I would appreciate your help.

Tian Tian

Filtering variants

Hi there,

Thanks for this great tool. I noticed that, in both the HPRC reference release paper and the original paper describing this tool, you employ complex pre and post pangenie variant filtering:

"We used vcfbub (version 0.1.0, (Garrison, 2021)) with parameters -l 0 and -r 100000 in order to filter the VCF..."
"We additionally remove all records for which more than 80% of all 88 haplotypes carry a missing allele..."
"we have implemented a decomposition approach which aims at detecting all variant alleles nested inside of multi-allelic top-level records..."
"[We] defined a high quality subset of SV alleles that we can reliably genotype. For this purpose, we applied a machine learning approach similar to what we have presented previously..."

How important are these methods to the quality/reliability of your final genotypes, especially for large SV (> 50bp)?
Does this package implement these filters, and if not, is there code that does?
Do you have pre-trained models to use for the machine learning/regression based filter?

Just trying to get high-quality SV genotypes on my diverse short-read dataset, so any help is greatly appreciated.

Thanks again!
-Joe

Joint-genotyping across samples

Is there a way to generate a joint-genotyped vcf with pangenie? If I understand the software correctly, only the variants in the pangenome reference are called across all the genotyped samples while the non-pangenome variants detected are called within individual samples.

PanGenie nondeterminstic when executed with a Jellyfish database and multiple threads

Running PanGenie with a Jellyfish database in jf format and multiple threads (-t) produces randomized output.
Executing it with either a .jf file or multiple threads works fine, but not with both parameters provided.

Steps to Reproduce

jellyfish count -m 31 -s 3G -p 126 -c 7 -C -L 1 --if <path-segments.fasta> <input.fasta> -o input.jf
PanGenie -i input.jf -r <reference.fa> -v <variants.vcf>  -t 2 -o V1
PanGenie -i input.jf -r <reference.fa> -v <variants.vcf>  -t 2 -o V2
diff V1_genotyping.vcf V2_genotyping.vcf | wc -l

Expected Result

Both genotyping files should be the same.

diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
0

Actual Result

However, there are many (small) differences:

diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
10790792

How to Fix

The problem seems to be within the JellyfishReader class. I'm not familiar with the source code of Jellyfish, but I presume that the line abundance = this->db->check(jelly_kmer); in the function getKmerAbundance cannot be parallized as is.

I tried guarding the line with a mutex and this does resolve the problem. However, the parallelization overhead of the mutex makes the "determining unique kmers" section almost as slow as the serial execution, so it's definitely not an optimal solution

Remark: The bug cannot be observed in the provided test data, because the VCF file only contains a single chromosome. In that case PanGenie will only submit a single thread, regardless of how many were specified in the command line.

Evaluating Pangenie vcf output with Truvari

Hi,

Thank you for developing PanGenie! Recently, I am trying to evaluate the output vcf of PanGenie with Truvari, but it comes up that the entries which have multiple sequence (separated by comma) in the ALT field are misclassified as false positives by Truvari. Is there any way or criteria to choose only one sequence from the ALT field, or find the most possible variant that corresponds to such an entry?

For example, given a record like this:

chr1 899960 . TGGGAGGGGTCCGCGCGTCCGCAGTGGGGATGTGCTGCGGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGA TGGGAGGGGTCCACGCGTCCGCAGTGGGGCTGCCGGGAGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGCTGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGCCGGGAGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGCTGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGTGCTGCGGGAAG,CCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGATGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGTGCTGCGGGAAGAGGGGGGCCGGGTCCGCAGTGGGGATGTGCTGCCGGGA,CGGGGA,CGGGGAGGGGGGCGCGGGTCCGTAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAA . PASS AF=0.05,0.05,0.15,0.05,0.1,0.15,0.05;UK=301;AK=52,112,95,44,47,70,54,97;MA=8;ID=chr1-899972-SNV-0-1:chr1-899987-DEL-0-5:chr1-899998-SNV-0-1:chr1-900009-DEL-1-1:chr1-900018-SNV-0-1:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1:chr1-900141-SNV-0-1,chr1-899987-DEL-0-5:chr1-899998-SNV-0-1:chr1-900009-DEL-1-1:chr1-900018-SNV-0-1:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1:chr1-900141-SNV-0-1,chr1-899989-SNV-0-1:chr1-900001-SNV-0-1:chr1-900003-DEL-0-151,chr1-899955-DEL-0-113:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1,chr1-899989-SNV-0-1:chr1-900000-INS-0-2:chr1-900009-DEL-2-115,chr1-899955-DEL-1-189:chr1-900150-SNV-0-1,chr1-899955-DEL-2-37:chr1-900019-SNV-0-1:chr1-900036-SNV-0-1:chr1-900039-SNV-0-1:chr1-900047-SNV-0-1:chr1-900048-INS-0-1:chr1-900056-SNV-0-1:chr1-900150-SNV-0-1:chr1-900153-SNV-0-1 GT:GQ:GL:KC 3/3:10000:-82.8146,-359.1,-264.5,-291,-281.3,-185.3,-93.41,-258.7,-197.2,0,-138.8,-309.4,-199.5,-35.39,-57.8,-169.7,-335,-273.5,-101.3,-122.4,-72.88,-187.1,-362.2,-315.3,-117.7,-191.2,-194,-120.8,-247.5,-409.4,-342.2,-178.1,-270.1,-210.3,-201.2,-202.4:52

how should I interpret it and convert it (if possible) to a simple variant record that only has one sequence in both REF and ALT?

Thank you very much

Best