eblerjana / pangenie Goto Github PK
View Code? Open in Web Editor NEWPangenome-based genome inference
License: MIT License
Pangenome-based genome inference
License: MIT License
Hi,
Thanks for developing this great tool. I would like to use the "HPRC-CHM13 (88 haplotypes)" variant reference. Based on the name, I think the reference genome I should use for genotyping is CHM13, but I am not sure which version of CHM13 to use. Could you provide links to the reference genome you used when generating the variant reference vcf?
Thanks,
Han
Hi @eblerjana,
I've been working with PanGenie
more often since my last issue. I was keen on assessing the performance for the genome inference of HG002, so I benchmarked my PnaGenie
outputs for this sample – obtained after running the tool on both my pangenome graphs (built with minigraph-CACTUS and PGGB) – against the truth set from GIAB.
However, despite not being as good as the HPRC one the recall is reasonable, on the other hand the precision is straight up bad... after talking with the team at the HPRC here in Santa Cruz they pointed out how most of these FP lowering the precision value are likely on satellite other difficult regions (I guess?), and explained to me that most of the improvement of a PanGenie
run in terms of precision is due to the post-processing filtering with a ML approach (I've been reading this #14).
Therefore, my question is how can I run this script for just my single HG002 sample? What I got from the issue #14 is that this is particularly beneficial for large cohort with trio data; is there a way to assess it a single sample? Also, you pointed out to a concept of genotype quality which might work in the absence of the former information..., how can I assess that eventually?
Thanks in advance and looking up online I came across these two links:
Are they anyhow related on how the ML approach works or is based on? This is just for my understanding. Thanks again!
The command line help is accurate (reads.fa/fq
), but since uncompressed read files are uncommon, it would be helpful to add that requirement in a more obvious way, e.g.
-i VAL sequencing reads in FASTA/FASTQ format or Jellyfish database in jf format (required).
NOTE: INPUT FASTA/Q FILE MUST NOT BE COMPRESSED
Hi, pangenie group,
I met a problem when I run the pangenie with singularity, it seems that it cannot open the reference fasta file, which I can read. The error is listed below. In addition, I'd like to ask how to select an appropriate reference fasta file, is it also a part of our pan-genome?
singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif cat /metadata/jellyfish.lib.version
libjellyfish-2.0-2:amd64 2.3.0-4build1 libjellyfish-2.0-dev:amd64 2.3.0-4build1
singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif cat /metadata/pangenie.git.version
6e0da25
singularity exec /share/apps/gene/pangenie-1.0.1/pangenie-new-gcc.sif PanGenie -i $input.fastq -r $reference.fa -v $graph.vcf -o test
program: PanGenie - genotyping and phasing based on kmer-counting and known haplotype sequences.
author: Jana Ebler
Files and parameters used:
-c 0
-d 0
-e 3000000000
-g 1
-i $input.fastq
-j 1
-k 31
-o test
-p 0
-r $reference.fa
-s sample
-t 1
-u 0
-v $graph.vcf
Determine allele sequences ...
terminate called after throwing an instance of 'std::runtime_error'
what(): FastaReader::parse_file: reference file cannot be opened.
Aborted
Hello,
I have about 200 samples. There are 2 fastq.gz file (sample.R1.fq.gz and sample.R2.fq.gz) for every file. Each file is very large (more than 50G compressed file). How can I unzip them and combine the paired end file into one file?
Thank you !
The PanGenie command line help indicates that a jellyfish database in jf format is an option for input.
-i VAL sequencing reads in FASTA/FASTQ format or Jellyfish database in jf format.
NOTE: INPUT FASTA/Q FILE MUST NOT BE COMPRESSED. (required).
What is the recommended JellyFish command for creating the jf file? How does the jf formated file compare in size to the uncompressed fasta or fastq file?
Currently, genotype likelihoods from subsampling steps are added and normalized after each step. A better approach would be to first add up all likelihoods, and then do the normalization once at the end. Otherwise, the likelihoods are not weighted equally, which might lead to genotyping errors.
Line 96 in ae47956
pangenie]$ cd build/
build]$ cmake ..
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /xxx/software/miniconda3/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /xxx/software/miniconda3/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Checking for module 'jellyfish-2.0'
-- Found jellyfish-2.0, version 2.2.10
-- /Parastor300s_G30S/zhangjj/software/miniconda3/envs/pangenie/include/jellyfish-2.2.10
-- /parastor300/work01/zhangjj/software/PanGenie/pangenie/build/src
-- Configuring done
-- Generating done
-- Build files have been written to: /parastor300/work01/zhangjj/software/PanGenie/pangenie/build
build]$ make
Scanning dependencies of target PanGenieLib
[ 1%] Building CXX object src/CMakeFiles/PanGenieLib.dir/emissionprobabilitycomputer.cpp.o
[ 2%] Building CXX object src/CMakeFiles/PanGenieLib.dir/copynumber.cpp.o
[ 4%] Building CXX object src/CMakeFiles/PanGenieLib.dir/commandlineparser.cpp.o
[ 5%] Building CXX object src/CMakeFiles/PanGenieLib.dir/columnindexer.cpp.o
[ 7%] Building CXX object src/CMakeFiles/PanGenieLib.dir/dnasequence.cpp.o
[ 8%] Building CXX object src/CMakeFiles/PanGenieLib.dir/fastareader.cpp.o
[ 10%] Building CXX object src/CMakeFiles/PanGenieLib.dir/genotypingresult.cpp.o
[ 11%] Building CXX object src/CMakeFiles/PanGenieLib.dir/histogram.cpp.o
[ 13%] Building CXX object src/CMakeFiles/PanGenieLib.dir/hmm.cpp.o
。。。
[ 38%] Linking CXX executable PanGenie-graph
/xxx/software/miniconda3/bin/../lib/gcc/x86_64-conda-linux-gnu/9.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -ljellyfish-2.0
collect2: error: ld returned 1 exit status
make[2]: *** [src/CMakeFiles/PanGenie-graph.dir/build.make:104: src/PanGenie-graph] Error 1
make[1]: *** [CMakeFiles/Makefile2:166: src/CMakeFiles/PanGenie-graph.dir/all] Error 2
make: *** [Makefile:114: all] Error 2
build]$ jellyfish --version
jellyfish 2.2.10
Does the version of Jellyfish have to be 2.0?
Hi !
I used your pipeline pangenome-graph-from-assemblies
to construct a VCF file, but I noticed that some of the variants have an unusually high number of alleles. see belows:
1 18300 . GACACACAACACACAC GACAGACAGACACACACAC,GACAGACAGACAGACAGACACAC,GACAGACAGACAGACAGACAGACACACACACACACACACAC,G,GACACACACACACACACACACAC,GACAGACAGACAGACACACACACACACAC
1 19171 . CAAAAAAAA CAA,CAAA,CAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAAA,CAAAAAAAAAAAAAAAAAAAAAA,CAAAAA,CAAAA,CAAAAAAAAAA,C,CA,CAAAAAAAAAAA
I suspect that this may be due to repetitive sequences in my input assemblies, which were directly obtained from hifiasm assembly results without being masked.
Hi, @eblerjana
Can the variants vcf is single chromosome ? For large-scale samples parallel, can we just use one chromosome vcf for genotyping/phasing? Or the pangenie must have whole genome variants for more precisely kmer counting and genotyping?
Thanks.
Zhigui Bao
Hi,
Thank you very much for such a nice tool.
I would like to represent not only snps but also structural variants in the input VCF.
Could you please tell me how to do that?
This is my toy example:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1>
##ALT=<ID=DEL,Description="Deletion">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00732 NA12878
1 15951 . T . PASS . GT 0|0 1|1
End error message:
VariantReader: skip variant at 1:15950 since alleles contain undefined nucleotides:
Best,
Mehmet
Hi pangenie team,
First of all, thank you very much for developing such a beautiful tool for the community.
I am wondering, instead of whole genome sequencing data, can pangenie also work with genotyping-by-sequencing (GBS) data? In plant breeding, people normally do GBS to save money since there might be hundreds of plants needed to be genotyped.
Looking forward to your answer and thank you very much in advance.
Best wishes,
Yutang
Hi,
Thanks for your work on this program. The method seems really elegant and we've had some really exciting results from it so far.
Here is a brief outline of our preliminary analysis:
I have a couple of questions to make sure I'm getting the most out of the program:
I initially had good results with minigraph-cactus for generating a comprehensive vcf file. However I read on the PanGenie issues page, that this is not a recommended program. There was a reference to vcfbub, but It's not clear to me what steps need to be taken to go from the reference + long read assemblies to an appropriate pangenome graph. I didn't notice any obvious issues with the subsequent short read genotyping results from minigraph-cactus, but I'm worried that I'm not getting the best results.
I was trying to be quite conservative with the genotyping results, I used the GQ flag as mentioned in the paper, and then also applied a minimum KC of 7 per sample. Is this a reasonable filtering approach?
I noticed that there is an experimental phasing option. We are quite interested in the haplotype structure of our data, and initially we used Shapeit (using the HiFi individuals as reference haplotypes) to phase the population data. Are there anymore details available about PanGenie phasing? Is it worth applying the PanGenie phasing prior to Shapeit perhaps? Is it worth combining PanGenie genotyping with read-based phasing programs like WhatsHap?
Hi Jana,
I've noticed that the memory usage when counting kmers directly from reads (and not a .jf) is incredibly high, and it looks like PanGenie reads the entire fastq file into memory and then counts the kmers. I think this also leads to under reporting of usage, because the PanGenie log for a run says
Count kmers in genome ...
#### Memory usage until now: 61.7867 GB ####
Determine unique kmers ...
#### Memory usage until now: 79.1394 GB ####
....
Total maximum memory usage: 84.4493 GB
When the same cluster reports peak usage of 223.6 Gb. At one point during kmer counting it was 176.3 Gb, so it definitely seems like the kmer counting is a big bottleneck, especially for moderate/high coverage samples.
Just running jellyfish by itself to produce the .jf file only peaked at 42.1 Gb, so that seems like a better approach currently.
I tried and very strongly failed at trying to implement some streaming feature in jellyfish. It looks like they have some code set up for that, but I didn't last long enough.
It is not obvious how PanGenie handles the output prefix if the path includes non-existent folders. By looking at the runtime errors, I assume PanGenie does not create non-existent folders.
PanGenie is a user-friendly and fast genotyper.
I generated vcf from minigraph-cactus.
When i used PanGenie v2.1.0, it can output results but too many (nearly 36%) variants from were skipped. The command is
~/tool/pangenie-2.1.0/build/src/PanGenie -i $fq -r $ref -v $mcvcf -j $th -s $sample -o $sample -t 20
Logs output like
VariantReader: skip variant at Chr01:5201 since it is contained in a previous one.
...
The source vcf file had been changed to the diploid version. Here are the first and second variants.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10
Chr01 5194 >1008>1119 TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTAAACCCTAAACCCTAAACTCTAAACCCTAAACCCTAAACTCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAATCCTAAACCCTAAACCCTAAACCCTAAACCC,TGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCTGAACCCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCTGAACCCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTGAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCTAAACCCTAAACCCTAACCCTAAGCCCTAAACCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAACCCTAAACCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAGCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC,TATT 60 . AC=1,1,1,1,1;AF=0.166667,0.166667,0.166667,0.166667,0.166667;AN=6;AT=>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1104>1106>1108>1109>1110>1112>1113>1115>1116>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1104>1106>1107>1109>1110>1112>1114>1115>1116>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1063>1065>1066>1068>1069>1071>1072>1074>1075>1077>1078>1080>1081>1083>1084>1086>1087>1089>1090>1092>1093>1095>1097>1099>1100>1102>1103>1104>1106>1107>1109>1111>1112>1113>1115>1117>1118>1119,>1008>1011>1012>1013>1014>1016>1017>1019>1020>1022>1023>1025>1026>1028>1029>1031>1032>1034>1035>1037>1038>1040>1041>1043>1044>1046>1047>1048>1050>1051>1053>1054>1056>1057>1059>1060>1062>1063>1065>1066>1068>1069>1071>1072>1074>1075>1077>1078>1080>1081>1083>1084>1086>1087>1089>1090>1092>1093>1095>1096>1097>1098>1100>1101>1103>1104>1106>1107>1109>1110>1112>1113>1115>1116>1118>1119,>1008>1009>1012>1013>1015>1016>1018>1019>1021>1022>1024>1025>1027>1028>1030>1031>1033>1034>1036>1037>1039>1040>1042>1043>1045>1046>1047>1049>1050>1052>1053>1055>1056>1058>1059>1061>1062>1064>1065>1067>1068>1070>1071>1073>1074>1076>1077>1079>1080>1082>1083>1085>1086>1088>1089>1091>1092>1094>1095>1097>1099>1100>1101>1103>1104>1105>1106>1107>1109>1110>1112>1114>1115>1116>1118>1119,>1008>1009>1010>1119;NS=6;LV=0 GT 5|5 .|. 3|3 1|1 0|0 2|2 .|. 4|4 .|. .|.
Chr01 5202 >1013>1016 G A 60 . AC=1;AF=0.2;AN=5;AT=>1013>1014>1016,>1013>1015>1016;NS=5;LV=1;PS=>1008>1119 GT .|. .|. 0|0 0|0 0|0 0|0 .|. 1|1 .|. .|.
...
I checked the REF/ALT sequence, and found the second variant was redundant.
RER[Sample5] 5202bp position can be inferred from the first variant. TGAACCCT[g]AA
ALT[Sample8] 5202bp position can be inferred from the first variant as well. TAAACCCT[a]AA
I know that PanGenie uses non-overlapping variants from the vcf file, but I wonder how to check the vcf file can be used as the input of PanGenie? Are there some suggested steps to pre-process the vcf file?
Hi
I started to install pangenie using the following code:
git clone https://github.com/eblerjana/pangenie.git
cd pangenie
conda activate pangenie
mkdir build; cd build; cmake .. ; make
Since I use an old linux HPC with old glibc I got this message:
cmake: /lib64/libc.so.6: version `GLIBC_2.17' not found
I installed glibc2.17 using this code:
conda install -c fridh glibc
Then I reactivated pangenie and tried
cmake
at the pangenie directory.
I get:
Segmentation fault
Thank you
Would be useful to be able to work, e.g., with exome data. If variants outside the given region set could be imputed, that'd be even more cool.
I want to install pangenie in ARM server, but it always issue errors, I don't know weather pangenie support ARM architecture, If you know it, you can tell me. thanks.
Hi,
Thank you very much for such a nice tool. usage: PanGenie [options] -i <reads.fa/fq> -r <reference.fa> -v <variants.vcf>
I fund the paper (https://www.biorxiv.org/content/10.1101/2020.11.11.378133v1)said that :we propose a new algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation....
May I ask if the pangenie software can only use short reads? Can long-reads be use for <reads.fa/fq> (such as Nanopore or PacBio reads)? ,What is the advantage over the second generation
The genotype of the outfile is the genotype of reads (i.e -i <reads.fa/fq>) that corresponds to the sample, rigth?
Thanks.
Hi there,
My aim is to get a genome inference for a human sample (HG005) to benchmark my graph in terms of SVs calling using PanGenIe
. I assembled both with minigraph-CACTUS
(for which I'm testing right now) and with pggb
.
With that said, despite the pangenomes are quite small (only five individuals), after correctly outputting the .vcf files for both of them, I incur in an error when launching from Snakefile. Below the output for the command:
File "Snakefile", line 13
rule all:
^
SyntaxError: invalid syntax
Now, the reason I'm doing this is because of the non-overlapping variants requirement. Therefore, considering my .vcf haven't been generated using your pipeline, I'm assuming they are not in accordance to the specific standard you mentioned (particularly the third one); hence why, I proceeded this way.
One last mention, I'm not very experience with Snakefiles, and after looking online I've seen they are launched invoking python3; is that the correct way? Let me know, thanks in advance!
P.S. attached a version of the .yaml edited to conform the inputs for the script (config.log)
Hi again,
I'm having serious troubles processing the PGGB
.vcf files... At first the tool pointed out to haplotypes being un-phased; this is a problem I fixed collapsing the columns for each individual haplotype in a single one for all samples using the following awk
command:
awk -F '\t' '/^#/ {print;next;} {OFS="\t";for(i=j=10; i < NF; i+=2) {$j = $i"|"$(i+1); j++} NF=j-1}1' pangenome_ref_guided_GRCh38.vcf > pangenome_ref_guided_GRCh38-phased.vcf
In fact, compared to minigraph-CACTUS
, PGGB
outputs a .vcf file where the haplotypes for each individual in the graph are split into two columns, this leads PanGenIe
to think they are un-pahsed. Anyway, I proceed then to fix the columns headers for the respective samples, but still despite running to completion the process returns an empty .vcf.
I'm not sure what was the cause for it, and after talking with Glenn and looking up online I acknowledged I was missing some pre-processing steps, namely removing complex bubbles and decomposing nested alleles with vcfbub
and vcfwave
, respectively.
I run the following command:
vcfbub -l 0 -r 10000 -i pangenome_ref_guided_GRCh38-phased.vcf.gz | vcfwave -L 10000 -t 16 -n | bgzip -c -@ 16 > decomposed_and_cleaned.vcf.gz
Unfortunately, I got the same result... So, going back to one of my hypotheses this could have been linked to having #
in the filenames as per the PanSN
standard for naming used by PGGB
.
I tested removing from the #CHROM column everything but the string chr(number); however, even after this attempt the result is always an empty .vcf file.
I though it could have been helpful to re-run the vcfbub
and vcflib
commands, but this time I got the following error:
error: more sample names in header than sample fields
samples: CHM13 HG002 HG00514 HG00733 HG03492 NA19240
line: chr1 418412 >11061496>11061499 T C 60.0 . AC=1;AF=1;AN=1;AT=>11061496>11061497>11061499,>11061496>11061498>11061499;NS=1;LV=0 GT .|. .|. .|. .|. .|1
thread 'main' panicked at 'calledResult::unwrap()
on anErr
value: IoError(Os { code: 32, kind: BrokenPipe, message: "Broken pipe" })', src/main.rs:199:46
note: run withRUST_BACKTRACE=1
environment variable to display a backtrace
This is strange as I previously didn't get that, maybe because I run the pre-processing before editing with awk
, I don't know for sure...
Please, if you have any clue/idea of what is happening I would really appreciate some help. I tried several approaches but all failed and I wonder whether there is something I'm missing.
With that said, I'm happy to attach a screenshot of the original .vcf file if needed and I'm sorry for the long text.
Thanks in advance!
P.S. I always made sure the contigs name in the reference and the #CHROM in the .vcf were consistent in all trials
Hi is it possible to add Pangenie to Bioconda instead of having to manually build the tool?
Variants in the input VCF that are less than the kmer size apart are internally merged by PanGenie and treated as a single variant. Alleles of this merged variant are composed of combinations of the merged variants (defined by the paths). The number of unique kmers per allele is computed only for this merged variant. When writing the output VCF, these merged variants are converted back to their original representation (creating the original entries). However, the AK counts are computed for each original allele by summing up the unique kmers of all combined alleles that include that particular allele. The problem is that the same kmer can be counted several times when occurring on several combined alleles, leading to numbers of AK that exceed UK. There is not really any proper way of computing AK counts for the original alleles.
Hi,
When I tried to reproduce your pangenome graph, I noticed that there were several errors in vcf-merging/pangenome-graph-from-assemblies/Snakemake file, mainly about generating trio sample list and ped file.
According to function parse_trios in scripts/mendelian-consistency.py,
def parse_trios(ped_file, samples):
samples_to_include = set()
with open(samples, 'r') as listing:
for line in listing:
samples_to_include.add(line.strip())
trios = {}
for line in open(ped_file, 'r'):
if line.startswith('#'):
continue
fields = line.split()
# map trio_name -> [child, parent1, parent2, nr consistent, nr_inconsistent, nr_untyped, nr total]
if any([s not in samples_to_include for s in fields[1:4]]):
continue
trios[(fields[1], fields[2], fields[3])] = [fields[1], fields[2], fields[3], 0, 0, 0, 0]
return trios
trio sample list (results/trio-samples.txt) should be like:
HG00731
HG00732
HG00733
HG00512
HG00513
HG00514
NA19238
NA19239
NA19240
However, the Snakemake file generated a file like:
Puerto Rican
HG00731
HG00732
HG00733
Chinese
HG00512
HG00513
HG00514
Yoruban
NA19238
NA19239
NA19240
And ped (results/trios.ped) file should look like:
Puerto_Rican HG00733 HG00731 HG00732
Chinese HG00514 HG00512 HG00513
Yoruban NA19240 NA19238 NA19239
instead of:
Puerto Rican Puerto Rican HG00731 HG00732
Chinese Chinese HG00512 HG00513
Yoruban Yoruban NA19238 NA19239
(Also, trio names should not contain spaces)
To solve the problem above, you may need to modify some lines in Snakemake file:
#################################################################
# 5) Check mendelian consistency for trios and construct graph
#################################################################
# generate a file specifying the trio relationships
rule generate_ped_file:
output:
"{outdir}trios.ped"
run:
with open(output[0], "w") as ped_output:
for trio in config['trios']:
father=config['trios'][trio][0]
mother=config['trios'][trio][1]
#Original
#ped_output.write('\t'.join([trio, trio, father, mother]) + '\n')
#MODIFED#
child=config['trios'][trio][2]
ped_output.write('\t'.join([trio, child, father, mother]) + '\n')
rule generate_samples_file:
output:
"{outdir}trio-samples.txt"
run:
with open(output[0], "w") as sample_output:
for trio in config['trios']:
#sample_output.write(trio + '\n') #MODIFIED#
for sample in config['trios'][trio]:
sample_output.write(sample + '\n')
Hello,
My sequencing files (.fastq) are paired end. But I didn't find parameter which fits this kind of file in panGenie. Can you give me an example usage of paired end file?
Thanks!
Any chance of pushing the singularity image to a cloud service?
It would be great for those of us who try to integrate this tool into nextflow genotyping pipelines.
Thank you.
Hello, authors,
At the demo, 'test-variants.vcf' was a regular variation vcf file. I am confused which step used the graph pangenome format with Extended Data Fig. 1. I used the 'pangenome-graph-from-assemblies' pipline to generate a callset.vcf file. What should I do next? Should I use vcfbub to convert it into a bubble file and then perform genotyping?
I would appreciate your help.
Tian Tian
Hi there,
Thanks for this great tool. I noticed that, in both the HPRC reference release paper and the original paper describing this tool, you employ complex pre and post pangenie variant filtering:
"We used vcfbub (version 0.1.0, (Garrison, 2021)) with parameters -l 0 and -r 100000 in order to filter the VCF..."
"We additionally remove all records for which more than 80% of all 88 haplotypes carry a missing allele..."
"we have implemented a decomposition approach which aims at detecting all variant alleles nested inside of multi-allelic top-level records..."
"[We] defined a high quality subset of SV alleles that we can reliably genotype. For this purpose, we applied a machine learning approach similar to what we have presented previously..."
How important are these methods to the quality/reliability of your final genotypes, especially for large SV (> 50bp)?
Does this package implement these filters, and if not, is there code that does?
Do you have pre-trained models to use for the machine learning/regression based filter?
Just trying to get high-quality SV genotypes on my diverse short-read dataset, so any help is greatly appreciated.
Thanks again!
-Joe
Is there a way to generate a joint-genotyped vcf with pangenie? If I understand the software correctly, only the variants in the pangenome reference are called across all the genotyped samples while the non-pangenome variants detected are called within individual samples.
Running PanGenie with a Jellyfish database in jf format and multiple threads (-t) produces randomized output.
Executing it with either a .jf file or multiple threads works fine, but not with both parameters provided.
jellyfish count -m 31 -s 3G -p 126 -c 7 -C -L 1 --if <path-segments.fasta> <input.fasta> -o input.jf
PanGenie -i input.jf -r <reference.fa> -v <variants.vcf> -t 2 -o V1
PanGenie -i input.jf -r <reference.fa> -v <variants.vcf> -t 2 -o V2
diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
Both genotyping files should be the same.
diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
0
However, there are many (small) differences:
diff V1_genotyping.vcf V2_genotyping.vcf | wc -l
10790792
The problem seems to be within the JellyfishReader class. I'm not familiar with the source code of Jellyfish, but I presume that the line abundance = this->db->check(jelly_kmer);
in the function getKmerAbundance
cannot be parallized as is.
I tried guarding the line with a mutex and this does resolve the problem. However, the parallelization overhead of the mutex makes the "determining unique kmers" section almost as slow as the serial execution, so it's definitely not an optimal solution
Remark: The bug cannot be observed in the provided test data, because the VCF file only contains a single chromosome. In that case PanGenie will only submit a single thread, regardless of how many were specified in the command line.
Hi,
Thank you for developing PanGenie! Recently, I am trying to evaluate the output vcf of PanGenie with Truvari, but it comes up that the entries which have multiple sequence (separated by comma) in the ALT field are misclassified as false positives by Truvari. Is there any way or criteria to choose only one sequence from the ALT field, or find the most possible variant that corresponds to such an entry?
For example, given a record like this:
chr1 899960 . TGGGAGGGGTCCGCGCGTCCGCAGTGGGGATGTGCTGCGGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGA TGGGAGGGGTCCACGCGTCCGCAGTGGGGCTGCCGGGAGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGCTGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGCCGGGAGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGCTGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGTGCTGCGGGAAG,CCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCGCAGTGGGGATGTGCTGCCGGGA,TGGGAGGGGTCCGCGCGTCCGCAGTGGGGCTGTGCTGCGGGAAGAGGGGGGCCGGGTCCGCAGTGGGGATGTGCTGCCGGGA,CGGGGA,CGGGGAGGGGGGCGCGGGTCCGTAGTGGGGATGTGCTGCGGGAAGGGGGGGGCCGGGTCCACAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCCGGGAGGGGGGCGCGGGTCCGCAGTGGGGATGTGCTGCGGGAA . PASS AF=0.05,0.05,0.15,0.05,0.1,0.15,0.05;UK=301;AK=52,112,95,44,47,70,54,97;MA=8;ID=chr1-899972-SNV-0-1:chr1-899987-DEL-0-5:chr1-899998-SNV-0-1:chr1-900009-DEL-1-1:chr1-900018-SNV-0-1:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1:chr1-900141-SNV-0-1,chr1-899987-DEL-0-5:chr1-899998-SNV-0-1:chr1-900009-DEL-1-1:chr1-900018-SNV-0-1:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1:chr1-900141-SNV-0-1,chr1-899989-SNV-0-1:chr1-900001-SNV-0-1:chr1-900003-DEL-0-151,chr1-899955-DEL-0-113:chr1-900112-SNV-0-1:chr1-900115-SNV-0-1:chr1-900123-SNV-0-1:chr1-900124-INS-0-1,chr1-899989-SNV-0-1:chr1-900000-INS-0-2:chr1-900009-DEL-2-115,chr1-899955-DEL-1-189:chr1-900150-SNV-0-1,chr1-899955-DEL-2-37:chr1-900019-SNV-0-1:chr1-900036-SNV-0-1:chr1-900039-SNV-0-1:chr1-900047-SNV-0-1:chr1-900048-INS-0-1:chr1-900056-SNV-0-1:chr1-900150-SNV-0-1:chr1-900153-SNV-0-1 GT:GQ:GL:KC 3/3:10000:-82.8146,-359.1,-264.5,-291,-281.3,-185.3,-93.41,-258.7,-197.2,0,-138.8,-309.4,-199.5,-35.39,-57.8,-169.7,-335,-273.5,-101.3,-122.4,-72.88,-187.1,-362.2,-315.3,-117.7,-191.2,-194,-120.8,-247.5,-409.4,-342.2,-178.1,-270.1,-210.3,-201.2,-202.4:52
how should I interpret it and convert it (if possible) to a simple variant record that only has one sequence in both REF and ALT?
Thank you very much
Best
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.