This repo describes pangenomes produced by the Human Pangenome Reference Consortium from year 1 data. For information about data reuse and publicating with HPRC data please see the HPRC's Data Use Protocol.
Note: The pangenomes and resultant files referred to in this repo have not been fully QC'd, are not published, and may have known issues.
Graphs are available from three different strategies summarized in the table (and relevant sections) below:
Minigraph | Minigraph-Cactus | PGGB | |
---|---|---|---|
sequence comparison | reference-based, progressive | reference-based, progressive | symmetric, all-vs-all |
resolution | SV only | base-level (via abPOA) | base-level (via abPOA) |
scope | full assemblies | Non-centromeric | full assemblies |
cyclic paths | no | non-reference | all |
short read mapping | untested | yes (fast) | untested |
long read mapping | yes (fastest) | yes | yes (slowest) |
Assembly mapping | yes (direct) | untested | yes (via injection) |
Index files listing file locations for download with the AWS CLI can be found in the indexes folder of this repository. Alternatively, tables are listed below in each graph creation strategy's section. Note that the index files list the file locations with s3:// uris -- as opposed to http:// urls as found in the tables.
Information about the source assemblies can be found in the HPRC Assembly GitHub repository. Of the 47 samples assembled (94 assemblies) in year 1, all but three samples were included in graph constructions (HG002, HG005 and NA19240 were excluded for evaluation purposes). GRCh38 and CHM13 were added to make the total number of haplotypes included 90.
Minigraph is a generalization of minimap2 (very fast) which builds the graph with iterative construction. Minigraph aligns with approximate locations and can be used to call structural variants (>50nt). Graphs were built with both GRCh38 and CHM13+Y (found here) used as reference sequences.
Description | GRCh38 Graph | CHM13 Graph |
---|---|---|
graph | graph | graph |
bed | bed index | bed index |
The CACTUS pangenome pipeline adds base-level alignments to the minigraph graphs above (so both GRCh38- and CHM13-based graphs are available).
Graphs and associated files are summarized below.
Description | GRCh38 Graph | CHM13 Graph |
---|---|---|
graph | gfa | gfa |
VCF | VCF VCF index | VCF VCF index VCF(CHM13) VCF(CHM13) index |
multiple alignment | HAL | HAL |
sequences clipped out before alignment | masking | masking |
VG indexes | xg snarls trans | xg snarls trans |
Giraffe indexes | dist min gg gbwt | dist min gg gbwt |
The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:
- graph formats: xg/gg
- index formats: gbwt/dist/min
- snarls format: snarls
The Giraffe short read mapper relies on the graph's snarl decomposition. The versions of the Cactus/Minigraph graphs released here contain some spurious large deletion edges that make this decomposition less efficient, which impacts Giraffe runtime. Furthermore, we have found that for calling small variants with the Giraffe-DeepVariant pipeline, accuracy is improved if all alleles with frequency < 10% are removed from the graph before indexing. Two filtered versions of each of the two Minigraph/Cactus graphs are available here. The graphs with maxdel.10mb
in the name (recommended to speed up general mapping experiments) were created by removing edges that imply deletions > 10mb, and the graphs with minaf.0.1
in the name (recommended when using with DeepVariant) were created by removing, in addition to the deletions, nodes that are covered by fewer than 9 haplotypes.
Highly repetitive sequence such as found in centromeres was excluded from the Minigraph/Cactus graphs using the following process. dna-brnn was first run with its default parameters and model to identify alpha satellite and hsat 2/3 regions >100kb, which were clipped out of the input fasta files. Gaps >100kb between minigraph mappings were likewise removed. Any remaining contigs or contig fragments that could not be assigned to a reference chromosome were excluded. Finally, gaps >10kb left unaligned after Cactus were removed. Each removed interval, as well as the step it was removed by, are available:
- regions removed from GRCh38-based graph: hprc-v1.0-mc-grch38.clipped-intervals.bed.gz
- regions removed from CHM13-based graph: hprc-v1.0-mc-chm13.clipped-intervals.bed.gz
The Pangenome Graph Builder pipeline (PGGB) creates and all-vs-all graph with base-level alignments and no clipping of mitochondrial or centromeric regions.
Graphs and associated files are summarized below.
Description | Location |
---|---|
graph | gfa |
untangle | delta paf |
VCFs | chm13.1-22+X chm13.M grch38.1-22+X grch38.M grch38.Y |
Graph chromosome files and images can be found here and here.
* Dec 03, 2021: updated minigraph-cactus VCFs to fix headers (thanks to Wen-Wei)