Giter Site home page Giter Site logo

hpp_pangenome_resources's Introduction

HPRC Pangenome Resources

This repo describes pangenomes produced by the Human Pangenome Reference Consortium from year 1 data. For information about data reuse and publicating with HPRC data please see the HPRC's Data Use Protocol.

Note: The pangenomes and resultant files referred to in this repo have not been fully QC'd, are not published, and may have known issues.

Background Information

Graph Creation Strategies

Graphs are available from three different strategies summarized in the table (and relevant sections) below:

Minigraph Minigraph-Cactus PGGB
sequence comparison reference-based, progressive reference-based, progressive symmetric, all-vs-all
resolution SV only base-level (via abPOA) base-level (via abPOA)
scope full assemblies Non-centromeric full assemblies
cyclic paths no non-reference all
short read mapping untested yes (fast) untested
long read mapping yes (fastest) yes yes (slowest)
Assembly mapping yes (direct) untested yes (via injection)

Index files listing file locations for download with the AWS CLI can be found in the indexes folder of this repository. Alternatively, tables are listed below in each graph creation strategy's section. Note that the index files list the file locations with s3:// uris -- as opposed to http:// urls as found in the tables.

Assembly Inputs

Information about the source assemblies can be found in the HPRC Assembly GitHub repository. Of the 47 samples assembled (94 assemblies) in year 1, all but three samples were included in graph constructions (HG002, HG005 and NA19240 were excluded for evaluation purposes). GRCh38 and CHM13 were added to make the total number of haplotypes included 90.

Graphs

Minigraph

Minigraph is a generalization of minimap2 (very fast) which builds the graph with iterative construction. Minigraph aligns with approximate locations and can be used to call structural variants (>50nt). Graphs were built with both GRCh38 and CHM13+Y (found here) used as reference sequences.

Description GRCh38 Graph CHM13 Graph
graph graph graph
bed bed     index bed     index

Minigraph/CACTUS

The CACTUS pangenome pipeline adds base-level alignments to the minigraph graphs above (so both GRCh38- and CHM13-based graphs are available).

Graphs and associated files are summarized below.

Description GRCh38 Graph CHM13 Graph
graph gfa gfa
VCF VCF     VCF index VCF   VCF index   VCF(CHM13)   VCF(CHM13) index
multiple alignment HAL HAL
sequences clipped out before alignment masking masking
VG indexes xg     snarls     trans xg     snarls     trans
Giraffe indexes dist     min     gg     gbwt dist     min     gg     gbwt

The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:

Filtered Graphs

The Giraffe short read mapper relies on the graph's snarl decomposition. The versions of the Cactus/Minigraph graphs released here contain some spurious large deletion edges that make this decomposition less efficient, which impacts Giraffe runtime. Furthermore, we have found that for calling small variants with the Giraffe-DeepVariant pipeline, accuracy is improved if all alleles with frequency < 10% are removed from the graph before indexing. Two filtered versions of each of the two Minigraph/Cactus graphs are available here. The graphs with maxdel.10mb in the name (recommended to speed up general mapping experiments) were created by removing edges that imply deletions > 10mb, and the graphs with minaf.0.1 in the name (recommended when using with DeepVariant) were created by removing, in addition to the deletions, nodes that are covered by fewer than 9 haplotypes.

Masked Sequence

Highly repetitive sequence such as found in centromeres was excluded from the Minigraph/Cactus graphs using the following process. dna-brnn was first run with its default parameters and model to identify alpha satellite and hsat 2/3 regions >100kb, which were clipped out of the input fasta files. Gaps >100kb between minigraph mappings were likewise removed. Any remaining contigs or contig fragments that could not be assigned to a reference chromosome were excluded. Finally, gaps >10kb left unaligned after Cactus were removed. Each removed interval, as well as the step it was removed by, are available:

PGGB

The Pangenome Graph Builder pipeline (PGGB) creates and all-vs-all graph with base-level alignments and no clipping of mitochondrial or centromeric regions.

Graphs and associated files are summarized below.

Description Location
graph gfa
untangle delta     paf
VCFs chm13.1-22+X     chm13.M     grch38.1-22+X     grch38.M     grch38.Y

Graph chromosome files and images can be found here and here.

Change Log

* Dec 03, 2021: updated minigraph-cactus VCFs to fix headers (thanks to Wen-Wei)

hpp_pangenome_resources's People

Contributors

juklucas avatar glennhickey avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.