The Cancer Predisposition Sequencing Reporter (CPSR) is a computational workflow that interprets germline variants identified from next-generation sequencing in the context of cancer predisposition. The workflow is integrated with the framework that underlies the Personal Cancer Genome Reporter (PCGR), utilizing the Docker environment for encapsulation of code and software dependencies. While PCGR is intended for reporting and analysis of somatic variants detected in a tumor, CPSR is intended for reporting and ranking of germline variants in protein-coding genes that are implicated in cancer predisposition and inherited cancer syndromes.
CPSR accepts a query file with raw germline variant calls encoded in the VCF format (i.e. analyzing SNVs/InDels). The software performs extensive variant annotation and produces an interactive HTML report, in which the user can investigate three main sets of variants identified in the query set:
-
Germline variants in a selected set of configurable cancer predisposition genes, that are previously reported as pathogenic or likely pathogenic in ClinVar (with no conflicting interpretations)
-
Unclassified variants constitute the set of germline variants within the configurable cancer predisposition gene list that are either:
- Registered as variant of uncertain significance (VUS) in ClinVar, or
- Is a novel protein-coding variant (i.e. not reported in ClinVar, and not found in gnomAD or 1000 Genomes Project user-defined population datasets), or
- Is a rare protein-coding variant (e.g. minor allele frequency (MAF) < 0.001 in user-defined gnomAD or 1000 Genomes Project population datasets)
- The upper MAF threshold (e.g. 0.001) for listing of unclassified variants can be configured by the user
-
Variants overlapping with previously identified hits in genome-wide association studies (GWAS) of cancer phenotypes (i.e. low to moderate risk conferring alleles), using NHGRI-EBI Catalog of published genome-wide association studies as the underlying source.
The (classified and unclassified) variant sets can be interactively explored and ranked further through different types of filters (associated phenotypes, genes, variant consequences, population MAF etc.). Importantly, the unclassified variants are assigned and ranked according to a pathogenicity score, which is based on the aggregation of scores according to previously established ACMG criteria and also cancer-specific criteria, as outlined and specified in several previous studies (Huang et al., Cell, 2018; Maxwell et al., Am J Hum Genet., 2016; Amendola et al., Am J Hum Genet., 2016). See also Related work below).
We have compiled a comprehensive list of genes that are implicated in cancer predisposition and cancer syndromes. Three different sources were combined:
- A list of 152 genes that were curated and established within TCGA’s pan-cancer study (Huang et al., Cell, 2018)
- A list of 107 protein-coding genes that has been manually curated in COSMIC’s Cancer Gene Census v86,
- A list of 148 protein-coding genes established by experts within the Norwegian Cancer Genomics Consortium (http://cancergenomics.no)
The combination of the three sources resulted in a non-redundant set of 209 protein-coding genes of relevance for predisposition to tumor development. We want to make it explicit that this list of 209 genes is by no means regarded as an international consensus, but should rather be subject to continuous update by the international community that carry expertise on genetic risk factors for cancer.
- VEP v94 - Variant Effect Predictor (GENCODE version 28/19 (grch38/grch37) as the gene reference dataset), includes gnomAD r2, dbSNP build 151/150, 1000 Genomes Project - phase3
- dBNSFP v3.5 - Database of non-synonymous functional predictions (August 2017)
- ClinVar - Database of clinically related variants (November 2018)
- DisGeNET - Database of gene-disease associations (v5.0, May 2017)
- UniProt/SwissProt KnowledgeBase 2018_010 - Resource on protein sequence and functional information (November 2018)
- Pfam v32 - Database of protein families and domains (September 2018)
- CancerMine v6 - Literature-derived database of tumor suppressor genes/proto-oncogenes (November 2018)
- NHGRI-EBI GWAS catalog - GWAS catalog for cancer phenotypes (October 29th 2018)
- November 19th 2018: 0.3.0 pre-release
- Bug fixing and bundle update
- November 12th 2018: 0.2.1 pre-release
- Improved ACMG classification transparency
- November 6th 2018: 0.2.0 pre-release
- Adjustments of ACMG classification criteria
- Mechanisms of disease for cancer susceptibility genes (GoF vs. LoF) retrieved from Maxwell et al., Am J Hum Genet, 2016
- Exceptions for HFE/SERPINA1 wrt. high population MAF (BA1)
- Threshold for genes with "primarily truncations" set to 90% pathogenic truncations (BP1)
- Consider only pathogenic variants (not likely pathogenic) when checking for novel peptide changes at pathogenic loci (PS1/PM5)
- Adjustments of ACMG classification criteria
- October 27th 2018: 0.1.1 pre-release
- Added documentation of ACMG evidence items in report output
- GWAS hits are optionable to include
- October 5th 2018: 0.1.0 pre-release
- Initial release of CPSR - reporting of germline variants for cancer predisposition
Make sure you have a working installation of PCGR (dev version) and the accompanying dev data bundle(s) (walk through steps 0-2).
Download the pre-release of cpsr (run script and configuration file)
A few elements of the workflow can be figured using the cpsr configuration file, encoded in TOML (an easy to read file format).
The initial step of the workflow performs VCF validation on the input VCF file. This procedure is very strict, and often causes the workflow to return an error due to various violations of the VCF specification. If the user trusts that the most critical parts of the input VCF is properly encoded, a setting in the configuration file (vcf_validation = false
) can be used to turn off VCF validation.
An exhaustive, predefined list of 209 cancer predisposition/syndrome genes can also be configured.
Run the workflow with cpsr.py, which takes the following arguments and options:
usage: cpsr.py [-h] [--input_vcf INPUT_VCF] [--force_overwrite] [--version]
[--basic] [--docker-uid DOCKER_USER_ID] [--no-docker]
pcgr_base_dir output_dir {grch37,grch38} configuration_file
sample_id
Cancer Predisposition Sequencing Reporter (CPSR) - report of cancer-predisposing
germline variants
positional arguments:
pcgr_base_dir Directory that contains the PCGR data bundle
directory, e.g. ~/pcgr-dev
output_dir Output directory
{grch37,grch38} Genome assembly build: grch37 or grch38
configuration_file Configuration file (TOML format)
sample_id Sample identifier - prefix for output files
optional arguments:
-h, --help show this help message and exit
--input_vcf INPUT_VCF
VCF input file with somatic query variants
(SNVs/InDels). (default: None)
--force_overwrite By default, the script will fail with an error if any
output file already exists. You can force the
overwrite of existing result files by using this flag
(default: False)
--version show program's version number and exit
--basic Run functional variant annotation on VCF through
VEP/vcfanno, omit report generation (STEP 4) (default:
False)
--docker-uid DOCKER_USER_ID
Docker user ID. Default is the host system user ID. If
you are experiencing permission errors, try setting
this up to root (`--docker-uid root`) (default: None)
--no-docker Run the CPSR workflow in a non-Docker mode (see
install_no_docker/ folder for instructions (default:
False)
The cpsr software bundle contains an example VCF file. It also contains a configuration file (cpsr.toml).
Analysis of the example VCF can be performed by the following command:
python ~/cpsr-0.3.0/cpsr.py --input_vcf ~/cpsr-0.3.0/example.vcf.gz
~/pcgr-dev ~/cpsr-0.3.0 grch37 ~/cpsr-0.3.0/cpsr.toml example
Note that the example command also refers to the PCGR directory (pcgr-dev), which contains the data bundle that are necessary for both PCGR and CPSR.
This command will run the Docker-based cpsr workflow and produce the following output files in the cpsr folder:
- example.cpsr.grch37.pass.vcf.gz (.tbi) - Bgzipped VCF file with functional/clinical annotations
- example.cpsr.grch37.pass.tsv.gz - Compressed TSV file (generated with vcf2tsv) with functional/clinical annotations
- example.cpsr.grch37.html - Interactive HTML report with clinically relevant variants in cancer predisposition genes organized into tiers
- example.cpsr.grch37.json.gz - Compressed JSON dump of HTML report content
- example.cpsr.snvs_indels.tiers.grch37.tsv - TSV file with most important annotations of tier-structured SNVs/InDels