broadinstitute / exome-results-browsers Goto Github PK

View Code? Open in Web Editor NEW

11.0 7.0 4.0 1.72 MB

Results browsers for case-control studies of psychiatric diseases done at the Broad Institute

License: BSD 3-Clause "New" or "Revised" License

JavaScript 73.64% Dockerfile 0.51% Python 24.37% Shell 1.05% HTML 0.43%

exome-results-browsers's Introduction

Exome Results Browsers

Results browsers for case-control studies of psychiatric diseases done at the Broad Institute.

Schizophrenia - SCHEMA

The Schizophrenia Exome Sequencing Meta-analysis (SCHEMA) consortium is a large multi-site collaboration dedicated to aggregating, generating, and analyzing high-throughput sequencing data of schizophrenia patients to improve our understanding of disease architecture and advance gene discovery. The first results of this study have provided genome-wide significant results associating rare variants in individual genes to risk of schizophrenia, and later releases are planned with larger number of samples that will further increase power.
Epilepsy - Epi25

The Epi25 collaborative is a global collaboration committed to aggregating, sequencing, and deep-phenotyping up to 25,000 epilepsy patients to advance epilepsy genetics research. The Epi25 whole-exome sequencing (WES) case-control study is one of the collaborative's ongoing endeavors that aims to characterize the contribution of rare genetic variation to a spectrum of epilepsy syndromes to identify individual risk genes.
Autism - ASC

Founded in 2010, the Autism Sequencing Consortium (ASC) is an international group of scientists who share autism spectrum disorder (ASD) samples and genetic data. This portal displays variant and gene-level data from the most recent ASC exome sequencing analysis.
Bipolar Disorder - BipEx

The Bipolar Exome (BipEx) sequencing project is a collaboration between multiple institutions across the globe, which aims to increase our understanding of the disease architecture of bipolar disorder.

exome-results-browsers's People

Contributors

Stargazers

Watchers

Forkers

hoangtn tarjindersingh sachalau populationgenomics

exome-results-browsers's Issues

Repurposing exome-results-browser

Hello @nawatts

I am working on a non-human genomic project where I would like to display variants in a much similar fashion than what is currently done using the exome-results-browser. Basically I would like to display for all genes in a genome, the variant counts/frequency in two tracks for cases and controls, possibly the result of the X² test.

I tried previously using directly the gnomad-browser for my project, and managed to some extent to have a prototype working (=the VariantTable, with columns for counts in cases/controls), before realizing that my use case was much better handled by exome-results-browser. Just for the record when using gnomad-browser, I replaced the elasticsearch API with an API querying directly an SQL database, removed the caching part too.

For exome-results-browser, the API serves data stored in json files directly if I understood correctly. You process the json files for your different datasets using the scripts in data_pipeline using Hail and these json files are then served. You have some project specific implementation of the browser but I think I can ignore that for now as I would like to concentrate my efforts with the per Gene page, with the Case and Control + VariantTracks, and the VariantTable.

I think a number of things should be achieved :

Replacing the gene-models. My genome is really simple (prokaryotic, so no alternative transcripts, no introns, only single exon genes) so I imagine this should not be too hard to feed this instead of a complicated human genome with transcript tracks and co
Generating the json for each of the gene in the genome. For this you go through Hail as intermediary I think. I would like to try generating the json directly from my SQL database. As you are using GraphQL for the queries, my feeling is that I would have to modify the queries to fit my simpler gene models. For the variant query, I will start with a simpler model, using the variantId, HGVS nomenclature, and counts.

Does my plan make any sense for you ? Do you have any recommendations or previous experience in repurposing gnomad/exome-results-browser for other organism ?

As a side note, did you recently moved the common gnomad browser component out of each repo? They are all now in the gnomad-browser-toolkit is that right ?

Thank you very much for open sourcing this suite of tools and your help

LoF label changed to PTV

Not sure how to best systematically do this just for the SCHEMA browser, but in the gene page, we have LoF for:

The variant selection button
Constraint definition

For some reason, we have decided to use PTV in the SCHEMA project - what is the easiest to make all the PTV consistent?

Use field types in dataset metadata to set default render functions for table columns

Integer fields should default to renderCount, floats to renderExponential, etc.

Document browser configuration

Document browser configuration in the "Adding a new browser" section of CONTRIBUTING.md.

Create new exome results browser for inflammatory bowel disease (IBD)

Emailed them a while back describing how they need to prepare their data for us.

Gene-level analysis is still underway.

Point of contacts: Hailiang, Kai, and Mingrui

See: https://airtable.com/appUwtp7HmTkfdYyg/tbljmTZELaOMWrSPv/viwB6C8zwpd2MCJ3L?blocks=hide

Add pipeline to prepare all datasets

Currently, prepare_dataset has to be run individually on each dataset. There should be a pipeline to prepare all datasets based on the list of datasets in pipeline_config.ini.

Render error on /gene/ENSG00000251801

Entering "SNORD112" in the gene search page and choosing any of the Ensembl transcripts that pop up results in an error

Constraint table labels

Genic constraint metrics: metrics for quantification intolerance to protein-truncating variation as calculated by the gnomAD consortium. For more information, please visit the gnomAD browser. Please note that insertions and deletions are excluded in the aggregated counts and calculated metrics.

[hover]o/e ratio: ratio of the observed / expected (oe) number of loss-of-function variants in that gene. The expected counts are based on a mutational model that takes sequence context, coverage and methylation into account.
[hover]Exp. SNVs: expected number of loss-of-function variants
[hover]Obs. SNVs: observed number of loss-of-function variants
[hover]pLI: probability of being loss-of-function intolerant (pLI). A score closer to 1 indicates more intolerance to protein-truncating variation. For a set of transcripts intolerant of protein-truncating variation, we suggest pLI ≥ 0.9.

//TODO Change LoF to PTV.

Split up other studies component

Currently, the OtherStudies component is the only thing in the "base" directory that contains browser-specific information. It should be split up into the individual browser directories.

Inconsistent gene symbols/names

The gene symbols/names shown on the all gene results page and those shown on the individual gene pages are sometimes inconsistent. The ones shown on the all gene results page come from the gene results table. The ones shown on the gene pages come from the gene models based on Gencode/HGNC data.

Remove the gene symbol/name requirement from the data format (leave only Ensembl gene ID) and update data preparation steps to annotate gene symbol/name from the gene models.

Gene model positions off by one

The ExAC browser's code for importing gene models from a Gencode GTF added one to start and stop positions.
https://github.com/konradjk/exac_browser/blob/a212465c5b75752abe8990cf6aa581295835ab58/parsing.py#L284-L285

That was kept when the import was converted to Hail. However, according to https://genome.ucsc.edu/FAQ/FAQformat.html#format3, GTF files are one indexed. Thus, all gene/transcript/exon positions are off by one.

QQ plot y-axis relabel

Can actual -log10(p) be changed to Observed -log10(p)? How easy is it to change the 10 to a subscript? Thanks!

Move CSV export server side

Instead of generating CSVs in the client, add routes that respond with gene/variant results in CSV format. These could take an analysis group as a query parameter.

This would make prevent configuring renderForCSV per-browser. Field types in dataset metadata could be used to set reasonable defaults (#6).

Only cache results of successful gene result queries

Queries for all gene results are cached in the API. However, there is no check that the query succeeds. Thus, an initial failed query is never retried.

exome-results-browsers/src/server/schema/geneResult.js

Lines 66 to 91 in c387efd

    
           const geneResultsCache = new Map() 
        
           export const fetchAllGeneResultsForAnalysisGroup = (ctx, analysisGroup) => { 
        
             if (geneResultsCache.has(analysisGroup)) { 
        
               return geneResultsCache.get(analysisGroup) 
        
             } 
        
             const request = fetchAllSearchResults(ctx.database.elastic, { 
        
               index: browserConfig.elasticsearch.geneResults.index, 
        
               type: browserConfig.elasticsearch.geneResults.type, 
        
               size: 10000, 
        
               body: { 
        
                 query: { 
        
                   bool: { 
        
                     filter: { 
        
                       term: { analysis_group: analysisGroup }, 
        
                     }, 
        
                   }, 
        
                 }, 
        
               }, 
        
             }).then(hits => hits.map(hit => shapeGeneResult(hit._source))) // eslint-disable-line no-underscore-dangle 
        
             geneResultsCache.set(analysisGroup, request) 
        
             return request 
        
           }

Add option to proxy to remote API in development

Render error on /gene/ENSG00000083168

viewing variants in KAT6A; unchecked the LoF, got this error msg.

Render error on /gene/ENSG00000092108

Loaded gene page in SCHEMA for SCFD1. Page appeared to load fine.

However, starting to type "13" into the "Search variant table" field resulted in an error page. This was reproducible in both Safari and Chrome on a MacBook Pro running High Sierra (Mac OS 10.13.6)

I tried it with other genes and got the same crash

Handle Ensembl gene IDs in search box

Currently, entering an Ensembl gene ID in the search box returns "No results found."

Hover over description

In the gene results page, the Description (gene name) is truncated - is there a way of displaying the full name when hovering over it? https://schema.broadinstitute.org/results

Allow sorting group results in variant details

Currently, the group results table in the variant details modal is sorted so that the default group is first, and the others are unordered. The group results table should allow choosing a column to sort on the same as the gene and variant results tables.

Updated variant annotation table for SCHEMA

gs://schizophrenia/browser-5/2020-09-10_schema-browser-variant-annotation-table.ht

I've added a column called canonical_term which is the string label for the consequence. However, I kept canonical_csq such that you know which variants are lof, mis, etc.

It would be good to have missense variants still separated into MPC 2 - 3 and MPC > 3. I think you do that as part of your processing pipeline?

Let me know if this makes sense.

Thanks!

Updated variant files for SCHEMA

I found a few things that required fixing in the listed variants in the browser. The format of the data has not changed.

Thanks!

gs://schizophrenia/browser-5/schema-browser-variant-annotation-table.ht
gs://schizophrenia/browser-5/schema-browser-variant-results-table-meta-rare-denovos-common-merged.ht

Document sources for reference data

Document where/how to obtain reference data files.

exome-results-browsers/data_pipeline/pipeline_config.ini

Lines 28 to 33 in 1d2e66e

    
           [reference_data] 
        
           grch37_gencode_path = gs://exome-results-browsers/reference/gencode.v19.gtf.bgz 
        
           grch38_gencode_path = gs://exome-results-browsers/reference/gencode.v29.gtf.bgz 
        
           grch37_canonical_transcripts_path = gs://exome-results-browsers/reference/gnomad_2.1.1_vep85_canonical_transcripts.tsv.bgz 
        
           grch38_canonical_transcripts_path = gs://exome-results-browsers/reference/gnomad_3.0_vep95_canonical_transcripts.tsv.bgz 
        
           hgnc_path = gs://exome-results-browsers/reference/hgnc.tsv

Precompress static files

Currently, the compression package is used to serve data gzip encoded. Since most responses are sending static files, those files can be compressed at build time.

Refactor data pipeline output

Currently, all outputs of the data pipeline are written to the output.staging_path specified in pipeline_config.ini.

exome-results-browsers/data_pipeline/pipeline_config.ini

Lines 45 to 47 in 86c8b62

    
           [output] 
        
           # Path for intermediate Hail files. 
        
           staging_path = gs://exome-results-browsers/data/200911

Thus, preserving older versions of the combined Hail table requires changing the staging path setting every time data is updated. This in turn leads to requiring multiple copies of gene models and individual dataset files.

Instead, gene models could be output separately, individual dataset Hail tables written to staging path, and combined Hail tables written to timestamped paths. This way, updating one dataset would require running prepare_dataset only on that one dataset and then generating a new combined Hail table.

	const geneResultsCache = new Map()

	export const fetchAllGeneResultsForAnalysisGroup = (ctx, analysisGroup) => {
	if (geneResultsCache.has(analysisGroup)) {
	return geneResultsCache.get(analysisGroup)
	}

	const request = fetchAllSearchResults(ctx.database.elastic, {
	index: browserConfig.elasticsearch.geneResults.index,
	type: browserConfig.elasticsearch.geneResults.type,
	size: 10000,
	body: {
	query: {
	bool: {
	filter: {
	term: { analysis_group: analysisGroup },
	},
	},
	},
	},
	}).then(hits => hits.map(hit => shapeGeneResult(hit._source))) // eslint-disable-line no-underscore-dangle

	geneResultsCache.set(analysisGroup, request)

	return request
	}

	[reference_data]
	grch37_gencode_path = gs://exome-results-browsers/reference/gencode.v19.gtf.bgz
	grch38_gencode_path = gs://exome-results-browsers/reference/gencode.v29.gtf.bgz
	grch37_canonical_transcripts_path = gs://exome-results-browsers/reference/gnomad_2.1.1_vep85_canonical_transcripts.tsv.bgz
	grch38_canonical_transcripts_path = gs://exome-results-browsers/reference/gnomad_3.0_vep95_canonical_transcripts.tsv.bgz
	hgnc_path = gs://exome-results-browsers/reference/hgnc.tsv

	[output]
	# Path for intermediate Hail files.
	staging_path = gs://exome-results-browsers/data/200911