statgen / pheweb Goto Github PK
View Code? Open in Web Editor NEWA tool to build a website to browse hundreds or thousands of GWAS.
License: MIT License
A tool to build a website to browse hundreds or thousands of GWAS.
License: MIT License
Before I can merge py3
into master
, I need to:
Update documentation
pip3
python3 -m venv ~/venv3
Figure out how to use wsgi-py3
with /var/www/pheweb-dev/pheweb
apt-get install libapache2-mod-wsgi-py3
. Install py3
in /var/www/pheweb3
and make all phewebs use that.
<script>
s not lookup until 400ms? What's slowing things down?<script>
s?console.time('foo'); console.timeEnd('foo');
$ cat ~/.pheweb/cache/genes.bed | cut -f4 | sort | uniq -c | grep -v '^\s*1\b' | wc -l
103
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
$ gunzip -c gencode.v19.annotation.gtf.gz | perl -F'\t' -nale 'print if $F[2] eq "gene"' | grep -E '; gene_type "(protein_coding|IG_V_gene|TR_V_gene|TR_J_gene|IG_C_gene|IG_D_gene|IG_J_gene|TR_C_gene|TR_D_gene)";' | grep ZNF788
chr19 HAVANA gene 12203078 12248050 . + . gene_id "ENSG00000214189.4"; transcript_id "ENSG00000214189.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ZNF788"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ZNF788"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000156416.3";
chr19 ENSEMBL gene 12203078 12225491 . + . gene_id "ENSG00000188474.6"; transcript_id "ENSG00000188474.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ZNF788"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ZNF788"; level 3; tag "ncRNA_host";
Sometimes genes with the same name overlap, but sometimes they don't. Maybe common gene names are just hopeless. Anyways, I made it so that the same gene name will never be displayed twice at the gene-annotation stage.
Then move all phewebs into /var/phewebs/<dataset_name>{a,b}/
and make /var/phewebs/.git
with .gitignore
for autogenerated/
.
After that, use hardlinking for a cleaner workflow.
On Manhattan Plot page when you click a variant have an option to go eitherto the LZ page or to the PheWAS page (to make navigation to the PheWAS plot less clunky).
References:
panel.qqconf
.
n = length(pvalues) + 1
, other params are defaults.Use a bootstrap data-toggle="collapse"
.
CI depends on #pvals, so each MAF-range should have a similar number of pvals.
At the page http://pheweb.sph.umich.edu/top_hits, gene names should link to LZ.
There are two options for how to do that:
I make URLs like /region/<phenocode>/gene/<genename>
and the server does the work.
When serving top_hits.json
I add in regions for each gene (either alongside of in a mapping) and the rendering code injects the links.
Later, should I build /gene/<genename>
, where, again, the server does the work? Then the schema will be:
/random
/top_hits
/variant/1:3432423-A-G
/pheno/BMI
/region/BMI/1:34234-34323
/region/BMI/gene/IRF4
/gene/IRF4
(This would have caught bug #46)
Follow this tutorial. Maybe use firefox to access error log. Block GA.
version
should always get incremented and end with -{$(git rev-parse --short -q HEAD)}
.
Add a .git/hooks/pre-commit
hook that updates pheweb/version.py
.
make pheweb _r
update pheweb/version.py
(so that I don't have to install the hook everywhere). Neither of these need to be robust, because I'm the only one who will use them.
http://www.ebi.ac.uk/gwas/search?query=rs6025
That site also has a download page: http://www.ebi.ac.uk/gwas/docs/downloads
Lars said that UCSC is one of the easiest ways to get an GRCh37/hg19 version:
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gwasCatalog.sql
CREATE TABLE `gwasCatalog` (
`bin` smallint(5) unsigned NOT NULL,
`chrom` varchar(255) NOT NULL,
`chromStart` int(10) unsigned NOT NULL,
`chromEnd` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`pubMedID` int(10) unsigned NOT NULL,
`author` varchar(255) NOT NULL,
`pubDate` varchar(255) NOT NULL,
`journal` varchar(255) NOT NULL,
`title` varchar(1024) NOT NULL,
`trait` varchar(255) NOT NULL,
`initSample` longblob NOT NULL,
`replSample` longblob NOT NULL,
`region` varchar(255) NOT NULL,
`genes` longblob NOT NULL,
`riskAllele` varchar(255) NOT NULL,
`riskAlFreq` varchar(255) NOT NULL,
`pValue` varchar(255) NOT NULL,
`pValueDesc` varchar(255) NOT NULL,
`orOrBeta` varchar(255) NOT NULL,
`ci95` varchar(255) NOT NULL,
`platform` varchar(255) NOT NULL,
`cnv` enum('Y','N') NOT NULL,
KEY `chrom` (`chrom`,`bin`),
KEY `name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gwasCatalog.txt.gz
603 chr1 2392647 2392648 rs2477686 22197933 Hu Z 2011-12-25 Nat Genet A genome-wide association study in Chinese men identifies three risk loci for non-obstructive azoospermia. Non-obstructive azoospermia 981 Han Chinese ancestry cases, 1,657 Han Chinese ancestry controls 1,946 Han Chinese ancestry cases, 4,077 Han Chinese ancestry controls 1p36.32 MMEL1, PEX10 rs2477686-? NR 6E-12 1.39 [1.26-1.52] Affymetrix [587,347] N
603 chr1 2452978 2452979 rs12073504 23251661 Comuzzie AG 2012-12-04 PLoS One Novel genetic loci identified for the pathophysiology of childhood obesity in the Hispanic population. Obesity-related traits 815 Hispanic children from 263 families NA 1p36.32 PANK4 rs12073504-G 0.033 5E-6 (IGFBP-3 ) 0.03 [NR] (ng/mL increase) Illumina [899,892] N
604 chr1 2501337 2501338 rs10797432 23128233 Jostins L 2012-11-01 Nature Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Ulcerative colitis Up to 12,924 European ancestry cases, up to 21,442 European ancestry controls Up to 25,683 European ancestry cases, up to 17,015 European ancestry controls 1p36.32 MMEL1, PLCH2, TNFRSF14 rs10797432-C 0.522 3E-12 1.078 [1.041-1.116] Affymetrix & Illumina [1.23 million] (imputed) N
604 chr1 2513215 2513216 rs734999 21297633 Anderson CA 2011-02-06 Nat Genet Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47. Ulcerative colitis 6,687 European ancestry cases, 19,718 European ancestry controls 9,628 European ancestry cases, 12,917 European ancestry controls 1p36.32 MMEL1, C1orf93, PLCH2, TNFRSF14 rs734999-C 0.52 3E-9 1.05 [1.01-1.09] Affymetrix & Illumina [~1.1 million] (imputed) N
604 chr1 2526745 2526746 rs3748816 20190752 Dubois PC 2010-02-28 Nat Genet Multiple common variants for celiac disease influencing immune gene expression. Celiac disease 4,533 European ancestry cases, 10,750 European ancestry controls 4,918 European ancestry cases, 5,684 European ancestry controls 1p36.32 MMEL1, TNFRSF14 rs3748816-? 0.66 3E-9 1.12 [1.09-1.18] Illumina [292,387] N
604 chr1 2553623 2553624 rs3890745 24449572 Orozco G 2014-01-01 Arthritis Rheumatol Novel rheumatoid arthritis susceptibility locus at 22q12 identified in an extended UK genome-wide association study. Rheumatoid arthritis 3,034 European ancestry cases, 5,271 European ancestry controls 4,726 European ancestry cases, 2,625 European ancestry controls 1p36.32 TNFRSF14 rs3890745-? 1E-6 1.18 [1.1-1.27] Affymetrix & Illumina [1,831,729] (Imputed) N
...
With marisa_trie
, trie.iteritems(key)
returns items from longest-to-shortest. I'm using marisa_trie
for:
[key] + [f'{key}{suffix}' for suffix in range(10)]
TP53
doesn't make the first page for its own suggestions.)import marisa_trie
t = marisa_trie.Trie(['a'*i for i in range(1, 5)] + ['z', 'az', 'za', 'aza'])
t.keys('a')
['a',
'aa',
'aaa',
'aaaa',
'az',
'aza']
I want results sorted by length and then lexicographic:
['a',
'aa',
'az',
'aaa',
'aza',
'aaaa']
In the "Top Hits" table, give users the option:
(1) to hide/unhide columns
(2) to sort columns
Also, add the following columns:
(1) add OR/BETA column
This page explains that the file PheWAS_code_translation_v1_2.txt
is not using the same phewas-icd9-mapping as we are. We need to use phemap.rda and pheinfo.rda.
Should we still pull our ICD9 strings from the v1.2 file, or look for them somewhere else?
Convert phenolist.py
-> phenolist/__init__.py
Make phenolist/icd9-phewas.json
which contains the information we need
Make pheweb phenolist icd9-phewas
. It checks for columns in pheno-list.json
that look like phewas codes, and then augments the file if it finds any.
/net/dumbo/home/larsf/PheWAS/PheWAS_code_translation_v1_2.txt
(15k lines):
"icd9" "icd9_string" "phewas_code" "phewas_string" "exclude_range" "rollup_bool" "Ignore_bool" "sex" "pregnancy_related" "category" "category_string"
"001" "Cholera" "008" "Intestinal infection" "001-009.99" 1 0 "" 0 1 "infectious diseases"
"001.0" "Cholera due to Vibrio cholerae" "008" "Intestinal infection" "001-009.99" 1 0 "" 0 1 "infectious diseases"
"001.1" "Cholera due to Vibrio cholerae el tor" "008" "Intestinal infection" "001-009.99" 1 0 "" 0 1 "infectious diseases"
"001.9" "Cholera NOS" "008" "Intestinal infection" "001-009.99" 1 0 "" 0 1 "infectious diseases"
"002" "Typhoid and paratyphoid fevers" "008" "Intestinal infection" "001-009.99" 1 0 "" 0 1 "infectious diseases"
"002.0" "Typhoid fever" "008.5" "Bacterial enteritis" "001-009.99" 1 0 "Both" 0 1 "infectious diseases"
...
"021.1" "Enteric tularemia" "008" "Intestinal infection" "001-009.99" 1 0 "" 0 1 "infectious diseases"
"021.2" "Pulmonary tularemia" 480.1 "Bacterial pneumonia" "480-488.99" 1 0 "Both" 0 9 "respiratory"
"021.3" "Oculoglandular tularemia" 369 "Infection of the eye" "369-374.99" 1 0 "" 0 7 "sense organs"
"021.8" "Other specified tularemia" "041" "Bacterial infection NOS" "010-041.99" 1 0 "" 0 1 "infectious diseases"
"021.9" "Tularemia NOS" "041" "Bacterial infection NOS" "010-041.99" 1 0 "" 0 1 "infectious diseases"
...
/net/dumbo/home/larsf/PheWAS/PheWAS_code_v1_2.txt
(1488+1 lines):
phewas_code2 phewas_code phewas_string category_string Cases Controls NR
008 8 Intestinal infection infectious diseases 460 17893 1
008.5 8.5 Bacterial enteritis infectious diseases 244 17893 2
...
008.6 8.6 Viral Enteritis infectious diseases 114 17893 5
010 10 Tuberculosis infectious diseases 39 16924 7
038 38 Septicemia infectious diseases 410 16924 10
...
/net/dumbo/home/larsf/PheWAS/MATCHED/PheWAS_${NR}_MATCHED.epacts
:
#CHROM BEGIN END MARKER_ID NS AC CALLRATE MAF PVALUE BETA SEBETA CHISQ NS.CASE NS.CTRL AF.CASE AF.CTRL
1 1005806 1005806 1:1005806_C/T_1:1005806 5060 1561.2 1 0.15427 0.48329 -0.06802 0.097689 0.49144 460 4600 0.14605 0.1551
1 1079198 1079198 1:1079198_T/C_1:1079198 5060 2627 1 0.25959 0.93072 -0.0079779 0.091803 0.0075582 460 4600 0.25836 0.25971
1 1247494 1247494 1:1247494_T/C_1:1247494 5060 8226.7 1 0.18708 0.51949 -0.059428 0.091832 0.4149 460 4600 0.80563 0.81365
1 1723031 1723031 1:1723031_G/A_1:1723031 5060 5027.6 1 0.4968 0.12474 -0.10694 0.069707 2.3568 460 4600 0.47283 0.49919
1 2069172 2069172 1:2069172_C/T_1:2069172 5060 2826.2 1 0.27927 0.45051 -0.058743 0.078132 0.56938 460 4600 0.2685 0.28034
1 2069681 2069681 1:2069681_C/T_1:2069681 5060 626.91 1 0.061948 0.84678 0.027622 0.14258 0.037335 460 4600 0.063029 0.06184
1 2205581 2205581 1:2205581_C/A_1:2205581 5060 4262.4 1 0.42119 0.4292 -0.058128 0.073623 0.62498 460 4600 0.40942 0.42237
1 2387101 2387101 1:2387101_C/T_1:2387101 5060 4937.6 1 0.48791 0.40808 -0.057252 0.069225 0.6844 460 4600 0.4749 0.48921
1 2392648 2392648 1:2392648_G/C_1:2392648 5060 5017 1 0.49575 0.57311 -0.038945 0.069117 0.31751 460 4600 0.48688 0.49663
...
1 82074852 82074852 1:82074852_G/A_1:82074852 5060 0 1 0 NA NA NA NA 460 4600 0 0
...
What shall we do about those NA
s? Not insert them?
Where should we get colors for each category
? I can't find them online. We either make them up, or pull them out of Vanderbilt's code.
Click the first of the four little icons in a row at https://phewas.mc.vanderbilt.edu/datatable.
Lars used the R package PheWAS.
Lars built interactive http://csg.sph.umich.edu/larsf/Example_PheWAS_Plot.html using Bokeh in Python.
Test cases:
<pre>
containing any results.See chromosome 12 on http://browser.sph.umich.edu:5000/pheno/705.8. There's no way to figure out which variants are significant.
Once we get LocusZoom, that will allow people to distinguish the points. If two points overlap so fully that you can't even tell, then who cares– it's imputed data anyways.
Even with a binning threshold of pval=1e-4, the variant bins are the slow part (~500ms).
Improving the code speeds things up, but I think we're mostly constrained by the cost of drawing ~50k circles in the DOM.
Ways to bin variants:
Option 1: When there's a string of neglog10_pvals
, replace them with a [start, end]
in the json and a line with stroke-linecap:round
in the svg. I suspect that this will eliminate 80% of circles.
Option 2: The circles have radius 2.3. We have ~1000 position bins in a plot >=700px wide. Columns overlap their neighbors at browser widths < 3200px. Exploit that.
Eg, http://pheweb.sph.umich.edu/region/008/1:523918-923918 shows FAM87B, but my the tooltips don't, because we only include protein_coding
and {IG,TR}_{C,D,J,V}_gene
(immunoglobulin and T-cell receptors). (Gene biotypes are defined here)
LZ folks said I could add a gene_biotype
column to the genes postgres table and figure out how to get that into the response for the genes panel tooltips. Or I could set up my own genes API.
add a little question-mark-tooltip to the definition-list on the variant page explaining that this only counts genes that are protein-coding or immunoglobulin or T-cell receptor (and never pseudo).
Replace nearest_genes
with genes
, which is a json-serialized list of strings like "coding in PCSK9" or "3'UTR in PCSK9" or "3kb upstream of PCSK9". (should these be in a machine-readable format? strings will work for now.) Always include all coding and UTR entries and up/downstream < 1kb, but if none of those are found, include the nearest gene <1Mb.
It seems to be broken. Even for genes with signifiant traits are not coming up in the table. The LZ plot shows, but the table is always empty.
Only about the top 500 variants are useful in a LZ plot. Make /api/region
only return the 500 most pvaluable variants. It should also return the p-value of the best variant that it excluded. Then LZ should draw a gray box up to that p-value to show that those variants aren't loaded. That box should be behind the recombination line.
Later, I could add some sort of binning, which would allow LZ plots of several Mb. At that point the genes panel will be too full and the genes API will be the slowest part. I'm not sure what to do about that. Perhaps the variants API should also return the genes, and it can use the variants' pvalues to choose which ones to show? But it can't just choose the closest genes, because that will just make a messy cluster. Could I assign an interestingness score to each gene based on Clinvar/OMIM/ExAC-constraint or something like that?
5e-8 is used by:
/random
window.model
1e-6 is used by:
(See master...phuwanat:master for reference.)
so,
Replace 5e-8
with conf.genome_wide_sig_threshold
.
Replace 1e-6
with conf.peak_pval_cutoff
.
best-phenos-for-gene.json
. Tabix it.hits-1e2.bed
) which agglomerates 100 variants at a time. Tabix it.hits-1e4.bed
) which agglomerates 10k variants at a time.See http://browser.sph.umich.edu:5000/pheno/378, where no variants reach the significance cutoff. Should we force the extent to contain it?
Ryan has a sqlite3 db that maps rsid <-> cpra. He also has a table mapping old rsids to current ones. Use those to fix up association files that are missing some cpras.
I think I would just display all GWAS Catalog hits that pass some threshold (pval, sample size, &c?).
Would I just highlight any variant at the same position as a GWAS Catalog hit? Include it in their tooltip? Or mark the x-axis with it?
Option 1: In ggplot2
, ggrepel
fixes this. I'll search for something similar for d3, or else try to copy their algorithm.
Option 2: Maybe I can try a simple approach, like moving labels to the left or up or down until they no longer conflict, and drawing a line if the label moves too far from the point.
Option 3: Do simple things, and if it fails just give up and either (a) leave a mess or (b) don't draw the label.
Hello,
It seems like there is no top hits shown in my pheno page. Is there any known issues about this?
Regards,
Phuwanat
The GWAS Catalog has 25k entries.
Each entry has:
Is it worth automating? Let's do some by hand and then decide.
product_{each row}(P(OR_{obs} | OR_{GWAS}, SE)) / the null
Use a <clippath>
to solve this.
Get rid the bold info below the table. This should be easy when we switch to the new table architecture.
Understand
LocusZoom.DataLayer.prototype.applyStatusBehavior
workslayout.panels[0].data_layers[1].selected.onclick
is accessed.nvm.
(extracted from another issue)
Sarah:
add an annotation column (for example: intronic; exonic-nonsynonymous; exonic-splice; exonic-synonynmous; intergenic; 5' UTR; 3' UTR)
Peter:
Annotation: I want to be quite careful not to introduce problems into sites.tsv
, since matrixify
has to parse it. Check version numbers of VEP/snpEff, and then have PheWeb manage everything. If VEP or snpEff can add exactly the annotations I want to a vcf-like file, this should be easy– I just annotate sites.tsv
, pass along all columns into augmented_phenos_gz/*.gz
and matrix.tsv.gz
, convince my API to load and serve the annotations, and set up my default templates to render whatever's available.
Sarah:
VEP only is fine.
Annotation: VEP can write into a VCF file (--vcf flag) http://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_vcf
You can can filter the VCF output http://useast.ensembl.org/info/docs/tools/vep/script/vep_filter.html
Peter:
VEP adds this to the end of INFO
:
;CSQ=T|stop_gained|HIGH|CCT8L2|ENSG00000198445|Transcript|ENST00000359963|protein_coding|1/1||||1354|1094|365|W/*|tGg/tAg|||-1||HGNC|15553,
documented by this header:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">
in pairs:
Allele : T
Consequence : stop_gained
IMPACT : HIGH
SYMBOL : CCT8L2
Gene : ENSG00000198445
Feature_type : Transcript
Feature : ENST00000359963
BIOTYPE : protein_coding
EXON : 1/1
INTRON :
HGVSc :
HGVSp :
cDNA_position : 1354
CDS_position : 1094
Protein_position : 365
Amino_acids : W/*
Codons : tGg/tAg
Existing_variation :
DISTANCE :
STRAND : -1
FLAGS :
SYMBOL_SOURCE : HGNC
HGNC_ID : 15553
From this, I would display consequence: stopgained
in tables and maybe use IMPACT: HIGH
to color Manhattan/LZ. Most of the rest could get dumped at the bottom of the PheWAS page. If I'm happy with the way SYMBOL
and Feature
are decided, I could use those instead of my gene annotation.
As I read the VEP annotation from a VCF, I'll take the most deleterious consequence
and the highest IMPACT
. Then assert that CSQ["alt"] == alt
.
Later: Figure out whether we can run ensembl-vep
on sites.tsv
, or if not what I must add. Maybe add an empty INFO
column or rename rsids
to ID
.
Sarah:
add to tables:
add to tooltips:
With the pval_threshold 1e-4 and bin sizes 0.05 and 3e6, the Manhattan Plot data is ~1MB. (2/3 unbinned_variants
and 1/3 variant_bins
.) Gzipped, it's <100KB. (4/5 unbinned_variants
and 1/5 variant_bins
.)
Option 1: Just use Flask-Compress.
Option 2: For maximum speed, we'll pre-gzip (-9
!) our static files and directly serve them. The headers for gzipped data can be found at http://flask.pocoo.org/snippets/122/. https://spoqa.github.io/flask-s3/_modules/flask_s3.html might also be interesting. But how do I attach headers with send_from_directory
or send_file
(which it uses)? They return response
s, right? Be sure to pass a long cache_timeout
and to handle the mimetype guessing.
Sometimes the tooltip is at the wrong place by about 20 pixels, but it still shows the correct information. It's not pointing at any other point.
It seems to happen most often to wide tooltips near the right side. The tooltip usually moves right and upward, but not always. It affects points and phenotype labels.
I didn't notice this issue before adding .show(d, this)
code.
To reproduce:
Once we integrate LocusZoom, we'll want lock-on-click tooltips, which we'll take from LZ, which shouldn't have this problem.
show_gene
, which I lost in 029baa8 and 9b23bc8 .pheweb/load/summarize/*.py
(where pheweb/load/summarize/__init__.py
has def run()
)?LocusZoom requires me to specify which attributes will exist (layout.panels[#id=association].data_layers[#id=associationpvalues].fields
). But I want users to be able to use whichever attributes they want. For now, I'll have a constant set of attributes for each pheweb, but later I'll likely have a different set for each phenotype within a pheweb.
I'd like to skip straight to dynamically setting fields
for each phenotype.
Options:
Know which attributes we have before we serve our javascript, and modify it server-side.
On the client, modify things after receiving data. .getData(...)
is probably the best place to do that. Chris didn't like that idea, and I didn't understand why not. Look into parseResponse
, especially the one in GeneSource
.
LocusZoom tooltip templating only allows substituting values, and nothing more complicated. Different phewebs have different attributes to show. Soon even different phenotypes within the same pheweb will have different attributes. Later, within a single phenotype, different variants will have different attributes. (eg, GWAS Catalog hits). On the LZ-PheWAS, different points have different attributes.
I'd love to just have a function that takes a variant and returns HTML. What if the tooltip just included an id and <script>foo(variant)</script>
?
On http://pheweb.sph.umich.edu:5002/variant/1-169519049-T-C click download
. The SVG has the wrong fonts and background color. Both of those are controlled by a css rules, rather than inline styles. Why isn't the export-as-svg code bringing those styles along?
EXPLAIN ANALYZE
SELECT (pheweb.categories.name, pheweb.phenos.num_cases, pheweb.phenos.num_controls, pheweb.phenos.phewas_code, pheweb.phenos.phewas_string, pheweb.associations.pval)
FROM pheweb.associations
JOIN pheweb.phenos ON pheweb.associations.pheno_id = pheweb.phenos.id
JOIN pheweb.categories ON pheweb.categories.id = pheweb.phenos.category_id
WHERE variant_id = 1
ORDER BY pheweb.phenos.phewas_code;
Sort (cost=574520.98..574524.62 rows=1458 width=83) (actual time=2312.447..2312.610 rows=1488 loops=1)
Sort Key: phenos.phewas_code
Sort Method: quicksort Memory: 256kB
-> Nested Loop (cost=0.43..574444.36 rows=1458 width=83) (actual time=0.065..2310.648 rows=1488 loops=1)
-> Nested Loop (cost=0.28..574134.89 rows=1458 width=59) (actual time=0.062..2308.497 rows=1488 loops=1)
-> Seq Scan on associations (cost=0.00..573274.20 rows=1458 width=16) (actual time=0.058..2304.954 rows=1488 loops=1)
Filter: (variant_id = 1)
Rows Removed by Filter: 26240362
-> Index Scan using pheno_pkey on phenos (cost=0.28..0.58 rows=1 width=59) (actual time=0.002..0.002 rows=1 loops=1488)
Index Cond: (id = associations.pheno_id)
-> Index Scan using category_pkey on categories (cost=0.15..0.20 rows=1 width=40) (actual time=0.001..0.001 rows=1 loops=1488)
Index Cond: (id = phenos.category_id)
Planning time: 0.252 ms
Execution time: 2312.732 ms
(14 rows)
As .svg for example
Make pheweb.HandledError which won't be printed by bin/pheweb
and maybe can bubble up through multiprocessing.
When an error occurs, drop debugging info into ~/pheweb/error-logs/$(date).txt
or ~/pheweb/error-logs/$(date)-process$i.txt
. Later, have all subprocesses push their error logs into a queue to that the master process can write them all into a single file and print its path.
pysam.tabix_index()
instead of tabix -p vcf <file>
.augmented_pheno_gz
):
#
in augmented_phenos/*
, and then just use pysam.tabix_compress(infile, outfile)
.BgzipWriter
via cffi.Goncalo says that all you need is r
between that variant and each other for your data.
Set up an LD server that references the raw data, probably by copying Daniel's HVCF. Having raw data means that security gets complicated, and I think I don't want that.
Pre-compute r
for all variant pairs within 300kb from the raw data. ie, for each variant, store r
for all variants for the next 300kb.
Ways to store it:
r
s until they hit 300kb. Easy to make/store, decent to use, probably just 2 sigfigs, should be smaller than matrix.tsv.gz
. So, look up the first variant, and iterate through r
s at the same time as iterating through sites.tsv.gz
until you hit the second variant.id
, chr-pos-ref-alt
], another of [variant1_id
, variant2_id
, r
]?A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.