Giter Site home page Giter Site logo

pheweb's Introduction

For a list of available instances of PheWeb, navigate here. For a walk-through demo see here. If you have questions or comments, check out our Google Group.

screenshot of PheWAS plot

How to Cite PheWeb

If you use the PheWeb code base for your work, please cite our paper:

Gagliano Taliun, S.A., VandeHaar, P. et al. Exploring and visualizing large-scale genetic associations by using PheWeb. Nat Genet 52, 550–552 (2020).

How to Build a PheWeb for your Data

If this is broken, open an issue on github and hopefully I can help.

1. Install PheWeb

pip3 install pheweb

2. Create a directory and config.py for your new dataset

mkdir ~/my-new-pheweb && cd ~/my-new-pheweb

This directory will store all the files pheweb makes for your dataset. All pheweb ... commands should be run in this directory.

Make config.py in this directory. In it, either set hg_build_number = 19 or hg_build_number = 38. Other options you can set are listed here.

3. Check that your GWAS summary statistics files will work

You need one file for each phenotype. Most common GWAS file formats should work. Here are the requirements:

  • It needs a header row.
  • Columns can be delimited by tabs, spaces, or commas.
  • It needs a column for the reference allele (which must always match the bases on the reference genome that you specified with hg_build_number) and a column for the alternate allele. If you have a MARKER_ID column like 1:234_C/G, that's okay too. If you have an allele1 and allele2, and sometimes one or the other is the reference, then you'll need to modify your files.
  • It can be gzipped if you want.
  • Variants must be sorted by chromosome and position, with chromosomes in the order [1-22,X,Y,MT].

The file must have columns for:

column description name other allowed column names allowed values
chromosome chrom #chrom, chr 1-22, X, Y, M, MT, chr1, etc
position pos beg, begin, bp integer
reference allele ref reference must match reference genome
alternate allele alt alternate anything
p-value pval pvalue, p, p.value number in [0,1]

You may also have columns for:

column description name other allowed column names allowed values
minor allele frequency maf number in (0,0.5]
allele frequency (of alternate allele) af a1freq, frq number in (0,1)
AF among cases case_af af.cases number in (0,1)
AF among controls control_af af.controls number in (0,1)
allele count ac integer
effect size (of alternate allele) beta number
standard error of effect size sebeta se number
odds ratio (of alternate allele) or number
R2 r2 number
number of samples num_samples ns, n integer, must be the same for every variant in its phenotype
number of controls num_controls ns.ctrl, n_controls integer, must be the same for every variant in its phenotype
number of cases num_cases ns.case, n_cases integer, must be the same for every variant in its phenotype

Column names are case-insensitive. If your file has a different column name, set field_aliases = {"column_name": "field_name"} in config.py. For example, field_aliases = {'P_BOLT_LMM_INF': 'pval', 'NSAMPLES': 'num_samples'}.

Any field can be null if it is one of ['', '.', 'NA', 'N/A', 'n/a', 'nan', '-nan', 'NaN', '-NaN', 'null', 'NULL']. If a required field is null, the variant gets dropped.

If your pval is log10 (like in REGENIE output), then set these variables in config.py: pval_is_neglog10 = True and field_aliases = {'LOGP':'pval'}.

4. Make a list of your phenotypes

Inside of your data directory, you need a file named pheno-list.json that looks like this:

[
 {
  "assoc_files": ["/home/peter/data/ear-length.gz"],
  "phenocode": "ear-length"
 },
 {
  "assoc_files": ["/home/peter/data/a1c.X.gz","/home/peter/data/a1c.autosomal.gz"],
  "phenocode": "A1C"
 }
]

Each phenotype needs assoc_files (a list of paths to association files) and phenocode (a string representing your phenotype that is used in filenames and URLs, comprised of [A-Za-z0-9_~-]).

If you want, you can also include:

  • phenostring (string): a name for the phenotype. Shown in tables and tooltips and page headers.
  • category (string): groups together phenotypes in the PheWAS plot. Shown in tables and tooltips.
  • num_cases, num_controls, and/or num_samples (number): if your input data only has AC or MAC, this will be used to calculated AF or MAF. Shown in tooltips. If your input data has correctly-named columns for these, the command pheweb phenolist read-info-from-association-files will add them into your existing pheno-list.json.
  • anything else you want, but you'll have to modify templates to use it.

You can use a csv by running:

pheweb phenolist import-phenolist "/path/to/pheno-list.csv"

or you can make one from scratch by running:

pheweb phenolist glob --star-is-phenocode "/home/peter/data/*.gz"

You can see other methods here.

5. Load your association files

Run pheweb process.

To distribute jobs across a cluster, follow these instructions.

To include VEP annotations, follow these instructions.

If something breaks and you can't understand the error message or it's something that PheWeb should support by default, open an issue on github or email me.

6. Serve the website

Run pheweb serve --open.

That command should either open a browser to your new PheWeb, or it should give you a URL that you can open in your browser to access your new PheWeb. If it doesn't, follow the directions for hosting a PheWeb and accessing it from your browser.

More options:

To run pheweb through systemd, see sample file here. To use Apache2 or Nginx, see instructions here. To require login via OAuth, see instructions here. To track page views with Google Analytics, see instructions here. To reduce storage use, see instructions here. To customize page contents, see instructions here.

PheWeb can display genetic correlations generated by another tool. To use this feature, set show_correlations = True in config.py and place the output of the rg pipeline as pheno-correlations.txt in the same folder as pheno-list.json.

To hide the button for downloading summary stats, add download_pheno_sumstats = "secret" and SECRET_KEY = "your random string" in config.py. That will make a secret page (printed to the console when you start the server) to share summary stats. To hide the button for downloading top hits and phenotypes, add download_top_hits = "hide" and download_phenotypes = "hide" respectively.

To allow dynamically filtering the manhattan plot, run pheweb best-of-pheno and set show_manhattan_filter_button=True in config.py.

Modifying PheWeb

See instructions here. See documentation about the files in generated-by-pheweb/ here.

pheweb's People

Contributors

abought avatar frankp-0 avatar hanayik avatar pjvandehaar avatar sergey-knyazev avatar sgagliano avatar shicheng-guo avatar vchapdelainet avatar zhilizheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pheweb's Issues

External links for variant

Test cases:

Remove non-python dependencies

  • wget: use wget from pypi instead of /usr/{,local/}bin/wget
  • tabix: use pysam.tabix_index() instead of tabix -p vcf <file>.
  • bgzip (for augmented_pheno_gz):
    • option 1: include the leading # in augmented_phenos/*, and then just use pysam.tabix_compress(infile, outfile).
    • option 2: use BgzipWriter via cffi.

make flexible tooltips for LZ

different sets of attributes for each phenotype

LocusZoom requires me to specify which attributes will exist (layout.panels[#id=association].data_layers[#id=associationpvalues].fields). But I want users to be able to use whichever attributes they want. For now, I'll have a constant set of attributes for each pheweb, but later I'll likely have a different set for each phenotype within a pheweb.

I'd like to skip straight to dynamically setting fields for each phenotype.

Options:

  1. Know which attributes we have before we serve our javascript, and modify it server-side.

  2. On the client, modify things after receiving data. .getData(...) is probably the best place to do that. Chris didn't like that idea, and I didn't understand why not. Look into parseResponse, especially the one in GeneSource.

tooltips

LocusZoom tooltip templating only allows substituting values, and nothing more complicated. Different phewebs have different attributes to show. Soon even different phenotypes within the same pheweb will have different attributes. Later, within a single phenotype, different variants will have different attributes. (eg, GWAS Catalog hits). On the LZ-PheWAS, different points have different attributes.

I'd love to just have a function that takes a variant and returns HTML. What if the tooltip just included an id and <script>foo(variant)</script>?

Conversation with Chris:

  • // just follow the docs at https://github.com/statgen/locuszoom
  • // plot.on('data_rendered', () => { modify this.**.tooltip })
  • // but .fields is gonna be a problem
  • // I could hackishly modify it in the .getData
  • // or I can make the parseResponse &c act more forgiving like in GeneSource.

Make points clickable in LZ

Understand

  • how LocusZoom.DataLayer.prototype.applyStatusBehavior works
  • where layout.panels[0].data_layers[1].selected.onclick is accessed.

Hide tracebacks for known errors

Make pheweb.HandledError which won't be printed by bin/pheweb and maybe can bubble up through multiprocessing.

When an error occurs, drop debugging info into ~/pheweb/error-logs/$(date).txt or ~/pheweb/error-logs/$(date)-process$i.txt. Later, have all subprocesses push their error logs into a queue to that the master process can write them all into a single file and print its path.

Link from gene names to LZ

At the page http://pheweb.sph.umich.edu/top_hits, gene names should link to LZ.

There are two options for how to do that:

  • I make URLs like /region/<phenocode>/gene/<genename> and the server does the work.

  • When serving top_hits.json I add in regions for each gene (either alongside of in a mapping) and the rendering code injects the links.

Later, should I build /gene/<genename>, where, again, the server does the work? Then the schema will be:

/random
/top_hits
/variant/1:3432423-A-G
/pheno/BMI
/region/BMI/1:34234-34323
/region/BMI/gene/IRF4
/gene/IRF4

QQ CI

@pjvandehaar

  • In SardiNIA QQ plots- check the Confidence Intervals
  • Also for the lowest bound replace 0.00 with 0.01 to reflect the cut-off applied for the plotting

Make Consortia-PheWeb

Ryan has a sqlite3 db that maps rsid <-> cpra. He also has a table mapping old rsids to current ones. Use those to fix up association files that are missing some cpras.

Then:

  • use it on Daniel's copy of Christian's data

Load the GWAS Catalog

http://www.ebi.ac.uk/gwas/search?query=rs6025

That site also has a download page: http://www.ebi.ac.uk/gwas/docs/downloads

Lars said that UCSC is one of the easiest ways to get an GRCh37/hg19 version:

  • http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gwasCatalog.sql

    CREATE TABLE `gwasCatalog` (
    `bin` smallint(5) unsigned NOT NULL,
    `chrom` varchar(255) NOT NULL,
    `chromStart` int(10) unsigned NOT NULL,
    `chromEnd` int(10) unsigned NOT NULL,
    `name` varchar(255) NOT NULL,
    `pubMedID` int(10) unsigned NOT NULL,
    `author` varchar(255) NOT NULL,
    `pubDate` varchar(255) NOT NULL,
    `journal` varchar(255) NOT NULL,
    `title` varchar(1024) NOT NULL,
    `trait` varchar(255) NOT NULL,
    `initSample` longblob NOT NULL,
    `replSample` longblob NOT NULL,
    `region` varchar(255) NOT NULL,
    `genes` longblob NOT NULL,
    `riskAllele` varchar(255) NOT NULL,
    `riskAlFreq` varchar(255) NOT NULL,
    `pValue` varchar(255) NOT NULL,
    `pValueDesc` varchar(255) NOT NULL,
    `orOrBeta` varchar(255) NOT NULL,
    `ci95` varchar(255) NOT NULL,
    `platform` varchar(255) NOT NULL,
    `cnv` enum('Y','N') NOT NULL,
    KEY `chrom` (`chrom`,`bin`),
    KEY `name` (`name`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
    
  • http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/gwasCatalog.txt.gz

    603  chr1  2392647  2392648   rs2477686  22197933         Hu Z  2011-12-25            Nat Genet                   A genome-wide association study in Chinese men identifies three risk loci for non-obstructive azoospermia.  Non-obstructive azoospermia            981 Han Chinese ancestry cases, 1,657 Han Chinese ancestry controls          1,946 Han Chinese ancestry cases, 4,077 Han Chinese ancestry controls  1p36.32                     MMEL1, PEX10   rs2477686-?     NR  6E-12               1.39            [1.26-1.52]                            Affymetrix [587,347]  N 
    603  chr1  2452978  2452979  rs12073504  23251661  Comuzzie AG  2012-12-04             PLoS One                       Novel genetic loci identified for the pathophysiology of childhood obesity in the Hispanic population.       Obesity-related traits                                        815 Hispanic children from 263 families                                                                             NA  1p36.32                            PANK4  rs12073504-G  0.033   5E-6  (IGFBP-3 )   0.03  [NR] (ng/mL increase)                              Illumina [899,892]  N 
    604  chr1  2501337  2501338  rs10797432  23128233    Jostins L  2012-11-01               Nature                                Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease.           Ulcerative colitis  Up to 12,924 European ancestry cases, up to 21,442 European ancestry controls  Up to 25,683 European ancestry cases, up to 17,015 European ancestry controls  1p36.32           MMEL1, PLCH2, TNFRSF14  rs10797432-C  0.522  3E-12              1.078          [1.041-1.116]  Affymetrix & Illumina [1.23 million] (imputed)  N 
    604  chr1  2513215  2513216    rs734999  21297633  Anderson CA  2011-02-06            Nat Genet  Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47.           Ulcerative colitis               6,687 European ancestry cases, 19,718 European ancestry controls               9,628 European ancestry cases, 12,917 European ancestry controls  1p36.32  MMEL1, C1orf93, PLCH2, TNFRSF14    rs734999-C   0.52   3E-9               1.05            [1.01-1.09]  Affymetrix & Illumina [~1.1 million] (imputed)  N 
    604  chr1  2526745  2526746   rs3748816  20190752    Dubois PC  2010-02-28            Nat Genet                                              Multiple common variants for celiac disease influencing immune gene expression.               Celiac disease               4,533 European ancestry cases, 10,750 European ancestry controls                4,918 European ancestry cases, 5,684 European ancestry controls  1p36.32                  MMEL1, TNFRSF14   rs3748816-?   0.66   3E-9               1.12            [1.09-1.18]                              Illumina [292,387]  N 
    604  chr1  2553623  2553624   rs3890745  24449572     Orozco G  2014-01-01  Arthritis Rheumatol         Novel rheumatoid arthritis susceptibility locus at 22q12 identified in an extended UK genome-wide association study.         Rheumatoid arthritis                3,034 European ancestry cases, 5,271 European ancestry controls                4,726 European ancestry cases, 2,625 European ancestry controls  1p36.32                         TNFRSF14   rs3890745-?          1E-6               1.18             [1.1-1.27]     Affymetrix & Illumina [1,831,729] (Imputed)  N 
    ...
    

QQ

Show CI "trumpet"

References:

Show 1%ile and 0.1%ile lambda

Show at top right w/ popout

Use a bootstrap data-toggle="collapse".

Split by MAF

CI depends on #pvals, so each MAF-range should have a similar number of pvals.

  1. Sort by MAF.
  2. Make 4 MAF-ranges and their lists of pvals.
  3. For each of those, make QQ points.
  4. Draw them in 4 colors with a legend showing MAF-ranges like "0.1 < MAF < 0.34" (two sigfigs).

Add icd9 codes to PheWAS

This page explains that the file PheWAS_code_translation_v1_2.txt is not using the same phewas-icd9-mapping as we are. We need to use phemap.rda and pheinfo.rda.

Should we still pull our ICD9 strings from the v1.2 file, or look for them somewhere else?

Add VEP annotation

(extracted from another issue)


Sarah:
add an annotation column (for example: intronic; exonic-nonsynonymous; exonic-splice; exonic-synonynmous; intergenic; 5' UTR; 3' UTR)


Peter:

  1. Would you want VEP or snpEff or something else? Is it okay if PheWeb only gives users one choice?

Annotation: I want to be quite careful not to introduce problems into sites.tsv, since matrixify has to parse it. Check version numbers of VEP/snpEff, and then have PheWeb manage everything. If VEP or snpEff can add exactly the annotations I want to a vcf-like file, this should be easy– I just annotate sites.tsv, pass along all columns into augmented_phenos_gz/*.gz and matrix.tsv.gz, convince my API to load and serve the annotations, and set up my default templates to render whatever's available.


Sarah:
VEP only is fine.

Annotation: VEP can write into a VCF file (--vcf flag) http://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_vcf
You can can filter the VCF output http://useast.ensembl.org/info/docs/tools/vep/script/vep_filter.html


Peter:
VEP adds this to the end of INFO:

;CSQ=T|stop_gained|HIGH|CCT8L2|ENSG00000198445|Transcript|ENST00000359963|protein_coding|1/1||||1354|1094|365|W/*|tGg/tAg|||-1||HGNC|15553,

documented by this header:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID">

in pairs:

Allele : T
Consequence : stop_gained
IMPACT : HIGH
SYMBOL : CCT8L2
Gene : ENSG00000198445
Feature_type : Transcript
Feature : ENST00000359963
BIOTYPE : protein_coding
EXON : 1/1
INTRON :
HGVSc :
HGVSp :
cDNA_position : 1354
CDS_position : 1094
Protein_position : 365
Amino_acids : W/*
Codons : tGg/tAg
Existing_variation :
DISTANCE :
STRAND : -1
FLAGS :
SYMBOL_SOURCE : HGNC
HGNC_ID : 15553

From this, I would display consequence: stopgained in tables and maybe use IMPACT: HIGH to color Manhattan/LZ. Most of the rest could get dumped at the bottom of the PheWAS page. If I'm happy with the way SYMBOL and Feature are decided, I could use those instead of my gene annotation.

As I read the VEP annotation from a VCF, I'll take the most deleterious consequence and the highest IMPACT. Then assert that CSQ["alt"] == alt.

Later: Figure out whether we can run ensembl-vep on sites.tsv, or if not what I must add. Maybe add an empty INFO column or rename rsids to ID.


Sarah:

add to tables:

  • Beta (discussed above)
  • "VT" SNP or Indel
  • "GENO" 0=imputed; 1=genotyped
  • VEP Impact

add to tooltips:

  • "R2" Imputation quality (or genotyped)
  • VEP Consequence

Get best-other-phenos for arbitrary regions, not just genes

  1. make a file that contains the top 3-N phenotypes for every variant. (including the same information as the current best-phenos-for-gene.json. Tabix it.
  2. make a bed file (hits-1e2.bed) which agglomerates 100 variants at a time. Tabix it.
  3. make a bed file (hits-1e4.bed) which agglomerates 10k variants at a time.
  4. ... continue until a bed file only contains 100 bins.
  5. when someone requests a region from LZ, I traverse down the bed-tree all the way to the variants file collecting all phenotypes that are significant in that region. Maybe they need to include pvalues so that I'll know how to order them?

Copy changes from Encore

  • Finish copying QQ changes
  • Copy some of make_top_hits_json to replace show_gene, which I lost in 029baa8 and 9b23bc8 .
  • Copy QQ js to get this
  • Make Manhattan, QQ, and pheno_top_hits at the same time.
    • in separate classes? or in pheweb/load/summarize/*.py (where pheweb/load/summarize/__init__.py has def run())?

d3-tip tooltips are sometimes in the wrong place

Sometimes the tooltip is at the wrong place by about 20 pixels, but it still shows the correct information. It's not pointing at any other point.

It seems to happen most often to wide tooltips near the right side. The tooltip usually moves right and upward, but not always. It affects points and phenotype labels.

I didn't notice this issue before adding .show(d, this) code.

To reproduce:

  • Hover over a variant at the far left near the bottom, and then over a variant on the far right near the bottom.
  • Hover over a variant at the far right near the bottom, and then over any variant at all, or even a phenotype label.

Once we integrate LocusZoom, we'll want lock-on-click tooltips, which we'll take from LZ, which shouldn't have this problem.

Show GWAS Catalog info on [LZ, PheWAS]

The GWAS Catalog

The GWAS Catalog has 25k entries.
Each entry has:

  • a study name (eg, "Newly identified genetic risk variants for celiac disease related to the immune response.")
  • one of 1400 "Disease/Trait" labels (eg, "Type 2 diabetes", "Systemic lupus erythematosus")
  • a link to an abstract (eg http://www.ncbi.nlm.nih.gov/pubmed/18204446).

Approaches to selecting GWAS Catalog entries for each phewas code:

  1. For each phewas code, manually search for matching GWAS Catalog label.
    • I would need help with this, because I need to understand the diseases.
  2. Then search for matching study names that were missed.

Is it worth automating? Let's do some by hand and then decide.

Goncalo's Approach for BDSI

  1. Filter to rs#s that match nicely between MGI and GWAS Catalog (MAF, etc).
  2. For each phewas_code, find the GWAS Catalog trait that makes our data most likely:

product_{each row}(P(OR_{obs} | OR_{GWAS}, SE)) / the null

Load the data

Data to import

/net/dumbo/home/larsf/PheWAS/PheWAS_code_translation_v1_2.txt (15k lines):

 "icd9"                            "icd9_string"  "phewas_code"            "phewas_string"  "exclude_range"  "rollup_bool"  "Ignore_bool"   "sex"  "pregnancy_related"  "category"      "category_string"
  "001"                                "Cholera"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"001.0"         "Cholera due to Vibrio cholerae"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"001.1"  "Cholera due to Vibrio cholerae el tor"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"001.9"                            "Cholera NOS"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
  "002"         "Typhoid and paratyphoid fevers"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"002.0"                          "Typhoid fever"        "008.5"      "Bacterial enteritis"     "001-009.99"              1              0  "Both"                    0           1  "infectious diseases"
    ...
"021.1"                      "Enteric tularemia"          "008"     "Intestinal infection"     "001-009.99"              1              0      ""                    0           1  "infectious diseases"
"021.2"                    "Pulmonary tularemia"          480.1      "Bacterial pneumonia"     "480-488.99"              1              0  "Both"                    0           9          "respiratory"
"021.3"               "Oculoglandular tularemia"            369     "Infection of the eye"     "369-374.99"              1              0      ""                    0           7         "sense organs"
"021.8"              "Other specified tularemia"          "041"  "Bacterial infection NOS"     "010-041.99"              1              0      ""                    0           1  "infectious diseases"
"021.9"                          "Tularemia NOS"          "041"  "Bacterial infection NOS"     "010-041.99"              1              0      ""                    0           1  "infectious diseases"
...

/net/dumbo/home/larsf/PheWAS/PheWAS_code_v1_2.txt (1488+1 lines):

phewas_code2  phewas_code         phewas_string      category_string  Cases  Controls  NR
         008            8  Intestinal infection  infectious diseases    460     17893   1
       008.5          8.5   Bacterial enteritis  infectious diseases    244     17893   2
...
       008.6          8.6       Viral Enteritis  infectious diseases    114     17893   5
         010           10          Tuberculosis  infectious diseases     39     16924   7
         038           38            Septicemia  infectious diseases    410     16924  10
...

/net/dumbo/home/larsf/PheWAS/MATCHED/PheWAS_${NR}_MATCHED.epacts:

#CHROM     BEGIN       END                  MARKER_ID    NS      AC  CALLRATE       MAF   PVALUE        BETA    SEBETA      CHISQ  NS.CASE  NS.CTRL   AF.CASE  AF.CTRL
     1   1005806   1005806    1:1005806_C/T_1:1005806  5060  1561.2         1   0.15427  0.48329    -0.06802  0.097689    0.49144      460     4600   0.14605   0.1551
     1   1079198   1079198    1:1079198_T/C_1:1079198  5060    2627         1   0.25959  0.93072  -0.0079779  0.091803  0.0075582      460     4600   0.25836  0.25971
     1   1247494   1247494    1:1247494_T/C_1:1247494  5060  8226.7         1   0.18708  0.51949   -0.059428  0.091832     0.4149      460     4600   0.80563  0.81365
     1   1723031   1723031    1:1723031_G/A_1:1723031  5060  5027.6         1    0.4968  0.12474    -0.10694  0.069707     2.3568      460     4600   0.47283  0.49919
     1   2069172   2069172    1:2069172_C/T_1:2069172  5060  2826.2         1   0.27927  0.45051   -0.058743  0.078132    0.56938      460     4600    0.2685  0.28034
     1   2069681   2069681    1:2069681_C/T_1:2069681  5060  626.91         1  0.061948  0.84678    0.027622   0.14258   0.037335      460     4600  0.063029  0.06184
     1   2205581   2205581    1:2205581_C/A_1:2205581  5060  4262.4         1   0.42119   0.4292   -0.058128  0.073623    0.62498      460     4600   0.40942  0.42237
     1   2387101   2387101    1:2387101_C/T_1:2387101  5060  4937.6         1   0.48791  0.40808   -0.057252  0.069225     0.6844      460     4600    0.4749  0.48921
     1   2392648   2392648    1:2392648_G/C_1:2392648  5060    5017         1   0.49575  0.57311   -0.038945  0.069117    0.31751      460     4600   0.48688  0.49663
...
     1  82074852  82074852  1:82074852_G/A_1:82074852  5060       0         1         0       NA          NA        NA         NA      460     4600         0        0
...

What shall we do about those NAs? Not insert them?

Where should we get colors for each category? I can't find them online. We either make them up, or pull them out of Vanderbilt's code.

Merge py3 into master

Before I can merge py3 into master, I need to:

Speed up LZ

Only about the top 500 variants are useful in a LZ plot. Make /api/region only return the 500 most pvaluable variants. It should also return the p-value of the best variant that it excluded. Then LZ should draw a gray box up to that p-value to show that those variants aren't loaded. That box should be behind the recombination line.

Later, I could add some sort of binning, which would allow LZ plots of several Mb. At that point the genes panel will be too full and the genes API will be the slowest part. I'm not sure what to do about that. Perhaps the variants API should also return the genes, and it can use the variants' pvalues to choose which ones to show? But it can't just choose the closest genes, because that will just make a messy cluster. Could I assign an interestingness score to each gene based on Clinvar/OMIM/ExAC-constraint or something like that?

Improve switching between plots

On Manhattan Plot page when you click a variant have an option to go eitherto the LZ page or to the PheWAS page (to make navigation to the PheWAS plot less clunky).

The Manhattan Plot is too slow

Even with a binning threshold of pval=1e-4, the variant bins are the slow part (~500ms).

Improving the code speeds things up, but I think we're mostly constrained by the cost of drawing ~50k circles in the DOM.

Ways to bin variants:

Option 1: When there's a string of neglog10_pvals, replace them with a [start, end] in the json and a line with stroke-linecap:round in the svg. I suspect that this will eliminate 80% of circles.

Option 2: The circles have radius 2.3. We have ~1000 position bins in a plot >=700px wide. Columns overlap their neighbors at browser widths < 3200px. Exploit that.

Update pheweb's version for each commit

version should always get incremented and end with -{$(git rev-parse --short -q HEAD)}.

Option 1:

Add a .git/hooks/pre-commit hook that updates pheweb/version.py.

Option 2:

make pheweb _r update pheweb/version.py (so that I don't have to install the hook everywhere). Neither of these need to be robust, because I'm the only one who will use them.

Gene Search Table

It seems to be broken. Even for genes with signifiant traits are not coming up in the table. The LZ plot shows, but the table is always empty.

Autocomplete should suggest shortest rather than longest

With marisa_trie, trie.iteritems(key) returns items from longest-to-shortest. I'm using marisa_trie for:

  • rsid (where I mitigate this behavior by manually checking [key] + [f'{key}{suffix}' for suffix in range(10)]
  • chr-pos-ref-alt
  • gene name (where TP53 doesn't make the first page for its own suggestions.)

Example:

import marisa_trie
t = marisa_trie.Trie(['a'*i for i in range(1, 5)] + ['z', 'az', 'za', 'aza'])
t.keys('a')

['a',
'aa',
'aaa',
'aaaa',
'az',
'aza']

I want results sorted by length and then lexicographic:

['a',
'aa',
'az',
'aaa',
'aza',
'aaaa']

Allow conditioning on a variant

Goncalo says that all you need is r between that variant and each other for your data.

Option 1:

Set up an LD server that references the raw data, probably by copying Daniel's HVCF. Having raw data means that security gets complicated, and I think I don't want that.

Option 2: (Probably)

Pre-compute r for all variant pairs within 300kb from the raw data. ie, for each variant, store r for all variants for the next 300kb.

Ways to store it:

  • in a tabixed file containing tab-separated rs until they hit 300kb. Easy to make/store, decent to use, probably just 2 sigfigs, should be smaller than matrix.tsv.gz. So, look up the first variant, and iterate through rs at the same time as iterating through sites.tsv.gz until you hit the second variant.
  • in sqlite3. Two tables, one of [id, chr-pos-ref-alt], another of [variant1_id, variant2_id, r]?

Highlight/label GWAS Catalog hits in LZ

I think I would just display all GWAS Catalog hits that pass some threshold (pval, sample size, &c?).

Would I just highlight any variant at the same position as a GWAS Catalog hit? Include it in their tooltip? Or mark the x-axis with it?

Phenotype labels overlap on PheWAS

Option 1: In ggplot2, ggrepel fixes this. I'll search for something similar for d3, or else try to copy their algorithm.

Option 2: Maybe I can try a simple approach, like moving labels to the left or up or down until they no longer conflict, and drawing a line if the label moves too far from the point.

Option 3: Do simple things, and if it fails just give up and either (a) leave a mess or (b) don't draw the label.

ZNF788 is in `genes.bed` twice, and some variants have it twice.

$ cat ~/.pheweb/cache/genes.bed | cut -f4 | sort | uniq -c | grep -v '^\s*1\b' | wc -l
103
$ wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
$ gunzip -c gencode.v19.annotation.gtf.gz | perl -F'\t' -nale 'print if $F[2] eq "gene"' |  grep -E '; gene_type "(protein_coding|IG_V_gene|TR_V_gene|TR_J_gene|IG_C_gene|IG_D_gene|IG_J_gene|TR_C_gene|TR_D_gene)";' | grep ZNF788
chr19   HAVANA  gene    12203078        12248050        .       +       .       gene_id "ENSG00000214189.4"; transcript_id "ENSG00000214189.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ZNF788"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ZNF788"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000156416.3";
chr19   ENSEMBL gene    12203078        12225491        .       +       .       gene_id "ENSG00000188474.6"; transcript_id "ENSG00000188474.6"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ZNF788"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "ZNF788"; level 3; tag "ncRNA_host";

Sometimes genes with the same name overlap, but sometimes they don't. Maybe common gene names are just hopeless. Anyways, I made it so that the same gene name will never be displayed twice at the gene-annotation stage.

Improve tables

In the "Top Hits" table, give users the option:
(1) to hide/unhide columns
(2) to sort columns

Also, add the following columns:
(1) add OR/BETA column

API speed

EXPLAIN ANALYZE 
SELECT (pheweb.categories.name, pheweb.phenos.num_cases, pheweb.phenos.num_controls, pheweb.phenos.phewas_code, pheweb.phenos.phewas_string, pheweb.associations.pval) 
FROM pheweb.associations 
JOIN pheweb.phenos ON pheweb.associations.pheno_id = pheweb.phenos.id 
JOIN pheweb.categories ON pheweb.categories.id = pheweb.phenos.category_id 
WHERE variant_id = 1 
ORDER BY pheweb.phenos.phewas_code;

 Sort  (cost=574520.98..574524.62 rows=1458 width=83) (actual time=2312.447..2312.610 rows=1488 loops=1)
   Sort Key: phenos.phewas_code
   Sort Method: quicksort  Memory: 256kB
   ->  Nested Loop  (cost=0.43..574444.36 rows=1458 width=83) (actual time=0.065..2310.648 rows=1488 loops=1)
         ->  Nested Loop  (cost=0.28..574134.89 rows=1458 width=59) (actual time=0.062..2308.497 rows=1488 loops=1)
               ->  Seq Scan on associations  (cost=0.00..573274.20 rows=1458 width=16) (actual time=0.058..2304.954 rows=1488 loops=1)
                     Filter: (variant_id = 1)
                     Rows Removed by Filter: 26240362
               ->  Index Scan using pheno_pkey on phenos  (cost=0.28..0.58 rows=1 width=59) (actual time=0.002..0.002 rows=1 loops=1488)
                     Index Cond: (id = associations.pheno_id)
         ->  Index Scan using category_pkey on categories  (cost=0.15..0.20 rows=1 width=40) (actual time=0.001..0.001 rows=1 loops=1488)
               Index Cond: (id = phenos.category_id)
 Planning time: 0.252 ms
 Execution time: 2312.732 ms
(14 rows)

The set of genes in the annotations and the genes panel are different

Why they differ

Eg, http://pheweb.sph.umich.edu/region/008/1:523918-923918 shows FAM87B, but my the tooltips don't, because we only include protein_coding and {IG,TR}_{C,D,J,V}_gene (immunoglobulin and T-cell receptors). (Gene biotypes are defined here)

The genes panel is fine as it is

LZ folks said I could add a gene_biotype column to the genes postgres table and figure out how to get that into the response for the genes panel tooltips. Or I could set up my own genes API.

TODO:

  • add a little question-mark-tooltip to the definition-list on the variant page explaining that this only counts genes that are protein-coding or immunoglobulin or T-cell receptor (and never pseudo).

  • Replace nearest_genes with genes, which is a json-serialized list of strings like "coding in PCSK9" or "3'UTR in PCSK9" or "3kb upstream of PCSK9". (should these be in a machine-readable format? strings will work for now.) Always include all coding and UTR entries and up/downstream < 1kb, but if none of those are found, include the nearest gene <1Mb.

Gzip outgoing data

With the pval_threshold 1e-4 and bin sizes 0.05 and 3e6, the Manhattan Plot data is ~1MB. (2/3 unbinned_variants and 1/3 variant_bins.) Gzipped, it's <100KB. (4/5 unbinned_variants and 1/5 variant_bins.)

Option 1: Just use Flask-Compress.

Option 2: For maximum speed, we'll pre-gzip (-9!) our static files and directly serve them. The headers for gzipped data can be found at http://flask.pocoo.org/snippets/122/. https://spoqa.github.io/flask-s3/_modules/flask_s3.html might also be interesting. But how do I attach headers with send_from_directory or send_file (which it uses)? They return responses, right? Be sure to pass a long cache_timeout and to handle the mimetype guessing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.