neurogenomics / rare_disease_celltyping Goto Github PK

View Code? Open in Web Editor NEW

6.0 3.0 0.0 510.66 MB

Code, data and results associated with the "Rare diseases cell-typing" project.

R 0.15% HTML 80.18% JavaScript 4.69% Lua 0.01% TeX 14.24% Jupyter Notebook 0.75%

rare-disease cell-type-deconvolution genetics enrichment-analysis

rare_disease_celltyping's Introduction

Identification of cell types underlying thousands of rare diseases and subtraits

Kitty B. Murphy, Bobby Gordon-Smith, Jai Chapman, Momoko Otani, Brian M. Schilder, Nathan G. Skene

Introduction

Rare diseases (RDs) are uncommon as individual diagnoses, but as a group contribute to an enormous disease burden globally. However, partly due the low prevalence and high diversity of individual RDs, this category of diseases is understudied and under-resourced. The advent of large, standardised genetics databases has enabled high-throughput, comprehensive approaches that uncover new insights into the multi-scale aetiology of thousands of diseases. Here, using the Human Phenotype Ontology (9,677 annotated phenotypes) and multiple single-cell transcriptomic atlases (77 human cell types and 38 mouse cell types), we conducted >688,000 enrichment tests (x100,000 bootstrap iterations each) to identify >13,888 genetically supported cell type-phenotype associations.

This repository contains the data and code needed to replicate the analyses in our preprint [insert link to preprint], as well as links to the R packages required (see below).

KGExplorer: Imports and analyses large-scale biomedical knowledge graphs and ontologies.
HPOExplorer: Contains extensive functions for easily importing, annotating, filtering, and visualising the Human Phenotype Ontology (HPO) at the disease, phenotype, and gene levels.
MSTExplorer: Systematically identifies, prioritises, and visualises cell-type-specific gene therapy targets across the phenome.

rare_disease_celltyping's People

Contributors

Stargazers

Watchers

rare_disease_celltyping's Issues

Figures must be legible when reduced to 5.5cm or 12cm wide, with resolution of 150 to 300 dots per inch (dpi)

See section under "Preperation of Figures" at https://www.science.org/content/page/instructions-preparing-initial-manuscript

For the recurrent infection plots, set scale_y=free

It will help shrink the figure smaller

Make figures / text based around the cell type specificity of obvious facial markers visible from birth

Jai had the figure showing that leaf nodes of HPO are more specific. Bobby to add an extra one showing the leaf nodes are more likely to only have one cell type association.

This principle can be followed through to the idea that facial morphological defects should be cell type specific.

E.g. there is a broad "abnormal lip morphology" which is associated with excitatory, inhibitory neurons, astrocytes and amacrine cells.

But a subphenotype of that, "short philtrum" is specific to excitatory neurons.

And a subphenotype "cleft upper lip" is astrocyte specific.

Can suggest then that obvious facial defects can be used to gain insight into the underlying cell types.

Drop files that are now outdated / no longer required

Drop code files which are no longer used, e.g. process_hpo_to_genelists.r. That file is replaced by process_hpo_to_genelists-wholebody.r... we want the repo to be neat and tidy and only have the latest version of files.

If it is not called by the makefile, remove it (if outdated) or move it to a new repo (if it just doesn't belong there anymore)

Assess pLI in HPO genes

Originally discussed here:

neurogenomics/MSTExplorer#16 (comment)

We should asses the relationship between HPO genes and probability of Loss Intolerance (pLI) scores.

Questions

Do HPO genes have greater pLI than randomly sampled background genes from the human genome?
Is there a relationship between cell-type specificity scores (from our CTDs) and pLI?
Do phenotypes with stronger cell type enrichment (in terms of fold-change) tend to have genes with greater pLI scores?
Is there a relationship between pLI and GenCC evidence scores? (use HPOExplorer::get_gencc() to get the GenCC data)

@KittyMurphy it would be great if you could start exploring these questions and upload them as rmarkdown reports in this repo.
Perhaps start by just uploading a script showing me how to get the pLI scores for all genes, so I can start exploring this a bit as well.

Also just came across this paper:

https://onlinelibrary.wiley.com/doi/10.1002/humu.23763

Update website with the results from the new scRNA-seq datasets

How do we deal with having two scRNA-seq datasets?

Improve function coding practices

This is more of a personal preference, but i find it much easier to read functions when the arguments are separated by newlines.

Also, it's good practice to make all function arguments explicit, bc if the order of the args ever changes, the function won't break.

Lastly, it's not essential to for every single function, but I think including @examples for a greater proportion of functions would be helpful.

Setup the document so that the text can be edited in authorea + figures regenerated from a markdown file

Basic idea for this is described here:
https://www.authorea.com/users/2191/articles/278550-a-step-by-step-guide-to-using-authorea-github-and-jupyter-for-reproducible-dynamic-collaborative-data-science-in-public-health

This is the best guide for how it can be done that I've seen:
https://21docs.com/users/244977/articles/324602-how-to-combine-r-with-authorea-to-collaboratively-write-reproducible-scientific-papers

Authorea allows collaborative editing of the markdown document + references

Authorea documents can be linked to a github repo

Then we keep a markdown document within the same repo. This is used to (re)generate all the figures as required. The markdown document probably wouldn't perform the whole (computationally intensive) analysis, it would be for e.g. generating figures. The figures are then saved as *.jpeg and read into the Authorea document.

Remove CTD_code for processing the Trapnell datasets to another repository

This repo should be specifically for running the rare disease analysis, and should not have preprocessing code for cell type datasets

Graphs showing relationship between celltype enrichment strengths + ontology level, should use violin plots

Currently these just show the mean value. Would be better to show the distribution at each level with violin plots.

Create a makefile that runs the whole analysis from scratch

(1) Extract HPO gene sets to gene lists using process_hpo_to_genelists-wholebody.r

(1) Run code in Results-Generation-100000-MC-WithSaving.rmd

(2) Knit rare_disease_ewce/Markdown_Draft.Rmd at master · neurogenomics/rare_disease_ewce

Find regulator genes

@Al-Murphy MultiEWCE::prioritise_targets could be further extended by identifying key regulators of gene sets. This way, we can target all candidate genes by targeting a single regulator gene:

Useful tools:

@Al-Murphy let me know if you have any other suggestions!

Document all the functions in Roxygen style https://github.com/ovrhuman/RD_EWCE_Website_and_apps

In Fig 3a,b,c, remove the "other" line... it isn't required and isn't really in contrast to the phenotypes

Merge dev_descartes into master (this has Bobby's make file & CTD dir removed etc)

Add the markdown document of MSc Report to the github repo so we can start editing that

Show that cell type enrichments become purer as you move towards leaf nodes

Currently we hint at this but don't actually show it

Could you add to this row a plot with the distribution (violin plot) for number of associated celltypes at each ontology level?

Cleanly separate functions from scripts + ensure all functions pass dev::check()

At the moment some function files have scripts at the top: remove this code, and have a master script/markdown file which calls functions. E.g.

Improve app speed

I'm combing through the code of all the apps and looking for places where we could improve speed.

Here's a couple of things I'll be on the lookout for:

Use data.table wherever possible (much more efficient than standard data.frame)
Onlylibrary(pkg) packages that you need all functions from. Otherwise use pkg::function in the code itself, and/or @importfrom pkg function in the Roxygen notes.
Remove variables when they're no long being used.
Replace for loops with lapply or mclapply.
Introduce parallelization where possible.
Pre-calculate results as much as possible.
Look for functions that could be replaced with more efficient versions.

I'll work on the bschilder_dev branch and then we can merged afterwards.

Does the likelihood of a phenotype being associated with multiple cell types increase as you move up the HPO ontology?

Jai created this figure:

It shows that at the leaves of the network, enrichments are more significant. I'd bet that they get more significant because they are not split across multiple cell types. E.g. "morphology of the nervous system" is probably associated with oligos, astrocytes, microglia etc... but a leaf node of that would only be associated with oligodendrocytes.

Assess our results against known phenotype-celltype links

The Monarch knowledge graph (KG) include well-established links between phenotypes/diseases and celltypes.
Of those, 103 are phenotypes within the HPO with associations across 79 different cell types. We can use this to validate our own MultiEWCE enrichment results. However, only 64 phenotypes were included in our results, likely due to the lack of >=4 genes for certain phenotypes.

kg_res.csv

See the rendered validation report here.
https://neurogenomics.github.io/rare_disease_celltyping/code/validation

Programmatic checking

To do this programmatically, we need the following in both the KG and the MultiEWCE enrichment results to have the following:

HPO IDs ✅ : already present in both
Cell Ontology IDs ❌ : present in KG, but not our enrichment results, since we used the freeform annotations provided by the authors.
- As it turns out, CellxGene has since provided harmonised versions of both DescartesHuman and HumanCellLandscape, complete with Cell Ontology (CL)-aligned annotations (alongside the freeform ones).
Cell Ontology IDs at the same level ❌ :
- That said, even with the CL Ids, the annotations are not guarantees to be defined at the same level within the ontology (e.g. "neurons" vs. "interneurons"). There is a package by Irene Papatheodorou's lab called scOntoMatch that is designed to make sets of CL IDs comparable by finding their shared common ancestors.
- After trying this a number of different ways, I had trouble getting too far up the CL ontology to the point of many annotations becoming meaningless (e.g. "stellate cell" --> "cell"). I may be missing something, and the tool seems conceptually useful, so I'll reach out to the package author to see if i can get some more info.
- Another issue with this approach is that the CL doesn't seem to always consider developmental precursors (hematopoetic stem cells) in addition to ontological ancestors.

Manual checking

Thus, I ultimately took an approach where I merged the KG with our significant results (DecartesHuman + HumanCellLandscape; FDR<5%), joining on the HPO ID column only. I then manually compared the KG celltype annotations with those in our results, looking up cell type definitions where needed.

After manual inspection I found that 65.6% (42/64) of known causal phenotype-cell type associations were recapitulated by our MultiEWCE results. In many instances, our results further resolved the specific cell subtype(s) underlying the phenotype (e.g. "neuron" to "visceromotor neuron", or "muscle cell" to "ventricular cardiac muscle cell").

For several KG cell types (sperm, germ cell), we didn't have an equivalent celltype in either of our CTD references and thus could not compare our results.

What's up with these missing phenotype-celltype associations?

Strangely, almost all of the known phenotype-celltype associations we missed are the most obvious ones!
Things like

HP:0100494 | Abnormal mast cell morphology (HPO)
HP:0100705 | Abnormal glial cell morphology (HPO)
HP:0012757 | Abnormal neuron morphology (HPO)
etc. see here for the full list: kg_missing.csv

Some possible reasons

The way our specificity matrix is constructed, we may be missing some of the markers for neurons in general (instead, focusing on neuronal subtype markers). But as both our CTDs are multi-system atlases, I'd expect that kind of effect to be attenuated.
Incomplete genes lists in the HPO.
Stringent multiple testing correction may have killed some weaker signals (tho not sure why these would be quite so weak). All 22 missing phenotypes-associations are significant at an uncorrected p-value (<0.05). Though I'd expect the enrichment to be quite strong for these phenotypes.

But are we simply recalling these associations due to massive amounts of testing?

It seems not! Using FDR<5% seems to be sufficient to keep the number of enriched celltypes per phenotype reasonable. In the KG annotations they have 1 cell type per phenotype. In ours, we have a median of 4 cell types per phenotype. DescartesHuman has 77 unique celltypes while HCL has 124, for a total of 201 cell types that we tested across all phenotypes. If we assume a single "correct" celltype per phenotype (from the KG annotations), this would mean we have a 4/201 (2%) probability of picking the correct celltype by chance.

Now, one might argue this isn't quite right since multiple celltypes can count as the "correct" celltype, including a direct match , a developmental precursor, or an ontological ancestor. But even if you triple our number of "draws", that would still only be 12/201 (6%) probability of getting the correct celltype by chance.

Conclusions

I think this provides some decent evidence to our claim that our enrichment results represent the real causal celltypes underlying these phenotypes. But the imperfect recall rate (especially for obvious associations) leaves some concerns.

Look into use of Snakemake for the make file: could it also be used in place of the EWCEwrapper functions?

https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

Are congenital phenotypes more enriched for fetal cell types?

Are congenital phenotypes are more strongly enriched for fetal cell types than congenital phenotypes?
- Strategy: Identify cell types that only occur in fetal samples (and not adult samples)
Are congenital phenotypes are more strongly enriched for fetal cell types vs. adult cell types?
- Strategy: Identify cell types that occur in both fetal and adult samples, and compare their relative enrichment levels in congenital vs. non-congenital disease.

We can assess these questions with the Human Cell Landscape as it contains both fetal and adult cell types.

We can use the "congenital_onset" column from GPT annotations to categorise each phenotype.
neurogenomics/RareDiseasePrioritisation#19

Make this plot with log(fold change) to reduce effect of outliers

Group plots into larger figures

What do you think of the rough grouping i have here, and what ones should i exclude?

https://github.com/neurogenomics/rare_disease_ewce/blob/master/figure_group_planning.md

@NathanSkene @bschilder

Create plot showing number of cell type enrichments associated with phenotypes that are descendants of "recurrent infections" HP:0002719

Then add this to the markdown.

After we'll want similar plots for e.g.:

https://hpo.jax.org/app/browse/term/HP:0025015 (abnormal vasculature)

Remove datasets like `GeneLists-WholeBody` from the repo as it should be possible to easily regenerate from scratch

Engage with rare disease community

Identify ways to share our findings, and get members of the rare disease community to engage with our resources (e.g. web apps).

Stakeholders

Researchers
Healthcare providers (doctors, nurses, physician assistants)
Rare disease patients
Family of rare disease patients

Rare disease networks

Nord
Rare Disease UK
Genetic and Rare Diseases (GARD) / NIH
Rare Diseases International
The UK Rare Diseases Framework
NIH AMP: Bespoke Gene Therapy Consortium
Lupus UK
Foundation for ARID1B
Parkinson's UK (for monogenic forms of PD)
The British Society for Genetic Medicine
Rare Trait Hope Fund

Contacts

Caroline Olshewsky: head of Lupus UK and Foundation for ARID1B. Met her at Crick Institute RD conference and she was very excited to discuss our results.

Get the plots for Heart, Circulating and Glucose plotting on the same axes, with the bars dodged in different colours

Write introduction section for the markdown report

Add to report table showing phenotypes most associated with T-cells that are not significant at all (even without BH correction) in B-cells

and vica-versa

Work with Alan Murphy to move `RunEWCE` and `EWCEWrapper` into the main EWCE package

Currently they are within https://github.com/neurogenomics/rare_disease_ewce/blob/master/TabulaMuris_Results/Results-Generation-100000-MC-WithSaving.R

Identify variant-level mechanisms of each rare disease

We can use the Monarch knowledge graph to extract info on how each gene is linked to each rare disease phenotype; i.e. missense mutation, splicing mutations, disrupted promoter, methylation, etc.

Move the HPO input file to a permanent DOI'd address using figshare

I believe the current input file is:

http://purl.obolibrary.org/obo/hp/hpoa/genes_to_phenotype.txt

This has moved several times. We don't want to have to keep updating the code every time they change the location / format.

So upload it to figshare, and use the DOI'd address throughout this repo

Cut the words in the paper back to 2500 to fit with requirements for science report submission

Here's the word limit info: https://www.science.org/content/page/science-information-authors

Reports (up to ~2500 words including references, notes and captions–corresponds to ~3 printed pages in the journal) present important new research results of broad significance. Reports should include an abstract, an introductory paragraph, up to four figures or tables, and about 30 references. Materials and Methods should be included in supplementary materials, which should also include information needed to support the paper's conclusions.

Create a markdown file that summarises some of the interesting files you've generated to date

Show plots for celltypes associated with phenotypes containing:

Heart
Circulating
Glucose

Compare these to celltypes associated with all phenotypes

Try 'pruning' the terms associated with 'Pancreatic Stellate Cells'... if all leaf nodes of a parent are enriched, then drop the leaf terms. Then we can see a cleaner list of what phenotypes are associated with it

It's interesting because they have more phenotypes associated with them than any other cell type...

But they are not associated with 'abnormality of the pancreas' (though all other major classes of pancreatic cell type are associated)

Without pruning the phenotypes associated with it just look like a random mixed bag

Currently when looping through the phenotypes, it crashes and has to be restarted whenever there is an error (e.g. not enough genes in the list). Add try-catch statements and make a log record.

Describe Human Cell Landscape CTD levels

Script to preprocess the HCL here:
https://github.com/neurogenomics/model_celltype_conservation/blob/4bcd4d569775a5f17c70ff314ad68accc9e47e4b/scripts/prepare_HumanCellLandscape.R#L52C2-L60C47

Define levels

The main thing to note is that I used the following combinations of metadata variables from the original study to create 5 levels:

> save_path="/shared/bms20/projects/model_celltype_conservation/raw_data/scRNAseq/standardized/HumanCellLandscape.orth.h5ad"
> adat2 <- prepare_metadata(adat = adat,
>                           save_path = save_path,
>                           species = "human",
>                           level_cols = list(level1="tissue",
>                                             level2="celltype",
>                                             level3=c("stage","celltype"),
>                                             level4=c("tissue","celltype"),
>                                             level5=c("tissue","celltype","cluster")
>                                             )

Assess preprocessed HCL scRNA-seq data

> adat
AnnData object with n_obs × n_vars = 599926 × 27341
    obs: 'batch', 'tissue', 'n_genes', 'n_counts', 'id', 'cluster', 'stage', 'donor', 'celltype', 'cellid', 'level1', 'level2', 'level3', 'level4', 'level5', 'species'
    var: 'n_cells'
    uns: 'levels_key'

Check each level

Within each level (the metadata column i use to generate the CTD aferwards), asses the number of unique values and show 10 example values.

> lapply(grep("^level",names(adat$obs),value = TRUE), function(x)head(unique(adat$obs[[x]]),10))
[[1]]
 [1] AdultAdipose         AdultAdrenalGland    AdultArtery          AdultAscendingColon  AdultBladder        
 [6] AdultBoneMarrow      AdultCerebellum      AdultCervix          AdultTransverseColon AdultDuodenum       
59 Levels: AdultAdipose AdultAdrenalGland AdultArtery AdultAscendingColon AdultBladder ... Placenta

[[2]]
 [1] Macrophage             Monocyte               Stromal cell           Smooth muscle cell     Dendritic cell        
 [6] B cell (Plasmocyte)    Erythroid cell         T cell                 Endothelial cell (APC) Fibroblast            
63 Levels: AT2 cell Adrenal gland inflammatory cell Antigen presenting cell (RPS high) Astrocyte ... hESC

[[3]]
 [1] Adult.Macrophage             Adult.Monocyte               Adult.Stromal cell           Adult.Smooth muscle cell    
 [5] Adult.Dendritic cell         Adult.B cell (Plasmocyte)    Adult.Erythroid cell         Adult.T cell                
 [9] Adult.Endothelial cell (APC) Adult.Fibroblast            
124 Levels: Adult.AT2 cell Adult.Antigen presenting cell (RPS high) Adult.Astrocyte ... HESC.hESC

[[4]]
 [1] AdultAdipose.Macrophage             AdultAdipose.Monocyte               AdultAdipose.Stromal cell          
 [4] AdultAdipose.Smooth muscle cell     AdultAdipose.Dendritic cell         AdultAdipose.B cell (Plasmocyte)   
 [7] AdultAdipose.Erythroid cell         AdultAdipose.T cell                 AdultAdipose.Endothelial cell (APC)
[10] AdultAdipose.Fibroblast            
1361 Levels: AdultAdipose.B cell (Plasmocyte) AdultAdipose.Dendritic cell ... Placenta.hESC

[[5]]
 [1] AdultAdipose.Macrophage.cluster2             AdultAdipose.Monocyte.cluster13             
 [3] AdultAdipose.Stromal cell.cluster27          AdultAdipose.Smooth muscle cell.cluster42   
 [5] AdultAdipose.Dendritic cell.cluster22        AdultAdipose.B cell (Plasmocyte).cluster3   
 [7] AdultAdipose.Erythroid cell.cluster75        AdultAdipose.T cell.cluster6                
 [9] AdultAdipose.Endothelial cell (APC).cluster8 AdultAdipose.Fibroblast.cluster18           
1864 Levels: AdultAdipose.B cell (Plasmocyte).cluster3 ... Placenta.hESC.cluster87

Add some graphs showing the range of ontology terms associated with some cell types

E.g. at the moment there are the plots showing that cardiomyocytes are most assocaited with cardiovascular terms

Jai also tinkered with these plots for breaking down what the 'Other' terms are... for this example it nicely highlighted that they still make good sense

Replace rda files with rds

rds file are preferable bc you know exactly what the variable name will be.

Add to the markdown report result about leaf nodes having more significant enrichments than parent nodes

The total number of phenotype-cell associations w/ p<0.05 in the whole results dataframe is 5509, of which 1,113 are leaf terms (if this worked)
and just because this could be interesting, the average Z score for significant phenotype-cell associations for non-leaf terms is ~2.7, whereas for the leaf terms it's ~5.14

Setup a better project folder structure

Read e.g.

https://www.rostrum.blog/2019/06/11/r-repo-template/

Organizing your projects - Reproducible research

Rearrange accordingly

There's a list of ~400 phenotypes that didn't run succesfully. Bobby thinks it's because of e.g. hyphens in the name. Need to figure out why this was and get them working.

@ovrhuman can you upload the list here

See if there is a way of plotting the ontology associated with a cell type, e.g. show all nodes that are significantly enriched, and the relationship between them

This would be good if we wanted to build a website showing the results associated with each cell type

Try using this package: https://cran.r-project.org/web/packages/ontologyPlot/vignettes/plotting-ontological-terms.html

Regenerate manuscript figures with new results

Original figures

Code for generating figures:
https://github.com/neurogenomics/rare_disease_celltyping/blob/f7701e7e9f859a947d5d83cb5c62cc3b4afaf80b/figures/manuscript_figures_july_2022.Rmd

Fig 1

https://github.com/neurogenomics/rare_disease_celltyping/blob/f7701e7e9f859a947d5d83cb5c62cc3b4afaf80b/figures/manuscript_fig_1.pdf

Fig 2

https://github.com/neurogenomics/rare_disease_celltyping/blob/f7701e7e9f859a947d5d83cb5c62cc3b4afaf80b/figures/manuscript_fig_2.pdf

Fig 3

https://github.com/neurogenomics/rare_disease_celltyping/blob/f7701e7e9f859a947d5d83cb5c62cc3b4afaf80b/figures/manuscript_fig_3.pdf

Recreate entire study in one Rmarkdown

Recreate entire Rare Disease Celltyping study (analyses, results, plots) in one Rmarkdown.

This will incorporate elements of:

Cell_select_interactive: `Error:`loops` is `FALSE`, but `x` contains loops. `

When i launch the Cell_select_interactive shiny app locally this error appears. Have you encountered this before @ovrhuman ?:

I'm running Rstudio within a docker container, so i'm viewing the ap through a Chrome tab.

Session info

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] jsonlite_1.7.2       httr_1.4.2           sna_2.6              statnet.common_4.5.0 ggnet_0.1.0          cowplot_1.1.1       
 [7] stringr_1.4.0        ontologyPlot_1.6     ontologyIndex_2.7    Rgraphviz_2.37.2     graph_1.71.2         BiocGenerics_0.39.1 
[13] DT_0.18              plotly_4.9.4.9000    shinythemes_1.2.0    shiny_1.6.0          reshape2_1.4.4       plyr_1.8.6          
[19] network_1.17.1       viridis_0.6.1        viridisLite_0.4.0    ggnetwork_0.5.10     ggplot2_3.3.5       

loaded via a namespace (and not attached):
 [1] fs_1.5.0            usethis_2.0.1       devtools_2.4.2      rprojroot_2.0.2     tools_4.1.0         bslib_0.2.5.1      
 [7] utf8_1.2.2          R6_2.5.0            DBI_1.1.1           lazyeval_0.2.2      colorspace_2.0-2    withr_2.4.2        
[13] tidyselect_1.1.1    gridExtra_2.3       prettyunits_1.1.1   processx_3.5.2      curl_4.3.2          compiler_4.1.0     
[19] cli_3.0.1           desc_1.3.0          sass_0.4.0          scales_1.1.1        callr_3.7.0         digest_0.6.27      
[25] pkgconfig_2.0.3     htmltools_0.5.1.1   sessioninfo_1.1.1   fastmap_1.1.0       htmlwidgets_1.5.3   rlang_0.4.11       
[31] rstudioapi_0.13     jquerylib_0.1.4     generics_0.1.0      paintmap_1.0        crosstalk_1.1.1     dplyr_1.0.7        
[37] magrittr_2.0.1      Rcpp_1.0.7          munsell_0.5.0       fansi_0.5.0         lifecycle_1.0.0     stringi_1.7.3      
[43] yaml_2.2.1          pkgbuild_1.2.0      promises_1.2.0.1    crayon_1.4.1        lattice_0.20-44     ps_1.6.0           
[49] pillar_1.6.2        stats4_4.1.0        pkgload_1.2.1       glue_1.4.2          data.table_1.14.0   remotes_2.4.0      
[55] BiocManager_1.30.16 vctrs_0.3.8         httpuv_1.6.1        testthat_3.0.4      gtable_0.3.0        purrr_0.3.4        
[61] tidyr_1.1.3         assertthat_0.2.1    cachem_1.0.5        mime_0.11           xtable_1.8-4        coda_0.19-4        
[67] later_1.2.0         tibble_3.1.3        memoise_2.0.0       ellipsis_0.3.2

`phenomix`: Exploring more efficient methods for celltype enrichment

@NathanSkene @KittyMurphy @Al-Murphy

While I was waiting for the latest phenotype-celltype enrichment results to finish running on HPC, I starting exploring whether we could run phenotype-celltype association tests in a more efficient manner. This was implemented within my package phenomix.

Short answer: yes, I was able to run all HPO x Descartes celltype enrichment tests but orders of magnitude faster using only my laptop.

All code to run the tests and benchmarking can be found here. Will generate a proper HTML report with figures as well.

Approach

In this approach, I:

use the GenCC evidence scores for each disease-gene link-
merge that with the disease-phenotype-gene data in HPO
construct a gene x phenotype matrix (filled with the GenCC scores)
run regression on every combination of phenotype and celltype in the level 2 specificity matrix of the Descartes CTD (via phenomix::iterate_lm)
apply multiple-testing correction

Fundamentally, the biggest differences between MultiEWCE and phenomix are:

MultiEWCE treats it as a gene set enrichment problem, while phenomix treats it as a linear regression problem.
MultiEWCE uses bootstrapping (100,000 iterations per phenotype-celltype test), while phenomix only needs to run one rep per phenotype-celltype test.

Speed

`MultiEWCE`:

N tests: 6916 phenotypes x 77 celltypes (532,532 tests)
Time: Completed overnight (don't have a precise estimate)
Resources: 6916 array subjobs with 4 cores (27,664 cores total) and 128Gb mem per subjob

`phenomix`:

N tests: 10,948 phenotypes x 77 celltypes (844,613 tests)
Time: 30.5 min
Resources: Local Macbook Pro, paralellised across 10 cores. Only ever used a small fraction of my 64Gb of memory.

Benchmarking.

I benchmarked these new results against Bobby's old MultiEWCE results. Will repeat once i finish the updated MultiEWCE tests using the fixed Descartes CTD (and the Human Cell Landscape CTD).
In the end, there is 96.66% concordance between the top most enriched celltype (the one with the highest fold-change after filtering by q<0.05 within the MultiEWCE results) between MultiEWCE and my new method (implemented in phenomix).
Will need to explore exactly how much the GenCC scores help these analyses. ie. does making the gene links binary degrade performance substantially? (1=associated with the disease/phenotype, 0=not).

Conclusion

This new method will still require more benchmarking to ensure it's robust, and I think for the sake of the rare disease celltyping paper we should stick to our original MultiEWCE approach (partly for the sake of not having to rewrite it..). But wanted to document this here to let us know we have a new approach in our back pocket.

How to plot a figure showing that most phenotypes associated with 'cardiac muscle cells' are associated with heart phenotypes

Plot fraction of the hits that are from 'abnormality of the heart' for cardiac muscle cells.

E.g. if the highest z-score is 25, and that is from that term, then fraction =1