The Monarch knowledge graph (KG) include well-established links between phenotypes/diseases and celltypes.
Of those, 103 are phenotypes within the HPO with associations across 79 different cell types. We can use this to validate our own MultiEWCE enrichment results. However, only 64 phenotypes were included in our results, likely due to the lack of >=4 genes for certain phenotypes.
kg_res.csv
See the rendered validation report here.
https://neurogenomics.github.io/rare_disease_celltyping/code/validation
Programmatic checking
To do this programmatically, we need the following in both the KG and the MultiEWCE enrichment results to have the following:
- HPO IDs ✅ : already present in both
- Cell Ontology IDs ❌ : present in KG, but not our enrichment results, since we used the freeform annotations provided by the authors.
- As it turns out, CellxGene has since provided harmonised versions of both DescartesHuman and HumanCellLandscape, complete with Cell Ontology (CL)-aligned annotations (alongside the freeform ones).
- Cell Ontology IDs at the same level ❌ :
- That said, even with the CL Ids, the annotations are not guarantees to be defined at the same level within the ontology (e.g. "neurons" vs. "interneurons"). There is a package by Irene Papatheodorou's lab called
scOntoMatch
that is designed to make sets of CL IDs comparable by finding their shared common ancestors.
- After trying this a number of different ways, I had trouble getting too far up the CL ontology to the point of many annotations becoming meaningless (e.g. "stellate cell" --> "cell"). I may be missing something, and the tool seems conceptually useful, so I'll reach out to the package author to see if i can get some more info.
- Another issue with this approach is that the CL doesn't seem to always consider developmental precursors (hematopoetic stem cells) in addition to ontological ancestors.
Manual checking
Thus, I ultimately took an approach where I merged the KG with our significant results (DecartesHuman + HumanCellLandscape; FDR<5%), joining on the HPO ID column only. I then manually compared the KG celltype annotations with those in our results, looking up cell type definitions where needed.
After manual inspection I found that 65.6% (42/64) of known causal phenotype-cell type associations were recapitulated by our MultiEWCE results. In many instances, our results further resolved the specific cell subtype(s) underlying the phenotype (e.g. "neuron" to "visceromotor neuron", or "muscle cell" to "ventricular cardiac muscle cell").
For several KG cell types (sperm, germ cell), we didn't have an equivalent celltype in either of our CTD references and thus could not compare our results.
What's up with these missing phenotype-celltype associations?
Strangely, almost all of the known phenotype-celltype associations we missed are the most obvious ones!
Things like
- HP:0100494 | Abnormal mast cell morphology (HPO)
- HP:0100705 | Abnormal glial cell morphology (HPO)
- HP:0012757 | Abnormal neuron morphology (HPO)
- etc. see here for the full list: kg_missing.csv
Some possible reasons
- The way our specificity matrix is constructed, we may be missing some of the markers for neurons in general (instead, focusing on neuronal subtype markers). But as both our CTDs are multi-system atlases, I'd expect that kind of effect to be attenuated.
- Incomplete genes lists in the HPO.
- Stringent multiple testing correction may have killed some weaker signals (tho not sure why these would be quite so weak). All 22 missing phenotypes-associations are significant at an uncorrected p-value (<0.05). Though I'd expect the enrichment to be quite strong for these phenotypes.
But are we simply recalling these associations due to massive amounts of testing?
It seems not! Using FDR<5% seems to be sufficient to keep the number of enriched celltypes per phenotype reasonable. In the KG annotations they have 1 cell type per phenotype. In ours, we have a median of 4 cell types per phenotype. DescartesHuman has 77 unique celltypes while HCL has 124, for a total of 201 cell types that we tested across all phenotypes. If we assume a single "correct" celltype per phenotype (from the KG annotations), this would mean we have a 4/201 (2%) probability of picking the correct celltype by chance.
Now, one might argue this isn't quite right since multiple celltypes can count as the "correct" celltype, including a direct match , a developmental precursor, or an ontological ancestor. But even if you triple our number of "draws", that would still only be 12/201 (6%) probability of getting the correct celltype by chance.
Conclusions
I think this provides some decent evidence to our claim that our enrichment results represent the real causal celltypes underlying these phenotypes. But the imperfect recall rate (especially for obvious associations) leaves some concerns.