How do taxonomy classifiers perform when they encounter a query sequence that is not represented in the reference database? To what degree do they "overclassify"?
For this test, "novel taxa" consist of query sequences randomly drawn from a reference database (source
). Taxonomy assignment is then performed using a modified reference database (ref
), which consists of the source
minus novel taxa AND all seqs with matching taxonomy annotations at the taxonomic level (L
) being tested (species, genus, family, etc). Also remove any taxa from the ref
that do not have near neighbors at level L
, e.g., other species in the genus
Following this method,
Match
: assignment == L
- 1 (e.g., a novel species is assigned the correct genus)
overclassification
: assignment == L
(e.g., correct genus but assigns to a near neighbor)
misclassification
: incorrect assignment at L
- 1 (e.g., wrong genus-level assignment)
One question: is it worth also defining underclassification
, i.e., assignment < L
(e.g., correct family but no genus)? My gut feeling is NO, since this will complicate matters and we are left asking at which level L
this becomes irrelevant. (e.g., if species X is assigned to the correct phylum but wrong class, is this still underclassification
and is that important?) Unlike overclassification
, I also question whether this would yield meaningful interpretation.