It would be nice to also have some kind of leaderboard associated to each of the listed datasets.
One (bad) way of doing it would be to use numbers reported in papers. It is bad for several reasons:
- one cannot be sure the exact same evaluation protocol was used
- one cannot be sure the exact same metric was used
A better way of doing it would be to ask authors to provide their output files and run the evaluation for them (possibly automatically on each pull request) but this does not solve the first problem. We could use pyannote.metrics
for that.
An even better way would be to ask them to provide runnable pre-trained systems and run them for them but this would need a lot of work to ask from the authors and to setup.
An utopian way would to ask them to provide trainable systems.
Anyway, maybe it is too much to ask and existing challenges like DIHARD and Albayzin are probably enough...