Hi, Related to <a class="issue-link js-issue-link" data-error-text="

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

Underscore does it! <div class="highlight highlight-source-python notranslate posi

Ok, I might need some help to setup the protocol though... Is <a href="https://github.

If your configuration file looks like this, <div class="highlight highlight-source

support for transcripts and entity linking annotation about pyannote-database HOT 19 CLOSED

pyannote commented on August 11, 2024

support for transcripts and entity linking annotation

from pyannote-database.

Comments (19)

hbredin commented on August 11, 2024 1

* add path to transcripts and entites here https://github.com/pyannote/pyannote-database/blob/develop/pyannote/database/custom.py#L319 ?
* do I to use another protocol than `SpeakerDiarizationProtocol`?

I think the main question is: do we want to share the PLUMCOT corpus via a custom protocol or via a pyannote.db.* plugin.

Solution 1: custom protocol

Pros: installing the corpus is as easy as

git clone ...
export PYANNOTE_DATABASE_CONFIG=Plumcot/data/database.yml

Cons: implies of a lot changes to the pyannote.database.custom module that will (most likely) only be used for the Plumcot corpus.

Solution 2: plugin

Pros: since everything is done in a Python class, you can choose to create your on PlumcotProtocol class (e.g. as a subclass of SpeakerDiarizationProtocol) and add to yielded ProtocolFile objects whichever field you see fit (e.g. transcript).

Cons: an additional Python package needs to be maintained and deployed to PyPI which might be tricky in the long term.

My point of view

I'd personally go for solution 2 as a first step -- and some of the code might be re-used later if we finally realize that solution 1 might be interesting to integrate into pyannote.database.

I think it really depends on whether we do more NLP stuff in the future (e.g. a pyannote.text package?) For now, solution 1 feels a bit overkill.

I'll discuss the spaCy data structure question in here as this is a slightly different problem.

from pyannote-database.

PaulLerner commented on August 11, 2024 1

Underscore does it!

In [1]: import yaml                                                                                                                                                                                          

In [2]: with open('foo.yml') as file: 
   ...:     print(yaml.load(file, Loader=yaml.SafeLoader)) 
   ...:                                                                                                                                                                                                      
{'foo': '_bar'}

In [3]: cat foo.yml                                                                                                                                                                                          
foo: _bar

NB: are latest comments are more relevant to #51

from pyannote-database.

PaulLerner commented on August 11, 2024

Ok, I might need some help to setup the protocol though... Is this still up to date?
There's already some (messy) code in https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/Plumcot/__init__.py

from pyannote-database.

PaulLerner commented on August 11, 2024

Also, one difficulty is that Plumcot doesn't have one but 16 (+ some extra) protocols...

from pyannote-database.

hbredin commented on August 11, 2024

FYI, I am currently looking into Solution 1. I'll keep you posted.

from pyannote-database.

PaulLerner commented on August 11, 2024

I was about to message you!
I started implementing solution 1 (because of the aforementioned issues) to not waste time as I thought we could always re-use the code.
So here it is:

plumcot formats PaulLerner@7ba0b94
spacy doc (TODO fix this PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR173-R178)

from pyannote-database.

hbredin commented on August 11, 2024

Nice! As I was saying earlier, I have been working in parallel on refactoring pyannote.database.custom (and the introduction of custom data loaders) to make it much more flexible. Work is in progress in branch custom (or pull request #51).

I am pretty sure most of your new code could become a custom data loader.
This would allow to define protocols like that:

Protocols:
  MyDatabase:
    SpeakerDiarization:
      MyProtocol:
        train:
          uri: /path/to/uris.lst
          annotation: /path/to/speaker.rttm
          transcript: @/path/to/{uri}.ctm

The @ above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.

For now, only SpeakerDiarizationprotocols are supported but this is basically just a name and you can add as many keys as you want, as long as you provide the corresponding custom data loader.

from pyannote-database.

PaulLerner commented on August 11, 2024

Great, I'll look into it

The @ above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.

Cool that solves this hack PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR284-R286 :)

from pyannote-database.

PaulLerner commented on August 11, 2024

Mmh... I'm not sure how can I merge two different annotations (in my case: entities and forced-alignment) as I did here PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR238-R239

from pyannote-database.

hbredin commented on August 11, 2024

If your configuration file looks like this,

Protocols:
  MyDatabase:
    SpeakerDiarization:
      MyProtocol:
        train:
          uri: ...
          entity: ...
          transcription: ...

then you can assume that the "entity" key is available to the TranscriptionLoader:

class EntityLoader:
    def __init__(self, entities):
        self.entities = entities
    def __call__(self, current_file: ProtocolFile):
        return ...

class TranscriptionLoader:
    def __init__(self, transcription):
        self.transcription = transcription
    def __call__(self, current_file: ProtocolFile):
        # current_file has an "entity" key
        entity = current_file["entity"]
        return ...

Does it answer your question?

This behavior is brought to you by ProtocolFile lazy keys ™️

from pyannote-database.

PaulLerner commented on August 11, 2024

Yes thanks, I was about to ask about that (I was not sure about the iterating order of the fields)

It's quite dangerous as inverting the order of the fields "entity" and "transcription" would lead to the annotations not being merged but I don't have a better solution 😅

This behavior is brought to you by ProtocolFile lazy keys ™️

👍

from pyannote-database.

hbredin commented on August 11, 2024

The order actually does not matter.

current_file["entity"] is computed (using EntityLoader.__call__) the first time you access it.

The only thing that would not work is if EntityLoader.__call__ needs the transcription key and TranscriptionLoader.__call__ needs the entity key: this would create a circular dependency that cannot be resolved.

from pyannote-database.

PaulLerner commented on August 11, 2024

Oh ok, thanks for clearing that up!

from pyannote-database.

PaulLerner commented on August 11, 2024

This would allow to define protocols like that:
Protocols:
  MyDatabase:
    SpeakerDiarization:
      MyProtocol:
        train:
          uri: /path/to/uris.lst
          annotation: /path/to/speaker.rttm
          transcript: @/path/to/{uri}.ctm
The @ above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.

Note that you should use double quotes (e.g. transcript: "@/path/to/{uri}.ctm") to avoid the following yaml error:

ScannerError: while scanning for the next token
found character '@' that cannot start any token

from pyannote-database.

hbredin commented on August 11, 2024

Ah. Dammit. I guess we should switch to another character, then.
Any suggestion?

from pyannote-database.

PaulLerner commented on August 11, 2024

% ?

from pyannote-database.

hbredin commented on August 11, 2024

I understand from the documentation that you might get the same error with %. Did you try?

from pyannote-database.

PaulLerner commented on August 11, 2024

No, just a thought 😅

from pyannote-database.

hbredin commented on August 11, 2024

Perfect, _ it is. Can you push it to your other fix?

from pyannote-database.

support for transcripts and entity linking annotation about pyannote-database HOT 19 CLOSED

Comments (19)

Solution 1: custom protocol

Solution 2: plugin

My point of view

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent