Giter Site home page Giter Site logo

Comments (19)

hbredin avatar hbredin commented on August 11, 2024 1
* add path to transcripts and entites here https://github.com/pyannote/pyannote-database/blob/develop/pyannote/database/custom.py#L319 ?
* do I to use another protocol than `SpeakerDiarizationProtocol`?

I think the main question is: do we want to share the PLUMCOT corpus via a custom protocol or via a pyannote.db.* plugin.

Solution 1: custom protocol

Pros: installing the corpus is as easy as

git clone ...
export PYANNOTE_DATABASE_CONFIG=Plumcot/data/database.yml

Cons: implies of a lot changes to the pyannote.database.custom module that will (most likely) only be used for the Plumcot corpus.

Solution 2: plugin

Pros: since everything is done in a Python class, you can choose to create your on PlumcotProtocol class (e.g. as a subclass of SpeakerDiarizationProtocol) and add to yielded ProtocolFile objects whichever field you see fit (e.g. transcript).

Cons: an additional Python package needs to be maintained and deployed to PyPI which might be tricky in the long term.

My point of view

I'd personally go for solution 2 as a first step -- and some of the code might be re-used later if we finally realize that solution 1 might be interesting to integrate into pyannote.database.

I think it really depends on whether we do more NLP stuff in the future (e.g. a pyannote.text package?) For now, solution 1 feels a bit overkill.

I'll discuss the spaCy data structure question in here as this is a slightly different problem.

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024 1

Underscore does it!

In [1]: import yaml                                                                                                                                                                                          

In [2]: with open('foo.yml') as file: 
   ...:     print(yaml.load(file, Loader=yaml.SafeLoader)) 
   ...:                                                                                                                                                                                                      
{'foo': '_bar'}

In [3]: cat foo.yml                                                                                                                                                                                          
foo: _bar

NB: are latest comments are more relevant to #51

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

Ok, I might need some help to setup the protocol though... Is this still up to date?
There's already some (messy) code in https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/Plumcot/__init__.py

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

Also, one difficulty is that Plumcot doesn't have one but 16 (+ some extra) protocols...

from pyannote-database.

hbredin avatar hbredin commented on August 11, 2024

FYI, I am currently looking into Solution 1. I'll keep you posted.

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

I was about to message you!
I started implementing solution 1 (because of the aforementioned issues) to not waste time as I thought we could always re-use the code.
So here it is:

from pyannote-database.

hbredin avatar hbredin commented on August 11, 2024

Nice! As I was saying earlier, I have been working in parallel on refactoring pyannote.database.custom (and the introduction of custom data loaders) to make it much more flexible. Work is in progress in branch custom (or pull request #51).

I am pretty sure most of your new code could become a custom data loader.
This would allow to define protocols like that:

Protocols:
  MyDatabase:
    SpeakerDiarization:
      MyProtocol:
        train:
          uri: /path/to/uris.lst
          annotation: /path/to/speaker.rttm
          transcript: @/path/to/{uri}.ctm

The @ above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.

For now, only SpeakerDiarizationprotocols are supported but this is basically just a name and you can add as many keys as you want, as long as you provide the corresponding custom data loader.

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

Great, I'll look into it

The @ above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.

Cool that solves this hack PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR284-R286 :)

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

Mmh... I'm not sure how can I merge two different annotations (in my case: entities and forced-alignment) as I did here PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR238-R239

from pyannote-database.

hbredin avatar hbredin commented on August 11, 2024

If your configuration file looks like this,

Protocols:
  MyDatabase:
    SpeakerDiarization:
      MyProtocol:
        train:
          uri: ...
          entity: ...
          transcription: ...

then you can assume that the "entity" key is available to the TranscriptionLoader:

class EntityLoader:
    def __init__(self, entities):
        self.entities = entities
    def __call__(self, current_file: ProtocolFile):
        return ...

class TranscriptionLoader:
    def __init__(self, transcription):
        self.transcription = transcription
    def __call__(self, current_file: ProtocolFile):
        # current_file has an "entity" key
        entity = current_file["entity"]
        return ...

Does it answer your question?

This behavior is brought to you by ProtocolFile lazy keys ™️

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

Yes thanks, I was about to ask about that (I was not sure about the iterating order of the fields)

It's quite dangerous as inverting the order of the fields "entity" and "transcription" would lead to the annotations not being merged but I don't have a better solution 😅

This behavior is brought to you by ProtocolFile lazy keys ™️

👍

from pyannote-database.

hbredin avatar hbredin commented on August 11, 2024

The order actually does not matter.

current_file["entity"] is computed (using EntityLoader.__call__) the first time you access it.

The only thing that would not work is if EntityLoader.__call__ needs the transcription key and TranscriptionLoader.__call__ needs the entity key: this would create a circular dependency that cannot be resolved.

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

Oh ok, thanks for clearing that up!

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

This would allow to define protocols like that:

Protocols:
  MyDatabase:
    SpeakerDiarization:
      MyProtocol:
        train:
          uri: /path/to/uris.lst
          annotation: /path/to/speaker.rttm
          transcript: @/path/to/{uri}.ctm

The @ above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.

Note that you should use double quotes (e.g. transcript: "@/path/to/{uri}.ctm") to avoid the following yaml error:

ScannerError: while scanning for the next token
found character '@' that cannot start any token

from pyannote-database.

hbredin avatar hbredin commented on August 11, 2024

Ah. Dammit. I guess we should switch to another character, then.
Any suggestion?

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

% ?

from pyannote-database.

hbredin avatar hbredin commented on August 11, 2024

I understand from the documentation that you might get the same error with %. Did you try?

from pyannote-database.

PaulLerner avatar PaulLerner commented on August 11, 2024

No, just a thought 😅

from pyannote-database.

hbredin avatar hbredin commented on August 11, 2024

Perfect, _ it is. Can you push it to your other fix?

from pyannote-database.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.