Comments (19)
* add path to transcripts and entites here https://github.com/pyannote/pyannote-database/blob/develop/pyannote/database/custom.py#L319 ? * do I to use another protocol than `SpeakerDiarizationProtocol`?
I think the main question is: do we want to share the PLUMCOT corpus via a custom protocol or via a pyannote.db.*
plugin.
Solution 1: custom protocol
Pros: installing the corpus is as easy as
git clone ...
export PYANNOTE_DATABASE_CONFIG=Plumcot/data/database.yml
Cons: implies of a lot changes to the pyannote.database.custom
module that will (most likely) only be used for the Plumcot corpus.
Solution 2: plugin
Pros: since everything is done in a Python class, you can choose to create your on PlumcotProtocol
class (e.g. as a subclass of SpeakerDiarizationProtocol
) and add to yielded ProtocolFile objects whichever field you see fit (e.g. transcript
).
Cons: an additional Python package needs to be maintained and deployed to PyPI which might be tricky in the long term.
My point of view
I'd personally go for solution 2 as a first step -- and some of the code might be re-used later if we finally realize that solution 1 might be interesting to integrate into pyannote.database
.
I think it really depends on whether we do more NLP stuff in the future (e.g. a pyannote.text
package?) For now, solution 1 feels a bit overkill.
I'll discuss the spaCy
data structure question in here as this is a slightly different problem.
from pyannote-database.
Underscore does it!
In [1]: import yaml
In [2]: with open('foo.yml') as file:
...: print(yaml.load(file, Loader=yaml.SafeLoader))
...:
{'foo': '_bar'}
In [3]: cat foo.yml
foo: _bar
NB: are latest comments are more relevant to #51
from pyannote-database.
Ok, I might need some help to setup the protocol though... Is this still up to date?
There's already some (messy) code in https://github.com/PaulLerner/pyannote-db-plumcot/blob/video/Plumcot/__init__.py
from pyannote-database.
Also, one difficulty is that Plumcot doesn't have one but 16 (+ some extra) protocols...
from pyannote-database.
FYI, I am currently looking into Solution 1. I'll keep you posted.
from pyannote-database.
I was about to message you!
I started implementing solution 1 (because of the aforementioned issues) to not waste time as I thought we could always re-use the code.
So here it is:
- plumcot formats PaulLerner@7ba0b94
- spacy doc (TODO fix this PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR173-R178)
from pyannote-database.
Nice! As I was saying earlier, I have been working in parallel on refactoring pyannote.database.custom
(and the introduction of custom data loaders) to make it much more flexible. Work is in progress in branch custom
(or pull request #51).
I am pretty sure most of your new code could become a custom data loader.
This would allow to define protocols like that:
Protocols:
MyDatabase:
SpeakerDiarization:
MyProtocol:
train:
uri: /path/to/uris.lst
annotation: /path/to/speaker.rttm
transcript: @/path/to/{uri}.ctm
The @
above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.
For now, only SpeakerDiarization
protocols are supported but this is basically just a name and you can add as many keys as you want, as long as you provide the corresponding custom data loader.
from pyannote-database.
Great, I'll look into it
The
@
above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.
Cool that solves this hack PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR284-R286 :)
from pyannote-database.
Mmh... I'm not sure how can I merge two different annotations (in my case: entities and forced-alignment) as I did here PaulLerner@d5c5598#diff-a20fd6de8844df2e02690b99f2c12ecbR238-R239
from pyannote-database.
If your configuration file looks like this,
Protocols:
MyDatabase:
SpeakerDiarization:
MyProtocol:
train:
uri: ...
entity: ...
transcription: ...
then you can assume that the "entity" key is available to the TranscriptionLoader
:
class EntityLoader:
def __init__(self, entities):
self.entities = entities
def __call__(self, current_file: ProtocolFile):
return ...
class TranscriptionLoader:
def __init__(self, transcription):
self.transcription = transcription
def __call__(self, current_file: ProtocolFile):
# current_file has an "entity" key
entity = current_file["entity"]
return ...
Does it answer your question?
This behavior is brought to you by ProtocolFile
lazy keys ™️
from pyannote-database.
Yes thanks, I was about to ask about that (I was not sure about the iterating order of the fields)
It's quite dangerous as inverting the order of the fields "entity" and "transcription" would lead to the annotations not being merged but I don't have a better solution 😅
This behavior is brought to you by
ProtocolFile
lazy keys ™️
👍
from pyannote-database.
The order actually does not matter.
current_file["entity"]
is computed (using EntityLoader.__call__
) the first time you access it.
The only thing that would not work is if EntityLoader.__call__
needs the transcription
key and TranscriptionLoader.__call__
needs the entity
key: this would create a circular dependency that cannot be resolved.
from pyannote-database.
Oh ok, thanks for clearing that up!
from pyannote-database.
This would allow to define protocols like that:
Protocols: MyDatabase: SpeakerDiarization: MyProtocol: train: uri: /path/to/uris.lst annotation: /path/to/speaker.rttm transcript: @/path/to/{uri}.ctmThe
@
above is not a typo: it is meant to indicate that the path must be considered as a template with placeholders.
Note that you should use double quotes (e.g. transcript: "@/path/to/{uri}.ctm"
) to avoid the following yaml error:
ScannerError: while scanning for the next token
found character '@' that cannot start any token
from pyannote-database.
Ah. Dammit. I guess we should switch to another character, then.
Any suggestion?
from pyannote-database.
%
?
from pyannote-database.
I understand from the documentation that you might get the same error with %
. Did you try?
from pyannote-database.
No, just a thought 😅
from pyannote-database.
Perfect, _
it is. Can you push it to your other fix?
from pyannote-database.
Related Issues (20)
- Add support for Python 3.12
- No loader for file with '.rttm' suffix HOT 5
- problem with pyannote HOT 3
- Training the overlap detection : AttributeError: 'PosixPath' object has no attribute 'format' HOT 2
- pyannote-audio sad train fails HOT 3
- Wrapper around various audio dataset libraries HOT 1
- AttributeError: 'PosixPath' object has no attribute 'format' HOT 3
- Bug on database.yml HOT 1
- Error in dataloader : 'PosixPath' object has no attribute 'format' HOT 5
- Multiple preprocessor for same field HOT 5
- Training on Jamendo Corpus HOT 4
- Deprecated plugin system HOT 7
- AttributeError: 'NoneType' object has no attribute 'items' HOT 6
- Add support for SPKR-INFO lines in load_rttm HOT 3
- LABLoader import error HOT 1
- ImportError: cannot import name 'registry' from 'pyannote.database' HOT 3
- Faster RTTMLoader HOT 2
- Speaker tag across rttm files
- Cannot combine several protocols from different databases into one HOT 1
- `LABLoader` raise ValueError("`path` must contain the {uri} placeholder.") even if the placeholder is configured correctly HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyannote-database.