bminixhofer / wtpsplit Goto Github PK

Code for Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

Home Page: https://aclanthology.org/2023.acl-long.398

License: MIT License

Python 99.97% Shell 0.03%

sentence-boundary-detection python deep-learning machine-learning pretrained-models sentence-segmentation sentence-segmenter natural-language-processing

wtpsplit's Introduction

wtpsplit🪓

Code for the paper Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation with Jonas Pfeiffer and Ivan Vulić, accepted at ACL 2023.

This repository contains wtpsplit, a package for robust and adaptible sentence segmentation across 85 languages, as well as the code and configs to reproduce the experiments in the paper.

Installation

pip install wtpsplit

Usage

from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# optionally run on GPU for better performance
# also supports TPUs via e.g. wtp.to("xla:0"), in that case pass `pad_last_batch=True` to wtp.split
wtp.half().to("cuda")

# returns ["Hello ", "This is a test."]
wtp.split("Hello This is a test.")

# returns an iterator yielding a lists of sentences for every text
# do this instead of calling wtp.split on every text individually for much better performance
wtp.split(["Hello This is a test.", "And some more texts..."])

# if you're using a model with language adapters, also pass a `lang_code`
wtp.split("Hello This is a test.", lang_code="en")

# depending on your usecase, adaptation to e.g. the Universal Dependencies style may give better results
# this always requires a language code
wtp.split("Hello This is a test.", lang_code="en", style="ud")

ONNX support

You can enable ONNX inference for the wtp-bert-* models:

wtp = WtP("wtp-bert-mini", onnx_providers=["CUDAExecutionProvider"])

This requires onnxruntime and onnxruntime-gpu. It should give a good speedup on GPU!

>>> from wtpsplit import WtP
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model = WtP("wtp-bert-mini")
>>> model.half().to("cuda")
>>> %timeit list(model.split(texts))
272 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# onnxruntime GPU
>>> model = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])
>>> %timeit list(model.split(texts))
198 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes:

The wtp-canine-* models are currently not supported with ONNX because the pooling done by CANINE is not trivial to export. Ideas to solve this are very welcome!
This does not work with Python 3.7 because onnxruntime does not support the opset we need for py37.

Available Models

Pro tips: I recommend wtp-bert-mini for speed-sensitive applications, otherwise wtp-canine-s-12l. The *-no-adapters models provide a good tradeoff between speed and performance. You should probably not use wtp-bert-tiny.

Model	English Score	English Score (adapted)	Multilingual Score	Multilingual Score (adapted)
wtp-bert-tiny	83.8	91.9	79.5	88.6
wtp-bert-mini	91.8	95.9	84.3	91.3
wtp-canine-s-1l	94.5	96.5	86.7	92.8
wtp-canine-s-1l-no-adapters	93.1	96.4	85.1	91.8
wtp-canine-s-3l	94.4	96.8	86.7	93.4
wtp-canine-s-3l-no-adapters	93.8	96.4	86	92.3
wtp-canine-s-6l	94.5	97.1	87	93.6
wtp-canine-s-6l-no-adapters	94.4	96.8	86.4	92.8
wtp-canine-s-9l	94.8	97	87.7	93.8
wtp-canine-s-9l-no-adapters	94.3	96.9	86.6	93
wtp-canine-s-12l	94.7	97.1	87.9	94
wtp-canine-s-12l-no-adapters	94.5	97	87.1	93.2

The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". "adapted" means adapation via WtP Punct; check out the paper for details.

For comparison, here's the English scores of some other tools:

Model	English Score
SpaCy (sentencizer)	86.8
PySBD	69.8
SpaCy (dependency parser)	93.1
Ersatz	91.6
Punkt (`nltk.sent_tokenize`)	92.5

Paragraph Segmentation

Since WtP models are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.

# returns a list of paragraphs, each containing a list of sentences
# adjust the paragraph threshold via the `paragraph_threshold` argument.
wtp.split(text, do_paragraph_segmentation=True)

Adaptation

WtP can adapt to the Universal Dependencies, OPUS100 or Ersatz corpus segmentation style in many languages by punctuation adaptation (preferred) or threshold adaptation.

Punctuation Adaptation

# this requires a `lang_code`
# check the paper or `wtp.mixtures` for supported styles
wtp.split(text, lang_code="en", style="ud")

This also allows changing the threshold, but inherently has higher thresholds values since it is not newline probablity anymore being thresholded:

wtp.split(text, lang_code="en", style="ud", threshold=0.7)

To get the default threshold for a style:

wtp.get_threshold("en", "ud", return_punctuation_threshold=True)

Threshold Adaptation

threshold = wtp.get_threshold("en", "ud")

wtp.split(text, threshold=threshold)

Advanced Usage

Get the newline or sentence boundary probabilities for a text:

# returns newline probabilities (supports batching!)
wtp.predict_proba(text)

# returns sentence boundary probabilities for the given style
wtp.predict_proba(text, lang_code="en", style="ud")

Load a WtP model in HuggingFace transformers:

# import wtpsplit.models to register the custom models 
# (character-level BERT w/ hash embeddings and canine with language adapters)
import wtpsplit.models
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("benjamin/wtp-bert-mini") # or some other model name

** NEW ** Adapt to your own corpus using WtP_Punct:

Clone the repository:

git clone https://github.com/bminixhofer/wtpsplit
cd wtpsplit

Create your data:

import torch

torch.save(
    {
        "en": {
            "sentence": {
                "dummy-dataset": {
                    "meta": {
                        "train_data": ["train sentence 1", "train sentence 2"],
                    },
                    "data": [
                        "test sentence 1",
                        "test sentence 2",
                    ]
                }
            }
        }
    },
    "dummy-dataset.pth"
)

Run adaptation:

python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en

This should print something like

en dummy-dataset U=0.500 T=0.667 PUNCT=0.667
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.52it/s]
Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops
Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json

i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this:

from wtpsplit import WtP
import skops.io as sio

wtp = WtP(
    "wtp-bert-mini",
    mixtures=sio.load(
        "wtpsplit/.cache/wtp-bert-mini.skops",
        ["numpy.float32", "numpy.float64", "sklearn.linear_model._logistic.LogisticRegression"],
    ),
)

wtp.split("your text here", lang_code="en", style="dummy-dataset")

... and adjust the dataset name, language and model in the above to your needs.

Reproducing the paper

configs/ contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this:

python wtpsplit/train/train.py configs/<config_name>.json

In addition:

wtpsplit/data_acquisition contains the code for obtaining evaluation data and raw text from the mC4 corpus.
wtpsplit/evaluation contains the code for:
- intrinsic evaluation (i.e. sentence segmentation results) via adapt.py. The raw intrinsic results in JSON format are also at evaluation_results/
- extrinsic evaluation on Machine Translation in extrinsic.py
- baseline (PySBD, nltk, etc.) intrinsic evaluation in intrinsic_baselines.py
- punctuation annotation experiments in punct_annotation.py and punct_annotation_wtp.py

Supported Languages

iso	Name
af	Afrikaans
am	Amharic
ar	Arabic
az	Azerbaijani
be	Belarusian
bg	Bulgarian
bn	Bengali
ca	Catalan
ceb	Cebuano
cs	Czech
cy	Welsh
da	Danish
de	German
el	Greek
en	English
eo	Esperanto
es	Spanish
et	Estonian
eu	Basque
fa	Persian
fi	Finnish
fr	French
fy	Western Frisian
ga	Irish
gd	Scottish Gaelic
gl	Galician
gu	Gujarati
ha	Hausa
he	Hebrew
hi	Hindi
hu	Hungarian
hy	Armenian
id	Indonesian
ig	Igbo
is	Icelandic
it	Italian
ja	Japanese
jv	Javanese
ka	Georgian
kk	Kazakh
km	Central Khmer
kn	Kannada
ko	Korean
ku	Kurdish
ky	Kirghiz
la	Latin
lt	Lithuanian
lv	Latvian
mg	Malagasy
mk	Macedonian
ml	Malayalam
mn	Mongolian
mr	Marathi
ms	Malay
mt	Maltese
my	Burmese
ne	Nepali
nl	Dutch
no	Norwegian
pa	Panjabi
pl	Polish
ps	Pushto
pt	Portuguese
ro	Romanian
ru	Russian
si	Sinhala
sk	Slovak
sl	Slovenian
sq	Albanian
sr	Serbian
sv	Swedish
ta	Tamil
te	Telugu
tg	Tajik
th	Thai
tr	Turkish
uk	Ukrainian
ur	Urdu
uz	Uzbek
vi	Vietnamese
xh	Xhosa
yi	Yiddish
yo	Yoruba
zh	Chinese
zu	Zulu

Citation

Please cite wtpsplit as

@inproceedings{minixhofer-etal-2023-wheres,
    title = "Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation",
    author = "Minixhofer, Benjamin  and
      Pfeiffer, Jonas  and
      Vuli{\'c}, Ivan",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.398",
    pages = "7215--7235",
    abstract = "Many NLP pipelines split text into sentences as one of the crucial preprocessing steps. Prior sentence segmentation tools either rely on punctuation or require a considerable amount of sentence-segmented training data: both central assumptions might fail when porting sentence segmenters to diverse languages on a massive scale. In this work, we thus introduce a multilingual punctuation-agnostic sentence segmentation method, currently covering 85 languages, trained in a self-supervised fashion on unsegmented text, by making use of newline characters which implicitly perform segmentation into paragraphs. We further propose an approach that adapts our method to the segmentation in a given corpus by using only a small number (64-256) of sentence-segmented examples. The main results indicate that our method outperforms all the prior best sentence-segmentation tools by an average of 6.1{\%} F1 points. Furthermore, we demonstrate that proper sentence segmentation has a point: the use of a (powerful) sentence segmenter makes a considerable difference for a downstream application such as machine translation (MT). By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points over the best prior segmentation tool, as well as massive gains over a trivial segmenter that splits text into equally-sized blocks.",
}

Acknowledgments

Ivan Vulić is supported by a personal Royal Society University Research Fellowship ‘Inclusive and Sustainable Language Technology for a Truly Multilingual World’ (no 221137; 2022–). Research supported with Cloud TPUs from Google’s TPU Research Cloud (TRC). We thank Christoph Minixhofer for advice in the initial stage of this project. We also thank Sebastian Ruder and Srini Narayanan for helpful feedback on a draft of the paper.

Previous Version

This repository previously contained nnsplit, the precursor to wtpsplit. You can still use the nnsplit branch (or the nnsplit PyPI releases) for the old version, however, this is highly discouraged and not maintained! Please let me know if you have a usecase which nnsplit can solve but wtpsplit can not.

wtpsplit's People

Contributors

Stargazers

Watchers

wtpsplit's Issues

Error when load model

When I run this command:
splitter = NNSplit.load('en')
This error occured:

thread '<unnamed>' panicked at 'Once instance has previously been poisoned', library/std/src/sync/once.rs:394:21
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: Once instance has previously been poisoned

EN model training in Google Colab

Hello, with use of Google Colab I was able to train a model for Russian language.
But when I start training a model for English language with trainer.fit(model), it floods output (hundreds of messages):

[W108] The rule-based lemmatizer did not find POS annotation for the token ')'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Marjorie'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'But'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'and'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'Daw'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'It'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token '('. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.
[W108] The rule-based lemmatizer did not find POS annotation for the token 'made'. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.

and connection to runtine is lost

Do you have an idea how I could get rid of this output mesages?

Link to Colab: https://colab.research.google.com/drive/1xjrD1ZvkzLypbuYaywkf7yHy9UAjYq-g?usp=sharing
(the model is a bit chaged to have int8 input)

Python bindings frequently segfault

Python bindings frequently cause a segfault. This has apparently been resolved by PyO3/pyo3@7b1e8a6 which is not yet released.

Current solution: Depend on a pyo3@master:

pyo3 = {git = "https://github.com/PyO3/pyo3", rev = "e6f8fa7"}

and a rust-numpy fork which uses the same rev:

numpy = {git = "https://github.com/bminixhofer/rust-numpy"}

in bindings/python/Cargo.toml.

This should be changed as soon as PyO3 0.10.x is released.

[Bug] Pandas not listed in `install_requires`, resulting in import error

When I tried to import wtpsplit to try it out, the program failed with an import error.

It appears that wtpsplit uses pandas (and imports it at the top level), but does not list it in setup.py, so it doesn't automatically get installed when wtpsplit is installed.

I can make a PR if you'd like (though I know it is a very small thing, lol).

Apple Silicon / Arm support

What will it take to get support on Apple Silicon / ARM? I'm happy to help out with testing if that can be useful.

How to use Universal Dependencies style ?

I load model from local path,:

from wtpsplit import WtP
wtp = WtP("/data/share/HuggingFace/custom/benjamin/wtp-bert-mini/")

wtp.split("This is a test This is another test.", lang_code="en", style="ud")

it returns :
ValueError: This model does not have any associated mixtures. Maybe they are missing from the model directory?

where can I download associated mixtures?

wtp-canine-s-1l-no-adapters - missing mixtures.skops

Hi,

There isn't a "mixtures.skops" file for "wtp-canine-s-1l-no-adapters". This causes an error while passing the "style" argument. All other "-no-adapters" model have the file and work fine.

rust installation failure on 0.3.1

Followed your Rust instructions and am hitting this failure:

   Compiling nnsplit v0.3.1
error: couldn't read /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/nnsplit-0.3.1/src/../../models.csv: No such file or directory (os error 2)
  --> /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/nnsplit-0.3.1/src/model_loader.rs:10:23
   |
10 |         let raw_csv = include_str!("../../models.csv");
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

error: could not compile `nnsplit`.

When I look inside .cargo/registry/src/github.com-1ecc6299db9ec823/nnsplit-0.3.1/ I see there is no models.csv. If I manually add it by copying down raw, it still fails on the path.

Inconsistent results with same sentences

I found that the usage of your splitter model gives very inconsistent results. Take for example the amharic language (lang_code="am").

If I take for example these two sentences from the flores 200 test dataset:

ሪንግ ከተፎካካሪ የደህንነት ኩባንያም ADT ኮርፖሬሽን፣ ጋር ክስ መስርቷል።
አንደ የሙከራ ክትባት የኢቦላን ገዳይነት ቢቀንስም፣ እስካሁን፣ ነባር በሽታዎችን እንዲያክም አመቺ ሆኖ የቀረበ ምንም መድሃኒት የለም።

If i concatenate these two strings and feed them into wtp.split() it will produce 10 sentences:

ሪንግ
ከተፎካካሪ
የደህንነት ኩባንያም ADT
ኮርፖሬሽን፣ ጋር ክስ መስርቷል።
አንደ የሙከራ ክትባት
የኢቦላን ገዳይነት ቢቀንስም፣
እስካሁን፣
ነባር በሽታዎችን እንዲያክም
አመቺ ሆኖ የቀረበ
ምንም መድሃኒት የለም።

However if I give the algorithm more (7) sentences and concatenate them into a string it splits them all perfectly:
(Notice that the two sentences in the example above are included in the text below, sentence 3 and 4)

1.ፓናሉ ለንግድ መጀመር ገንዘብ በተከለከለበት በ2013 በሻርክ ታንክ ምዕራፍ ላይ ከቀረበ ወዲህ ሽያጭ እንደጨመረ ሲሚኖፍ ተነግሯል።
2. በ2017 መጨረሻ ላይ፣ ሲሚኖፍ በሽያጭ የቴሌቪዥን ጣቢያ ላይ ቀርቦ ነበር።
3. ሪንግ ከተፎካካሪ የደህንነት ኩባንያም ADT ኮርፖሬሽን፣ ጋር ክስ መስርቷል።
4. አንደ የሙከራ ክትባት የኢቦላን ገዳይነት ቢቀንስም፣ እስካሁን፣ ነባር በሽታዎችን እንዲያክም አመቺ ሆኖ የቀረበ ምንም መድሃኒት የለም።
5. አንድ የጸረ እንግዳ አካል፣ ZMapp፣ በዚህ መስክ ላይ ተስፋን አሳይቶ ነበር፣ ግን መደበኛ ጥናቶች ሞትን ለመከላከል ከተፈለገው ጥቅም ያነሰ እንዳለው ያሳያል።
6. በPALM ሙከራ፣ ZMapp እንደ መቆጣጠሪያ ያገለግል ነበር፣ ማለት ተመራማሪዎች እንደ መነሻ ይጠቀሙበት እና ከሌሎች ሶስት ህክምናዎች ጋር ያነጻጽሩታል።
7. የአሜሪካ ጂምናስቲ የዩናይትድ ስቴትስ ኦሎፒክ ኮሚቴ ደብዳቤ ይደግፋል እናም በሙሉ አስፈላጊነት የኦሎምፒክ ቤተሰብ ደህንነቱ የተጠበቀ አካባቢ ለሁሉም አትሌቶቻችን ማስተዋወቅ እንዳለበት ይቀበላል።

Can you explain this behaviour? It makes your algorithm very unpredictable to be honest and I fear this problem is also present in other languages if I did not make any mistake. I called the splitter with the appropiate language at all times. Let me know what you think of this.

Build Python 3.9 Wheels

When trying to install into python3.9 it will not install a version later than 0.2.2. I am not certain but I believe that this is because wheels are only build for versions 3.6, 3.7 and 3.8. Would it be possible to add wheels for the 3.9 version?

The split does not look right for this particular case.

Hi,

My sentence is as shown below:
What's working and what needs to change? Not everybody Dr.Jones, has the opportunity to watch themselves after they've had a date to see what they're doing right or wrong, so that you will only know what to do in the next day. Yeah, but it's such an important exercise that they needed to do. Last week they went on their first date, which is a huge step for our single wives, and a great time for us to watch your dates..

When I split it using nnsplit the split sentences are shown below:

What's working and what needs to change?
Not everybody Dr.
Jones, has the opportunity to watch themselves after they've had a date to see what they're doing right or wrong, so that you will only know what to do in the next day.
Yeah, but it's such an important exercise that they needed to do.
Last week they went on their first date, which is a huge step for our single wives, and a great time for us to watch your dates..

I don't think this is right. Will you please let me know if these splits can be improved.

Security: update version of tract-onnx

This security vulnerability:

https://rustsec.org/advisories/RUSTSEC-2021-0073.html

is fixed in prost==0.8.0, which is included in a recent new release of tract-onnx: https://github.com/sonos/tract/releases/tag/0.15.2

Would it be possible to do a new release with the tract-onnx dependency bumped?

Can't run it on macOS

I want to give the module a try, but I am getting the following error by running the basic example on the readme. Module was installed with pip.

splitter = NNSplit.load("en")
AttributeError: type object 'NNSplit' has no attribute 'load'

Incorrect splits

Please report here issues similar to #18, i. e. text where it is easy for humans to see the correct split but NNSplit gets it wrong.

I'm not entirely satisfied with the quality of the models yet, such cases might help improve it.

Control where the model is downloaded too?

Hi,
This is more of a minor feature request. I'm trying to use NNSplit in a container, which has a read-only file system except for the /tmp dir. It would be groovy if one could provide a local path to load the model from/download to. Perhaps this is in the python interface already but i couldn't see it.

I know you can specify a path when calling NNSplit() but this gets more complcated as I'm including it in a modele that then gets included in another project.

Anyway, nice work and thanks!

`AttributeError: 'InferenceSession' object has no attribute '_providers' Segmentation fault (core dumped)`

I was trying to segment sentences for my transcribing program, but I ran into this error when I first tried using it this.

Full Error

Traceback (most recent call last):
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 280, in __init__
    self._create_inference_session(providers, provider_options)
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 307, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
RuntimeError: /onnxruntime_src/onnxruntime/core/platform/posix/env.cc:142 onnxruntime::{anonymous}::PosixThread::PosixThread(const char*, int, unsigned int (*)(int, Eigen::ThreadPoolInterface*), Eigen::ThreadPoolInterface*, const onnxruntime::ThreadOptions&) pthread_setaffinity_np failed


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "speech.py", line 436, in <module>
    transcript, source_align_data = transcript_audio(input_path, True, transcript_path, granularity=granularity)
  File "speech.py", line 271, in transcript_audio
    sentence_segmenter = NNSplit.load("en")
  File "backend.py", line 6, in create_session
  File "/home/runner/Voice-Synthasizer/venv/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in __init__
    print("EP Error using {}".format(self._providers))
AttributeError: 'InferenceSession' object has no attribute '_providers'
Segmentation fault (core dumped)

Opus100 FR not in mixtures

Hi,

Table 8 in the paper indicates that the training data includes Opus100 FR. However, it seems to not be present in mixtures I checked.

wtp = WtP("wtp-canine-s-12l")
wtp.split("Bonjour", lang_code="fr", style="opus100")

PanicException with 0.4.*

After installing the new version, I get the following exception when running NNSplit.load("en").

PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: Py(0x5632cc1b1140, PhantomData) }

This occurs both with and without onnxruntime-gpu installed.

Python 3.10 wheel

It seems there is no wheel for Python 3.10 on PyPI yet: https://pypi.org/project/nnsplit/#files

Is there anything I can do to help with 3.10 support?

Performance of Rust crate

I am getting fairly poor performance in release mode (CPU)… 2kB/s.

Is there a guide on using the GPU?

ValueError when using wtp-canine-s-12l-no-adapters on Danish

When using wtp-canine-s-12l-no-adapters for Danish with style "ud", I encounter a ValueError on one specific text.

Specs:

Python version: 3.9.15

Steps to reproduce:

In a clean environment, I only install wtpsplit (and missing requirement pandas).

text = 'Vinderne af Club Syds quiz er fundet\n06 februar 2012 kl. 16.58\nVinderne af Club Syds quiz er fundet. Stort tillykke til de tre vindere af en iPad. Quizzen fortsætter i denne uge, hvor præmierne er tre flotte fladskærms-TV.\nSidste uges rigtige svar var:\nFredericia Stadion (Monjasa Park)\nPræmierne er en iPad til hver af de heldige vindere, og de er nu på vej til:\nJørgen Ladegaard\ni Asperup\nIngelise Smith Hansen\ni Haderslev\nog \nGudrun Zederkof\nLunderskov\n'
model = WtP("wtp-canine-s-12l-no-adapters")
sents = model.split(text, lang_code="da", style="ud")

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 285, in split
    return next(
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 365, in _split
    for text, probs in zip(
  File "/.env/lib/python3.9/site-packages/wtpsplit/__init__.py", line 232, in _predict_proba
    outer_batch_logits = extract(
  File "/.env/lib/python3.9/site-packages/wtpsplit/extract.py", line 175, in extract
    out = model(
  File "/.env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1521, in forward
    outputs = self.canine(
  File "/.env/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1145, in forward
    molecule_attention_mask = self._downsample_attention_mask(
  File "/.env/lib/python3.9/site-packages/transformers/models/canine/modeling_canine.py", line 1061, in _downsample_attention_mask
    batch_size, char_seq_len = char_attention_mask.shape
ValueError: too many values to unpack (expected 2)

InconsistentVersionWarning Issue everytime I start wtp

Hi, everytime I run a wtp object I get the following warning;

InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.0. This might lead to breaking code or invalid results. Use at your own risk. For more info pl
ease refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(

and ends. Any ideas on how to tackle this?

Scoring metric, does definition make sense?

I looked more into the scoring metric and noticed something. You score based on the indices of predicted sentences. However if you for example split two sentences and predict two arbitrary indeces (true indeces that is), lets say [ 23, 83] the scoring is only based on they index 23. Why is that? Because we score the splits, two sentences equal to one split so while 23 marks the split 83 only marks the end of the sentence. This makes sense in a way ... or maybe not - i am not sure. Because if you think about it, even if the algorithm does not recognize the last symbol as the end of a sentence it will still give the index 83, since it is given by [len(s) in predicted_sentences]. Lets assume you have three sentences now, which have the true indeces [23, 83, 140, 158] and lets say for some reason wtpsplit cant recognize the middle sentence. It would return [23, 140, 158] and a smaller f1 score. However if I would input the sentences separately like this [23, 83] and [140, 158] the f1 score would be 1, because 83 and 158 are never considered for scoring. This makes the score dependent on the number of sentences. For example if I score an dataset by aggregating two lines (which represent a sentence) in a loop the results would be much better than if I did with with 5 lines or even 10. There is also an risk involved in losing data, except you take the last sentence of each iteration into the next. Sorry for the text blob, but maybe you guys know a best practice for such a problem :)

Publish 0.3.x python wheels for Linux/non-macOS platforms

After running into #13, I tried to use the Python bindings instead. It worked, but I noticed that installed version 0.2.2 (saw it didn't match up with the documentation in the README).

After digging into it a little bit, I saw that 0.2.2 was the last release with a platform-agnostic wheel available. All 0.3.x wheels seem to be built specifically for macOS, and are not installable on my Linux/Ubuntu machine.

I'm wondering if there are some easy adjustments that could be made to make publishing wheels for all platforms again possible (or at least Linux/Ubuntu 😇 )?

Unusual splits in short sentence

Hello, thank you for your great work!

I noticed unusual splits in a short sentence. I assume this is due to the name.

Is there any way to detect this?

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l")

issue = """‘Make sure it does,’ Vaughn said."""
expected = ["""‘Make sure it does,’ Vaughn said."""]

wtp.split(issue, lang_code="en")

# wrong ['‘Make sure it does,’ ', 'Vaughn ', 'said.']

wtp.split(issue,  lang_code="en",  style="ud")

# wrong ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(issue,  lang_code="en",  style="opus100")

# correct ['‘Make sure it does,’ Vaughn said.']

wtp.split(issue,  lang_code="en",  style="ersatz")

# wrong ['‘Make sure it does,’ ', 'Vaughn said.']

wtp.split(issue,  lang_code="en", threshold=0.99)

# correct ['‘Make sure it does,’ Vaughn said.']

Tested: Version 1.0.1 , colab CPU

ImportError in Python (NNSplit)

Hi, I was trying the simple example in Python from the documentation and I'm getting an ImportError:

from nnsplit import NNSplit
splitter = NNSplit.load("en")

# returns `Split` objects
splits = splitter.split(["This is a test This is another test."])[0]

# a `Split` can be iterated over to yield smaller splits or stringified with `str(...)`.
for sentence in splits:
   print(sentence)

When executing this example I'm getting the following error:

Traceback (most recent call last):
  File "nnsplit.py", line 1, in <module>
    from nnsplit import NNSplit
  File "G:\OneDrive\projects\s\nnsplit.py", line 1, in <module>
    from nnsplit import NNSplit
ImportError: cannot import name 'NNSplit' from partially initialized module 'nnsplit' (most likely due to a circular import) (G:\OneDrive\projects\s\nnsplit.py)

I have installed the packages in a new conda environment, executing pip list installed I have:

pip list installed
Package         Version
--------------- -------------------
certifi         2020.12.5
nnsplit         0.5.7.post0
numpy           1.20.3
onnxruntime     1.7.0
onnxruntime-gpu 1.7.0
pip             21.1.1
protobuf        3.17.1
setuptools      52.0.0.post20210125
six             1.16.0
tqdm            4.61.0
wheel           0.36.2
wincertstore    0.2

Can nnsplit use an http proxy?

For some reasons I can't directly fetch the resources required by nnsplit. For example,

splitter = NNSplit.load("fr")
nnsplit.ResourceError: network error fetching "model.onnx" for "fr"

I'm pretty sure this is the local network issue because when I switch to other networks, it works.

So I'm wondering if there's any method to use an http proxy instead of directly sending a network request? I've tried to set the environment variables like http_proxy and https_proxy on windows and they didn't work.

Porting to Android

Hi, I am trying to run onnx model on Android and have sharted with the steps like is described there: https://github.com/onnx/tutorials/blob/master/tutorials/PytorchCaffe2MobileSqueezeNet.ipynb

import onnx
import caffe2.python.onnx.backend
from onnx import helper

# Load the ONNX GraphProto object. Graph is a standard Python protobuf object
model = onnx.load("model.onnx")

Unfortinately I receive an error:

---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
<ipython-input-8-0e15f43f99e0> in <module>()
      1 # Load the ONNX GraphProto object. Graph is a standard Python protobuf object
----> 2 model = onnx.load("model.onnx")
      3 

2 frames
/usr/local/lib/python3.6/dist-packages/onnx/__init__.py in _deserialize(s, proto)
     95                          '\ntype is {}'.format(type(proto)))
     96 
---> 97     decoded = cast(Optional[int], proto.ParseFromString(s))
     98     if decoded is not None and decoded != len(s):
     99         raise google.protobuf.message.DecodeError(

DecodeError: Error parsing message

Could you please what could be the issue? I use EN model and Google Colab

Rust bindings for wtpsplit

I would love this to have Rust bindings again like NNSplit had :)

use concrete error types

There is a bit of an issue with the error bounds in rust when being as lax as Box<dyn Error> - most error frameworks expect the error type bounded to be Error + Send + 'static.

For a library it's common to implement a custom error type which is then exposed to the user, which wraps all possible internal error types. Currently the tool of choice (imho) is thiserror.

Moving to concrete error types rather than dyn boxes would be auch appreciated step.

Model(s) use word capitlisation to segment

Hi,

The models tested in English and few other languages seem to rely on capitalisation to detect sentence boundaries. On our dataset, if the captilisaton at start of target sentences are retained the F1 score is as high as .90 for certain model+style+threshold. If the sentence boundary starts are lowercased, then the best of F1 score drops to 0.3

Example:
with 'wtp-bert-mini' the sentence 'We are running a test We should should get two sentences' will split but 'We are running a test we should should get two sentence' won't split.

I am not sure if this is an expected behavour or an issue.

Thanks

show progress

I'm currently using nnsplit on a fairly big dataset. Is it possible to track progress on a long list of inputs?

Recursion in init

Hi - hopefully this is just something I am constructing incorrectly but I am getting recursion in init which results in an error with wtpsplit==1.2.0.

My code is running inside joblib but is just doing:

self.sentence_splitter = WtP("wtp-canine-s-12l")
for sentence in self.sentence_splitter.split(text, lang_code=self.language):
yield sentence

And I get:

process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.8/dist-packages/wtpsplit/init.py", line 115, in getattr
return getattr(self.model, name)
File "/usr/local/lib/python3.8/dist-packages/wtpsplit/init.py", line 115, in getattr
return getattr(self.model, name)
File "/usr/local/lib/python3.8/dist-packages/wtpsplit/init.py", line 115, in getattr
return getattr(self.model, name)
[Previous line repeated 988 more times]
RecursionError: maximum recursion depth exceeded

Let me know if any thoughts

Thanks

Jon

Missing file in NPM package?

I'm trying to import nnsplit in a JavaScript project, and webpack is failing with:

./node_modules/nnsplit/nnsplit.bundle/nnsplit_javascript_bg.wasm
Module not found: Can't resolve './nnsplit_javascript_bg.js' in '/tmp/experiment/node_modules/nnsplit/nnsplit.bundle'

Looking in node_modules/nnsplit/nnsplit.bundle, indeed the file nnsplit_javascript_bg.js is referenced by package.json, but missing from the filesystem.

(Not sure though whether that's the real culprit, as the nodejs example seems to work as intended.)

Use ONNX models everywhere due to TorchScript instability

Hey, there! I was trying to run the Rust example from the README, but got the following error on a cargo run:

Error: Compat { error: TorchError { c_error: "The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n  File \"code/__torch__/torch/nn/quantized/dynamic/modules/rnn.py\", line 195, in __setstate__\n    state: Tuple[Tuple[Tensor, Optional[Tensor]], bool]) -> None:\n    _72, _73, = (state)[0]\n    _74 = ops.quantized.linear_prepack(_72, _73)\n          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n    self.param = _74\n    self.training = (state)[1]\n\nTraceback of TorchScript, original code (most recent call last):\n  File \"/usr/local/lib/python3.6/dist-packages/torch/nn/quantized/dynamic/modules/rnn.py\", line 29, in __setstate__\n    @torch.jit.export\n    def __setstate__(self, state):\n        self.param = torch.ops.quantized.linear_prepack(*state[0])\n                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE\n        self.training = state[1]\nRuntimeError: Didn\'t find engine for operation quantized::linear_prepack NoQEngine\n" } }

Let me know it there is any more info you need for debugging!

Simplified Chinese model does not detect sentence boundaries correctly

Hi,

I have tried Simplified Chinese model on demo page and it seems that sentence boundary and tokens detection are not correct.

I have 2 ideas why that could happen:

Period in Chinese is 。
There are no white spaces between words. Possibly it is better to use something like https://github.com/voidism/pywordseg to split on words as a preprocessing step

It looks like issue 2. causes that tokens are also not detected correctly. I have compared with https://github.com/voidism/pywordseg results and they do not match. But I am not sure here, because I have compared Spacy, pywordseg and Stanford Word Segmenter and all of them provide different results

Hi, pip install nnsplit doesn't work

Hello, first of all, nnsplit is really cool, it's really great stuff. :)
I'd really like to run nnsplit on my local computer, but an error occurs when I try to pip install nnsplit:

ERROR: Could not find a version that satisfies the requirement nnsplit (from versions: 0.0.1, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.2.0, 0.2.1, 0.2.2)
ERROR: No matching distribution found for nnsplit

can I get some helps?

Split Object to proper string

How can I convert split object and get segmented sentences from a text (without line breaks or full stops)

Language wishlist

A list of languages currently considered for training and adding to the Repo:

I'll see if I can train models for languages on this list. If you want to speed it up, just train it yourself following https://github.com/bminixhofer/nnsplit/blob/master/train/train.ipynb :)

JS Usage example & Node.js support

Would it be possible to add a simple "usage" for node.js?

Unable to use own trained onnx models

Hello and first of all: thank you for a great library!

I've tried to train my own model using an unusual input data format following the train Python notebook you've provided. However, after the training, when trying to load the custom model via NNSplit.load("en/model.onnx") call in python bindings, I get this:

nnsplit.ResourceError: model not found: "en/model.onnx"

I may be wrong, but it seems the current logic of model_loader.rs does not allow custom local paths, only the ones that are listed in the models.csv:

https://github.com/bminixhofer/nnsplit/blob/a5a15815382029bf5c3438fd4753f644847d4dbf/nnsplit/src/model_loader.rs#L59

Effectively limiting the available models to the pretrained ones.

For GPU, ONNX WtP model is around 2x slower than PyTorch.

import time
from wtpsplit import WtP

wtp = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"])


def make_sentence(seg):
  sentences = wtp.split(seg, lang_code="en", style="ud", threshold=0.975)
  sentences = [x.strip() for x in sentences]
  return(sentences)

timelist_fox = []

for i in range(20):
  start = time.time()
  input_text = "The quick brown fox jumps over the lazy dog. El zorro marrón rápido salta sobre el perro perezoso. I went to see the p. t. barnum circus today!"
  sentences = make_sentence(input_text)
  end = time.time()
  print(sentences)
  print("Runtime for sentence segmentation", end - start)
  timelist_fox.append(end - start)

print()
# Get average runtime
print("Average runtime for sentence segmentation", sum(timelist_fox)/len(timelist_fox))

And I get

['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09200406074523926
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.15698647499084473
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.07426166534423828
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.07866954803466797
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09979438781738281
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.08975934982299805
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09622359275817871
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.09634947776794434
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.07365036010742188
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.10837149620056152
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.0805506706237793
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.08892273902893066
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.04485893249511719
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.10665750503540039
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.05337262153625488
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.040402889251708984
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.03861117362976074
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.04022550582885742
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.03824734687805176
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.0443730354309082

Average runtime for sentence segmentation 0.07711464166641235

Whereas if I replace the line wtp = WtP("wtp-bert-mini", ort_providers=["CUDAExecutionProvider"]) with

wtp = WtP("wtp-bert-mini")
wtp.half().to("cuda")

I get

['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 3.6466240882873535
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.021060943603515625
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.014858007431030273
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.015185832977294922
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.021528959274291992
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.014949560165405273
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.014830350875854492
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.013895034790039062
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.013033628463745117
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.017659902572631836
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.018916606903076172
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.013854742050170898
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.025988340377807617
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.01674795150756836
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.015290498733520508
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.012728214263916016
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.016968250274658203
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.01860976219177246
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.021869421005249023
['The quick brown fox jumps over the lazy dog.', 'El zorro marrón rápido salta sobre el perro perezoso.', 'I went to see the p. t. barnum circus today!']
Runtime for sentence segmentation 0.01503300666809082

Average runtime for sentence segmentation 0.19848165512084961

Although PyTorch implementation is on average slower because outlier from the first run, removing the initial outlier from the first PyTorch run makes it on average faster than ONNX run.

I see the inputs are not bounded to GPU in (https://github.com/bminixhofer/wtpsplit/blob/main/wtpsplit/extract.py). Could you please try to binding them to see if it faster?

node.js: unload wasm module

Hello, thanks for this project!
I'm trying to unload the wasm module in my node code:

const nnsplit = require("nnsplit");

function run() {
    nnsplit.NNSplit.new("/root/nnsplit_models/en/model.onnx")
    .then(splitter => {
        return splitter.split(["This is a test This is another test."])
    })
    .then(results => {
        let splits = results[0];
        console.log(splits.parts.map((x) => x.text)); // to log sentences, or x.parts to get the smaller subsplits
    })
    .catch(error => {
        console.error(error);
    })
}
run();

when running this script, the console keeps opened, since some resource must be released. I assume is the tractjs model that should be released in some way (likely using destroy())

Thanks

Async - Skops import is failing

I am trying with 1.2.1 and 1.2.3 but I have issues like:

1.2.1:

  File "/usr/local/lib/python3.10/site-packages/exorde/prepare_batch.py", line 10, in <module>
    from wtpsplit import WtP
  File "/usr/local/lib/python3.10/site-packages/wtpsplit/__init__.py", line 11, in <module>
    import skops.io as sio
ModuleNotFoundError: No module named 'skops.io'

or with just your latest version 1.2.3:


   from wtpsplit import WtP
  File "/usr/local/lib/python3.10/site-packages/wtpsplit/__init__.py", line 11, in <module>
    import skops.io as sio
  File "/usr/local/lib/python3.10/site-packages/skops/io/__init__.py", line 1, in <module>
    from ._persist import dump, dumps, get_untrusted_types, load, loads
  File "/usr/local/lib/python3.10/site-packages/skops/io/_persist.py", line 22, in <module>
    module = importlib.import_module(module_name, package="skops.io")
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_general.py", line 16, in <module>
    from ._trusted_types import (
  File "/usr/local/lib/python3.10/site-packages/skops/io/_trusted_types.py", line 17, in <module>
    SCIPY_UFUNC_TYPE_NAMES = get_public_type_names(module=scipy.special, oftype=np.ufunc)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 230, in get_public_type_names
    {
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 234, in <setcomp>
    and (type_name := get_type_name(obj)).startswith(module_name)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 179, in get_type_name
    return f"{get_module(t)}.{t.__name__}"
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 86, in get_module
    return whichmodule(obj, obj.__name__)
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 49, in whichmodule
    if _getattribute(module, name)[0] is obj:
  File "/usr/local/lib/python3.10/site-packages/skops/io/_utils.py", line 24, in _getattribute
    obj = getattr(obj, subpath)
TypeError: __getattr__() missing 1 required positional argument: 'name'

I am using Python 3.10.11
Any ideas? I can't seem to simply import your lib.

Error loading model to GPU

Version: 1.2.0

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l-no-adapters")
wtp.to("cuda")

throws,

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
test.ipynb Cell 6 in 4
      [1] from wtpsplit import WtP
      [3] wtp = WtP("wtp-canine-s-12l-no-adapters")
----> [4] wtp.to("cuda")

File (site-packages/wtpsplit/__init__.py:115), in WtP.__getattr__(self, name)
    114 def __getattr__(self, name):
--> 115     return getattr(self.model, name)

AttributeError: 'PyTorchWrapper' object has no attribute 'to'

The issue is due to the nested wtp.model.model not being handled by the __getattr__ method.
Calling wtp.model.model.to("cuda") works.

More language support.

Hi, a lot of thanks to your project.

In the README, it says:

Alternatively, you can also load your own model.

Where can I can find models for other languages except English and German? Or could you tell me how to train my own model for other languages step by step? I'm happy to contribute for providing more models.

Thank you,
Guangrui Wang

get_threshold does not work

Hi!
I'm tring to test functionality from README.md this step

from wtpsplit import WtP

wtp = WtP("wtp-canine-s-12l")

wtp.get_threshold("en", "ud")

AttributeError                            Traceback (most recent call last)

<ipython-input-41-b7dd80e9f417> in <cell line: 1>()
----> 1 wtp.get_threshold("en", "ud")

1 frames

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in __getattr__(self, name)
   1612             if name in modules:
   1613                 return modules[name]
-> 1614         raise AttributeError("'{}' object has no attribute '{}'".format(
   1615             type(self).__name__, name))
   1616 

AttributeError: 'LACanineForTokenClassification' object has no attribute 'get_threshold'

Colab:
torch 2.0.1+cu118
huggingface-hub-0.15.1
safetensors-0.3.1
skops-0.7.post0
tokenizers-0.13.3
transformers-4.30.2
wtpsplit-1.0.1

Any string that isn't a multiple of 4 causes an assert failure

Hi,

Any string that isn't a multiple of 4 causes an assert failure at line 548 in models.py
"assert char_encoding.shape[1] % self.conv.stride[0] == 0"

stride is intialised to config.downsampling_rate (4) in modeling_canine.py in transformers lib.

Sample code causing assert failure (length of input string is 35):
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l")
wtp.split("This is a test This is another test", lang_code="en")

Sample code that works (with added full-stop that makes the length of input string to become 36):
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l")
wtp.split("This is a test This is another test.", lang_code="en")

ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found

i run it on centos7 and python3.8.3，and i just want to run it on cpu not gpu. I met follows error：

>>> import nnsplit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /home/zyb/miniconda3/lib/python3.8/site-packages/nnsplit.cpython-38-x86_64-linux-gnu.so)

Could not find a mixture for the Universal Dependencies (UD) style in Thai language

I have been trying to use a wtpsplit in the Thai language by using the 'ud' style as :

# specify language code to be 'th' and style='ud' according to the paper
wtp.split(text, lang_code="th", style='ud')

However, there returned an error that:

ValueError: Could not find a mixture for the style 'ud'.

I also checked in the language_info.csv file and found that the UD style is also supported in the Thai language as UD_Thai-PUD

I have tried on another supported style such as OPUS100 and found that it is usable, except for the UD style that returned me an error. Did this is an error or did I understand something wrong?

Thank you

bminixhofer / wtpsplit Goto Github PK

wtpsplit's Introduction

wtpsplit🪓

Installation

Usage

ONNX support

Available Models

Paragraph Segmentation

Adaptation

Punctuation Adaptation

Threshold Adaptation

Advanced Usage

Reproducing the paper

Supported Languages

Citation

Acknowledgments

Previous Version

wtpsplit's People

Contributors

Stargazers

Watchers

Forkers

wtpsplit's Issues

Full Error

Specs:

Steps to reproduce:

Output:

Recommend Projects

Recommend Topics

Recommend Org