Can you provide an example of how to use spaCy for sentence segmentation? Thanks so m

You could do something like this. <div class="highlight highlight-source-python no

I have a similar issue with Spanish <div class="highlight highlight-source-python

Sentence segmentation example? about spacy HOT 12 CLOSED

yang commented on May 1, 2024

Sentence segmentation example?

from spacy.

Comments (12)

dseddah commented on May 1, 2024 3

the situation is a bit better in v2.0 but still plenty of errors. After a discussion with @honnibal on twitter, turned out the French model was trained on the shuffled Sequoia treebank version as provided by the UD people so it's likely that's the segmentation learning process was messed up bc of that. Also, small treebank so not enough data point to learn it accurately.

from spacy.

dseddah commented on May 1, 2024 2

Hey,
the sentence segmenter for French is very bad :(
it cuts sentences in the middle right after an adjective or doesn't follow basic rule ( period followed by \n for example. Even if I replace each \n by \n\n.
I'm using that kind of templates

nlp = spacy.load('fr')                
USTR_mydoc=USTR_get_1string_from_file("/dev/stdin")   
doc = nlp(USTR_mydoc)    
print([(w.text, w.pos_) for w in doc])

(USTR_get_1string_from_file is just a function that return a whole file in a string)

from spacy.

sirjan13 commented on May 1, 2024 2

Any way we can train the sentence segmentation on custom data? If so it would be great if someone could provide examples as well

from spacy.

geovedi commented on May 1, 2024

You could do something like this.

In [34]: text = u'''Python packaging is awkward at the best of times, and it’s particularly tricky with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them, and bear with me :)'''

In [35]: tokens = nlp(text, parse=True)

In [36]: for s in tokens.sents:
    print ''.join(tokens[x].string for x in range(*s))
   ....:
Python packaging is awkward at the best of times, and it’s particularly tricky with C extensions, built via Cython, requiring large data files.
So, please report issues as you encounter them, and bear with me :)

from spacy.

metasyn commented on May 1, 2024

I think it might be a good idea to make sentence segmentation more visible, as, at immediate glance, people seem to assume that it might not be easy to do or even possible.

e.g. http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html

@honnibal what would be the spaCy way of implementing a sentence-splitter ?

from spacy.

syllog1sm commented on May 1, 2024

The example that Jim posted above is the current solution --- the tokens.sents attribute gives a (start, end) iterator. Unfortunately the Tokens.getitem method doesn't accept a slice at the moment, so use range(start, end). This will be fixed in the next version.

The segmentation works a little differently from others. It uses the syntactic structure, not just the surface clues from the punctuation.

spaCy is unique among current parsers in parsing whole documents, instead of splitting first into sentences, and then parsing the resulting strings. This is possible because the algorithm is linear time, whereas a lot of previous parsers use polynomial time parsing algorithms.

This means the sentence boundary detection is often robust to difficult cases:

>>> tokens = nlu(u"If Harvard doesn't come through, I'll take the test to get into Yale. many parents set goals for their children, or maybe they don't set a goal.")
>>> for start, end in tokens.sents:
...   print ''.join(tokens[i].string for i in range(start, end))
...   print
... 
If Harvard doesn't come through, I'll take the test to get into Yale. 

many parents set goals for their children, or maybe they don't set a goal.

I just tried this example from the Grammarly examples, and it works correctly. Because spaCy parses this as two clauses, it puts the sentence break in the correct place, even though "many" is lower-cased.

I haven't highlighted this yet because I still haven't sorted out better training and evaluation data. I need to train models on more web text, instead of the current model which is based on the Wall Street Journal text.

All of this has been delayed by me dropping everything to do a demo for a major client. If I secure this, then I can say that spaCy's development is secure for the immediate future, and I'll even be able to hire additional help. But for the last month, things have been less smooth than I'd like. I hope to have all this sorted out soon.

from spacy.

metasyn commented on May 1, 2024

Hmm. I see. Thanks for the response. I was curious because I'd love to contribute but still need to do a bit of reading on cython and going over this project more in depth.

from spacy.

honnibal commented on May 1, 2024

Improved sentence segmentation now included in the latest release. Docs are updated with usage.

from spacy.

mansilla commented on May 1, 2024

I have a similar issue with Spanish

text = """
En estadística, un error sistemático es aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud. Puede estar originado en un defecto del instrumento, en una particularidad del operador o del proceso de medición, etc. Se contrapone al concepto de error aleatorio.
"""
nlp = spacy.load("es")
doc = spacy_nlp(text, parse=True)
for span in doc.sents:
    print("#> span:", span)

this gives

#> span: En estadística, un error sistemático es
#> span: aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud.
...

as you can see the first "span" isn't a full sentence.

from spacy.

pramod2157 commented on May 1, 2024

How exactly the sentence separation works? is it based on a kind of regular expression based(considering punctuations like full stop, question marks)?

from spacy.

xzegga commented on May 1, 2024

Hey @syllog1sm, I am experimenting with your approach but I am getting this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-edf97381b54c> in <module>()
      3 range1 = lambda start, end: range(start, end+1)
      4 
----> 5 for start, end in en_doc.sents:
      6     print(''.join(tokens[i].string for i in range(start, end)))

ValueError: too many values to unpack (expected 2)

Do you have an idea why, I am no expert in python.
I am using python 3.6 and spacy 2.0.5

from spacy.

lock commented on May 1, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Sentence segmentation example? about spacy HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent