Giter Site home page Giter Site logo

Sentence segmentation example? about spacy HOT 12 CLOSED

yang avatar yang commented on May 1, 2024
Sentence segmentation example?

from spacy.

Comments (12)

dseddah avatar dseddah commented on May 1, 2024 3

the situation is a bit better in v2.0 but still plenty of errors. After a discussion with @honnibal on twitter, turned out the French model was trained on the shuffled Sequoia treebank version as provided by the UD people so it's likely that's the segmentation learning process was messed up bc of that. Also, small treebank so not enough data point to learn it accurately.

from spacy.

dseddah avatar dseddah commented on May 1, 2024 2

Hey,
the sentence segmenter for French is very bad :(
it cuts sentences in the middle right after an adjective or doesn't follow basic rule ( period followed by \n for example. Even if I replace each \n by \n\n.
I'm using that kind of templates

nlp = spacy.load('fr')                
USTR_mydoc=USTR_get_1string_from_file("/dev/stdin")   
doc = nlp(USTR_mydoc)    
print([(w.text, w.pos_) for w in doc]) 

(USTR_get_1string_from_file is just a function that return a whole file in a string)

from spacy.

sirjan13 avatar sirjan13 commented on May 1, 2024 2

Any way we can train the sentence segmentation on custom data? If so it would be great if someone could provide examples as well

from spacy.

geovedi avatar geovedi commented on May 1, 2024

You could do something like this.

In [34]: text = u'''Python packaging is awkward at the best of times, and it’s particularly tricky with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them, and bear with me :)'''

In [35]: tokens = nlp(text, parse=True)

In [36]: for s in tokens.sents:
    print ''.join(tokens[x].string for x in range(*s))
   ....:
Python packaging is awkward at the best of times, and its particularly tricky with C extensions, built via Cython, requiring large data files.
So, please report issues as you encounter them, and bear with me :)

from spacy.

metasyn avatar metasyn commented on May 1, 2024

I think it might be a good idea to make sentence segmentation more visible, as, at immediate glance, people seem to assume that it might not be easy to do or even possible.

e.g. http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html

@honnibal what would be the spaCy way of implementing a sentence-splitter ?

from spacy.

syllog1sm avatar syllog1sm commented on May 1, 2024

The example that Jim posted above is the current solution --- the tokens.sents attribute gives a (start, end) iterator. Unfortunately the Tokens.getitem method doesn't accept a slice at the moment, so use range(start, end). This will be fixed in the next version.

The segmentation works a little differently from others. It uses the syntactic structure, not just the surface clues from the punctuation.

spaCy is unique among current parsers in parsing whole documents, instead of splitting first into sentences, and then parsing the resulting strings. This is possible because the algorithm is linear time, whereas a lot of previous parsers use polynomial time parsing algorithms.

This means the sentence boundary detection is often robust to difficult cases:

>>> tokens = nlu(u"If Harvard doesn't come through, I'll take the test to get into Yale. many parents set goals for their children, or maybe they don't set a goal.")
>>> for start, end in tokens.sents:
...   print ''.join(tokens[i].string for i in range(start, end))
...   print
... 
If Harvard doesn't come through, I'll take the test to get into Yale. 

many parents set goals for their children, or maybe they don't set a goal.

I just tried this example from the Grammarly examples, and it works correctly. Because spaCy parses this as two clauses, it puts the sentence break in the correct place, even though "many" is lower-cased.

I haven't highlighted this yet because I still haven't sorted out better training and evaluation data. I need to train models on more web text, instead of the current model which is based on the Wall Street Journal text.

All of this has been delayed by me dropping everything to do a demo for a major client. If I secure this, then I can say that spaCy's development is secure for the immediate future, and I'll even be able to hire additional help. But for the last month, things have been less smooth than I'd like. I hope to have all this sorted out soon.

from spacy.

metasyn avatar metasyn commented on May 1, 2024

Hmm. I see. Thanks for the response. I was curious because I'd love to contribute but still need to do a bit of reading on cython and going over this project more in depth.

from spacy.

honnibal avatar honnibal commented on May 1, 2024

Improved sentence segmentation now included in the latest release. Docs are updated with usage.

from spacy.

mansilla avatar mansilla commented on May 1, 2024

I have a similar issue with Spanish

text = """
En estadística, un error sistemático es aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud. Puede estar originado en un defecto del instrumento, en una particularidad del operador o del proceso de medición, etc. Se contrapone al concepto de error aleatorio.
"""
nlp = spacy.load("es")
doc = spacy_nlp(text, parse=True)
for span in doc.sents:
    print("#> span:", span)

this gives

#> span: En estadística, un error sistemático es
#> span: aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud.
...

as you can see the first "span" isn't a full sentence.

from spacy.

pramod2157 avatar pramod2157 commented on May 1, 2024

How exactly the sentence separation works? is it based on a kind of regular expression based(considering punctuations like full stop, question marks)?

from spacy.

xzegga avatar xzegga commented on May 1, 2024

Hey @syllog1sm, I am experimenting with your approach but I am getting this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-edf97381b54c> in <module>()
      3 range1 = lambda start, end: range(start, end+1)
      4 
----> 5 for start, end in en_doc.sents:
      6     print(''.join(tokens[i].string for i in range(start, end)))

ValueError: too many values to unpack (expected 2)

Do you have an idea why, I am no expert in python.
I am using python 3.6 and spacy 2.0.5

from spacy.

lock avatar lock commented on May 1, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.