Comments (12)
the situation is a bit better in v2.0 but still plenty of errors. After a discussion with @honnibal on twitter, turned out the French model was trained on the shuffled Sequoia treebank version as provided by the UD people so it's likely that's the segmentation learning process was messed up bc of that. Also, small treebank so not enough data point to learn it accurately.
from spacy.
Hey,
the sentence segmenter for French is very bad :(
it cuts sentences in the middle right after an adjective or doesn't follow basic rule ( period followed by \n for example. Even if I replace each \n by \n\n.
I'm using that kind of templates
nlp = spacy.load('fr')
USTR_mydoc=USTR_get_1string_from_file("/dev/stdin")
doc = nlp(USTR_mydoc)
print([(w.text, w.pos_) for w in doc])
(USTR_get_1string_from_file is just a function that return a whole file in a string)
from spacy.
Any way we can train the sentence segmentation on custom data? If so it would be great if someone could provide examples as well
from spacy.
You could do something like this.
In [34]: text = u'''Python packaging is awkward at the best of times, and it’s particularly tricky with C extensions, built via Cython, requiring large data files. So, please report issues as you encounter them, and bear with me :)'''
In [35]: tokens = nlp(text, parse=True)
In [36]: for s in tokens.sents:
print ''.join(tokens[x].string for x in range(*s))
....:
Python packaging is awkward at the best of times, and it’s particularly tricky with C extensions, built via Cython, requiring large data files.
So, please report issues as you encounter them, and bear with me :)
from spacy.
I think it might be a good idea to make sentence segmentation more visible, as, at immediate glance, people seem to assume that it might not be easy to do or even possible.
e.g. http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html
@honnibal what would be the spaCy way of implementing a sentence-splitter ?
from spacy.
The example that Jim posted above is the current solution --- the tokens.sents attribute gives a (start, end) iterator. Unfortunately the Tokens.getitem method doesn't accept a slice at the moment, so use range(start, end). This will be fixed in the next version.
The segmentation works a little differently from others. It uses the syntactic structure, not just the surface clues from the punctuation.
spaCy is unique among current parsers in parsing whole documents, instead of splitting first into sentences, and then parsing the resulting strings. This is possible because the algorithm is linear time, whereas a lot of previous parsers use polynomial time parsing algorithms.
This means the sentence boundary detection is often robust to difficult cases:
>>> tokens = nlu(u"If Harvard doesn't come through, I'll take the test to get into Yale. many parents set goals for their children, or maybe they don't set a goal.")
>>> for start, end in tokens.sents:
... print ''.join(tokens[i].string for i in range(start, end))
... print
...
If Harvard doesn't come through, I'll take the test to get into Yale.
many parents set goals for their children, or maybe they don't set a goal.
I just tried this example from the Grammarly examples, and it works correctly. Because spaCy parses this as two clauses, it puts the sentence break in the correct place, even though "many" is lower-cased.
I haven't highlighted this yet because I still haven't sorted out better training and evaluation data. I need to train models on more web text, instead of the current model which is based on the Wall Street Journal text.
All of this has been delayed by me dropping everything to do a demo for a major client. If I secure this, then I can say that spaCy's development is secure for the immediate future, and I'll even be able to hire additional help. But for the last month, things have been less smooth than I'd like. I hope to have all this sorted out soon.
from spacy.
Hmm. I see. Thanks for the response. I was curious because I'd love to contribute but still need to do a bit of reading on cython and going over this project more in depth.
from spacy.
Improved sentence segmentation now included in the latest release. Docs are updated with usage.
from spacy.
I have a similar issue with Spanish
text = """
En estadística, un error sistemático es aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud. Puede estar originado en un defecto del instrumento, en una particularidad del operador o del proceso de medición, etc. Se contrapone al concepto de error aleatorio.
"""
nlp = spacy.load("es")
doc = spacy_nlp(text, parse=True)
for span in doc.sents:
print("#> span:", span)
this gives
#> span: En estadística, un error sistemático es
#> span: aquel que se produce de igual modo en todas las mediciones que se realizan de una magnitud.
...
as you can see the first "span" isn't a full sentence.
from spacy.
How exactly the sentence separation works? is it based on a kind of regular expression based(considering punctuations like full stop, question marks)?
from spacy.
Hey @syllog1sm, I am experimenting with your approach but I am getting this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-25-edf97381b54c> in <module>()
3 range1 = lambda start, end: range(start, end+1)
4
----> 5 for start, end in en_doc.sents:
6 print(''.join(tokens[i].string for i in range(start, end)))
ValueError: too many values to unpack (expected 2)
Do you have an idea why, I am no expert in python.
I am using python 3.6 and spacy 2.0.5
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- Version incompatibility between Spacy, Cuda, Pytorch and Python HOT 3
- Accessing private transformer models HOT 1
- Problems converting Doc object to/from json HOT 1
- The word transitions to the wrong prototype HOT 1
- Fuzzy Matching not working HOT 1
- Unable to finetune transformer based ner model after initial tuning
- Undesired whitespace normalization of Korean text
- Suggestion: Normalize or Translate the parsing labels for German and English dependency labelling
- Code example discrepancy for `Span.lemma_` in API docs HOT 1
- Signature docs error in API docs for `MorphAnalysis.__contains__` HOT 2
- Import broken python 3.9 HOT 1
- Luminous
- Converting into exe file through pyinstaller-> spacy cannot find factory for 'curated transformer' HOT 1
- Spacy problem with whitespace or punctuation HOT 1
- config.cfg error from spacy init config command
- Possible ORG misidentification HOT 1
- SpaCy is not building today HOT 2
- How can I contribute.
- spacy Transformers shows 0 in Losses_trf while training
- The `transition_parser` in `Spacy` is not compatible with the use of cuda for inference
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.