Comments (36)
I'm pleased to say that there's now excellent support for German, thanks to the great work from our first NLP employee Wolfgang Seeker. We're still finishing up the blog post etc, but the model is uploaded and can be used from spaCy 0.100.7.
import spacy
de_nlp = spacy.load('de')
We're still refactoring and working on better processes for adding more languages. I'm going to close this issue because it's old and most of the information here is now out of date.
Obviously, there's still a lot to do to support more languages. And I think the idea of partial language support is important and overdue. But --- progress :)
from spacy.
Does anybody plan to add a support for Russian?
I wanted to understand how much effort and what road-map should I plan to add support for Russian, but unfortunately page http://spacy.io/tutorials/add-a-language/ is not available.
Could you please point me to corresponding doc/guide?
I'm also reading pull requests about adding of a German support, do you think the work for adding of Russian support is going to be kind of similar?
from spacy.
http://spacy.io/tutorials/add-a-language/
is not working. It seems page is not available (404 Error).
from spacy.
Thanks everyone. I'll say a bit more about what's blocking this.
The lexemes.bin data file has been constructed in a way that depends on various intermediate data files --- for instance, I processed an unannotated corpus into a list of word counts, and then smoothed the counts with another script, and then consumed the smoothed probabilities with the current quick-and-dirty make_lexicon.py script, which isn't even in this repository yet. I also need to set up a program to generate Brown clusters, and configure word2vec to generate word vectors.
Finally, I need to document the process, and document the tokenizer file formats, so that I can describe what you'll actually need to do to add new languages.
Instead of doing these things, I've mostly been doing bug-fixes, improving the API docs, and trying to improve my deployment process, which at the moment feels very error-prone.
I'll say a little bit about what will be required to add new languages.
- Select an unannotated corpus. This will probably be Wikipedia --- it's a nice way to streamline things across languages. Another nice solution would be to run a language identification program over Common Crawl dumps, so that we can get text from wider genres.
- Select an annotated corpus. This will define the tokenization standards that we have to target. I'll be licensing the data, so you probably won't have direct access to it --- I'll have to do the actual training. It will probably be nice to give you a web API, so you can run things. If the API you call is emailing me "hey, train this model", well...That sucks.
- Define tokenization rules. This is mostly a list of prefix tokenization, a list of suffix tokenization, and a list of special-cases, which are exact-matched. The "How It Works" page says a little bit more about this, but not enough.
- Write a lemmatizer and morphological analyser. If there's a WordNet for the language you're targeting, and it's any good, I would prefer to lemmatize to WordNet sense-keys. The BabelNet project is probably the useful way to go about this.
Having written all this, I'm thinking it might be nice to inter-operate closely with Gensim on this. Gensim will give us the word2vec implementation, and would be a good way to handle the boot-strapping problem: if getting spaCy to work on a new language initially depends on processing a bunch of unannotated data with spaCy, then things are awkward.
I'll think more about this, and probably reach out to Radim about it.
from spacy.
Just to push towards adding for additional language, I would be willing to work on Estonian support as I have experience developing https://github.com/estnltk/estnltk, which is a NLP library specific to the language.
from spacy.
http://spacy.io/tutorials/add-a-language/
is 404'ing
from spacy.
@kaustubhn Sounds great – for more info on how to add languages, see the new link posted above.
There's also an open discussion on getting started with the Indic language tokenizers in #641, so this would probably be the best place to talk more about this.
from spacy.
Same question here: where one should begin in order to add support for other languages? Would be glad to lend a hand...
from spacy.
+1. I would gladly lend a hand.
from spacy.
When I was still using redshift, I used word2vec and gensim to generate word cluster with window_size=2
. It was fast, but wasn't quite happy with the result until I found Percy Liang's word clustering tool. Took days to generate since I have large corpus.
from spacy.
If you need, I've a lot of resources for Portuguese (word2vec, Stanford NE Extractor, ConLL Floresta Sintática, POS Tagger (trigram, bigram, n-gram)...) that I've trained for my project.
from spacy.
@brunoalano thanks but actually I am interested in Turkish for the beginning.
from spacy.
Working on docs for this here: http://spacy.io/tutorials/add-a-language/
from spacy.
+1
Ideally we could develop the concept of explicit partial support of a new language. For example, tokenisation really works already for most languages.
from spacy.
I would be very interested in spaCy support for German, especially official support. Thanks for the documentation on adding languages!
from spacy.
See Issue #124
from spacy.
Select an unannotated corpus. This will probably be Wikipedia --- it's a nice way to streamline things across languages. Another nice solution would be to run a language identification program over Common Crawl dumps, so that we can get text from wider genres.
Perhaps the corpora from http://corporafromtheweb.org/ are useful here. They're large, come in several different languages (Dutch, English, French, German, Spanish, Swedish), and are tokenized pretty well. They're also POS-tagged and lemmatized, but I don't know whether that's needed here.
from spacy.
I'm also glad to help in adding Portuguese tools to spaCy.
By the way, anyone knows if there is corpora to train dependency parsers or just PoS-taggers to Dutch?
from spacy.
Yes there are corpora for that: Lassy-Klein, Lassy-Groot, and SoNaR. Probably the best thing is to contact the TST-centrale. They are undergoing some changes in management, so they might be slow to respond.
For questions about Lassy, you can email Gertjan van Noord.
For questions about SoNaR you can email Nelleke Oostdijk.
I trained a POS-tagger on the NLCOW14 corpus, which was automatically tagged. But if you need something fast, then here is the repository.
If you just need a parser, try Alpino.
from spacy.
Thanks for your reply! One more questions, does the license associated to those datasets allows to be incorporated into spaCy or use it to process dutch text not within a research context?
from spacy.
I think if you're a researcher yourself, and you don't contribute to SpaCy for money I think it should be OK.
If you're a SpaCy employee it might technically be commercial use and the terms and conditions from the TST-centrale are different: I just checked and the license for Lassy-klein for commercial use costs 2000 euros. Same for SoNar-klein. The larger corpora cost more. Then again, SpaCy is open-source. So you can always try. If it doesn't work, contact Gertjan to see whether he can help you.
Else there's also this free smaller treebank on his website. That should at least get you started. And the people at the university of Groningen would probably be happy to see another parser for Dutch so they can compare Alpino to it :)
Also look around on Maarten van Gompel's GitHub here. He's worked on memory-based POS tagging for Dutch, among other things.
from spacy.
Dank u wel for all the clarifications!
from spacy.
You're welcome, but I just realized I forgot to mention http://universaldependencies.org/ which also covers Dutch, and seems to overlap with Lassy. Sorry!
from spacy.
I new here but have been trolling for a while, would be glad to add Bahasa support
from spacy.
@geovedi , were you working on Bahasa?
from spacy.
@syllog1sm yes. it's been awhile tho. will catch up with the latest commit and regenerate the model.
from spacy.
@geovedi do you have a fork or branch i can check out?
from spacy.
Same question as @korobool -- I would be interested in adding support for Bulgarian.
from spacy.
Hi, I am interested in adding a language I have a corpus for 4 Indian languages, would be willing to help add a spacy model.
P.S: the URL is not working.
Thanks
from spacy.
https://spacy.io/docs/usage/adding-languages
from spacy.
Hi @davidsbatista. Do you have some code to train a model in portuguese?
from spacy.
Train what exactly? For PoS and syntactic parsing there is public available data, for named-entity recognition, is not so easy, there is no public available dataset
from spacy.
The training would be what's mentioned in the docs above. At this point PoS would be enough for me. Were you able to implement this in spaCy?
from spacy.
I could started it, although at the moment I have a bit of too much in my plate, but if you give me some guidelines on how to start I can try to plug-in the open PoS corpora in spaCy.
from spacy.
got it: https://spacy.io/docs/usage/adding-languages
:)
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model
- Enable override of existing custom pipe HOT 1
- Check that filter_spans input is a Span HOT 3
- Tokenizer Incorrectly Splitting "M1M" HOT 1
- Version incompatibility between Spacy, Cuda, Pytorch and Python HOT 3
- Accessing private transformer models HOT 1
- Problems converting Doc object to/from json HOT 1
- The word transitions to the wrong prototype HOT 1
- Fuzzy Matching not working HOT 1
- Unable to finetune transformer based ner model after initial tuning
- Undesired whitespace normalization of Korean text
- Suggestion: Normalize or Translate the parsing labels for German and English dependency labelling
- Code example discrepancy for `Span.lemma_` in API docs HOT 1
- Signature docs error in API docs for `MorphAnalysis.__contains__` HOT 2
- Import broken python 3.9 HOT 1
- Luminous
- Converting into exe file through pyinstaller-> spacy cannot find factory for 'curated transformer' HOT 1
- Spacy problem with whitespace or punctuation HOT 1
- config.cfg error from spacy init config command
- Possible ORG misidentification HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.