Giter Site home page Giter Site logo

Additional Language Support about spacy HOT 36 CLOSED

explosion avatar explosion commented on May 1, 2024
Additional Language Support

from spacy.

Comments (36)

honnibal avatar honnibal commented on May 1, 2024 7

I'm pleased to say that there's now excellent support for German, thanks to the great work from our first NLP employee Wolfgang Seeker. We're still finishing up the blog post etc, but the model is uploaded and can be used from spaCy 0.100.7.

import spacy
de_nlp = spacy.load('de')

We're still refactoring and working on better processes for adding more languages. I'm going to close this issue because it's old and most of the information here is now out of date.

Obviously, there's still a lot to do to support more languages. And I think the idea of partial language support is important and overdue. But --- progress :)

from spacy.

korobool avatar korobool commented on May 1, 2024 6

Does anybody plan to add a support for Russian?

I wanted to understand how much effort and what road-map should I plan to add support for Russian, but unfortunately page http://spacy.io/tutorials/add-a-language/ is not available.
Could you please point me to corresponding doc/guide?

I'm also reading pull requests about adding of a German support, do you think the work for adding of Russian support is going to be kind of similar?

from spacy.

Biswajit2902 avatar Biswajit2902 commented on May 1, 2024 4

http://spacy.io/tutorials/add-a-language/

is not working. It seems page is not available (404 Error).

from spacy.

honnibal avatar honnibal commented on May 1, 2024 1

Thanks everyone. I'll say a bit more about what's blocking this.

The lexemes.bin data file has been constructed in a way that depends on various intermediate data files --- for instance, I processed an unannotated corpus into a list of word counts, and then smoothed the counts with another script, and then consumed the smoothed probabilities with the current quick-and-dirty make_lexicon.py script, which isn't even in this repository yet. I also need to set up a program to generate Brown clusters, and configure word2vec to generate word vectors.

Finally, I need to document the process, and document the tokenizer file formats, so that I can describe what you'll actually need to do to add new languages.

Instead of doing these things, I've mostly been doing bug-fixes, improving the API docs, and trying to improve my deployment process, which at the moment feels very error-prone.

I'll say a little bit about what will be required to add new languages.

  1. Select an unannotated corpus. This will probably be Wikipedia --- it's a nice way to streamline things across languages. Another nice solution would be to run a language identification program over Common Crawl dumps, so that we can get text from wider genres.
  2. Select an annotated corpus. This will define the tokenization standards that we have to target. I'll be licensing the data, so you probably won't have direct access to it --- I'll have to do the actual training. It will probably be nice to give you a web API, so you can run things. If the API you call is emailing me "hey, train this model", well...That sucks.
  3. Define tokenization rules. This is mostly a list of prefix tokenization, a list of suffix tokenization, and a list of special-cases, which are exact-matched. The "How It Works" page says a little bit more about this, but not enough.
  4. Write a lemmatizer and morphological analyser. If there's a WordNet for the language you're targeting, and it's any good, I would prefer to lemmatize to WordNet sense-keys. The BabelNet project is probably the useful way to go about this.

Having written all this, I'm thinking it might be nice to inter-operate closely with Gensim on this. Gensim will give us the word2vec implementation, and would be a good way to handle the boot-strapping problem: if getting spaCy to work on a new language initially depends on processing a bunch of unannotated data with spaCy, then things are awkward.

I'll think more about this, and probably reach out to Radim about it.

from spacy.

tpetmanson avatar tpetmanson commented on May 1, 2024 1

Just to push towards adding for additional language, I would be willing to work on Estonian support as I have experience developing https://github.com/estnltk/estnltk, which is a NLP library specific to the language.

from spacy.

jasonmhead avatar jasonmhead commented on May 1, 2024 1

http://spacy.io/tutorials/add-a-language/
is 404'ing

from spacy.

ines avatar ines commented on May 1, 2024 1

@kaustubhn Sounds great – for more info on how to add languages, see the new link posted above.

There's also an open discussion on getting started with the Indic language tokenizers in #641, so this would probably be the best place to talk more about this.

from spacy.

chasseurmic avatar chasseurmic commented on May 1, 2024

Same question here: where one should begin in order to add support for other languages? Would be glad to lend a hand...

from spacy.

scari avatar scari commented on May 1, 2024

+1. I would gladly lend a hand.

from spacy.

geovedi avatar geovedi commented on May 1, 2024

When I was still using redshift, I used word2vec and gensim to generate word cluster with window_size=2. It was fast, but wasn't quite happy with the result until I found Percy Liang's word clustering tool. Took days to generate since I have large corpus.

from spacy.

brunoalano avatar brunoalano commented on May 1, 2024

If you need, I've a lot of resources for Portuguese (word2vec, Stanford NE Extractor, ConLL Floresta Sintática, POS Tagger (trigram, bigram, n-gram)...) that I've trained for my project.

from spacy.

aikinci avatar aikinci commented on May 1, 2024

@brunoalano thanks but actually I am interested in Turkish for the beginning.

from spacy.

honnibal avatar honnibal commented on May 1, 2024

Working on docs for this here: http://spacy.io/tutorials/add-a-language/

from spacy.

bittlingmayer avatar bittlingmayer commented on May 1, 2024

+1
Ideally we could develop the concept of explicit partial support of a new language. For example, tokenisation really works already for most languages.

from spacy.

dpk avatar dpk commented on May 1, 2024

I would be very interested in spaCy support for German, especially official support. Thanks for the documentation on adding languages!

from spacy.

honnibal avatar honnibal commented on May 1, 2024

See Issue #124

from spacy.

evanmiltenburg avatar evanmiltenburg commented on May 1, 2024

Select an unannotated corpus. This will probably be Wikipedia --- it's a nice way to streamline things across languages. Another nice solution would be to run a language identification program over Common Crawl dumps, so that we can get text from wider genres.

Perhaps the corpora from http://corporafromtheweb.org/ are useful here. They're large, come in several different languages (Dutch, English, French, German, Spanish, Swedish), and are tokenized pretty well. They're also POS-tagged and lemmatized, but I don't know whether that's needed here.

from spacy.

davidsbatista avatar davidsbatista commented on May 1, 2024

I'm also glad to help in adding Portuguese tools to spaCy.

By the way, anyone knows if there is corpora to train dependency parsers or just PoS-taggers to Dutch?

from spacy.

evanmiltenburg avatar evanmiltenburg commented on May 1, 2024

Yes there are corpora for that: Lassy-Klein, Lassy-Groot, and SoNaR. Probably the best thing is to contact the TST-centrale. They are undergoing some changes in management, so they might be slow to respond.

For questions about Lassy, you can email Gertjan van Noord.

For questions about SoNaR you can email Nelleke Oostdijk.

I trained a POS-tagger on the NLCOW14 corpus, which was automatically tagged. But if you need something fast, then here is the repository.

If you just need a parser, try Alpino.

from spacy.

davidsbatista avatar davidsbatista commented on May 1, 2024

Thanks for your reply! One more questions, does the license associated to those datasets allows to be incorporated into spaCy or use it to process dutch text not within a research context?

from spacy.

evanmiltenburg avatar evanmiltenburg commented on May 1, 2024

I think if you're a researcher yourself, and you don't contribute to SpaCy for money I think it should be OK.

If you're a SpaCy employee it might technically be commercial use and the terms and conditions from the TST-centrale are different: I just checked and the license for Lassy-klein for commercial use costs 2000 euros. Same for SoNar-klein. The larger corpora cost more. Then again, SpaCy is open-source. So you can always try. If it doesn't work, contact Gertjan to see whether he can help you.

Else there's also this free smaller treebank on his website. That should at least get you started. And the people at the university of Groningen would probably be happy to see another parser for Dutch so they can compare Alpino to it :)

Also look around on Maarten van Gompel's GitHub here. He's worked on memory-based POS tagging for Dutch, among other things.

from spacy.

davidsbatista avatar davidsbatista commented on May 1, 2024

Dank u wel for all the clarifications!

from spacy.

evanmiltenburg avatar evanmiltenburg commented on May 1, 2024

You're welcome, but I just realized I forgot to mention http://universaldependencies.org/ which also covers Dutch, and seems to overlap with Lassy. Sorry!

from spacy.

kaizenx avatar kaizenx commented on May 1, 2024

I new here but have been trolling for a while, would be glad to add Bahasa support

from spacy.

syllog1sm avatar syllog1sm commented on May 1, 2024

@geovedi , were you working on Bahasa?

from spacy.

geovedi avatar geovedi commented on May 1, 2024

@syllog1sm yes. it's been awhile tho. will catch up with the latest commit and regenerate the model.

from spacy.

kaizenx avatar kaizenx commented on May 1, 2024

@geovedi do you have a fork or branch i can check out?

from spacy.

savkov avatar savkov commented on May 1, 2024

Same question as @korobool -- I would be interested in adding support for Bulgarian.

from spacy.

kaustubhn avatar kaustubhn commented on May 1, 2024

Hi, I am interested in adding a language I have a corpus for 4 Indian languages, would be willing to help add a spacy model.

P.S: the URL is not working.
Thanks

from spacy.

ausiddiqui avatar ausiddiqui commented on May 1, 2024

https://spacy.io/docs/usage/adding-languages

from spacy.

martinbel avatar martinbel commented on May 1, 2024

Hi @davidsbatista. Do you have some code to train a model in portuguese?

from spacy.

davidsbatista avatar davidsbatista commented on May 1, 2024

Train what exactly? For PoS and syntactic parsing there is public available data, for named-entity recognition, is not so easy, there is no public available dataset

from spacy.

martinbel avatar martinbel commented on May 1, 2024

The training would be what's mentioned in the docs above. At this point PoS would be enough for me. Were you able to implement this in spaCy?

from spacy.

davidsbatista avatar davidsbatista commented on May 1, 2024

I could started it, although at the moment I have a bit of too much in my plate, but if you give me some guidelines on how to start I can try to plug-in the open PoS corpora in spaCy.

from spacy.

davidsbatista avatar davidsbatista commented on May 1, 2024

got it: https://spacy.io/docs/usage/adding-languages

:)

from spacy.

lock avatar lock commented on May 1, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.