Giter Site home page Giter Site logo

romanian-nlp-tools's Introduction

Romanian-NLP-tools

A list of Natural Language Processing Tools for the Romanian language.
I try to keep this list up-to-date, as technologies evolve over time.

Any feedback to this repository is greatly appreciated. All suggestions are welcome!

Stemmer

Use the package manager pip to install nltk.

pip install nltk

Usage

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("romanian")
print(stemmer.stem("alergare"))

Tokeniser, Lemmatiser and POS (Part-Of-Speech)

Use the package manager pip to install spacy and spacy-stanza.

pip install spacy spacy-stanza

Usage

import stanza
from spacy_stanza import StanzaLanguage

snlp = stanza.Pipeline(lang="ro")
nlp = StanzaLanguage(snlp)

doc = nlp("Această propoziție este în limba română.")
for token in doc:
    print(token.text, token.lemma_, token.pos_)

For more info visit https://spacy.io/universe/project/spacy-stanza.

SpaCy

Create Doc objects and play with its tokens:

from spacy.lang.ro import Romanian
nlp = Romanian()
doc = nlp("Aceasta este propoziția mea: eu am 7 mere, ce să fac cu ele?")
print("Index: ", [token.i for token in doc])
print("Text: ", [token.text for token in doc])
print("is alpha: ", [token.is_alpha for token in doc])
print("is punctuation: ", [token.is_punct for token in doc])
print("is like_num: ", [token.like_num for token in doc])

Output:

Index:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Text:  ['Aceasta', 'este', 'propoziția', 'mea', ':', 'eu', 'am', '7', 'mere', ',', 'ce', 'să', 'fac', 'cu', 'ele', '?']
is alpha:  [True, True, True, True, False, True, True, False, True, False, True, True, True, True, True, False]
is punctuation:  [False, False, False, False, True, False, False, False, False, True, False, False, False, False, False, True]
is like_num:  [False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False]

Search for POS and dependencies:

import spacy
from spacy.lang.ro.examples import sentences
#load pre-trained romanian model
nlp = spacy.load("ro_core_news_sm")
doc = nlp("Ea a mâncat pizza")
for token in doc:
    print('{:<12}{:<10}{:<10}{:<10}'.format(token.text, token.pos_, token.dep_, token.head.text))

Output:

Ea          PRON      nsubj     mâncat    
a           AUX       aux       mâncat    
mâncat      VERB      ROOT      mâncat    
pizza       ADV       obj       mâncat 

Predict Named Entities:

import spacy
from spacy.lang.ro.examples import sentences

nlp = spacy.load("ro_core_news_sm")

doc = nlp("Iulia Popescu, cea din Constanta, s-a dus la Lidl să cumpere pâine. Pe drum și-a dat seama că are nevoie de 50 de lei așa că a trecut și pe la bancomat înainte.")

for ent in doc.ents:
    print(ent.text, ent.label_)

Output:

Iulia Popescu PERSON
Constanta GPE
Lidl LOC
50 de lei MONEY

Rule-based Matching

Matching can be done by handlers: LEMMA, POS, TEXT, IS_DIGIT, IS_PUNCT, LOWER, UPPER, OP.
The OP handler can have the following values:

  • '!' = never
  • '?' = never or once
  • '+' = once or more times
  • '*' = never or more times
import spacy
from spacy.matcher import Matcher
#load pre-trained romanian model
nlp = spacy.load('ro_core_news_sm')
#create matcher
matcher = Matcher(nlp.vocab)
#create doc object
doc = nlp("Caracteristicile aplicației includ un design frumos, căutare inteligentă, etichete automate și răspunsuri vocale opționale.")
#create pattern for adjective plus one or two nouns
pattern = [{'POS': 'NOUN'}, {'POS': 'ADJ'}, {'POS': 'ADJ', 'OP': '?'}]
#add the pattern to the matcher
matcher.add('QUALITIES', [pattern])
#apply mather on doc
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

Output:

design frumos
căutare inteligentă
etichete automate
răspunsuri vocale opționale

RoWordnet

Use the package manager pip to install rowordnet.

pip install rowordnet

Usage

import rowordnet

wordnet = rowordnet.RoWordNet()
word = 'arbore'
synset_ids = wordnet.synsets(literal=word)
wordnet.print_synset(synset_ids[0])

For more info visit https://github.com/dumitrescustefan/RoWordNet.

BERT for Romanian

from transformers import BertModel, BertTokenizer
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

For more info visit https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1.

Word Vectors

fastText

import fasttext.util
fasttext.util.download_model('ro', if_exists='ignore')
ft = fasttext.load_model('path/to/cc.ro.300.bin')

or download from here.
More info on usage here: https://fasttext.cc/docs/en/crawl-vectors.html.

word2vec

from here: https://github.com/senisioi/ro_resources.

Other Lingvistic resources

  • List of (all, I hope) romanian words - from here
  • List of prefixes - from here
  • List of suffixes - from here
  • RoSentiwordnet - download from here

RoSentiWordNet is a lexical resource in which each RoWordNet synset is associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective, positive, and negative the terms contained in the synset are. It was created by translating SentiWordnet into Romanian using googletrans Python library.

romanian-nlp-tools's People

Contributors

alegzandra avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.