Welcome to my page! I'm Sergey, NLP Engineer from 🇷🇺 Moscow, Russia, currently living in 🇷🇺 Moscow, Russia.
Projects | ⭐ Stars | 📚 Forks | 🛎 Issues |
ruTS | |||
Speech-to-Text-Russian | |||
DataScience-Roadmap | |||
Word-to-Number-Russian | |||
BERTopic-as-service |
Библиотека для извлечения статистик из текстов на русском языке.
Home Page: https://sergeyshk.github.io/ruTS/
License: MIT License
Welcome to my page! I'm Sergey, NLP Engineer from 🇷🇺 Moscow, Russia, currently living in 🇷🇺 Moscow, Russia.
Projects | ⭐ Stars | 📚 Forks | 🛎 Issues |
ruTS | |||
Speech-to-Text-Russian | |||
DataScience-Roadmap | |||
Word-to-Number-Russian | |||
BERTopic-as-service |
propose to replace from nltk.tokenize
to razdel.tokenize
example of how to use razdel
here
this also applies to sentences
in my practice, razdel
create make better tokenization for 🇷🇺 than nltk
Заметил странное поведение , все значения разные.
t1 = 'Бальзам хороший, но пришёл один а не два, как написано '
import ruts
ds = ruts.DiversityStats(t1)
ds.get_stats()
{'ttr': 1.0,
'rttr': 3.162277660168379,
'cttr': 2.23606797749979,
'httr': 1.0,
'sttr': 0,
'mttr': 0.0,
'dttr': 0,
'mattr': 1.0,
'msttr': 1.0,
'mtld': 0.0,
'mamtld': 1.0,
'hdd': -1,
'simpson_index': 0,
'hapax_index': 0}
vs
print('ttr' , ruts.diversity_stats.calc_ttr(t1))
print('rttr',ruts.diversity_stats.calc_rttr(t1))
print('cttr',ruts.diversity_stats.calc_cttr(t1))
print('httr',ruts.diversity_stats.calc_httr(t1))
print('sttr',ruts.diversity_stats.calc_sttr(t1))
print('mttr',ruts.diversity_stats.calc_mttr(t1))
print('dttr',ruts.diversity_stats.calc_dttr(t1))
print('mattr',ruts.diversity_stats.calc_mattr(t1))
print('msttr',ruts.diversity_stats.calc_msttr(t1))
print('mtld',ruts.diversity_stats.calc_mtld(t1))
print('mamtld',ruts.diversity_stats.calc_mamtld(t1))
print('hdd',ruts.diversity_stats.calc_hdd(t1))
print('simpson_index' , ruts.diversity_stats.calc_simpson_index(t1) )
print('hapax_index',ruts.diversity_stats.calc_hapax_index(t1) )
ttr 0.4
rttr 2.9664793948382653
cttr 2.0976176963403033
httr 0.7713465066366824
sttr 0.5314553128319692
mttr 0.1313826679597258
dttr 7.611354035728222
mattr 0.41
msttr 0.42
mtld 14.338133470257823
mamtld 12.708333333333334
hdd 0.4587105249530551
simpson_index 15.0
hapax_index 319.06649307394474
Простой пример:
ds = DiversityStats('саид, ты опять абдулле насолил?').get_stats()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/ruts/diversity_stats.py", line 163, in get_stats
'dttr': self.dttr,
File "/usr/local/lib/python3.6/dist-packages/ruts/diversity_stats.py", line 119, in dttr
return calc_dttr(self.words)
File "/usr/local/lib/python3.6/dist-packages/ruts/diversity_stats.py", line 300, in calc_dttr
return log10(n_words)**2 / (log10(n_words) - log10(n_lexemes))
ZeroDivisionError: float division by zero
Проверялось на 0.5.0
Предлагаю добавить опцию представления в нормализованных/относительных величинах большей части статистик из набора BasicStats(). Все количества слов, кроме общего числа слов делить на это общее число слов. Аналогично со знаками.
c_letters
, c_syllables
, n_complex_words
, n_monosyllable_words
, n_polysyllable_words
, n_long_words
, n_simple_words
, n_unique_words
делить/нормировать на n_words
.
n_letters
, n_punctuations
, n_spaces
делить/нормировать на n_chars
.
Удобнее не самому делить, а сразу получать в выдаче.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.