Giter Site home page Giter Site logo

seqtolang's Introduction

seqtolang

PyPI pyversions CircleCI Actions Status

seqtolang is a python library for multi-langauge documents identification.

See this post for implementation details.

Getting Started

Install from source:

$ git clone https://github.com/hiredscorelabs/seqtolang
$ cd seqtolang
$ python setup.py install

or using PyPi:

$ pip install seqtolang

Basic usage:

from seqtolang import Detector

detector = Detector()
text = "In Chinese, the French phrase 'Je rentre chez moi Je rentre chez moi' will be '我正在回家'"
languages = detector.detect(text)
print(languages)

>>> [('fr', 0.499), ('en', 0.437), ('zh', 0.062)]


tokens = detector.detect(text, aggregated=False)
print(tokens)

>>> ['eng', 'eng', 'eng', 'eng', 'eng', 'fra', 'fra', 'fra', 'fra', 'fra', 'fra', 'fra', 'fra', 'eng', 'eng', 'zho']

seqtolang support 36 languages:

['afr', 'eus', 'bel', 'ben', 'bul', 'cat', 'zho', 'ces', 'dan', 'nld', 'eng', 'est', 'fin', 'fra', 
'glg', 'deu', 'ell', 'hin', 'hun', 'isl', 'ind', 'gle', 'ita', 'jpn', 'kor', 'lat', 'lit', 'pol', 
'por', 'ron', 'rus', 'slk', 'spa', 'swe', 'ukr', 'vie']

Docker Example

To make it easier to test the lib a runnable docker is also provided. To test it:

$> docker build . -t seqtolang
$> docker run -e SEQTOLANG_TEXT="Good boy in chinese is 好孩子" seqtolang
['Good', 'boy', 'in', 'chinese', 'is', '好孩子']
['eng', 'eng', 'eng', 'eng', 'eng', 'zho']

Support

Getting Help

You can ask questions and join the development discussion on Github Issues

License

Apache License 2.0

seqtolang's People

Contributors

avnercohen avatar shudima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

seqtolang's Issues

German and Hindi Count issue

from seqtolang import Detector

detect_lang = Detector()
words = "'Good Morning', 'guten Morgen', 'Bonjour', 'おはようございます', 'शुभ प्रभात' "
multi_lang = detector.detect(words)
print(multi_lang)
Output:
[('hin', 0.4620698094367981), ('eng', 0.22681103646755219), ('jpn', 0.09090376645326614), ('fra', 0.08465294539928436), ('deu', 0.054195597767829895)]

tokens = detector.detect(words, aggregated=False)
print(tokens)
output:
['eng', 'eng', 'deu', 'eng', 'fra', 'jpn', 'hin', 'hin', 'hin', 'hin', 'hin']

German word 'Morgen' treated as Eng
Hindi two words treated as 5 word count

training data

hello, thanks for releasing this repository
can you tell us on what training data you have built the model?

AttributeError: module 'numpy' has no attribute 'int'.

I experienced the following error:

"The 'numpy' module has no 'int' attribute.
The np.int was an obsolete alias for the built-in int. To avoid this error, use int by itself in existing code. This does not change any behavior and is safe. If np.int is substituted, you can use, for example, np.int64 or np.int32 to give precision. If you want to review the current usage, see the release note link for more information.
Aliases were originally dropped in NumPy 1.20; see the original release note at the following link for more details and guidance:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations"

It is located in the following line:
File ".../seqtolang/vectorizer.py", line 47, in vectorize
return np.array(padded_texts, dtype=np.int)

I changed this to "dtype=np.int64", now it works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.