Giter Site home page Giter Site logo

trellixvulnteam / neuro_tagger_0kbj Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bond005/neuro_tagger

0.0 0.0 0.0 89 KB

Text tagger based on recurrent neural network. It can be used as NER, dependency parser, morphoanalyzer etc.

License: Apache License 2.0

Python 100.00%

neuro_tagger_0kbj's Introduction

neuro_tagger

Text tagger based on recurrent neural network. It can be used as NER, dependency parser, morphoanalyzer etc.

The goal of this project is creation of a simple Python package with the sklearn-like interface for solution of different tasks of text tagging (named entity recognition, dependency parsing, etc) in case number of labeled texts is very small (not greater than several thousands). Special word embeddings named as ELMo (Embeddings from Language Models) ensure this possibility, because these embeddings are contextual and they allow to design more simple and separable feature space for words in texts.

ELMo embeddings are used as features of words in text, and different variants of neural network architecture (BiLSTM, hybrid BiLSTM-CRF or pure CRF) can be used as final classifier (tagger). I recommend to use a special TensorFlow Hub ELMo for English NLP tasks and a DeepPavlov ELMo (http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz) for for same tasks in Russian.

Getting Started

Installing

To install this project on your local machine, you should run the following commands in Terminal:

git clone https://github.com/bond005/neuro_tagger.git
cd neuro_tagger
sudo python setup.py

You can also run the tests

python setup.py test

But I recommend you to use pip and install this package from PyPi:

pip install neuro_tagger

or

sudo pip install neuro_tagger

Usage

There are two examples of the neuro_tagger usage in the demo subdirectory:

  1. demo_brat.py - example of neuro-tagger creating and its cross-validated estimating on the labeled text corpus in the brat format (brat is popular tool for manual annotating of texts);
  2. demo_factrueval.py - example of experiments on the FactRuEval-2016 text corpus, which is part of special competition devoted to named entity recognition and fact extraction in Russian (it is described in the paper FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian). Use of special ELMo from the iPavlov project and CRF as final classifier allows to reach best result for first track of this competition (F1-score is greater than 0,89, and it is state-fo-the-art solution for named entity recognition in Russian).

Note

You have to use short texts such as sentences or small paragraphs, because long texts will be processed worse. If you train neuro_tagger on corpus of long texts, then the training can be converged slowly. If you use the neuro_tagger, trained on short texts, for recognizing of long text, then only some initial words of this text can be tagged, and remaining words at the end of text will not be considered by algorithm. Besides, you need to use a very large volume of RAM for processing of long texts.

For solving of above-mentioned problem you can split long texts by shorter sentences using well-known NLP libraries such as NLTK or SpaCy. Also, if you want to correctly split long text with its tag labels, then you can use the special function tokenize_all_by_sentences from the module neuro_tagger.dataset_loading.

Acknowledgment

The work was supported by National Technology Initiative and PAO Sberbank project ID 0000000007417F630002.

neuro_tagger_0kbj's People

Contributors

bond005 avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.