Giter Site home page Giter Site logo

artursstaf / nlp-example Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pmarcis/nlp-example

0.0 1.0 0.0 21 KB

The code here provides a simple example of some NLP tasks for plain text processing for English and Latvian

License: MIT License

Python 45.57% Shell 19.14% HTML 35.30%

nlp-example's Introduction

NLP Example

This is an example of how to:

  • Train neural network-based sentence breaking, tokenisation, tagging and parsing models with the UDPipe toolkit. The example shows how to build models for English and Latvian using the default parameters.
  • Use pre-trained UDPipe models in python for sentence breaking, tokenisation, tagging and parsing of text documents.
  • Use the python library NLTK for sentence breaking, tokenisation, tagging and parsing of text documents in English using the Stanford Tagger and Stanford Parser.
  • Visualise syntactically parsed data in the CONLLU format using conllu.js.

Note: The example is meant for learning purposes! It uses publicly available tools and resources and executes default workflows (which means that the examples will probably lag behind what for each language is state-of-the-art). Also, not all resources have open licenses (e.g., the Latvian syntactically annotated corpus cannot be used in real projects due to its restrictive license).

Installation

The tools have been tested only on Linux (Ubuntu 16.04) with 4GB of RAM and at least 15GB of free storage and the installation script is meant only for Linux. That being said, this is what you need to do:

  • You know the drill: git clone https://github.com/pmarcis/nlp-example.git and cd nlp-example.
  • To create a Python 2.7 environment use Conda: conda create -n nlp-example python=2.7 and conda activate nlp-example.
  • Next, for UDPipe execute the set-up-udpipe.sh script. You may want to go through it though before you do this! The script will (hopefully) install all required dependencies, download relevant data and train UDPipe models for English and Latvian. The script will also install dependencies for the syntactically parsed data visualisation. For more details, see the script! The script may take an hour (more or less) to complete.
  • Next, for NLTK execute the set-up-nltk-with-stanford-tools.sh script. You may want to go through it though before you do this! The script will (hopefully) install all required python dependencies and download the Stanford Tagger and Stanford Parser. For more details, see the script!

Note: Make sure you have dependencies installed sudo apt install build-essential swig python-dev. If the compilation of UDPipe fails, try compliling it with MODE=debug.

Usage

Once installed, you can execute processing of text documents as follows:

python process-text-with-udpipe.py -i test_en.txt -o test_en_out.txt -l en
python process-text-with-udpipe.py -i test_lv.txt -o test_lv_out.txt -l lv
python process-text-with-nltk.py -i test_en.txt -o test_en_nltk_out.txt -l en

The resulting CONLLU data from the files test_en_out.txt and test_lv_out.txt can be visualised (for human comprehension of what has happened) in visualize.html (i.e., copy the output from the files and paste it in the text box that appears after you click on edit). Make sure you have installed all dependencies from set-up-udpipe.sh. Otherwise you will be disappointed by the visualisation!

For example purposes, the scripts have been written so that they talk a lot. If you want to use the tools for your own projects, I suggest commenting out everything that writes to standard output or standard error output.

Lincense

The code here is licensed under the MIT license (i.e., do whatever you like with it), however the dependencies are licensed under various different licenses. Therefore, do not assume that you will get a Holy Grail here!

nlp-example's People

Contributors

pmarcis avatar artursstaf avatar tomsbergmanis avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.