Giter Site home page Giter Site logo

google-research / nisaba Goto Github PK

View Code? Open in Web Editor NEW
28.0 8.0 4.0 1.87 MB

Finite-state script normalization and processing utilities

License: Apache License 2.0

Starlark 14.85% Python 47.78% C++ 37.38%
finite-state finite-state-transducers finite-state-automata grammars indic-languages brahmic-scripts writing-systems unicode unicode-normalization pynini

nisaba's Introduction

GitHub license Paper Build Tests (Linux) Build Tests (macOS)

Nisaba

Named after Nisaba — the Sumerian goddess of writing and scribe of the gods (𒀭𒉀).

nisaba

About

Collection of finite-state transducer-based (FST) tools for visual normalization, well-formedness, transliteration and NFC normalization of various scripts from South Asia and beyond. Nisaba provides these APIs in Python and C++. Currently supported script families:

Nisaba primarily relies on OpenGrm Pynini, which is a Python toolkit for finite-state grammar development. OpenGrm Pynini, like its C++ counterpart Thrax, compiles grammars expressed as strings, regular expressions, and context-dependent rewrite rules into weighted finite-state transducers (WFSTs). It uses the OpenFst library and its Python extension to create, access and manipulate compiled grammars.

Building and testing

This library will build on any system that supports Bazel versatile multiplatform build and test tool. The following examples assume Debian Linux distribution, but should also apply with minor modifications to other Linux and non-Linux platforms that Bazel supports.

Prerequisites

Bazel or Bazelisk

Your operating system may permit an easy installation of pre-built Bazel package, like the Debian-specific example below shows:

sudo apt-get install bazel

Alternatively, e.g., on macOS, a user-friendly Bazel launcher called Bazelisk can be installed:

BAZEL=bazelisk-darwin-amd64
curl -LO "https://github.com/bazelbuild/bazelisk/releases/latest/download/$BAZEL"
chmod +x $BAZEL

When using Bazelisk, simply replace the command bazel in the examples below with $BAZEL.

C++ and Python

Nisaba requires a modern C++ compiler that supports C++17 standard (e.g., the GCC 10 release series) and Python3. Assuming these are already present, the required dependencies are the Python3 development headers and the Python3 package installer pip.

sudo apt-get install python3-dev
sudo apt-get install python3-pip

Example Debian configuration: gcc (10.2.0), bazel (3.7.2), python3 (3.8.6) and pip (20.1.1).

Getting and building the code

  1. Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).

  2. Clone the repository (please note, this example does not clone the fork of the main repository, but a forked repo can be used as well):

    git clone https://github.com/google-research/nisaba.git
    cd nisaba
  3. Build all the targets using Bazel (this example uses optimized mode):

    bazel build -c opt ...

    The above command will build Nisaba artifacts using all the remote repository dependencies, including OpenFst, Pynin and Thrax, that are specified in the Bazel WORKSPACE file. The resulting artifacts are located in bazel-bin/nisaba directory.

    If the above command fails due to missing Python prerequisites, please install them using pip Python package manager and try again:

    pip3 install --upgrade pip
    pip3 install -r requirements.txt
  4. Make sure the small unit tests are passing:

    bazel test -c opt --test_size_filters=-large,-enormous ...

    The above command should produce something along the following lines:

      ...
      //nisaba/scripts/brahmic:cc_test                                                 PASSED in 0.4s
      //nisaba/scripts/brahmic:far_cc_test                                             PASSED in 0.2s
      //nisaba/scripts/brahmic:far_test                                                PASSED in 2.0s
      //nisaba/scripts/brahmic:fixed_test                                              PASSED in 0.2s
      //nisaba/scripts/brahmic:fst_properties_test                                     PASSED in 2.3s
      //nisaba/scripts/brahmic:iso_test                                                PASSED in 0.3s
      //nisaba/scripts/brahmic:nfc_test                                                PASSED in 0.2s
      //nisaba/scripts/brahmic:nfc_utf8_test                                           PASSED in 0.2s
      //nisaba/scripts/brahmic:py_test                                                 PASSED in 2.1s
      //nisaba/scripts/brahmic:util_test                                               PASSED in 1.9s
      //nisaba/scripts/brahmic:visual_norm_test                                        PASSED in 0.3s
      //nisaba/scripts/brahmic:visual_norm_utf8_test                                   PASSED in 0.3s
      //nisaba/scripts/brahmic:wellformed_test                                         PASSED in 0.2s
      //nisaba/scripts/brahmic:wellformed_utf8_test                                    PASSED in 0.2s
      ...

    You may also want to run all the tests, but depending on your host configuration these may take a long time:

    bazel test -c opt ...

Contributions

NOTE: We don't accept pull requests (PRs) at the moment.

License

Nisaba is licensed under the terms of the Apache license. See LICENSE for more information.

Citation

If you use this software in a publication, please cite the accompanying paper from EACL 2021:

@inproceedings{nisaba-eacl2021,
    title = {Finite-state script normalization and processing utilities: The {N}isaba {B}rahmic library},
    author = {Cibu Johny and Lawrence Wolf-Sonkin and Alexander Gutkin and Brian Roark},
    booktitle = {16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations},
    address = {[Online], Kyiv, Ukraine},
    month = apr,
    year = {2021},
    pages = {14--23},
    publisher = {Association for Computational Linguistics},
    doi = {10.18653/v1/2021.eacl-demos.3},
    url = {https://www.aclweb.org/anthology/2021.eacl-demos.3},
}

Mandatory disclaimer

This is not an official Google product.

nisaba's People

Contributors

agutkin avatar annakatanova avatar cibu avatar derekmauro avatar isingoo avatar jerub avatar jesseemond avatar katre avatar kylebgorman avatar lwolfsonkin avatar mauricioa avatar rickeylev avatar roark-google avatar yilei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nisaba's Issues

Using Nisaba via Pynini: cannot find rules_python

Hello,

I followed the installation instructions in the README to install and build Nisaba on my personal MacOS machine. Bazel is working and the small unit tests are passing.

However, when I try to use Nisaba FSTs in Pynini following the examples here, imports from rules_python fail. I am getting the same error with both the Anaconda Python 3.7 and the system Python 2.7.

Is there some further disconnect between my Python and Bazel, or am I maybe missing a step here? Thanks!

ISSUE WITH BENGALI

In addition to the regular ABNF formalism governing the Brahmi Syllable, Bengali language admits two more combinations which are unique to Bengali

  1. both candrabindu and anuswara /bindu can come together and in proof thereof this word is cited হ্যাঁংচা : Hancha a sweet
    হ ্ যা ঁ ং চা 09B9 09CD 09AF 09BE 0981 0982 099A 09BE
  2. a nasal can be followed by a visarga the following example is provided. This combination is attested in Vedic but not in a neo-Brahmi syllable
    হ্যাঁঃ হ ্ যা ঁ ঃ 09B9 09CD 09AF 09BE 0981 0983
    Both cases are rare and have been validated by the Bengali experts who I contacted.

USE OF MODIFYING LETTER [STRESS MARKER] IN BORO, MAITHILI,DOGRI

Boro, Maithili and Dogri are mild tonal languages and admit a stress marker .
"The orthographies of the Bodo, Dogri, and Maithili languages
of India make use of U+02BC “ ’ ” modifier letter apostrophe, either as a tone mark or
as a length mark. In Bodo and Dogri, this character functions as a tone mark, called gojau
kamaa in Bodo and sur chinha in Dogri. In Dogri, the tone mark occurs after short vowels,
including inherent vowels, and indicates a high-falling tone. After Dogri long vowels, a
high-falling tone is written instead using U+0939 devanagari letter ha.
In Maithili, U+02BC “ ’ ” modifier letter apostrophe is used to indicate the prolongation
of a short a and to indicate the truncation of words. This sign is called bikari kaamaa."
PROVENANCE: UNICODE CHAPTER 12. pp. 484-485
image

Bazel Build Issue GRM2

Problem:
bazel test -c opt ...
Bazel build is failing due to issues grm2 thrax package. There is no thrax subdirectory in grm2 directory. There is a file dependency.

Solutions tried:
Tried pulling tharax separately, and putting it in the grm2 directory within folder structure.
Tried putting it in the local directory, and changing the path in cc and h files to the grm-manager.h file.
Tried on a different machine.

Python 3.8.10
Ubuntu 20.04.6 LTS

Error:
In file included from nisaba/scripts/brahmic/grammar.cc:15: ./nisaba/scripts/brahmic/grammar.h:23:10: fatal error: grm2/thrax/grm-manager.h: No such file or directory 23 | #include "grm2/thrax/grm-manager.h" | ^~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated. 1696873372.322851622: src/main/tools/linux-sandbox-pid1.cc:538: wait returned pid=2, status=0x100 1696873372.322862493: src/main/tools/linux-sandbox-pid1.cc:556: child exited normally with code 1 1696873372.323221847: src/main/tools/linux-sandbox.cc:233: child exited normally with code 1 INFO: Elapsed time: 1521.160s, Critical Path: 158.03s INFO: 2647 processes: 448 internal, 2199 linux-sandbox.

Bazel Build Failure

Problem:
bazel test -c opt ...
Bazel build is failing due to issues with pynini package.

Python 3.8.10
Ubuntu 20.04.6 LTS

Installed pynini separately
Tried including a colon in the problem file
Different conda environment
Tried gcc versions: 9.4.0/10.2.0

Error Message 1

bazel test -c opt ...
ERROR: Skipping '...': error loading package under directory '': error loading package 'nisaba/scripts/abjad_alphabet': at /home/ayush/work/nisaba/nisaba/nisaba/scripts/utils/grammars.bzl:22:6: Label '//speech/fst/testing:build_defs.bzl' is invalid because 'speech/fst/testing' is not a package; perhaps you meant to put the colon here: '//:speech/fst/testing/build_defs.bzl'?
ERROR: error loading package under directory '': error loading package 'nisaba/scripts/abjad_alphabet': at /home/ayush/work/nisaba/nisaba/nisaba/scripts/utils/grammars.bzl:22:6: Label '//speech/fst/testing:build_defs.bzl' is invalid because 'speech/fst/testing' is not a package; perhaps you meant to put the colon here: '//:speech/fst/testing/build_defs.bzl'?
INFO: Elapsed time: 0.124s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
    currently loading: nisaba/scripts/brahmic ... (7 packages)
ERROR: Couldn't start the build. Unable to run tests

Error Message 2 (adding : in the grammar.bzl file)

ERROR: /home/ayush/work/nisaba/nisaba/nisaba/scripts/utils/grammars.bzl:22:6: in load statement: invalid target name 'speech/fst/testing:build_defs.bzl': target names may not contain ':'
INFO: Repository rules_proto instantiated at:
  /home/ayush/work/nisaba/nisaba/WORKSPACE.bazel:108:14: in <toplevel>
  /home/ayush/.cache/bazel/_bazel_ayush/994007203a8fe7aea0910ee47f910bb9/external/com_google_protobuf/protobuf_deps.bzl:56:21: in protobuf_deps
Repository rule http_archive defined at:
  /home/ayush/.cache/bazel/_bazel_ayush/994007203a8fe7aea0910ee47f910bb9/external/bazel_tools/tools/build_defs/repo/http.bzl:372:31: in <toplevel>
INFO: Repository bazel_skylib instantiated at:
  /home/ayush/work/nisaba/nisaba/WORKSPACE.bazel:81:20: in <toplevel>
  /home/ayush/.cache/bazel/_bazel_ayush/994007203a8fe7aea0910ee47f910bb9/external/org_opengrm_pynini/bazel/workspace.bzl:44:17: in pynini_repositories
Repository rule http_archive defined at:
  /home/ayush/.cache/bazel/_bazel_ayush/994007203a8fe7aea0910ee47f910bb9/external/bazel_tools/tools/build_defs/repo/http.bzl:372:31: in <toplevel>
ERROR: error loading package under directory '': error loading package 'nisaba/scripts/abjad_alphabet': module 'nisaba/scripts/utils/grammars.bzl' has invalid load statements
INFO: Elapsed time: 1.260s
INFO: 0 processes.
ERROR: Couldn't start the build. Unable to run tests
FAILED: Build did NOT complete successfully (0 packages loaded)
    currently loading: nisaba/scripts/utils ... (31 packages)
    Fetching repository @nisaba_deps; Restarting.
    Fetching repository @rules_cc; starting
    Fetching https://github.com/bazelbuild/bazel-skylib/releases/download/1.3.0/bazel-skylib-1.3.0.tar.gz

[MacBook/M1 chip] Unable to generate FSTs

Hi,

Thanks for this useful project and the nice documentation.

I am trying to run your example. However, I am not able to generate the .far files due to issues with Bazel in Mac M1.

Will it be possible for you to provide the .far files too? Or, do you have them somewhere online that I can download them? This being said, issue #31 helped to some extent.

Thanks.

Nukta adjoined to Vowel letter /अ/ in Ol Ciki

Ol Ciki/Santhali/Cemet admits
अ़ /a/ followed by nukta. U+0905 U+093C
Example
अ़पारम अंजान भाषा: an unknown language
In the example above, अ़ (अ with nukta) represents the phoneme /ə/ in the Santhali language.
Expertise: Dept. of tribal & regional languages, Ranchi University, Ranchi.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.