Giter Site home page Giter Site logo

garvys / rustfst Goto Github PK

View Code? Open in Web Editor NEW
140.0 9.0 16.0 6.58 MB

Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.

Home Page: https://docs.rs/rustfst

License: Other

Rust 81.43% Shell 0.91% C 0.04% C++ 7.28% Python 10.34%
fst wfst rust rust-lang rust-crate transducers finite-state-transducers finite-state-acceptors automata openfst speech-recognition fsts asr tokenizer transducer graph shortest-path composition kaldi kaldi-asr

rustfst's Introduction

Rustfst

License: MIT/Apache-2.0 Maintenance Github tag

Rust

rustc >= 1.51.0 Native Linux test status Documentation

Python

PyPI version PyPI download month PyPI pyversions

Rust implementation of Weighted Finite States Transducers.

Rustfst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). Weighted finite-state transducers are automata where each transition has an input label, an output label, and a weight. The more familiar finite-state acceptor is represented as a transducer with each transition's input and output label equal. Finite-state acceptors are used to represent sets of strings (specifically, regular or rational sets); finite-state transducers are used to represent binary relations between pairs of strings (specifically, rational transductions). The weights can be used to represent the cost of taking a particular transition.

FSTs have key applications in speech recognition and synthesis, machine translation, optical character recognition, pattern matching, string processing, machine learning, information extraction and retrieval among others. Often a weighted transducer is used to represent a probabilistic model (e.g., an n-gram model, pronunciation model). FSTs can be optimized by determinization and minimization, models can be applied to hypothesis sets (also represented as automata) or cascaded by finite-state composition, and the best results can be selected by shortest-path algorithms.

fst

Overview

For a basic example see the section below.

Some simple and commonly encountered types of FSTs can be easily created with the macro [fst] or the functions acceptor and transducer.

For more complex cases you will likely start with the VectorFst type, which will be imported in the [prelude] along with most everything else you need. VectorFst<TropicalWeight> corresponds directly to the OpenFST StdVectorFst, and can be used to load its files using read or read_text.

Because "iteration" over an FST can mean many different things, there are a variety of different iterators. To iterate over state IDs you may use states_iter, while to iterate over transitions out of a state, you may use get_trs. Since it is common to iterate over both, this can be done using fst_iter or fst_into_iter. It is also very common to iterate over paths accepted by an FST, which can be done with paths_iter, and as a convenience for generating text, string_paths_iter. Alternately, in the case of a linear FST, you may retrieve the only possible path with decode_linear_fst.

Note that iterating over paths is not the same thing as finding the shortest path or paths, which is done with shortest_path (for a single path) or shortest_path_with_config (for N-shortest paths).

For the complete list of algorithms, see the [algorithms] module.

You may now be wondering, especially if you have previously used such linguist-friendly tools as pyfoma, "what if I just want to transduce some text???" The unfriendly answer is that rustfst is a somewhat lower-level library, designed for implementing things like speech recognizers. The somewhat more helpful answer is that you would do this by constructing an acceptor for your input, which you will compose with a transducer, then project the result to itsoutput, and finally iterate over the paths in the resulting FST.

References

Implementation heavily inspired from Mehryar Mohri's, Cyril Allauzen's and Michael Riley's work :

The API closely resembles that of OpenFST, with some simplifications and changes to make it more idiomatic in Rust, notably the use of Tr instead of Arc. See Differences fromOpenFST for more information.

Example

use anyhow::Result;
use rustfst::prelude::*;
use rustfst::algorithms::determinize::{DeterminizeType, determinize};
use rustfst::algorithms::rm_epsilon::rm_epsilon;

fn main() -> Result<()> {
    // Creates a empty wFST
    let mut fst = VectorFst::<TropicalWeight>::new();

    // Add some states
    let s0 = fst.add_state();
    let s1 = fst.add_state();
    let s2 = fst.add_state();

    // Set s0 as the start state
    fst.set_start(s0)?;

    // Add a transition from s0 to s1
    fst.add_tr(s0, Tr::new(3, 5, 10.0, s1))?;

    // Add a transition from s0 to s2
    fst.add_tr(s0, Tr::new(5, 7, 18.0, s2))?;

    // Set s1 and s2 as final states
    fst.set_final(s1, 31.0)?;
    fst.set_final(s2, 45.0)?;

    // Iter over all the paths in the wFST
    for p in fst.paths_iter() {
         println!("{:?}", p);
    }

    // A lot of operations are available to modify/optimize the FST.
    // Here are a few examples :

    // - Remove useless states.
    connect(&mut fst)?;

    // - Optimize the FST by merging states with the same behaviour.
    minimize(&mut fst)?;

    // - Copy all the input labels in the output.
    project(&mut fst, ProjectType::ProjectInput);

    // - Remove epsilon transitions.
    rm_epsilon(&mut fst)?;

    // - Compute an equivalent FST but deterministic.
    fst = determinize(&fst)?;

    Ok(())
}

Differences from OpenFST

Here is a non-exhaustive list of ways in which Rustfst's API differs from OpenFST:

  • The default epsilon symbol is <eps> and not <epsilon>.
  • Functions and methods follow Rust naming conventions, e.g. add_state rather than AddState, but are otherwise mostly equivalent, except that:
  • Transitions are called Tr and not Arc, because Arc has a rather different and well-established meaning in Rust, and rustfst uses it (std::sync::Arc, that is) to reference-count symbol tables. All associated functions also use tr.
  • Final states are not indicated by a final weight of zero. You can test for finality using is_final, and final_weight returns an [Option]. This requires some care when converting OpenFST code.
  • Transitions can be accessed directly as a slice rather than requiring an iterator.
  • Semiring operations are expressed as plain old methods rather than strange C++ things. So write w1.plus(w2) rather than Plus(w1, w2), for instance.
  • Weights have in-place operations for βŠ• (plus_assign) and βŠ— (times_assign).
  • Most of the type aliases (which would be trait aliases in Rust) such as StdArc, StdFst, and so forth, are missing, but type inference allows us to avoid explicit type arguments in most cases, such as when calling [Tr::new], for instance.
  • State IDs are unsigned, with [NO_STATE_ID] used for a missing value. They are also 32 bits by default (presumably, 4 billion states is enough for most applications). This means you must take care to cast them to [usize] when using them as indices, and vice-versa, preferably checking for overflows
  • Symbol IDs are also unsigned and 32-bits, with [NO_LABEL] used for a missing value.
  • Floating-point weights are not generic, so are always single-precision.

Benchmark with OpenFST

I did a benchmark some time ago on almost every linear fst algorithm and compared the results with OpenFst. You can find the results here :

Spoiler alert: Rustfst is faster on all those algorithms πŸ˜…

Documentation

The documentation of the last released version is available here : https://docs.rs/rustfst

Release process

  1. Use the script update_version.sh to update the version of every package.
  2. Push
  3. Push a new tag with the prefix rustfst-v

Example :

./update_version.sh 0.9.1-alpha.6
git commit -am "Release 0.9.1-alpha.6"
git push
git tag -a rustfst-v0.9.1-alpha.6 -m "Release rustfst 0.9.1-alpha.6"  
git push --tags

Optionally, if this is a major release, create a GitHub release in the UI.

Projects contained in this repository

This repository contains two main projects:

  • rustfst is the Rust re-implementation.
    • Crate available on crates.io here
    • Documentation available on docs.rs here
  • rustfst-python is the python binding of rustfst.
    • Package available on Pypi here
    • Documentation available on Github Pages here

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

rustfst's People

Contributors

adrienball avatar alexandrecauliersonos avatar awelkie avatar boxdot avatar dependabot[bot] avatar dhdaines avatar dzhu avatar emricksinisonos avatar garvys avatar hdlj avatar hubertdelajonquieresonos avatar jonasdepoixsonos avatar kali avatar llogiq avatar mooreniemi avatar noeddl avatar riccardobassanisonos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rustfst's Issues

Support LogWeight in C and Python

Writing code in Rust is fun, but writing it in Python is more fun... but for many use cases (i.e. training or constructing WFSTs for language modeling) we need the log semiring which is only accessible from Rust.

Evidently to support it elsewhere requires it to be added to rustfst-ffi which implies duplicating all the classes and functions everywhere. Since log and tropical semiring are nearly the same thing this seems a bit wasteful to me!

As an aside, I wonder whether the OpenFST approach of templating everything based on the weight type is really a good idea, given that it duplicates all the object code simply to have a different implementation of a handful of simple functions, but this is typical of the C++ mentality underlying OpenFST...

Bug in minimization

I'm using the latest version of Rustfst from Git (1ddd1da). This file contains an FST that appears to be turned into a non-equivalent FST by minimize(). The following test program should demonstrate the difference (save the linked file as bug.txt):

use rustfst::prelude::*;

fn main() {
    let path = fst_path![97, 98, 97, 100, 32];

    let mut fst: VectorFst<TropicalWeight> = VectorFst::read_text("bug.txt").unwrap();
    let accept1 = check_path_in_fst(&fst, &path);
    minimize(&mut fst).unwrap();
    let accept2 = check_path_in_fst(&fst, &path);

    assert_ne!(accept1, accept2);
}

The given path is rejected by the original FST but accepted by the minimized one.

I believe the issue arises because, in AcyclicMinimizer::refine, the partition is cloned into the comparator and used without being changed, while the original partition is updated as the function runs. As a result, two states with the same height that have the same number of transitions, where all the corresponding transitions have the same labels and point to states that were originally in the same partition class (i.e., they have the same height) are merged, even if the pointed-to states have since been distinguished and moved into different partition classes by earlier iterations of the loop.

Simply moving the declaration of state_cmp inside the loop (so that the comparator always has an up-to-date partition) gives correct behavior on my test case and still passes all of the existing test cases in the repo, which gives me some confidence that this idea is correct. Unfortunately, that does mean the clone happens even more times than beforeβ€”I'm not sure if there's a good way to avoid that.

`tr_unique` doesn't work in some cases

The tr_unique function doesn't remove Tr duplicates if they are not consecutive within the list of trs of a state. The same behaviour happens in OpenFST so adding a new test case for this situation will break comparison tests.
The following example can reproduce the issue:

    #[test]
    fn test_tr_map_unique_1() -> Result<()> {
        let mut fst_in = VectorFst::<ProbabilityWeight>::new();

        let s1 = fst_in.add_state();
        let s2 = fst_in.add_state();

        fst_in.add_tr(s1, Tr::new(1, 2, ProbabilityWeight::new(1.0), s2))?;
        fst_in.add_tr(s1, Tr::new(1, 2, ProbabilityWeight::new(2.0), s2))?;
        fst_in.add_tr(s1, Tr::new(1, 2, ProbabilityWeight::new(1.0), s2))?;

        fst_in.set_start(s1)?;
        fst_in.set_final(s2, ProbabilityWeight::one())?;

        let mut fst_out = VectorFst::<ProbabilityWeight>::new();

        let s1 = fst_out.add_state();
        let s2 = fst_out.add_state();

        fst_out.add_tr(s1, Tr::new(1, 2, ProbabilityWeight::new(1.0), s2))?;
        fst_out.add_tr(s1, Tr::new(1, 2, ProbabilityWeight::new(2.0), s2))?;

        fst_out.set_start(s1)?;
        fst_out.set_final(s2, ProbabilityWeight::one())?;

        tr_unique(&mut fst_in);

        assert_eq!(fst_in, fst_out);

        Ok(())
    }

It can easily be solved by adding a check on the weight value of a transition in tr_compare :

pub(crate) fn tr_compare<W: Semiring>(tr_1: &Tr<W>, tr_2: &Tr<W>) -> Ordering {
    if tr_1.ilabel < tr_2.ilabel {
        return Ordering::Less;
    }
    if tr_1.ilabel > tr_2.ilabel {
        return Ordering::Greater;
    }
    if tr_1.olabel < tr_2.olabel {
        return Ordering::Less;
    }
    if tr_1.olabel > tr_2.olabel {
        return Ordering::Greater;
    }
    if tr_1.weight < tr_2.weight {
        return Ordering::Less;
    }
    if tr_1.weight > tr_2.weight {
        return Ordering::Greater;
    }
    if tr_1.nextstate < tr_2.nextstate {
        return Ordering::Less;
    }
    if tr_1.nextstate > tr_2.nextstate {
        return Ordering::Greater;
    }
    Ordering::Equal
}

Python 3.11 support?

PyPI says that 3.11 is supported, but I don't see any 3.11 versions in the list of files. I can't install the latest version on my machine, but I'm not sure if the problem is the python version (3.11) or the MacOS version (13).

Support FAR format

Another enhancement, perhaps to be accompanied by a pull request in the near future...

It would be nice to (at least minimally) support the FAR format for collections of (usually linear) FSTs, as this is needed to reimplement OpenGRM functionality or interoperate with OpenGRM.

Gallic semiring information

This is a somehow "documentation oriented" issue! I was trying to understand determinization for FSTs and I'm having difficulties in finding literature regarding the "Gallic" semiring. I tried to look into the OpenFST documentation as well but I don't find any reference. Do you perhaps know any paper/book where this is described in detail? Thank you!

ConstFstBuilder

Currently, the only way to create a ConstFst object is to convert it from a MutableFst which makes some copies along the way.
If the number of states, and arcs are known in advance, theoritically, it could be possible to create directly the ConstFst.

serializable `IntegerWeight`

I see that TropicalWeight has SerializableSemiring implemented, but not IntegerWeight:

impl SerializableSemiring for TropicalWeight {
    fn weight_type() -> String {
        "tropical".to_string()
    }

Purposeful? Or can I submit a PR with IntegerWeight too?

Without this fst.write doesn't work.

Composition and O_LABEL_SORTED

Thanks for producing an incredible piece of software!

I am attempting to compose two (moderately sized) wFSTs (I have used compose in other contexts without a problem). In this case, the program panics with the following message:

Properties are not known : O_LABEL_SORTED | NOT_O_LABEL_SORTED.

I assume that this is because compose doesn't know whether the transitions of the first wFST have been sorted by their output labels. They have. Is there a straightforward way of updating the FstProperties of the wFST to communicate this?

Determinization issues: Unequal arguments : non-functional FST

When attempting to call determinize on some wFSTs, I get error messages like the following:

Unequal arguments : non-functional FST ? w1 = StringWeightRestrict { value: Labels([45, 62]) } w2 = StringWeightRestrict { value: Labels([45, 40]) }

As far as I can tell, this not very informative error message goes back to OpenFST, but even there, its meaning seems unclear. What could the problem be?

I haven't been able to reproduce this error with smaller FSTs, but I'll try.

[Python] Printing a VectorFst crashes

>>> import rustfst
>>> from rustfst import VectorFst, Tr, SymbolTable
>>> fst = VectorFst()
>>> s1 = fst.add_state()
>>> print(s1)
0
>>> print(fst)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alexandre.caulier/Documents/rustfst/venv94/lib/python3.7/site-packages/rustfst/fst/__init__.py", line 262, in __str__
    return self.text()
AttributeError: 'VectorFst' object has no attribute 'text'

Bug in optimize

I have an issue when I run the optimize algorithm on this FST.

The algorithm panics because of an unwrap() occurring in the decode function that happens just after a minimize. More exactly, decode(tr.ilabel) is called with an ilabel == 0 which implies calling a get on a negative value here.

The issue seems to be introduced by this PR #166 as the algorithm doesn't panic in the latest commit before that change (8f7f486).

Some Mutable operations on ConstFst

Some algorithms don't change much the Fst.
This is the case for ArcMap for instance.
For instance, inverting input labels and output labels could be done on a ConstFst but currently it requires a VectorFst

Type arguments to `compose` are horrendous

Not sure if this is rusfst's fault or something inherent to the Rust language, but I really don't feel like it should be necessary to write a straightforward compose operation this way (https://github.com/dhdaines/rustfst-g2p/blob/main/src/g2p.rs#L148):

// fst and self.model are both VectorFst<TropicalWeight>
let fst: VectorFst<TropicalWeight> = compose::<
    TropicalWeight,
    VectorFst<TropicalWeight>,
    VectorFst<TropicalWeight>,
    _,
    _,
    _,
>(fst, &self.model)?;

Is there something that can be done to make rustc properly infer types W, F1, and F2?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.