Giter Site home page Giter Site logo

yhirose / cpp-fstlib Goto Github PK

View Code? Open in Web Editor NEW
52.0 4.0 9.0 7.48 MB

A single file C++17 header-only Minimal Acyclic Subsequential Transducers, or Finite State Transducers

License: MIT License

C++ 59.11% CMake 0.11% Makefile 15.66% SWIG 0.04% Perl 0.52% Python 0.66% Ruby 0.10% Shell 22.96% M4 0.33% CSS 0.23% C 0.27%
cpp header-only fst cpp17 finite-state-transducers

cpp-fstlib's Introduction

cpp-fstlib

C++17 header-only FST (finite state transducer) library. We can use it as Trie data structure. This library uses the algorithm "Minimal Acyclic Subsequential Transducers".

Play cpp-fstlib with cli

> git clone http://github/yhirose/cpp-fstlib
> cd cpp-fstlib
> make build && cd build
> cmake .. && make
> ./cmd/fst compile /usr/share/dict/words words.fst

> ./cmd/fst search words.fst hello
83713

> ./cmd/fst prefix words.fst helloworld
h: 81421
he: 82951
hell: 83657
hello: 83713

> ./cmd/fst longest words.fst helloworld
hello: 83713

> ./cmd/fst predictive words.fst predictiv
predictive: 153474
predictively: 153475
predictiveness: 153476

> ./cmd/fst fuzzy words.fst fuzzy -ed 2 // Edit distance 2
Suzy: 195759
buzz: 28064
buzzy: 28076
...

> ./cmd/fst spellcheck words.fst thier
their: 0.946667
thir: 0.762667
tier: 0.752
thief: 0.736
trier: 0.704

API reference

namespace fst {

enum class Result { Success, EmptyKey, UnsortedKey, DuplicateKey };

std::pair<Result, size_t /* error input index */> compile<uint32_t>(
  const std::vector<std::pair<std::string, uint32_t>> &input,
  std::ostream &os,
  bool sorted
);

std::pair<Result, size_t /* error input index */> compile<std::string>(
  const std::vector<std::pair<std::string, std::string>> &input,
  std::ostream &os
);

std::pair<Result, size_t /* error input index */> compile(
  const std::vector<std::string> &key_only_input,
  std::ostream &os,
  bool need_output, // true: map, false: set
  bool sorted
);

template <typename output_t> class map {
public:
  map(const char *byte_code, size_t byte_code_size);

  operator bool() const;

  bool contains(std::string_view sv) const;

  output_t operator[](std::string_view sv) const;

  output_t at(std::string_view sv) const;

  bool exact_match_search(std::string_view sv, output_t &output) const;

  std::vector<std::pair<size_t length, output_t output>>
  common_prefix_search(std::string_view sv) const;

  size_t longest_common_prefix_search(std::string_view sv, output_t &output) const;

  std::vector<std::pair<std::string, output_t>>
  predictive_search(std::string_view sv) const;

  std::vector<std::pair<std::string, output_t>>
  edit_distance_search(std::string_view sv, size_t max_edits) const;

  std::vector<std::tuple<double, std::string, output_t>>
  suggest(std::string_view word) const;
}

class set {
public:
  set(const char *byte_code, size_t byte_code_size);

  operator bool() const;

  bool contains(std::string_view sv) const;

  std::vector<size_t> common_prefix_search(std::string_view sv) const;

  size_t longest_common_prefix_search(std::string_view sv) const;

  std::vector<std::string> predictive_search(std::string_view sv) const;

  std::vector<std::string>
  edit_distance_search(std::string_view sv, size_t max_edits) const;

  std::vector<std::pair<double, std::string>>
  suggest(std::string_view word) const;
}

} // namespace fst

API usage

const std::vector<std::pair<std::string, std::string>> items = {
  {"hello", "こんにちは!"},
  {"world", "世界!"},
  {"hello world", "こんにちは世界!"}, // incorrect sort order entry...
};

std::stringstream out;
auto sorted = false; // ask fst::compile to sort entries
auto [result, error_line] = fst::compile<std::string>(items, out, sorted);

if (result == fst::Result::Success) {
  const auto& byte_code = out.str();
  fst::map<std::string> matcher(byte_code.data(), byte_code.size());

  if (matcher) {
    assert(matcher.contains("hello world"));
    assert(!matcher.contains("Hello World"));
    assert(matcher["hello"] == "こんにちは!");

    auto prefixes = matcher.common_prefix_search("hello world!");
    assert(prefixes.size() == 2);
    assert(prefixes[0].first == 5);
    assert(prefixes[0].second == "こんにちは!");
    assert(prefixes[1].first == 11);
    assert(prefixes[1].second == "こんにちは世界!");

    std::string output;
    auto length = matcher.longest_common_prefix_search("hello world!", output);
    assert(length == 11);
    assert(output == "こんにちは世界!");

    auto predictives = matcher.predictive_search("he");
    assert(predictives.size() == 2);
    assert(predictives[0].first == "hello");
    assert(predictives[0].second == "こんにちは!");
    assert(predictives[1].first == "hello world");
    assert(predictives[1].second == "こんにちは世界!");

    std::cout << "[Edit distance 1]" << std::endl;
    for (auto [k, o]: matcher.edit_distance_search("hellow", 1)) {
      std::cout << "key: " << k << " output: " << o << std::endl;
    }

    std::cout << "[Suggestions]" << std::endl;
    for (auto [r, k, o]: matcher.suggest("hellow")) {
      std::cout << "ratio: " << r << " key: " << k << " output: " << o << std::endl;
    }
  }
}
[Edit distance 1]
key: hello output: こんにちは
[Suggestions]
ratio: 0.810185 key: hello output: こんにちは
ratio: 0.504132 key: hello world output: こんにちは世界!
ratio: 0.0962963 key: world output: 世界!

License

MIT license (© 2022 Yuji Hirose)

cpp-fstlib's People

Contributors

yhirose avatar zpalmtree avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cpp-fstlib's Issues

how to dynamically add new items after fst::compile called ?

const std::vector<std::pair<std::string, std::string>> items =
{
{"hello", "a"},
{"hello world", "b"},
{"world", "c"}
};

std::stringstream out;
auto sorted = false; // ask fst::compile to sort entries
auto [result, error_line] = fst::compilestd::string(items, out, sorted);

const auto &byte_code = out.str();
fst::mapstd::string matcher(byte_code.data(), byte_code.size());

//do something
****** here, I have more items to add, how to do that ? ******

conditional logical bug when calculating output

if (OutputTraits<output_t>::empty(state_output)) {

In current condition, state_output will not add up to final output in current frame
I try to fix like this:

        if (ope.data.final && atm.is_match()) {
            if (prefix.empty() || (prefix.size() == 1 && prefix.front() == arc)) {
                if (OutputTraits<output_t>::type() == OutputType::none_t ||
                        OutputTraits<output_t>::empty(state_output)) {
                    accept(word, output);
                } else {
                    accept(word, output + state_output);
                }
            }
        }

how to support prefix search in cpp-fstlib?

I use API named common_prefix_search but the result is not what i want.
I use the example in tutorials.
for example:

// input
const std::vector<std::pair<std::string, std::string>> items = {
  {"hello", "こんにちは!"},
  {"hello world", "こんにちは世界!"}, // incorrect sort order entry...
  {"world", "世界!"},
};
//Omit the initialization process for fstmap; 
//....

//search result with prefix 'h' but no result returned.
auto prefixes = matcher.common_prefix_search("h");

//output=0;
cout << prefixes.size() << endl;  

I expect the result to be “こんにちは!” and “こんにちは世界!”.
Which interface can get the expected result? And what does 'common_prefix_search' mean?

Problem to make this lib support uint64

Hello, I am diving into let the fstlib support uint64_t type. When I simply change the classes instantiated for std::string and uint32_t types to uint64_t, some error are produced in test file. Can you give some hint that where should I make changes if I want to support uint64_t type?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.