Giter Site home page Giter Site logo

pytries / marisa-trie Goto Github PK

View Code? Open in Web Editor NEW
1.0K 27.0 91.0 2.33 MB

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.

Home Page: https://marisa-trie.readthedocs.io/en/latest/

License: MIT License

Python 49.42% Shell 0.08% Cython 50.50%
trie tree-structure cython-wrapper marisa marisa-trie python python3 python310 python311 python37

marisa-trie's People

Contributors

bobotig avatar daa avatar dependabot[bot] avatar dfuhry avatar duilio avatar dymil avatar fried avatar hickford avatar hugovk avatar hulikau avatar kmike avatar liori avatar lucidfrontier45 avatar matham avatar puretryout avatar samuelsmal avatar superbobry avatar vermut avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marisa-trie's Issues

Always downloading package punkt of NLTK

When I use marisa_trie, my execution always running:
`[nltk_data] Downloading package punkt to /root/nltk_data...'
'[nltk_data] Unzipping tokenizers/punkt.zip.'

User should be able to add strings to a Trie after instantiation

As a user, I want to create an instance of Trie and add words to it one-at-a-time, so that I can use a Trie in a streaming environment (in which strings arrive on-the-fly). For example,

>>> from marise import Trie
>>> trie = Trie()
>>> trie.add(u'key1')
>>> trie.add(u'key12')
>>> u'key1' in trie
True
>>> u'key12' in trie
True
>>> u'key2' in trie
False

This also makes Trie behave more like a set of strings.

Install failing

 $ pip install marisa-trie==0.7.3
Collecting marisa-trie==0.7.3
  Using cached marisa-trie-0.7.3.tar.gz
Building wheels for collected packages: marisa-trie
  Running setup.py bdist_wheel for marisa-trie ... error
  Complete output from command /Users/fredmailhot/anaconda/envs/marisa_test/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/pip-build-qDhwKQ/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/tmpA7YRglpip-wheel- --python-tag cp27:
  running bdist_wheel
  running build
  running build_clib
  building 'libmarisa-trie' library
  creating build
  creating build/temp.macosx-10.7-x86_64-2.7
  creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie
  creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib
  creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa
  creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire
  creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/io
  creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/trie
  creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/vector
  gcc -fno-strict-aliasing -I/Users/fredmailhot/anaconda/envs/marisa_test/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/agent.cc -o build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/agent.o
  marisa-trie/lib/marisa/agent.cc:3:10: fatal error: 'marisa/agent.h' file not found
  #include "marisa/agent.h"
           ^
  1 error generated.
  error: command 'gcc' failed with exit status 1

  ----------------------------------------
  Failed building wheel for marisa-trie
  Running setup.py clean for marisa-trie
Failed to build marisa-trie
Installing collected packages: marisa-trie
  Running setup.py install for marisa-trie ... error
    Complete output from command /Users/fredmailhot/anaconda/envs/marisa_test/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/pip-build-qDhwKQ/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /var/folders/h4/p3tnqg5n1rg54phgp1_g_8s00000gp/T/pip-vupQHV-record/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_clib
    building 'libmarisa-trie' library
    creating build
    creating build/temp.macosx-10.7-x86_64-2.7
    creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie
    creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib
    creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa
    creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire
    creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/io
    creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/trie
    creating build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/grimoire/vector
    gcc -fno-strict-aliasing -I/Users/fredmailhot/anaconda/envs/marisa_test/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/agent.cc -o build/temp.macosx-10.7-x86_64-2.7/marisa-trie/lib/marisa/agent.o
    marisa-trie/lib/marisa/agent.cc:3:10: fatal error: 'marisa/agent.h' file not found
    #include "marisa/agent.h"
             ^
    1 error generated.
    error: command 'gcc' failed with exit status 1

Allow to use arbitrary sequences as elements, not only strings

I tried to construct the following trie:

trie = marisa_trie.Trie([('New', 'York'), ('New', 'Castle')])

Which gave me AttributeError: 'tuple' object has no attribute 'encode'. So I suppose the library accepts only strings, but sometimes you want other structures.

mingw support

Bug report by lazarou.

"Maybe I'm asking something very silly, but here goes. I get this error when trying to install the package (using Win 7 64-bit and the latest version of mingw):

C:\MinGW\bin\gcc.exe -mdll -O -Wall -Ilib -IC:\Python32\include -IC:\Python32\PC -c lib/marisa/grimoire/io\mapper.cc -o build\temp.win32-3.2\Release\lib\marisa\grimoire\io\mapper.o
lib/marisa/grimoire/io\mapper.cc: In member function 'void marisa::grimoire::io::Mapper::open_(const char*)':
lib/marisa/grimoire/io\mapper.cc:110:19: error: aggregate 'marisa::grimoire::io::Mapper::open(const char*)::_stat64 st' has incomplete type and cannot be defined
lib/marisa/grimoire/io\mapper.cc:111:3: error: '::_stat64' has not been declared
error: command 'gcc' failed with exit status 1
```"

Deprecate ``read`` and ``write``

I think these two should be deprecated in favour of their path-based friends.

Three reasons:

  • API should be as small as possible to be useful in 90% of the cases.

  • The methods only work on file objects and produce ugly error messages when called on e.g. BytesIO:

    >>> import io
    >>> import marisa_trie
    >>> marisa_trie.Trie().write(io.BytesIO())
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "src/marisa_trie.pyx", line 193, in marisa_trie._Trie.write (src/marisa_trie.cpp:4201)
        self._trie.write(f.fileno())
    io.UnsupportedOperation: fileno
  • A related method mmap lacks a file-based version.

@kmike, what do you think?

Needs find-longest key method

Or we could just call it a "longest" method - or "prefix" method (singular). Is there an efficient way to find the longest key that is a prefix of a string that I'm overlooking? It could be done with the current implementation by simply using prefixes, then finding the longest match, but there should be a much more efficient way possible by taking advantage of the Trie properties.

I can give it a go if you like - unless it's already implemented and I'm simply missing it.

Trailing \x02 byte on restore_key result

Like so:

In [46]: marisa_trie.Trie([u'foo', u'bar']).restore_key(0)
Out[46]: u'bar\x02'

This doesn't happen if I first get the key_id for that key:

In [48]: t = marisa_trie.Trie([u'foo', u'bar'])

In [49]: t.key_id(u'bar')
Out[49]: 0

In [50]: t.restore_key(0)
Out[50]: u'bar'

If it's part of the contract that key_id is needed before restore_key then it should probably be documented, ideally raise some kind of exception if the contract is violated rather than silently return an incorrect result.

Add has_keys_with_prefix method

Hi. Great library. Would it be possible to add a has_keys_with_prefix method? Datrie has one.

set(dir(datrie.Trie)) - set(dir(marisa_trie.Trie))   
set(['__delitem__', 'setdefault', '__getitem__', 'prefix_values', 'items', 'longest_prefix', 'has_keys_with_prefix', 'longest_prefix_value', 'is_dirty', '__setitem__', 'values', 'iter_prefix_items', 'longest_prefix_item', 'iter_prefix_values', '_delitem', 'prefix_items'])

BytesTries saved with `Trie.save` cannot be loaded with `Trie.load`

Using marisa-trie 0.7.4, on Python 3.5.1:

>>> t1 = marisa_trie.BytesTrie([('a', b'a'), ('ab', b'b'), ('ac', b'c')])
>>> t1.save('/tmp/t1.marisa')
/home/rspeer/.virtualenvs/lum/bin/ipython:1: DeprecationWarning: Trie.write is deprecated and will be removed in marisa_trie 0.8.0. Please use Trie.save instead.
  #!/home/rspeer/.virtualenvs/lum/bin/python3.5
>>> t2 = marisa_trie.Trie()
>>> t2.load('/tmp/t1.marisa')
/home/rspeer/.virtualenvs/lum/bin/ipython:1: DeprecationWarning: Trie.save is deprecated and will be removed in marisa_trie 0.8.0. Please use Trie.load instead.
  #!/home/rspeer/.virtualenvs/lum/bin/python3.5
<marisa_trie.Trie object at 0x7f4e9ea552d0>
>>> t2.keys()
Traceback (most recent call last):
  File "<ipython-input-23-28b5ae76f2b3>", line 1, in <module>
    t2.keys()
  File "src/marisa_trie.pyx", line 267, in marisa_trie._Trie.keys (src/marisa_trie.cpp:6279)
  File "src/marisa_trie.pyx", line 278, in marisa_trie._Trie.keys (src/marisa_trie.cpp:6172)
  File "src/marisa_trie.pyx", line 403, in marisa_trie._UnicodeKeyedTrie._get_key (src/marisa_trie.cpp:8108)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte

MARISA_SIZE_ERROR: buf_.size() > MARISA_UINT32_MAX

Hello,

I recently inherited some code from a developer who had departed. It is safe to say that the amount of data flowing into the trie has increased over time. This bug looks like an overflow.

Stack trace:

File "marisa_trie.pyx", line 422, in marisa_trie.BytesTrie.init (src/marisa_trie.cpp:7670)
File "marisa_trie.pyx", line 127, in marisa_trie._Trie.build (src/marisa_trie.cpp:2768)
RuntimeError: lib/marisa/grimoire/trie/tail.cc:192: MARISA_SIZE_ERROR: buf
.size() > MARISA_UINT32_MAX

Missing wheels?

On PyPI I see there are many wheels available for 0.7.4, but the latest version only has a wheel for macOS, thus requiring other platforms to build from source, which means they need to have the Python development headers installed, etc. For casual users who are installing some package that depends on marisa-trie, this can be quite a burden.

Is it possible to setup some CD workflow that generates the wheels for various platforms?

input file format documentation missing

Edit: nvm, misunderstood purpose of Trie.load.

The documentation does not explain what format is expected for files passed to load. I tried the following:

In [1]: import marisa_trie

In [2]: trie = marisa_trie.Trie()

In [3]: trie.load('data')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-d813dec57587> in <module>()
----> 1 trie.load('data')

src/marisa_trie.pyx in marisa_trie._Trie.load()

src/marisa_trie.pyx in marisa_trie._Trie.load()

RuntimeError: marisa-trie/lib/marisa/grimoire/trie/header.h:26: MARISA_FORMAT_ERROR: !test_header(buf)

Documentation does not specify what the file should be formatted like:

In [4]: trie.load?
Docstring:
_Trie.load(self, path)
Load a trie from a specified path.
Type:      builtin_function_or_method

Build fails under Pyhton 3.7 and MacOS Catalina

Running into the following build error:

gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/johannes/miniconda3/include -arch x86_64 -I/Users/johannes/miniconda3/include -arch x86_64 -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/trie.o
clang: warning: include path for libstdc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
In file included from marisa-trie/lib/marisa/trie.cc:1:
marisa-trie/include/marisa/stdio.h:4:10: fatal error: 'cstdio' file not found
#include <cstdio>
         ^~~~~~~~
1 error generated.
error: command 'gcc' failed with exit status 1

I tried to update the cpp files, but no luck either. Any hints welcome.

Please make a new release on pypi

Would you please make a new release on pypi with updated patch-level version to include fixes of deprecation warnings on not deprecated methods? I'm just tired to see those warnings from our regular process and I wouldn't like to patch marisa-trie locally. Thank you.

specify weight when building

Thanks for building such a solid library wrapper! The marisa_trie C++ seems to expose a way of specifying order of nodes returned for a prefix, given the weight parameter.

Is there a way to expose this parameter in your lib?

Or alternatively, if you could provide some guarantee that nodes will return for a prefix in the same order that they had in the list that built the marisa tree, that would be fine too.

Fuzzy Matching

I see that this data structure supports prefix lookups -- does it also support fuzzy lookups (i.e. all records within Levenshtein distance). If that's not supported in this package / this data structure, do you know of any other packages that would let me do in-memory fuzzy searching?

~ Ben

Pypy Compilation Error

It can not be installed with Pypy.
Following error is showed:
error: use of undeclared identifier 'PyByteArray_FromStringAndSize';

Pack unicode strings

If I have the following variables:

keys = [u'1', u'12', u'13', u'123', u'132', u'1234']
vals = [u'a', u'b', u'c', u'd', u'e', u'f']
fmt = "s"
trie = marisa_trie.RecordTrie(fmt, zip(keys, vals))

But I keep getting argument for 's' must be a string

Any help?

`Trie.load` raises a DeprecationWarning that doesn't make sense

There's a nonsensical DeprecationWarning in Trie.load:

>>> import marisa_trie
>>> t = marisa_trie.Trie()
>>> t.load('data/language_names.marisa')
/home/rspeer/.virtualenvs/lum/bin/ipython:1: DeprecationWarning: Trie.save is deprecated and will be removed in marisa_trie 0.8.0. Please use Trie.load instead.
  #!/home/rspeer/.virtualenvs/lum/bin/python3.5
<marisa_trie.Trie at 0x7f4e9fefb0f0>

Some things that are wrong with this:

  • I didn't use Trie.save, I used Trie.load, which is exactly what it's telling me to use.
  • It doesn't sound right that Trie.load would be able to replace Trie.save.
  • The warning is actually raised by Trie.read, not Trie.save.
  • Trie.load is implemented by using Trie.read, so there is no way to avoid the DeprecationWarning.

Expected behavior: if I use the function that the DeprecationWarning tells me I should use, I should not get a DeprecationWarning.

It's inconvenient that the IDs of marisa_trie.Trie are unpredictable and keys can't be retrieved in ID order

See below:

>>> import marisa_trie
>>> trie = marisa_trie.Trie(['zeroth', 'first', 'second', 'third', 'fourth', 'fifth'])
>>> # IDs aren't ordered the same as original input list:
... trie.get('zeroth')
2
>>> # IDs aren't ordered the same as iteration order of the trie, either:
... for word, ID in trie.items():
...     print(word, ID)
... 
fifth 4
first 5
fourth 3
second 0
third 1
zeroth 2

This is inconvenient given that one possible use case, actually encouraged in the README, is to

use the returned ID to store a value in a separate data structure (e.g. in a python list

Ideally I'd like to be able to loop over my elements in ID order to construct such a list. I guess I can create a list of the right length and then assign into it, but couldn't this be made easier (either by assigning IDs according to the order the words were passed to Trie() in, or by having iteration over the trie iterate in ID order?

Memory Efficient Trie Creation

Hi,

When creating a RecordTrie, the superclass _UnpackTrie unpacks all key value pairs in memory. So if I am correct, creating a Trie is not memory efficient at all. Is there a simple way to create large Tries more efficiently?

Thx,
joe

Build failure under Python-3.9

I tried to build marisa-trie with python-3.9 and failed with following error:

gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -Imarisa-trie/include -I/opt/python3.9/include/python3.9 -c src/marisa_trie.cpp -o build/temp.linux-x86_64-3.9/src/marisa_trie.o
  src/marisa_trie.cpp: In function ‘int __Pyx_modinit_type_init_code()’:
  src/marisa_trie.cpp:17944:34: error: ‘PyTypeObject {aka struct _typeobject}’ has no member named ‘tp_print’
     __pyx_type_11marisa_trie__Trie.tp_print = 0;

From documentation follows that tp_print was removed in Python-3.9: https://docs.python.org/dev/whatsnew/3.9.html#id3 .

builtins.RuntimeError: Unknown exception

Got this on Windows 64:

trie.keys()[3]
Traceback (most recent call last):
  File "c:\Users\Administrator\Desktop\fuck.py", line 1, in <module>
    import hashlib
  File "c:\Python34\Lib\site-packages\marisa_trie.pyd", line 516, in marisa_trie.BytesTrie.keys (src\marisa_trie.cpp:9045)
  File "c:\Python34\Lib\site-packages\marisa_trie.pyd", line 527, in marisa_trie.BytesTrie.keys (src\marisa_trie.cpp:8865)

builtins.RuntimeError: Unknown exception

Build under Python 3.9 failed

python3 -m pip install --user marisa-trie
There is no member PyTypeObject->tp_print in python 3.9.
Python 3.9.2
Linux SPPI 5.10.46-v7l+ #1432 SMP Fri Jul 2 21:17:20 BST 2021 armv7l GNU/Linux

src/marisa_trie.cpp: In function ‘int __Pyx_modinit_type_init_code()’:
    src/marisa_trie.cpp:17944:34: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    17944 |   __pyx_type_11marisa_trie__Trie.tp_print = 0;
          |                                  ^~~~~~~~
    src/marisa_trie.cpp:17968:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    17968 |   __pyx_type_11marisa_trie_BinaryTrie.tp_print = 0;
          |                                       ^~~~~~~~
    src/marisa_trie.cpp:17981:46: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    17981 |   __pyx_type_11marisa_trie__UnicodeKeyedTrie.tp_print = 0;
          |                                              ^~~~~~~~
    src/marisa_trie.cpp:17995:33: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    17995 |   __pyx_type_11marisa_trie_Trie.tp_print = 0;
          |                                 ^~~~~~~~
    src/marisa_trie.cpp:18014:38: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18014 |   __pyx_type_11marisa_trie_BytesTrie.tp_print = 0;
          |                                      ^~~~~~~~
    src/marisa_trie.cpp:18039:40: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18039 |   __pyx_type_11marisa_trie__UnpackTrie.tp_print = 0;
          |                                        ^~~~~~~~
    src/marisa_trie.cpp:18052:39: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18052 |   __pyx_type_11marisa_trie_RecordTrie.tp_print = 0;
          |                                       ^~~~~~~~
    src/marisa_trie.cpp:18070:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18070 |   __pyx_type_11marisa_trie___pyx_scope_struct____init__.tp_print = 0;
          |                                                         ^~~~~~~~
    src/marisa_trie.cpp:18076:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18076 |   __pyx_type_11marisa_trie___pyx_scope_struct_1_genexpr.tp_print = 0;
          |                                                         ^~~~~~~~
    src/marisa_trie.cpp:18082:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18082 |   __pyx_type_11marisa_trie___pyx_scope_struct_2_iterkeys.tp_print = 0;
          |                                                          ^~~~~~~~
    src/marisa_trie.cpp:18088:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18088 |   __pyx_type_11marisa_trie___pyx_scope_struct_3_iter_prefixes.tp_print = 0;
          |                                                               ^~~~~~~~
    src/marisa_trie.cpp:18094:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18094 |   __pyx_type_11marisa_trie___pyx_scope_struct_4_iteritems.tp_print = 0;
          |                                                           ^~~~~~~~
    src/marisa_trie.cpp:18100:63: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18100 |   __pyx_type_11marisa_trie___pyx_scope_struct_5_iter_prefixes.tp_print = 0;
          |                                                               ^~~~~~~~
    src/marisa_trie.cpp:18106:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18106 |   __pyx_type_11marisa_trie___pyx_scope_struct_6_iteritems.tp_print = 0;
          |                                                           ^~~~~~~~
    src/marisa_trie.cpp:18112:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18112 |   __pyx_type_11marisa_trie___pyx_scope_struct_7___init__.tp_print = 0;
          |                                                          ^~~~~~~~
    src/marisa_trie.cpp:18118:57: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18118 |   __pyx_type_11marisa_trie___pyx_scope_struct_8_genexpr.tp_print = 0;
          |                                                         ^~~~~~~~
    src/marisa_trie.cpp:18124:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18124 |   __pyx_type_11marisa_trie___pyx_scope_struct_9_iteritems.tp_print = 0;
          |                                                           ^~~~~~~~
    src/marisa_trie.cpp:18130:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18130 |   __pyx_type_11marisa_trie___pyx_scope_struct_10_iterkeys.tp_print = 0;
          |                                                           ^~~~~~~~
    src/marisa_trie.cpp:18136:59: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18136 |   __pyx_type_11marisa_trie___pyx_scope_struct_11___init__.tp_print = 0;
          |                                                           ^~~~~~~~
    src/marisa_trie.cpp:18142:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18142 |   __pyx_type_11marisa_trie___pyx_scope_struct_12_genexpr.tp_print = 0;
          |                                                          ^~~~~~~~
    src/marisa_trie.cpp:18148:60: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18148 |   __pyx_type_11marisa_trie___pyx_scope_struct_13_iteritems.tp_print = 0;
          |                                                            ^~~~~~~~
    src/marisa_trie.cpp:18154:58: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
    18154 |   __pyx_type_11marisa_trie___pyx_scope_struct_14_genexpr.tp_print = 0;
          |                                                          ^~~~~~~~

Possible to walk the trie?

I don't see any methods that would allow for just proceeding one edge in the trie, for example:

trie = marisa_trie.Trie([u'key1', u'key2', u'kite'])
trie.edges('k') #would return [u'ke', u'ki']

Using the prefixes method for something like this would be very expensive if the trie is big and the prefix is short. Is there some technical detail I'm missing for why implementing a function for this would be costly, or some other reason this isn't implemented? It seems like it's a necessary step in the traversal with the prefixes() method anyway, and quite useful for predictive lookup operations.

Does not install with pip for python3.5

When I try pip install marisa-trie i get the following error:

Could not find a version that satisfies the requirement install (from versions: )
No matching distribution found for install

Can't get value using key with null char (marisa_trie.Trie)

I might be missing something out here, but I believe there is a consistency issue with the Trie implementation and keys that have null characters.

Take a look at the following code snippet:

key = 'Random\x00Key'
python_dict = { key : 'random_value' }
key in python_dict # prints True
python_dict.get(key) # returns 'random_value'

std_trie = marisa_trie.Trie(python_dict)
key in std_trie # prints True
std_trie.keys() # prints ['Random\x00Key']
std_trie.get(key) # should return 'random_value'
std_trie.key_id(key) # should return the key id

What happens is that std_trie.get(key) actually returns None, and std_trie.key_id(key) throws a KeyError exception with the following trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/marisa_trie.pyx", line 409, in marisa_trie.Trie.key_id
  File "src/marisa_trie.pyx", line 417, in marisa_trie.Trie.key_id
KeyError: 'Random\x00Key'

Apparently, the RecordTrie implementation is immune to this consistency issue.

r_trie = marisa_trie.RecordTrie('<H', zip([key], [(1,)]))
key in r_trie # prints True
r_trie.keys() # prints ['Random\x00Key']
r_trie.get(key) # prints [(1,)]

By the way, if you confirm this issue as an actual bug, also check DAWG. I haven't used it extensively, but calling dawg.DAWG(python_dict) should throw the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "dawg.pyx", line 45, in dawg.DAWG.__init__ (src\dawg.cpp:2147)
  File "dawg.pyx", line 70, in dawg.DAWG._build_from_iterable (src\dawg.cpp:2570)
dawg.Error: Can't insert key b'Random\x00Key' (with value 0)

I'm running Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] on win32 and using marisa-trie 0.7.5 and DAWG 0.7.8.

Weight Ordering Doesn't Seem to Work?

I've tried this with both RecordTrie and BytesTrie and a number of different formats, but can't get things to work no matter what I do.
keys = ['foo', 'foo1', 'foobar', 'bar']
values = [(1,), (2,), (3,), (4,)]
fmt = str("<H")
trie = marisa_trie.RecordTrie(fmt, zip(keys, values), order=marisa_trie.WEIGHT_ORDER)
trie.items(u'')
>>> [(u'foo1', (4,)), (u'foobar', (3,)), (u'foo', (1,)), (u'bar', (2,))]

I tried adding a weights parameter as well, which isn't documented, but does seem supported in the code. It looks like it does something, because if an iterable with items that can't be converted to floats is passed in, it breaks. Nevertheless, values are still not returned in weight order:

trie = marisa_trie.RecordTrie(fmt, zip(keys, values), order=marisa_trie.WEIGHT_ORDER, weights=[1,2,3,4])
trie.items()
>>> [(u'foo1', (4,)), (u'foobar', (3,)), (u'foo', (1,)), (u'bar', (2,))]

Am I missing something here? Is there some way other way to set weight?

Repeated keys

import marisa_trie
a = [(u'1', '1'), (u'1', '2')]
tr = marisa_trie.BytesTrie(a)
print tr.keys()

This will output [u'1', u'1'].

I guess, that, this function should returns a [u'1']

I'm ready to fix it, if someone consider, that this is bug.

ImportError missing symbols on macOS

I get the following error when trying to import on macOS Catalina 10.15.4. I have version 0.7.5 installed via pipenv. Super simple Python shell transcript below.

14:40:23 ❯ pipenv run python
Python 3.7.3 (default, Mar  6 2020, 22:34:30)
[Clang 11.0.3 (clang-1103.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import marisa_trie
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/ianthetechie/.local/share/virtualenvs/python-json-apis-b20-K_lB/lib/python3.7/site-packages/marisa_trie.cpython-37m-darwin.so, 2): Symbol not found: __ZN6marisa4Trie4mmapEPKc
  Referenced from: /Users/ianthetechie/.local/share/virtualenvs/python-json-apis-b20-K_lB/lib/python3.7/site-packages/marisa_trie.cpython-37m-darwin.so
  Expected in: flat namespace
 in /Users/ianthetechie/.local/share/virtualenvs/python-json-apis-b20-K_lB/lib/python3.7/site-packages/marisa_trie.cpython-37m-darwin.so

PyPI package out of date

Just wanted to make a suggestion to publish a recent version of this code on PyPI. The latest published PyPI version is 0.7.2, which is from April 2015.

Trie objects don't support comparison

A trivial example:

>>> from marisa_trie import Trie
>>> Trie() == Trie()
False
>>> Trie(["foo", "bar"]) == Trie(["foo", "bar"])
False

There's one more interesting property: different tries seem to hash to the same value:

>>> hash(Trie())
271079393
>>> hash(Trie(["foo", "bar"]))
271079393
>>> hash(Trie(["foo", "bar", "boo"]))
271079393

This might be due to a free list-based allocation, but anyway the behaviour is confusing.

Starting an iterator in the middle of a values list in a label-ordered RecordTrie

Hi Mike,

Thanks for building this wrapper and providing the additional Bytes and RecordTrie classes, they're great and extremely useful and have been largely easy to build additional features into.

I would like to implement a RecordTrie feature as follows: if there are key-value pairs (u'a', (1, N_1)), (u'a', (2, N_2)), (u'a',(3, N_3)), ..., (u'a', (i, N_i)), and I know that I only need the values stored in N_p through N_q. If i is very large (say, a million), calling my_list = my_trie[u'a'] is prohibitively slow, and so I would like to start the loop at the pth value, that is, at b_prefix = <bytes>u'a'.encode('utf8') + self._b_value_separator + <bytes>bytes(struct(">I", p)).

Of course, if I set Agent key to this, the loop will not continue on to (u'a', (p+1, N_{p+1})), and I cannot figure out how to "trick" the predictive search into continuing to loop past that specific prefix and on to anything with prefix simply <bytes>u'a'.encode('utf8') + self._b_value_separator without resetting the entire loop from the top.

Do you have any idea of how this might be accomplished, or if it is a limitation of the marisa-trie base library?

Thanks,
-George

Win64 Problem

The following Code is breaks under 64bit windows. 32bit Windows is ok, 64bit Linux does also work.

I've compiled the win-64bit extension with MSVC 2010.


marisa_trie.Trie([u'Das', u'Lahnth.al', u'mit', u'seinen', u'Heilquellen']).keys()

d:\vls-trunk\env-win64\Python27\lib\site-packages\marisa_trie.pyd in marisa_trie._Trie.keys (src\marisa_trie.cpp:4199)()

d:\vls-trunk\env-win64\Python27\lib\site-packages\marisa_trie.pyd in marisa_trie._Trie.keys (src\marisa_trie.cpp:4061)()

RuntimeError: Unknown exception

possible to add keys?

Is it possible to add keys to the Trie after its been created? I've seen the restore facility (eg: trie.restore_key(1)) and looked into the source but haven't seen anything offering this. Something like

t = Trie([u'one'])
t._add_key(u'two')

Getting UnicodeDecodeError accessing trie read from file

Hi, I'm consistently getting the following error when trying to access a trie from a load or read from a file.

./read_trie_test.py
Traceback (most recent call last):
  File "./read_trie_test.py", line 18, in <module>
    print(t.restore_key(0))
  File "marisa_trie.pyx", line 324, in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:6365)
  File "marisa_trie.pyx", line 334, in marisa_trie.Trie.restore_key (src/marisa_trie.cpp:6299)
  File "marisa_trie.pyx", line 62, in marisa_trie._get_key (src/marisa_trie.cpp:1615)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 10: invalid start byte

I get the same error if the following code is used...

  for k in t.keys():
      print(k)

and again the same error if I use:

  t['someKey']  # or t[u'somekey']

The trie file reads in w/o any error and i've written the file using both trie.save() and trie.write()
and in writing file I've used a codec.open() and codec.write() to force utf-8 encoding

I'm not sure if this is similar issue #10

Build fails on MacOS Mojave, January 2019

I don't know whether this is user error, a return of Issue #34, or something new (since #34 seems to have been closed as resolved), but I just tried and failed to build marisa-trie under MacOS Mojave. Details below.

Vombatus:SciFi djb$ pip install marisa-trie
Collecting marisa-trie
  Using cached https://files.pythonhosted.org/packages/20/95/d23071d0992dabcb61c948fb118a90683193befc88c23e745b050a29e7db/marisa-trie-0.7.5.tar.gz
Building wheels for collected packages: marisa-trie
  Running setup.py bdist_wheel for marisa-trie ... error
  Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-wheel-syviv4xc --python-tag cp37:
  running bdist_wheel
  running build
  running build_clib
  building 'libmarisa-trie' library
  creating build
  creating build/temp.macosx-10.7-x86_64-3.7
  creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie
  creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib
  creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa
  creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire
  creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/io
  creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/trie
  creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/vector
  gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/trie.o
  warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
  In file included from marisa-trie/lib/marisa/trie.cc:1:
  marisa-trie/include/marisa/stdio.h:4:10: fatal error: 'cstdio' file not found
  #include <cstdio>
           ^~~~~~~~
  1 warning and 1 error generated.
  error: command 'gcc' failed with exit status 1

  ----------------------------------------
  Failed building wheel for marisa-trie
  Running setup.py clean for marisa-trie
Failed to build marisa-trie
Installing collected packages: marisa-trie
  Running setup.py install for marisa-trie ... error
    Complete output from command /anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-record-539k0cjv/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_clib
    building 'libmarisa-trie' library
    creating build
    creating build/temp.macosx-10.7-x86_64-3.7
    creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie
    creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib
    creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa
    creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire
    creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/io
    creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/trie
    creating build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/grimoire/vector
    gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda3/include -arch x86_64 -I/anaconda3/include -arch x86_64 -Imarisa-trie/lib -Imarisa-trie/include -c marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.7-x86_64-3.7/marisa-trie/lib/marisa/trie.o
    warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
    In file included from marisa-trie/lib/marisa/trie.cc:1:
    marisa-trie/include/marisa/stdio.h:4:10: fatal error: 'cstdio' file not found
    #include <cstdio>
             ^~~~~~~~
    1 warning and 1 error generated.
    error: command 'gcc' failed with exit status 1

    ----------------------------------------
Command "/anaconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-record-539k0cjv/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/46/5fcnvzts14527hn_3ryrqr100000gn/T/pip-install-f2s2mhpn/marisa-trie/

no marisa

despite the repository having marisa in the name i could not find a single touhou refference
please add one, atleast in a comment or something, or rename the project

OSX 10.10.3 Build Failure

Just want to let you know that this project fails to compile on the OSX 10.10.3 with CLI Tools 6.3 because of the missing <__debug> header. Check out this StackOverflow article for more info.

I've worked around this by simply downgrading to the previous CLI tools.

libmarisa-trie fails on macOS Big Sur and Python 3.8

  • pip3 install marisa-trie is working fine for me, but when using it I get the following:
  running build_clib
  building 'libmarisa-trie' library
  creating build/temp.macosx-10.14.6-x86_64-3.8
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire/io
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire/trie
  creating build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/grimoire/vector
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -iwithsysroot/System/Library/Frameworks/System.framework/PrivateHeaders -iwithsysroot/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/Headers -arch arm64 -arch x86_64 -Imarisa-trie/marisa-trie/lib -Imarisa-trie/marisa-trie/include -c marisa-trie/marisa-trie/lib/marisa/trie.cc -o build/temp.macosx-10.14.6-x86_64-3.8/marisa-trie/marisa-trie/lib/marisa/trie.o
  marisa-trie/marisa-trie/lib/marisa/trie.cc:1:10: fatal error: 'marisa/stdio.h' file not found
  #include "marisa/stdio.h"
           ^~~~~~~~~~~~~~~~
  1 error generated.
  error: command 'clang' failed with exit status 1
  ----------------------------------------
  • I tried the steps in here but it didn't resolve the issue: #50

  • Any tips welcome, thanks!

higher than expected memory usage

I'm finding the expected memory usage to be much higher than you suggest. Does anything strike you as odd about this?

import string
import marisa_trie

keys = []
fmt = "<I"
for i in xrange(int(3e6)):
key = "".join([random.choice(string.ascii_uppercase).decode('unicode-escape') for j in xrange(10)])
keys.append(key)

t=marisa_trie.Trie(keys)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.