wrengr / bytestring-trie Goto Github PK

An efficient finite map from (byte)strings to values.

License: BSD 3-Clause "New" or "Revised" License

Haskell 100.00%

bytestring-trie's Introduction

bytestring-trie

The bytestring-trie package provides an efficient implementation of tries mapping ByteString to values. The implementation is based on Okasaki's big-endian patricia trees, à la IntMap. We first trie on the elements of ByteString and then trie on the big-endian bit representation of those elements. Patricia trees have efficient algorithms for union and other merging operations, but they're also quick for lookups and insertions.

If you are only interested in being able to associate individual ByteStrings to values, then you may prefer the hashmap package which is faster for those only needing a map-like structure. This package is intended for those who need the extra capabilities that a trie-like structure can offer (e.g., structure sharing to reduce memory costs for highly redundant keys, taking the submap of all keys with a given prefix, contextual mapping, extracting the minimum and maximum keys, etc.)

Install

This is a simple package and should be easy to install. You should be able to use the standard:

$> cabal install bytestring-trie

Portability

The implementation is quite portable, relying only on a few basic language extensions. The complete list of extensions used by the library is:

CPP
MagicHash -- Only if using GHC
NoImplicitPrelude

The test suite uses a few additional extensions:

MultiParamTypeClasses
FlexibleInstances
FlexibleContexts

bytestring-trie's People

Contributors

Stargazers

Watchers

Forkers

ffwng obsidiansystems ulyssesdotcodes netom unhammer hdgarrood michaeljklein lukel97 hasufell kkysen tmcgilchrist volesen yokodake prikhi remexre agentultra jcranch

bytestring-trie's Issues

Consider additions from @ffwng's fork

The disjoin function of ffwng@735b605 seems generally useful, and would also fit in with resolving issue #7 and PR #10. Unrelated to those issues/PRs, the cut function could also be helpful (though it needs a rename at the very least).

Native support for bytestring >=0.11

bytestring-0.11.0.0 (September 2020) removed the offset parameter to the ByteString data constructor: haskell/bytestring#175 . They provide a PS pattern-synonym for backward compatibility (https://github.com/haskell/bytestring/blob/master/Changelog.md) and while CI indicates that's sufficient for now, we should surely update things to use the new internal representation directly.

Compact version?

Using individual bits seems wasteful. Why not use HashMap-like sparse SmallArrays to increase the branching factor? The Prefix thing looks like it's on its way to being path compression, but it's a bit on the small side.

Why I'm thinking about any of this: I think it would be interesting to attempt path-compressed generalized tries. The process of addressing a key in a generalized trie is very serialization-like. Working in preorder, each step produces a statically known number of bits representing a constructor. The easy thing is to just stick a node there, but the path compression way would say to put together bits until there's a branch (or some limit is reached). If you have any thoughts on that, I'd be very interested in hearing them.

Asymptotic complexity of keyed functions.

All the functions which reconstruct keys (traverseWithKey, mapBy, contextualMapBy, foldrWithKey, toList, keys, etc) suffer from a quadratic slowdown due to how we reconstruct the keys. I can make some changes to reduce the non-asymptotic factors; but so far as I can tell, actually fixing the bug is irreconcilable under the current design.

We could easily eliminate the slowdown by instead constructing a reversed variant of lazy bytestrings (reversed, because we need snoc-lists of strict bytestrings, whereas the standard lazy bytestrings are cons-lists). However, if the caller then converts those to standard bytestrings, doing so will incur the quadratic cost again. (Going from reversed-lazy-bytestrings to lazy-bytestrings is only quadratic in the number of chunks, whereas going to strict-bytestrings is quadratic in the length.) Thus, while this approach would technically solve our problems, it only does so by pushing the problem off onto the user, which is unacceptable.

I do, at least, have some ideas for how we could solve this by storing extra metadata in the trie. But so far I have no idea how great the overhead of that would be. Ideally, if the metadata can be made cheap enough to compute on the fly, then we could just compute it when we need it. Each call to the key-reconstructing functions would require two passes over the trie, but that's far better than the current approach. However, if the user wants to call these functions often, then it makes more sense to keep the metadata around; but that introduces the burden of updating it whenever changes are made to the trie. And if the cost of those updates is too great, then that pushes us towards forking the datatype so that users can decide which cost model they want —which I'd really rather not do, if it can be avoided.

Under GHC 8, use runRW# primop

The UnsafeST module says it's all about working around [https://ghc.haskell.org/trac/ghc/ticket/5916]. That ticket has now been closed thanks to the new runRW# primop. You should probably use that when available. You should also be able to remove the -fno-full-laziness from things using that module.

Asymptotic complexity of functions deleting values

When using the internal functions arc, arc_, arcNN, or prepend to rebuild the spine after deleting values, under certain circumstances we can end up chaining those together and thus encounter an asymptotic slowdown similar to #25 (albeit greatly reduced in scope and severity). However, unlike #25, the bug here is not intrinsic to the data structure. We should be able to adjust those functions to behave more like a zipper, and only do the bytestring concatenation once an arc is "complete" (akin to the use of RevLazyByteString in the priority-queue functions).

Add minimal `match` function

The current match/match_ functions return the longest prefix; we may also want a version that returns the shortest prefix.

Since matches/matches_ is a good-producer, it should be cheap enough to simply take the head of that list. (If for some reason the list overhead isn't fully erased, then we'll want to see if we can't find some way to combine the implementations of match_, minMatch_, and matchFB_.)

intersectBy

I've made an intersectBy function that I'm using (for faster intersection than filter.merge) at https://github.com/unhammer/bytestring-trie-0.2.4/pull/1/files

If this seems at all useful for anyone else, it'd be very nice to have in the official package, so I don't have to remember to cabal add-source all the time :) It's for 0.2.4.1 though; I haven't had a chance to look into all the changes in 0.3 (which I guess is not quite done yet).

Add `differenceBy`

Like intersectBy, we can add a primitive differenceBy which is much more efficient than the circuitous route of using disunion twice.

text-trie issue: Bug in `match`

bytestring-trie (correct):

Trie.match (Trie.fromList [("foo", "0"), ("foob", "1")]) "foobar"
-- Just ("foob", True, "ar")

text-trie (incorrect):

Trie.match (Trie.fromList [("foo", "0"), ("foob", "1")]) "foobar"
-- Just ("f", True, "oobar")

Issues are disabled on text-trie for some weird reason, so raising it here and pinging @michaeljklein

Consider: deleteSubmap and other prefix operations

The submap function is extremely useful for retrieving a map for all strings (or suffixes of strings, rather) that contain a specified prefix, and does it very efficiently.

Logically, it is possible to produce a map with all strings with a given prefix removed, and to do this efficiently. The currently provided functions do not seem to allow for this however, and instead require using alterBy or similar to delete every key that has the prefix.

Given that prefixes are such a fundamental part of the logic of tries, I think some more useful prefix operations would be valuable in this library. Consider, for example, the ability to add a common prefix to all keys in the trie, or delete the longest common prefix. Tries make these options inherently efficient, but currently the user does not have the tools to do this efficiently.

Does `matches` return the list of matches in any particular order?

E.g., the longest match first

Conversion to another Trie internal representation

I'd like to convert Tries to another internal representation. I can do it by using toList and writing code to build the NaiveTrie from the list, but that's basically throwing away all the work that your code does. It would be a lot simpler and smaller code if I had access to the internals:

data NaiveTrie a = NaiveTrie !ByteString !(Maybe a) ![NaiveTrie a] deriving Show

toNaiveTrie :: Trie a -> NaiveTrie a
toNaiveTrie Empty              = NaiveTrie "" Nothing []
toNaiveTrie (Arc b v t)        = NaiveTrie b v (dfs t)
toNaiveTrie (Branch p m t1 t2) = NaiveTrie "" Nothing ((dfs t1) ++ (dfs t2))

dfs :: Trie a -> [NaiveTrie a]
dfs Empty              = []
dfs arc@(Arc _ _ _)    = [toNaiveTrie arc]
dfs (Branch _ _ t1 t2) = (dfs t1) ++ (dfs t2)

Morally, the internal Trie structure is currently being exposed via showTrie, so exposing the structure completely won't leak any more information. On the other hand, allowing users to construct Tries directly would allow Tries that violate the invariants.

How willing would you be to accept a patch providing a datatype isomorphic to Trie, to which Trie can be converted? It would expose the structure I need while not allowing the isomorphic structure to be used as input to your module's functions.

`Binary.get` can break invariants

The current implementation for Binary.get allows constructing tries that violate our invariants!

Releasing 0.2.7.1 to Hackage?

Appaently there is already version 0.2.7.1 that supports GHC 9.4, but it's not available on Hackage. Could you upload the latest version to Hackage as well?