Giter Site home page Giter Site logo

bytestring-trie's Introduction

bytestring-trie

Hackage version Build Status Dependencies

The bytestring-trie package provides an efficient implementation of tries mapping ByteString to values. The implementation is based on Okasaki's big-endian patricia trees, à la IntMap. We first trie on the elements of ByteString and then trie on the big-endian bit representation of those elements. Patricia trees have efficient algorithms for union and other merging operations, but they're also quick for lookups and insertions.

If you are only interested in being able to associate individual ByteStrings to values, then you may prefer the hashmap package which is faster for those only needing a map-like structure. This package is intended for those who need the extra capabilities that a trie-like structure can offer (e.g., structure sharing to reduce memory costs for highly redundant keys, taking the submap of all keys with a given prefix, contextual mapping, extracting the minimum and maximum keys, etc.)

Install

This is a simple package and should be easy to install. You should be able to use the standard:

$> cabal install bytestring-trie

Portability

The implementation is quite portable, relying only on a few basic language extensions. The complete list of extensions used by the library is:

  • CPP
  • MagicHash -- Only if using GHC
  • NoImplicitPrelude

The test suite uses a few additional extensions:

  • MultiParamTypeClasses
  • FlexibleInstances
  • FlexibleContexts

Links

bytestring-trie's People

Contributors

dependabot[bot] avatar jcranch avatar unhammer avatar wrengr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bytestring-trie's Issues

Consider additions from @ffwng's fork

The disjoin function of ffwng@735b605 seems generally useful, and would also fit in with resolving issue #7 and PR #10. Unrelated to those issues/PRs, the cut function could also be helpful (though it needs a rename at the very least).

Compact version?

Using individual bits seems wasteful. Why not use HashMap-like sparse SmallArrays to increase the branching factor? The Prefix thing looks like it's on its way to being path compression, but it's a bit on the small side.

Why I'm thinking about any of this: I think it would be interesting to attempt path-compressed generalized tries. The process of addressing a key in a generalized trie is very serialization-like. Working in preorder, each step produces a statically known number of bits representing a constructor. The easy thing is to just stick a node there, but the path compression way would say to put together bits until there's a branch (or some limit is reached). If you have any thoughts on that, I'd be very interested in hearing them.

Asymptotic complexity of keyed functions.

All the functions which reconstruct keys (traverseWithKey, mapBy, contextualMapBy, foldrWithKey, toList, keys, etc) suffer from a quadratic slowdown due to how we reconstruct the keys. I can make some changes to reduce the non-asymptotic factors; but so far as I can tell, actually fixing the bug is irreconcilable under the current design.

We could easily eliminate the slowdown by instead constructing a reversed variant of lazy bytestrings (reversed, because we need snoc-lists of strict bytestrings, whereas the standard lazy bytestrings are cons-lists). However, if the caller then converts those to standard bytestrings, doing so will incur the quadratic cost again. (Going from reversed-lazy-bytestrings to lazy-bytestrings is only quadratic in the number of chunks, whereas going to strict-bytestrings is quadratic in the length.) Thus, while this approach would technically solve our problems, it only does so by pushing the problem off onto the user, which is unacceptable.

I do, at least, have some ideas for how we could solve this by storing extra metadata in the trie. But so far I have no idea how great the overhead of that would be. Ideally, if the metadata can be made cheap enough to compute on the fly, then we could just compute it when we need it. Each call to the key-reconstructing functions would require two passes over the trie, but that's far better than the current approach. However, if the user wants to call these functions often, then it makes more sense to keep the metadata around; but that introduces the burden of updating it whenever changes are made to the trie. And if the cost of those updates is too great, then that pushes us towards forking the datatype so that users can decide which cost model they want —which I'd really rather not do, if it can be avoided.

Under GHC 8, use runRW# primop

The UnsafeST module says it's all about working around [https://ghc.haskell.org/trac/ghc/ticket/5916]. That ticket has now been closed thanks to the new runRW# primop. You should probably use that when available. You should also be able to remove the -fno-full-laziness from things using that module.

Asymptotic complexity of functions deleting values

When using the internal functions arc, arc_, arcNN, or prepend to rebuild the spine after deleting values, under certain circumstances we can end up chaining those together and thus encounter an asymptotic slowdown similar to #25 (albeit greatly reduced in scope and severity). However, unlike #25, the bug here is not intrinsic to the data structure. We should be able to adjust those functions to behave more like a zipper, and only do the bytestring concatenation once an arc is "complete" (akin to the use of RevLazyByteString in the priority-queue functions).

Add minimal `match` function

The current match/match_ functions return the longest prefix; we may also want a version that returns the shortest prefix.

Since matches/matches_ is a good-producer, it should be cheap enough to simply take the head of that list. (If for some reason the list overhead isn't fully erased, then we'll want to see if we can't find some way to combine the implementations of match_, minMatch_, and matchFB_.)

intersectBy

I've made an intersectBy function that I'm using (for faster intersection than filter.merge) at https://github.com/unhammer/bytestring-trie-0.2.4/pull/1/files

If this seems at all useful for anyone else, it'd be very nice to have in the official package, so I don't have to remember to cabal add-source all the time :) It's for 0.2.4.1 though; I haven't had a chance to look into all the changes in 0.3 (which I guess is not quite done yet).

Add `differenceBy`

Like intersectBy, we can add a primitive differenceBy which is much more efficient than the circuitous route of using disunion twice.

text-trie issue: Bug in `match`

bytestring-trie (correct):

Trie.match (Trie.fromList [("foo", "0"), ("foob", "1")]) "foobar"
-- Just ("foob", True, "ar")

text-trie (incorrect):

Trie.match (Trie.fromList [("foo", "0"), ("foob", "1")]) "foobar"
-- Just ("f", True, "oobar")

Issues are disabled on text-trie for some weird reason, so raising it here and pinging @michaeljklein

Consider: deleteSubmap and other prefix operations

The submap function is extremely useful for retrieving a map for all strings (or suffixes of strings, rather) that contain a specified prefix, and does it very efficiently.

Logically, it is possible to produce a map with all strings with a given prefix removed, and to do this efficiently. The currently provided functions do not seem to allow for this however, and instead require using alterBy or similar to delete every key that has the prefix.

Given that prefixes are such a fundamental part of the logic of tries, I think some more useful prefix operations would be valuable in this library. Consider, for example, the ability to add a common prefix to all keys in the trie, or delete the longest common prefix. Tries make these options inherently efficient, but currently the user does not have the tools to do this efficiently.

Conversion to another Trie internal representation

I'd like to convert Tries to another internal representation. I can do it by using toList and writing code to build the NaiveTrie from the list, but that's basically throwing away all the work that your code does. It would be a lot simpler and smaller code if I had access to the internals:

data NaiveTrie a = NaiveTrie !ByteString !(Maybe a) ![NaiveTrie a] deriving Show

toNaiveTrie :: Trie a -> NaiveTrie a
toNaiveTrie Empty              = NaiveTrie "" Nothing []
toNaiveTrie (Arc b v t)        = NaiveTrie b v (dfs t)
toNaiveTrie (Branch p m t1 t2) = NaiveTrie "" Nothing ((dfs t1) ++ (dfs t2))

dfs :: Trie a -> [NaiveTrie a]
dfs Empty              = []
dfs arc@(Arc _ _ _)    = [toNaiveTrie arc]
dfs (Branch _ _ t1 t2) = (dfs t1) ++ (dfs t2)

Morally, the internal Trie structure is currently being exposed via showTrie, so exposing the structure completely won't leak any more information. On the other hand, allowing users to construct Tries directly would allow Tries that violate the invariants.

How willing would you be to accept a patch providing a datatype isomorphic to Trie, to which Trie can be converted? It would expose the structure I need while not allowing the isomorphic structure to be used as input to your module's functions.

Releasing 0.2.7.1 to Hackage?

Appaently there is already version 0.2.7.1 that supports GHC 9.4, but it's not available on Hackage. Could you upload the latest version to Hackage as well?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.