Giter Site home page Giter Site logo

trie's Introduction

Summary

Over holiday 2023, I decided to try to play around to find an efficient way to store paths in a prefix tree from .NET Core.

The point of this is to make a little C# one that will take tons of file system paths and keep them compressed yet searchable in memory. Ideally one should be able to reference any given leaf and work back up to get the whole path when necessary. However, most consumers I've seen who want to look at paths anyway don't want the actual full string in the first place. They want to do StartsWith, EndsWith, or Contains on the paths to find ones that fall under a directory, have a specific extension, or have a partial filename match. All of these should be possible without fully resynthesizing the trie nodes to a string first. Though I haven't implemented all of them up front, they should be relatively easy.

The other thing I want to encompass here is absolutely minimizing allocations. The other examples I've seen of this often don't care about how many nodes or substrings or whatnot that they allocate. As such, I created several block pools that make big honking chunks of bytes or whatnot and dole them out instead of new new newing everything up and adding all that overhead to heap management and the garbage collector.

I was also attempting to use structures where I could to avoid allocations completely and in the main Node structure, I intentionally used 4 ints so it would be 128-bits big (or rather a nice multiple of 32/64-bit word size and not waste bits there).

Prior Research

I didn't really find an off-the-shelf implementation that looked to do what I wanted. Either my brain turned to mush trying to read the implementation, it was in C++ instead of C# and I really want to avoid compiling for a whole matrix of things (Linux/Windows, x86/amd64, blah blah blah) by leveraging .NET Core, it was heap allocating/freeing a lot, or something of that ilk.

Performance Testing

  • I dumped my entire filesystem paths to a file using the one little ignored unit test that's in here. It was 59,001KB with 575567 lines (paths).
  • I loaded it in with the other unit test into this little trie. When I was done optimizing, it seemed to take the test a little over 1 second to read all the paths from the file and completely build the trie on a Framework Laptop 13 i7-1280p variety. If I recall correctly, it consumed a little over 20MB of RAM when I was staring at it under the Visual Studio .NET Allocations memory profiler and JetBrains' dotMemory.

Conclusion

In lieu of losing this somewhere in my pile of computers, I threw it on GitHub. Chances that it is effectively "abandoned" like many of the other tries out there is high, but if I do end up using it for something, I might put the bonus tweaks back here. However, maybe it'll help the next person scouring the internet for something like this. Or future me.

trie's People

Contributors

miniksa avatar

Stargazers

Drew O'Meara avatar well.james avatar Laurent Kempé avatar 咖喱咖喱 avatar Eli Belash avatar Icarus avatar  avatar Andrejs Agejevs avatar Demin Dmitriy avatar Lee Seung Hu avatar Louis avatar 夜莺 avatar Kirill Osenkov avatar Dan Thompson avatar

Watchers

 avatar

trie's Issues

Performance sucks when you end up with tens of thousands of children under one node

Theoretically there can be tens of thousands of files under one path. But then performance of GetChild sucks as it linearly searches through all the nodes.

I had a think about improving that... but I think what's probably a better solution is to make this into a real compressed tree instead of just by the path separator. Many of the file names under a 10,000+ file folder likely have name redundancy in them, so we could perhaps just adjust the trie construction to not break on separators but rather break up existing nodes as divergence is found to eliminate the problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.