Giter Site home page Giter Site logo

imagibee / gigantor Goto Github PK

View Code? Open in Web Editor NEW
12.0 1.0 1.0 4.15 MB

Works in conjunction with System.Text.RegularExpressions.Regex to boost performance, add a replace function, and support gigantic files that exceed RAM

License: MIT License

C# 100.00%
large-files regex regular-expression search performance

gigantor's People

Contributors

dynamicbutter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

dynamicbutter

gigantor's Issues

Benchmarks are invalidated by F_NOCACHE

Turning file cache off using F_NOCACHE while a large part of the 32 GB test file is lodged in memory causes subsequent tests to run faster because a big chunk of their data is now cached in memory.

Steps to reproduce the behavior:

  1. dotnet test -c Release --framework net7.0 --filter FileReadThroughputTest
  2. Notice how some results exceed the 3.4 GB sequential read throughput for this device

optimize UnsafeByteToString

instead of 1 byte pointer to unpack 1 byte at a time, use 32 or 64 bit pointer to unpack a word at a time

Match results wrong for larger values of overlap

Steps to reproduce

  1. Run RegexSearcherTests.MaxMatchCountTest with overlap set to pattern.Length and observe it fails

Note: suspect that deduping is not correctly handling the case where multiple matches are present in the overlap window

markdown examples contain errors

As a user I want the markdown examples to be correct and functional, and as a developer I find its error prone to try to write examples directly in markdown. Change the example workflow to create tests for the examples and past the tested code into the markdown.

limit SearchIndexer MatchCount

As a user of SearchIndexer I am concerned about memory usage if MatchCount is allowed to grow without bounds. Provide a parameter to the user that enables them to select an upper limit on MatchCount for their application.

LineIndexer deadlock

There is a good chance the LineIndexer will deadlock if maxWorkers is not set to a value of 1.

support compressed files

As a user I would like to use Gigantor on compressed files without decompressing them to disk, but Gigantor will currently only work on files that are already decompressed. See this stack overflow post.

add code coverage to tests

As a developer I need to evaluate if sufficient testing is done. Add code coverage to tests and analyze lines that are not covered.

add nuget package

As a user of Gigantor I would like to get the distribution from nuget.org

make it faster

Wanted: PR's that result in measurable performance gains ;)

line to fpos mapping

As a user of LineIndexer I want a mapping between a line and its fpos to be provided so that I don't have to re-invent it. The mapping should work in either direction.

add readlines

As a user I commonly need to be able to read the next line starting at a position and don't want to reinvent the wheel. While the System.IO.StreamReader looks promising it has known issues with read, seek, read sequences which are exactly what would be useful. Create a Giagantor StreamReader that is based on BinaryReader that can do read, seek, read sequences.

dedup search results

As a user I do not want any duplicate search results. Because of the way RegexSearcher uses overlapping buffers, there is a possibility of duplicate matches. Find and remove duplicates prior to Running being set to False.

support match groups

As a user of search I would like to be able to take advantage of match groups for more sophisticated searches. Add Group structure to the returned data structure and populate it from the matches.

RegexSearcher matches have a wrong value of StartFpos

The value of the MatchData.StartFpos is slightly off at the beginning of the file and progressively gets worse the further into the file the search goes. Easily reproducible by seeking to the match StartFpos and comparing the text there to the text Value of the match data.

add line cache

As a text file viewer application I need to show a window full of text lines but I'm only really concerned about a current window. There should be a way to select the number of lines in the cache, seek to a line, advance the cache by +/- N lines where N is an integer.

solve readme image issue with nuget.org

As a nuget.org user I want to see the images in the README from nuget.org. Find a solution that allows GitHub hosted readme images to be displayed in nuget.org.

Background.StartAndWait fails in certain cases

Background.StartAndWait fails...

Steps to reproduce

This fails

    DuplicateChecker dc = new(enwik9, enwik9Copy1, progress);
    Background.StartAndWait(dc, progress, (_) => {});

But this passes

    DuplicateChecker dc = new(enwik9, enwik9Copy1, progress);
    dc.Start();
    Background.Wait(dc, progress, (_) => {});
    Assert.AreEqual(true, dc.Identical);

add file duplication detection

As a user I would like to be able to quickly determine if two very large files are duplicates of one another. This should work for binary or text files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.