imagibee / gigantor Goto Github PK

Works in conjunction with System.Text.RegularExpressions.Regex to boost performance, add a replace function, and support gigantic files that exceed RAM

License: MIT License

C# 100.00%

large-files regex regular-expression search performance

gigantor's People

Contributors

Stargazers

Watchers

Forkers

dynamicbutter

gigantor's Issues

Benchmarks are invalidated by F_NOCACHE

Turning file cache off using F_NOCACHE while a large part of the 32 GB test file is lodged in memory causes subsequent tests to run faster because a big chunk of their data is now cached in memory.

Steps to reproduce the behavior:

dotnet test -c Release --framework net7.0 --filter FileReadThroughputTest
Notice how some results exceed the 3.4 GB sequential read throughput for this device

optimize UnsafeByteToString

instead of 1 byte pointer to unpack 1 byte at a time, use 32 or 64 bit pointer to unpack a word at a time

Match results wrong for larger values of overlap

Steps to reproduce

Run RegexSearcherTests.MaxMatchCountTest with overlap set to pattern.Length and observe it fails

Note: suspect that deduping is not correctly handling the case where multiple matches are present in the overlap window

markdown examples contain errors

As a user I want the markdown examples to be correct and functional, and as a developer I find its error prone to try to write examples directly in markdown. Change the example workflow to create tests for the examples and past the tested code into the markdown.

limit SearchIndexer MatchCount

As a user of SearchIndexer I am concerned about memory usage if MatchCount is allowed to grow without bounds. Provide a parameter to the user that enables them to select an upper limit on MatchCount for their application.

LineIndexer deadlock

There is a good chance the LineIndexer will deadlock if maxWorkers is not set to a value of 1.

Multiple regex per partition

support compressed files

As a user I would like to use Gigantor on compressed files without decompressing them to disk, but Gigantor will currently only work on files that are already decompressed. See this stack overflow post.

add code coverage to tests

As a developer I need to evaluate if sufficient testing is done. Add code coverage to tests and analyze lines that are not covered.

add nuget package

As a user of Gigantor I would like to get the distribution from nuget.org

make it faster

Wanted: PR's that result in measurable performance gains ;)

test it on Linux

Wanted: help running the functional tests on Linux

line to fpos mapping

As a user of LineIndexer I want a mapping between a line and its fpos to be provided so that I don't have to re-invent it. The mapping should work in either direction.

test it on Windows

Wanted: help running the functional tests on Windows

add readlines

As a user I commonly need to be able to read the next line starting at a position and don't want to reinvent the wheel. While the System.IO.StreamReader looks promising it has known issues with read, seek, read sequences which are exactly what would be useful. Create a Giagantor StreamReader that is based on BinaryReader that can do read, seek, read sequences.

dedup search results

As a user I do not want any duplicate search results. Because of the way RegexSearcher uses overlapping buffers, there is a possibility of duplicate matches. Find and remove duplicates prior to Running being set to False.

support match groups

As a user of search I would like to be able to take advantage of match groups for more sophisticated searches. Add Group structure to the returned data structure and populate it from the matches.

RegexSearcher matches have a wrong value of StartFpos

The value of the MatchData.StartFpos is slightly off at the beginning of the file and progressively gets worse the further into the file the search goes. Easily reproducible by seeking to the match StartFpos and comparing the text there to the text Value of the match data.

code duplication in LineIndexer, RegexSearcher, and DuplicateChecker

As a developer I want to avoid code duplication because of the associated maintenance cost. I have noticed that there is a lot of code duplication in LineIndexer, RegexSearcher, and DuplicateChecker. Consolidate this overlapping functionality into common code.

add line cache

As a text file viewer application I need to show a window full of text lines but I'm only really concerned about a current window. There should be a way to select the number of lines in the cache, seek to a line, advance the cache by +/- N lines where N is an integer.

This fails

    DuplicateChecker dc = new(enwik9, enwik9Copy1, progress);
    Background.StartAndWait(dc, progress, (_) => {});

But this passes

    DuplicateChecker dc = new(enwik9, enwik9Copy1, progress);
    dc.Start();
    Background.Wait(dc, progress, (_) => {});
    Assert.AreEqual(true, dc.Identical);

add file duplication detection

As a user I would like to be able to quickly determine if two very large files are duplicates of one another. This should work for binary or text files.