imagibee / gigantor Goto Github PK
View Code? Open in Web Editor NEWWorks in conjunction with System.Text.RegularExpressions.Regex to boost performance, add a replace function, and support gigantic files that exceed RAM
License: MIT License
Works in conjunction with System.Text.RegularExpressions.Regex to boost performance, add a replace function, and support gigantic files that exceed RAM
License: MIT License
Turning file cache off using F_NOCACHE while a large part of the 32 GB test file is lodged in memory causes subsequent tests to run faster because a big chunk of their data is now cached in memory.
Steps to reproduce the behavior:
instead of 1 byte pointer to unpack 1 byte at a time, use 32 or 64 bit pointer to unpack a word at a time
Steps to reproduce
Note: suspect that deduping is not correctly handling the case where multiple matches are present in the overlap window
As a user I want the markdown examples to be correct and functional, and as a developer I find its error prone to try to write examples directly in markdown. Change the example workflow to create tests for the examples and past the tested code into the markdown.
As a user of SearchIndexer I am concerned about memory usage if MatchCount is allowed to grow without bounds. Provide a parameter to the user that enables them to select an upper limit on MatchCount for their application.
There is a good chance the LineIndexer will deadlock if maxWorkers is not set to a value of 1.
As a user I would like to use Gigantor on compressed files without decompressing them to disk, but Gigantor will currently only work on files that are already decompressed. See this stack overflow post.
As a developer I need to evaluate if sufficient testing is done. Add code coverage to tests and analyze lines that are not covered.
As a user of Gigantor I would like to get the distribution from nuget.org
Wanted: PR's that result in measurable performance gains ;)
Wanted: help running the functional tests on Linux
As a user of LineIndexer I want a mapping between a line and its fpos to be provided so that I don't have to re-invent it. The mapping should work in either direction.
Wanted: help running the functional tests on Windows
As a user I commonly need to be able to read the next line starting at a position and don't want to reinvent the wheel. While the System.IO.StreamReader looks promising it has known issues with read, seek, read sequences which are exactly what would be useful. Create a Giagantor StreamReader that is based on BinaryReader that can do read, seek, read sequences.
As a user I do not want any duplicate search results. Because of the way RegexSearcher uses overlapping buffers, there is a possibility of duplicate matches. Find and remove duplicates prior to Running being set to False.
As a user of search I would like to be able to take advantage of match groups for more sophisticated searches. Add Group structure to the returned data structure and populate it from the matches.
The value of the MatchData.StartFpos is slightly off at the beginning of the file and progressively gets worse the further into the file the search goes. Easily reproducible by seeking to the match StartFpos and comparing the text there to the text Value of the match data.
As a developer I want to avoid code duplication because of the associated maintenance cost. I have noticed that there is a lot of code duplication in LineIndexer, RegexSearcher, and DuplicateChecker. Consolidate this overlapping functionality into common code.
As a text file viewer application I need to show a window full of text lines but I'm only really concerned about a current window. There should be a way to select the number of lines in the cache, seek to a line, advance the cache by +/- N lines where N is an integer.
Both file and stream mode can improve performance by using a singly allocated TLS buffer to reduce memory allocations
As a nuget.org user I want to see the images in the README from nuget.org. Find a solution that allows GitHub hosted readme images to be displayed in nuget.org.
Background.StartAndWait fails...
Steps to reproduce
This fails
DuplicateChecker dc = new(enwik9, enwik9Copy1, progress);
Background.StartAndWait(dc, progress, (_) => {});
But this passes
DuplicateChecker dc = new(enwik9, enwik9Copy1, progress);
dc.Start();
Background.Wait(dc, progress, (_) => {});
Assert.AreEqual(true, dc.Identical);
As a user I would like to be able to quickly determine if two very large files are duplicates of one another. This should work for binary or text files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.