dazinator / dotnet.glob Goto Github PK

View Code? Open in Web Editor NEW

352.0 11.0 26.0 278 KB

A fast globbing library for .NET / .NETStandard applications. Outperforms Regex.

License: MIT License

C# 95.54% Batchfile 0.40% PowerShell 3.99% Shell 0.07%

glob glob-pattern globbing-library csharp

dotnet.glob's Introduction

DotNet.Glob

A fast (probably the fastest) globbing library for .NET.

- if you'd like this library enough please consider giving back with a small donation.

Branch	Build Status	NuGet
Master
Develop

This library does not use Regex - I wanted to make something faster. The latest benchmarks show that DotNet.Glob outperforms Regex - that was my goal for this library. The benchmarks use BenchmarkDotNet and can be located inside this repo. Just dotnet run them. Some Benchmark results have also been published on the wiki: https://github.com/dazinator/DotNet.Glob/wiki/Benchmarks-(vs-Compiled-Regex)

Usage

Install the NuGet package. Install-Package DotNet.Glob
Add using statement: using DotNet.Globbing;
Parse a glob from a pattern

 var glob = Glob.Parse("p?th/*a[bcd]b[e-g]a[1-4][!wxyz][!a-c][!1-3].*");
 var isMatch = glob.IsMatch("pAth/fooooacbfa2vd4.txt"); // You can also use ReadOnlySpan<char> on supported platforms.

Build a glob fluently

You can also use the GlobBuilder class if you wish to build up a glob using a fluent syntax. This is also more efficient as it avoids having to parse the glob from a string pattern.

So to build the following glob pattern: /foo?\\*[abc][!1-3].txt:

  var glob = new GlobBuilder()
                .PathSeparator()
                .Literal("foo")
                .AnyCharacter()
                .PathSeparator(PathSeparatorKind.BackwardSlash)
                .Wildcard()
                .OneOf('a', 'b', 'c')
                .NumberNotInRange('1', '3')
                .Literal(".txt")
                .ToGlob();

   var isMatch = glob.IsMatch(@"/fooa\\barrra4.txt"); // returns true.

Patterns

The following patterns are supported (from wikipedia):

Wildcard	Description	Example	Matches	Does not match
*	matches any number of any characters including none	Law*	Law, Laws, or Lawyer
?	matches any single character	?at	Cat, cat, Bat or bat	at
[abc]	matches one character given in the bracket	[CB]at	Cat or Bat	cat or bat
[a-z]	matches one character from the range given in the bracket	Letter[0-9]	Letter0, Letter1, Letter2 up to Letter9	Letters, Letter or Letter10
[!abc]	matches one character that is not given in the bracket	[!C]at	Bat, bat, or cat	Cat
[!a-z]	matches one character that is not from the range given in the bracket	Letter[!3-5]	Letter1, Letter2, Letter6 up to Letter9 and Letterx etc.	Letter3, Letter4, Letter5 or Letterxx

In addition, DotNet Glob also supports:

Wildcard	Description	Example	Matches	Does not match
`**`	matches any number of path / directory segments. When used must be the only contents of a segment.	/*/some.	/foo/bar/bah/some.txt, /some.txt, or /foo/some.txt

Escaping special characters

Wrap special characters ?, *, [ in square brackets in order to escape them. You can also use negation when doing this.

Here are some examples:

Pattern	Description	Matches
`/foo/bar[[].baz`	match a `[` after bar	`/foo/bar[.baz`
`/foo/bar[!!].baz`	match any character except `!` after bar	`/foo/bar7.baz`
`/foo/bar[!]].baz`	match any character except an ] after bar	`/foo/bar7.baz`
`/foo/bar[?].baz`	match an `?` after bar	`/foo/bar?.baz`
`/foo/bar[*]].baz`	match either a `*` or a `]` after bar	`/foo/bar*.baz`,`/foo/bar].baz`
`/foo/bar[*][]].baz`	match `*]` after bar	`/foo/bar*].baz`

ReadOnlySpan

ReadOnlySpan<char> is supported as of version 3.0.0 of this library. You can read more about Span here: https://msdn.microsoft.com/en-us/magazine/mt814808.aspx

You must be targeting a platform that supports ReadOnlySpan<T> for this API to become available. These are currently:

.NET Core 2.1
Platforms that implement .NET Standard 2.1

Usage remains very similar, except you can use the overload that takes a ReadOnlySpan<char> as opposed to a string:

    var glob = Globbing.Glob.Parse("p?th/*a[bcd]b[e-g]a[1-4][!wxyz][!a-c][!1-3].*");
    var span = "pAth/fooooacbfa2vd4.txt".AsSpan();
    Assert.True(glob.IsMatch(span));

There should be some performance benefits in utilising this in conjunction with other Span based API's being added to the .net framework / .net standard.

Advanced Usages

Options.

DotNet.Glob allows you to set options at a global level, however you can also override these options on a per glob basis, by passing in your own GlobOptions instance to a glob.

To set global options, use GlobOptions.Default.

For example:

    // Overide the default options globally for all matche:
    GlobOptions.Default.Evaluation.CaseInsensitive = true;   
	DotNet.Globbing.Glob.Parse("foo").IsMatch("Foo"); // true;

Or, override any global default options, by passing in your own instance of GlobOptions:

    GlobOptions options = new GlobOptions();
    options.Evaluation.CaseInsensitive = false;
    DotNet.Globbing.Glob.Parse("foo", options).IsMatch("Foo"); // false;

Case Sensitivity (Available as of version >= 2.0.0)

By default, evaluation is case-sensitive unless you specify otherwise.

    GlobOptions options = new GlobOptions();
    options.Evaluation.CaseInsensitive = true;
    DotNet.Globbing.Glob.Parse("foo*", options).IsMatch("FOo"); // true;

Setting CaseInsensitive has an impact on:

Letter Ranges. Any letter range (i.e '[A-Z]') will now match both lower or upper case characters.
Character Lists. Any character list (i.e '[ABC]') will now match both lower or upper case characters.
Literals. Any literal (i.e 'foo') will now match both lower or upper case characters i.e FoO will match foO etc.

Match Generation

Given a glob, you can generate random matches, or non matches, for that glob. For example, given the glob pattern /f?o/bar/**/*.txt you could generate matching strings like /foo/bar/ajawd/awdaw/adw-ad.txt or random non matching strings.

  var dotnetGlob = Glob.Parse(pattern);
  var generator = new GlobMatchStringGenerator(dotnetGlob.Tokens);

  for (int i = 0; i < 10; i++)
  {
          var testString = generator.GenerateRandomMatch();
          var result = dotnetGlob.IsMatch(testString);
          // result is always true.

          // generate a non match.
          testString = generator.GenerateRandomNonMatch();
          var result = dotnetGlob.IsMatch(testString);
           // result is always false.
  }

Give Back

If this library has helped you, even in a small way, please consider a small donation via https://opencollective.com/darrell-tunnell It really would be greatly appreciated.

dotnet.glob's People

Contributors

Stargazers

Watchers

dotnet.glob's Issues

Glob Builder

Create a glob builder so can fluently build up globs.

So to build a glob: /foo?\\*[abc][!1-3].txt

 var glob = new GlobBuilder()
                .PathSeperator()
                .Literal("foo")
                .AnyCharacter()
                .PathSeperator(PathSeperatorKind.BackwardSlash)
                .Wildcard()
                .OneOf('a', 'b', 'c')
                .NumberNotInRange('1', '3')
                .Literal(".txt")
                .ToGlob();

Glob Match Generator

To facilitate testing, create a generator, that given a glob, can generate random strings en-masse that will match that glob.

Wildcard pattern giving false positive match

Reported by @cocowalla here: #25
The pattern *ave 12 is matching the string "HKEY_LOCAL_MACHINE\SOFTWARE\Adobe\Shockwave 12"

Needs fixing.

Performance Comparison Tests

Do some performance comparison against https://github.com/kthompson/glob which currently uses Regex. I have dubbed this a fast globbing library, so better fact check that claim. If this library is dismally slower then scrap it.

Add more benchmarks against compiled Regex

Add more benchmarks versus a compiled regex.

DirectoryWildcards

Need to add support for directory wildcards: **/

IsMatch throws IndexOutOfRangeException

If I run:

var file = "x";
var glob = "**/y"
Glob.Parse(glob).IsMatch(file);

The following exception is thrown by IsMatch

Unhandled Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at DotNet.Globbing.Evaluation.WildcardDirectoryTokenEvaluator.IsMatch(String allChars, Int32 currentPosition, Int32& newPosition)
   at DotNet.Globbing.Evaluation.CompositeTokenEvaluator.IsMatch(String allChars, Int32 currentPosition, Int32& newPosition)
   at DotNet.Globbing.Glob.IsMatch(String subject)

It seems that if the length of the string to match is the same as what follows the **/ the exception is thrown.

Additional information:

λ dotnet --version
2.1.4

Update / Fix AppVeyor build to vs2017

In develop branch, we have upgraded solution to VS2017. Now need to upgrade app veyor build appropriately.

/DIR1/DIR2/file.txt won't match glob /DIR1//

Code example:

Glob glob = Glob.Parse(@"/DIR1/*/*");
MatchInfo matchInfo = glob.Match(@"/DIR1/DIR2/file.txt");
Console.Out.WriteLine("matchInfo.Success = {0}", matchInfo.Success);

gives:
matchInfo.Success = False

match assembly version and nuget version

I noticed that all of the DLLs in the nuget have the version 1.0.0.0 in their Win32 version resources and the .NET assembly names. It would be nice if this would be the same as the nuget version.

Is the * operator greedy or non-greedy?

In other words, does it make the longest possible or shortest possible match?

Is it possible to specify a non-greedy (like *? in regex) in case it is greedy ?

Add Benchmarks

IndexOutOfRangeException

The following lines generate an IndexOutOfRangeException:

var glob = Glob.Parse("C:\\Bin\\*.pdb");
glob.Match("C:\\Bin\\.vs");

** Directory wildcard not matching correctly

At the moment /**/some.* doesn't match /some.txt due to the fact that the / is matching, and then the directory wildcard ** is matching from position 1 with subtokens that match /some.* against text from position 1 which is some.txt. /some.* doesn't match some.txt.

This all boils down to the fact that ** token needs to also know if it has a trailing / so that I can omit a / token for the trailing slash in its place. That way rather than the glob /**/ being tokenised into

/ token
** token
/ token

it can be tokenised into just:

/ token
** token (with information to say the token has a trailing / character.

This will result in the trailing slash in / being omitted as token, so then //some.* will match /some.txt.

/**/some.* will not however match some.txt as the first / still expects to match as it is a path separator token. I think this is ok, as to match "some.txt" in any directory you could use **/some.txt and then this wont require a leading slash.
.

Path separator insensitive with wildcards

I'm having some trouble using wildcard globs in situations where I get mixed forward and backward slashes (cross-platform vscode extension, it's a nightmare, vscode gives you wonderful things like //c/users)

"**/gfx/*.gfx" seems to work fine on mixed slashes.

"**/gfx/**/*.gfx" only seems to work on paths with forward slashes.

"**\\gfx\\**\\*.gfx" only seems to work on paths with backwards slashes.

Is this working as designed, or am I missing something?

Defect: Spaces After Comma Doesn't Match

The following results seems to indicate a defect:

DotNet.Globbing.Glob.Parse("Stuff,*").IsMatch("Stuff, x");: true
DotNet.Globbing.Glob.Parse("Stuff *").IsMatch("Stuff x");: true
DotNet.Globbing.Glob.Parse("Stuff, *").IsMatch("Stuff, x");: false

Am I missing something?

Token parsing and AllowInvalidPathCharacters = true

Discovered whilst investigating #46

When setting the following:


options.Parsing.AllowInvalidPathCharacters = true;

and tokenising **/foo/

It causes the tokeniser to parse the literal foo as foo/. This causes a literal match on the path seperator which is problematic if mixed slashes are used.

endless loop in version 1.6.6

try this code:
localPattern="C:\sources\COMPILE*\MSVC120.DLL"
localInput="C:\sources\COMPILE\ANTLR3.RUNTIME.DLL"

var glob = Glob.Parse( localPattern ); return glob.IsMatch( localInput );

endless loop in "WildcardDirectoryTokenEvaluator.IsMatch":

currentPosition=51
maxPos=51
isMatch=false

` // Match until maxpos, is reached.
while (currentPosition <= maxPos)
{
// Test at current position.
isMatch = _subEvaluator.IsMatch(allChars, currentPosition, out newPosition);
if (isMatch)
{
return isMatch;
}

                // Iterate until we hit a seperator or maxPos.
                while (currentPosition < maxPos)
                {
                    currentPosition = currentPosition + 1;
                    currentChar = allChars[currentPosition];
                    if (currentChar == '/' || currentChar == '\\')
                    {
                        // advance past the seperator.
                        currentPosition = currentPosition + 1;
                        break;
                    }
                }
            }

Escaping glob patterns

Is escaping glob patterns supported?

For example, let's say I have a literal path like:

/my*files/more[stuff]/is-there-more?/

Is there a supported mechanism my which *, [, ] and ? can be escaped, such that they will be treated as literals instead of glob characters? For example, can I escape characters with a backslash?

/my\*files/more\[stuff\]/is-there-more\?/

Single asterisk behaviour

I seem to get odd results when using the single asterisk wildcard *. For example, let's take the following string:

HKEY_LOCAL_MACHINE\SOFTWARE\Adobe\Shockwave 12

The pattern *ave 12 matches, as expected, but *ave*2 (which uses 2 asterisks) does not, neither doe Shock* 12 (which uses a single asterisk, but not as the first character in the pattern).

Is this the expected behaviour? If so, what's the rationale behind it?

More Performance Improvement Ideas and Heurisitcs

When parsing a glob pattern I can do the following:

Calculate the minimum required char length that a string needs in order to match the pattern overall.

For example, this pattern, would require a string atleasts 4 character's long in order to match.
**/*.txt

This would require 9:
*/[a-z][!1-9]/f.txt

This is because certain tokens require atleast a single character in order to match, so you can sum that total up when analysing the glob pattern, and get a minimum required length for any set of tokens that a string needs to be in order to have the possibility of matching.

Some tokens (* and **) will match against 0 or many characters so they won't add any weight to the minimum required length.

With this information available, when matching strings against the Glob using Glob.IsMatch(somestring) I can allow the glob to fail much faster on certain strings using a simple string length check.

For example:

var glob = Glob.Parse("*some/fol?er/p*h/file.*")
Glob.IsMatch("aaaaaaaasome/foo")

That can fail pretty much straight away, becuase the computed min length of a matching string is 20 chars, and the test string is only 16 chars long. I can fail this without even attempting to match any of the tokens.

I expect to use this new length information for the IsMatch() evaluation, where you just want a bool result quickly. The Match() method is different because it actually returns more in depth analysis about the match, including which tokens failed to match and why. For example, in the case of a string aaaaaaaasome/foo and a pattern *some/fol?er/p*h/file.* it might be important to know that "*some/" actually matches "*aaaaaaaasome/" but that fol fails to match "foo" , and the closest it came to matching was "fo". This kind of in-depth match analysis can only be returned if the match is actually attempted, which means not failing fast due to length checks. However failing early is desirable when doing IsMatch() because a boolean result is all you want to know.

Once I have the min required length computed, I can also put in some improvements for the Wildcard(*) evaluator and WildcardDirectory(**) evaluators.

Those evaluators will now be able to match against characters only within a range where it doesn't take them past minimum length required for the remaining tokens to be matched.

Glob Formatter

Write a formatter that can iterate a tokenised glob pattern and output the relevent glob string.

This will be useful if building up a glob programtically. i.e

wrong match for "C:\name.ext" to glob pattern "C:\name\**"

given the following test code ...

var glob = DotNet.Globbing.Glob.Parse( @"C:\name\**" );
bool result = glob.IsMatch( @"C:\name.ext" );
result = glob.IsMatch( @"C:\name_longer.ext" );`

... the result of IsMatch() is true in both cases. To my mind this is wrong. The result should be false. I am using DotNet.Glob-1.6.1 from nuget.org.

Update Readme with some sample code.

Pattern **/ not working

The pattern "**/app*.js" for example should match when a path looks like dist/app.js or dist/app.a72ka8234.js. The issue is it is not evaluating the "**/" part of the glob pattern as the documentation only mentions "/**/" as a valid pattern. Any plan to implement this soon?

Comparison with Microsoft.Extensions.FileSystemGlobbing

Now that Microsoft has a globbing library, it would be useful to see a comparison of features and performance in the wiki.

Case sensitivity option

I like the project, works well so far for me. However, am missing an option to make glob case insensitive to mimic how Windows treats paths. Currently I have to make both glob pattern and file names lower case to achieve this. Seems like this could easily be baked in.

Keep up the good work!

Defect: Patterns with Unsupported non alph-numeric characters fail to match

It appers that a patterns with escape sequences fail:

DotNet.Globbing.Glob.Parse("\"Stuff*").IsMatch("\"Stuff"): false
DotNet.Globbing.Glob.Parse("\0Stuff*").IsMatch("\0Stuff"): false
DotNet.Globbing.Glob.Parse("\nStuff*").IsMatch("\nStuff"): false
DotNet.Globbing.Glob.Parse("\r\nStuff*").IsMatch("\r\nStuff"): false

Glob.Match InvalidOperationException

Calling Glob.Match() on certain globs/input strings will produce an InvalidOperationException with the message "Index was outside the bounds of the array". Repro for v1.6.9:

Glob.Parse("*://*wikia.com/**").Match("chrome://extensions");

Infinite loop?

I'm using the latest Nuget package, 1.7.0-unstable0022, and one of my tests found an issue whereby the call to IsMatch never returns (I assume due to an infinite loop).

Pattern: C:\Test\**\*.txt
Test String: C:\Test\file.dat

This issue was not present in 1.7.0-unstable0018

Add a license

The current 1-line license file is just a tad ambiguous :)

It would be good if you could choose a more explicit license.

Glob should override string and return pattern

Benchmark Non Matches

I have some benchmarks that benchmark glob.IsMatch() for loads of successful matches.

However, I also need to benchmark glob.IsMatch() for unsuccessful matches - as dotnet glob should be highly efficient at evaluating an unsuccesful match, and returning a result as fast as possible,

Unexpected match with **

Hi,

I just found an unexpected match when fiddling around with your library. I am not sure if this is a bug, or if my understanding is lacking:

var glob = DotNet.Globbing.Glob.Parse("Bumpy/**/AssemblyInfo.cs");

// success - expected
Assert.IsTrue(glob.IsMatch("Bumpy/Properties/AssemblyInfo.cs"));

// failure - unexpected
Assert.IsFalse(glob.IsMatch("Bumpy.Test/Properties/AssemblyInfo.cs"));

add target framework "net47"

<TargetFrameworks>netstandard1.1;net45;net46;net47;net4</TargetFrameworks>

"C:\THIS_IS_A_DIR\**\somefile.txt" matches wrongly to "C:\THIS_IS_A_DIR\awesomefile.txt"

see the test "DotNet.Glob.Tests.GlobTests.Does_Not_Match"
should not match but does it

[Theory]
[InlineData( "C:\\THIS_IS_A_DIR\\**\\somefile.txt", "C:\\THIS_IS_A_DIR\\awesomefile.txt" )]
public void Does_Not_Match(string pattern, params string[] testStrings)
{
    var glob = Globbing.Glob.Parse(pattern);
    foreach (var testString in testStrings)
    {
        Assert.False(glob.IsMatch(testString));
    }
}

Extending globbing patterns

This is a new feature to add support for a set of extended globbing patterns - documented here:

https://www.linuxjournal.com/content/bash-extended-globbing

I see this as an opt-in feature - so I'll add another property on the options class, so you can opt-in like so:

GlobParseOptions.Default.Evaluation.EnableExtendedPatterns = true;

Once enabled, you can use the following additional patterns as supported by bash:

?(pattern-list) Matches zero or one occurrence of the given patterns
*(pattern-list) Matches zero or more occurrences of the given patterns
+(pattern-list) Matches one or more occurrences of the given patterns
@(pattern-list) Matches one of the given patterns
!(pattern-list) Matches anything except one of the given patterns

Here a pattern-list is a list of items separated by a vertical bar "|" (aka the pipe symbol).

For example, the following pattern would match all the JPEG and GIF files that start with either "ab" or "def":

+(ab|def)*+(.jpg|.gif)

Characters missing from list of allowable path characters

The readme says:

By default, when your glob pattern is parsed, DotNet.Glob will only allow literals which are valid for path / directory names. These are:
Any Letter (A-Z, a-z) or Digit
., , !, #, -, ;, =, @, ~, _, :

Maybe I'm misunderstanding this section, but on all of Windows, Linux and MacOS, lot's of other characters are valid in file system paths, such as:

Any printable Unicode character 你好！
{ } [ ] ( ) + ; % ? *

Also, on Windows : is not valid.

Does not recognise ~ as a valid character

As raised by @ming4883

Consider..

http://stackoverflow.com/questions/188892/glob-pattern-matching-in-net

Index outside bounds of array

Exception when matching

             [InlineData("/*file.txt", "/folder")]

The nuget package have no strong name

I'm getting error:
System.IO.FileLoadException: 'Could not load file or assembly 'DotNet.Glob, Version=2.0.1.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. A strongly-named assembly is required.
How I can use your package now? Or when will you fix it?

Add back in net4 and net46 targets.

The current stable release supports net4, 4.5, 4.6 and netstandard 1.1.
After latest dev changes, the current unstable nugget package now only support net.4.5 and net standard 1.1.

I am going to add back in the other targets.

Add benchmark results to wiki

As per request from @cocowalla

Spanification

With the recent 'spanification' of .NET Core, in particular the addition of the Span-based FileSystemEnumerable and System.IO.Path Span-based methods, I'm wondering if it would be possible to add Span-based methods to Glob? The point would be to avoid allocating strings when performing Glob-based matching.

Glob object model support for **

Your globbing library is pretty nice. I was wondering if there’s a way to glob a pattern like this using the GlobBuilder: [a-zA-Z0-9]**

I can do it with string parsing, Glob.Parse(); but I seek high performance and was wondering if it would be possible to use the GlobBuilder to achieve the same.
new GlobBuilder().LetterInRange('a', 'z').Wildcard().WildCard() -> [a-z]**

.Wildcard().WildCard() doesn't seem to be equivalent to "**".

Also, another thing that’s problematic is being able to do, [a-zA-Z0-9]**.

Finally, I want to glob files with a certain extension for e.g., *.pdb this doesn't seem to work at the moment?

Match vs IsMatch

The Match method is different from IsMatch in that rather than return a simple bool it returns information about how a match progressed i.e what tokens matched at which positions of the string which is useful if you need to analyze the match. However its implementation is not consistent with IsMatch implementation and also has some bugs. My first choice is to make this method obsolete and then eventually remove it. If people want this method, ill add a message to the obsolete directive to add feedback to this issue, and then if there is demand to keep it i'll refactor it rather than remove it.