benjamin-hodgson / pidgin Goto Github PK

View Code? Open in Web Editor NEW

827.0 24.0 60.0 3.24 MB

A lightweight and fast parsing library for C#.

Home Page: https://www.benjamin.pizza/Pidgin

License: MIT License

C# 99.57% F# 0.43%

parsing dotnet dotnet-core csharp parser-combinators parse parser

pidgin's People

Contributors

Stargazers

Watchers

Forkers

sizzles paulroho merickowa dellis1972 kswaroop1 n9 josephgardner raizam sandboxorg 07101994 mustik22 laiyongqin jangocheng andrzejolszak hwalkenshaw nblumhardt beho lmihalkovic milomconsulting agray86 koliyo comaid mrclan marcelomml sigridandersen nuspect-alan joshspicer joost-morsink lvyitian lukewoodward dpen2000 stjordanis lahma sambsp jmenashe artentus mgkr marcin-golebiowski doomviruz sirandros nangal erisonliang veljko getkks jkumarraj bardasoft teo-tsirpanis romansoloweow lzhm wangzq driekus77 modularmanagement meigs2 maxstu ddalacu bcoskun georgreen thijstijsma svengali zerxz

pidgin's Issues

How to signal parsing failure from within Map?

Background: I need a parser that can consume an ISO-8601 into a DateTime. I'm trying to avoid reinventing the wheel for all the valid of date and time format, so I'm leaning on DateTime.Parse(). I need to parse out a string with Pidgin to send to DateTime.Parse(), but I don't actually know if that string is fully valid until I feed it through that parse method (code below).

Unfortunately there doesn't seem to be a way to indicate failure from within the "combine parser outputs" function that's passed to Map, so an invalid date string will causing parsing to fail with a System.FormatException. Of course there's TryParse(), but once I know parsing has failed it doesn't seem like I can do anything with that info?

Is there a recommended way to handle this type of situation?

internal static readonly Parser<char, IFilterExpression> DateTimeLiteral =
    SharedParsers.Token(
        Map(
            (year, rest) =>
            {
                var input = new string(year.ToArray()) + rest;
                var result = DateTime.Parse(
                    input,
                    CultureInfo.InvariantCulture,
                    DateTimeStyles.RoundtripKind
                );

                return (IFilterExpression) new DateTimeLiteral(result);
            },
            Digit.Repeat(4),
            Char('-')
                .Then(
                    Try(Digit)
                        .Or(OneOf('-', ':', '.', 'T', 'Z'))
                        .ManyString(),
                    (head, tail) => head + tail
                )
        )
    );

Many? - question

Let's have the following (simplified) example:

static readonly Parser<char, IEnumerable<string>> Foo = OneOf(Map(c => "<>", Token('|')), Token(c => c != '|').ManyString()).Many();

void Main()
{
	Foo.Parse("|bar|foo").Dump();
}

It fails with Many() used with a parser which consumed no input. What's wrong with that?

How to parse escaped string?

I'm trying to parse a quoted string with an escaped quote in it. I've tried a few variations of this with no success:

    private static readonly Parser<char, string> QuotedString =
        (String("\\\"").Or(Token(c => c != '"').ManyString())).Many().Select(System.String.Concat).Between(Quote);

Any ideas?

Streaming input?

(Sorry if I misunderstand the goal of this library)

The parsers I found in the examples are all string-based. I'm wondering whether it is possible to create functionality using this library that takes a Stream as an input and that outputs parsed items through events when new data is received through the input stream.

The use case for this is that I want to connect the incoming data Stream from a SerialPort object to my parser functionality, and I want it to emit parsed pieces of information to my GUI via events.

Pidgin any doesn't exist

This is throwing a error because Any doesn't exist

using NUnit.Framework;
using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;

namespace Pidgin_Basic_Tests
{
    public class Pidgin
    {

        [Test]
        public void Any()
        {
           Assert.AreEqual('a', Any.ParseOrThrow("a"));
           Assert.AreEqual('b', Any.ParseOrThrow("b"));
        }
    }
}

there is how ever and Any() but it is a method and returns void

EDIT: renaming my Pidgin file and class to PidginTest doesn't seem to work.

EDIT2: so this..

using NUnit.Framework;
using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;

namespace Pidgin_Basic_Tests
{
    public class PidginTest
    {

        [Test]
        public void Any()
        {
           Assert.AreEqual('a', Parser<char>.Any.ParseOrThrow("a"));
           Assert.AreEqual('b', Parser<char>.Any.ParseOrThrow("b"));
        }
    }
}

works and makes using Pidgin; a reconsidered using statement but the other two are still considered unnecessary.

EDIT2: odd it works fine in a different file maybe its due to n unit tests?

New token streams

Hi,

we would like to use Pidgin for parsing data received using ReadOnlySequence<T>. I'm experimenting with implementation in beho/Pidgin/read-only-sequence.

I was forced to fork this repository because ITokenStream as the essential extension point for creating new token streams is marked internal. I would very much like to be able to use your build of Pidgin (and be able to update by simply updating nuget package) than having to merge new commits into a fork.

So the question is – is there any plan to publish ITokenStream for external consumers or is it something you would be willing to consider? Or from opposite perspective, is there any reason preventing you from making this interface public?

Thanks for great work!

beho

Resumable Parsing

I would love to be able to resume parsing once I finish parsing something. For example, it would be nice to do something like this:

IEnumerable<Token> ParseStream(TextReader stream) {
    var result = this.Parser.Parse(stream);
    while (result.Success) {
        yield return result.Value;
        result = result.ParseAgain(); // This would probably be implemented differently
    }
}

This would allow us to parse from an input stream as needed, without needing to store all the output tokens in memory. For example, we could process a 2GB+ file and write the output into another file without using very much memory.

Sprache has an equivalent to this, but does not support parsing from a stream:

IEnumerable<Token> ParseStream(string input) {
    var result = this.Parser.TryParse(input);
    while (result.WasSuccessful) {
        yield return result.Value;
        result = this.Parser(result.Remainder);
    }
}

One possible approach is to keep a reference to the ParseState<T> inside the object that Parse returns.

[Question] Is Map the best way to build an object from multiple parsers?

First thank you for this library, I have some monster regexs that i can make more maintainable with this =D

But, I have some difficulty composing a parser that emits a complex type.

I have a string like this

Bob Saget : (1234) 'Actor'

And an object like this

class Person
{
    public string Name { get; set; }
    public int Id { get; set; }
    public string Title { get; set; }
}

And a few parsers defined like this

var Colon = Char(':');
var SingleQuote = Char('\'');
var Name = OneOf(Letter, Whitespace).ManyString();
var IdNumber = Num.Between(Char('('), Char(')'));
var Title = OneOf(Letter, Whitespace).Many().Between(SingleQuote).Select(chars => string.Concat(chars));

But I am having a hard time chaining my primitive parsers into a single complex parser without losing the data from early stages in the pipeline.

Ideally i would like a Result<char, Person>. To accomplish this I have awkwardly 'mapped' them together like this.

Person MakePerson(string name, char _, char _, int id, char _, string title)
{ 
    return new Person {
      Name = name, 
      Id = id,
      Title = title
    };
}

var personParser = Map(MakePerson, Name, Colon, Whitespace, IdNumber, Whitespace, Title);

It works, but if i had 10 properties or more discards I would be out of luck for using map.

The given sequencing primitives (Then and Before) assume one of the two captures are not useful so if i don't have throwaway characters to burn i am not sure how to use them. I might just be bad at parser combinators but this has me kinda stumped.

I have fought this making by a fluent builder that flows state, very similar to a F# computation expression like this.

public static class Builder
{
    public static Builder<TToken, TComplex, TToken> Create<TToken, TComplex>(IEnumerable<TToken> tokenStream, TComplex item) => new Builder<TToken, TComplex, TToken>(tokenStream, item);
}

public struct Builder<TToken, TComplex, TLast>
{
    public IEnumerator<TToken> Enumerator { get; }
    public TComplex Item { get; }

    public Result<TToken, TLast>? LastResult { get; }
    public Builder(IEnumerable<TToken> tokenStream, TComplex item)
    {
        Enumerator = tokenStream.GetEnumerator();
        Item = item;
        LastResult = null;
    }

    private Builder(IEnumerator<TToken> enumerator, TComplex item, Result<TToken, TLast> lastResult)
    {
        Enumerator = enumerator;
        Item = item;
        LastResult = lastResult;
    }

    private bool ResultOk => LastResult == null || LastResult.Value.Success;

    public Builder<TToken, TComplex, T2> Capture<T2>(Parser<TToken, T2> parser, Action<TComplex, T2> assignment)
    {
        if (!ResultOk)
            return new Builder<TToken, TComplex, T2>(Enumerator, Item, new Result<TToken, T2>());

        var newResult = parser.Parse(Enumerator);

        if (newResult.Success)
            assignment(Item, newResult.Value);

        return new Builder<TToken, TComplex, T2>(Enumerator, Item, newResult);
    }

    public Builder<TToken, TComplex, T2> Skip<T2>(Parser<TToken, T2> parser)
    {
        if (!ResultOk)
            return new Builder<TToken, TComplex, T2>(Enumerator, Item, new Result<TToken, T2>());

        return new Builder<TToken, TComplex, T2>(Enumerator, Item, parser.Parse(Enumerator));
    }


    public Result<TToken, TComplex> Done
    {
        get
        {
            Parser<TToken, TComplex> ret;
            if (ResultOk)
                ret = Parser<TToken>.Return(Item);
            else
                ret = Parser<TToken>.Fail<TComplex>();

            return ret.Parse(Enumerable.Empty<TToken>());
        }
    }
}

Which i then use like this

var result =
    Builder
    .Create(testSubject, new Employ())
    .Capture(Name, (e, name) => e.Name = name)
    .Skip(Colon)
    .Skip(Whitespace)
    .Capture(IdNumber, (e, id) => e.Id = id)
    .Skip(Whitespace)
    .Capture(Title, (e, title) => e.Title = title)
    .Done;

My questions are

Is there a more idiomatic way to chain simple parsers into complex parsers?
Is this fluent builder a bad idea or catastrophic for perf or something?

Get the span of a parser match?

It would be nice to be able to have as a result a Span in the original string/buffer that covers the matched extend.

Use cases

Re-use some very efficient parsers once I know a substring matches the expected format.

It is quite a lot of code to assemble some literal types such as double, DateTimeOffset or TimeSpan, especially if you're going to support all optional parts and variations.

The framework already has some very efficient parsers for those and it's relatively easy to just check if the input matches a format you're supporting, then letting the framework do the actual parsing.

Sometimes you want to validate a complex format, but then just keep the result as a string.
Of course, you can Concat all the parts, but it seems to me it would be easier and more efficient to just skip intermediate results and get the span in the original string.

Proposed API

Predicate.SpanResult<T>(Func<Span<string>, T> selector)

Examples

Want to parse a phone number literal that should look (0xx) xx xx? and then keep it as a string?

String("(0")
  .Then(Digit.Repeat(2))
  .Then(String(") ")
  .Then(Digit.Repeat(2))
  .Then(Char(' '))
  .Then(Digit.Repeat(2))
  .SpanResult(s => s.ToString()); // Result is the full matched string

Want to parse a decimal?

Char('-').Optional()
  .Then(Digit.SkipAtLeastOnce())
  .Then(Char('.').Then(Digit.SkipMany()).Optional())
  .SpanResult(s => decimal.Parse(s)); // Result is the parsed decimal number

Compare that with the work required to build up the decimal yourself.

New parser: Regex

Based on that idea that sometimes you just want to grab the matched substring, I think a regex parser building block that returns a Span when it matches would make the scenarios above even simpler.

var phoneParser = Regex(@"\(0\d\d) \d\d \d\d");

var decimalParser = Regex(@"-?\d+(\.\d*)?", s => decimal.Parse(s));

'\ r', '\ n' inclusion analysis

When analyzing the actual xml file with xmlParser of Sample Code, it fails.
What should I do to analyze characters containing '\ r','\ n'?
Does the relevant code exist?

xml File:

        public void Test()
        {
            string xmlContents = "<?xml version=\"1.0\"?>\r\n" +
                                 "<note>\r\n" +
                                 "<from>Jani</from>\r\n" +
                                 "<to>Tove</to>\r\n" +
                                 "<message>Norwegian: aa. French: eee</message>\r\n" +
                                 "</note>";

            var result = XmlParser.Parse(xmlContents);

        }

result`s success is false.

used sample code
Used XmlParser.cs in Pidgin.Examlples Project

Thanks, good day~

Parsing a list of unknown words

Hi,

Sorry if this isn't the right place for questions, but how would you modify your XmlParser example so that it could parse <foo>some text</foo>. Basically, adding an InnerText or Text property?

How to parse Quoted Quote?

I'd like to parse something like

"Hi I'm ""Joonhwan"""

{quote} + {string:Hi I'm "Joonhwan"} + {quote}

Please be noted that the 'joonhwan' part is quoted inside of the string. yes..wierd but I need this.
quote mark itself is escaped using quote.

How to I create parser for this ?

var result = Sequence('"', '"').Select(_ => '"').Or(Token(c => c!='"')).Many().Between(Char('"')).Parse("\"\"\"abcd\"\"1234.556\"");

That what i tried but failed.

Parsing code that defines new operators?

Hi Benjamin, how would one go about parsing a piece of code that defines new operators?
A motivating example is e.g. Prolog where one can:

:- op(200, xfy, =>>).
40 =>> 20.

I can see that ExpressionParser.Build takes an collection of operators, would it work to update the contents of the collection as the parsing progresses and we encounter new operator definitions?

parse to UInt?

how would I parse a uint?

//sounds like it would repeat until Letter or Symbol and then stop but it wants a digit, letter or symbol at the end.
public static Parser<char, uint>   uInt  { get; protected set; } = Digit.AtLeastOnceUntil(Letter.Or(Symbol)).Cast<uint>();

EDIT:
I also tried using a cast hack but it fails..

public static Parser<char, uint>   uInt  { get; protected set; } = Digit.AtLeastOnce().Cast<uint>().Labelled("uint");

Name changement requested

Can you change the name of this?

There is already Pidgin since a lot of years (formerly named Gaim):

https://pidgin.im/

Parsing failure on multi-character sequences

Hi,

In the expression parser, an (partially) invalid string takes a long time to fail. Ultimately the failure is reasonable, but the time it takes to fail may not be. I am looking for advice on the structure of multi-character sequences, is there a better way?

Having lots of fun with this excellent tool!

Applicable Expression parsers
private static readonly Parser<char, Func<IExpr, IExpr, IExpr>> EqualTo
= Binary(Tok("=").Then(String("=")).ThenReturn(BinaryOperatorType.EqualTo)); // "=="

Input string
1=2

Expected
1==2

Exception message
(reasonable)
Exception has occurred: CLR/Pidgin.ParseException
Exception thrown: 'Pidgin.ParseException' in Pidgin.dll: 'Parse error.
unexpected 2
expected expression
at line 1, col 3'
at Pidgin.ParserExtensions.GetValueOrThrow[TToken,T](Result2 result) at Pidgin.ParserExtensions.ParseOrThrow[T](Parser2 parser, String input, Func`3 calculatePos)
at ApplicationSupport.Parsers.ExprParser.ParseOrThrow(String input) in /Users/mustik/Projects/ReservationCheck/ReservationCheck/Support/Parsers/ExprParser.cs:line 155

XmlParser does not parse nested tags

Hello.
I try running XmlParser in Pidgin.Examples and find out it does not parse nested tags. It only parse simple tag.
Can you please update the example or tell me how can fix it? Thanks

Assert does not exist

am I missing something?
Visual Studio 2017 can't seem to find Assert.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;

namespace ScriptingLanguage
{
    class Program
    {
        static void Main(string[] args)
        {
            Assert.AreEqual('a', Any.ParseOrThrow("a"));
        }

    }
}

Using postfix operators for chained function calls doesn't work?

I've been trying to implement chained function calls, i.e. foo(bar)(baz)
It seemed like the obvious way to do this is to use a postfix operator but I haven't managed to get it to work. The first call will parse (foo(bar)), but the operator won't chain. I looked at the source for the expression builder, and it looks like, unlike with binary operators, unary operators are not recursively applied and aggregated.

Should the unary operators be recursively applied? Can you suggest how I should get this to work?

In your conversation with kswaroop1 re: his "expression example" pull request you suggest doing exactly this for function call operators (including the foo(bar)(baz) example), but there isn't enough of an example for me to understand how to make this work.

IgnoreResult introduces new failure label

Is this intended behaviour?

var p1 = Fail<Unit>().Labelled("fail1");
var p2 = p1.Where(r => true);
var p3 = p2.IgnoreResult();
Console.WriteLine(p1.Parse("abc").Error.Expected.Single().Label); // fail1
Console.WriteLine(p2.Parse("abc").Error.Expected.Single().Label); // fail1
Console.WriteLine(p3.Parse("abc").Error.Expected.Single().Label); // result satisfying assertion

It was certainly unexpected, since I didn't see why ignoring the result would then change the failure expectation.

Parsing double

Hi, Thanks for such a great library, I noticed you have both DecimalNum and Num returning the same int type. is this a "copy paste feature"?

Is there any Parser to parse a decimal number and return a double?

I am trying to parse a sequence of numbers like below:
114365.0 121429.0 6.2 4.0 30357.0 117165.0

Fine if you don't, I will write a new one, but as you have created the DecimalNum thought would be worth to ask you this question.

    //
    // Summary:
    //     A parser which parses a base-10 integer with an optional sign. The resulting
    //     int is not checked for overflow.
    //
    // Returns:
    //     A parser which parses a base-10 integer with an optional sign
    public static Parser<char, int> DecimalNum { get; }

    //
    // Summary:
    //     A parser which parses a base-10 integer with an optional sign. The resulting
    //     int is not checked for overflow.
    //
    // Returns:
    //     A parser which parses a base-10 integer with an optional sign
    public static Parser<char, int> Num { get; }

Thanks in Advance,
Paulo.

Update build to netstandard2.0

I'm experimenting with your parser (Celin.AIS.Data) but got stuck by a bulk of 'is defined in an assembly that is not referenced...' errors. After hours of googling and trial and errors I finally pulled the Pidgin source and saw that its target is netstandard1.3. After building it with netstandard2.0 as the target, so it matches my library, the errors dissipated.

get current token in until?

is there a way to do this..

Parser<char IEnumerable<char>> While(Predicate<string> pred) => Any().Until(!pred.Invoke(<UNKNOWNSTRING>));
Parser<char IEnumerable<char>> Until(Predicate<string> pred) => Any().Until(pred.Invoke(<UNKNOWNSTRING>));

[Question] What is the best way to skip all input except a set of matching Parser

I'm trying to parse a big and complex database log file. There are a lot of event types written to this log file and each has its own syntax. Some of the events are just a single line with some properties, but others are more complicated an can even be nested.
I need to parse just some of the event types, because currently I do not care about the rest of the log file. I was able to get the event type Parser to work, but I struggled a lot to make a parser that cherry picks just a view parser and skips the rest of the input. And I'm wondering how would you solve this problem.

Here is the problem in a nutshell:

using System;
using System.Collections.Generic;
using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;

namespace parsertest
{
    static class NoisyParser
    {
        private static readonly Parser<char, string> ComplexParserA = String("AA");
        private static readonly Parser<char, string> ComplexParserB = String("BB");

        private static readonly Parser<char, string> RealData = 
            ComplexParserA
                .Or(ComplexParserB);

        private static readonly Parser<char, Unit> Skip = 
            Any.SkipUntil(Lookahead(Try(RealData)));

        private static readonly Parser<char, string> RealDataWithNoiseBefore =
            from _ in Skip
            from traceData in RealData
            select traceData;

        private static readonly Parser<char, IEnumerable<string>> RealDataWithNoise = 
            from traceData in RealDataWithNoiseBefore.Until(Not(Lookahead(Skip)))
            from __ in Any.Until(End)
            select traceData;

        public static Result<char, IEnumerable<string>> Parse(string input) => 
            RealDataWithNoise.Parse(input);
    }

    class Program
    {
        static void Main(string[] args)
        {
            var veryNoisyInput = "asdfask asdf ASA a BB asdkfjAAaAa";
            //                                       ^        ^
            //                                       |        Should match
            //                                       Should match

            var parserResult = NoisyParser.Parse(veryNoisyInput);

            if (parserResult.Success)
            {
                foreach (var match in parserResult.Value)
                {
                    Console.WriteLine(match);
                }
            }
            else
            {
                Console.WriteLine($"Error.EOF: {parserResult.Error.EOF}");
                Console.WriteLine($"Error.ErrorPos: {parserResult.Error.ErrorPos}");
                Console.WriteLine($"Error.Message: {parserResult.Error.Message}");
            }
        }
    }
}

Is there a better way to solve this, I've got the feeling I'm missing something essential...

Scripting language samples?

can we have some examples on how to make scripting language?
some things I would like to see would be..

lua
javascript
c

Does the DllRewriter break the PDB?

add two parsers together?

is there a way to add two parsers together?

using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;
using System;
using System.Collections.Generic;
using System.Text;

namespace Pidgin_Basic_Tests
{
    public class TNumber
    {
        float value;

        public static Parser<char, string> DotOp { get; protected set; } = String(".");
        public static Parser<char, string> HexOp { get; protected set; } = String("0x");
        public static Parser<char, string> DecOp { get; protected set; } = String("0d");
        public static Parser<char, string> OctOp { get; protected set; } = String("0o");
        public static Parser<char, string> BinOp { get; protected set; } = String("0b");

        //trying to get either 0d+integer_number or integer_number+'.'+integer_number or integer_number
        public static Parser<char, string> Dec { get; protected set; }  = Try(DecOp.Or(DecimalNum.Cast<string>()));
    }
}

InvalidOperationException when calling Many

I tried to use Many(), however it threw an exception.

System.InvalidOperationException: 'Many() used with a parser which consumed no input'

Reproduction Code:
var x = Try(WhitespaceString.Many()).Parse(" ");

More details:
var x = Try(WhitespaceString).Parse(" "); works just fine.
However, if I want to skip whitespaces or comments, I have to write something along the lines of var x = Try(WhitespaceString.Or(CommentString)).Parse(" ");. This won't quite do the trick yet, since it can't handle a string that has whitespace and comments. To solve that, I tried using .Many() to apply "the current parser zero or more times". var x = Try(WhitespaceString.Or(CommentString).Many()).Parse(" ");
But that results in an exception.

[Question] Use of separator with sequenced parser

Which is the way to apply the Separated method with a Sequenced parser because the following code returns the original string.

private static readonly Parser<char, char> RBracket = Char(']');
private static readonly Parser<char, char> Asterisk = Char('*');
static readonly Parser<char, Unit> EndDocument =
RBracket.Then(Asterisk).Then(RBracket).ThenReturn(Unit.Value);

var test = Any
.ManyString()
.Separated(EndDocument)
.Parse("fd]*]vb]*]gf");

Error with sequencing parsers on .NET Core 2.1

This test

[Theory]
[InlineData("<>", 1)]
[InlineData("<", 2)]
[InlineData(">", 3)]
[InlineData("=", 4)]
public void TryParse(string source, int expected)
{
    Parser<char, int> parser =
        Parser.Char('<').Then(_ => Parser.Char('>')).ThenReturn(1)
        .Or(Parser.Char('<').ThenReturn(2))
        .Or(Parser.Char('>').ThenReturn(3))
        .Or(Parser.Char('=').ThenReturn(4));
     
    Assert.Equal(expected, parser.ParseOrThrow(source));
}

fails with the following error

Failed PidginTests.PidginTests.TryParse(source: "<", expected: 2)
Error Message:
Pidgin.ParseException : Parse error.
unexpected EOF
expected "<"
at line 1, col 2
Stack Trace:
at Pidgin.ParserExtensions.GetValueOrThrow[TToken,T](Result2 result) at Pidgin.ParserExtensions.ParseOrThrow[T](Parser2 parser, String input, Func`3 calculatePos)
at PidginTests.PidginTests.TryParse(String source, Int32 expected)

Other tests in theory passed.
Verified on Windows 10 and MacOS.

[Question] Parser for range of bytes

Hi,

First time using this library, so I'm a novice and hence this might be a stupid question.

Problem
I want to write a parser that reads, for example [64kB, 5MB), and can produce the resulting tuple (64*1024, 5*1024^2) of integers.

So far

using System;
using Pidgin;
using static Pidgin.Parser;

namespace Parser
{
    public class Range
    {
        public int Lower { get; set; }

        public int Upper { get; set; }
    }

    public static class RangeParser
    {
        public static readonly char[] Suffix =
        {
            'k', 'M', 'G'
        };

        private static readonly Parser<char, char> LBracket = Char('[');
        private static readonly Parser<char, char> RBracket = Char(']');
        private static readonly Parser<char, char> LParenthesis = Char('(');
        private static readonly Parser<char, char> RParenthesis = Char(')');
        private static readonly Parser<char, char> Comma = Char(',');
        private static readonly Parser<char, char> Byte = Char('B');
        private static readonly Parser<char, char> Kilo = Char('k');
        private static readonly Parser<char, char> Mega = Char('M');
        private static readonly Parser<char, char> Giga = Char('G');

        private static readonly Parser<char, Range> Parser =
            OneOf(LBracket, LParenthesis).Then(DecimalNum).Then(OneOf(Kilo, Mega, Giga))
                .Then(Byte)
                .Separated(Comma)
                .Then(DecimalNum)
                .Then(OneOf(Kilo, Mega, Giga))
                .Then(Byte)
                .Then(OneOf(RBracket, RParenthesis))
                .Select(c => new Range());

        public static int SuffixMap(char suffix)
        {
            switch(suffix)
            {
                case 'k': return 1000;
                case 'M': return 1000 * 1000;
                case 'G': return 1000 * 1000 * 1000;
                default:
                    throw new ArgumentOutOfRangeException(nameof(suffix),
                        "The suffix is not supported. Valid values are [" + string.Join(",", Suffix) + "]");
            }
        }


        public static Result<char, Range> Parse(string input) => Parser.Parse(input);
    }
}

Can't understand how to use Map or the second argument to Then properly.

Thanks

F# port

Hey Benjamin,

I couldn't find any other way to contact you, that's why I created an issue.

Just wanted to mention that after your NDC talk I got curious what the syntax of Pidgin would look like in F#, so I started to work on a port, which now has enough functionality to implement a simple Json parser. The implementation closely follows the primitive and complex parsers of Pidgin, although a couple of them had to be renamed due to reserved F# keywords (for example I used after instead of then).

Many things are still missing, mainly some of the more finicky parsers (SeparatedAndOptionallyTerminated, etc.), error messages, backtracking, and any other parser state than strings. Also, I started working on it mainly to learn more about F#, so it's not really intended for production use (for which there is already FParsec anyway).

So far what I really liked about doing it in F#:

Pattern matching (with warnings if it's not exhaustive)
The partial application and function composition leads to some really terse syntax
Type inference, mainly not having to specify the argument types and the generic type arguments
Defining mutually recursive parsers is really easy with the and syntax

What I didn't like:

No function overloading, so there is no way to have map with different number of arguments
The current syntax of advancing the parser state (let newState = advance state) suggests that the state is immutable, which is true for a string state but wouldn't be true in the case of a stream (not implemented yet). I'm not sure yet what the nicest way to express this would be.

If you're interested, you can take a look here.

Cheers,
Mark

Behavior of `Labelled` if the wrapped parser consumed input

Hi! Very nice library, thanks for open-sourcing it.

I noticed some odd behavior of Parser.Labelled(), for example, given this parser:

var tupleParser = LetterOrDigit.AtLeastOnceString()
                    .Separated(Char(','))
                    .Between(Char('('), Char(')'))
                    .Labelled("Tuple");

tupleParser.ParseOrThrow("(1,2,!!,3)");

The error message is:

Pidgin.ParseException : Parse error.
    unexpected !
    expected Tuple
    at line 1, col 6

Which is misleading IMHO, since a tuple isn't actually expected at col 6. A possible fix would be to change WithExpectedParser.Parse() as follows:

            internal override InternalResult<T> Parse(ref ParseState<TToken> state)
            {
                state.BeginExpectedTran();
                var result = _parser.Parse(ref state);
                state.EndExpectedTran(commit: result.ConsumedInput);
                if (!result.Success && !result.ConsumedInput)
                {
                    state.AddExpected(_expected);
                }
                return result;
            }

I only just started looking at the Codebase, so I hope I understood the Expected transaction mechanism correctly :) With this change, the message is as if the Labelled wasn't there:

Pidgin.ParseException : Parse error.
    unexpected !
    expected letter or digit
    at line 1, col 6

Maybe a better fix would be to include a sort of stacktrace of Labelleds that enclosed the error, e.g. "Error while parsing Tuple: ...", but this would be a larger change. Not sure if it should go into the error message, the Expecteds, or some new construct.

Add benchmark results to readme.md

Hey there,

Since Pidgin positions itself as the fastest parser combinator library, it would be really useful to see the output of https://github.com/benjamin-hodgson/Pidgin/tree/master/Pidgin.Bench in the readme.md for quick access.

[Question] 'Separated' parser succeeds when input has a trailing separator

I'm working on writing a parser for the Microsoft API filtering language, and as a warmup exercise I'm doing a parser for their very simple sort syntax. The problem I'm running into is that my parser will succeed when there's a trailing separator, e.g. "foo,", or "foo desc, bar,".

Here's my parser. I have to be doing something wrong, because the PropertyAccess parser does fail when there's a trailing dot, but the main SortExpression parse doesn't. Any guidance on what I've screwed up?

public static class SortExpressionParser
{
    internal static readonly Parser<char, char> Comma = Char(',');
    internal static readonly Parser<char, char> Dot = Char('.');

    internal static Parser<char, T> Token<T>(Parser<char, T> parser) =>
            Try(parser).Before(SkipWhitespaces);

    internal static Parser<char, string> Token(string s) => Token(String(s));
    
    internal static readonly Parser<char, string> PropertyName =
        Token(Letter.Then(LetterOrDigit.ManyString(), (head, tail) => head + tail));

    internal static readonly Parser<char, IImmutableList<string>> PropertyAccess =
        PropertyName.Separated(Token(Dot))
            .Select<IImmutableList<string>>(names => names.ToImmutableArray());
            
    internal static readonly Parser<char, SortDirection> SortDirectionModifier =
        OneOf(
            Token("asc").ThenReturn(SortDirection.Ascending),
            Token("desc").ThenReturn(SortDirection.Descending)
        );

    internal static readonly Parser<char, SortDirective> SortStatement =
        Map(
            (propertyAccess, sortDirection) => new SortDirective(
                propertyAccess,
                sortDirection.GetValueOrDefault(SortDirection.Ascending)
            ),
            SharedParsers.PropertyAccess,
            SortDirectionModifier.Optional()
        );

    public static readonly Parser<char, IImmutableList<SortDirective>> SortExpression =
        SortStatement.Separated(Token(Comma))
            .Select<IImmutableList<SortDirective>>(list => list.ToImmutableArray())
            .Before(End);
}

[Question] Is there an easy way to get a breakdown of parsing performance?

What I mean is - get a breakdown of how many times each parser was called, how long each individual parser took, the total time spent for each parser, etc. I tried profiling a parsing run with JetBrains dotTrace but it's, uh, not very helpful since it's just a massive chain of parser calls.

Is there a clever way to at least log how many times each parser is called by injecting a side effect somewhere? I can fill in the other numbers from that.

Any plans to make use of new Span support?

https://msdn.microsoft.com/en-us/magazine/mt814808.aspx

Performance issue in ExpressionParser.Build

We are trying to build an expression parser using Pidgin, based on the provided example code. I have two questions:

Using the following build

expr = ExpressionParser.Build(
term,
new[]
{
Operator.PostfixChainable(call),
Operator.Prefix(Neg).And(Operator.Prefix(Complement)).And(Operator.Prefix(UPlus)),
Operator.InfixL(Multiply).And(Operator.InfixL(Divide)),
Operator.InfixL(Plus).And(Operator.InfixL(Minus)),
Operator.InfixL(EqualTo).And(Operator.InfixL(NotEqualTo))
}
).Labelled("expression");

first time initialization takes 10s of seconds. As you can see we have only added a few operator conditions, is this to be expected? Is there a better way to structure this?

We are interested in floating point support. This was discussed in another thread ... any progress?

Thanks in advance.

Mark

Parsing all input?

This may be a stupid question, but I'm looking for something like End() in Sprache, but with no luck.

Like the code below:

int num = Parser.Num.ParseOrThrow("1234aa");

Now it can parse 1234 successfully, but I need the parsing failed.

Parsing Function Call with empty parameters fails

I have been able to create a reasonably complex expression evaluator. The only issue I am having is that when the object text contains a function with no arguments it fails (spinning in DLL) ... suspect Between not working with no arguments.

Expression that fails:
TEST()

Expression that works fine:
TEST(X)
TEST(X,X ...)

Parsing logic (directly from your test application)

private static Parser<char, T> Parenthesised(Parser<char, T> parser)
=> parser.Between(Tok("("), Tok(")"));
...
var call = Parenthesised(Rec(() => expr).Separated(Tok(",")))
.Select<Func<IExpr, IExpr>>(args => method => new Call(method, args.ToImmutableArray()))
.Labelled("function call");

Unfortunately, I have not been able to debug the DLL so I cant give you more details.

Thanks in advance.

Binary Parsing - Null Terminated ASCII with Max bytes

I am attempting to parse a field that is a potentially Null Terminated ASCII string with a max of 64 bytes. If there is no null terminator at the 64th byte then I would like to terminate the string automatically.

This is what I have for a Null Terminated ASCII string and it appears to work fine. I just need to find a way to stop parsing past 64 bytes if there is no null terminator. Ideally in an efficient manner.

static Parser<byte, IEnumerable<byte>> NullTerminatedStringBytes = SingleByte.Until(Token(b => b == 0x00).Labelled("Null Terminated"));

        
public static Parser<byte, string> NullTerminatedString = Map((nullString) =>
{
    var buffer = nullString.ToArray();
    var result = Encoding.ASCII.GetString(buffer);
    return result;
}, NullTerminatedStringBytes);

Thoughts?

NotAnyOf

I am trying to match an identifier where all chars except a set list of chars is legal. I would like to have a NotAnyOf or possibly some sort of negation parser where you could do Not(AnyOf('a', 'b', 'c')). Is there any way to do this now without using a Token lambda?

help with wrapper parser that is easier to understand?

I am having trouble understanding these parser builders.
I was wanting to make a wrapper parser in pidgin that is easier to understand and use for me but can do the same thing..

any some one help me impliment it?

using System;
using System.Collections.Generic;
using System.Text;
using Pidgin;
using static Pidgin.Parser;
using static Pidgin.Parser<char>;

namespace Pidgin_Basic_Tests
{
   
    public class Parser<TToken, T>
    {
        Pidgin.Parser<TToken, T> value;

        Parser<TToken, T> Tag(string name) => value.Labelled(name);

        ParseValue<char, IEnumerable<char>> Any        (        ) => Any ( );            //anything
        ParseValue<char,             char > CharOnce   (char   n) => Char(n);            //requires a n
        ParseValue<char,             char > CharMany   (char[] n) => ???;                //requires one of n
        ParseValue<char,             char > Whitespace (        ) => Parser.Whitespace;  //requires whitespace


        public static implicit operator Parser<TToken, T>(Pidgin.Parser<TToken, T> parser)
        {
            return new Parser<TToken, T>() { value = parser };
        }
    }

    public class ParseValue<TToken, T>
    {
        Pidgin.Parser<TToken, T> value;

        ParseValue   <TToken, T> Tag(string name) => value.Labelled(name);

        ParseOperator<char, char> AtMin        (int times);  //Must occer at least "times" times
        ParseOperator<char, char> AtMax        (int times); //Must occer at most "times" times
        ParseOperator<char, char> Until<T>     (T       c);  //Must occer until "c"
        ParseOperator<char, char> UntilAnyOf<T>(T[]     c);  //Must occer until any of "c"
        ParseOperator<char, char> UntilAllOf<T>(T[]     c);  //Must occer until all of "c"
        ParseOperator<char, char> UntilEnd();                //Must occur until end

        public static implicit operator ParseValue<TToken, T>(Pidgin.Parser<TToken, T> parser)
        {
            return new ParseValue<TToken, T>() { value = parser };
        }
    }

    public class ParseOperator<TToken, T>
    {
        Pidgin.Parser<TToken, T> value;

        ParseOperator<TToken, T> Tag(string name) => value.Labelled(name);
        
        //And
        Parser<TToken, T> ReqAND(ParseValue value); //required AND 
        Parser<TToken, T> ReqNAND(ParseValue value); //required NAND
        Parser<TToken, T> ReqXAND(ParseValue value);   //required XAND/XNOR 

        Parser<TToken, T> OptAND(ParseValue value);   //optional AND    
        Parser<TToken, T> OptNAND(ParseValue value);   //optional NAND
        Parser<TToken, T> OptXAND(ParseValue value);  //optional XAND/XNOR    

        //Or
        Parser<TToken, T> ReqOR(ParseValue value);    //required OR
        Parser<TToken, T> ReqNOR(ParseValue value);    //required NOR
        Parser<TToken, T> ReqXOR(ParseValue value);   //required XOR/XNAND

        Parser<TToken, T> OptOR(ParseValue value);  //optional OR
        Parser<TToken, T> OptNOR(ParseValue value);   //optional NOR
        Parser<TToken, T> OptXOR(ParseValue value);  //optional XOR/XNAND
        Parser<TToken, T> Result();       //End
        Parser<TToken, T> Result(T v1, T v2);       //End

        public static implicit operator ParseOperator<TToken, T>(Pidgin.Parser<TToken, T> parser)
        {
            return new ParseOperator<TToken, T>() { value = parser };
        }
    }
}

Parsing a non-delimited flat-file fixed-field-length string messages?

What if I dont have delimiter but only flat-file style fixed-length string. Imagine getting data from a Tcp stream like this for example

"\0\0\0j\0\0\0\vT3A1111 2999BOSH 2100021 399APV 2100022 "

I cannot reliably rely on a delimiter here. The string above, represents a message received from a server with following meaning:

4  byte long message length      ("\0\0\0j") . THIS IS HEX value
4  byte long message id          ("\0\0\0\v"). THIS IS HEX value, the rest of values below are ASCII
1  byte long message type        ("T")
1  byte long message sequence    ("3")
8  byte long car Id              ("A1111   ")  
9  byte long part-1 price        ("     2999")
30 byte long part-1 manufacturer ("BOSH                          ")
9  byte long part#               ("2100021  ")
9  byte long part-2 price        ("      399")
30 byte long part-2 manufacturer ("APV                           ")
9  byte long part#               ("2100022  ")

How to parse message like this?

Many() usage on a given parser

Hi.

For a given parser class...

    public class SutParser
    {
        public static Parser<char, char> TabOrSpace 
            = Token(c => c == ' ' || c == '\t');

        public static Parser<char, string> TextField
            = Token(c => !char.IsWhiteSpace(c))
                .AtLeastOnceString()
                .Between(TabOrSpace.Many());

        public static Parser<char, IEnumerable<string>> ControlValue = 
                TextField
                .AtLeastOnce()
                .Between(String("~ControlValue"), EndOfLine)
            ;

        public static Parser<char, IEnumerable<IEnumerable<string>>> ControlValues
            = ControlValue.Many();
    }

and the string to be parsed ...

~ControlValue  1 55.7 51.0 46.4 41.8 37.1
~ControlValue  2 50.6 46.4 42.2 38.0 33.8
~ControlValue  3 55.7 51.0 46.4 41.8 37.1
~ControlValue  4 50.6 46.4 42.2 38.0 33.8
~ControlValue  5 77.2 70.7 64.3 57.9 51.4
~ControlValue  6 88.6 81.2 73.8 66.4 59.0
~Key     10
~Contrast 10 32.5 23

I assigned testdata string variable to above text.

The parser ControlValue parses it successfully with single list of string

var singleResult = SutParser.ControlValue.ParseOrThrow(testdata);
// singleResult ==  ["1," "55.7", "51.0", "46.4", "41.8", "37.1"]

now I tried to expand it using ControlValues but failed

var multipleResults = SutParser.ControlValues.ParseOrThrow(testdata);

It complained like

{Parse error. 
     unexpected K
    expected "~ControlValue"
    at line 7, col 2}

How do I ControlValues parser only parse until ~Key part in above text?

[Question] What is the best way to enhance function call expression parser to use an enumeration for function name instead of an expression.

This question is based on the example expression parser.

For the binary operator expression BinaryOp the parser is able to create a BinaryOp object with the BinaryOperatorType enumeration.

Because of the use case that I am using PidGin for has a finite number of "function calls" I was trying to design the function call expression to contain an enumeration of all the function call types instead of a IExpr to represent the function call type.

For example something like this:

public enum FunctionCallType
{
    Contains,   // function name string to match it with: "contains"
    StartsWith, // function name string to match it with: "startswith"
    EndsWith    // function name string to match it with: "endswith"
}

public class Call : IExpr
{
    public FunctionCallType Type { get; }
    public ImmutableArray<IExpr> Arguments { get; }

    public Call(FunctionCallType type, ImmutableArray<IExpr> arguments)
    {
        Type = type;
        Arguments = arguments;
    }
}

I spent hours trying to come up with something that would give me a "warm fuzzy", I do have a design working but it is not what I have above and seems dirty to me. I am curious how this could be done in a easy/clean way based on the Call design above?

Ignore whitespace, either by default or setting

I have been finding it arduous to constantly add SkipWhitespaces with Before or Between methods for various parsers. Does it make sense to have char based parsing automatically ignore whitespace or by setting to simplify parser creation?

Access Start/End SourcePos after successful parse

Is there a way to access the source begin/end positions?

I was hoping that Result<TToken, T> would reveal something, but I don't see anything. If not, is this planned?

The use case is that I would be able to highlight parts of the parsed source while processing the resulting tree.

I suppose the ideal API would expose the starting/ending SourcePos from a successful result instance.

Apologies if you've addressed this question before; I didn't find any similar questions.

StackOverflowException in a specific scenario.

Hi Benjamin,

If a debug build of Pidgin.dll is linked with a console app targeting .NET Framework 4.x, the app is terminated due to StackOverflowException.

It looks like given the described configuration, CLR tries to eagerly initialize Parser.Optional._returnNothing and gets in an endless loop. Please take a look at the screenshot for an example.

Here is a minimal solution to reproduce the issue.
ParserOptionalStackoverflow.zip

I took the liberty of including binary files (Pidgin.dll and friends) withing the archive.
If you prefer to remove them and rebuild yourself, here are the build commands.

dotnet restore
dotnet build --configuration Debug

Curiously enough:

If a release build of Pidgin.dll is linked to the same console app, the bug does not reproduce.
If a debug build of Pidgin.dll is linked to a .netcoreapp, the bug does not reproduce.

One way to correct this issue might be by preventing eager initialization in the following way

    public abstract partial class Parser<TToken, T>
    {
        private static Parser<TToken, Maybe<T>> _returnNothing;
        private static Parser<TToken, Maybe<T>> returnNothing => 
            _returnNothing ?? (_returnNothing = Parser<TToken>.Return(Maybe.Nothing<T>()));

        // Rest of the code left out for clarity
    }

The motivation to use a debug build of Pidgin is, while learning the library, it gives a way to figure out what is going on inside of Pidgin if something does not work as I expect it to.
By the way, thank you for a great talk -- https://www.youtube.com/watch?v=lsUgwfK9XIM

Thank you.
Mykola.

Issue with parsing for one character and whitespace

Hi,

I have the following test case:

        [TestMethod]
        public void Bug()
        {
            var mins= OneOf(Tok("minutes"), Tok("mins"), Tok("min")).Trace(x => $"mins={x}");
            var meters=OneOf(Tok("meters"), CIString("m").Before(WhitespaceString)).Trace(x => $"meters={x}");

            meters.ParseOrThrow("min Run");
        }

I would expect the meters parser to throw because it should be looking for "m " and failing. Am i doing something wrong here?