dashbitco / nimble_parsec Goto Github PK

View Code? Open in Web Editor NEW

783.0 783.0 49.0 364 KB

A simple and fast library for text-based parser combinators

Elixir 100.00%

binary elixir parser-combinator

nimble_parsec's People

Contributors

Stargazers

Watchers

Forkers

bitwalker mbuhot bruce meox renesugar meritozh aerosol schnittchen smaximov njichev fishcakez jparise hrzndhrn colrack tverlaan namjae stjordanis elixirnepal etsangsplk alukasz ironjanowar minibikini oscarolbe adrianomitre polvalente lmarlow kianmeng williamragstad jeregrine inoas foresttoney ben-dyer diegobarrera raphaeltraviss ashea-code kipcole9 dbernheisel rbino ljzn eltonplima dvic medhiwidjaja treere muhifauzan carlosescri viniciusmuller dkuku aaronrenner elfenlaid

nimble_parsec's Issues

Introduce eof combinator

So we don't need to explicitly check that we are done every time.

Add `ok` and `error` records for matching on the parser result

Add a state system for context-sensitive languages

The user should have access to the full tuple that parser combinators operate on, and that tuple should include a state element that could be get and set by the user. Something like this:

{:ok, acc, stack, state, rest, line, column}

If the user is meant to pattern match on this tuple directly, the tuple should probably be a record, which is easier to pattern match on.

How to parse escaped string containing \" characters?

I would like to parse something like the following string:

'<foo="bar",baz="xxx">'

into the following list:

["foo", "bar", "baz", "xxx"]

This works fine, as long as the "bar" and "baz" do not contain escaped strings. The following example does not work:

'<foo="baaa\"rrrrr",baz="xxx\"substring\"xxx">'

because it contains a \" and my combinator is defined as utf8_string([not: ?"], min: 1).

Is there a recommended way of parsing strings containing escaped characters, specifically \" characters?

As discussed here, NimbleParsec should have an expect combinator for better error reporting. I suggest something like this: expect(previous \\ [], expected). This combinator would raise an error is expected didn't match.

My use case: I'm currently working on a parser for the ICU Message Format. Currently, NimbleParsec backtracks so much that I get rather useless error messages. Suppose we have something like this:

{variable, plural, one {...} two {...} abcd {...}}

The above is invalid, because abcd is not a supported option for the plural argument type. I already know it won't be supported as soon as I parse {variable, plural, , and I'd like to emit an error as soon as I find the abcd option. But currently there is no way to do this... NimbleParsec will just backtrack and fail with an error message that doesn't really help anyone (it actually says it expects an end of string on character 0...).

An expect combinator would allow me to emit the correct error message.

Remove compiler warnings from nimble_parsec.compile

mix nimble_parsec.compile is great for removing nimble_parsec as a dependency for a package (as is currently the case for ex_cldr). However compiling the resultant code can produce a lot of compiler unused variable warnings which I think is not a good thing for a package if it can be avoided.

I have explored a little and can remove many of the warnings by replacing in NimbleParsec.Compiler:

  defp build_proxy_to(name, next, n) do
    args = quote(do: [rest, acc, stack, context, line, offset])
    [_, acc, stack, _, _, _] = args

    body =
      quote do
        unquote(build_acc_depth(n, acc, stack)) = stack
        unquote(next)(rest, acc, stack, context, line, offset)
      end

    {name, args, true, body}
  end

with

  defp build_proxy_to(name, next, n) do
    args = quote(do: [rest, acc, stack, context, line, offset])
    [_, acc, stack, _, _, _] = args

    body =
      quote do
        # Is removed in a compiler optimisation pass
        _ = {acc, stack}

        unquote(build_acc_depth(n, acc, stack)) = stack
        unquote(next)(rest, acc, stack, context, line, offset)
      end

    {name, args, true, body}
  end

However there are still one case I haven't nailed down yet which produces repeated:

warning: variable "context" is unused
  lib/cldr/language_tag/rfc5646_parser.ex:4116

warning: variable "line" is unused
  lib/cldr/language_tag/rfc5646_parser.ex:4116

warning: variable "offset" is unused
  lib/cldr/language_tag/rfc5646_parser.ex:4116

warning: variable "rest" is unused
  lib/cldr/language_tag/rfc5646_parser.ex:4116

These warnings come from code generated with the following shape:

  defp language_tag__35(
         inner_rest,
         inner_acc,
         [{rest, acc, context, line, offset} | stack],
         inner_context,
         inner_line,
         inner_offset
       ) do
    language_tag__27(
      inner_rest,
      [],
      [{inner_rest, inner_acc ++ acc, inner_context, inner_line, inner_offset} | stack],
      inner_context,
      inner_line,
      inner_offset
    )
  end

I am happy to make a PR for this, but could also use some guidance on where to hunt down this last case since the pattern [{rest, acc, context, line, offset} | stack] is used in 5 places for different combinator compilations.

Generic parser combinators that can operate on a list/stream of anything instead of only text

This takes away from the nimble brand, but I think Elixir lacks a good parser combinator library in the Style of nimbleparsec. Most libraries are specific to strings (which is weird because they don't actually leverage the BEAM's patern matching capabilities). The most generic one I know is ExSpirit, and although it's very extensible it's not as pleasant to use.

Would you like to add generic combinators to NimbleParsec or would you prefer that as a separate package?

Common Combinators

Many parser combinator libraries maintain a set of the common combinators, like whitespace and word. Would you accept pull requests to add those kinds of combinators? Usually they are organized by some kind of domain, like NimbleParsec.Text and NimbleParsec.Number.

Large generated code size, long compilation time

Over the last week or so I've been experimenting with using nimble_parsec as the basis for a new GraphQL parser for the absinthe project.

What I've built so far is here: https://github.com/bruce/absinthe/blob/nimble-parsec-experiment/lib/absinthe/parser.ex

At this point I have the rough (and probably very naive) implementation of (almost) the entire GraphQL specification built out as nimble_parsec combinators. The experience has been great; I've really enjoyed using the package and feel like this is working towards much more maintainable, enjoyable parser code than our current leex/yecc implementation.

I've gotten it to the point that it compiles, although it doesn't actually do a whole lot yet (I'm, for example, only using map/traverse in a few places to generate a few of the resulting structs we'll want eventually, and I'm not skipping ignored input in most of the combinators yet). I've exposed a few defparsecs for a few tests (and some necessary recursion), and it's feeling pretty good.

Except for the compile time (a few minutes on my new XPS 13) and the size of the resulting .beam, which is... 23MB. (Reducing the defparsec usage to those strictly needed for recursion and the main entry point reduces the .beam filesize to 22MB.)

It's quite possible that I'm doing something boneheaded with the way this is structured/built at this point, but on the off chance I'm not [completely] and this can help serve as a sample of a larger grammar to track down an issue (with generation or documentation), I'd love it if someone could take a peek at what I've built so far.

$ elixir --version
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]

Elixir 1.6.4 (compiled with OTP 20)

Long utf8_strings never seem to compile.

If I do the following:

defmodule MarcParser do
  import NimbleParsec

  defparsec :parse_marc,
    utf8_string([], 24)
end

and run mix compile, it never finishes.

Track byte offset

Let's keep {line, last_line_byte} and the current_byte.

Help to create a Macro that use defparsec

I'm trying to write a super-parser, smthg like this:

  defmacro named_parser(name, parser, tag) do
    defparsec(
      name,
      symbol
      |> map({:extract, []})
      |> ignore(spaces)
      |> optional(int_t |> tag(:tag))
      |> ignore(spaces)
      |> ignore(ascii_char([?:]))
      |> ignore(spaces1)
      |> concat(parser)
      |> tag(tag)
      |> map({:data_field_prettify, []})
    )
  end

but it doesn't works.
So I tried this:

  defmacro named_parser(name, parser, tag) do
    quote do
      defparsec(
        unquote(name),
        symbol
        |> map({:extract, []})
        |> ignore(spaces)
        |> optional(int_t |> tag(:tag))
        |> ignore(spaces)
        |> ignore(ascii_char([?:]))
        |> ignore(spaces1)
        |> concat(unquote(parser))
        |> tag(unquote(tag))
        |> map({:data_field_prettify, []})
      )
    end
  end

but I get this error:

== Compilation error in file lib/rulec/rulec_parser.ex ==
** (CompileError) lib/rulec/rulec_parser.ex:456: undefined function symbol/0
    (stdlib) lists.erl:1354: :lists.mapfoldl/3
    (stdlib) lists.erl:1354: :lists.mapfoldl/3
    (elixir) expanding macro: Kernel.|>/2

Any suggestion?

Monadic bind

Hi, might be boneheaded but can't see any way to do a monadic bind. I'm writing a Bencoding library and e.g. for byte strings, it would be useful to parse the length and then construct a parser that only parses that number of characters.

defparsec should define the main function first

Currently, if I have something like this:

  @impl Makeup.Lexer
  defparsec :root,
    root_combinator

I get this:

warning: function root__0/6 is private, @impl attribute is always discarded for private functions/macros
  test/lexer/fixtures/token_lexer.exs:31

It looks like this happens because nimble_parsec doesn't generate the def root(...) clause first, generating some helper function clauses instead. This makes it hard to annotate the function with docs or @impl attributes.

defparsec/3: debug: true and inline: true at the same time causes a compile error

Options debug: true and inline: true at the same time in defparsec/3 causes the compile error:

== Compilation error in file lib/my_parser.ex ==** (ArgumentError) cannot convert the given list to a string.
To be converted to a string, a list must contain only:
  * strings
  * integers representing Unicode codepoints
  * or a list containing one of these three elements
Please check the given list or call inspect/1 to get the list representation, got:[datetime__2: 6]
    (elixir) lib/list.ex:826: List.to_string/1
    lib/nimble_parsec/recorder.ex:66: NimbleParsec.Recorder.format_defs/3
    lib/nimble_parsec/recorder.ex:27: NimbleParsec.Recorder.record/6
    lib/my_parser.ex:19: (module)

Is this problem concerning to the note of defparsec/3 in the documentation?

It (:inline) is disabled by default because of a bug in Elixir v1.5 and v1.6 where unused functions that are inlined cause a compilation error

forwad declaration of parser

Readind the doc:
https://github.com/plataformatec/nimble_parsec/blob/8005d1439e2479dcbcc13cb8c7bbd0fb833af875/lib/nimble_parsec.ex#L36

seems not possible to call "normal" function from defparsec.
Following the suggstion of the doc I started using module variable but this seems to be limited.
For example I'm not able to rapresent mutal recursive parser, somenthing like:

    lambda_t =
      ignore(string("fn"))
      |> ignore(ascii_char([?(]))
      |> concat(optional(
        symbol
        |> repeat(
          |> ignore(ascii_char([?,]))
          |> concat(symbol)
          )
        |> tag(:parameters)
      ))
      |> ignore(ascii_char([?)]))
      |> ignore(ascii_char([?{]))
      |> concat(
          statements
          |> tag(:body)
      )
      |> ignore(ascii_char([?}]))
      |> tag(:lambda_t)

    statements = choice ([basic_types, lambda_t]) ...

statements could contains a lambda_t and in order to use it we should declare before lambda_t, so this approach is not easy to follow.

Any idea how to achive this?

Any suggestion?

Public API for code generation

I'd like to ship a parser with a Hex package but want to avoid runtime dependency on NimbleParsec. Another use case recently mentioned is to build a parser and ship it as part of Elixir.

Here's a proof-of-concept project that does this:

Is there/could there be a public API that makes such code generation easier?

Using `tag` in conjunction with `traverse` causes compilation error

I'm building a basic signed integer combinator. This (likely naive) attempt seems to work fine:

  int_value =
    optional(ascii_char([?-]))
    |> integer(min: 1)
    |> traverse({:sign_int_value, []})

  defp sign_int_value(_rest, [int, _neg], context, _, _) do
    {[int * -1], context}
  end
  defp sign_int_value(_rest, res, context, _, _) do
    {res, context}
  end

If I try to tag the result, however, by turning it into:

  int_value =
    optional(ascii_char([?-]))
    |> integer(min: 1)
    |> traverse({:sign_int_value, []})
    |> tag(:int_value)

I see an error:

undefined function when/2

It's reported on the line in my parser module that I'm using defparsec. Using debug: true, here's what's generated:

defp __int_value____0(rest, acc, stack, context, line, offset) do
  __int_value____1(rest, [], [acc | stack], context, line, offset)
end

defp __int_value____1(<<x0::integer, rest::binary>>, acc, stack, context, comb__line, comb__offset) when x0 === 45 do
  __int_value____2(rest, [x0] ++ acc, stack, context, comb__line, comb__offset + 1)
end

defp __int_value____1(rest, acc, stack, context, line, offset) do
  __int_value____2(rest, acc, stack, context, line, offset)
end

defp __int_value____2(rest, acc, stack, context, line, offset) do
  __int_value____3(rest, [], [acc | stack], context, line, offset)
end

defp __int_value____3(<<x0::integer, rest::binary>>, acc, stack, context, comb__line, comb__offset) when x0 >= 48 and x0 <= 57 do
  __int_value____4(rest, [(x0 - 48) * 1] ++ acc, stack, context, comb__line, comb__offset + 1)
end

defp __int_value____3(rest, _acc, _stack, context, line, offset) do
  {:error, "expected byte in the range ?0..?9", rest, context, line, offset}
end

defp __int_value____4(<<x0::integer, rest::binary>>, acc, stack, context, comb__line, comb__offset) when x0 >= 48 and x0 <= 57 do
  __int_value____6(rest, [x0] ++ acc, stack, context, comb__line, comb__offset + 1)
end

defp __int_value____4(rest, acc, stack, context, line, offset) do
  __int_value____5(rest, acc, stack, context, line, offset)
end

defp __int_value____6(rest, acc, stack, context, line, offset) do
  __int_value____4(rest, acc, stack, context, line, offset)
end

defp __int_value____5(rest, user_acc, [acc | stack], context, line, offset) do
  __int_value____7(rest, (
  [head | tail] = :lists.reverse(user_acc)
  [:lists.foldl(fn x, acc -> x - 48 + acc * 10 end, head, tail)]
) ++ acc, stack, context, line, offset)
end

defp __int_value____7(rest, user_acc, [acc | stack], context, line, offset) do
  case(with({acc, context} when is_list(acc) <- sign_int_value(rest, user_acc, context, line, offset), {acc, context} when is_list(acc) <- {[int_value: :lists.reverse(acc)], context}) do
  {acc, context} when is_list(acc)
end) do
  {user_acc, context} when is_list(user_acc) ->
    __int_value____8(rest, user_acc ++ acc, stack, context, line, offset)
  {:error, reason} ->
    {:error, reason, rest, context, line, offset}
end
end

defp __int_value____8(rest, acc, _stack, context, line, offset) do
  {:ok, acc, rest, context, line, offset}
end

I don't immediately see the use of the mythical when/2 that's causing the issue, but I'd guess it's somewhere in __int_value____7/6 in the midst of the case + with complexities.

Now, it's possible I'm missing something in my reading of the docs with regard to the compatibility of tag and traverse, but if they are incompatible for some reason, it seems to me that it's still worth reporting this bad code generation.

(It does appear I can just set the tag as part of the return value from traverse as a temporary workaround, but that feels like giving the sign_int_value too much responsibility.)

I'm submitting this issue on the not-so-unlikely chance someone knows exactly what's going on and can fix it faster. Otherwise, I am happy to dig in deeper tomorrow, and learn more about the package in the process (which I'm finding a real joy to use!).

Feature: Incremental Parsing

It would be nice to be able to stream data into nimble_parsec as it arrives over the network.

Current Support: None (as far as I can tell)

Things to consider:

Not every user requires this and supporting it where it's not needed would have performance implications, so it should probably become a defparsec option
Doneness is not always decidable in incremental mode

Introduce Regex combinator

Hi everyone,

The idea of having a regex combinator is to be easier to implement some rules that makes sense to be implemented via Regex, such as the names of XML elements, which have strict rules for naming.

As an example, Scala has a Regex parser, which makes parsing names of XML elements as easy as [a-zA-Z_:][a-zA-Z0-9\.-_:]*.

I'm not sure if it's viable to implement these, since we cannot do pattern matching on regexes on BEAM, but I think this can be worthwile to consider.

between and sep_by combinators

Parsec has some handy combinators between and sepBy.

Having these in the package should make it easy to parse bracketed / delimited text, eg:

array_of_ints = between(string("["), string("]"), sep_by(",",  integer()))

or perhaps with the order of arguments reversed:

array_of_ints = 
  integer()
  |> sep_by(",")
  |> between(string("["), string("]"))

Feature: Add sized_binary(combinator, size_combinator)

It reads size_combinator from the input and then it reads N bytes from the binary. For example:

sized_binary(combinator, integer(min: 1))

We will also add:

sized_binary(combinator, bytes(4))

The combinator must always return a list with one element. This means this is possible:

sized_binary(combinator, integer(min: 1) |> ignore(string(":")))

Customize doc of generated function?

I'd like to customize the doc of the entry function generated by defparsec, so I can include some examples of things it can and cannot parse. Is there a way to do this?

Error is not returned when `parsec`/`repeat` is used

Hello,

I've been recently working on parsing https://projectfluent.org/ (detail, failing spec in description). I'm trying to improve error messages and add more validations but I'm stuggling with returning errors.

Failing spec:

  describe "errors from combinators" do
    defcombinatorp(:errors,
      empty()
      |> post_traverse(:set_error)
    )

    defcombinatorp(:function_reference,
      repeat(
        empty()
        |> parsec(:errors)
      )
    )

    defparsec :parse_function_reference, parsec(:function_reference)

    defp set_error(_rest, _, _context, _line, _offset) do
      {:error, "something wrong"}
    end

    test "returns ok/error" do
      assert parse_function_reference("") == {:error, "something wrong", "", %{}, {1, 0}, 0}
    end
  end

  1) test errors from combinators returns ok/error (NimbleParsec.IntegrationTest)
     test/integration_test.exs:121
     Assertion with == failed
     code:  assert parse_function_reference("") == {:error, "something wrong", "", %{}, {1, 0}, 0}
     left:  {:ok, [], "", %{}, {1, 0}, 0}
     right: {:error, "something wrong", "", %{}, {1, 0}, 0}
     stacktrace:
       test/integration_test.exs:122: (test)

It will pass if I either:

Remove repeat in function_reference

Switch parsec(:errors) to post_traverse(:set_error).

Is there any way to deal with that and ensure that errors are returned or is this expected behavior that I can deal with somehow differently? Of course, my combinators are more complex than this example.

Thank you for any help.

Make an exact parser

I am using the following code

  import NimbleParsec
  # Type ::= "int" "[" "]" | "boolean" | "int" | Identifier
  array_int = string("int []") |> label("int[]")
  int = string("int") |> label("int")
  bool = string("boolean") |> label("boolean")

  defparsec(:types, choice([array_int, int, bool]))

When I test the using int it says ok but when I test using intd it is still saying ok, but I want to catch the exact word int and reject any variation like intd, intuiw, intc..., how can I achieve this?

Compiling generates unused variable warnings

Nimble Parsec Version: 0.5.3
Elixir Version: 1.10.0

Today I made a seemingly simple change to the cron parser in oban. Here is the complete diff:

+  whitespace = ascii_string([?\s, ?\t], min: 1)
+
   defparsec(
     :cron,
     minutes
-    |> ignore(string(" "))
+    |> ignore(whitespace)
     |> concat(hours)
-    |> ignore(string(" "))
+    |> ignore(whitespace)
     |> concat(days)
-    |> ignore(string(" "))
+    |> ignore(whitespace)
     |> concat(months)
-    |> ignore(string(" "))
+    |> ignore(whitespace)
     |> concat(weekdays)
   )

The new parser works exactly as expected, but compiling it generates this list of unused variable warnings:

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:797: Oban.Crontab.Parser.cron__105/6

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:801: Oban.Crontab.Parser.cron__107/6

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:1558: Oban.Crontab.Parser.cron__213/6

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:1562: Oban.Crontab.Parser.cron__215/6

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:2319: Oban.Crontab.Parser.cron__321/6

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:2323: Oban.Crontab.Parser.cron__323/6

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:3266: Oban.Crontab.Parser.cron__443/6

warning: variable "user_acc" is unused (if the variable is not meant to be used, prefix it with an underscore)
  lib/oban/crontab/parser.ex:3270: Oban.Crontab.Parser.cron__445/6

The function definitions are rather non-descript. Here is a sample from the warning at line 797:

  defp cron__105(rest, user_acc, [acc | stack], context, line, offset) do
    cron__107(rest, acc, stack, context, line, offset)
  end

In this particular instance I commit the compiled version, so I'm able to manually fix the deprecation warnings. I expect that the compiled code doesn't have any compilation errors, though.

Dialyzer errors

Environment

Elixir 1.8.1
OTP 21
nimble_parsec 0.5

Issue description

I am struggling with dialyzer (as usual) and it appears all my combinators that use unwrap_and_tag/2 generate the following error:

lib/cldr/language_tag/rfc5646_grammar.ex:81:call_without_opaque
The call NimbleParsec.unwrap_and_tag('Elixir.NimbleParsec':t(),'script') does not have an opaque term of type 'Elixir.NimbleParsec':t() in 2nd.

Example code

The relevant combinator is below. It is one of many examples and it appears that all calls to unwrap_and_tag/2 cause the error.

  def script do
    alpha4()
    |> unwrap_and_tag(:script)
    |> label("a script id of four alphabetic character")
  end

It is a typing error, or a misunderstanding on my side?

Feature request: easy way to match string in case insensitive fashion

https://tools.ietf.org/html/rfc3501 requires a lot of case insensitive string matching for keywords like "INBOX", "ALL", "BODY", "FROM". And all strings defined in ABNF is case insensitive by default https://tools.ietf.org/html/rfc5234#page-5. Is there an easy way to do case insensitive string matching that I've been missing? If there isn't one, is it possible to add this feature?

Thanks a lot.

Lookahead parsers

I Think nimble doesn't currently support positive or negative lookahead. I don't think I'm using them much at the moment, but're useful sometimea.

feature: possibly empty sequence of iterations of A

Hi,

the helper function that match possibly empty sequence of iterations of
a combinator will be useful. I wrote a simple implementation.

defmodule LexerHelper do
  import NimbleParsec

  def possible(comb, to_poss) do
    comb
    |> repeat(to_poss)
    |> lookahead_not(to_poss)
  end
end

digit = ascii_char [?0..?9]
num = digit
    |> possible(
      concat(
        optional(ascii_char([?_])),
        digit
      )
    )
defparsec :num, num

iex(1)> Lexer.num "123"
{:ok, '123', "", %{}, {1, 0}, 3}
iex(2)> Lexer.num "123_234"
{:ok, '123_234', "", %{}, {1, 0}, 7}
iex(3)> Lexer.num "123_234_"
{:ok, '123_234', "_", %{}, {1, 0}, 7}
iex(4)> Lexer.num "123_234 34"
{:ok, '123_234', " 34", %{}, {1, 0}, 7}

Remove literal in favor of just passing binaries

"Position" has changed, but generated documentation was not changed?

The generated doc for an entry point states

Returns {:ok, [token], rest, context, line, byte_offset} or {:error, reason,
rest, context, line, byte_offset}.

which is not correct because at the position of line I see a tuple. I suspect it is {line where accumulared result begins, col where it begins}?

Parsing "123" as integer

I've defined a parser in this way:

int_t = integer(min: 1) |> tag(:int_t)
defparsec :parse_inttype, int_t

When I try to use it in this way:

RuleC.parse_inttype("123")

I get this:

{:ok, [int_t: '{'], "", %{}, {1, 0}, 3}

instead of:

{:ok, [int_t: 123], "", %{}, {1, 0}, 3}

But other case works:

> RuleC.parse_inttype("213")
{:ok, [int_t: [213]], "", %{}, {1, 0}, 3}

I'm not an Elixir expert but seems that the internal type of integer is a char instead of int

Clarification on library's intended input data type

Firstly, this is a brilliant library and I'm excited to find a use case for it. The composition abilities are really attractive given how unwieldy parsing can get sometimes.

To the point now, though: given the nature of strings in Elixir, it wasn't entirely clear to me that this library is strictly for parsing text data. I read a non-trivial amount of the high level documentation before deciding to try and use it for parsing raw binary data. It wasn't until I wrote my first integer combinator that I realized I had misunderstood:

iex(1)> App.Parser.version(<<1, 6>>)
{:error,
 "expected byte in the range ?0..?9, followed by byte in the range ?0..?9",
 <<1, 1>>, %{}, {1, 0}, 0}

I was disappointed since this rules out my current use case, but it made sense in hindsight after getting more familiar with the library.

I started to submit a PR for the docs that would clarify this important detail up front so that others wouldn't have the same confusion that I did, but then it occurred to me that I should probably ask before I presume this library will always be strictly for parsing text data. Are there hopes or plans to expand its use case to generic binary parsing? If not, I'll go ahead and submit that PR! :)

utf8_string/2 and ascii_string/2 should support min: 0

Motivation: say I want to parse a line which might be empty.

Currently, I have to do this:

line = choice([utf8_string([not: ?\n], min: 1), string("")]) |> ignore(string("\n"))

I'd like to be able to write:

line = utf8_string([not: ?\n], min: 0) |> ignore(string("\n"))

I propose defining utf8(chars, min: 0) as equivalent to choice([utf8_string([not: ?\n], min: 1), string("")]). Although implemented in a more efficient way if possible.

Missing v.0.5.1 git tag

The tag for v0.5.1 was not pushed to Github, so clicking "View Source" in the documentation results in a 404.

Undefined function defparsec/1

I created a project using the mix new project and add the dependency in mix.exs, I found this example in documentation

defmodule MyParser do
  import NimbleParsec

  defparsec integer(min: 1) |> tag(:integer)
end
MyParser.integer("1234")

when I run this example from project directory using the iex -S mix I got this error message
** (CompileError) iex:12: undefined function defparsec/1

Elixir 1.9.1
NimbleParser 0.5.1

lookahead_not() not matching correctly for non-trivial combinators

I have the following small code, running with the nimble_parsec dependency at version 0.6:

defmodule Foo do
  import NimbleParsec

  defparsec :test,
    (
      string("START ")
      |> lookahead_not(string("MIDDLE/"))    # Works
      # |> lookahead_not(concat(ascii_string([?A..?Z], min: 1), string("/")))   # Fails
      # |> lookahead_not(ascii_string([?A..?Z], min: 1) |> string("/"))   # Fails
      |> ascii_string([?A..?Z], [min: 1, max: 7])
    )
end

This test parsec is supposed to match "START " followed by some A-Z characters that are NOT followed by a "/".

It works for the lookahead_not() commented with "# Works" (that is, it tells me "did not expect string ...") but does not work with the lookahead_not()s commented with "# Fails" (that is, it matches the A-Z up to but not including the "/", while instead it should tell me "did not expect string ..."):

With lookahead_not(string("MIDDLE/")) active:

4▶ Foo.test("START MIDDLE END") 
{:ok, ["START ", "MIDDLE"], " END", %{}, {1, 0}, 12}
5▶ Foo.test("START MIDDLE/ END")
{:error, "did not expect string \"MIDDLE/\"", "MIDDLE/ END", %{}, {1, 0}, 6}

With lookahead_not(concat(ascii_string([?A..?Z], min: 1), string("/"))) active:

7▶ Foo.test("START MIDDLE END") 
{:ok, ["START ", "MIDDLE"], " END", %{}, {1, 0}, 12}
8▶ Foo.test("START MIDDLE/ END")
{:ok, ["START ", "MIDDLE"], "/ END", %{}, {1, 0}, 12}

With lookahead_not(ascii_string([?A..?Z], min: 1) |> string("/")) active:

10▶ Foo.test("START MIDDLE END") 
{:ok, ["START ", "MIDDLE"], " END", %{}, {1, 0}, 12}
11▶ Foo.test("START MIDDLE/ END")
{:ok, ["START ", "MIDDLE"], "/ END", %{}, {1, 0}, 12}

As you can see, in the second and third test, the "/" is seemingly not considered/detected/accounted for by the lookahead_not(), or the lookahead_not() doesn't seem to have any effect.

Using ascii_string([?A..?Z] including with other/more ?x notations inside it has worked perfectly fine for the rest of the project, which is quite a lot of using it, so it doesn't strike me as a problem with encoding or similar, but more as a bug in the lookahead_not() combinator.

Let me know if there's anything I can do to try narrowing this down better!

Edit, in case it's relevant:

14▶ runtime_info

## System and architecture 

Elixir version:     1.10.2
Erlang/OTP version: 22
ERTS version:       10.7
Compiled for:       x86_64-apple-darwin18.7.0
Schedulers:         4
Schedulers online:  4

The call ... does not have an opaque term of type 'Elixir.NimbleParsec':t() as 1st argument

I don't know if this is a nimble_parsec bug, or an Elixir bug, or a Dialyzer bug, or just me doing something wrong, but here goes:

I'm playing with NimbleParsec, and I'm attempting to put something together that recognises boolean algebra. At the moment, I'm simply trying to get it to recognise two booleans catenated, and I've come up with this:

defmodule Exlox.Parser.Boolean do
  import NimbleParsec

  def boolean(combinator \\ empty()) do
    true_p = string("true") |> replace(true)
    false_p = string("false") |> replace(false)

    choice(combinator, [true_p, false_p])
  end
end

defmodule Exlox.Parser do
  import NimbleParsec
  import Exlox.Parser.Boolean

  defparsec :expression, boolean() |> boolean()
end

That is: it recognises "truefalse", "truetrue", "falsefalse" and "falsetrue". This is OK (though I wonder whether this is the best way to do this...).

But dialyzer doesn't like it:

lib/parser.ex:4: The call 'Elixir.Exlox.Parser.Boolean':boolean([]) does not have an opaque term of type 'Elixir.NimbleParsec':t() as 1st argument

What am I doing wrong?

Use the ?x codepoints instead of "byte n" when using ascii_chars/2

If I use ascii_chars([?\n]) to expect a newline and there's no newline, the error that I get is "expected byte 10". I would expect the error to be something like expected ASCII character \n or something like that.

Difference between concat and plain pipeing

I've not much experience with parsers, but given a super simple markdown headline like # Heading I'm wondering why the first following example doesn't work, while the latter one does:

  heading_1 =
    string("#")
    |> ignore()
    |> utf8_string([], min: 1)

  defparsec(:markdown, heading_1)

  heading_1 =
    string("#")
    |> ignore()
    |> concat(utf8_string([], min: 1))

  defparsec(:markdown, heading_1)

nimple_parsec causes makeup compile error

Hello,

nimple_parsec 0.5.0 causes phoenix app (or makeup) to fail during compilation.

==> makeup
Compiling 43 files (.ex)

== Compilation error in file lib/makeup/lexer/combinators.ex ==
** (CompileError) lib/makeup/lexer/combinators.ex:116: undefined function repeat_until/2
    (stdlib) lists.erl:1338: :lists.foreach/2
    (stdlib) erl_eval.erl:680: :erl_eval.do_apply/6
could not compile dependency :makeup, "mix compile" failed. You can recompile this dependency with "mix deps.compile makeup", update it with "mix deps.update makeup" or clean it with "mix deps.clean makeup"

Switching back to 0.4.0 resolves the error.

traverse cannot be second in pipe

The following code works OK.

defmodule MyParser do
  @moduledoc false

  @debug false

  import NimbleParsec

  white_space =
    ignore(ascii_string([9, 32], min: 1))

  language_code =
    optional(white_space) |>      # Needs to be here in order to make traverse work
    ascii_string([?a..?z], 2) |>
    traverse(:check_language_code) |>
    tag(:language)

  territory_code =
    ascii_string([?A..?Z], 2) |>
    tag(:territory)

  locale =
    choice([language_code, territory_code]) |>
    ignore(string(":"))

  locale_text =
    optional(locale) |>
    optional(white_space) |>
    utf8_string([], min: 1)

  defparsec :locale_text, locale_text, debug: @debug

  defp check_language_code(_rest, [language_code], context, _line, _offset) do
    case Cldr.validate_locale(language_code) do
      {:ok, %Cldr.LanguageTag{cldr_locale_name: cldr_locale_name}} -> {cldr_locale_name |> String.to_atom |> List.wrap, context}
      {:error, _} -> {:error, "Not a valid language code (#{language_code})"}
    end
  end
end

MyParser.locale_text("en:A text") works and returns a valid language code and MyParser.locale_text("qq:A text") fails with an error.

But in order to make it work, I had to put an extra part in front of ascii_string([?a..?z], 2) in language_code above, otherwise the compiler failed.

I have no idea of the difference and for me it does not matter, but it is not clear to me why traverse fails being second in the pipe chain.

Typespec issue in `choice/2`

With this code,

def some_parser do
  choice(empty(), [
        string("+62"),
        string("0")
  ])
end

Dialyzer complains that choice/2 breaks the contract (t(), t()) -> t(). It seems that choice/2 expects [t] instead of just t, so I think the typespec should be: @spec choice(t, [t]) :: t.

Is this correct?

Minor usability issue with repeat and optional?

Hey, I ran into this issue while writing a language parser in nimble_parsec. The issue was that chaining the repeat and optional combinators like so: repeat(optional(...)) never terminates execution. The reason is pretty clear: it repeatedly chooses the "None" option of optional, and just keeps going. A full replication of the issue can be found in this gist.

The symptoms of this bug are that the combinator never terminates, so as long as the user does rudimentary testing of the parser, they will identify the issue. It never terminates on any input, including inputs which can never match. Although I identified the issue quickly, a newer user might have real trouble with it. I somewhat doubt this error can be detected statically, but a warning in the docs might help save some people some time when they encounter this issue.

Additional Dialyzer warnings

José, I've tried to work out these issues so I could send a PR but my dialyzer-fu is poor and my understanding of the architecture of nimble_parsec is primitive at best..

Environment

Elixir 1.8.1
OTP 21
nimble_parsec from git master

Errors

I have omitted what I believe are duplicated errors to leave as simple a case as possible. The source links directly to where the error is indicated.

lib/cldr/language_tag/rfc5646_grammar.ex:13:no_return
Function langtag/0 has no local return.

Source

________________________________________________________________________________
lib/cldr/language_tag/rfc5646_grammar.ex:59:call_without_opaque
The call NimbleParsec.concat('Elixir.NimbleParsec':t(),nonempty_maybe_improper_list()) does not have an opaque term of type 'Elixir.NimbleParsec':t() in 2nd.

Source

________________________________________________________________________________
lib/cldr/language_tag/rfc5646_grammar.ex:82:call_with_opaque
The call NimbleParsec.label('Elixir.NimbleParsec':t(),<<_:320>>) contains an opaque term in 1st argument when terms of different types are expected in these positions}.

Source

________________________________________________________________________________

lib/cldr/language_tag/rfc5646_grammar.ex:146:call_without_opaque
The call NimbleParsec.unwrap_and_tag(nonempty_maybe_improper_list(),'type') does not have an opaque term of type 'Elixir.NimbleParsec':t() in 1st.

Source

________________________________________________________________________________
lib/cldr/language_tag/rfc5646_grammar.ex:271:call_without_opaque
The call NimbleParsec.choice([nonempty_maybe_improper_list(),...]) does not have a term of type ['Elixir.NimbleParsec':t(),...] (with opaque subterms) in 1st.

Source

________________________________________________________________________________

Syntax abuse

Hi, I played a bit with quoting and the following code parses just fine using Code.string_to_quoted:

  parser parseName do
    many1 letter
  end

  parser parseAbs do
    string "abs"
    spaces
    name <- parseName
    spaces
    body <- parseTerm
    return {:abs, name, body}
  end

  parser parseTerm do
    chainl1 (try(parseAbs) or parseVar or parens(parseTerm)), parseApp
  end

  parser parseVar do
    name <- parseName
    return {:var, name}
  end

  parser parens p do
    char '('
    p_ <- p
    char ')'
    return p_
  end

  parser parseApp do
    spaces
    return (fn f, a -> {:app, f, a} end)
  end

Which looks basically the same as using Parsec in Haskell. Not suggesting anything, just wanted to share.

Make possible to inkove a parser from another module

Looking the documentation of parsec function seems that is not possible to invoke a parser of a different Module.

In order to distribute the logic of parsing in several files and in order to use macro (as a superset of defparsec for instance) I guess that we need to expand defparsec functionalities.

Feature: unmatchable combinator

Rationale: the choice combinator currently accepts a list of two or more combinators. However, when writing a macro that can accept a variable number of choices, it is useful to have a combinator that never matches to handle the case of zero choices provided.

See here for a practical example:
https://github.com/derpibooru/philomena/blob/c66fe0ca39cd8d4f51b21c8fe939489c24bba892/lib/philomena/search/lexer.ex#L262-L285

Parsing IPv6 addresses

I implemented a combinator (roughly) as per the ABNF described by the IETF:

        IPv4address = d8 "." d8 "." d8 "." d8

        d8          = DIGIT               ; 0-9
                    / %x31-39 DIGIT       ; 10-99
                    / "1" 2DIGIT          ; 100-199
                    / "2" %x30-34 DIGIT   ; 200-249
                    / "25" %x30-35        ; 250-255

        IPv6address =                          6(h16 ":") ls32
                    /                     "::" 5(h16 ":") ls32
                    / [             h16 ] "::" 4(h16 ":") ls32
                    / [ *1(h16 ":") h16 ] "::" 3(h16 ":") ls32
                    / [ *2(h16 ":") h16 ] "::" 2(h16 ":") ls32
                    / [ *3(h16 ":") h16 ] "::"   h16 ":"  ls32
                    / [ *4(h16 ":") h16 ] "::"            ls32
                    / [ *5(h16 ":") h16 ] "::"             h16
                    / [ *6(h16 ":") h16 ] "::"

        ls32        = h16 ":" h16 / IPv4address

        h16         = 1*4HEXDIG

  ipv4_octet =
    ascii_string([?0..?9], min: 1, max: 3)

  ipv4_address =
    times(ipv4_octet |> string("."), 3)
    |> concat(ipv4_octet)

  ipv6_hexadectet =
    ascii_string('0123456789abcdefABCDEF', min: 1, max: 4)

  ipv6_ls32 =
    choice([
      ipv6_hexadectet |> string(":") |> concat(ipv6_hexadectet),
      ipv4_address
    ])

  ipv6_fragment =
    ipv6_hexadectet |> string(":")

  ipv6_address =
    choice([
      times(ipv6_fragment, 6) |> concat(ipv6_ls32),
      string("::") |> times(ipv6_fragment, 5) |> concat(ipv6_ls32),
      optional(ipv6_hexadectet) |> string("::") |> times(ipv6_fragment, 4) |> concat(ipv6_ls32),
      optional(times(ipv6_fragment, max: 1) |> concat(ipv6_hexadectet)) |> string("::") |> times(ipv6_fragment, 3) |> concat(ipv6_ls32),
      optional(times(ipv6_fragment, max: 2) |> concat(ipv6_hexadectet)) |> string("::") |> times(ipv6_fragment, 2) |> concat(ipv6_ls32),
      optional(times(ipv6_fragment, max: 3) |> concat(ipv6_hexadectet)) |> string("::") |> concat(ipv6_fragment) |> concat(ipv6_ls32),
      optional(times(ipv6_fragment, max: 4) |> concat(ipv6_hexadectet)) |> string("::") |> concat(ipv6_ls32),
      optional(times(ipv6_fragment, max: 5) |> concat(ipv6_hexadectet)) |> string("::") |> concat(ipv6_hexadectet),
      optional(times(ipv6_fragment, max: 6) |> concat(ipv6_hexadectet)) |> string("::")
    ])

I have checked this combinator over and over again and I am fairly sure it is implemented exactly per the BNF, but I cannot get it to match addresses like fe80::362c:b162:1a49:bf12 or 2000:4000:6000:8000::a. What's going wrong?

dashbitco / nimble_parsec Goto Github PK

nimble_parsec's People

Contributors

Stargazers

Watchers

Forkers

nimble_parsec's Issues

Environment

Issue description

Example code

Environment

Errors

Recommend Projects

Recommend Topics

Recommend Org