Giter Site home page Giter Site logo

Comments (3)

josevalim avatar josevalim commented on July 28, 2024

Hi @electricshaman! We could easily make it work for bytes. However, would it really be useful? When working with bytes, it is hard to beat the readability and efficiency of the low-level bitstring constructor.

Also, I was not aware of protocols that encode bytes directly as integers, such as <<1, 6>>, to encode the number 16. Do you have a link? I would love to read more about it and see how good is the fit of such protocol with nimble_parsec.

from nimble_parsec.

electricshaman avatar electricshaman commented on July 28, 2024

Hi @josevalim! For the background context, I work with @mobileoverlord at LT. We are currently writing an IPP client to handle printing from Nerves-based kiosks in our fulfillment center. You won't find a bigger fan of the bitstring constructor syntax than me. I've used it over the years to deal with some pretty gnarly protocols and it's one of my favorite things about Erlang and Elixir.

We could easily make it work for bytes. However, would it really be useful? When working with bytes, it is hard to beat the readability and efficiency of the low-level bitstring constructor.

I agree with you. I would say for 95% of the protocol work that I've done over the years, the bitstring constructor is all I've ever needed for parsing raw binary. I saw your announcement about nimble_parsec on the Elixir forums in March and it's taken me this long to even experiment with using it because I didn't want to give up the expressiveness and efficiency qualities you mentioned provided by the bitstring constructor.

Yet somehow, here I am. So I spent some time today thinking about why I was toying with the idea of trading off those qualities by using nimble_parsec and two things stand out as to what captivated about this library:

  1. Parsing rule composition
  2. Low-level optimization guarantees (performance and safety)

These two qualities are particularly attractive in the remaining 5% of cases when I'm dealing with a protocol (IPP, for example) which is full of legacy baggage and has been over-designed by committee. For example, in IPP, there are different types of attributes which have an associated syntax. The syntax defines how the attribute is encoded. A few examples of a syntax: range of integer, name, print resolution, version, and keyword. These in turn are built up on lower level syntax as octet-string, integer, etc. There are probably more than 30 other types of syntax defined in the protocol (and that's not to mention the extensions). More examples can be found here in the IPP decoding module. Each by itself is fairly simple. But now that I've got a module full of these patterns for all the different attributes, it's feeling a bit unmanageable and I naturally started to see lots of places where common definitions could be reused. This is what lead me to investigate how to compose the parsing rules by their constituent parts and then combine them back into more manageable definitions. And knowing the rules when produced would still be optimized by the runtime is also a huge factor.

All that to say, I agree that a parser combinator like nimble_parsec is not right for generic binary parsing. The bitstring constructor is the way to go. I think it's probably worthwhile to state that up front in the documentation for nimble_parsec to avoid confusing folks like me. :)

However, I do think there is something else here that could be investigated as a potential way of composing separate parsing rules with bitstring syntax and doing it in a way that is still optimized. Maybe the answer is that I should just define macros in my project and be done with it. I've done that before on a smaller scale (for example, defining a uint32 macro for unsigned-integer-size(32), etc) and it worked well. Do you have any other suggestions?

Sorry for unloading this much text on you. This is something I've kind of felt was a missing part of the bitstring puzzle for awhile: some slightly higher level construct to manage piecing together various bitstrings from lower level parts (almost like an ABNF grammar definition but taking advantage of the idiomatic bitstring constructor syntax).

Also, I was not aware of protocols that encode bytes directly as integers, such as <<1, 6>>, to encode the number 16. Do you have a link? I would love to read more about it and see how good is the fit of such protocol with nimble_parsec.

That was me trying to parse IPP's version attribute syntax in one rule (before I realized nimble_parsec was strictly for text). They define it as follows:

version-number       = major-version-number minor-version-number
major-version-number = SIGNED-BYTE
minor-version-number = SIGNED-BYTE
SIGNED-BYTE          = BYTE
BYTE                 = %x00-ff

from nimble_parsec.

josevalim avatar josevalim commented on July 28, 2024

Thanks for the reply @electricshaman!

I think we are in agreement. I will go ahead and clarify the README.

Defining macros such as uint32 is what we do in projects like Postgrex.

My main concerns with a composable library for bitstrings are two: 1. it will certainly be more verbose, so the composition really has to pay off 2. things like <<n, value::integer-size(n)>> may be hard to express. So if I were to define something that composes binaries, I would maybe try to allow those rules to be defined using binaries themselves and then allow composition on the more mechanical parts. Maybe something like this:

# rule is made by the matching binary and the expression
# to be executed once matched
var_integer =
  rule <<n, int::integer-size(n)>> do
    int
  end

# ..more rules...
# ...

# rules are composed using then (could be called concat)
var_integer |> then(another_rule) |> then(yet_another_rule)

# multiple choices are done with choose
choose([rule1, rule2, rule3])

So this at least removes the mechanical aspect of binary matching, such as appending the parsed value to a list, writing cases, making sure the binary is the first argument, etc.

I have added then/concat and choose above, but you will at least need something recursive and a traverse combinator too.

The amount generalizations you would do will be based on the features you want to get out of this. For example, nimble_parsec keeps an offset byte and line count, but for binary parsing you would need an offset byte at best. If you need to pass state around, you may need to have a mechanism to handle it too. It is quite likely this is a much simpler endeavor than nimble_parsec but it really depends on what you want to get out of it.

from nimble_parsec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.