Giter Site home page Giter Site logo

Support for parser combinators about duckling HOT 7 OPEN

facebook avatar facebook commented on April 26, 2024
Support for parser combinators

from duckling.

Comments (7)

niteria avatar niteria commented on April 26, 2024

Hi @axman6,

You bring up an interesting question. I've been thinking about this and I have some ideas, but I will need a bit of time to express them in more detail.

The TL;DR is:

  1. Using Regexps in Duckling sets the barrier to contribution quite low. Duckling benefited a lot from this, previously in the Clojure incarnation and now already as a Haskell project. Most of the rules were contributed.

  2. There's some optimizations we can do with Regexps that we cannot with parser combinators. I have a change in progress that compiles a giant NFA from all the heads of rules, which can filter out rules that don't have a chance of applying. On the test cases that I've tested it with it halves the cost of Duckling. This change basically makes it so that the cost of Duckling doesn't increase with each added rule.

  3. There's room for a combinator like constructs in Duckling, but probably not using some stock combinator library. Using a Free Applicative we could make the rules expressible with:

{-# LANGUAGE ApplicativeDo #-}
ruleURL :: Rule
ruleURL = mkRule "url" $ do
  m:_:_protocol:_:domain:_:_:_port:_path:_query:_ <- regex 
    "((([a-zA-Z]+)://)?(w{2,3}[0-9]*\\.)?(([\\w_-]+\\.)+[a-z]{2,4})(:(\\d+))?(/[^?\\s#]*)?(\\?[^\\s#]+)?)"
  return $ Token Url $ url m domain
  1. The shortcomings of parsing URLs today are orthogonal to the choice of underlying parsing library. We could make it work with Regexps. Or we can have a Regexp that overapproximates and use parseURI :: String -> Maybe URI to filter out false positives.

I hope to elaborate on 2) and 3) more in the future, this is just a sketch.

Anyways, thanks for thinking about this, if you have ideas we're happy to hear them.

from duckling.

s4s0l avatar s4s0l commented on April 26, 2024

what is NFA?

from duckling.

niteria avatar niteria commented on April 26, 2024

NFA - https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
Compiling regexps down to NFAs gives predictable polynomial performance.

from duckling.

dmvianna avatar dmvianna commented on April 26, 2024

Would you consider supporting both regexes and parser combinators? I take your point in the simple cases of URLs and emails. However this machinery and contribution model could easily encompass, say, street addresses. I find many parallels between Duckling's and libpostal's architecture, but I find it unlikely we could achieve a sufficiently good street address parser using regex, even with community support.

This might be a non-goal, but it is a worthy one.

from duckling.

patapizza avatar patapizza commented on April 26, 2024

@dmvianna I wouldn't discard the idea, though the added value should gain on the complexity cost.
You're right about the use of regexes for street addresses; it doesn't work properly. At this point postal addresses are out of scope, as they require a resolution based on an external datasource.

It seems like libpostal is a good candidate for that, so I wouldn't jump in reinventing the wheel. What are the limitations that could be overcome by Duckling?

from duckling.

dmvianna avatar dmvianna commented on April 26, 2024

Recursion. As far as I understand, libpostal is not a good candidate for parsing free text. For example, let's say I want to parse street addresses out of Patents Abstracts (I have some toy code and data here). Or other cases I routinely find at work, such as 1-25, 30-40 and 150 Collins St or even 2 Fictional Rd and 3B Epic Av. libpostal simply won't parse multiple addresses in one string properly. And guess what, even regex can parse common cases for these examples. Parser combinators can improve on that.

What naïve code won't do is to assign St to Street or Saint according to local context. libpostal sorta does that by returning all alternatives. Duckling as I understand does implement a probabilistic parser, so it could potentially solve both problems.

from duckling.

patapizza avatar patapizza commented on April 26, 2024

@dmvianna I'm open to look at a prototype for US postal addresses to see if we can get something useful even without the resolution part.

from duckling.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.