Comments (7)
Hi @axman6,
You bring up an interesting question. I've been thinking about this and I have some ideas, but I will need a bit of time to express them in more detail.
The TL;DR is:
-
Using Regexps in Duckling sets the barrier to contribution quite low. Duckling benefited a lot from this, previously in the Clojure incarnation and now already as a Haskell project. Most of the rules were contributed.
-
There's some optimizations we can do with Regexps that we cannot with parser combinators. I have a change in progress that compiles a giant NFA from all the heads of rules, which can filter out rules that don't have a chance of applying. On the test cases that I've tested it with it halves the cost of Duckling. This change basically makes it so that the cost of Duckling doesn't increase with each added rule.
-
There's room for a combinator like constructs in Duckling, but probably not using some stock combinator library. Using a Free Applicative we could make the rules expressible with:
{-# LANGUAGE ApplicativeDo #-}
ruleURL :: Rule
ruleURL = mkRule "url" $ do
m:_:_protocol:_:domain:_:_:_port:_path:_query:_ <- regex
"((([a-zA-Z]+)://)?(w{2,3}[0-9]*\\.)?(([\\w_-]+\\.)+[a-z]{2,4})(:(\\d+))?(/[^?\\s#]*)?(\\?[^\\s#]+)?)"
return $ Token Url $ url m domain
- The shortcomings of parsing URLs today are orthogonal to the choice of underlying parsing library. We could make it work with Regexps. Or we can have a Regexp that overapproximates and use
parseURI :: String -> Maybe URI
to filter out false positives.
I hope to elaborate on 2) and 3) more in the future, this is just a sketch.
Anyways, thanks for thinking about this, if you have ideas we're happy to hear them.
from duckling.
what is NFA?
from duckling.
NFA - https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton
Compiling regexps down to NFAs gives predictable polynomial performance.
from duckling.
Would you consider supporting both regexes and parser combinators? I take your point in the simple cases of URLs and emails. However this machinery and contribution model could easily encompass, say, street addresses. I find many parallels between Duckling's and libpostal's architecture, but I find it unlikely we could achieve a sufficiently good street address parser using regex, even with community support.
This might be a non-goal, but it is a worthy one.
from duckling.
@dmvianna I wouldn't discard the idea, though the added value should gain on the complexity cost.
You're right about the use of regexes for street addresses; it doesn't work properly. At this point postal addresses are out of scope, as they require a resolution based on an external datasource.
It seems like libpostal is a good candidate for that, so I wouldn't jump in reinventing the wheel. What are the limitations that could be overcome by Duckling?
from duckling.
Recursion. As far as I understand, libpostal
is not a good candidate for parsing free text. For example, let's say I want to parse street addresses out of Patents Abstracts (I have some toy code and data here). Or other cases I routinely find at work, such as 1-25, 30-40 and 150 Collins St
or even 2 Fictional Rd and 3B Epic Av
. libpostal
simply won't parse multiple addresses in one string properly. And guess what, even regex can parse common cases for these examples. Parser combinators can improve on that.
What naïve code won't do is to assign St
to Street
or Saint
according to local context. libpostal
sorta does that by returning all alternatives. Duckling as I understand does implement a probabilistic parser, so it could potentially solve both problems.
from duckling.
@dmvianna I'm open to look at a prototype for US postal addresses to see if we can get something useful even without the resolution part.
from duckling.
Related Issues (20)
- Support Persian(fa) Languages HOT 4
- [Persian][Numeral] Extract incorrect number from text
- ghc: panic! when stack build.
- Error building on MacOS with M1 Pro HOT 4
- Can't build Docker image HOT 2
- [it_IT] add time corpus/rules
- Default log files are created even if different paths are provided via options
- [ERROR] Network.Socket.sendBuf: resource vanished HOT 1
- Time intervals are too large by one grain HOT 1
- [Question] Which elements are relevant for Test-Failures? HOT 3
- Will not run on new Macbook HOT 1
- Reject common phrases like "guten morgen"
- Up Text and Hashable constraints
- Ne dimension is not working. HOT 1
- Custom Dimension doesn't return the actual value!
- Intervals with leap year dates return incorrect years
- how can I detect the date and time separately? anyone knows?
- Inconsistent date range parsing HOT 1
- Incorrect abbreviation of 'millions' in locale DE HOT 1
- overalaping body in the extracted entities
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duckling.