Giter Site home page Giter Site logo

document implementation about earley HOT 10 OPEN

ollef avatar ollef commented on August 15, 2024
document implementation

from earley.

Comments (10)

ollef avatar ollef commented on August 15, 2024

I'm aware that the internals aren't exactly easy to follow, and I've been meaning to write something up about it. This issue just might be the push I need to actually do it. :)

Some quick remarks about the things you mentioned:
Both State and Cont represent Earley states. They differ in the types, since we're in a typed setting and are constructing the results as we parse. Cont needs an argument, i.e. a parse result, to become a State.

The double references look a bit peculiar in Haskell, but are there for a reason. I'll start with the one in Rule:
We only want a rule to be expanded once per position in the input. In traditional Earley parsing this is implicit in the usage of sets of states, since adding duplicates is then a no-op. We can't hold sets of states because states contain things of unknown type and are thus not in Eq or Ord.

So we instead make sure to only expand each rule once per position, which has the same effect. When a rule is expanded, we also have to keep track of a continuation, i.e. the productions to parse after a successful parse of the rule. But since we only expand a rule once per position, what happens when we encounter a rule a second time at the same position? This is where these pointers come in. At this point we look up its the rule's continuation pointer (i.e. dereference the outer pointer layer), and add ourselves to the list of continuations (i.e. update the inner pointer). Now any states (e.g. from the first time the rule was expanded) that refer to this continuation (the inner pointer) are automatically updated!

At each position we have to make sure that the outer reference points to a fresh continuation pointer. This is what the reset computation does.

We have a similar situation when there are alternatives, i.e. one state that branches into two states. In this case both branches will have a continuation that is the same (e.g. c in (a <|> b) <*> c). If both branches produce a result at the same position we want to merge the branches --- otherwise we would suddenly have several states at the same position (at c in the example) and parsing would become exponential. This is what the contsArgs field that you referred to is for, in a way roughly the same as that for rules. The inner reference gives us a way to inject more results into the continuation state even when that state has already been processed at a position.

from earley.

sboosali avatar sboosali commented on August 15, 2024

oh yeah that helps.

feel free to link me to any drafts. sooner or later I'll learn how it works
(it's just a few lines too). until then, I can provide "fresh eyes".

On Thursday, September 17, 2015, Olle Fredriksson [email protected]
wrote:

I'm aware that the internals aren't exactly easy to follow, and I've been
meaning to write something up about it. This issue just might be the push I
need to actually do it. :)

Some quick remarks about the things you mentioned:
Both State and Cont represent Earley states. They differ in the types,
since we're in a typed setting and are constructing the results as we
parse. Cont needs an argument, i.e. a parse result, to become a State.

The double references look a bit peculiar in Haskell, but are there for a
reason. I'll start with the one in Rule:
We only want a rule to be expanded once per position in the input. In
traditional Earley parsing this is implicit in the usage of sets of states,
since adding duplicates is then a no-op. We can't hold sets of states
because states contain things of unknown type and are thus not in Eq or
Ord.

So we instead make sure to only expand each rule once per position, which
has the same effect. When a rule is expanded, we also have to keep track of
a continuation, i.e. the productions to parse after a successful parse of
the rule. But since we only expand a rule once per position, what happens
when we encounter a rule a second time at the same position? This is where
these pointers come in. At this point we look up its the rule's
continuation pointer (i.e. dereference the outer pointer layer), and add
ourselves to the list of continuations (i.e. update the inner pointer). Now
any states (e.g. from the first time the rule was expanded) that refer to
this continuation (the inner pointer) are automatically updated!

At each position we have to make sure that the outer reference points to a
fresh continuation pointer. This is what the reset computation does.

We have a similar situation when there are alternatives, i.e. one state
that branches into two states. In this case both branches will have a
continuation that is the same (e.g. c in (a <|> b) <*> c). If both
branches produce a result at the same position we want to merge the
branches --- otherwise we would suddenly have several states at the same
position (at c in the example) and parsing would become exponential. This
is what the contsArgs field that you referred to is for, in a way roughly
the same as that for rules. The inner reference gives us a way to inject
more results into the continuation state even when that state has already
been processed at a position.


Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

from earley.

ollef avatar ollef commented on August 15, 2024

I've started doing this. https://github.com/ollef/Earley/blob/master/docs/implementation.md

Please let me know if anything needs clarification or if there are mistakes, typos, or whatever.

Most of it is background so far, but there is already some meaty information in there.

from earley.

sboosali avatar sboosali commented on August 15, 2024

this is amazing!

On Wednesday, September 23, 2015, Olle Fredriksson [email protected]
wrote:

I've started doing this.
https://github.com/ollef/Earley/blob/master/docs/implementation.md

Please let me know if anything needs clarification or if there are
mistakes, typos, or whatever.

Most of it is background so far, but there is already some meaty
information in there.


Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

from earley.

greglearns avatar greglearns commented on August 15, 2024

The implementation.md file helps a bunch. However, is there any explanation of how to write the grammars themselves? You have the examples, but don't really explain what the different symbols are used for or how to think about writing a grammar from scratch, or how to handle whitespace effectively, etc. I'd really like to use this! (thank you, by the way, for your work so far!)

from earley.

sboosali avatar sboosali commented on August 15, 2024

(this implementation file helps explain the internals only.)

have you used other parse combinator libraries? the API is similar. also,
what's a grammar you'd like to define? maybe I can help, and we can then
address these issues in the documentation.

On Saturday, October 3, 2015, Greg Edwards [email protected] wrote:

The implementation.md file helps a bunch. However, is there any
explanation of how to write the grammars themselves? You have the examples,
but don't really explain what the different symbols are used for or how to
think about writing a grammar from scratch, or how to handle whitespace
effectively, etc. I'd really like to use this! (thank you, by the way, for
your work so far!)


Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

from earley.

greglearns avatar greglearns commented on August 15, 2024

I've done parsing in Ruby, Java, and C -- I'm pretty new to Haskell (sorry!
I know that complicates things)

Your offer to help is very kind. I'd be happy to help out in terms of
documentation if you wish once I understand things better, so that I can
give back.

Here's an example of a line I'd like to parse, and the resulting structure
I'd like to produce (ultimately as JSON)

  input = " .. [next x] some note #other #for:bob, mary and (betty or

larry) /6m/12/40h"

  var expected = {
    numberOfInitialDots: 2,
    boxed_word: 'next',
    boxed_letter: 'x',
    text: 'some note',
    tags: {
      other: '',
      for: 'bob, mary and (betty or larry)'
    },
    numbers: [
      { value: 6, letter: 'm' },
      { value: 12, letter: '' },
      { value: 40, letter: 'h' }
    ]
  }

On Sat, Oct 3, 2015 at 9:43 PM, Sam Boosalis [email protected]
wrote:

(this implementation file helps explain the internals only.)

have you used other parse combinator libraries? the API is similar. also,
what's a grammar you'd like to define? maybe I can help, and we can then
address these issues in the documentation.

On Saturday, October 3, 2015, Greg Edwards [email protected]
wrote:

The implementation.md file helps a bunch. However, is there any
explanation of how to write the grammars themselves? You have the
examples,
but don't really explain what the different symbols are used for or how
to
think about writing a grammar from scratch, or how to handle whitespace
effectively, etc. I'd really like to use this! (thank you, by the way,
for
your work so far!)


Reply to this email directly or view it on GitHub
#12 (comment).

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis


Reply to this email directly or view it on GitHub
#12 (comment).

from earley.

sboosali avatar sboosali commented on August 15, 2024

cool.

here's a tutorial for parsing log files with more explanation, using a different parser combinator library:

https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/attoparsec

(I think we should link to a general parser combinator tutorials from the readme).

for your problem, the "Haskell way" is first to define a custom type. we construct this type as soon as possible whenever consuming data, from a parser or a database or whatever, and then printing/serializing that data back again at the very end. the benefits being that it's much easier manipulate structured data then raw text, and also easier to manipulate typed data than json.

{-# LANGUAGE DerivingGeneric, DeriveAnyClass  #-}

import Data.Aeson (ToJSON)  -- from the `aeson` package 

data GregLine = 
 { numberOfInitialDots :: Int
 , boxed_word          :: String
 , boxed_letter        :: Char 
 , text                :: String
 , tags                :: [Tag] 
 , numbers             :: [Number] 
 } deriving (Generic, ToJSON) -- this one line lets us serialize our datatype into json with `encode`.   

type Tag = (String, Maybe String)   -- the Maybe provides explicit optionality...

type Number = (Int, Maybe Char)     -- ...and will also simplify the parsers.  

Haskell has many "combinator libraries". in particular, "parser combinator libraries" like this one. the basic idea is that the library exports a type (call it Combinator), some "primitives" that construct values of that type (like aPrimitive :: String -> Combinator), and "combinators" that transform values of that type (like aCombinator :: Integer -> Combinator -> Combinator). if we wanted to build a simple "arithmetic combinator library" (sounds cooler than it is), the primitives might be 0 and 1, and the combinators might be + and *.

so first, let's import the primitives from Text.Earley and the combinators from Control.Applicative (the latter is in the standard library, yay code reuse!)

import Text.Earley
import [Control.Applicative](https://hackage.haskell.org/package/base-4.8.1.0/docs/Control-Applicative.html) (many, some, optional, (<$>), (<*>), (<*), (*>))

if we specialize the type signatures to Prod, we get:

optional :: Prod r e t a -> Prod r e t (Maybe a)
many     :: Prod r e t a -> Prod r e t [a]
some     :: Prod r e t a -> Prod r e t [a]
(<*)     :: Prod r e t a -> Prod r e t b -> Prod r e t a
(*>)     :: Prod r e t b -> Prod r e t a -> Prod r e t a

we can read them as:

optional  -- zero or one, or regex "?"
many      -- zero or more, or regex "*"  
some      -- one or more, or regex "+"
(<*)      -- parse the left and keep its result, then parse the right and ignore its result 
(*>)      -- parse the left and ignore its result, then parse the right and keep its result 

we can read the type constructor Prod r e t as "consumes some tokens and then returns", so an expression of type pInt :: Prod r e t Int means that the expression "consumes some tokens and then returns an int". thus, if we wanted to increment the result of a parse, we couldn't write (+1) pInt since that was incrementing the parser, not the number parsed. we would write (+1) <$> pInt.

anyways, following is the basic scaffold for your problem, using the library. it reads like BNF, hopefully. or more like yacc, where you can insert "interpretations" around productions.

(the g stands for "grammar" and the p stands for "parser". I haven't tried type checking this)

-- | can parse @" .. [next x] some note #other #for:bob, mary and (betty or larry) /6m/12/40h"@
gGregLine :: Grammar r (Prod r e Char GregLine)                  -- the tokens are characters
gGregLine = do
 pWhitespace          <- rule$ some (symbol ' ')
 pNumberOfInitialDots <- rule$ length <$> some (symbol '.')      -- length "interprets" the parse result
                                                                 -- of "one or more" dots
 pBoxed_word          <- rule$ _                                 -- uses <* and *>
 pBoxed_letter        <- rule$ _
 pText                <- rule$ _
 pTags                <- rule$ _                                 -- uses optional
 pNumbers             <- rule$ _            
 pGregLine            <- rule$ GregLine <$> (pWhitespace *> pNumberOfInitialDots)  -- skip initial whitespace 
                                        <*> pBoxed_word
                                        <*> pBoxed_letter
                                        <*> pText
                                        <*> pTags
                                        <*> pNumbers

 return pGregLine

let me know if you're unfamiliar with "Applicative"s. if so, for now, you can think of f <$> x <*> y as similar to normal pure function application f x y, only "with effects". here, the "effects" are "matching tokens and parsing them into some value".

and you can ignore rule for now, it just says it's argument is a grammatical production with the left hand side, rather than just a grammatical expression with only a right-hand side.

(:: Prod r e Char Char)
----
|  |
pInt <- rule$ read <$> satisfy isDigit
        |     |        |_____________| 
        |     |        (:: Prod r e Char Char)
        |     |______________________| 
        |     (:: Prod r e Char Int) |
        |                            | 
        |                            |
        |____________________________| 
        (:: Grammar r (Prod r e Char Int))

by the way, how did you come across this package?

and let me know if you need more info.

from earley.

ollef avatar ollef commented on August 15, 2024

@sboosali: I've added you as a collaborator to the repository so you can add this to the documentation if you're up for it. I suggest adding it to the docs directory and linking it somewhere appropriate in the README. I can help flesh it out when I find the time. A minor issue from a quick look is that the types of a few of the combinators are incorrect: the list and maybe types should be "pushed into" the return types.

from earley.

sboosali avatar sboosali commented on August 15, 2024

thanks, will do (the type typo is noted).

also, I'll try to clearly separate "new to parser combinators" from "new to this package".

from earley.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.