Giter Site home page Giter Site logo

Comments (12)

lihaoyi avatar lihaoyi commented on August 28, 2024

I wonder if it's a consequence of the way StringIn is implemented? A lot of
our bitsets are implemented as Char predicates, including those used in
StringIn, and those initialize a 65000-pointer long lookup array the first
time they're initialized. This may be causing the slowdown, but I can't
imagine it taking 3 minutes...

Could you repeat the benchmark moving the initialization (construction of
the parser object) outside of the timing block? I thought the
initialization would be relatively fast, but maybe it's not, and we should
reconsider this approach.

The lookup array implementation lives in the trie here:

https://github.com/lihaoyi/fastparse/blob/master/utils/shared/src/main/scala/fastparse/Utils.scala#L138-L183

On Fri, Aug 7, 2015 at 12:00 AM, tannerezell [email protected]
wrote:

Tested on 0.2.1:

sealed trait Valcase class Var(value : String) extends Valval startTime = new Date()val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq val variables : Parser[Var] = P ( StringIn(vars:_*).! ).map(Var)val Result.Success(myVars, _) = variables.parse("SA")val endTime = new Date()val dateDiff = new SimpleDateFormat("mm:ss").format(new Date(endTime.getTime - startTime.getTime))
println (s"myVars = $myVars")
println (s"took $dateDiff to parse")

results in the following output:

myVars = Var(SA)
took 03:11 to parse

This only happens on long strings, shorter ones process almost instantly.
I've tried various combinations of the long string and it always results in
an exceedingly long parsing time.


Reply to this email directly or view it on GitHub
#33.

from fastparse.

tannerezell avatar tannerezell commented on August 28, 2024

It looks like when parse is called, rather than when the parser is setup:

sealed trait Val
case class Var(value : String) extends Val

def time[R](unit : => R) : R = {
    val startTime = new Date()
    val result = unit
    val endTime = new Date()
    val dateDiff = new SimpleDateFormat("mm:ss:SS").format(new Date(endTime.getTime - startTime.getTime))
    println (s"took $dateDiff to parse")
    unit
  }


  val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
  val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
  val Result.Success(myVars, _) = time { variables.parse("SA") }
  println (s"myVars = $myVars")

and the result:

took 00:00:01 to parse
took 03:27:331 to parse
myVars = Var(SA)

That seems crazy right?

from fastparse.

lihaoyi avatar lihaoyi commented on August 28, 2024

Initialization is lazy IIRC; what if you call parse over and over?

On Fri, Aug 7, 2015 at 5:14 AM, tannerezell [email protected]
wrote:

It looks like when parse is called, rather than when the parser is setup:

sealed trait Valcase class Var(value : String) extends Val
def time[R](unit : => R) : R = {
val startTime = new Date()
val result = unit
val endTime = new Date()
val dateDiff = new SimpleDateFormat("mm:ss:SS").format(new Date(endTime.getTime - startTime.getTime))
println (s"took $dateDiff to parse")
unit
}

val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
val Result.Success(myVars, _) = time { variables.parse("SA") }
println (s"myVars = $myVars")

and the result:

took 00:00:01 to parse
took 03:27:331 to parse
myVars = Var(SA)

That seems crazy right?


Reply to this email directly or view it on GitHub
#33 (comment).

from fastparse.

tannerezell avatar tannerezell commented on August 28, 2024
  val vars = List("\r\n", "\n", "hl", "SA", "avxavxavxavxavxavx").toSeq
  val variables : Parser[Var] = time { P ( StringIn(vars:_*).! ).map(Var) }
  val Result.Success(myVars, _) = time { variables.parse("SA") }
  val Result.Success(myVars2, _) = time { variables.parse("SA") }
  val Result.Success(myVars3, _) = time { variables.parse("SA") }
  val Result.Success(myVars4, _) = time { variables.parse("SA") }
  println (s"myVars = $myVars")
took 00:00:00 to parse
took 03:08:530 to parse
took 00:00:00 to parse
took 00:00:00 to parse
took 00:00:00 to parse
myVars = Var(SA)

Seems like only the first time, then its pretty fast

from fastparse.

tannerezell avatar tannerezell commented on August 28, 2024

I tried my hand at writing an alternate 'StringIn' implementation:

case class ContainsIn(str : Seq[String]) extends Parser[Unit] {
    val sortedList = str.sortBy(_.length).reverse
    override def parseRec(cfg: ParseCtx, index: Int): Mutable[Unit] = {
      def length : Int = { sortedList foreach { p => if (cfg.input.startsWith(p, index)) return p.length }; -1 }
      if (length != -1) success(cfg.success, (), index + length + 0, Nil, cut = false)
      else fail(cfg.failure, index)
    }
    override def toString = {
      s"ContainsIn(${sortedList.map(literalize(_)).mkString(", ")})"
    }
  }

For the most part it works and is very very fast. For grins and giggles I tried to swap it in place of StringIn and ran the tests, it was very unhappy to say the least. I ran it against the examples in the documentation, ran it against a couple variations and it seems to work like I would expect. Maybe you can provide some insight as to why this implementation breaks on the tests.

from fastparse.

lihaoyi avatar lihaoyi commented on August 28, 2024

I don't know why it wouldn't work off the top of my head; you'll have to go and minimize the problem

from fastparse.

Alain-Bearez avatar Alain-Bearez commented on August 28, 2024

I have been experimenting with a mixed approach, on pull request #54. However this is not ideal, as detailed in this review comment.

The real challenge is to have a solution which is efficient for long running parsers and not to costly on initialisation for one-shot parsers.

from fastparse.

lihaoyi avatar lihaoyi commented on August 28, 2024

I don't understand the problem in detail and I don't understand what your solution is trying to do. For this sort of non-completely-trivial patch, you're going to need to explain what's going on =P

from fastparse.

shawjef3 avatar shawjef3 commented on August 28, 2024

90 seconds to compile StringIn with a 17 character string is way too long.

println("string length,time to compile StringIn (ms)")

for (subset <- "aaaaaaaaaaaaaaaaa".tails.toSeq.reverse) {

  val start = System.currentTimeMillis()

  StringIn(subset)

  val end = System.currentTimeMillis()

  println(s"${subset.size},${end - start}")
}
string length,time to compile StringIn (ms)
0,16
1,5
2,1
3,2
4,4
5,10
6,16
7,17
8,23
9,52
10,79
11,155
12,438
13,1274
14,3264
15,10069
16,29726
17,90797

from fastparse.

lihaoyi avatar lihaoyi commented on August 28, 2024

Yes it is. It's totally screwed and someone needs to fix it.

Anyone want to chip in with an implementation of a Double Array Trie? I've never used one but it seems compact, fast to initialize and fast to query

from fastparse.

shawjef3 avatar shawjef3 commented on August 28, 2024

I'm not so sure you need a fresh implementation. How about the one from Apache commons?

from fastparse.

lihaoyi avatar lihaoyi commented on August 28, 2024

That's plausible. We'd need to...

  • Port it to Scala, or find an alternate implementation in Javascript so we can run it on Scala.js.
  • Make sure it's semi-rigorously unit tested itself, and the Fastparse/Scalaparse/Pythonparse tests all pass
  • Run some perf comparisons on both initialization (which we know is slow for status quo) and retrieval/matching (which I think the status quo does pretty well?) to see how it compares.

Nothing fundamentally difficult about this, someone just needs to do it. And I'm spending all day/week/month/year refactoring legacy python code so that person probably won't be me ^_^

from fastparse.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.