Current behavior I got java.lang.IndexOutOfBoundsException while p

That should make it reproducible <div class="highlight highlight-source-java notra

Thanx <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here I uploaded 2 files: Top 1 million urls in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

java.lang.IndexOutOfBoundsException about jodd HOT 17 CLOSED

marekhudyma commented on June 10, 2024

java.lang.IndexOutOfBoundsException

from jodd.

Comments (17)

neroux commented on June 10, 2024 1

That should make it reproducible

new LagartoParser("<!---->").parse(new TagVisitorChain());
new LagartoParser("<!--->").parse(new TagVisitorChain());

from jodd.

slandelle commented on June 10, 2024 1

I will disable the flag by default, as I believe it is not longer used in Internet Explorer anyway.

+1: last version supporting conditional comments was IE9

from jodd.

neroux commented on June 10, 2024 1

I will disable the flag by default, as I believe it is not longer used in Internet Explorer anyway.

I second that too. As of IE 10 they are not supported anymore and even IE10 is completely out of support as of January this year. Not saying Jodd should drop the support altogether but changing the default might be a good idea.

Lagarto (visitors) should emit errors as defined by the spec. Note that you don't usually see that in browsers, so to check the correctness some official tool should be used, if such exist.

https://validator.w3.org/nu/?doc=https%3A%2F%2Fwww.koaci.com%2F :)

from jodd.

igr commented on June 10, 2024

Awesome! Will check myself and resolve in no time! Thank you very much!

from jodd.

neroux commented on June 10, 2024

It would seem this is because both pages contain invalid empty HTML comments ()

koaci.com in lines 1812 and 1816
cofc.edu in line 695

Seems Jodd doesn't like that too much and commentStart is -1 here when it shouldn't be (assuming the broken HTML comment is still considered a comment)

jodd/jodd-lagarto/src/main/java/jodd/lagarto/LagartoParser.java

Line 1414 in f62f81c

emitComment(commentStart, ndx - 2);

from jodd.

igr commented on June 10, 2024

Thanx @neroux for deduction!

from jodd.

igr commented on June 10, 2024

@marekhudyma if you can, please share the whole bundle, to test it myself

from jodd.

igr commented on June 10, 2024

@marekhudyma lets fix all the issues :) Thank you again, it's great idea.

from jodd.

marekhudyma commented on June 10, 2024

Here I uploaded 2 files:

Top 1 million urls in internet - in the past Alexa ranking was sharing it. It can be outdated, some pages do not work anymore.
There are pages where you can buy 250 million urls for 25$
https://drive.google.com/file/d/1D4OPZ0FAdZxjMxw5i_rx9eEz8Ex4_vR2/view?usp=sharing
I downloaded around 200k pages - 3.5GB. I think you could download even more and test it ;-) Of course downloading that amount of pages will take time + executing test for that amount of pages is a big thing. But it is a chance to make your library really good.
https://drive.google.com/file/d/11l7mUuWd1KCYMHsWUEeSqQ7k8DxU35HD/view?usp=sharing

Please let me know when you download it, so I can make my google drive free.

from jodd.

igr commented on June 10, 2024

@marekhudyma Sent requests from [email protected]

from jodd.

igr commented on June 10, 2024

@marekhudyma I think I got them all! Will run tests today

from jodd.

marekhudyma commented on June 10, 2024

Shared.

from jodd.

marekhudyma commented on June 10, 2024

I also made some performance tests. I am interested only in getting links from page. I write a crawler. The exact test is not so important, important is comparison, so for a full round these are the times:
-HtmlCleaner 35 minutes,
-Jsoup 6.5 minutes,
-Jodd 3.5 minutes,
-my own custom implementation 2.5 minutes.

My own solution is pretty simple (but still not completed). I just analyze letter after letter and use a state machine - it means I have some enum value to determinate where I am.
From theoretical point of view it is not rocker science. But I try to stick to the same behaviour as HtmlCleaner to run tests on thousands of pages and compare the results. I saw that HtmlCleaner is the slowest one, but fix html syntax the best.

I see there are many libraries on the market, but most of them are not supported for 5+ years.

Do you know other fast html libraries for parsing html?

from jodd.

neroux commented on June 10, 2024

important is comparison

The important question here is, are these four libraries doing the very same thing? If, for example, HtmlCleaner does a lot more than the other three it would not be surprising it takes more time. Similarily, if your custom implementation is optimised for your particular use case it might also explain why it is faster.

Strictness and error tolerance are certainly also a factor but at this point this probably warrants its own issue if there are concerns regarding Jodd :)

from jodd.

igr commented on June 10, 2024

@marekhudyma that is exactly what Jodd is doing :) Jodd is following the HTML5 state machine and the specification; have a class (instead of enums) for each state. Jodd is following this: http://html.spec.whatwg.org/multipage/parsing.html (states are named the same etc), although I didn't check for the updates.

If you use Lagarto with visitor, you will get max speed. You can turn off some settings like for conditional comments (ah!) and that would give bit more speed.

Like @neroux said, if you are writing just link extractor, I believe you can get the faster code, as in that case you care less about the correctness of the remaining HTML. But as you see, the spec is quite big, although not complex, and implementing it in whole would have some performance impact.

If you wish, you can share some of the performance results so I can check if we can tweak something, or we can work together on this parser :)

from jodd.

marekhudyma commented on June 10, 2024

I think I will not join the project, I am too busy.
What I need to admit, after switching off EnableConditionalComments, the speed increases. I see there is a flag: enableRawTextModes = true, but I don't know how it affects the speed.
It is so fast, I will drop my implementation and use your library.

What I could do for you are:

performance tests. I saw that in microbenchmarking there is no such big difference. The biggest is when you compare time of parsing 1 million htmls in a row. So given parser need to handle GC, that takes a lot of time. Of course it doesn't mean I want to read 1M html from files. I want to read 1k (can fit into memory) and run it 1k times. (so no IO is counted there).
I was testing Jsoap and HtmlCleaner, becase other implementations looks abandoned.. Do you have any other good competitor?
I would like to run correctness tests. I am downloading 1 million pages (right now I have around 0.5M). Of course it will be around 0.9M, because many pages doesn't exist in the Internet any more.
I can run your parser against all pages and confirm that you don't throw any exceptions.
The biggest problem is how to confirm that your result are good. In the ideal world, there should be other library that returns the same results, but slower. Did you compare your results with some other library?

from jodd.

igr commented on June 10, 2024

@marekhudyma Yeah, conditional comments are PITA; actually, I will disable the flag by default, as I believe it is not longer used in Internet Explorer anyway.

enableRawTextModes is faster when false. It is for detecting CDATA raw sections - Jodd Lagarto can parse XML, too.

I don't know much about the competitors... quite busy as well :) I know about JSoup only. Moreover, Jodd Lagarto has two modes: visitor and DOM mode, and on-top of it the Jerry (https://jodd.org/jerry/) that is jQuery-alike parser, that is handy.

Any help with Lagarto is welcomed! The goal is to have fast and precise library.

Now, regarding the correctness, there are few things to mention:

Lagarto (visitors) should emit errors as defined by the spec. Note that you don't usually see that in browsers, so to check the correctness some official tool should be used, if such exist.
LagartoDOM is DOM structure built on top of visitor, and it mimics the browsers, meaning: if you want to e.g. clean some HTML code, you should use LagartoDOM, load html and produce it back. LagartoDOM is slower, as DOM needs to be built in-memory, and there are some additional code that resolves invalid order of tags and so on (that is different part of the HTML spec).
Finally, one idea to check correctness might be parsing some bunch of files with e.g. jQuery and then with Jerry, e.g.: extract all links, or text or...something :) Just an idea.

So, there are two different place that has to be correct: the visitor (e.g. parsing HTML) and DOM (order of elements).

Again, please feel free to report and test and use Lagarto as much as you want and have time, I am here to help and to make it work in the best possible way... Maybe it may even become a separate project after all.

from jodd.

java.lang.IndexOutOfBoundsException about jodd HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent