Giter Site home page Giter Site logo

Comments (2)

m-heider avatar m-heider commented on May 24, 2024

The reason for this behavior is that your HTML is invalid.

Any tag must have an end tag, aside from a few exceptions listed here:
https://html.spec.whatwg.org/multipage/syntax.html#optional-tags

Also, custom elements must contain a hyphen (e.g <a-player>) but JSoup does not seem to enforce this.
https://html.spec.whatwg.org/multipage/custom-elements.html#custom-elements-core-concepts

I am not aware of any setting that ignores custom tags but there are two other options:

  1. You escape the angle brackets in <player>:
org.jsoup.nodes.Document doc;
String output;
org.jsoup.nodes.Document.OutputSettings outputSettings;

doc = Jsoup.parse("""
                  <!DOCTYPE html>
                  <html lang="en">
                    <head><title>Title</title></head>
                    <body>
                      <u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be &lt;player&gt; would you?</u></strong>
                    </body>
                  </html>
                  """);

output = Parser.unescapeEntities(doc.select("body").html(), true);

Output:

<u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</u></strong>
  1. You deal with the end-tag in your program, disable the additional line breaks in JSoup and hope that future versions of JSoup will neither enforce the rules about end tags nor hyphens in custom elements:
org.jsoup.nodes.Document doc;
String output;
org.jsoup.nodes.Document.OutputSettings outputSettings;

doc = Jsoup.parse("""
                  <!DOCTYPE html>
                  <html lang="en">
                    <head><title>Title</title></head>
                    <body>
                      <u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</u></strong>
                    </body>
                  </html>
                  """);
                  
outputSettings = new org.jsoup.nodes.Document.OutputSettings();
outputSettings.prettyPrint(false);
doc.outputSettings(outputSettings);

output = doc.select("body").html().trim();

Output:

<u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</player></u></strong>

from jsoup.

SkyAphid avatar SkyAphid commented on May 24, 2024

Thank you for the response, and I apologize for my late one.

I was aware that it was identifying it as a tag and trying to treat it as such. I was able to work around it in my program thankfully and circumvent the entire thing. Not so fortunately, it ended massively overcomplicating my code.

The problem that this API has, in my opinion, is that there is no way to turn off the autocorrecting of the parse function. It's not that I'm requesting that the API ignore them entirely, but in my opinion, there should be a way to have JSoup parse the strings, and simply not call whatever function is inserting text into my Strings without my permission. It's worsened by the fact I have no control whatsoever, even having a callback when it edits the string would be nice, mostly so I could just override it and have it not touch the string.

If this project is ever updated, I suggest the feature to work something like this:
JSoup.setFixErrors(false);

If this is set to false, then the code that inserting the end tag automatically will simply not be called, and the text will not be parsed by the system. Ideally, it'd also include an optional callback that catches the "error" and feeds it into the function.

If I could please be directed to the code in this API that handles this autocorrecting functionality, perhaps I could look into adding the support to help out, or at least have the change locally.

Thank you again for your time!

from jsoup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.