Giter Site home page Giter Site logo

jhy / jsoup Goto Github PK

View Code? Open in Web Editor NEW
10.6K 395.0 2.1K 4.91 MB

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

Home Page: https://jsoup.org

License: MIT License

Java 84.24% HTML 15.76%
jsoup html java dom css java-html-parser css-selectors xml xpath parser

jsoup's Introduction

jsoup: Java HTML Parser

jsoup is a Java library that makes it easy to work with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe-list, to prevent XSS attacks
  • output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

See jsoup.org for downloads and the full API documentation.

Build Status

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the News section into a list of Elements:

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Online sample, full source.

Open source

jsoup is an open source project distributed under the liberal MIT license. The source code is available on GitHub.

Getting started

  1. Download the latest jsoup jar (or add it to your Maven/Gradle build)
  2. Read the cookbook
  3. Enjoy!

Android support

When used in Android projects, core library desugaring with the NIO specification should be enabled to support Java 8+ features.

Development and support

If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via jsoup Discussions.

If you find any issues, please file a bug after checking for duplicates.

The colophon talks about the history of and tools used to build jsoup.

Status

jsoup is in general, stable release.

jsoup's People

Contributors

benbenw avatar cketti avatar cromoteca avatar dependabot[bot] avatar hannibal218bc avatar hazendaz avatar isira-seneviratne avatar jairideout avatar jaredstehler avatar jhy avatar kno10 avatar kovacstamasx avatar krystiangorecki avatar kzn avatar legioth avatar mccxj avatar mitemitreski avatar morokosi avatar offa avatar pascalschumacher avatar schmid-michael avatar sebkur avatar sedran avatar steinarb avatar suarez12138 avatar talgatakhm avatar tc avatar tipabu avatar travisfw avatar zjiajun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jsoup's Issues

"charset=latin-1" is not properly detected

From the mailing list: http://groups.google.com/group/jsoup/browse_thread/thread/09d8325e0e5a46c6#

I just downloaded jsoup 1.3.3 and gave it a try. It works great for
UTF-8 encoded websites, but dies for LATIN-1 encoded sites.
The site that caused the error below is:
http://www.macupdate.com
In the html source you'll find this line:

Here the full stacktrace:
Exception in thread "main"
java.nio.charset.UnsupportedCharsetException: LATIN-1
at java.nio.charset.Charset.forName(Charset.java:505)
at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:58)
at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:
376)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:122)
at rgse.test.Main.main(Main.java:15)
System:
Mac OS X 10.5
Java 1.6
jsoup 1.3.3

Reason:
The general problem:
In DateUtil.java, line 56, the charset name is identified as
"LATIN-1". That name is handed to Charset.forName(). However,
"LATIN-1" does not seem to be recognized as valid character set alias
as defined in http://www.iana.org/assignments/character-sets
The correct character set alias for "LATIN-1" should be "latin1". I
wrote a small test program and the following line runs without problems:
Charset c = Charset.forName("latin1"); // WORKS
Charset c = Charset.forName("LATIN-1"); // FAILS
Solution:
Maybe somewhere in DateUtil.getCharsetFromContentType()? At least this
is where the character set is parsed and turned into all uppercase
(breaks for latin1).
Thanks!
Rico

Add #html() method to Elements

Add a collecting html() method to Elements, to align with text().

Also think about supporting Elements#html(String). Not sure we want to do this (effectively you'd use this to avoid getting a single element via first() and setting HTML on this. Still, would support some use cases. If we do, should also implement the prepend, wrap methods as well.

Issue with <tr>

When calling append to add a table row the resulting tr gets wrapped in a table even though I appended to an existing table.

Incorrect normalisation on headless body

Parsing <html><body><span class="foo">bar</span> creates <html><body><span class="foo">bar</span><head></head></html>: in the normalisation process, the head element is appended to the html element, instead of prepended.

Thanks to Patrick Smith @ ucsc.edu for reporting the issue.

Normalise document after parse

Add a post-parse document normalisation phase.

Particularly, move text nodes that aren't in #body (ie in #root, #html, #head) into body.

Add a textNode#isWhitespace method to check if textnode should be moved.

302 redirects are not followed

Not sure if this is a bug or done intentionally, but HTTP 302 redirects are not followed. It'd be great if they could be.

-edit-

I saw "// todo: error handling options, allow user to get !200 without exception" in HttpConnection, so maybe this more of a feature request...

uppercase umlauts get replaced by lowercase umlaut entities

The line

System.out.println(Jsoup.clean("<h1>Überschrift</h1>", Whitelist.none()));

should print

&Uuml;berschrift

but prints

&uuml;berschrift

This used to work correctly in v0.3.1, but fails in v1.2.3.

While baseArray in Entities.java distinguishes between lowercase and uppercase umlauts, the above call yields the wrong result.

Page results in malformed tree

The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.

You will find the page in "[email protected]:bimargulies/Misc.git" under the jsoup-tc directory.

Modify Elements#attr to get from first element with match

Currently, Elements#attr pulls the attribute from the first element. But Elements#hasAttr scans all of the elements in the collection to check if one has an attribute. So these do not align.

Modify Elements#attr to scan for the first Element that hasAttr, and return the value from that element.

Selector for data attributes in HTML5

Hi Jhy,

is it possible to consume data elements?

    <li class="user" data-name="John Resig" data-city="Boston"
      data-lang="js" data-food="Bacon">
      <b>John says:</b> <span>Hello, how are you?</span>
    </li>

Jsoup.parse(document).select("[data]"); doesn't work for me.

I really love jsoup, thanks for your awesome work.

lower cased html attributes

As I already stated in a previous post, we are using JSTL tags (java custom tags) and we require the attributes to be camel cased to match some methods in our java code. Is it possible to give an option to leave the attributes as they are and not modify them by making them lower case?

e.g. <abc:ourtag returnUrl="http://abc.com" /> does not change to <abc:ourtag returnurl="http://abc.com" />

Thanks!

toString NPE for orphans

I'm working on code that frequently calls 'remove' and then re-adds an element. While the element is in a detached string, toString throws something, so Eclipse prints only an 'invocation target exception.' It would be nice if this were not so.

Should treat unknown tags as inline, not block

See: http://groups.google.com/group/jsoup/browse_thread/thread/711fb6d0c4818ead?hl=en_US#

We should probably treat unknown tags as inline, rather than block tags. Otherwise an unknown tag within a <p> causes the auto-closer to close the P, so <p><custom>Test</custom></p> parses to <p></p><custom>Test</custom>.

Need to think about what impact that would have on unknown tags that should be blocks.

Thanks to François Goldgewicht (http://francois.goldgewicht.com) for reporting the issue.

Add support for Element class manipulation

Add support for Element addClass, removeClass, toggleClass (hasClass, classNames exist, this adds convenience)

Also include in Elements. addClass / removeClass / toggleClass acts on all, hasClass finds first match to true.

StringIndexOutOfBoundsException when testing whether String content is valid HTML

If I try to parse a tag with an equals sign (an empty attribute) but without any single or double quotes around an attribute value, then I get a StringIndexOutOfBoundsException. The stack trace is pasted below.

An example String would be "<a =a"

The following JUnit test case should not throw a StringIndexOutOfBoundsException:

import static org.junit.Assert.assertTrue;
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import org.junit.Test;
public class BadAttributeTest {
@test
public void aTagWithABadAttributeIsValid() throws Exception {
assertTrue(Jsoup.isValid("<a =a", Whitelist.relaxed()));
}
}

java.lang.StringIndexOutOfBoundsException: String index out of range: 13
at java.lang.String.charAt(String.java:686)
at org.jsoup.parser.TokenQueue.consume(TokenQueue.java:130)
at org.jsoup.parser.Parser.parseAttribute(Parser.java:207)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:142)
at org.jsoup.parser.Parser.parse(Parser.java:91)
at org.jsoup.parser.Parser.parseBodyFragment(Parser.java:64)
at org.jsoup.Jsoup.parseBodyFragment(Jsoup.java:99)
at org.jsoup.Jsoup.isValid(Jsoup.java:155)

Wrong html parsing (probably) due to isEmptyElement

Look at this original HTML code
----HTML START---

        <p id="pivot">
            <span style="font-weight:bold;">
                <table width="1" align="left" class="foto-v-left">
                    <tr>
                        <td>
                            <img align="left" alt="x" title="y" border="0" width="140" height="180" src="http://foo.org/iPhoneApp1.jpg"></td>
                        </tr>
                        <tr>
                            <td>Txt1</td>
                        </tr>
                    </table>
        Txt2 - </span>
        Txt3 
        </p>
</body>
---HTML END--- Try to parse it! The "toString()" of resulting org.jsoup.nodes.Document figure like:

---ToString start---

La nuova â«appâ» per iPhone della Notte della Taranta
Txt1
Txt2 - Txt3

---ToString end---

As you can see the documents are differnt in the structure. For example "Txt2" and "Txt3" are not children the "p" element but they are children of a "div"

Suggestion: new method Elements.parents()

A function similar to jQuery's parents() - http://api.jquery.com/parents/ - would be a nice addition. The function would return all parent elements of the current Element. Or, given an optional parameter, would filter based on that.

So if you for example selected all bold text with Elements elems = doc.select('b') you could then find all bold tags that were in paragraphs with elems.parents('p'), and that would select the paragraphs themselves if you wanted to do some processing on them.

You could also add the optional selector to the parent() function too - although it is as easy in this case to simply select the parent and check if the tag or class etc matches.

Page results in malformed tree

The page I will attach results in a Jsoup tree with two body elements, neither if which is a direct child of the html element.

However, I can't see how to attach a file to an issue here.

Unadorned text following data-only tags doesn't parse properly

This HTML, parsed and immediately printed out, results in:

<html>
<body>
<script type="text/javascript">
var inside = true;
</script>
this should be outside.
</body>
</html>

Results:

<html>
<head>
</head>
<body>
<script type="text/javascript">
var inside = true;

this should be outside.

</script>
</body>
</html>

Note how "this should be outside" ends up inside the <script> tag, instead of following it. From what I can tell, this only happens to data-only tags.

Can get text of a <link></link> node

    String html = "<link>http://www.google.com</link><link1>http://link1.com</link1>";
    Document doc = Jsoup.parse(html);
    String link = doc.select("link").first().text();
    System.out.println("Link: " + link);
    String link1 = doc.select("link1").first().text();
    System.out.println("Link1: " + link1);

The result is :

    Link: 
    Link1: http://link1.com

It seems the content of "" node is ignored

StringIndexOutOfBoundsException when parsing link http://news.yahoo.com/s/nm/20100831/bs_nm/us_gm_china

java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(String.java:686)
at java.util.regex.Matcher.appendReplacement(Matcher.java:711)
at org.jsoup.nodes.Entities.unescape(Entities.java:69)
at org.jsoup.nodes.TextNode.createFromEncoded(TextNode.java:95)
at org.jsoup.parser.Parser.parseTextNode(Parser.java:222)
at org.jsoup.parser.Parser.parse(Parser.java:94)
at org.jsoup.parser.Parser.parse(Parser.java:54)
at org.jsoup.Jsoup.parse(Jsoup.java:30)

Set USER_AGENT

It would be good to be able to set the user agent on the fly for Jsoup.parse(url). Many sites block a java user_agent and return a 403.

Suggestion: operators at the start of a selector

In jQuery, when doing further DOM selection on an element (e.g. using find), you can use operators at the start of the query to filter based on the current element.

For example, this jQuery: $('table.data > tbody > tr').find('> td') will select td elements that are direct children of the rows found in the first query. It will not select td elements from any nested tables.

With JSoup, this would be something like:

Elements tableRows = doc.select( "table.data>tbody>tr" );
for ( Element tr : tableRows )
{
    // do something with tr here
    tr.select(">td");
}

I currently get this error: Could not parse query >td

options tags not properly normalised from ugly HTML

After parsing a large HTML document from the wild, unclosed <option> tags are not being automatically closed when a second <option> tag (or finishing </select> tag) is met.

Example:

Element node:
DetailsTurnsCRXP ... etc. Then there is another element node containing the first <option> tag (value="title") and onward. Within that element node exists a single data node: DetailsTurnsCRXP ... etc. Nothing else follows.

JSoup cannot parse IDs with dash

If I trying to use the following expression
doc.select("#expandable-nav");
I'll get following error
Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query #expandable-nav

Add option to output non-pretty-printed HTML

Add an option to output HTML that is formatted (spaces / newlines / indentation) as the original source, and not force pretty-printed.

Implement with a switch in Document, to force preserve whitespace on all nodes. Will require Nodes to have a direct accessor to their parent Document

Cleaner.isValid improvement idea for form processing

Hi,
I'm using jsoup behind some wicket form processing. I really like it.
Maybe there is a way to call e.g. Cleaner.isValid(String input, Whitelist list).
Which returns false on the first tag removed.
Of course it could be coded manually, but I think that might be a nice feature.

What do you think?

Problem with <td tag

Hello

making follow:

final Elements rows = doc.select("body > table > tr");
for ( Element row: rows ) {
final Element date = row.child(0); // select("td").first();
}

for first < tr > will return < td class="company"..., first child ignored
for second < tr > will return < td >21-Feb...</ td > correct

see comments beside tags in html below

This html:

< table cellspacing="0" cellpadding="0" border="0">



    <col width="12%">
    <col>
</colgroup>
<tbody>
    <tr>
        <th class="tl">Date Posted</th>
        <th>Details Preview</th>
        <th>Type</th>
        <th>Amount</th>
        <th class="tr">Location</th>
    </tr>
    <tr>    <!-- if inspect code then displayed as <tr td=""> and first child is <td class="company">...</td>    -->  
        <td>21-Feb-2010 10:44</td>
        <td class="company">
            <h2>
                1.
                <a id="AdvertTitleForRow1"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="TITLE"
                >Title title title</a>
            </h2>
            <p>
                vText Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text .....
                <a
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="MORE_ADVERT_INFO"
                >More</a>
            </p>
            <p>Advertiser : AAAA Services</p>
        </td>
        <td>Contract</td>
        <td class="viewItem">
            United Kingdom,City of London
            <div class="view_advert_link">
                <a id="view_advert_link_7801464"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="view_advert_link"
                >View </a>
            </div>
        </td>
    </tr>
    <tr class="alternate"> <!-- BUT in this row all ok. First child is <td>21-Feb...</td>-->
        <td>21-Feb-2010 10:44</td>
        <td class="company">
            <h2>
                1.
                <a id="AdvertTitleForRow1"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="TITLE"
                >Title title title</a>
            </h2>
            <p>
                vText Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text 
                Text Text Text Text Text Text Text Text Text Text Text Text .....
                <a
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="MORE_ADVERT_INFO"
                >More</a>
            </p>
            <p>Advertiser : AAAA Services</p>
        </td>
        <td>Contract</td>
        <td class="viewItem">
            United Kingdom,City of London
            <div class="view_advert_link">
                <a id="view_advert_link_7801464"
                    href="/7801464/en/?source=Search&amp;SearchTerms=&amp;LocationSearchTerms=&amp;DatePostedFilter=2&amp;Page=1&amp;OrderBy=0&amp;CountryId=0&amp;nocache=1266753038"
                    name="view_advert_link"
                >View </a>
            </div>
        </td>
    </tr>

</tbody>

JSoup cannot CSS select IDs with a colon

If I trying to use the following expression
doc.select ( "#" + pageId );
where pageId happens to be 'PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage' in one case I use, I get the following error:

Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query '#PlugIn100:PlugIn0_ManageMailStoreUserMultipleSelectionsPage': unexpected token at ':PlugIn0_ManageMailStoreUserMultipleSelectionsPage'

I know this issue has come up with underscores and dashes, but I thought I would bring it to your attention that it happens with colons as well.

IndexOutOfBoundsException in HttpConnection whene empty headers in the response

I get this exception, because a response header is empty.

    java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.jsoup.helper.HttpConnection$Response.setupFromConnection(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:338)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:132)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:121)
at org.jsoup.Jsoup.parse(Jsoup.java:133)

JSoup cannot parse IDs with underscores

Example:
Elements id = doc.select( "#An_ID_name" );

Error output:
Could not parse query #An_ID_name

Underscores are valid characters for IDs, but JSoup seems to choke on them. Regular IDs are working fine. There are other valid characters that I haven't tested, like dashes - these should all be accepted.

JSoup unable to extract text from paragraphs

I have the following test case for a CNN url: http://pastebin.com/yqZ1fbY1

if you look at the output you'll be able to see that it doesn't print most of the paragraphs, in fact the second paragraph of the story is rendered as: http://pastebin.com/Hh8KyRwD

expected output would be the text from the 2nd paragraph
"We will continue to highlight the Democratic Party's role in strengthening it and the Republican Party's role in opposing it," etc..........

Html entities containing digits are not unescaped correctly

Some html entities (such as sup1, sup2) are not unescaped correctly by Entities.unescape because they contain digits.

The problem is the pattern Entities.unescapePattern. I changed it to '&(#(x|X)?([0-9a-fA-F]+)|[0-9a-zA-Z]+);?', and it worked fine for me. But there might be side effects ...

You can see my changes here : clementdenis@d65387c

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.