Giter Site home page Giter Site logo

Non-strict parser about xmlpath HOT 6 CLOSED

go-xmlpath avatar go-xmlpath commented on July 30, 2024
Non-strict parser

from xmlpath.

Comments (6)

Netherdrake avatar Netherdrake commented on July 30, 2024

The line that causes the crash is a javascript tag:

<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o?o:n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({D5DuLP:[function(t,n){function e(t,n){var e=r[t];return e?e.apply(this,n):(o[t]||(o[t]=[]),void o[t].push(n))}var r={},o={};n.exports=e,e.queues=o,e.handlers=r},{}],handle:[function(t,n){n.exports=t("D5DuLP")},{}],G9z0Bl:[function(t,n){function e(){var t=l.info=NREUM.info;if(t&&t.agent&&t.licenseKey&&t.applicationID&&p&&p.body){l.proto="https"===f.split(":")[0]||t.sslForHttp?"https://":"http://",i("mark",["onload",a()]);var n=p.createElement("script");n.src=l.proto+t.agent,p.body.appendChild(n)}}function r(){"complete"===p.readyState&&o()}function o(){i("mark",["domContent",a()])}function a(){return(new Date).getTime()}var i=t("handle"),u=window,p=u.document,s="addEventListener",c="attachEvent",f=(""+location).split("?")[0],l=n.exports={offset:a(),origin:f,features:[]};p[s]?(p[s]("DOMContentLoaded",o,!1),u[s]("load",e,!1)):(p[c]("onreadystatechange",r),u[c]("onload",e)),i("mark",["firstbyte",a()])},{handle:"D5DuLP"}],loader:[function(t,n){n.exports=t("G9z0Bl")},{}]},{},["G9z0Bl"]);</script>

from xmlpath.

niemeyer avatar niemeyer commented on July 30, 2024

The cause, as you've already figured, is that xmlpath right now uses the xml package to parse the HTML code. Although we can loose some bolts so that it can parse more than strict XML, it's still not enough to parse general HTML without errors.

The good news is that I've already been working on xmlpath v2, which will use a real HTML parser to avoid such issues. The bad news is that it will take a couple of weeks before I can finish this.

If you want to solve the problem right away, one hack I've done before is to use the regexp package to get rid of the content within such script tags, before handing it off to xmlpath.

I'll leave this issue open and report here once the problem is solved.

from xmlpath.

Netherdrake avatar Netherdrake commented on July 30, 2024

Using regex on html reminds me of this:
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

I will give it a try.

Can't wait for the v2 :) Keep up the good work 👍

from xmlpath.

Netherdrake avatar Netherdrake commented on July 30, 2024

I've one more question. For the future version of xmlpath, will it be possible to get Raw value of a content of the node?

For instance, If I have:

<p>
foo <br> bar
</p>

I would like to get raw contents of p with html tags and everything (in this case, I'd like unstripped <br> tag to retain formatting).
//p
foo <br> bar

from xmlpath.

niemeyer avatar niemeyer commented on July 30, 2024

It's unlikely that you might get the unmodified raw content. The parser will generally alter it to have it well formed.

from xmlpath.

niemeyer avatar niemeyer commented on July 30, 2024

xmlpath.v2 is out, fixing this issue: http://goo.gl/a6d0MG

from xmlpath.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.