Giter Site home page Giter Site logo

html5ever's Introduction

html5ever

Build Status crates.io

API Documentation

html5ever is an HTML parser developed as part of the Servo project.

It can parse and serialize HTML according to the WHATWG specs (aka "HTML5"). However, there are some differences in the actual behavior currently, most of which are documented in the bug tracker. html5ever passes all tokenizer tests from html5lib-tests, with most tree builder tests outside of the unimplemented features. The goal is to pass all html5lib tests, while also providing all hooks needed by a production web browser, e.g. document.write.

Note that the HTML syntax is very similar to XML. For correct parsing of XHTML, use an XML parser (That said, many XHTML documents in the wild are serialized in an HTML-compatible form).

html5ever is written in Rust, therefore it avoids the notorious security problems that come along with using C. Being built with Rust also makes the library come with the high-grade performance you would expect from an HTML parser written in C. html5ever is basically a C HTML parser, but without needing a garbage collector or other heavy runtime processes.

Getting started in Rust

Add html5ever as a dependency in your Cargo.toml file:

[dependencies]
html5ever = "0.26"

You should also take a look at examples/html2html.rs, examples/print-rcdom.rs, and the API documentation.

Getting started in other languages

Bindings for Python and other languages are much desired.

Working on html5ever

To fetch the test suite, you need to run

git submodule update --init

Run cargo doc in the repository root to build local documentation under target/doc/.

Details

html5ever uses callbacks to manipulate the DOM, therefore it does not provide any DOM tree representation.

html5ever exclusively uses UTF-8 to represent strings. In the future it will support other document encodings (and UCS-2 document.write) by converting input.

The code is cross-referenced with the WHATWG syntax spec, and eventually we will have a way to present code and spec side-by-side.

html5ever builds against the official stable releases of Rust, though some optimizations are only supported on nightly releases.

html5ever's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html5ever's Issues

Did not compile with latest rustc

➜ parser git:(master) ✗ cargo build
Compiling html5ever v0.0.0 (https://github.com/kmcallister/html5ever#974b5ac1)
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:41:1: 43:2 error: the parameter type Handle may not live long enough; consider adding an explicit lifetime bound Handle:'a...
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:41 pub struct ActiveFormattingIter<'a, Handle> {
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:42 iter: Rev<Enumerate<slice::Items<'a, FormatEntry>>>,
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:43 }
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:41:1: 43:2 note: ...so that the reference type core::slice::Items<'a,tree_builder::types::FormatEntry<Handle>> does not outlive the data it points at
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:41 pub struct ActiveFormattingIter<'a, Handle> {
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:42 iter: Rev<Enumerate<slice::Items<'a, FormatEntry>>>,
/Users/mac/.cargo/git/checkouts/html5ever-122db04116ffdbdc/master/src/tree_builder/actions.rs:43 }
error: aborting due to previous error
Could not compile html5ever.

➜ parser git:(master) ✗ rustc -v
rustc 0.12.0-nightly (9a2286d3a 2014-10-03 07:33:26 +0000)

Accept raw bytes

Reads from the network will not necessarily align with UTF-8 codepoint boundaries.

C API

We should have a C API. Then any language can call into a high-performance, memory-safe HTML parser.

Implement the "XML5" syntax

A parser for well-formed XML 1.0 is a very different beast from a HTML parser. But there's this XML5 spec that is quite similar to HTML parsing and could re-use a lot of the machinery, if not the actual parse rules. Servo could potentially use this for all XML parts of the Web platform. See servo/servo#3319.

Document the C API

There are lots of details such as pointer ownership and character encoding.

Do fewer linear searches of element stacks

e.g. the open elements or active formatting elements. We search these when looking for elements in various scopes, and #77 adds more.

We could track open elements with a bitvector for each scope. Checking if one of a set of elements is open in a particular scope would be a simple bitmask test. We can choose bit patterns to match the static atom indices as well.

We should find an abstraction that makes it hard to screw up the book-keeping.

Support document.write

See servo/servo#3704.

The argument to document.write is a sequence of UCS-2 code units and we need a way to interface this with the UTF-8 parser. My plan is:

(Edit: Largely superseded by this proposal)

  • Convert to UTF-8 as soon as possible.
  • Convert invalid surrogate sequences to U+FFFD 'REPLACEMENT CHARACTER'. This is a deviation from the spec, but nobody has objected strongly in the course of various discussions. There was even talk of amending the spec to allow this behavior, since it's currently written under the assumption that all parsers use UCS-2 natively.
  • If a document.write input ends with a leading surrogate, we can't convert it yet, so save this single u16 in the BufferQueue alongside the UTF-8 buffers.
  • If a document.write input starts with a trailing surrogate, and there's a saved leading surrogate in the BufferQueue, then replace both with the appropriate Unicode character as UTF-8.
  • If the parser receives any other input and there's a saved leading surrogate, drop the saved surrogate and prepend U+FFFD to the input. (This means that a script split an invalid surrogate sequence across multiple document.write calls, or wrote a lone leading surrogate and then finished.)

Make Namespace a newtype around Atom

After discussion with @SimonSapin we concluded that there's no reason for a custom enum. An HTML parser can only output a fixed set of namespaces, but XHTML parsing and DOM manipulation can produce arbitrary namespaces, so Servo must represent namespaces as interned strings anyway.

I think we still want Namespace to be a separate type, not a synonym, so that it's harder to mix up namespaces and other atoms. (Does that mean we should also have e.g. TagName and AttrName newtypes?)

build errors with html5ever as cargo crate

rustc 0.13.0-nightly (6f4c11be3 2014-12-05 20:23:10 +0000)
cargo 0.0.1-pre-nightly (da789a6 2014-11-30 08:14:16 +0000)

Steps to Reproduce

cargo new project --bin
cd project

Add html5ever to Cargo.toml

[dependencies.html5ever]
git = "https://github.com/servo/html5ever"

Add html5ever to main.rs

#![feature(phase)]

extern crate html5ever;
use html5ever::sink::common::{Document, Doctype, Text, Comment, Element};

Attempt build

cargo build
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling html5ever_macros v0.0.0 (https://github.com/servo/html5ever#e7f74b65)
   Compiling time v0.1.0
   Compiling string_cache_macros v0.0.0 (https://github.com/servo/string-cache#40f25a17)
/Users/ozten/.cargo/git/checkouts/html5ever-1ab8707684fb3258/master/macros/src/named_entities.rs:37:9: 37:21 error: unresolved enum variant, struct or const `Object`
/Users/ozten/.cargo/git/checkouts/html5ever-1ab8707684fb3258/master/macros/src/named_entities.rs:37         json::Object(m) => m,
                                                                                                            ^~~~~~~~~~~~
error: aborting due to previous error
   Compiling phf v0.2.0
/Users/ozten/.cargo/git/checkouts/string-cache-628f0438d3df3ef7/master/macros/src/atom/mod.rs:18:25: 18:30 error: unresolved import `std::slice::Found`. There is no `Found` in `std::slice`
/Users/ozten/.cargo/git/checkouts/string-cache-628f0438d3df3ef7/master/macros/src/atom/mod.rs:18 use std::slice::{Items, Found, NotFound};
                                                                                                                         ^~~~~
/Users/ozten/.cargo/git/checkouts/string-cache-628f0438d3df3ef7/master/macros/src/atom/mod.rs:18:32: 18:40 error: unresolved import `std::slice::NotFound`. There is no `NotFound` in `std::slice`
/Users/ozten/.cargo/git/checkouts/string-cache-628f0438d3df3ef7/master/macros/src/atom/mod.rs:18 use std::slice::{Items, Found, NotFound};
                                                                                                                                ^~~~~~~~
error: aborting due to 2 previous errors
Build failed, waiting for other jobs to finish...
Could not compile `html5ever_macros`.

To learn more, run the command again with --verbose.

I'm new to using cargo, so maybe I'm doing something wrong. The README states that this project tracks nightly Rust. Thanks for your help!

Display code next to spec

Many parts of the parser code closely match sections of the HTML5 syntax spec. We should include machine-readable comments describing this correspondence, and then generate a HTML document with syntax-highlighted Rust code next to spec excerpts.

Build on Rust stable branch

Not for 1.0 necessarily, but at some point in 1.x.

The biggest obstacle is syntax extensions. I have no clever ideas there except running rustc-nightly --pretty expanded and shipping that as the 1.x "source".

We should catalog all of the other obstacles here.

Servo itself will be on nightlies for a looooong time.

Provide source spans for tokens and DOM nodes

We can use something like libsyntax's Span and Spanned types to track positions in the input stream.

The tokenizer will remember its current position and the position at certain events, e.g. start tag, start attribute name. The tree builder will call a tree sink method (with an empty default) to annotate the DOM with span information.

Then we can write a command-line HTML validator with the same output UI as rustc :)

Note that eventually it will be possible for a single document's nodes to come from multiple text sources, e.g. with document.write.

Implement a parent-ownership DOM

This would be another built-in TreeSink, suitable for simple static parsing tasks, and potentially faster than RcDom.

During parsing, the TreeSink operates on

type TentativeNodePtr = *mut UnsafeCell<TentativeNode>;

struct TentativeNode {
    node: NodeEnum,
    parent: TentativeNodePtr,
    children: Vec<TentativeNodePtr>,
    // plus kind opt-out markers?
}

The TentativeNodes are allocated and owned by the TreeSink itself. They all live as long as the TreeSink does, so reference counting is not necessary.

The get_result method consumes the TreeSink and transmutes the root node to

pub struct Node {
    pub node: NodeEnum,
    _parent_not_accessible: uint,
    pub children: Vec<Box<Node>>,
}

thus transferring ownership of each node to its parent, without actually doing anything at runtime. get_result then walks the tree once and destroys any TentativeNodes that didn't make it into the final tree.

We should statically assert that the respective layouts of TentativeNode and Node are really compatible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.