cloudflare / lol-html Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 75.0 2.78 MB

Low output latency streaming HTML parser/rewriter with CSS selector-based API

Home Page: https://crates.io/crates/lol-html

License: BSD 3-Clause "New" or "Revised" License

Rust 86.91% C 12.81% Shell 0.28%

css-selectors html parser rewriting rust stream streaming

lol-html's Introduction

LOL HTML

Low Output Latency streaming HTML rewriter/parser with CSS-selector based API.

It is designed to modify HTML on the fly with minimal buffering. It can quickly handle very large documents, and operate in environments with limited memory resources. More details can be found in the blog post.

The crate serves as a back-end for the HTML rewriting functionality of Cloudflare Workers, but can be used as a standalone library with a convenient API for a wide variety of HTML rewriting/analysis tasks.

Documentation

https://docs.rs/lol_html/

Bindings for other programming languages

C
Lua
Go (unofficial, not coming from Cloudflare)
Ruby (unofficial, not coming from Cloudflare)

Example

Rewrite insecure hyperlinks:

use lol_html::{element, HtmlRewriter, Settings};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut output = vec![];

    let mut rewriter = HtmlRewriter::new(
        Settings {
            element_content_handlers: vec![
                element!("a[href]", |el| {
                    let href = el
                        .get_attribute("href")
                        .expect("href was required")
                        .replace("http:", "https:");

                    el.set_attribute("href", &href)?;

                    Ok(())
                })
            ],
            ..Settings::default()
        },
        |c: &[u8]| output.extend_from_slice(c)
    );

    rewriter.write(b"<div><a href=")?;
    rewriter.write(b"http://example.com>")?;
    rewriter.write(b"</a></div>")?;
    rewriter.end()?;

    assert_eq!(
        String::from_utf8(output)?,
        r#"<div><a href="https://example.com"></a></div>"#
    );
    Ok(())
}

License

BSD licensed. See the LICENSE file for details.

lol-html's People

Contributors

Stargazers

Watchers

Forkers

dianpeng tchigher d4tocchini marsalans ahnv appson-engineering jayphelps obsidianminor jbampton jyn514 l1kw1d mbrubeck pranavstar-1203 transparencies coolspring8 zanachka isgasho glyphpoch aarono 106faceeater106 asaliheddine jbr jredrado onenos-com icodein vlovich forkkit kixell-nicolasjardillier nilslice jongiddy b8591340 devsnek mrbbot trevyn mitsuhiko untitaker isabella232 mattjurenka gjtorikian kingslimedeesr imhunterand suryatmodulus cyberflamego orium ghdevs fxxxlei hozanhoi stefan-thothmind mwcz stalker-silence limour-dev kuribohrn iq-scm ryman warfields harrishancock uiforks bru02 x-oss-byte fxcl selfisekai ethicalsecurity-agency shabbirhasan1 koshy-thomas gngpp syphar rohankumardubey rillian hishope kirinse peterkobza zhaopufeng

lol-html's Issues

release from master?

Hi! I filed #82 and although I still would love to see an example, it turns out that #69 at least partly addresses my confusion. Is there a timetable for a release from master? Would it be better to make a temporary lol-html-master on crates.io?

Thanks!

Document C API

C API needs readme and mention in all other docs.

Require `ContentHandlerError` to implement Send and Sync

I tried using rewrite_str with failure, but it doesn't implement Send and Sync:

error[E0277]: `(dyn std::error::Error + 'static)` cannot be sent between threads safely
   --> src/utils/html.rs:47:5
    |
47  |     lol_html::rewrite_str(html, settings).map_err(Into::into)
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `(dyn std::error::Error + 'static)` cannot be sent between threads safely
    | 
   ::: /home/joshua/.local/lib/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/convert/mod.rs:288:1
    |
288 | pub trait Into<T>: Sized {
    | ------------------------ required by this bound in `std::convert::Into::into`
    |
    = help: the trait `std::marker::Send` is not implemented for `(dyn std::error::Error + 'static)`
    = note: required because of the requirements on the impl of `std::marker::Send` for `std::ptr::Unique<(dyn std::error::Error + 'static)>`
    = note: required because it appears within the type `std::boxed::Box<(dyn std::error::Error + 'static)>`
    = note: required because it appears within the type `lol_html::rewriter::RewritingError`
    = note: required because of the requirements on the impl of `failure::Fail` for `lol_html::rewriter::RewritingError`
    = note: required because of the requirements on the impl of `std::convert::From<lol_html::rewriter::RewritingError>` for `failure::error::Error`
    = note: required because of the requirements on the impl of `std::convert::Into<failure::error::Error>` for `lol_html::rewriter::RewritingError`

error[E0277]: `(dyn std::error::Error + 'static)` cannot be shared between threads safely
   --> src/utils/html.rs:47:5
    |
47  |     lol_html::rewrite_str(html, settings).map_err(Into::into)
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `(dyn std::error::Error + 'static)` cannot be shared between threads safely
    | 
   ::: /home/joshua/.local/lib/rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/convert/mod.rs:288:1
    |
288 | pub trait Into<T>: Sized {
    | ------------------------ required by this bound in `std::convert::Into::into`
    |
    = help: the trait `std::marker::Sync` is not implemented for `(dyn std::error::Error + 'static)`
    = note: required because of the requirements on the impl of `std::marker::Sync` for `std::ptr::Unique<(dyn std::error::Error + 'static)>`
    = note: required because it appears within the type `std::boxed::Box<(dyn std::error::Error + 'static)>`
    = note: required because it appears within the type `lol_html::rewriter::RewritingError`
    = note: required because of the requirements on the impl of `failure::Fail` for `lol_html::rewriter::RewritingError`
    = note: required because of the requirements on the impl of `std::convert::From<lol_html::rewriter::RewritingError>` for `failure::error::Error`
    = note: required because of the requirements on the impl of `std::convert::Into<failure::error::Error>` for `lol_html::rewriter::RewritingError`

error: aborting due to 2 previous errors

failure has a workaround for missing Sync (i.e. a mutex), but not for missing Send. It would be great if ContentHandlerError implemented both. It looks like this is a user-provided error, so this would be a breaking change, but I think the increase in flexibility is worth it.

(Yes, I'm aware failure is deprecated, but docs.rs won't be switching any time soon and this is useful even without failure)

Get the text content of a HTML element?

If I have the following handler, is it possible to get the text content / innerHTML?

                    element!("title", |el| {
                        let page_title = // some way to get the text content of `el`
                        println!("Page Title: {:?}", page_title);
                        Ok(())
                    })

Return &str from `get_attribute`

Most of the time I'm looking at an attribute, I don't actually need to copy it. But get_attribute() makes a copy anyway. It would be nice to return a borrow instead.

This would also require changing Bytes::as_string() to return a borrow.

Implement WASM-based JS API

Remove parser comparison benchmarks

Benchmarks for html5ever and LazyHTML were needed to see where we stand comparing to other parsers. Now, when we have this information we can keep cool_thing benchmarks only to catch regressions. This will allow us to get rid of LazyHTML dependency completely.

We'll need to commit previous Critetion's results to have a reference values for the regression benchmarks.

Speed up text decoding by not decoding chunks that contain only bytes in ASCII range

Currently, we need to decode all the incoming text chunks in the selector scope if text handler is assigned. This requirement comes from the fact that chunks that are fed to the rewriter may not contain a full sequence of valid bytes in the given encoding.

However, considering that we operate only with ASCII-compatible encodings, we can perform check for non-ASCII characters in parser. And, if we don't find any, just use str::from_utf8_unchecked instead of actual decoding.

Unable to find match using selectors with a colon in the attribute value

I'm attempting to replace the a Google Fonts stylesheet URL with the contents of the stylesheet. I was using HTMLRewriter about a week ago and it was working fine, but I noticed today after I published to my worker, that it stopped working as expected even though it I had not made any changes that would have made a difference.

So, I started testing out different selectors. This seems to work fine:

rewriter.on(`link[href*="fonts.googleapis.com/css"]`, ...)

But, when a colon was added to the font, it wouldn't work, and the handler would not be called:

rewriter.on(`link[href*=":fonts.googleapis.com/css"]`, ...)

I was having issues with these two cases as well. This works fine:

rewriter.on(`[href*="fonts.googleapis.com/css?family"]`, ...)

But, this does not:

rewriter.on(`[href*=fonts.googleapis.com/css?family=Lato"]`, ...)

It feels like the last two cases could be an issue with the equals sign.

Implement self-closing flag event in tag scanner

For proper CSS selector matching we need to maintain an open element stack. Then parser parses foreign content (<svg> or <math> elements and their content) self-closing flag of start tags has effect on the structure of the DOM-tree (it is ignored for the regular HTML).

So, currently we don't have any other choice than switching to lexer if we are in foreign content, because we need to consume whole start tag to check a self-closing flag. This leads to an unnecessary buffering and SVG tags tend to be quite big.

We can avoid this situation by providing additional event in tag scanner which will be invoked when scanner encounters an end of the start tag tells if self-closing flag is present.

Remove `force_quirks` flag from doctype

We don't use it anywhere besides tests

`element_content_handlers` is difficult to use

The docs suggest to use element!, let's use element:

    let settings = RewriteStrSettings {
        element_content_handlers: vec![element!("head", head_handler), element!("body", body_handler)],
        ..RewriteStrSettings::default()
    };

    lol_html::rewrite_str(html, settings)

error[E0716]: temporary value dropped while borrowed
  --> src/utils/html.rs:39:40
   |
39 |         element_content_handlers: vec![element!("head", head_handler), element!("body", body_handler)],
   |                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ creates a temporary which is freed while still in use
40 |         ..RewriteStrSettings::default()
41 |     };
   |      - temporary value is freed at the end of this statement
42 | 
43 |     lol_html::rewrite_str(html, settings)
   |                                 -------- borrow later used here
   |
   = note: consider using a `let` binding to create a longer lived value
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

Hmm, that's weird, I don't see any temporaries. Let's try adding a let binding like the compiler suggests:

    let element_content_handlers = vec![element!("head", head_handler), element!("body", body_handler)];
    let settings = RewriteStrSettings {
        element_content_handlers,
        ..RewriteStrSettings::default()
    };

error[E0716]: temporary value dropped while borrowed
  --> src/utils/html.rs:38:41
   |
38 |     let element_content_handlers = vec![element!("head", head_handler), element!("body", body_handler)];
   |                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                 - temporary value is freed at the end of this statement
   |                                         |
   |                                         creates a temporary which is freed while still in use
39 |     let settings = RewriteStrSettings {
40 |         element_content_handlers,
   |         ------------------------ borrow later used here
   |
   = note: consider using a `let` binding to create a longer lived value
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

Ok, it seems element! is creating a temporary: https://docs.rs/lol_html/0.2.0/src/lol_html/rewriter/settings.rs.html#111. Here is the version that works:

    let (head_selector, body_selector) = ("head".parse().unwrap(), "body".parse().unwrap());
    let head = (
        &head_selector,
        ElementContentHandlers::default().element(head_handler),
    );
    let body = (
        &body_selector,
        ElementContentHandlers::default().element(body_handler),
    );
    let settings = RewriteStrSettings {
        element_content_handlers: vec![head, body],
        ..RewriteStrSettings::default()
    };

    lol_html::rewrite_str(html, settings)
}

This is a lot more verbose than the original code!

:nth-child

Hi there,

we're trying to generate css selectors automatically and use them with HTMLRewriter (on Cloudflare workers) / lol-html.
For a guaranteed selector generation we need to use :nth-child selectors in some cases. I completely understand that this could be impossible due to the streaming nature of lol-html.
Did you approach (and maybe have a workaround) for such a case?

Thanks and all the best

Alex

How to get all element attributes

I tried below but it not return the attributes

rewriter.on(selector, {
  element(element) {
    for (var key in element.attributes) {
      console.log(key) // it print only 'next' 
    }
  }
}

Refactor C API tests

Split in separate files, move harness to header

Introduce innerHTML handlers

Request extracted from #40. We want innerHTML handlers that allow users to modify the raw HTML of an element, similar to how they are already able to modify the element's text contents with text handlers. See this comment.

Add (unofficial) Go bindings to README

Hello,

Impressed by lol_html months ago, I wrote a Go binding for it here: https://github.com/coolspring8/go-lolhtml.

The Go binding now has a documentation, several ported examples and is 80~% covered with tests. On the other hand, however, the binding has not been tested in the wild, and is a personal "unofficial" work.

Would you like to add it to the "Bindings for other programming languages" part in README?

Restrict HTML emission to specific nodes

Feature request: I'd love the ability to extract html from a CSS selector. Currently it seems there's no good way to do so. Perhaps like so:

const extractedHtml = new HTMLRewriter().on('#my-id', {
  element(e) {
    e.extract();
  }
}).transform(response);

Not sure that'd be the most appropriate syntax given the nature of extracting vs. modifying/removing.

More context: https://community.cloudflare.com/t/htmlrewriter-extract-and-serve-single-dom-node/136769

Emit events for closing elements

A scenario that is currently impossible to do is adding child nodes to an element based on the content of the existing child nodes. What I am looking for is similar to the SAX Parser endElement.

As an example, I am currently trying to add certain <head> tags if they aren't already present in the head, but there is no way to signal that we are at the closing </head> tag so I can insert the missing tags before the closing tag is sent.

If all the content was known ahead it would be possible to just output them straight away and remove them from the stream when I encounter the stream versions. But that won't work if there are defaults, but the content is variable.

My current work-around is to swallow all <head> tags and all their child nodes and save them in memory. Then when the body element event comes, I figure out the missing tags, add them & serialise the entire <head> before the entire body.

This makes for brittle, harder to understand code and unnecessarily delays the delivery of the <head>, which potentially contains very performance sensitive information like preload and preconnect headers.

Support :empty pseudo class?

Hi, thanks for this library it is quite impressive and I enjoyed reading the blog post.

I am using it to do some basic rewriting where I need access to the text content of the nodes in order to rewrite. So I want to convert:

<h1>Some text</h1>

Into:

<h1 id="some-text">Some text</h1>

I got it all working by doing two passes once with the selector on text elements, buffering the contents into vectors and then doing a second pass on the elements to rewrite the attributes. But there is one minor issue in that when an element is empty (<h1></h1>) the text handler never fires but the element handler fires (which is expected and in many ways correct as there is no text to parse). I tried amending the selector to include :not(:empty) which in theory would fix the problem but :empty is not a supported pseudo class. Would it be possible to support the :empty pseudo class?

Thanks!

Don't switch parser to lexer if end tag handler is specified

Currently, if element modification involves rewriting of the end tag (e.g. el.set_tag_name) we implicitly create an end tag handler. This causes parser to run in the lexer mode in the search of the end tag.

This causes problem when we try to modify an end tag that covers the majority of the content of the page. E.g., if el is a body element:

el.append("foo", ContentType::Text);

will cause parser to run in the lexer mode for almost the whole page, even though we are interested only in the end tag to insert content before it.

If we need only an end tag we can keep parser running in the tag scanner mode and switch it to the lexer only if current end tag matched by a selector and we need to invoke a handler for it.

Question about mixing text and element handlers

Hey guys,

I have a case, where I need to put HTML <div> element after or before <p> element, based on how much words occured in previous <p> elements.

I am now trying to handle it this way:

    let mut words_count: usize = 0;
    let mut previous_words_count: usize = 0;
    let html = rewrite_str(
        post.description.as_str(),
        RewriteStrSettings {
            element_content_handlers: vec![
                text!("p", |t| {
                    previous_words_count = words_count.clone();
                    words_count += t.as_str().split(" ").count();

                    match html_elems
                        .iter()
                        .find(|x| previous_words_count < x.words_count && x.words_count <= words_count) {
                        None => (),
                        Some(x) => {
                            let platform_elem_string = x.by_platform(platform_str.as_str());
                            if !platform_elem_string.is_empty() {
                                t.after(platform_elem_string, ContentType::Html);
                            }
                        }
                    };

                    Ok(())
                }),
            ],
            ..RewriteStrSettings::default()
        }
    ).unwrap();

Sadly for t.after and t.before in given cases will insert these elements inside given element. I need to place them after or before <p>.
Usage of elements! macro here will be advised, but then I cannot count words, right?
Any idea how I can resolve it? Is it even possible at this moment?

Thanks!

Is this crate performant if you don't want to rewrite the HTML, just extract data?

I'd like to use this crate for the tasks of

extracting all readable text from the <body> of an HTML document (text not inside script/style tags)
Extracting all links from an HTML page

I notice the example in the README demonstrates the rewriting capabilities of this crate, however, I have no need to rewrite HTML, just extract some data. Is this crate still performant for such a task?

Hope this question makes sense.

Get rid of requirement to return failure::Error from content handlers

Currently return type for content handlers is Result<(), failure::Error> which forces users to install failure crate.

We need to get rid of failure crate, switch to Box<dyn std::error::Error> as a return value and use https://github.com/dtolnay/thiserror for convenience of Display implementation for our own error.

Update `selectors` and `cssparser`

Currently, adding lol-html as a dependency to cargo-deadlinks adds 50 more dependencies (190 -> 240). cargo tree -d shows a large number of duplicate dependencies coming from old versions of selectors and cssparser:

├── cssparser v0.25.9
│   ├── lol_html v0.2.0 (/home/joshua/src/rust/lol-html)
│   └── selectors v0.21.0
│       └── lol_html v0.2.0 (/home/joshua/src/rust/lol-html)
└── selectors v0.21.0 (*)

phf v0.8.0
└── markup5ever v0.10.0
    └── html5ever v0.25.1
        [dev-dependencies]
        └── lol_html v0.2.0 (/home/joshua/src/rust/lol-html)

phf_codegen v0.7.24
└── cssparser-macros v0.3.6
    └── cssparser v0.25.9 (*)
[build-dependencies]
└── selectors v0.21.0 (*)

phf_codegen v0.8.0
[build-dependencies]
└── markup5ever v0.10.0 (*)

phf_generator v0.7.24
└── phf_codegen v0.7.24 (*)

phf_generator v0.8.0
├── phf_codegen v0.8.0 (*)
└── string_cache_codegen v0.5.1
    [build-dependencies]
    └── markup5ever v0.10.0 (*)

phf_shared v0.7.24
├── phf v0.7.24 (*)
├── phf_codegen v0.7.24 (*)
└── phf_generator v0.7.24 (*)

phf_shared v0.8.0
├── phf v0.8.0 (*)
├── phf_codegen v0.8.0 (*)
├── phf_generator v0.8.0 (*)
├── string_cache v0.8.0
│   └── markup5ever v0.10.0 (*)
└── string_cache_codegen v0.5.1 (*)

rand v0.5.6
[dev-dependencies]
└── lol_html v0.2.0 (/home/joshua/src/rust/lol-html)

rand v0.6.5
└── phf_generator v0.7.24 (*)

rand v0.7.3
├── phf_generator v0.8.0 (*)
└── tempfile v3.1.0
    └── cargo-fuzz v0.8.0
        [dev-dependencies]
        └── lol_html v0.2.0 (/home/joshua/src/rust/lol-html)

rand_chacha v0.1.1
└── rand v0.6.5 (*)

rand_chacha v0.2.2
└── rand v0.7.3 (*)

rand_core v0.3.1
├── rand v0.5.6 (*)
├── rand_chacha v0.1.1 (*)
├── rand_hc v0.1.0
│   └── rand v0.6.5 (*)
├── rand_isaac v0.1.1
│   └── rand v0.6.5 (*)
└── rand_xorshift v0.1.1
    └── rand v0.6.5 (*)

rand_core v0.4.2
├── rand v0.6.5 (*)
├── rand_core v0.3.1 (*)
├── rand_jitter v0.1.4
│   └── rand v0.6.5 (*)
├── rand_os v0.1.3
│   └── rand v0.6.5 (*)
└── rand_pcg v0.1.2
    └── rand v0.6.5 (*)

rand_core v0.5.1
├── criterion v0.3.0
│   [dev-dependencies]
│   └── lol_html v0.2.0 (/home/joshua/src/rust/lol-html)
├── rand v0.7.3 (*)
├── rand_chacha v0.2.2 (*)
├── rand_os v0.2.2
│   └── criterion v0.3.0 (*)
├── rand_pcg v0.2.1
│   └── rand v0.7.3 (*)
└── rand_xoshiro v0.3.1
    └── criterion v0.3.0 (*)

rand_os v0.1.3 (*)

rand_os v0.2.2 (*)

rand_pcg v0.1.2 (*)

rand_pcg v0.2.1 (*)

siphasher v0.2.3
└── phf_shared v0.7.24 (*)

siphasher v0.3.3
└── phf_shared v0.8.0 (*)

It would be nice to update these so there aren't multiple duplicate dependencies.

Expose text decoding buffer size in MemorySettings

We can expose the buffer size that we use for decoding text chunks.

Smaller size means text handler will be called more times with a smaller value, big size means text handler called less but with a bigger value.

One idea would be to use a LazyCell that will eventually hold the buffer. It integrates with the memory_limiter (which means allocation can fail). If the worker doesn't register any text handler the buffer and allocation will be skiped.

Can't add custom attribute to <desc> element

AFAIU, this parser is what HTMLRewriter in CloudFlare workers uses under the hood.
I have a simple worker which transforms the HTML received as request body and adds a new attribute to each node. It seems that there is a bug, and the attribute change is not applied to <desc> element.

Example:

class EnumerationElementHandler {
    constructor() {
        this.counter = 0;
    }
    element(element) {
        const c = this.counter++;
        element.setAttribute('data-custom', c.toString())
    }
}

const rewriter = new HTMLRewriter().on("*", new EnumerationElementHandler())

export default {
    async fetch(request) {
        return rewriter.transform(new Response(request.body, {status: 200}));
    },
};

Now, I send a POST request with the following body:

<div><svg viewBox="0 0 460 271.2" width="70px"
    height="41px" aria-labelledby="title desc">
    <title>
    <span>X</span></title>
    <desc><span>Y</span>
    </desc>
    <path id="flare" fill="#fff"
    d="M370.9,150.2l-40.7-24.7c-0.6-0.1-4.4,0.3-6.4-0.7c-1.4-0.7-2.5-1.9-3.2-4c-3.2,0-175.5,0-175.5,0    v91.8h225.8V150.2z"></path>
    
</svg></div>

Output:

<div data-custom="0"><svg viewBox="0 0 460 271.2" width="70px" height="41px" aria-labelledby="title desc" data-custom="1">
    <title data-custom="2">
    <span data-custom="3">X</span></title>
    <desc><span data-custom="4">Y</span>
    </desc>
    <path id="flare" fill="#fff" d="M370.9,150.2l-40.7-24.7c-0.6-0.1-4.4,0.3-6.4-0.7c-1.4-0.7-2.5-1.9-3.2-4c-3.2,0-175.5,0-175.5,0    v91.8h225.8V150.2z" data-custom="5"></path>
    
</svg></div>

As you can see, data-custom attribute is added to every node but <desc>.

Refactor HTMLRewriter Settings to deprecate "try_new"

HTMLRewriter's creation interface is a bit clunky, since it requires calling try_new even if you know that an encoding is valid (for example with Default). Encoding should be checked ahead of time via a wrapper type and try_new should be deprecated and replaced with an infallible new to simplify the API.

See the API guidelines on static parameter enforcement.

Try implement async content handlers

Figure out the performance consequences of making content handlers async

HtmlRewriter::end() should take `self`, not `&mut self`

As documented, both end() and write() will panic if you do anything with the rewriter after you call end(). This can be enforced statically by having end() consume the writer.

Breaks on AMP pages

I am not sure if this is the right place to report bugs for HTMLRewriter on Workers...

Given this (bare bones and valid) AMP HTML page:

<!doctype html><html ⚡ i-amphtml-layout i-amphtml-no-boilerplate transformed="self;v=1"><head><meta data-auto charset="utf-8"><style amp-runtime i-amphtml-version="012003101714470">html{overflow-x:hidden!important}html.i-amphtml-fie{height:100%!important;width:100%!important}html:not([amp4ads]),html:not([amp4ads]) body{height:auto!important}html:not([amp4ads]) body{margin:0!important}body{-webkit-text-size-adjust:100%;-moz-text-size-adjust:100%;-ms-text-size-adjust:100%;text-size-adjust:100%}html.i-amphtml-singledoc.i-amphtml-embedded{-ms-touch-action:pan-y;touch-action:pan-y}html.i-amphtml-fie>body,html.i-amphtml-singledoc>body{overflow:visible!important}html.i-amphtml-fie:not(.i-amphtml-inabox)>body,html.i-amphtml-singledoc:not(.i-amphtml-inabox)>body{position:relative!important}html.i-amphtml-webview>body{overflow-x:hidden!important;overflow-y:visible!important;min-height:100vh!important}html.i-amphtml-ios-embed-legacy>body{overflow-x:hidden!important;overflow-y:auto!important;position:absolute!important}html.i-amphtml-ios-embed{overflow-y:auto!important;position:static}#i-amphtml-wrapper{overflow-x:hidden!important;overflow-y:auto!important;position:absolute!important;top:0!important;left:0!important;right:0!important;bottom:0!important;margin:0!important;display:block!important}html.i-amphtml-ios-embed.i-amphtml-ios-overscroll,html.i-amphtml-ios-embed.i-amphtml-ios-overscroll>#i-amphtml-wrapper{-webkit-overflow-scrolling:touch!important}#i-amphtml-wrapper>body{position:relative!important;border-top:1px solid transparent!important}#i-amphtml-wrapper+body{visibility:visible}#i-amphtml-wrapper+body .i-amphtml-lightbox-element,#i-amphtml-wrapper+body[i-amphtml-lightbox]{visibility:hidden}#i-amphtml-wrapper+body[i-amphtml-lightbox] .i-amphtml-lightbox-element{visibility:visible}#i-amphtml-wrapper.i-amphtml-scroll-disabled,.i-amphtml-scroll-disabled{overflow-x:hidden!important;overflow-y:hidden!important}amp-instagram{padding:54px 0px 0px!important;background-color:#fff}amp-iframe iframe{box-sizing:border-box!important}[amp-access][amp-access-hide]{display:none}[subscriptions-dialog],body:not(.i-amphtml-subs-ready) [subscriptions-action],body:not(.i-amphtml-subs-ready) [subscriptions-section]{display:none!important}amp-experiment,amp-live-list>[update],amp-share-tracking{display:none}.i-amphtml-jank-meter{position:fixed;background-color:rgba(232,72,95,0.5);bottom:0;right:0;color:#fff;font-size:16px;z-index:1000;padding:5px}amp-list[resizable-children]>.i-amphtml-loading-container.amp-hidden{display:none!important}amp-list[load-more] [load-more-button],amp-list[load-more] [load-more-end],amp-list[load-more] [load-more-failed],amp-list[load-more] [load-more-loading]{display:none}amp-story-page,amp-story[standalone]{min-height:1px!important;display:block!important;height:100%!important;margin:0!important;padding:0!important;overflow:hidden!important;width:100%!important}amp-story[standalone]{background-color:#202125!important;position:relative!important}amp-story-page{background-color:#757575}amp-story .amp-active>div{display:none!important}amp-story-page:not(:first-of-type):not([distance]):not([active]){transform:translateY(1000vh)!important}amp-autocomplete{position:relative!important;display:inline-block!important}amp-autocomplete>input,amp-autocomplete>textarea{padding:0.5rem;border:1px solid rgba(0,0,0,0.33)}.i-amphtml-autocomplete-results,amp-autocomplete>input,amp-autocomplete>textarea{font-size:1rem;line-height:1.5rem}[amp-fx^=fly-in]{visibility:hidden}
/*# sourceURL=/css/ampdoc.css*/[hidden]{display:none!important}.i-amphtml-element{display:inline-block}.i-amphtml-blurry-placeholder{transition:opacity 0.3s cubic-bezier(0.0,0.0,0.2,1)!important}[layout=nodisplay]:not(.i-amphtml-element){display:none!important}.i-amphtml-layout-fixed,[layout=fixed][width][height]:not(.i-amphtml-layout-fixed){display:inline-block;position:relative}.i-amphtml-layout-responsive,[layout=responsive][width][height]:not(.i-amphtml-layout-responsive),[width][height][sizes]:not(.i-amphtml-layout-responsive){display:block;position:relative}.i-amphtml-layout-intrinsic{display:inline-block;position:relative;max-width:100%}.i-amphtml-intrinsic-sizer{max-width:100%;display:block!important}.i-amphtml-layout-container,.i-amphtml-layout-fixed-height,[layout=container],[layout=fixed-height][height]{display:block;position:relative}.i-amphtml-layout-fill,[layout=fill]:not(.i-amphtml-layout-fill){display:block;overflow:hidden!important;position:absolute;top:0;left:0;bottom:0;right:0}.i-amphtml-layout-flex-item,[layout=flex-item]:not(.i-amphtml-layout-flex-item){display:block;position:relative;-ms-flex:1 1 auto;flex:1 1 auto}.i-amphtml-layout-fluid{position:relative}.i-amphtml-layout-size-defined{overflow:hidden!important}.i-amphtml-layout-awaiting-size{position:absolute!important;top:auto!important;bottom:auto!important}i-amphtml-sizer{display:block!important}.i-amphtml-blurry-placeholder,.i-amphtml-fill-content{display:block;height:0;max-height:100%;max-width:100%;min-height:100%;min-width:100%;width:0;margin:auto}.i-amphtml-layout-size-defined .i-amphtml-fill-content{position:absolute;top:0;left:0;bottom:0;right:0}.i-amphtml-layout-intrinsic .i-amphtml-sizer{max-width:100%}.i-amphtml-replaced-content,.i-amphtml-screen-reader{padding:0!important;border:none!important}.i-amphtml-screen-reader{position:fixed!important;top:0px!important;left:0px!important;width:4px!important;height:4px!important;opacity:0!important;overflow:hidden!important;margin:0!important;display:block!important;visibility:visible!important}.i-amphtml-screen-reader~.i-amphtml-screen-reader{left:8px!important}.i-amphtml-screen-reader~.i-amphtml-screen-reader~.i-amphtml-screen-reader{left:12px!important}.i-amphtml-screen-reader~.i-amphtml-screen-reader~.i-amphtml-screen-reader~.i-amphtml-screen-reader{left:16px!important}.i-amphtml-unresolved{position:relative;overflow:hidden!important}.i-amphtml-select-disabled{-webkit-user-select:none!important;-moz-user-select:none!important;-ms-user-select:none!important;user-select:none!important}.i-amphtml-notbuilt,[layout]:not(.i-amphtml-element){position:relative;overflow:hidden!important;color:transparent!important}.i-amphtml-notbuilt:not(.i-amphtml-layout-container)>*,[layout]:not([layout=container]):not(.i-amphtml-element)>*{display:none}.i-amphtml-ghost{visibility:hidden!important}.i-amphtml-element>[placeholder],[layout]:not(.i-amphtml-element)>[placeholder]{display:block}.i-amphtml-element>[placeholder].amp-hidden,.i-amphtml-element>[placeholder].hidden{visibility:hidden}.i-amphtml-element:not(.amp-notsupported)>[fallback],.i-amphtml-layout-container>[placeholder].amp-hidden,.i-amphtml-layout-container>[placeholder].hidden{display:none}.i-amphtml-layout-size-defined>[fallback],.i-amphtml-layout-size-defined>[placeholder]{position:absolute!important;top:0!important;left:0!important;right:0!important;bottom:0!important;z-index:1}.i-amphtml-notbuilt>[placeholder]{display:block!important}.i-amphtml-hidden-by-media-query{display:none!important}.i-amphtml-element-error{background:red!important;color:#fff!important;position:relative!important}.i-amphtml-element-error:before{content:attr(error-message)}i-amp-scroll-container,i-amphtml-scroll-container{position:absolute;top:0;left:0;right:0;bottom:0;display:block}i-amp-scroll-container.amp-active,i-amphtml-scroll-container.amp-active{overflow:auto;-webkit-overflow-scrolling:touch}.i-amphtml-loading-container{display:block!important;pointer-events:none;z-index:1}.i-amphtml-notbuilt>.i-amphtml-loading-container{display:block!important}.i-amphtml-loading-container.amp-hidden{visibility:hidden}.i-amphtml-element>[overflow]{cursor:pointer;position:relative;z-index:2;visibility:hidden}.i-amphtml-element>[overflow].amp-visible{visibility:visible}template{display:none!important}.amp-border-box,.amp-border-box *,.amp-border-box :after,.amp-border-box :before{box-sizing:border-box}amp-pixel{display:none!important}amp-analytics,amp-story-auto-ads{position:fixed!important;top:0!important;width:1px!important;height:1px!important;overflow:hidden!important;visibility:hidden}html.i-amphtml-fie>amp-analytics{position:initial!important}[visible-when-invalid]:not(.visible),amp-list [fetch-error],form [submit-error],form [submit-success],form [submitting]{display:none}amp-accordion{display:block!important}amp-accordion>section{float:none!important}amp-accordion>section>*{float:none!important;display:block!important;overflow:hidden!important;position:relative!important}amp-accordion,amp-accordion>section{margin:0}amp-accordion>section>:last-child{display:none!important}amp-accordion>section[expanded]>:last-child{display:block!important}</style><meta data-auto name="viewport" content="width=device-width,minimum-scale=1,initial-scale=1"><link rel="preload" href="https://cdn.ampproject.org/v0.js" as="script"><script data-auto async src="https://cdn.ampproject.org/v0.js"></script><title>Thanos</title><link data-auto rel="canonical" href="."></head><body>
<h1>test</h1>
<div><p>asdf</p></div>
</body></html>

This will not detect the <div> with: HTMLRewriter().on('div', new ElementHandler())

I guess it's something on the <style> tag, however, the above 👆is output from AMP Optimizer, which ist the last thing one one would do before sending the response to the client.

Add blanket `impl<T: Write> OutputSink for T`

Currently, HtmlRewriter::try_new takes only closures and nothing else. For the common case of 'I have a buffer to write to' this causes the code to be more complicated than necessary:

    let mut output = vec![];

    // NOTE: never panics because encoding is always "utf-8".
    let mut rewriter = HtmlRewriter::try_new(settings.into(), |c: &[u8]| {
        output.extend_from_slice(c);
    })
    .unwrap();

If OutputSink took any io::Write impl, it could be simpler:

    let mut output = vec![];

    // NOTE: never panics because encoding is always "utf-8".
    let mut rewriter = HtmlRewriter::try_new(settings.into(), &mut output).unwrap();

Advice on using lol-html purely as html parser

Would it be possible to somehow disable rewriting functionality and simply use this library as a fast HTML parser/selector?...
I could simply ignore the output but some help from the library for such use case would be nice to have

Add license information to the README

Cache values and return `&str` in rewritable unit getters

Remove leftover workaround for Rc<RefCell<FnOnce()>>

https://github.com/cloudflare/cool-thing/blob/2b64a1d094ff3c9446926b549439ad156c50af00/src/rewritable_units/element.rs#L467-L484

Spans for elements

For deadlinks/cargo-deadlinks#14 I want to provide the line number of a broken link. Currently, there's no way to get that from an Element.

Removing the doctype

I'm not sure this is possible and I'm missing it, but it would be great to be able to remove the document's doctype.

Support Reading of HTML without Rewriting

In a tool I'm writing that bundles JS, I have to make multiple passes over the HTML; it's necessary to get a list of all the <script> elements with a src attribute so that I can bundle them before actually doing the rewriting. This forces me to run rewrite_str twice, and in the first pass, I don't even modify the HTML. This is actually doing the job just fine, but I can't help but think there should be a way to use the read-only features LOL HTML has without duplicating the input HTML and then throwing out the result.

Adjacent Sibling Combinator

Is there any plan to support the adjacent sibling combinator? I'm working on something with an HTML-to-text component, and selecting adjacent <br> tags would be immensely useful for this.

Rename all tokens to rewritable units, rename lexeme to token

Current terminology might be a bit confusing. So, let's rename tokens to rewritable units (StartTag will be still a rewritable unit, just not exposed in the public API).

This allows us to rename lexeme to token.

Informational Issue: `lol-async` crate

Hey there, thanks for lol-html! Thought I'd open an issue and let you know about a crate I created for using lol-html in an async rust context: lol-async. Happy to take any suggestions for improvements! Feel free to close this issue immediately 👋

Implement latency decreasing heuristics

Currently we have a latency problem in the rewriter: we try to do as less work as possible in parser and flush content only when we produce tokens. This introduces latency problem: if we don't have any content that we need to rewrite in the chunk we wait for the whole chunk to be parsed before flushing it.

We can have few heuristics to improve it: for first chunk flush on </head>, for the remaining chunks flush based on time remained from the last flush.

We need to make all these heuristics adjustable via settings, to experiment with different parameters based on RUM. @sejoker working on that at the moment.

Take any `Into<Settings>` for rewrite_str()

Currently, rewrite_str only takes a RewriteStrSettings, so you can't e.g. configure the memory limit like you could with the full settings. It's possible to use HtmlRewriter directly, but this is a lot more code and duplicates almost everything in rewrite_str. It would be better to take settings: impl Into<Settings> in rewrite_str(), which makes the API a lot more flexible.

Add profiling scripts

Generating selector from document and element

Hi,

would it be possible in any ways to generate selector itselfusing lol-html? The input would be:
a) the document that will be processed by HTMLRewriter in a worker later on;
b) the element that needs to be selected (as generated by 'click' document event listener);

Output needed would be the (most effective) selector to be used to select this element on worker's execution.

Thanks!

P.S. am not familiar with internals of lol-html, it's a bit of a long-shot question. Thanks in advance for looking into it and replying.

[Feature Request] Allow memory introspection of HtmlRewriter

Allowing users to see the memory that HtmlRewriter used in the last rewrite cycle (.end() to .end(), basically) would be fantastic for applications who want to know more about their resource usage

I got the idea for this from docs.rs/#930 (comment), having this monitoring would allow us to adjust our memory cap to whatever's realistic since we have vastly varying file sizes

Rewriter reuse example?

I'm trying to understand how and if users are intended to express a set of element handlers to be reused on multiple inputs, across multiple threads. In particular, I'm interested in applying a given transformation to a bunch of html byte streams, but the lifetimes and borrows in the public interface make it difficult to see how to achieve that. Is there example code out there, or a part of the documentation that's relevant to this usage pattern?

Thanks!

publish up-to-date version on crates

It looks like some of the code changed made as far back as 26 Nov (at least since 1aedf36) have never made it to the version on crates.io - might be a good idea to update the version hosted there

cloudflare / lol-html Goto Github PK

lol-html's Introduction

LOL HTML

Documentation

Bindings for other programming languages

Example

License

lol-html's People

Contributors

Stargazers

Watchers

Forkers

lol-html's Issues

Recommend Projects

Recommend Topics

Recommend Org