Giter Site home page Giter Site logo

rust-html2text's People

Contributors

alatiera avatar dependabot[bot] avatar djahandarie avatar jugglerchris avatar kpagacz avatar liushuyu avatar nurelin avatar robinkrahl avatar sardinefish avatar sftse avatar sgtatham avatar spencerwi avatar strawberry-choco avatar zakaluka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

rust-html2text's Issues

Support definition lists

It would be great if you could add support for definition lists:

<dl>
  <dt>Topic</dt>
  <dd>Definition</dd>
</dl>
Topic
Definition

The topic could be rendered bold and the definition could be indented by two our four spaces.

Decoupling the html2text rendering pipeline

I’ve spent some time using html2text, reading its source code and even writing small patches. Still, I haven’t really grasped the complete rendering process that html2text performs. At the same time, I have some specific requirements like #27 or #36 that cannot be realized with html2text and maybe don’t even belong in a generic HTML rendering library.

Therefore, I am wondering: Would it be possible and would it make sense to decouple the html2text rendering pipeline into steps that can be customized by the user? This would make it easier to understand the rendering process, and it might make it possible to implement some of the requirements I mentioned earlier without having to re-implement the entire rendering stack.

From my point of view, these are the steps of the rendering pipeline (while I’m quite confident that steps 1–3 are correct, I’m not really sure about 4 and 5.):

  1. Parsing the HTML document (src/lib.rs).
  2. Transforming the HTML document into a render tree (src/lib.rs).
  3. Estimating the size of the elements of the render tree (src/lib.rs).
  4. Laying out the elements of the render tree into lines (src/text_renderer.rs?).
  5. Rendering the elements into text (src/text_renderer.rs?).
  6. Annotating the lines using a TextDecorator (src/text_renderer.rs).

It would be especially nice if the user would be able to customize step 5 without having to re-implement everything else.

Is my understanding of the rendering process roughly correct? What do you think?

Update dependancies

Looking at the Cargo.toml file seems like the dependencies haven't been touched in 2 years.

There are major changes and improvements from updating them. For example html5ever is currently locked at version 0.9.0 while the current version is 0.22.

It would be really nice if downstream crates would not need to vendor such an old version of html5ever and it's dependencies if they want to use html2text.

For example here how my Cargo.lock file looks atm:

[[package]]
name = "html2text"
version = "0.1.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
 "backtrace 0.2.3 (registry+https://github.com/rust-lang/crates.io-index)",
 "html5ever 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)",
 "html5ever-atoms 0.1.3 (registry+https://github.com/rust-lang/crates.io-index)",
 "string_cache 0.2.29 (registry+https://github.com/rust-lang/crates.io-index)",
 "unicode-width 0.1.5 (registry+https://github.com/rust-lang/crates.io-index)",
]

[[package]]
name = "html5ever"
version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
 "html5ever-atoms 0.1.3 (registry+https://github.com/rust-lang/crates.io-index)",
 "log 0.4.3 (registry+https://github.com/rust-lang/crates.io-index)",
 "mac 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "phf 0.7.22 (registry+https://github.com/rust-lang/crates.io-index)",
 "phf_codegen 0.7.22 (registry+https://github.com/rust-lang/crates.io-index)",
 "quote 0.3.15 (registry+https://github.com/rust-lang/crates.io-index)",
 "rustc-serialize 0.3.24 (registry+https://github.com/rust-lang/crates.io-index)",
 "syn 0.9.2 (registry+https://github.com/rust-lang/crates.io-index)",
 "tendril 0.2.4 (registry+https://github.com/rust-lang/crates.io-index)",
]

[[package]]
name = "html5ever"
version = "0.22.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
 "log 0.4.3 (registry+https://github.com/rust-lang/crates.io-index)",
 "mac 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
 "markup5ever 0.7.2 (registry+https://github.com/rust-lang/crates.io-index)",
 "proc-macro2 0.3.8 (registry+https://github.com/rust-lang/crates.io-index)",
 "quote 0.5.2 (registry+https://github.com/rust-lang/crates.io-index)",
 "syn 0.13.11 (registry+https://github.com/rust-lang/crates.io-index)",
]

[[package]]
name = "html5ever-atoms"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
 "string_cache 0.3.0 (registry+https://github.com/rust-lang/crates.io-index)",
 "string_cache_codegen 0.3.1 (registry+https://github.com/rust-lang/crates.io-index)",
]

Cheers

Empty new lines ignored in <pre> blocks

Using the latest version of html2text:

[package]
name = "testhtml2text"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
html2text = "0.4.4"

I have this example that reproduces what I am talking about:

fn main() {
    let input = b"
<pre>
This is a preformatted block of text.

It has newlines and renders as-is written in HTML.
</pre>";
    println!(
        "{}",
        html2text::from_read_with_decorator(
            &input[..],
            120,
            html2text::render::text_renderer::TrivialDecorator::new()
        )
    );
}

The program outputs:

This is a preformatted block of text.
It has newlines and renders as-is written in HTML.

As I use html2text to parse descriptions of algorithmic problems, I would like the empty new lines to be preserved in the output because sometimes they are crucial to understanding a problem's example. E.g. https://adventofcode.com/2020/day/4

I don't have any issues with html2text ignoring empty new lines anywhere else but <pre> blocks. These tags are understood explicitly as already preformatted, and all browsers would render them as written in the source code. html2text is not a browser. Still, I believe preserving the intent of the HTML is beneficial in this case.

As a workaround, I am inserting <br> tags at the end of all lines in the <pre> blocks of HTML I am downloading, but that seems suboptimal at best.

Panic in tree_map_reduce

The unwrap() in the following code from fn tree_map_reduce at lib.rs:554 seems to result in a panic on certain inputs:

        // Get the next child node to process
        let next_node = pending_stack.last_mut()
                                     .unwrap()
                                     .to_process
                                     .next();

I'm working on coming up with a test case that doesn't have sensitive information in it, but submitting this issue in advance.

Here is a backtrace:

  11:     0x55fee3b5bc90 - core::option::Option<T>::unwrap::hfa52bb5a7cdb86d0
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/macros.rs:12
  12:     0x55fee3cac769 - html2text::tree_map_reduce::h0eeb44a03fc94bd1
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:553
  13:     0x55fee3cb4f9c - html2text::dom_to_render_tree::ha0f5d421fcdfeabf
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:594
  14:     0x55fee3cb6036 - html2text::children_to_render_nodes::{{closure}}::ha8fb859e9d4ccd77
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:397
  15:     0x55fee3ca8294 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once::h429e02aa062973b1
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/ops/function.rs:279
  16:     0x55fee3b4d65b - core::option::Option<T>::map::h9cb8c6fa389220bd
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/option.rs:416
  17:     0x55fee3b9d094 - <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next::h3e67bc3ec8205108
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/adapters/mod.rs:570
  18:     0x55fee3b29597 - <core::iter::adapters::flatten::FlattenCompat<I,U> as core::iter::traits::iterator::Iterator>::next::hd76b5e6cad2d2858
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/adapters/flatten.rs:219
  19:     0x55fee3b27cfb - <core::iter::adapters::flatten::FlatMap<I,U,F> as core::iter::traits::iterator::Iterator>::next::h1ff61594eb8d230e
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/adapters/flatten.rs:49
  20:     0x55fee3c8f846 - <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter::h8595be8b7c7b1a21
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/liballoc/vec.rs:1883
  21:     0x55fee3c980ad - <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter::hc732d37a466aab80
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/liballoc/vec.rs:1796
  22:     0x55fee3b2c89d - core::iter::traits::iterator::Iterator::collect::h3601df414b19d4b1
                               at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/traits/iterator.rs:1466
  23:     0x55fee3cb5f4a - html2text::children_to_render_nodes::h1489c6b97e9e00fc
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:394
  24:     0x55fee3cb6874 - html2text::list_children_to_render_nodes::h842485128c34486e
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:411
  25:     0x55fee3cb0ff2 - html2text::process_dom_node::hf0aa74c8f1662c45
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:792
  26:     0x55fee3cb5000 - html2text::dom_to_render_tree::{{closure}}::he3666d29e519f1ac
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:595
  27:     0x55fee3cac929 - html2text::tree_map_reduce::h0eeb44a03fc94bd1
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:559
  28:     0x55fee3cb4f9c - html2text::dom_to_render_tree::ha0f5d421fcdfeabf
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:594
  29:     0x55fee3cb62db - html2text::from_read_with_decorator::haafad1b14c75eddb
                               at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:1183

Raw text (not markdown) formatting from html

Sorry if I did not find the relevant information by perusing the documentation but is there an easy way to produce a raw text output (without any markdown formatting at all)? It seems to me I can implement my own thing based on traversing a RenderNode or by implementing a custom TextDecorator? But it seems also a very common use-case, especially when pre-processing documents from the web for NLP pipelines.

If it does not yet exists, can I contribute some to this library? If so, do you have any guidance on the correct way to do so?

CSS support for formatting styles

There are situations in which it would be useful for html2text to understand at least a small amount of CSS.

An occasional annoyance I find with some web pages is that they use different classes of <span> (or <div>, depending on preference) for all their formatting, including both paragraph separation and inline style changes such as emphasis. Then they rely on CSS to make some of those span classes behave like <p>, some like <em>, some like <code> and so on.

html2text can't render a document of that kind sensibly without having to speak enough CSS to at least know which classes of <span> it should treat like which normal tags. You end up with a huge megaparagraph, or alternatively no end of spurious newlines (depending on whether the author went all-spans or all-divs).

I don't have a real-world example handy, but here's one I mocked up manually:

<head>
<title>Demo of the 'spans-everywhere' school of HTML</title>
<style type="text/css">
.p { display: block; margin-bottom: 1em; }
.em { font-style: italic; }
.code { font-family: monospace; }
</style>
</head>
<body>
<span class="p">Paragraph one, containing <span class="em">emphasis</span>.</span><span class="p">Paragraph two, containing <span class="code">code</span>.</span>
</body>
</html>

@jugglerchris mentioned that another use case is pages that use display: none.

Link syntax in markdown

Hi,

Currently, links are output as [link text][link number].

For example:

This is [a link][1]

[1] https://google.com

However, to make this work, a colon is required in the second part of the link:

This is [a link][1]

[1]: https://google.com

Handle wide documents without widening all paragraphs

Motivation

If the input to html2text can't be rendered in the specified display width, one option (in some client UIs) is to render it anyway at a larger width, and do some other kind of compromise when presenting it to the user, such as apologetically telling them they need to widen their terminal, or providing keystrokes to pan the window left and right across the wider logical canvas that you rendered the document on.

In the current API, you can do this by handling Error::TooNarrow: if you try to render at the physical terminal width and get TooNarrow, you can increase the render width and try again, until you find a width at which the document successfully renders. (And then maybe you binary-search between the largest failing width and the smallest successful one to find the cutoff? And then maybe you increase it by 20% or 50% or something, to avoid one-character-wide table cells? But these are client-side decisions.)

But the downside of doing this is that if the document contains one table that needs at least 500 columns, then all the ordinary paragraphs alongside that table are also wrapped at 500 columns, making them very hard to read. It would be nicer if all the parts that don't have to be ultra-wide could still be made to fit into my (say) 80-column terminal.

In particular, if the wide table is something that not all readers of the document will need to bother with at all, then that way, I can read the introductory paragraphs easily, and use them to decide whether I need to bother widening my window to read the table!

Suggestions

There's more than one way that that high-level requirement might be achieved. Two thoughts that have sprung to mind (which may still not be the only possible approaches):

1. Client application tells html2text to render at my actual terminal width, but with a flag of some kind that says that that width limit is 'strongly advised' rather than 'mandatory'. Then if a table can't help being wider than specified, fine, it can do that. But anything that can fit in the width should.

After setting this flag, the client would never expect to get Error::TooNarrow back from the renderer. But, in return, it accepts that the output might contain overlong lines, and must find some way of dealing with them itself if so.

2. Client application gives html2text two width parameters. One is the physical display width, just as now. The other is a 'maximum wrap width'. Each paragraph is wrapped to min(maximum wrap width, however much space is physically available in the layout).

The idea is that the client starts by setting both widths to the physical terminal width. If the document can't be rendered at that width, then the renderer still returns Error::TooNarrow, and the client responds the same way I suggested above, by cranking up the display width until the error stops happening. But it leaves the maximum wrap width at its original value.

Pros and cons

Option 1 has the advantage of only needing one call to the renderer, instead of O(log n) calls to zero in on a good render width.

Option 2 gives the client more control, because it still gets to decide how much larger than the absolute minimum possible width it wants to make things.

Fiddly edge cases

In both cases, there's a question of what happens to a single very wide table cell. (Say, one row of your table has an enormous number of tiny cells, and another has a single cell with colspan=lots covering the same horizontal space as all of them, so that that one cell is forced to be wider than the physical display.) With my option 2, the very wide cell would still have its paragraphs limited to the max wrap width, because that max width is applied to every paragraph anywhere in the whole layout. So if the client application is handling wide documents by providing keystrokes to pan left and right across the canvas, there will be some pan position at which the text of that table cell is readable. Perhaps option 1 might do the same thing?

With the max-wrap-width approach, there's also a question of how you measure it for indented things like <li> or <blockquote>: from the left margin of the text, or of the containing column? For example, which of these would you see, in a set of nested lists? This, which lets you read the whole list without having to pan your display at all?

plain para at the
full screen width

* narrower bullet
  point

  * even narrower
    bullet point

    * smaller yet
      and so on

Or this, which avoids the bullet points getting squashed up against the right-hand edge, making use of the fact that they physically do have space to expand into?

plain para at the
full screen width

* bullet point uses
  same width so its
  margin is 2 chars
  further right

  * nested bullets in
    turn move 2 chars
    right each time

    * result: you never
      get text squashed
      too narrow

The CSS max-width property takes the latter view, on the basis that it expects that you've widened your browser window and the whole wide window is visible. But in a context where you might be using a "pan left and right" UI, perhaps the former makes more sense?

Or perhaps some further compromise involving a minimum wrap width, so that once the bullet points get absolutely too silly squashed up against the right margin, there's a fallback available?

As you can see, I don't have all the answers here :-)

How to not override RichAnnotations colours when CSS Feature is enabled

Introduction

I'm using the html2text example tool to convert Html emails for a TUI email client.

In issue #134, I learned that using the CSS feature/option can hide Html element with max-height: 0 and display: none styles, which is really useful for email preview divs.

But, enabling CSS also overrides the RichAnnotations colours that are defined by design.

Issue

I would like to use the CSS feature for the layout but without overriding RichAnnotations.

It doesn't look like the overriding was implemented in "examples/html2text.rs" but directly in the library.

How to do that in "examples/html2text.rs"?

Cannot use feature css

I am trying to render some css in the HTML, and I believe I need the feature css for this to work but I cannot add it in my toml file due to this :

# Cargo.toml
html2text = { version = "0.12.4", features = ["css"]}
error: failed to select a version for the requirement `lightningcss = "^1.0.0-alpha.54"`
candidate versions found which didn't match: 1.0.0-alpha.52, 1.0.0-alpha.51, 1.0.0-alpha.50, ...
location searched: crates.io index
required by package `html2text v0.12.4`
    ... which satisfies dependency `html2text = "^0.12.4"` (locked to 0.12.4) of package `weather v0.1.0 (/home/abhishek/quick-test/weather-rs)`
if you are looking for the prerelease package it needs to be specified explicitly
    lightningcss = { version = "1.0.0-alpha.52" }
perhaps a crate was updated and forgotten to be re-vendored?

Here's my code which isn't working due to the above:

let s = config::rich()
        .add_css()
        .string_from_read(&mut reader, 150)
        .context("Render failed")?;

Feature request: provide the URL of an image

I've built a TUI Miniflux client that uses this lib for converting contents of RSS feed entries into something readable in the terminal, and it's great.

The only thing missing is that I'd love a way to just spit out the URL for an image when one is present (or do some other sort of processing with that URL, like giving it to the user in another way). Currently, the RichAnnotation::Image enum member doesn't provide any way to get the image's src attribute, so I can't actually do that (instead, I either show alt text or nothing).

Feature request: Comply with CommonMark specification

Your library looks very promising. Unfortunately I can not use it because:

  1. html2text's output is not sufficiently CommonMark compliant yet,
  2. the HTML's metadata is not converted into a YAML metadata block (see pandoc's yaml_metadata_block)

Let me explain my use case more in detail: I tested if I could replace in my toolchain:

pandoc --standalone -f html -t markdown_strict+yaml_metadata_block+pipe_tables

with:

html2text

This would allow me to do this:

curl $(xclip -o)| thml2text | tp-note

and even integrate your library into tp-note. Then the above would look like this:

curl $(xclip -o) | tp-note

Tp-Note comes with a document viewer that renders the content with pulldown-cmark which is compliant with the CommonMark specification.

As the de facto official specification for Markdown is CommonMark, making Html2text compatible with it, would open a wider range of use cases (mine included).
Another advantage: CommonMark has a validation test suite.

What do you think?

Text decorator functions for all elements?

Is it possible to implement text decorator functions for all elements? The current implementations for plain and rich text are already very helpful, but I would like to tweak the appearance of some specific elements. Let’s say I want to print headings with a bold typeface or I want to set the color based on the element’s class. This could be done with functions like:

fn decorate_element_start(&self, name: &str, attrs: HashMap<String, String>) -> Self::Annotation
fn decorate_element_end(&self, name: &str) -> Self::Annotation

Hitting 'Got character ...' errors

I'm hitting the error in

 pub fn add_preformatted_text(&mut self, text: &str, tag_main: &T, tag_wrapped: &T) {
        ...
        for c in text.chars() {
            if let Some(charwidth) = UnicodeWidthChar::width(c) {
                ...
            } else {
                match c {
                    '\n' => {
                        self.force_flush_line();
                        self.pre_wrapped = false;
                    }
                    '\t' => {
                        ...
                    }
                    _ => {
                        eprintln!("Got character: {:?}", c);
                    }
                }
            }
            html_trace_quiet!("  Added char {:?}", c);
        }
    }

Can this be silenced?

Support RTL

Like this one:

Found 1 items, similar to سلام.
-->Moin
-->سلام

<p align=right dir=rtl>(سَ) [<font color="green"> ع.</font> ] (<font color="green">مص ل.</font>)<br><font color="#7030a0">۱-</font> درود گفتن.<br><font color="#7030a0">۲-</font> بی گزند شدن.<br><font color="#7030a0">۳-</font> گردن نهادن.<br>~ علیک درود بر تو باد.<br>~ علیکم درود بر شما.</p>```

Convert <br> to new lines?

Is there an option to convert <br /> tags to new line character in the output?

Currently it seems to ignore all
tags in the output.

[Feature Request] Delete Email preview or convert HTML entities used in Emails preview hacks

Introduction

In Email, it was a common practice to add a separator indicating the end of the preview text.

This method does not work on all clients and is subject to regular changes. I came across this article on the subject with a few examples.

Issue

The problem is that the invisible characters used for this hack are badly converted by html2text, which leaves many "COMBINING GRAPHEME JOINER" entities in the output.

This entity is crudely displayed in pagers or TUI mail clients (dotted circle in a dotted square).

The first complication is that there is many variants of the sequences used, some are presented in the article above. In my own inbox, this week, I found at least 2 variants:

  • a repetition of &#847;&zwnj;&nbsp;
  • a repetition of ­͏ ‌ &nbsp, which you can reproduce with echo -ne '\u00ad\u034f \u200c &nbsp;'

And another complication is that the sequence is sometime formated in columns with newlines at different places in the sequence, like here:

image

This exemple would currently be converted by html2text like that:

image

Solution

For now I was piping the source file to sed first to delete those sequences:

sed -e 's/&#847;&zwnj;&nbsp;/\n/g' -e "s/$(echo -ne '\u00ad\u034f')[; ^C]*$(echo -ne '\u200c')[; \n]*&nbsp[; \n]*/\n/g"

I updated this script 3 times and I was going to find a solution for the cases with columns/newlines.

But I've just realized that the best solution would be to get rid of all the text before this sequence in the body, since it's a preview, a repetition of the text present in the rest of the document.

I think that such a deletion would impact only email documents and be a good addition to the library even by default.

Rendering tables crashe

Even after the fix in #64 some tables still crash, hitting the assert!(width > 0) in text_renderer.rs.

A reduced example is below

<!DOCTYPE html>
<html>
  <table><tbody><tr><td>
          <ol><li>ဘိန်းမုန့်</li></ol></td>
        <td>
          <ol><li>မုန့်ကြာစိ</li>
            <li>မုန့်တီ</li>
            <li>ခိုတောင်မုန့်တီ></li>
            <li>မုန့်ဟင်းခ</li>
            <li>တစ်ပင်တိုင်မုန့်ဟင်းခါး</li>
            <li>မုန့်စိမ်းပေါင်း</li>
            <li>မြိတ်ကတ်ကြေးကိုက်</li>
            <li>မြီးရှည်</li>
          </ol>
        </td>
        <td>
          <ol>
            <li></li></ol></td>
      </tr>
    </tbody>
  </table>
</html>

I have not further investigated and may not be able to for a while, but removing the assert condition leads to this example being processed reasonably, while processing the whole html this originated from appears to hit an infinite loop filling memory until OOM.

Deduplicate whitespace by default

The tests indicate that it was intentional for html2text to preserve whitespace, but is there a reason for this? I've run into html that expects the default behavior of browsers that deduplicate whitespace. If this is not deduplicated it messes up the formatting of a larger table.

<td style="margin: 0px; padding: 0px;
                    -webkit-print-color-adjust: exact;" class="">Your
                    AT Conference Account is Past Due - Suspension
                    Notice</td>

Panick at 'attempt to divide by zero'

The render table code is susceptible to panicking after attempting to divide by zero, as it just happened to me at this line of code:

thread 'main' panicked at 'attempt to divide by zero', [..]/html2text-0.6.0/src/lib.rs:1382:13

There are probably more places where this can happen.

Thanks

Return raw text instead of formatted table

Hello. I've been using your library for my RSS reader for a long time.

It has a couple of glitches. one of which is rendering of HTML tables. For example,

        let description = "<table cellpadding='10'>\n<tr>\n<td valign='top' align='center'><a href='https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/' title='Software engineer hacked webcams to spy on girls—Here's how to protect yourself'><img src='https://blog.malwarebytes.com/wp-content/uploads/2022/01/GettyImages-1199960659.jpg' border='0'  width='300px'  /></a></td>\n</tr>\n<tr>\n<td valign='top' align='left'>Yes, hackers can and will use your webcams against you if they see an opportunity. Don't let them.</p>\n<p>Categories: <a href=\"https://blog.malwarebytes.com/category/cybercrime/\" rel=\"category tag\">Cybercrime</a></p>\n<p>Tags: <a href=\"https://blog.malwarebytes.com/tag/andrew-shorrock/\" rel=\"tag\">Andrew Shorrock</a><a href=\"https://blog.malwarebytes.com/tag/catfishing/\" rel=\"tag\">catfishing</a><a href=\"https://blog.malwarebytes.com/tag/hacker-jailed/\" rel=\"tag\">hacker jailed</a><a href=\"https://blog.malwarebytes.com/tag/national-crime-agency/\" rel=\"tag\">National Crime Agency</a><a href=\"https://blog.malwarebytes.com/tag/nca/\" rel=\"tag\">NCA</a><a href=\"https://blog.malwarebytes.com/tag/robert-davies/\" rel=\"tag\">Robert Davies</a><a href=\"https://blog.malwarebytes.com/tag/software-engineer-hacker/\" rel=\"tag\">software engineer hacker</a><a href=\"https://blog.malwarebytes.com/tag/voyuerism/\" rel=\"tag\">voyuerism</a><a href=\"https://blog.malwarebytes.com/tag/webcam-security/\" rel=\"tag\">webcam security</a></p>\n<table width='100%'>\n<tr>\n<td align=right>\n<p><b>(<a href='https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/' title='Software engineer hacked webcams to spy on girls—Here's how to protect yourself'>Read more...</a>)</b></p>\n</td>\n</tr>\n</table>\n</td>\n</tr>\n</table>\n<p>The post <a rel=\"nofollow\" href=\"https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/\">Software engineer hacked webcams to spy on girls—Here&#8217;s how to protect yourself</a> appeared first on <a rel=\"nofollow\" href=\"https://blog.malwarebytes.com\">Malwarebytes Labs</a>.</p>\n";

        eprintln!(
            "raw : {:?}",
            html2text::from_read(description.as_bytes(), 2000)
                .trim()
                .to_string(),
        );
        
        

result is

raw : "────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n[][1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n[1] https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\nYes, hackers can and will use your webcams against you if they see an opportunity. Don't let them.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \nCategories: [Cybercrime][1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \nTags: [Andrew Shorrock][2][catfishing][3][hacker jailed][4][National Crime Agency][5][NCA][6][Robert Davies][7][software engineer hacker][8][voyuerism][9][webcam security][10]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n([Read more...][1])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             \n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n[1] https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n[1] https://blog.malwarebytes.com/category/cybercrime/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          \n[2] https://blog.malwarebytes.com/tag/andrew-shorrock/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          \n[3] https://blog.malwarebytes.com/tag/catfishing/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               \n[4] https://blog.malwarebytes.com/tag/hacker-jailed/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \n[5] https://blog.malwarebytes.com/tag/national-crime-agency/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    \n[6] https://blog.malwarebytes.com/tag/nca/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      \n[7] https://blog.malwarebytes.com/tag/robert-davies/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            \n[8] https://blog.malwarebytes.com/tag/software-engineer-hacker/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 \n[9] https://blog.malwarebytes.com/tag/voyuerism/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \n[10] https://blog.malwarebytes.com/tag/webcam-security/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n\nThe post [Software engineer hacked webcams to spy on girls—Here’s how to protect yourself][1] appeared first on [Malwarebytes Labs][2].\n\n[1] https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/\n[2] https://blog.malwarebytes.com"

Is it possible to return just text instead of this rudimentary table?

Behavior when failing to parse?

What is the behavior of this package when it fails to parse HTML? I don't see a Result type being returned from from_read, so am I correct in assuming that this panics if it is asked to parse invalid HTML?

Support custom rich text decorators

Currently, I can either implement a custom TextDecorator and produce a string using from_read_with_decorator, or I can produce annotated text with RichDecorator using from_read_rich. I’d like to be able to implement a custom TextDecorator and produce annotated text. As far as I see, that is not possible at the moment.

What do you think about adding a from_read_rich_with_decorator function?

Panic when parsing a particular HTML string

Hey, I noticed the following code panics:

    let html = r#"<table><td><p data-aid="133347338" id="p5">3,266</p>"#;
    let decorator = html2text::render::text_renderer::PlainDecorator::new();
    let text = html2text::from_read_with_decorator(html.as_bytes(), usize::MAX, decorator.clone());
    println!("{}", text);

thread 'main' panicked at 'attempt to multiply with overflow', /Users/ephraimkunz/.cargo/registry/src/github.com-1ecc6299db9ec823/html2text-0.4.2/src/lib.rs:1408:29

Preformatted text not passed to the decorator

pre tags will not cause a call to the decorate_preformat_* methods of the TextDecorator. Apparently, the Renderer::add_preformatted_block method is never called.

println!("{:?}", html2text::parse("<pre>test</pre>".as_bytes()).render_rich(100).into_lines());

[TaggedLine { v: [Str(TaggedString { s: "test", tag: [] })] }]

Enhanced RichDecorator/--colour features: References and Border Colors

Intro

I tried many solutions to display html emails in mutt/neomutt's internal pager (elinks, readability tools, pandoc, html2text tools), but there was always some issues with the encoding, colors, references or parsing.

After a few tests today, rust-html2text seems to be the way to go, it's fast and the parsing, format, and encoding are spot on.

Request

I changed colors and styles options in the html2text example to fit my tastes, but I would like to add 3 features to the RichDecorator, used by --colour in the example:

  • Links listed as references, like in the PlainDecorator.
  • References wrap at --wrap-width, to help with long links osc 8.
  • Assign a color to borders and horizontal lines, to dim them.

Would you have time to implement those or give me some leads? (I'm just starting rust)

Extra

My changes to the example (only Reset for colors and style was not working correctly for me):

diff --git a/examples/html2text.rs b/examples/html2text.rs
index 2c14ddf..4ee56b6 100644
--- a/examples/html2text.rs
+++ b/examples/html2text.rs
@@ -22,41 +22,41 @@ fn default_colour_map(annotations: &[RichAnnotation], s: &str) -> String {
         match annotation {
             Default => {}
             Link(_) => {
-                start.push(format!("{}", termion::style::Underline));
-                finish.push(format!("{}", termion::style::Reset));
+                start.push(format!("{}{}", Fg(AnsiValue(153)), termion::style::Underline));
+                finish.push(format!("{}{}", Fg(White), termion::style::NoUnderline));
             }
             Image(_) => {
                 if !have_explicit_colour {
-                    start.push(format!("{}", Fg(Blue)));
-                    finish.push(format!("{}", Fg(Reset)));
+                    start.push(format!("{}{}", Fg(AnsiValue(225)), termion::style::Italic));
+                    finish.push(format!("{}{}", Fg(White), termion::style::NoItalic));
                 }
             }
             Emphasis => {
-                start.push(format!("{}", termion::style::Bold));
-                finish.push(format!("{}", termion::style::Reset));
+                start.push(format!("{}", termion::style::Italic));
+                finish.push(format!("{}", termion::style::NoItalic));
             }
             Strong => {
                 if !have_explicit_colour {
-                    start.push(format!("{}", Fg(LightYellow)));
-                    finish.push(format!("{}", Fg(Reset)));
+                    start.push(format!("{}", termion::style::Bold));
+                    finish.push(format!("{}", termion::style::NoBold));
                 }
             }
             Strikeout => {
                 if !have_explicit_colour {
-                    start.push(format!("{}", Fg(LightBlack)));
-                    finish.push(format!("{}", Fg(Reset)));
+                    start.push(format!("{}{}", Fg(AnsiValue(7)), termion::style::CrossedOut));
+                    finish.push(format!("{}{}", Fg(White), termion::style::NoCrossedOut));
                 }
             }
             Code => {
                 if !have_explicit_colour {
-                    start.push(format!("{}", Fg(Blue)));
-                    finish.push(format!("{}", Fg(Reset)));
+                    start.push(format!("{}{}", Bg(AnsiValue(25)), Fg(AnsiValue(222))));
+                    finish.push(format!("{}{}", Bg(Reset) ,Fg(White)));
                 }
             }
             Preformat(_) => {
                 if !have_explicit_colour {
-                    start.push(format!("{}", Fg(Blue)));
-                    finish.push(format!("{}", Fg(Reset)));
+                    start.push(format!("{}{}", Bg(AnsiValue(25)), Fg(AnsiValue(229))));
+                    finish.push(format!("{}{}", Bg(Reset), Fg(White)));
                 }
             }
             Colour(c) => {

Specific HTML can cause `from_read` to hang indefinitely

While testing against a large batch of HTML samples I found that one appeared to cause an infinite loop when calling from_read.

Sample attached. I have tested this both with library call and with command line.
infinite.zip

I think it would be nice to be able to have some kind of limiting factor because HTML in the wild can be very weird and malformed.

Support syntax highlighting for <pre> blocks

The reason why I wanted a fix for preformatted blocks (#32) was that I wanted to implement syntax highlighting for code blocks for rusty-man. With the new release v0.2.1, I was able to implement basic syntax highlighting using syntect. My current implementation could still be improved: Context is very important for proper syntax highlighting, but I cannot identify the block a string with a preformatted annotation belongs to. Therefore I just assume that there is only one preformatted string per line, and that preformatted strings in adjacent lines belong to the same block. Of course, this does not hold for tables or for subsequent code blocks.

Do you think html2text could make it easier to highlight code blocks?

Deeply nested test fails on x86_64-pc-windows-gnu

The CI tests on AppVeyor fail, but only on x86_64-pc-windows-gnu - not on i686-pc-windows-gnu or x86_64-pc-windows-msvc.

Apparently test_deeply_nested_table overflows its stack on that platform but no others.

Panics found by fuzzing

Hi, I am fuzzing this crate with afl.rs, and my fuzzer reports some panics. I will list the code snippets and panic information below. All code snippets are guaranteed to be run directly. I hope you can check whether these panics are bugs.

The first case is panicked at 'capacity overflow':

    let _local0 = html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator>::new(11787190863583748771, html2text::render::text_renderer::RichDecorator{});
    let mut _local1 = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::new_sub_renderer(&_local0, 11791448176899352125);
    let _ = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::add_horizontal_border(&mut _local1);
thread 'main' panicked at 'capacity overflow', library/alloc/src/raw_vec.rs:518:5
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: alloc::raw_vec::capacity_overflow
   3: alloc::raw_vec::RawVec<T,A>::allocate_in
             at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/raw_vec.rs:178:27
   4: alloc::raw_vec::RawVec<T,A>::with_capacity_in
             at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/raw_vec.rs:131:9
   5: alloc::vec::Vec<T,A>::with_capacity_in
             at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/vec/mod.rs:673:20
   6: <T as alloc::vec::spec_from_elem::SpecFromElem>::from_elem
             at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/vec/spec_from_elem.rs:15:21
   7: alloc::vec::from_elem
             at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/vec/mod.rs:2566:5
   8: html2text::render::text_renderer::BorderHoriz::new
             at ./src/render/text_renderer.rs:630:23
   9: <html2text::render::text_renderer::SubRenderer<D> as html2text::render::Renderer>::add_horizontal_border
             at ./src/render/text_renderer.rs:1010:41
  10: replay_html2text161::test_function161
             at ./fuzz_target/build/replay_html2text161/src/main.rs:36:13
  11: replay_html2text161::main
             at ./fuzz_target/build/replay_html2text161/src/main.rs:69:5
  12: core::ops::function::FnOnce::call_once
             at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The second case is panicked at 'attempt to subtract with overflow'

    let data=[60, 116, 97, 98, 108, 101, 62, 60, 116, 114, 62, 60, 116, 100, 62, 120, 105, 60, 48, 62, 0, 0, 0, 60, 116, 97, 98, 108, 101, 62, 58, 58, 58, 62, 58, 62, 62, 62, 58, 60, 112, 32, 32, 32, 32, 32, 32, 32, 71, 87, 85, 78, 16, 16, 62, 60, 15, 16, 16, 16, 16, 16, 16, 15, 38, 16, 16, 16, 15, 1, 16, 16, 16, 16, 16, 16, 162, 111, 107, 99, 91, 112, 57, 64, 94, 100, 60, 111, 108, 47, 62, 127, 60, 108, 73, 62, 125, 109, 121, 102, 99, 122, 110, 102, 114, 98, 60, 97, 32, 104, 114, 101, 102, 61, 98, 111, 103, 32, 105, 100, 61, 100, 62, 60, 111, 15, 15, 15, 15, 15, 15, 15, 39, 15, 15, 15, 106, 102, 59, 99, 32, 32, 32, 86, 102, 122, 110, 104, 93, 108, 71, 114, 117, 110, 100, 96, 121, 57, 60, 107, 116, 109, 247, 62, 60, 32, 60, 122, 98, 99, 98, 97, 32, 119, 127, 127, 62, 60, 112, 62, 121, 116, 60, 47, 116, 100, 62, 62, 60, 111, 98, 62, 123, 110, 109, 97, 101, 105, 119, 60, 112, 101, 101, 122, 102, 63, 120, 97, 62, 60, 101, 62, 60, 120, 109, 112, 32, 28, 52, 55, 50, 50, 49, 52, 185, 150, 99, 62, 255, 112, 76, 85, 60, 112, 62, 73, 100, 116, 116, 60, 75, 50, 73, 116, 120, 110, 127, 255, 118, 32, 42, 40, 49, 33, 112, 32, 36, 107, 57, 60, 5, 163, 62, 49, 55, 32, 33, 118, 99, 63, 60, 109, 107, 43, 119, 100, 62, 60, 104, 58, 101, 163, 163, 163, 163, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 1, 107, 117, 107, 108, 44, 102, 58, 60, 116, 101, 97, 106, 98, 59, 60, 115, 109, 52, 58, 115, 98, 62, 232, 110, 114, 32, 60, 117, 93, 120, 112, 119, 111, 59, 98, 120, 61, 206, 19, 61, 206, 19, 59, 1, 110, 102, 60, 115, 0, 242, 64, 203, 8, 111, 50, 59, 121, 122, 32, 42, 35, 32, 37, 101, 120, 104, 121, 0, 242, 59, 63, 121, 231, 130, 130, 130, 170, 170, 1, 32, 0, 0, 0, 28, 134, 200, 90, 119, 48, 60, 111, 108, 118, 119, 116, 113, 59, 100, 60, 117, 43, 110, 99, 9, 216, 157, 137, 216, 157, 246, 167, 62, 60, 104, 61, 43, 28, 134, 200, 105, 119, 48, 60, 122, 110, 0, 242, 61, 61, 114, 231, 130, 130, 130, 170, 170, 170, 233, 222, 222, 162, 163, 163, 163, 163, 163, 163, 163, 85, 100, 116, 99, 61, 60, 163, 163, 163, 163, 163, 220, 220, 1, 109, 112, 105, 10, 59, 105, 220, 215, 10, 59, 122, 100, 100, 121, 97, 43, 43, 43, 102, 122, 100, 60, 62, 114, 116, 122, 115, 61, 60, 115, 101, 62, 215, 215, 215, 215, 215, 98, 59, 60, 109, 120, 57, 60, 97, 102, 113, 229, 43, 43, 43, 43, 43, 43, 43, 43, 43, 35, 43, 43, 101, 58, 60, 116, 98, 101, 107, 98, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 98, 99, 62, 60, 112, 102, 59, 124, 107, 111, 97, 98, 108, 118, 60, 116, 102, 101, 104, 97, 62, 60, 255, 127, 46, 60, 116, 101, 62, 60, 105, 102, 63, 116, 116, 60, 47, 116, 101, 62, 62, 60, 115, 98, 62, 123, 109, 108, 97, 100, 119, 118, 60, 111, 99, 97, 103, 99, 62, 60, 255, 127, 46, 60, 103, 99, 62, 60, 116, 98, 63, 60, 101, 62, 60, 109, 109, 231, 130, 130, 130, 213, 213, 213, 233, 222, 222, 59, 101, 103, 58, 60, 100, 111, 61, 65, 114, 104, 60, 47, 101, 109, 62, 60, 99, 99, 172, 97, 97, 58, 60, 119, 99, 64, 126, 118, 104, 100, 100, 107, 105, 60, 120, 98, 255, 255, 255, 0, 60, 255, 127, 46, 60, 113, 127];
    let _local0: html2text::RenderTree = html2text::parse(&data[..]);
    let _local1: html2text::RenderedText::<html2text::render::text_renderer::RichDecorator> = html2text::RenderTree::render(_local0, 1, html2text::render::text_renderer::RichDecorator{});
thread 'main' panicked at 'attempt to subtract with overflow', /home/jjf/Fuzzing-Target-Generator/experiments/rust-html2text/src/lib.rs:1305:65
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: html2text::do_render_node::{{closure}}
             at ./src/lib.rs:1305:65
   4: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/boxed.rs:2001:9
   5: core::ops::function::impls::<impl core::ops::function::Fn<A> for &F>::call
             at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:262:13
   6: html2text::tree_map_reduce::{{closure}}
             at ./src/lib.rs:780:30
   7: core::option::Option<T>::map
             at /home/jjf/Fuzzing-Target-Generator/library/core/src/option.rs:925:29
   8: html2text::tree_map_reduce
             at ./src/lib.rs:775:13
   9: html2text::render_tree_to_string
             at ./src/lib.rs:1128:5
  10: html2text::RenderTree::render
             at ./src/lib.rs:1542:23
  11: replay_html2text10::test_function10
             at ./fuzz_target/build/replay_html2text10/src/main.rs:39:95
  12: replay_html2text10::main
             at ./fuzz_target/build/replay_html2text10/src/main.rs:74:5
  13: core::ops::function::FnOnce::call_once
             at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This case is panicked at Option::unwrap:

    let mut _local0: html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::TrivialDecorator> = html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::TrivialDecorator>::new(18446744073709551615, html2text::render::text_renderer::TrivialDecorator{});
    let _local1_param0_helper1 = &mut (_local0);
    let _ = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::TrivialDecorator> as html2text::render::Renderer>::end_strikeout(_local1_param0_helper1);
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /home/jjf/Fuzzing-Target-Generator/experiments/rust-html2text/src/render/text_renderer.rs:1377:38
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: core::option::Option<T>::unwrap
             at /home/jjf/Fuzzing-Target-Generator/library/core/src/option.rs:778:21
   4: <html2text::render::text_renderer::SubRenderer<D> as html2text::render::Renderer>::end_strikeout
             at ./src/render/text_renderer.rs:1377:9
   5: replay_html2text59::test_function59
             at ./fuzz_target/build/replay_html2text59/src/main.rs:34:13
   6: replay_html2text59::main
             at ./fuzz_target/build/replay_html2text59/src/main.rs:66:5
   7: core::ops::function::FnOnce::call_once
             at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This case is panicked at 'Attempt to end a preformatted block which wasn't opened.'

    let _local0: html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> = html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator>::new(14829735431805717965, html2text::render::text_renderer::RichDecorator{});
    let _local1_param0_helper1 = &(_local0);
    let mut _local1: html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::new_sub_renderer(_local1_param0_helper1, 14829735431805717810);
    let _local2_param0_helper1 = &mut (_local1);
    let _ = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::end_pre(_local2_param0_helper1);
thread 'main' panicked at 'Attempt to end a preformatted block which wasn't opened.', /home/jjf/Fuzzing-Target-Generator/experiments/rust-html2text/src/render/text_renderer.rs:1027:13
stack backtrace:
   0: std::panicking::begin_panic
             at /home/jjf/Fuzzing-Target-Generator/library/std/src/panicking.rs:607:12
   1: <html2text::render::text_renderer::SubRenderer<D> as html2text::render::Renderer>::end_pre
             at ./src/render/text_renderer.rs:1027:13
   2: replay_html2text164::test_function164
             at ./fuzz_target/build/replay_html2text164/src/main.rs:36:13
   3: replay_html2text164::main
             at ./fuzz_target/build/replay_html2text164/src/main.rs:69:5
   4: core::ops::function::FnOnce::call_once
             at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

unable to use ANSI sequences with from_read_with_decorator

Hi,

I'm trying to use from_read_with_decorator with my own TextDecorator to output in the terminal with some color and text style. Unfortunately from_read_with_decorator seems to remove some part of escape sequences eg. \e, preventing for creating nice terminal output.

I use termion for colors and style

impl TextDecorator for ContentDecorator {
    type Annotation = RichAnnotation;

    fn decorate_link_start(&mut self, url: &str) -> (String, Self::Annotation) {
        self.0.push(url.to_string());
        (
            format!(
                "{}{}{}* {}{}",
                Italic,
                Fg(Black),
                self.0.len() + 1,
                StyleReset,
                Fg(Blue)
            ),
            RichAnnotation::Link(url.to_string()),
        )
    }
// ...

Then

let output = from_read_with_decorator(html.as_bytes(), term_width, ContentDecorator(vec![]))

output:

HTML: Google has <a href="https://blog.chromium.org/2021/01/limiting-private-api-availability-in.html">announced</a> that they are going to block

 from_read_with_decorator output:
"Google has [3m[38;5;0m2* [m[38;5;4mannounced[39m that they are going to block

Make links global

Links are currently recursively rendered into tables, which can take much more space than the text, distorting the table in the process.

#81 is a prototype how to collect the links globally and only append them at the very end, although you might have a better idea. The PR makes the former BuilderStack the new TextRenderer which can additionally carry state shared among the stack of renderer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.