jugglerchris / rust-html2text Goto Github PK
View Code? Open in Web Editor NEWRust library to render HTML as text.
License: MIT License
Rust library to render HTML as text.
License: MIT License
It would be great if you could add support for definition lists:
<dl>
<dt>Topic</dt>
<dd>Definition</dd>
</dl>
The topic could be rendered bold and the definition could be indented by two our four spaces.
I’ve spent some time using html2text
, reading its source code and even writing small patches. Still, I haven’t really grasped the complete rendering process that html2text
performs. At the same time, I have some specific requirements like #27 or #36 that cannot be realized with html2text
and maybe don’t even belong in a generic HTML rendering library.
Therefore, I am wondering: Would it be possible and would it make sense to decouple the html2text rendering pipeline into steps that can be customized by the user? This would make it easier to understand the rendering process, and it might make it possible to implement some of the requirements I mentioned earlier without having to re-implement the entire rendering stack.
From my point of view, these are the steps of the rendering pipeline (while I’m quite confident that steps 1–3 are correct, I’m not really sure about 4 and 5.):
src/lib.rs
).src/lib.rs
).src/lib.rs
).src/text_renderer.rs
?).src/text_renderer.rs
?).TextDecorator
(src/text_renderer.rs
).It would be especially nice if the user would be able to customize step 5 without having to re-implement everything else.
Is my understanding of the rendering process roughly correct? What do you think?
Looking at the Cargo.toml
file seems like the dependencies haven't been touched in 2 years.
There are major changes and improvements from updating them. For example html5ever
is currently locked at version 0.9.0
while the current version is 0.22
.
It would be really nice if downstream crates would not need to vendor such an old version of html5ever
and it's dependencies if they want to use html2text
.
For example here how my Cargo.lock
file looks atm:
[[package]]
name = "html2text"
version = "0.1.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
"backtrace 0.2.3 (registry+https://github.com/rust-lang/crates.io-index)",
"html5ever 0.9.0 (registry+https://github.com/rust-lang/crates.io-index)",
"html5ever-atoms 0.1.3 (registry+https://github.com/rust-lang/crates.io-index)",
"string_cache 0.2.29 (registry+https://github.com/rust-lang/crates.io-index)",
"unicode-width 0.1.5 (registry+https://github.com/rust-lang/crates.io-index)",
]
[[package]]
name = "html5ever"
version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
"html5ever-atoms 0.1.3 (registry+https://github.com/rust-lang/crates.io-index)",
"log 0.4.3 (registry+https://github.com/rust-lang/crates.io-index)",
"mac 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
"phf 0.7.22 (registry+https://github.com/rust-lang/crates.io-index)",
"phf_codegen 0.7.22 (registry+https://github.com/rust-lang/crates.io-index)",
"quote 0.3.15 (registry+https://github.com/rust-lang/crates.io-index)",
"rustc-serialize 0.3.24 (registry+https://github.com/rust-lang/crates.io-index)",
"syn 0.9.2 (registry+https://github.com/rust-lang/crates.io-index)",
"tendril 0.2.4 (registry+https://github.com/rust-lang/crates.io-index)",
]
[[package]]
name = "html5ever"
version = "0.22.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
"log 0.4.3 (registry+https://github.com/rust-lang/crates.io-index)",
"mac 0.1.1 (registry+https://github.com/rust-lang/crates.io-index)",
"markup5ever 0.7.2 (registry+https://github.com/rust-lang/crates.io-index)",
"proc-macro2 0.3.8 (registry+https://github.com/rust-lang/crates.io-index)",
"quote 0.5.2 (registry+https://github.com/rust-lang/crates.io-index)",
"syn 0.13.11 (registry+https://github.com/rust-lang/crates.io-index)",
]
[[package]]
name = "html5ever-atoms"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
dependencies = [
"string_cache 0.3.0 (registry+https://github.com/rust-lang/crates.io-index)",
"string_cache_codegen 0.3.1 (registry+https://github.com/rust-lang/crates.io-index)",
]
Cheers
Using the latest version of html2text
:
[package]
name = "testhtml2text"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
html2text = "0.4.4"
I have this example that reproduces what I am talking about:
fn main() {
let input = b"
<pre>
This is a preformatted block of text.
It has newlines and renders as-is written in HTML.
</pre>";
println!(
"{}",
html2text::from_read_with_decorator(
&input[..],
120,
html2text::render::text_renderer::TrivialDecorator::new()
)
);
}
The program outputs:
This is a preformatted block of text.
It has newlines and renders as-is written in HTML.
As I use html2text
to parse descriptions of algorithmic problems, I would like the empty new lines to be preserved in the output because sometimes they are crucial to understanding a problem's example. E.g. https://adventofcode.com/2020/day/4
I don't have any issues with html2text
ignoring empty new lines anywhere else but <pre>
blocks. These tags are understood explicitly as already preformatted, and all browsers would render them as written in the source code. html2text
is not a browser. Still, I believe preserving the intent of the HTML is beneficial in this case.
As a workaround, I am inserting <br>
tags at the end of all lines in the <pre>
blocks of HTML I am downloading, but that seems suboptimal at best.
Calling a function like this caused my program to OOM and crash, and took me ages to work out because I didn't know what was causing it:
println!(
"{}",
html2text::parse(s.content.as_bytes())
.render_plain(0)
.into_string()
);
}
Thank you for this library. I'm using it in https://github.com/ayrat555/el_monitorro to remove HTML from data feeds.
Is it possible to also remove invisible Unicode characters from text?
for example https://unicode-table.com/en/200B/
The unwrap()
in the following code from fn tree_map_reduce
at lib.rs:554 seems to result in a panic on certain inputs:
// Get the next child node to process
let next_node = pending_stack.last_mut()
.unwrap()
.to_process
.next();
I'm working on coming up with a test case that doesn't have sensitive information in it, but submitting this issue in advance.
Here is a backtrace:
11: 0x55fee3b5bc90 - core::option::Option<T>::unwrap::hfa52bb5a7cdb86d0
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/macros.rs:12
12: 0x55fee3cac769 - html2text::tree_map_reduce::h0eeb44a03fc94bd1
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:553
13: 0x55fee3cb4f9c - html2text::dom_to_render_tree::ha0f5d421fcdfeabf
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:594
14: 0x55fee3cb6036 - html2text::children_to_render_nodes::{{closure}}::ha8fb859e9d4ccd77
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:397
15: 0x55fee3ca8294 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once::h429e02aa062973b1
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/ops/function.rs:279
16: 0x55fee3b4d65b - core::option::Option<T>::map::h9cb8c6fa389220bd
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/option.rs:416
17: 0x55fee3b9d094 - <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next::h3e67bc3ec8205108
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/adapters/mod.rs:570
18: 0x55fee3b29597 - <core::iter::adapters::flatten::FlattenCompat<I,U> as core::iter::traits::iterator::Iterator>::next::hd76b5e6cad2d2858
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/adapters/flatten.rs:219
19: 0x55fee3b27cfb - <core::iter::adapters::flatten::FlatMap<I,U,F> as core::iter::traits::iterator::Iterator>::next::h1ff61594eb8d230e
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/adapters/flatten.rs:49
20: 0x55fee3c8f846 - <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T,I>>::from_iter::h8595be8b7c7b1a21
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/liballoc/vec.rs:1883
21: 0x55fee3c980ad - <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter::hc732d37a466aab80
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/liballoc/vec.rs:1796
22: 0x55fee3b2c89d - core::iter::traits::iterator::Iterator::collect::h3601df414b19d4b1
at /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/iter/traits/iterator.rs:1466
23: 0x55fee3cb5f4a - html2text::children_to_render_nodes::h1489c6b97e9e00fc
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:394
24: 0x55fee3cb6874 - html2text::list_children_to_render_nodes::h842485128c34486e
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:411
25: 0x55fee3cb0ff2 - html2text::process_dom_node::hf0aa74c8f1662c45
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:792
26: 0x55fee3cb5000 - html2text::dom_to_render_tree::{{closure}}::he3666d29e519f1ac
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:595
27: 0x55fee3cac929 - html2text::tree_map_reduce::h0eeb44a03fc94bd1
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:559
28: 0x55fee3cb4f9c - html2text::dom_to_render_tree::ha0f5d421fcdfeabf
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:594
29: 0x55fee3cb62db - html2text::from_read_with_decorator::haafad1b14c75eddb
at /home/darius/.cargo/git/checkouts/rust-html2text-62cf1d00222373b4/ac26264/src/lib.rs:1183
Sorry if I did not find the relevant information by perusing the documentation but is there an easy way to produce a raw text output (without any markdown formatting at all)? It seems to me I can implement my own thing based on traversing a RenderNode
or by implementing a custom TextDecorator
? But it seems also a very common use-case, especially when pre-processing documents from the web for NLP pipelines.
If it does not yet exists, can I contribute some to this library? If so, do you have any guidance on the correct way to do so?
There are situations in which it would be useful for html2text to understand at least a small amount of CSS.
An occasional annoyance I find with some web pages is that they use different classes of <span>
(or <div>
, depending on preference) for all their formatting, including both paragraph separation and inline style changes such as emphasis. Then they rely on CSS to make some of those span classes behave like <p>
, some like <em>
, some like <code>
and so on.
html2text can't render a document of that kind sensibly without having to speak enough CSS to at least know which classes of <span>
it should treat like which normal tags. You end up with a huge megaparagraph, or alternatively no end of spurious newlines (depending on whether the author went all-spans or all-divs).
I don't have a real-world example handy, but here's one I mocked up manually:
<head>
<title>Demo of the 'spans-everywhere' school of HTML</title>
<style type="text/css">
.p { display: block; margin-bottom: 1em; }
.em { font-style: italic; }
.code { font-family: monospace; }
</style>
</head>
<body>
<span class="p">Paragraph one, containing <span class="em">emphasis</span>.</span><span class="p">Paragraph two, containing <span class="code">code</span>.</span>
</body>
</html>
@jugglerchris mentioned that another use case is pages that use display: none
.
Hi,
Currently, links are output as [link text][link number].
For example:
This is [a link][1]
[1] https://google.com
However, to make this work, a colon is required in the second part of the link:
This is [a link][1]
[1]: https://google.com
If the input to html2text
can't be rendered in the specified display width, one option (in some client UIs) is to render it anyway at a larger width, and do some other kind of compromise when presenting it to the user, such as apologetically telling them they need to widen their terminal, or providing keystrokes to pan the window left and right across the wider logical canvas that you rendered the document on.
In the current API, you can do this by handling Error::TooNarrow
: if you try to render at the physical terminal width and get TooNarrow
, you can increase the render width and try again, until you find a width at which the document successfully renders. (And then maybe you binary-search between the largest failing width and the smallest successful one to find the cutoff? And then maybe you increase it by 20% or 50% or something, to avoid one-character-wide table cells? But these are client-side decisions.)
But the downside of doing this is that if the document contains one table that needs at least 500 columns, then all the ordinary paragraphs alongside that table are also wrapped at 500 columns, making them very hard to read. It would be nicer if all the parts that don't have to be ultra-wide could still be made to fit into my (say) 80-column terminal.
In particular, if the wide table is something that not all readers of the document will need to bother with at all, then that way, I can read the introductory paragraphs easily, and use them to decide whether I need to bother widening my window to read the table!
There's more than one way that that high-level requirement might be achieved. Two thoughts that have sprung to mind (which may still not be the only possible approaches):
1. Client application tells html2text
to render at my actual terminal width, but with a flag of some kind that says that that width limit is 'strongly advised' rather than 'mandatory'. Then if a table can't help being wider than specified, fine, it can do that. But anything that can fit in the width should.
After setting this flag, the client would never expect to get Error::TooNarrow
back from the renderer. But, in return, it accepts that the output might contain overlong lines, and must find some way of dealing with them itself if so.
2. Client application gives html2text
two width parameters. One is the physical display width, just as now. The other is a 'maximum wrap width'. Each paragraph is wrapped to min(maximum wrap width, however much space is physically available in the layout).
The idea is that the client starts by setting both widths to the physical terminal width. If the document can't be rendered at that width, then the renderer still returns Error::TooNarrow
, and the client responds the same way I suggested above, by cranking up the display width until the error stops happening. But it leaves the maximum wrap width at its original value.
Option 1 has the advantage of only needing one call to the renderer, instead of O(log n) calls to zero in on a good render width.
Option 2 gives the client more control, because it still gets to decide how much larger than the absolute minimum possible width it wants to make things.
In both cases, there's a question of what happens to a single very wide table cell. (Say, one row of your table has an enormous number of tiny cells, and another has a single cell with colspan=lots
covering the same horizontal space as all of them, so that that one cell is forced to be wider than the physical display.) With my option 2, the very wide cell would still have its paragraphs limited to the max wrap width, because that max width is applied to every paragraph anywhere in the whole layout. So if the client application is handling wide documents by providing keystrokes to pan left and right across the canvas, there will be some pan position at which the text of that table cell is readable. Perhaps option 1 might do the same thing?
With the max-wrap-width approach, there's also a question of how you measure it for indented things like <li>
or <blockquote>
: from the left margin of the text, or of the containing column? For example, which of these would you see, in a set of nested lists? This, which lets you read the whole list without having to pan your display at all?
plain para at the
full screen width
* narrower bullet
point
* even narrower
bullet point
* smaller yet
and so on
Or this, which avoids the bullet points getting squashed up against the right-hand edge, making use of the fact that they physically do have space to expand into?
plain para at the
full screen width
* bullet point uses
same width so its
margin is 2 chars
further right
* nested bullets in
turn move 2 chars
right each time
* result: you never
get text squashed
too narrow
The CSS max-width
property takes the latter view, on the basis that it expects that you've widened your browser window and the whole wide window is visible. But in a context where you might be using a "pan left and right" UI, perhaps the former makes more sense?
Or perhaps some further compromise involving a minimum wrap width, so that once the bullet points get absolutely too silly squashed up against the right margin, there's a fallback available?
As you can see, I don't have all the answers here :-)
TextDecorator::finalise
returns a Vec<TaggedString<T>>
, but in TextRenderer::into_lines
, the annotations are ignored and only the string value is rendered. I’ll try to prepare a fix later today.
I'm using the html2text example tool to convert Html emails for a TUI email client.
In issue #134, I learned that using the CSS feature/option can hide Html element with max-height: 0
and display: none
styles, which is really useful for email preview divs.
But, enabling CSS also overrides the RichAnnotations colours that are defined by design.
I would like to use the CSS feature for the layout but without overriding RichAnnotations.
It doesn't look like the overriding was implemented in "examples/html2text.rs" but directly in the library.
How to do that in "examples/html2text.rs"?
I am trying to render some css in the HTML, and I believe I need the feature css
for this to work but I cannot add it in my toml file due to this :
# Cargo.toml
html2text = { version = "0.12.4", features = ["css"]}
error: failed to select a version for the requirement `lightningcss = "^1.0.0-alpha.54"`
candidate versions found which didn't match: 1.0.0-alpha.52, 1.0.0-alpha.51, 1.0.0-alpha.50, ...
location searched: crates.io index
required by package `html2text v0.12.4`
... which satisfies dependency `html2text = "^0.12.4"` (locked to 0.12.4) of package `weather v0.1.0 (/home/abhishek/quick-test/weather-rs)`
if you are looking for the prerelease package it needs to be specified explicitly
lightningcss = { version = "1.0.0-alpha.52" }
perhaps a crate was updated and forgotten to be re-vendored?
Here's my code which isn't working due to the above:
let s = config::rich()
.add_css()
.string_from_read(&mut reader, 150)
.context("Render failed")?;
I've built a TUI Miniflux client that uses this lib for converting contents of RSS feed entries into something readable in the terminal, and it's great.
The only thing missing is that I'd love a way to just spit out the URL for an image when one is present (or do some other sort of processing with that URL, like giving it to the user in another way). Currently, the RichAnnotation::Image
enum member doesn't provide any way to get the image's src
attribute, so I can't actually do that (instead, I either show alt text or nothing).
Your library looks very promising. Unfortunately I can not use it because:
yaml_metadata_block
)Let me explain my use case more in detail: I tested if I could replace in my toolchain:
pandoc --standalone -f html -t markdown_strict+yaml_metadata_block+pipe_tables
with:
html2text
This would allow me to do this:
curl $(xclip -o)| thml2text | tp-note
and even integrate your library into tp-note. Then the above would look like this:
curl $(xclip -o) | tp-note
Tp-Note comes with a document viewer that renders the content with pulldown-cmark which is compliant with the CommonMark specification.
As the de facto official specification for Markdown is CommonMark, making Html2text compatible with it, would open a wider range of use cases (mine included).
Another advantage: CommonMark has a validation test suite.
What do you think?
Is it possible to implement text decorator functions for all elements? The current implementations for plain and rich text are already very helpful, but I would like to tweak the appearance of some specific elements. Let’s say I want to print headings with a bold typeface or I want to set the color based on the element’s class. This could be done with functions like:
fn decorate_element_start(&self, name: &str, attrs: HashMap<String, String>) -> Self::Annotation
fn decorate_element_end(&self, name: &str) -> Self::Annotation
I'm hitting the error in
pub fn add_preformatted_text(&mut self, text: &str, tag_main: &T, tag_wrapped: &T) {
...
for c in text.chars() {
if let Some(charwidth) = UnicodeWidthChar::width(c) {
...
} else {
match c {
'\n' => {
self.force_flush_line();
self.pre_wrapped = false;
}
'\t' => {
...
}
_ => {
eprintln!("Got character: {:?}", c);
}
}
}
html_trace_quiet!(" Added char {:?}", c);
}
}
Can this be silenced?
Like this one:
Found 1 items, similar to سلام.
-->Moin
-->سلام
<p align=right dir=rtl>(سَ) [<font color="green"> ع.</font> ] (<font color="green">مص ل.</font>)<br><font color="#7030a0">۱-</font> درود گفتن.<br><font color="#7030a0">۲-</font> بی گزند شدن.<br><font color="#7030a0">۳-</font> گردن نهادن.<br>~ علیک درود بر تو باد.<br>~ علیکم درود بر شما.</p>```
Is there an option to convert <br />
tags to new line character in the output?
Currently it seems to ignore all
tags in the output.
In Email, it was a common practice to add a separator indicating the end of the preview text.
This method does not work on all clients and is subject to regular changes. I came across this article on the subject with a few examples.
The problem is that the invisible characters used for this hack are badly converted by html2text, which leaves many "COMBINING GRAPHEME JOINER" entities in the output.
This entity is crudely displayed in pagers or TUI mail clients (dotted circle in a dotted square).
The first complication is that there is many variants of the sequences used, some are presented in the article above. In my own inbox, this week, I found at least 2 variants:
͏‌
͏  
, which you can reproduce with echo -ne '\u00ad\u034f \u200c '
And another complication is that the sequence is sometime formated in columns with newlines at different places in the sequence, like here:
This exemple would currently be converted by html2text like that:
For now I was piping the source file to sed first to delete those sequences:
sed -e 's/͏‌ /\n/g' -e "s/$(echo -ne '\u00ad\u034f')[; ^C]*$(echo -ne '\u200c')[; \n]* [; \n]*/\n/g"
I updated this script 3 times and I was going to find a solution for the cases with columns/newlines.
But I've just realized that the best solution would be to get rid of all the text before this sequence in the body
, since it's a preview, a repetition of the text present in the rest of the document.
I think that such a deletion would impact only email documents and be a good addition to the library even by default.
Even after the fix in #64 some tables still crash, hitting the assert!(width > 0)
in text_renderer.rs
.
A reduced example is below
<!DOCTYPE html>
<html>
<table><tbody><tr><td>
<ol><li>ဘိန်းမုန့်</li></ol></td>
<td>
<ol><li>မုန့်ကြာစိ</li>
<li>မုန့်တီ</li>
<li>ခိုတောင်မုန့်တီ></li>
<li>မုန့်ဟင်းခ</li>
<li>တစ်ပင်တိုင်မုန့်ဟင်းခါး</li>
<li>မုန့်စိမ်းပေါင်း</li>
<li>မြိတ်ကတ်ကြေးကိုက်</li>
<li>မြီးရှည်</li>
</ol>
</td>
<td>
<ol>
<li></li></ol></td>
</tr>
</tbody>
</table>
</html>
I have not further investigated and may not be able to for a while, but removing the assert condition leads to this example being processed reasonably, while processing the whole html this originated from appears to hit an infinite loop filling memory until OOM.
For use-cases like reading HTML email in TUI email clients, it would be nice with inline hyperlinks, using OSC8.
If you think that makes sense, I can try to implement it.
The tests indicate that it was intentional for html2text to preserve whitespace, but is there a reason for this? I've run into html that expects the default behavior of browsers that deduplicate whitespace. If this is not deduplicated it messes up the formatting of a larger table.
<td style="margin: 0px; padding: 0px;
-webkit-print-color-adjust: exact;" class="">Your
AT Conference Account is Past Due - Suspension
Notice</td>
The render table code is susceptible to panicking after attempting to divide by zero, as it just happened to me at this line of code:
thread 'main' panicked at 'attempt to divide by zero', [..]/html2text-0.6.0/src/lib.rs:1382:13
There are probably more places where this can happen.
Thanks
Hello. I've been using your library for my RSS reader for a long time.
It has a couple of glitches. one of which is rendering of HTML tables. For example,
let description = "<table cellpadding='10'>\n<tr>\n<td valign='top' align='center'><a href='https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/' title='Software engineer hacked webcams to spy on girls—Here's how to protect yourself'><img src='https://blog.malwarebytes.com/wp-content/uploads/2022/01/GettyImages-1199960659.jpg' border='0' width='300px' /></a></td>\n</tr>\n<tr>\n<td valign='top' align='left'>Yes, hackers can and will use your webcams against you if they see an opportunity. Don't let them.</p>\n<p>Categories: <a href=\"https://blog.malwarebytes.com/category/cybercrime/\" rel=\"category tag\">Cybercrime</a></p>\n<p>Tags: <a href=\"https://blog.malwarebytes.com/tag/andrew-shorrock/\" rel=\"tag\">Andrew Shorrock</a><a href=\"https://blog.malwarebytes.com/tag/catfishing/\" rel=\"tag\">catfishing</a><a href=\"https://blog.malwarebytes.com/tag/hacker-jailed/\" rel=\"tag\">hacker jailed</a><a href=\"https://blog.malwarebytes.com/tag/national-crime-agency/\" rel=\"tag\">National Crime Agency</a><a href=\"https://blog.malwarebytes.com/tag/nca/\" rel=\"tag\">NCA</a><a href=\"https://blog.malwarebytes.com/tag/robert-davies/\" rel=\"tag\">Robert Davies</a><a href=\"https://blog.malwarebytes.com/tag/software-engineer-hacker/\" rel=\"tag\">software engineer hacker</a><a href=\"https://blog.malwarebytes.com/tag/voyuerism/\" rel=\"tag\">voyuerism</a><a href=\"https://blog.malwarebytes.com/tag/webcam-security/\" rel=\"tag\">webcam security</a></p>\n<table width='100%'>\n<tr>\n<td align=right>\n<p><b>(<a href='https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/' title='Software engineer hacked webcams to spy on girls—Here's how to protect yourself'>Read more...</a>)</b></p>\n</td>\n</tr>\n</table>\n</td>\n</tr>\n</table>\n<p>The post <a rel=\"nofollow\" href=\"https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/\">Software engineer hacked webcams to spy on girls—Here’s how to protect yourself</a> appeared first on <a rel=\"nofollow\" href=\"https://blog.malwarebytes.com\">Malwarebytes Labs</a>.</p>\n";
eprintln!(
"raw : {:?}",
html2text::from_read(description.as_bytes(), 2000)
.trim()
.to_string(),
);
result is
raw : "────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n[][1] \n \n[1] https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/ \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\nYes, hackers can and will use your webcams against you if they see an opportunity. Don't let them. \n \n \nCategories: [Cybercrime][1] \n \nTags: [Andrew Shorrock][2][catfishing][3][hacker jailed][4][National Crime Agency][5][NCA][6][Robert Davies][7][software engineer hacker][8][voyuerism][9][webcam security][10] \n \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n([Read more...][1]) \n \n[1] https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/ \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n \n[1] https://blog.malwarebytes.com/category/cybercrime/ \n[2] https://blog.malwarebytes.com/tag/andrew-shorrock/ \n[3] https://blog.malwarebytes.com/tag/catfishing/ \n[4] https://blog.malwarebytes.com/tag/hacker-jailed/ \n[5] https://blog.malwarebytes.com/tag/national-crime-agency/ \n[6] https://blog.malwarebytes.com/tag/nca/ \n[7] https://blog.malwarebytes.com/tag/robert-davies/ \n[8] https://blog.malwarebytes.com/tag/software-engineer-hacker/ \n[9] https://blog.malwarebytes.com/tag/voyuerism/ \n[10] https://blog.malwarebytes.com/tag/webcam-security/ \n────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n\nThe post [Software engineer hacked webcams to spy on girls—Here’s how to protect yourself][1] appeared first on [Malwarebytes Labs][2].\n\n[1] https://blog.malwarebytes.com/cybercrime/2022/01/software-engineer-hacked-webcams-to-spy-on-girls-heres-how-to-protect-yourself/\n[2] https://blog.malwarebytes.com"
Is it possible to return just text instead of this rudimentary table?
What is the behavior of this package when it fails to parse HTML? I don't see a Result
type being returned from from_read
, so am I correct in assuming that this panics if it is asked to parse invalid HTML?
It looks like the way dom_to_render_tree
(and probably render_tree_to_string
too, but it doesn't get that far) is written results in a super deep stack when you have a deep HTML tree. This could probably be avoided if the function were CPS-transformed, or perhaps some other solution.
Currently, I can either implement a custom TextDecorator
and produce a string using from_read_with_decorator
, or I can produce annotated text with RichDecorator
using from_read_rich
. I’d like to be able to implement a custom TextDecorator
and produce annotated text. As far as I see, that is not possible at the moment.
What do you think about adding a from_read_rich_with_decorator
function?
Hey, I noticed the following code panics:
let html = r#"<table><td><p data-aid="133347338" id="p5">3,266</p>"#;
let decorator = html2text::render::text_renderer::PlainDecorator::new();
let text = html2text::from_read_with_decorator(html.as_bytes(), usize::MAX, decorator.clone());
println!("{}", text);
thread 'main' panicked at 'attempt to multiply with overflow', /Users/ephraimkunz/.cargo/registry/src/github.com-1ecc6299db9ec823/html2text-0.4.2/src/lib.rs:1408:29
pre
tags will not cause a call to the decorate_preformat_*
methods of the TextDecorator
. Apparently, the Renderer::add_preformatted_block
method is never called.
println!("{:?}", html2text::parse("<pre>test</pre>".as_bytes()).render_rich(100).into_lines());
[TaggedLine { v: [Str(TaggedString { s: "test", tag: [] })] }]
I tried many solutions to display html emails in mutt/neomutt's internal pager (elinks, readability tools, pandoc, html2text tools), but there was always some issues with the encoding, colors, references or parsing.
After a few tests today, rust-html2text seems to be the way to go, it's fast and the parsing, format, and encoding are spot on.
I changed colors and styles options in the html2text
example to fit my tastes, but I would like to add 3 features to the RichDecorator
, used by --colour in the example:
PlainDecorator
.--wrap-width
, to help with long links osc 8.Would you have time to implement those or give me some leads? (I'm just starting rust)
My changes to the example (only Reset
for colors and style was not working correctly for me):
diff --git a/examples/html2text.rs b/examples/html2text.rs
index 2c14ddf..4ee56b6 100644
--- a/examples/html2text.rs
+++ b/examples/html2text.rs
@@ -22,41 +22,41 @@ fn default_colour_map(annotations: &[RichAnnotation], s: &str) -> String {
match annotation {
Default => {}
Link(_) => {
- start.push(format!("{}", termion::style::Underline));
- finish.push(format!("{}", termion::style::Reset));
+ start.push(format!("{}{}", Fg(AnsiValue(153)), termion::style::Underline));
+ finish.push(format!("{}{}", Fg(White), termion::style::NoUnderline));
}
Image(_) => {
if !have_explicit_colour {
- start.push(format!("{}", Fg(Blue)));
- finish.push(format!("{}", Fg(Reset)));
+ start.push(format!("{}{}", Fg(AnsiValue(225)), termion::style::Italic));
+ finish.push(format!("{}{}", Fg(White), termion::style::NoItalic));
}
}
Emphasis => {
- start.push(format!("{}", termion::style::Bold));
- finish.push(format!("{}", termion::style::Reset));
+ start.push(format!("{}", termion::style::Italic));
+ finish.push(format!("{}", termion::style::NoItalic));
}
Strong => {
if !have_explicit_colour {
- start.push(format!("{}", Fg(LightYellow)));
- finish.push(format!("{}", Fg(Reset)));
+ start.push(format!("{}", termion::style::Bold));
+ finish.push(format!("{}", termion::style::NoBold));
}
}
Strikeout => {
if !have_explicit_colour {
- start.push(format!("{}", Fg(LightBlack)));
- finish.push(format!("{}", Fg(Reset)));
+ start.push(format!("{}{}", Fg(AnsiValue(7)), termion::style::CrossedOut));
+ finish.push(format!("{}{}", Fg(White), termion::style::NoCrossedOut));
}
}
Code => {
if !have_explicit_colour {
- start.push(format!("{}", Fg(Blue)));
- finish.push(format!("{}", Fg(Reset)));
+ start.push(format!("{}{}", Bg(AnsiValue(25)), Fg(AnsiValue(222))));
+ finish.push(format!("{}{}", Bg(Reset) ,Fg(White)));
}
}
Preformat(_) => {
if !have_explicit_colour {
- start.push(format!("{}", Fg(Blue)));
- finish.push(format!("{}", Fg(Reset)));
+ start.push(format!("{}{}", Bg(AnsiValue(25)), Fg(AnsiValue(229))));
+ finish.push(format!("{}{}", Bg(Reset), Fg(White)));
}
}
Colour(c) => {
While testing against a large batch of HTML samples I found that one appeared to cause an infinite loop when calling from_read
.
Sample attached. I have tested this both with library call and with command line.
infinite.zip
I think it would be nice to be able to have some kind of limiting factor because HTML in the wild can be very weird and malformed.
This is important - text struck out changes the meaning of text significantly!
The reason why I wanted a fix for preformatted blocks (#32) was that I wanted to implement syntax highlighting for code blocks for rusty-man
. With the new release v0.2.1, I was able to implement basic syntax highlighting using syntect
. My current implementation could still be improved: Context is very important for proper syntax highlighting, but I cannot identify the block a string with a preformatted annotation belongs to. Therefore I just assume that there is only one preformatted string per line, and that preformatted strings in adjacent lines belong to the same block. Of course, this does not hold for tables or for subsequent code blocks.
Do you think html2text
could make it easier to highlight code blocks?
see
Line 953 in 4b3081d
The CI tests on AppVeyor fail, but only on x86_64-pc-windows-gnu - not on i686-pc-windows-gnu or x86_64-pc-windows-msvc.
Apparently test_deeply_nested_table
overflows its stack on that platform but no others.
Hi, I am fuzzing this crate with afl.rs, and my fuzzer reports some panics. I will list the code snippets and panic information below. All code snippets are guaranteed to be run directly. I hope you can check whether these panics are bugs.
The first case is panicked at 'capacity overflow':
let _local0 = html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator>::new(11787190863583748771, html2text::render::text_renderer::RichDecorator{});
let mut _local1 = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::new_sub_renderer(&_local0, 11791448176899352125);
let _ = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::add_horizontal_border(&mut _local1);
thread 'main' panicked at 'capacity overflow', library/alloc/src/raw_vec.rs:518:5
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: alloc::raw_vec::capacity_overflow
3: alloc::raw_vec::RawVec<T,A>::allocate_in
at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/raw_vec.rs:178:27
4: alloc::raw_vec::RawVec<T,A>::with_capacity_in
at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/raw_vec.rs:131:9
5: alloc::vec::Vec<T,A>::with_capacity_in
at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/vec/mod.rs:673:20
6: <T as alloc::vec::spec_from_elem::SpecFromElem>::from_elem
at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/vec/spec_from_elem.rs:15:21
7: alloc::vec::from_elem
at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/vec/mod.rs:2566:5
8: html2text::render::text_renderer::BorderHoriz::new
at ./src/render/text_renderer.rs:630:23
9: <html2text::render::text_renderer::SubRenderer<D> as html2text::render::Renderer>::add_horizontal_border
at ./src/render/text_renderer.rs:1010:41
10: replay_html2text161::test_function161
at ./fuzz_target/build/replay_html2text161/src/main.rs:36:13
11: replay_html2text161::main
at ./fuzz_target/build/replay_html2text161/src/main.rs:69:5
12: core::ops::function::FnOnce::call_once
at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
The second case is panicked at 'attempt to subtract with overflow'
let data=[60, 116, 97, 98, 108, 101, 62, 60, 116, 114, 62, 60, 116, 100, 62, 120, 105, 60, 48, 62, 0, 0, 0, 60, 116, 97, 98, 108, 101, 62, 58, 58, 58, 62, 58, 62, 62, 62, 58, 60, 112, 32, 32, 32, 32, 32, 32, 32, 71, 87, 85, 78, 16, 16, 62, 60, 15, 16, 16, 16, 16, 16, 16, 15, 38, 16, 16, 16, 15, 1, 16, 16, 16, 16, 16, 16, 162, 111, 107, 99, 91, 112, 57, 64, 94, 100, 60, 111, 108, 47, 62, 127, 60, 108, 73, 62, 125, 109, 121, 102, 99, 122, 110, 102, 114, 98, 60, 97, 32, 104, 114, 101, 102, 61, 98, 111, 103, 32, 105, 100, 61, 100, 62, 60, 111, 15, 15, 15, 15, 15, 15, 15, 39, 15, 15, 15, 106, 102, 59, 99, 32, 32, 32, 86, 102, 122, 110, 104, 93, 108, 71, 114, 117, 110, 100, 96, 121, 57, 60, 107, 116, 109, 247, 62, 60, 32, 60, 122, 98, 99, 98, 97, 32, 119, 127, 127, 62, 60, 112, 62, 121, 116, 60, 47, 116, 100, 62, 62, 60, 111, 98, 62, 123, 110, 109, 97, 101, 105, 119, 60, 112, 101, 101, 122, 102, 63, 120, 97, 62, 60, 101, 62, 60, 120, 109, 112, 32, 28, 52, 55, 50, 50, 49, 52, 185, 150, 99, 62, 255, 112, 76, 85, 60, 112, 62, 73, 100, 116, 116, 60, 75, 50, 73, 116, 120, 110, 127, 255, 118, 32, 42, 40, 49, 33, 112, 32, 36, 107, 57, 60, 5, 163, 62, 49, 55, 32, 33, 118, 99, 63, 60, 109, 107, 43, 119, 100, 62, 60, 104, 58, 101, 163, 163, 163, 163, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 1, 107, 117, 107, 108, 44, 102, 58, 60, 116, 101, 97, 106, 98, 59, 60, 115, 109, 52, 58, 115, 98, 62, 232, 110, 114, 32, 60, 117, 93, 120, 112, 119, 111, 59, 98, 120, 61, 206, 19, 61, 206, 19, 59, 1, 110, 102, 60, 115, 0, 242, 64, 203, 8, 111, 50, 59, 121, 122, 32, 42, 35, 32, 37, 101, 120, 104, 121, 0, 242, 59, 63, 121, 231, 130, 130, 130, 170, 170, 1, 32, 0, 0, 0, 28, 134, 200, 90, 119, 48, 60, 111, 108, 118, 119, 116, 113, 59, 100, 60, 117, 43, 110, 99, 9, 216, 157, 137, 216, 157, 246, 167, 62, 60, 104, 61, 43, 28, 134, 200, 105, 119, 48, 60, 122, 110, 0, 242, 61, 61, 114, 231, 130, 130, 130, 170, 170, 170, 233, 222, 222, 162, 163, 163, 163, 163, 163, 163, 163, 85, 100, 116, 99, 61, 60, 163, 163, 163, 163, 163, 220, 220, 1, 109, 112, 105, 10, 59, 105, 220, 215, 10, 59, 122, 100, 100, 121, 97, 43, 43, 43, 102, 122, 100, 60, 62, 114, 116, 122, 115, 61, 60, 115, 101, 62, 215, 215, 215, 215, 215, 98, 59, 60, 109, 120, 57, 60, 97, 102, 113, 229, 43, 43, 43, 43, 43, 43, 43, 43, 43, 35, 43, 43, 101, 58, 60, 116, 98, 101, 107, 98, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 43, 98, 99, 62, 60, 112, 102, 59, 124, 107, 111, 97, 98, 108, 118, 60, 116, 102, 101, 104, 97, 62, 60, 255, 127, 46, 60, 116, 101, 62, 60, 105, 102, 63, 116, 116, 60, 47, 116, 101, 62, 62, 60, 115, 98, 62, 123, 109, 108, 97, 100, 119, 118, 60, 111, 99, 97, 103, 99, 62, 60, 255, 127, 46, 60, 103, 99, 62, 60, 116, 98, 63, 60, 101, 62, 60, 109, 109, 231, 130, 130, 130, 213, 213, 213, 233, 222, 222, 59, 101, 103, 58, 60, 100, 111, 61, 65, 114, 104, 60, 47, 101, 109, 62, 60, 99, 99, 172, 97, 97, 58, 60, 119, 99, 64, 126, 118, 104, 100, 100, 107, 105, 60, 120, 98, 255, 255, 255, 0, 60, 255, 127, 46, 60, 113, 127];
let _local0: html2text::RenderTree = html2text::parse(&data[..]);
let _local1: html2text::RenderedText::<html2text::render::text_renderer::RichDecorator> = html2text::RenderTree::render(_local0, 1, html2text::render::text_renderer::RichDecorator{});
thread 'main' panicked at 'attempt to subtract with overflow', /home/jjf/Fuzzing-Target-Generator/experiments/rust-html2text/src/lib.rs:1305:65
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic
3: html2text::do_render_node::{{closure}}
at ./src/lib.rs:1305:65
4: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
at /home/jjf/Fuzzing-Target-Generator/library/alloc/src/boxed.rs:2001:9
5: core::ops::function::impls::<impl core::ops::function::Fn<A> for &F>::call
at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:262:13
6: html2text::tree_map_reduce::{{closure}}
at ./src/lib.rs:780:30
7: core::option::Option<T>::map
at /home/jjf/Fuzzing-Target-Generator/library/core/src/option.rs:925:29
8: html2text::tree_map_reduce
at ./src/lib.rs:775:13
9: html2text::render_tree_to_string
at ./src/lib.rs:1128:5
10: html2text::RenderTree::render
at ./src/lib.rs:1542:23
11: replay_html2text10::test_function10
at ./fuzz_target/build/replay_html2text10/src/main.rs:39:95
12: replay_html2text10::main
at ./fuzz_target/build/replay_html2text10/src/main.rs:74:5
13: core::ops::function::FnOnce::call_once
at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
This case is panicked at Option::unwrap:
let mut _local0: html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::TrivialDecorator> = html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::TrivialDecorator>::new(18446744073709551615, html2text::render::text_renderer::TrivialDecorator{});
let _local1_param0_helper1 = &mut (_local0);
let _ = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::TrivialDecorator> as html2text::render::Renderer>::end_strikeout(_local1_param0_helper1);
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /home/jjf/Fuzzing-Target-Generator/experiments/rust-html2text/src/render/text_renderer.rs:1377:38
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic
3: core::option::Option<T>::unwrap
at /home/jjf/Fuzzing-Target-Generator/library/core/src/option.rs:778:21
4: <html2text::render::text_renderer::SubRenderer<D> as html2text::render::Renderer>::end_strikeout
at ./src/render/text_renderer.rs:1377:9
5: replay_html2text59::test_function59
at ./fuzz_target/build/replay_html2text59/src/main.rs:34:13
6: replay_html2text59::main
at ./fuzz_target/build/replay_html2text59/src/main.rs:66:5
7: core::ops::function::FnOnce::call_once
at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
This case is panicked at 'Attempt to end a preformatted block which wasn't opened.'
let _local0: html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> = html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator>::new(14829735431805717965, html2text::render::text_renderer::RichDecorator{});
let _local1_param0_helper1 = &(_local0);
let mut _local1: html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::new_sub_renderer(_local1_param0_helper1, 14829735431805717810);
let _local2_param0_helper1 = &mut (_local1);
let _ = <html2text::render::text_renderer::SubRenderer::<html2text::render::text_renderer::RichDecorator> as html2text::render::Renderer>::end_pre(_local2_param0_helper1);
thread 'main' panicked at 'Attempt to end a preformatted block which wasn't opened.', /home/jjf/Fuzzing-Target-Generator/experiments/rust-html2text/src/render/text_renderer.rs:1027:13
stack backtrace:
0: std::panicking::begin_panic
at /home/jjf/Fuzzing-Target-Generator/library/std/src/panicking.rs:607:12
1: <html2text::render::text_renderer::SubRenderer<D> as html2text::render::Renderer>::end_pre
at ./src/render/text_renderer.rs:1027:13
2: replay_html2text164::test_function164
at ./fuzz_target/build/replay_html2text164/src/main.rs:36:13
3: replay_html2text164::main
at ./fuzz_target/build/replay_html2text164/src/main.rs:69:5
4: core::ops::function::FnOnce::call_once
at /home/jjf/Fuzzing-Target-Generator/library/core/src/ops/function.rs:251:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Hi,
I'm trying to use from_read_with_decorator
with my own TextDecorator
to output in the terminal with some color and text style. Unfortunately from_read_with_decorator
seems to remove some part of escape sequences eg. \e
, preventing for creating nice terminal output.
I use termion for colors and style
impl TextDecorator for ContentDecorator {
type Annotation = RichAnnotation;
fn decorate_link_start(&mut self, url: &str) -> (String, Self::Annotation) {
self.0.push(url.to_string());
(
format!(
"{}{}{}* {}{}",
Italic,
Fg(Black),
self.0.len() + 1,
StyleReset,
Fg(Blue)
),
RichAnnotation::Link(url.to_string()),
)
}
// ...
Then
let output = from_read_with_decorator(html.as_bytes(), term_width, ContentDecorator(vec![]))
output:
HTML: Google has <a href="https://blog.chromium.org/2021/01/limiting-private-api-availability-in.html">announced</a> that they are going to block
from_read_with_decorator output:
"Google has [3m[38;5;0m2* [m[38;5;4mannounced[39m that they are going to block
Links are currently recursively rendered into tables, which can take much more space than the text, distorting the table in the process.
#81 is a prototype how to collect the links globally and only append them at the very end, although you might have a better idea. The PR makes the former BuilderStack
the new TextRenderer
which can additionally carry state shared among the stack of renderer.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.