Giter Site home page Giter Site logo

wooorm / markdown-rs Goto Github PK

View Code? Open in Web Editor NEW
765.0 11.0 38.0 2.2 MB

CommonMark compliant markdown parser in Rust with ASTs and extensions

Home Page: https://docs.rs/markdown/1.0.0-alpha.17/markdown/

License: MIT License

Rust 100.00%
commonmark compiler gfm markdown parse render rust tokenize

markdown-rs's Introduction





markdown-rs

Build Coverage GitHub docs.rs crates.io

👉 Note: this is a new crate that reuses an old name. The old crate (0.3.0 and lower) has a bunch of problems. Make sure to use the new crate, currently in alpha at 1.0.0-alpha.16.

CommonMark compliant markdown parser in Rust with ASTs and extensions.

Feature highlights

  • compliant (100% to CommonMark)
  • extensions (100% GFM, 100% MDX, frontmatter, math)
  • safe (100% safe Rust, also 100% safe HTML by default)
  • robust (2300+ tests, 100% coverage, fuzz testing)
  • ast (mdast)

When should I use this?

  • If you just want to turn markdown into HTML (with maybe a few extensions)
  • If you want to do really complex things with markdown

What is this?

markdown-rs is an open source markdown parser written in Rust. It’s implemented as a state machine (#![no_std] + alloc) that emits concrete tokens, so that every byte is accounted for, with positional info. The API then exposes this information as an AST, which is easier to work with, or it compiles directly to HTML.

While most markdown parsers work towards compliancy with CommonMark (or GFM), this project goes further by following how the reference parsers (cmark, cmark-gfm) work, which is confirmed with thousands of extra tests.

Other than CommonMark and GFM, this project also supports common extensions to markdown such as MDX, math, and frontmatter.

This Rust crate has a sibling project in JavaScript: micromark (and mdast-util-from-markdown for the AST).

P.S. if you want to compile MDX, use mdxjs-rs.

Questions

Contents

Install

With Rust (rust edition 2018+, ±version 1.56+), install with cargo:

👉 Note: this is a new crate that reuses an old name. The old crate (0.3.0 and lower) has a bunch of problems. Make sure to use the new crate, currently in alpha at 1.0.0-alpha.16.

Use

fn main() {
    println!("{}", markdown::to_html("## Hello, *world*!"));
}

Yields:

<h2>Hello, <em>world</em>!</h2>

Extensions (in this case GFM):

fn main() -> Result<(), String> {
    println!(
        "{}",
        markdown::to_html_with_options(
            "* [x] [email protected] ~~strikethrough~~",
            &markdown::Options::gfm()
        )?
    );

    Ok(())
}

Yields:

<ul>
  <li>
    <input checked="" disabled="" type="checkbox" />
    <a href="mailto:[email protected]">[email protected]</a>
    <del>strikethrough</del>
  </li>
</ul>

Syntax tree (mdast):

fn main() -> Result<(), String> {
    println!(
        "{:?}",
        markdown::to_mdast("# Hey, *you*!", &markdown::ParseOptions::default())?
    );

    Ok(())
}

Yields:

Root { children: [Heading { children: [Text { value: "Hey, ", position: Some(1:3-1:8 (2-7)) }, Emphasis { children: [Text { value: "you", position: Some(1:9-1:12 (8-11)) }], position: Some(1:8-1:13 (7-12)) }, Text { value: "!", position: Some(1:13-1:14 (12-13)) }], position: Some(1:1-1:14 (0-13)), depth: 1 }], position: Some(1:1-1:14 (0-13)) }

API

markdown-rs exposes to_html, to_html_with_options, to_mdast, Options, and a few other structs and enums.

See the crate docs for more info.

Extensions

markdown-rs supports extensions to CommonMark. These extensions are maintained in this project. They are not enabled by default but can be turned on with options.

  • frontmatter
  • GFM
    • autolink literal
    • footnote
    • strikethrough
    • table
    • tagfilter
    • task list item
  • math
  • MDX
    • ESM
    • expressions
    • JSX

It is not a goal of this project to support lots of different extensions. It’s instead a goal to support very common and mostly standardized extensions.

Project

markdown-rs is maintained as a single monolithic crate.

Overview

The process to parse markdown looks like this:

                    markdown-rs
+-------------------------------------------------+
|            +-------+         +---------+--html- |
| -markdown->+ parse +-events->+ compile +        |
|            +-------+         +---------+-mdast- |
+-------------------------------------------------+

File structure

The files in src/ are as follows:

  • construct/*.rs — CommonMark, GFM, and other extension constructs used in markdown
  • util/*.rs — helpers often needed when parsing markdown
  • event.rs — things with meaning happening somewhere
  • lib.rs — public API
  • mdast.rs — syntax tree
  • parser.rs — turn a string of markdown into events
  • resolve.rs — steps to process events
  • state.rs — steps of the state machine
  • subtokenize.rs — handle content in other content
  • to_html.rs — turns events into a string of HTML
  • to_mdast.rs — turns events into a syntax tree
  • tokenizer.rs — glue the states of the state machine together
  • unist.rs — point and position, used in mdast

Test

markdown-rs is tested with the ~650 CommonMark tests and more than 1k extra tests confirmed with CM reference parsers. Then there’s even more tests for GFM and other extensions. These tests reach all branches in the code, which means that this project has 100% code coverage. Fuzz testing is used to check for things that might fall through coverage.

The following bash scripts are useful when working on this project:

  • generate code (latest CM tests and Unicode info):
    cargo run --manifest-path generate/Cargo.toml
  • run examples:
    RUST_BACKTRACE=1 RUST_LOG=trace cargo run --features log --example lib
  • format:
    cargo fmt && cargo fix --all-targets
  • lint:
    cargo fmt --check && cargo clippy --examples --tests --benches --all-features
  • test:
    RUST_BACKTRACE=1 cargo test
  • docs:
    cargo doc --document-private-items
  • fuzz:
    cargo install cargo-fuzz
    cargo install honggfuzz
    cargo +nightly fuzz run markdown_libfuzz
    cargo hfuzz run markdown_honggfuzz

Version

markdown-rs follows SemVer.

Security

The typical security aspect discussed for markdown is cross-site scripting (XSS) attacks. Markdown itself is safe if it does not include embedded HTML or dangerous protocols in links/images (such as javascript: or data:). markdown-rs makes any markdown safe by default, even if HTML is embedded or dangerous protocols are used, as it encodes or drops them. Turning on the allow_dangerous_html or allow_dangerous_protocol options for user-provided markdown opens you up to XSS attacks.

An aspect related to XSS for security is syntax errors: markdown itself has no syntax errors. Some syntax extensions (specifically, only MDX) do include syntax errors. For that reason, to_html_with_options returns Result<String, String>, of which the error is a simple string indicating where the problem happened, what occurred, and what was expected instead. Make sure to handle your errors when using MDX.

Another security aspect is DDoS attacks. For example, an attacker could throw a 100mb file at markdown-rs, in which case it’s going to take a long while to finish. It is also possible to crash markdown-rs with smaller payloads, notably when thousands of links, images, emphasis, or strong are opened but not closed. It is wise to cap the accepted size of input (500kb can hold a big book) and to process content in a different thread so that it can be stopped when needed.

For more information on markdown sanitation, see improper-markup-sanitization.md by @chalker.

Contribute

See contributing.md for ways to help. See support.md for ways to get help. See code-of-conduct.md for how to communicate in and around this project.

Sponsor

Support this effort and give back by sponsoring:

Thanks

Special thanks go out to:

Related

  • micromark — same as markdown-rs but in JavaScript
  • mdxjs-rs — wraps markdown-rs to compile MDX to JavaScript

License

MIT © Titus Wormer

markdown-rs's People

Contributors

barafael avatar christianmurphy avatar firelightflagboy avatar hocdoc avatar kyle-mccarthy avatar lovasoa avatar mickvangelderen avatar pinkforest avatar rotmoded avatar sheremetyev avatar squili avatar wooorm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

markdown-rs's Issues

API for creating extensions?

Hi, I found this crate when looking for a markdown parser but my use case has needs beyond that of commonmark and the provided extensions. If there, or if not could there be, some sort of API to allow creation of custom syntax rules, e.g., callouts, <ins>/<del> tags, and whatever else people have need for on a more local basis?

MDAST: serde Serialization/Deserialization not working

The serialization of the mdast fails to deserialize because of a duplicate field type

#[test]
fn test_serde() {
    let markdown = "This is a **test**";
    let tree =
    markdown::to_mdast(&markdown, &markdown::ParseOptions::default()).unwrap();
    let json = serde_json::to_string(&tree).unwrap();
    let tree2: Node = serde_json::from_str(&json).unwrap();
    assert!(tree == tree2);
}

I get the following error from serde

Error("duplicate field `type`", line: 1, column: 21)

This is the generated json. It has duplicate type fields everywhere.

{
   "type":"Root",
   "type":"root",
   "children":[
      {
         "type":"Code",
         "type":"code",
         "value":"This is a **test**",
         "position":{
            "start":{
               "line":2,
               "column":1,
               "offset":1
            },
            "end":{
               "line":3,
               "column":5,
               "offset":28
            }
         },
         "lang":null,
         "meta":null
      }
   ],
   "position":{
      "start":{
         "line":1,
         "column":1,
         "offset":0
      },
      "end":{
         "line":3,
         "column":5,
         "offset":28
      }
   }
}

Looking at the AST enum and related structs they all have serde(tag, rename) macro configs, which seem to be conflicting.

#[derive(Clone, Eq, PartialEq)]
#[cfg_attr(
    feature = "serde",
    derive(serde::Serialize, serde::Deserialize),
    serde(tag = "type", rename = "type")
)]
pub enum Node {
    // Document:
    /// Root.
    Root(Root),
//..
}

#[derive(Clone, Debug, Eq, PartialEq)]
#[cfg_attr(
    feature = "serde",
    derive(serde::Serialize, serde::Deserialize),
    serde(tag = "type", rename = "root")
)]
pub struct Root {
    // Parent.
    /// Content model.
    pub children: Vec<Node>,
    /// Positional info.
    pub position: Option<Position>,
}

Markdown in beginning of a row in a Table with no outer pipes

Hi,

Thank you for this rust-native markdown parser. It is really helpful in one of my project!
Just wanted to report a bug (I believe it is a bug) when, in a table, there is markdown styling at the beginning of a row of a table without outer pipes. Here is an example:

Markdown:

Markdown | Less | Pretty
--- | --- | ---
*Still* | `renders` | **nicely**
1 | 2 | 3

GitHub:

Markdown Less Pretty
Still renders nicely
1 2 3

GitHub renders it nicely, but markdown-rs renders one table with the rest being a paragraph text between starting from the inner markdown. (I believe this is a bug, because stackedit, dillinger and markdownpreview are rendering this nicely too)

I am using the to_mdast function and I have the following mdast structure: (positions omitted for clarity).
I am using the latest version 1.0.0-alpha.13

Root {
    children: [
        Table {
            children: [
                TableRow {
                    children: [
                        TableCell {
                            children: [
                                Text {
                                    value: "Markdown",
                                    position: ...
                                },
                            ],
                            position: ...,
                        },
                        TableCell {
                            children: [
                                Text {
                                    value: "Less",
                                    position: ...,
                                },
                            ],
                            position: ...,
                        },
                        TableCell {
                            children: [
                                Text {
                                    value: "Pretty",
                                    position: ...,
                                },
                            ],
                            position: ...,
                        },
                    ],
                    position: ...,
                },
            ],
            position: ...,
            align: [
                None,
                None,
                None,
            ],
        },
        Paragraph {
            children: [
                Emphasis {
                    children: [
                        Text {
                            value: "Still",
                            position: ...,
                        },
                    ],
                    position: ...,
                },
                Text {
                    value: " | ",
                    position: ...,
                },
                InlineCode {
                    value: "renders",
                    position: ...,
                },
                Text {
                    value: " | ",
                    position: ...,
                },
                Strong {
                    children: [
                        Text {
                            value: "nicely",
                            position: ...,
                        },
                    ],
                    position: ...,
                },
                Text {
                    value: "\n1 | 2 | 3",
                    position: ...,
                },
            ],
            position: ...,
        },
    ],
    position: ...,
}

Syntax tree abstractness

My impression was that an abstract syntax tree represents the logical/semantic content of a document, rather than formatting specifics.

Since markdown ignores things like single new lines from a rendering perspective, I'd expect those to be removed/normalized out in the ast as well.

I think that could be made clear (i.e. text nodes have newlines, etc), or (ideally) the tree would contain abstract data... maybe as an option

Implement file_to_html

The old crate with the same name had this convenience function that would read the content of a file and then call to_html on it.
It would be nice to add it to the new crate as well.

Numbered list items start at 2

I don't have a minimal reproduction, but I have this code

        let ast = markdown::to_mdast(source, &markdown::ParseOptions {
            constructs: markdown::Constructs { ..Default::default() },
            ..Default::default()
        }).map_err(|e| anyhow!("{}", e))?;

and

        Node::List(x) => {
            match &x.start {
                Some(i) => {
                    for (j, child) in x.children.iter().enumerate() {
                        println!("numbered list, i {} j {}", i, j);
                        if j > 0 {
                            line.write_newline(state, out);
                        }
                        recurse_write(
                            state,
                            out,
                            line.clone_indent(Some(format!("{}. ", *i as usize + j)), "   ".into(), false),
                            child,
                        );
                    }
                },
                None => {
                    for (i, child) in x.children.iter().enumerate() {
                        if i > 0 {
                            line.write_newline(state, out);
                        }
                        recurse_write(state, out, line.clone_indent(Some("* ".into()), "   ".into(), false), child);
                    }
                },
            };
        },

Which produces this output:

formatting md:
 1. list item one
 2. list item 2
numbered list, i 2 j 0
numbered list, i 2 j 1

You can see x.start is 2 rather than the expected 1

Network requests in build.rs and heavy build dependencies

The alpha release of the crate requires tokio and reqwest during build time which are very heavy dependencies. More importantly it seems to be doing HTTP requests in build.rs which is a bad idea.

Could this be implemented in different ways?

[bug] When parsing as MDX, multiple lists are not handled correctly

👋 Howdy, fun project!

It appears that when parsing as MDX, any subsequent lists after the first one that is rendered are not parsed correctly.

    fn multiple_lists() {
        let node = &markdown::to_mdast(
            r#"* list 1

Extra paragraph

* list 2
* list 3"#,
            &markdown::ParseOptions::mdx(),
        )
        .unwrap();

        println!("{:?}", node)
    }

outputs:

Root { children: [List { children: [ListItem { children: [Paragraph { children: [Text { value: "list 1", position: Some(1:3-1:9 (2-8)) }], position: Some(1:3-1:9 (2-8)) }], position: Some(1:1-2:1 (0-9)), spread: false, checked: None }], position: Some(1:1-2:1 (0-9)), ordered: false, start: None, spread: false }, Paragraph { children: [Text { value: "Extra paragraph", position: Some(3:1-3:16 (10-25)) }], position: Some(3:1-3:16 (10-25)) }, Paragraph { children: [Text { value: "* list 2\n* list 3", position: Some(5:1-6:9 (27-44)) }], position: Some(5:1-6:9 (27-44)) }], position: Some(1:1-6:9 (0-44)) }

version:

markdown = "1.0.0-alpha.1"

I would expect an additional List {} element containing list items for List 2 and List 3. Let me know if I can provide any additional details. 🙇

Stronger types

Hi. I started using markdown-rs and noticed I had to add a panic into my code just because everything is a Node, but List can only really contain ListItems. Since 1.x is still in alpha, I figured this would be a good time to speak up.

Have you considered making the AST types stronger? I think List.children should be Vec<ListItem>, and similarly e.g. Strong cannot contain a Paragraph or a Heading..

mdast supports this line of thinking:

and so on.

I should be able to contribute code too, if you're interested and breaking the API is ok.

Get marker delimitation

Hello,

I am writing a WYSIWYG Markdown editor focused on math and science, and I want to use Markdown as the base format. The problem I am going to describe is present in many other Markdown parsers, as a result I decided to completely write a new parser from scratch (in C++) and make some modifications to the Markdown standard to fit my own needs (this is the result).

The prototype I wrote was working okay, but now I've decided to rewrite the whole application in Rust, and also decided to not maintain my own parser which is much more prone to bugs and crashes.

The marker delimitation problem

I am rewriting what I wrote here: https://github.com/jokteur/ab-parser#the-delimitation-marker-problem.

For my WYSIWYG application, I need to know where the markers of a specific block / span are, to temporarily display to the user the markers, like on this demo here: https://github.com/wooorm/markdown-rs/assets/25845695/420c1496-7306-4c69-b7ca-74059ec95886

Let's say that we have the following Markdown example:

- >> [abc
  >> def](example.com)

This example would generate an abstract syntax tree (AST) like:

DOC
  UL
    LI
      QUOTE
        QUOTE
          P
            URL
              TEXT

How do we attribute each non-text markers (like -, >, [, ...) to the correct block / span ?

My parser was created to solve this specific problem, while keeping reasonable performance. To do this, each object (BLOCK or SPAN) is represented by an vector of boundaries. A boundary is defined as follows:

struct Boundary {
    line_number: usize,
    pre: usize,
    beg: usize,
    end: usize,
    post: usize,
}

This struct designates offsets in the raw text which form its structure. line_number is the line number in the raw text on which the boundary is currently operating. Offsets between pre and beg are the pre-delimiters, and offsets between end and post are the post-delimiters. Everything between beg and end is the content of the block / span.

Here is a simple example. Suppose we have the following text: _italic_, which starts at line 0 and offset 0 then the boundary struct would look like {0, 0, 1, 7, 8}.

Going back to the first example, we now use the following notation to illustrate ownership of markers: if there is x, it indicates a delimiter, if there is _ it indicates content, and . indicates not in boundary. Here are the ownership for each block and span:

- >> [abc
  >> def](example.com)

UL:
_________
______________________

LI:
xx_______
xx____________________

QUOTE (1st):
..x______
..x___________________

QUOTE (2nd):
...xx____
...xx_________________

P:
.....____
....._________________

URL:
.....x___
.....___xxxxxxxxxxxxxx

TEXT:
......___
.....___..............

Is there any simple way to rewrite this kind of information ?

Currently, markdown-rs provides positional information like this:

Text { value: "abc\ndef", position: Some(1:7-2:10 (6-19)) }

I may have a workaround to rewrite this kind of information (after it has been parsed, go from leaf nodes, compare the text with raw text, and check which chars are part of the node or node, and attribute them to the parent). This workaround may be slow, but it is okay for my usage because I only need marker delimitation information where the cursor is (not on the whole document).

I don't really know how well markdown-rs works, how difficult would it be that have this information built-in the parser ?

Test

Test

  • (1) Check positional info
  • (3) Share tests with micromark-js
  • (3) Add tests for a zillion attention markers, tons of lists, tons of labels, etc?

Should json be default feature ?

$ cargo update

    Updating crates.io index
      Adding itoa v1.0.5
    Updating markdown v1.0.0-alpha.5 -> v1.0.0-alpha.6
      Adding ryu v1.0.12
      Adding serde v1.0.152
      Adding serde_json v1.0.92

I am using mdast just to parse it and then process the mdast structures, e.g. I don't need json or any serialization out.

Would there be openness to having it as optional considering there are different ways to serialize out ?

Thanks for the great work btw ! 💜

Most of the ecosystem has serde as optional feature

GFM with allow_dangerous_html panics when a tag contains a newline after its name

fn main() {
    let source = r#"
<div
>
>/div>
    "#;
    let _md = markdown::to_html_with_options(source, &markdown::Options {
        parse: markdown::ParseOptions::gfm(),
        compile: markdown::CompileOptions {
            allow_dangerous_html: true,
            ..markdown::CompileOptions::gfm()
        },
    }).unwrap();
}

The above code panics in version 1.0.0-alpha.12 at

matches!(bytes[name_end], b'\t' | b'\n' | 12 /* `\f` */ | b'\r' | b' ' | b'/' | b'>') &&

Here's the full RUST_BACKTRACE=1:

thread 'main' panicked at 'index out of bounds: the len is 4 but the index is 4', /home/gustav/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-1.0.0-alpha.12/src/util/gfm_tagfilter.rs:55:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
   2: core::panicking::panic_bounds_check
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:162:5
   3: markdown::util::gfm_tagfilter::gfm_tagfilter
             at /home/gustav/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-1.0.0-alpha.12/src/util/gfm_tagfilter.rs:55:26
   4: markdown::to_html::on_exit_html_data
             at /home/gustav/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-1.0.0-alpha.12/src/to_html.rs:1304:17
   5: markdown::to_html::exit
             at /home/gustav/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-1.0.0-alpha.12/src/to_html.rs:426:52
   6: markdown::to_html::handle
             at /home/gustav/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-1.0.0-alpha.12/src/to_html.rs:308:9
   7: markdown::to_html::compile
             at /home/gustav/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-1.0.0-alpha.12/src/to_html.rs:283:13
   8: markdown::to_html_with_options
             at /home/gustav/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-1.0.0-alpha.12/src/lib.rs:125:8
   9: markdown_repro::main
             at ./src/main.rs:7:15
  10: core::ops::function::FnOnce::call_once
             at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/ops/function.rs:250:5

Docs

Docs

Some small things around docs:

  • (3) Write comparison to other parsers
  • (1) Write examples, improve sponsor
  • (0) Coverage, badges, logo when this is public
  • (0) Badges, links when this is published

Add support for wiki links

First, I'm thankfull for this fantastic work !
I'm trying to build a web app to render markdown coming from obsidian or similar software, but there is one feature I'm missing: wikilinks.

It would be nice if I could get [[article_name|click here]] blocks in the abstract syntax tree, because it is quite common now in some tools that use markdown.

One potential challenge is that there is no "right answer" to create an html equivalent to this type of link, it depends on the situation.

Would that make sense to add wiki-style links as an option for the abstract syntax tree if there is no way to compile it ? Maybe creating a html link to ./article would be the best default to convert [[article]] ?

Thank's

panic on input `* `

Hello,

This input * (without the double quotes) causes markdown::to_html to panic.

e.g.

fn main() {
    println!("{}", markdown::to_html("* "));
}

Just in case, I've pasted the stacktrace below.

Cheers, let me know if I can provide anything else to help.

thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/markdown-0.3.0/src/parser/block/unordered_list.rs:84:44
stack backtrace:
0: rust_begin_unwind
at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/core/src/panicking.rs:65:14
2: core::panicking::panic_bounds_check
at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/core/src/panicking.rs:150:5
3: <usize as core::slice::index::SliceIndex<[T]>>::index
at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/core/src/slice/index.rs:259:10
4: core::slice::index::<impl core::ops::index::Index for [T]>::index
at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/core/src/slice/index.rs:18:9
5: <alloc::vec::Vec<T,A> as core::ops::index::Index>::index
at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/alloc/src/vec/mod.rs:2736:9
6: markdown::parser::block::unordered_list::parse_unordered_list
at /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/markdown-0.3.0/src/parser/block/unordered_list.rs:84:44
7: markdown::parser::block::parse_block
at /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/markdown-0.3.0/src/parser/block/mod.rs:71:5
8: markdown::parser::block::parse_blocks
at /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/markdown-0.3.0/src/parser/block/mod.rs:27:15
9: markdown::parser::parse
at /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/markdown-0.3.0/src/parser/mod.rs:43:5
10: markdown::to_html
at /home/me/.cargo/registry/src/github.com-1ecc6299db9ec823/markdown-0.3.0/src/lib.rs:29:18
11: markdown_bug::main
at ./frontend/examples/markdown_bug.rs:2:20
12: core::ops::function::FnOnce::call_once
at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/core/src/ops/function.rs:513:5

Refactor to clean code

Some ideas on improving the code:

Refactor

  • (1) Add helper to get byte at, get char before/after, bytes to chars, etc.
  • (1) Improve interrupt, concrete, lazy, pierce fields somehow?
  • (?) Remove last box: the one around the child tokenizer?

Math flow parsed inline in the syntax tree

Here is a small reproducible example:

use markdown::{to_mdast, ParseOptions, Constructs};

fn main() {
    let options = ParseOptions {
              constructs: Constructs {
                math_text: true,
                ..Constructs::default()
              },
              ..ParseOptions::default()
            };
    let ast = to_mdast("$$math$$", &options).unwrap();
    println!("{:?}", ast);
}

I get

Root { children: [Paragraph { children: [InlineMath { value: "math", position: Some(1:1-1:9 (0-8)) }], position: Some(1:1-1:9 (0-8)) }], position: Some(1:1-1:9 (0-8)) }

This was not what I was expecting

After a few tests, It seems that math is parsed "Inline" when there is no newline between the $$ and the content, but I think it is almost never the expected behavior: a math equation between $$ on a single line should not be rendered "inline"

Does it add complexity to the implementation or was it a deliberate decision ?

Can't deserialize mdast

I serialized it using serde_json (also tried bincode, but anyway it's serde-based), but when I deserialize it, I have errors saying that it's expecting 'Root' instead of 'root' etc. I forked the project and removed all serde attributes related to field renaming and it fixed the issue. But I believe those attributes are there not just because you wanted them to be there :)

Add support for automatic hard breaks

Some very popular markdown tools (I know about obsidian.md, but it is probably not the only one) have an option to toggle true hard breaks, meaning any return character in the markdown will produce a html
tag.

It would be really nice to have that option in markdown-rs too !

'json' feature should be called `serde` and `serde-json` should not be a dependency

Hi, thanks for this crate, it's got great potential!

I don't quite understand the json feature. The way I think it is done:

The json feature enables dependencies for serde and serde-json. In the code, the serializers/deserializers are derived via serde, if the json feature is enabled. The serde-json crate is not actually used.

Here is how this could be done better (I tested locally):
The serde feature enables a dependency on serde with the derive feature, as before. In the code, the serializers/deserializers are derived, as before. The serde-json crate appears nowhere.

I think this is the conventional and idiomatic way to do this. Are there things I am overlooking here?

EDIT: renamed, was "'json' feature seems half baked" which is vague and unnecessarily insulting. Sorry :)

to_mdast Math parsing error

the parser threat $$something$$ as inline math

#[test]
fn math_test() {
    println!(
        "{:?}",
        to_mdast(
            "$$\nnot inline\n$$\n$$not inline$$\n$inline$",
            &ParseOptions {
                constructs: Constructs {
                    math_flow: true,
                    math_text: true,
                    ..Constructs::gfm()
                },
                ..Default::default()
            }
        )
    )
}

output:

Ok(Root { children: [Math { value: "not inline", position: Some(1:1-3:3 (0-16)), meta: None }, Paragraph { children: [InlineMath { value: "not inline", position: Some(4:1-4:15 (17-31)) }, Text { value: "\n", position: Some(4:15-5:1 (31-32)) }, InlineMath { value: "inline", position: Some(5:1-5:9 (32-40)) }], position: Some(4:1-5:9 (17-40)) }], position: Some(1:1-5:9 (0-40)) })

Future-proof Options before 1.0

Hey! I was having a look at markdown, and noticed that the latest 1.0 alpha has Options that are structs.

I'm just coming to say, if you make it into a builder then it'll allow you to add options with minor releases, whereas with a struct like right now any option addition would require a full major release.

(This is also something I'd probably need: I'll probably soon-ish want to compile inline-only markdown to "markdown-html" (html that retains the markdown tags), and if I want any chance to upstream it I'll need the possibility of adding both parse and compile option. That said I'm not even using this crate yet, so this may totally never happen)

For the same reason you may want to make the Node enum non-exhaustive, though I'm less sure about this one. But the more stuff you can make non-exhaustive / made out of builder patterns with private members without hindering user experience, the less likely you'll need major updates in the future 😅

Serializing mdast to markdown

(Whoops accidentally hit enter before drafting a content for this question, my apologies for the noise!)

Hi! I have a perhaps newbie question! I can use markdown::to_mdast to go from &str -> Node. Is it possible/is there a function to go back to a string - Node -> &str` - in a way that roundtrips?

I came across Node::to_string, and it does seem to convert nodes into a string but it also deletes the links/titles/and most other ast nodes, which if re-parsed again, results in a different ast. Unsure if this question is reasonable/within the context of this crate, but is there an alternate function elsewhere that is round-trippable to/from &str <-> Node? I am also happy to take a stab at implementing this "rountrippable" unparser function myself, but was wondering if a function like it already existed.

For further clarification, by "roundtripping", I would be writing a property based test, like markdown::to_mdast(to_string(node)) == node be true for all node's.

Thanks!

log spamming

At debug level, this crate logs one line per byte of input, which makes the debug logs quite hard to read.

[2023-11-19T17:27:14Z DEBUG markdown::tokenizer] feed:    byte U+0020 to ParagraphInside
[2023-11-19T17:27:14Z DEBUG markdown::tokenizer] feed:    byte `s` (U+0073) to ParagraphInside
[2023-11-19T17:27:14Z DEBUG markdown::tokenizer] feed:    byte `e` (U+0065) to ParagraphInside
[2023-11-19T17:27:14Z DEBUG markdown::tokenizer] feed:    byte `c` (U+0063) to ParagraphInside
[2023-11-19T17:27:14Z DEBUG markdown::tokenizer] feed:    byte `r` (U+0072) to ParagraphInside
[2023-11-19T17:27:14Z DEBUG markdown::tokenizer] feed:    byte `e` (U+0065) to ParagraphInside

Maybe this log could be moved to TRACE level ?

GFM strikethrough causes nested attention sequences to be considered just text data

Here is the test to show what I am talking about.

markdown from test ~~foo __*a*__~~ = foo a

assert_eq!(
    to_html_with_options("~~foo __*a*__~~", &Options::gfm())?,
    "<p><del>foo <strong><em>a</em></strong></del></p>",
    "should support emphasis within strong emphasis within strikethrough w/ `*` in `_`"
);

I've already written the code to fix this, but I also refactored portion of the code to help me understand exactly where I was getting it wrong, so I am opening an issue to share findings and discuss. I eventually landed on a reimplementation of the determining of an AttentionSequence's open/close state using something maybe too close to the GFM spec's wording.

The root cause of the above test failing was tildes not being considered punctuation. I added a kind_before_index which copied its sibling (that already considered ascii punctuation as valid, which includes tilde).

After this though, it caused some other tests to break. Fast forward a bit of debugging and I found that my implementation was actually technically correct but Github's UI is able to handle something that their own spec doesn't necessarily support, but was being tested for in this repo I imagine for the exact reason that Github's UI supports it.

markdown from test e ***~~xxx~~***yyy = e xxxyyy
other parsers (my ide, and some live example sites) were showing me: e ***xxx***yyy

assert_eq!(
    to_html_with_options("e ***~~xxx~~***yyy", &Options::gfm())?,
    "<p>e <em><strong><del>xxx</del></strong></em>yyy</p>",
    "interplay"
);

Now I went through my new implementation to see if I could get this test to pass while maintaining the test at the top. I tweaked one of the refactor's new functions, is_right_flanking_delimiter_run, to treat preceding tildes as Other, else the new flow that includes ascii punctuation.

Happy to open the PR if asked, just didn't want to surprise anyone with an out of the blue PR.

Enable custom plugins

I'm a huge fan of this library, especially the MDX support 🎉, but I'd like to be able to provide my own extensions (such as super-fancy code blocks with all the bells and whistles and support for fun things like Mermaid diagrams). As far as I can tell, there isn't currently a way to do that. If I'm wrong and there is a way to do that, please let me know and I'll close this 😄 But if this isn't currently possible, I'd like to know if you consider that a non-goal for the project. If it is a non-goal, I'm happy to close this and explore other options. But if you're open to that, I'm happy to help.

Option to only produce tags for explicit markdown

Hello! I've been loving using this crate in a project of mine, it's really amazing. I have another segment of my project (translations) that would really benefit from being able to have its markdown only emit new HTML for explicit style... requests? For lack of a better word? I checked Constructs and it states that paragraphs cannot be turned off, which are what is currently causing trouble for me as it's placed around everything.

Thanks for making this library!

Inlines in Image

I was wondering if it may be possible to have a list of inlines in Node::Image and Node::ImageRef for the alt property.
That would be similar to how links works, and would allow to support Figures in the same manner than Pandoc does.

Internal panic with nested links

The strings
[![]()]()
![![]()]()
cause panics when passed to to_mdast!

I'm surprised this didn't get detected by fuzzing - you might want to double check the fuzzer

Field-variants rather than tuple-variants for Block, Span

For the two enums

pub enum Block {
    Header(Vec<Span>, usize),
    Paragraph(Vec<Span>),
    Blockquote(Vec<Block>),
    CodeBlock(Option<String>, String),
    OrderedList(Vec<ListItem>, OrderedListType),
    UnorderedList(Vec<ListItem>),
    Raw(String),
    Hr,
}
pub enum Span {
    Break,
    Text(String),
    Code(String),
    Link(String, String, Option<String>),
    Image(String, String, Option<String>),
    Emphasis(Vec<Span>),
    Strong(Vec<Span>),
}

There are several variants, which are unnecessarily unclear.
Could Block::CodeBlock, Span::Link, Span::Image, be replaced by field variants?
Or could the separate components be documented?

https://docs.rs/markdown/0.3.0/src/markdown/parser/mod.rs.html#31-40
https://docs.rs/markdown/0.3.0/src/markdown/parser/mod.rs.html#10-20

Just writing what #[allow(missing_docs)] is allowing to be missing in these two cases.

Thank you.

Add more options to ParseOptions

I think it would be nice to be able to configure ParseOptions to enable/disable features.

ie. Some features I would want in the future is strikethrough support, table support, etc.

mdast with frontmatter

Is it even possible to use mdast with options? And can we get frontmatter in the syntax tree?

Allow both gfm (for tables) and allow embedding HTML with allow_dangerous_html

I wanted to both allow for Markdown tables and allow the embedding of HTML tags. (specifically I wanted to use an iframe to embed YouTube videos.

This is what worked for me:

    let content = markdown::to_html_with_options(
        &content,
        &markdown::Options {
            compile: markdown::CompileOptions {
                allow_dangerous_html: true,
                ..markdown::CompileOptions::default()
            },
            ..markdown::Options::gfm()
        },
    )
    .unwrap();

Is this the correct and recommended way? It is unclear to me what should be inside the CompileOptions and what should be outside.

Would it be a good idea to add such example to this example?

Markdown in HTML tags don't work

Hello, thanks for this cool crate. I noticed that markdown in html tags do not work unless there is a character or more of text before the opening tag for each html tag. I also noticed it is like this in GitHub. Is this intended and could there be a way to make it work?

any text <p style="text-align: center;">
	**hello**
<p>

^ works as expected, centering and bolding "hello" with "any text" above it

any text 
<p style="text-align: center;">
	**hello**
<p>

^ Centering "hello" works, but bolding does not and the ** are visible.

Add serde annotations for configuration

The title pretty much explains it all, I'd love if it were possible to deserialise configuration with serde so that it can be exposed directly to things like WebAssembly

I'm happy to do this myself, I am just throwing this issue here before i get into it in case there are reasons to not do it/do it in a specific way

:)

Extensions

Extensions

The extensions below are listed from top to bottom from more important to less important.

Add options so that a JavaScript-aware (or theoretically, Rust-aware) parser can wrap this

micromark-rs can parse MDX. MDX is markdown (minus some features) plus some features.

MDX includes expressions (and expressions inside JSX), that can be parsed either:

  1. agnostic to a programming language (so this project is not aware of a programming language by default), in which case braces are counted: {xxx} is a whole expression, and so is the same substring in {xxx}yyy}. This is useful for people that want to use this project, with components, but to only support variables (props?) but not want code to be evaluated.
  2. gnostic (“aware”) of a particular programming language (most likely JavaScript through SWC, but theoretically also Rust or so), in which case something wrapping this must parse functions in, because {xxx} is valid, and so is {'a}b'} (if JS-aware), but {'a}b'} would be an exception for Rust-aware expressions.

MDX also includes ESM, which only makes sense if “gnostic to JS”. In the future we could support maybe Rust keywords for that instead of import/export if we need that.

Here’s the direction of an API I’m thinking off, as pseudo-code:

enum Signal {
    /// A syntax error.
    /// `micromark-rs` will crash with error message `String`, and convert the
    /// `usize` (offset into `&str`) to where it happened in the whole document.
    /// E.g., `Unexpected `"`, expected identifier`.
    Error((String, usize)),
    /// An “error” at the end of the (partial?) expression
    /// `micromark-rs` will either crash with error message `String` if it
    /// doesn’t have any more text, or it will try again later when more text
    /// is available
    /// E.g., `Unexpected end of file in string literal`.
    Eof(String),
    /// Done, `micromark-rs` knows that this is the end of a valid
    /// expression/esm and continues with markdown.
    Ok,
}

enum Kind {
    /// For `# {Math.PI}` and `{Math.PI}`.
    Expression,
    /// For `<a {...b}>`
    AttributeExpression,
    /// For `<a b={c}>`.
    AttributeValueExpression,
}

/// * If `kind` is `Kind::AttributeExpression`, SWC can pass an error
///   back if there is no spread, but it can also do that later when making the
///   AST).
/// * If `kind` is `Kind::AttributeValueExpression`, SWC can pass an
///   error back if the expression is nothing/whitespace-only/comments-only, but
///   it can also do that later when making the AST).
parse_expression(expression: &str, kind: Kind) -> Signal;

/// * SWC can pass errors back when there is non-ESM found (e.g.,
///   `export var a = 1\nvar b = 2`), or do it when building the AST
/// * When building the AST, SWC needs to throw errors if identifiers are used
///   in different ESM blocks (`export var a = 1\n\n# hi\n\nexport var a = 2`)
parse_esm(program: &str) -> Signal;

/// micromark-rs will then call these “hooks” when it encounters expressions/esm
/// to pass off parsing to SWC.
/// For example, taking this markdown:
/// 
/// ```
/// export function a() {
///   return `b
///
///   c`
/// }
/// ```
/// 
/// …`parse_esm` will first be called with:
/// ``"export function a() {\nreturn `b"``.
/// ⏎ SWC will then pass back:
/// `Signal::Eof("Unexpected end of file in template literal, expected closing backtick".to_string())`
/// `micromark-rs` will then continue, and call it again with:
/// ``"export function a() {\nreturn `b\n\nc`}"``.
/// ⏎ SWC will then pass back `Signal::Ok`.
/// 
/// Two big questions:
/// * If SWC is “resumable” on EOF errors, `micromark-rs` for the 2nd call
///   could pass ``"\n\nc`}"``.
///   I know that Acorn doesn’t support this though, and it might get a bit
///   complex
/// * I am not sure how to do this with Rust, but we need to find a way to
///   “save” the result of SWC partial ASTs for each expression/esm.
///   One way of thinking, is for SWC to define a sort of `Ok<T>`, and
///   `micromark-rs` saving that in an array or on events or so?
///   Another way is for `parse_expression`/`parse_esm` to be called with some
///   unique identifier/start position/incremented number, and then SWC needs
///   to store those partial ASTs somewhere?

Build issue with alpha.6 release - serde attribute and create confusion

Hello! Thanks for sharing this beautiful project. Since the latest release 1.0.0-alpha.6 I get the following error when building my project:

error: cannot find attribute `serde` in this scope
    --> /home/remo/.cargo/registry/src/github.com-1ecc6299db9ec823/markdown-1.0.0-alpha.6/src/mdast.rs:1299:5
     |
1299 |     serde(tag = "type", rename = "mdxJsxFlowElement")
     |     ^^^^^
     |
     = note: `serde` is in scope, but it is a crate, not an attribute

It is easy to reproduce, just create a new project cargo new test and add

[dependencies]
markdown = "=1.0.0-alpha.6"

This has been tested with rustc 1.67.0 (fc594f156 2023-01-24) and cargo 1.67.0 (8ecd4f20a 2023-01-10) on Ubuntu 22.04.

Automatically add IDs to headings

Hi!

In older versions of the crate, HTML headings would automatically be assigned an id based on the markdown heading

For example,

# Hello World

would become

<h1 id="hello_world">Hello World</h1>

Is there an option for this? If not, would it be possible to support it?

Thanks :)

Add support for directives

I have a usecase where we need to support the commonmark directives (https://github.com/remarkjs/remark-directive) and wondering if there are plans to support this, or if it's something I could contribute to.

For my usecase, markdown-rs is used as a library and it's fine to write rust code to define the directives directly, or as a fallback, even just render the directives as a web component format <directive-name attr=attrVal> and then the user can supply the webcomponent implementations themselves.

How to get math working

Hello I need help getting the math extension working. I'm using the following options:

Options {
    parse: ParseOptions {
        constructs: Constructs {
            attention: true,
            autolink: true,
            block_quote: true,
            character_escape: true,
            character_reference: true,
            code_indented: true,
            code_fenced: true,
            code_text: true,
            definition: true,
            frontmatter: false,
            gfm_autolink_literal: true,
            gfm_footnote_definition: true,
            gfm_label_start_footnote: true,
            gfm_strikethrough: true,
            gfm_table: true,
            gfm_task_list_item: true,
            hard_break_escape: true,
            hard_break_trailing: true,
            heading_atx: true,
            heading_setext: true,
            html_flow: true,
            html_text: true,
            label_start_image: true,
            label_start_link: true,
            label_end: true,
            list_item: true,
            math_flow: true,
            math_text: true,
            mdx_esm: false,
            mdx_expression_flow: false,
            mdx_expression_text: false,
            mdx_jsx_flow: false,
            mdx_jsx_text: false,
            thematic_break: true,
        },
        gfm_strikethrough_single_tilde: false,
        math_text_single_dollar: true,
        mdx_esm_parse: None,
        mdx_expression_parse: None,
    },
    compile: CompileOptions {
        allow_dangerous_html: false,
        allow_dangerous_protocol: false,
        default_line_ending: LineEnding::LineFeed,
        gfm_footnote_label: None,
        gfm_footnote_label_tag_name: None,
        gfm_footnote_label_attributes: None,
        gfm_footnote_back_label: None,
        gfm_footnote_clobber_prefix: None,
        gfm_task_list_item_checkable: false,
        gfm_tagfilter: false,
    },
}

Notably math_flow: true and math_text: true.

The options are doing something because writing

$$
\frac{1}{2}
$$

generates the following HTML:

<pre><code class="language-math math-display">\frac{1}{2}</code></pre>

I guess from here I'd have to style the language-math and math-display classes on my own? Any example CSS I can use?

From issue #1 I see that the math extension is (inspired?) https://github.com/micromark/micromark-extension-math. Which has a section on CSS. But adding their stylesheet like <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css"> didn't help either.

Also just wanted to add that this crate is awesome and perfect for my use case 😁. Thanks for your work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.