soasme / nim-markdown Goto Github PK

View Code? Open in Web Editor NEW

149.0 8.0 11.0 886 KB

A Beautiful Markdown Parser in the Nim World.

Home Page: https://www.soasme.com/nim-markdown/

License: MIT License

Nim 100.00%

markdown markdown-to-html markdown-parser nim nim-lang nim-language

nim-markdown's People

Contributors

Stargazers

Watchers

Forkers

treeform benjif enthus1ast zedeus akavel drkameleon hoijui pietroppeter dut3062796s yardanico ee7

nim-markdown's Issues

Bad performance with larger documents

I didn't really want to just post this without also providing a good fix, but it's causing me problems, and I lack the understanding of the large amount of code in this necessary to do much to it.

It seems that this library parses and renders Markdown quite slowly, especially on larger documents. I noticed this while testing with ~10KB documents but have mostly tried testing it on ~100KB ones, for which rendering takes about 4 seconds, while a Python CommonMark implementation seems to manage 400ms and a JavaScript one ~50ms. This is proving problematic for an application I want to use this in, and while one fix might be to switch to cmark bindings, they don't provide options to customize parsing, so I would have to reimplement a lot of logic.

By switching most of the internals away from doubly linked lists, I was able to get a roughly 25% improvement (very rough, as I just ran time over it a few times with a big test file as input) but I think bigger changes would be needed to get a significant difference; I think it needs to avoid allocating as many objects (maybe use an ADT or something instead) and to handle chunks of unformatted text better. This also breaks quite a few of the test cases (mostly only spacing in lists as far as I can tell): https://gist.github.com/osmarks/c7d8db89896047368d6512f6284cd7c2

Here is a test input file (without any formatting) and profiling output from it:

out2.txt
profile_results.txt

Add footnote support

I see that this in in the roadmap. It'd be nice if footnotes were properly supported! I'm writing a small static site generator using this library and there are a couple of existing markdown posts I have that use footnotes.

At the moment they seem to turn into links that point to strange paths.

Unable to compile with threading turned on

The compiler message is:

/home/johnd/.nimble/pkgs/markdown-0.7.2/markdown.nim(2317, 6) Warning: 'applyInlineParsers' is not GC-safe as it performs an indirect call via 'inlineParser' [GcUnsafe2]

Looking into it, it makes sense. The MarkdownConfig.inlineParsers field holds a list of pointers to procedures held dynamically. In a single-threaded app, that works. I suspect, but can't confirm, the biggest problem is that the compiler can't check for GC safety because the proc references are only known at runtime.

Simply adding a {.gcsafe.} pragma might fix it if you are confident there won't be any runtime problems.

To replicate, add --threads:on to the nim compiler parameters and call markdown from a threaded-context such as a route in Jester.

API docs missing?

Hi, I am not able to find a working link to API docs (which I see they are commited in the repo at https://github.com/soasme/nim-markdown/tree/master/docs/htmldocs). Currently the link I find is https://www.soasme.com/nim-markdown/ which does not contain API documentation.

Warning: method has lock level <unknown>, but another method has 0

When installing, it reports

Warning: method has lock level <unknown>, but another method has 0

How to produce AST?

I've been reading through the docs and am lost trying to produce the AST described in the docs:

Document()
+-Heading(level=1)
  +-Text("H")
  +-Text("e")
  +-Text("l")
  +-Text("l")
  +-Text("o")
  ...
+-Paragraph()
  +-Text("W")
  +-Text("e")
  ...
  +-Em()
    +-Text("n")
    +-Text("i")
    +-Text("m")
    ...
  +-Text(".")
  ...

Given a string, such as # Hello World\This is a **bold** word., how would I go about generating that object? If possible, can that be more explicitly explained in the docs? This project looks to be exactly what I need so I would really appreciate the help.

Improvement: some parsing functions can be replaced by parseutils.

https://nim-lang.org/docs/parseutils.html

[feature] API to operate on parsed markdown AST

pretty much the same motivation as described in http://pandoc.org/filters.html

How would you modify your regular expression to handle these cases? It would be hairy, to say the least. What we need is a real parser. Well, pandoc has a real markdown parser, the library function readMarkdown. This transforms markdown text to an abstract syntax tree (AST) that represents the document structure. Why not manipulate the AST directly in a short Haskell script, then convert the result back to markdown using writeMarkdown?

Just want to open a discussion on whether this could be supported, how would API look like etc.

use cases

enabling application writers to use nim-markdown API to write these:

writing linters for markdown files
listing urls / links
transform markdown file in custom ways

basically this is related to exposing a more modular API, as in pandoc, even if input/output is limited to markdown/html ; in particular input:markdown => output:markdown would be a natural extension (ie, markdown transformers)

Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format

using pandoc -f markdown... -t markdown... can have surprisingly useful applications. As a demo, this file is generated by
...

GFM and CommonMark compatibility

The roadmap mentions correctness, but doesn't further explain what that means. There is no consensus on what correct handling of markdown is, see for example Babelmark which compares the output of 20 different markdown implementations. Here's just one example where the issue becomes obvious.

Have you considered making this an implementation of the GFM spec? GFM is an extension of the CommonMark spec made by GitHub, which includes support for tables and several other non-standard markdown features. By making it possible to enable/disable the extensions in the API, it would also be an implementation of CommonMark itself.

Enhance post-processing

As a user, I want an enhanced post-processing API for the parsed AST, so that I can customize the parsed result to support more use cases, such as creating table of contents or adding slugs to headers, etc.

See discussion #67.

htmlEntityToUtf8 adds around 600 kb to binary size with -d:release on Windows

https://github.com/soasme/nim-markdown/blob/master/src/markdownpkg/entities.nim

Tested by manually removing its use in my local Nimble instance:

# markdown.nim
proc escapeHTMLEntity*(doc: string): string =
  var entities = doc.findAll(re"&([^;]+);")
  result = doc
  for entity in entities:
    if not IGNORED_HTML_ENTITY.contains(entity):
      let utf8Char = entity.htmlEntityToUtf8

Size of small website builder compiled with -d:release:

    if not IGNORED_HTML_ENTITY.contains(entity):
      let utf8Char = entity#.htmlEntityToUtf8

Same compilation settings:

Converting this to a constant table should save ~~a large amount of~~ space. A build option to turn it off might work as a temporary option though, like -d:markdownNoEntities

Update: Tried changing it to a hash table, it apparently does not save much space:

This makes sense because of the way case/of is optimized (case/of itself is probably faster than a hash table), but I expected it to have a bigger impact. My mistake.

What does save a little more space than that though is using an array of tuples and checking for equality every single time instead of hashes, sacrificing speed:

This is just a bad idea for performance. I would really rather just not have all this in my binary.

Forgot to mention this is on Nim 1.0.4.

Improve the Performance of nim-markdown

This is an umbrella issue for tracking the efforts on improving the performance of nim-markdown.

Below are potential bottlenecks:

Use normal sequence, instead of doublylinkedlist for token sequence.
HtmlBlockParser.parse is very slow. It has nested loop runs which can be optimized. #52
Consume less memory/gc by assigning slices instead of string objects to tokens. #54, #55, #56
Pre-chop the string by lines, instead of ad-hoc splitLines.
Use kind: XXX, instead of object inheritance.
Ignore parsing content when an html comment is matched.
remove bottleneck since() calls. #53
remove bottleneck firstLine() & restLines() calls. #56

[GFM] Support Disallowed Raw HTML (extension)

https://github.github.com/gfm/#disallowed-raw-html-extension-

discussion: feasibility of adding nim-markdown in nim documentation pipeline?

originally asked there nim-lang/Nim#9487 but moving that specific question here:

how could we use nim-markdown for markdown=>html generation in ./koch web ? options:

add a nimble dependency on nim-markdown for koch (is that even feasible, or would that cause circular dependencies ?)
[preferred] copy nim-markdown sources to Nim/ tests/deps/ ; we already do this for these:
jester-#head opengl-1.1.0 x11-1.0 zip-0.2.1

the copied sources would be (regularly) updated as needed from upstream nim-markdown (and not meant to be locally modified in Nim repo, only meant as a stale copy)

@soasme what do you think?

Strikethrough in list doesn't work

import markdown

echo markdown("* ~~da~~")

Output:

<ul>
<li>~~da~~</li>
</ul>

[GFM] Support Autolink (extension)

https://github.github.com/gfm/#autolinks-extension-

[GFM] Support Task List Items

Support parsing text like - [x] todo to

todo

support file transclusion (including contents of a file)

basically this feature request: github/markup#346

use cases

including a csv file (rendering it as a table)
including another markdown (cf rst's include)
including a LICENSE.txt
DRY docs: allows including a version number for example
makes it easier to transform Nim's rst files (which have a few include) to markdown
embed video

syntax

as proposed here: https://talk.commonmark.org/t/transclusion-or-including-sub-documents-for-reuse/270/3

{{ my_file }} -> include the file and parse it as markdown
{{ my_file[start:end] }} -> include the lines comprised between start and end and parse them as markdown.

there are alternatives that have been proposed in various markdown flavor.

Ideally, it should allow optionally specifying the file type (overriding guessing it from file extension if needed), eg: jpg, csv, md, txt, codebock(?)

caveats

github doesn't seem to support this feature
Github doesn't provide this feature even for reStructuredText (rst) which has include directive in the official language spec : github/markup#172
I'e seen somewhere github doesn't support it because of security concerns, however I'd like to understand more that concern ; it doesn't seem relevant as far as nim-markdown is concerned
different flavors of markdown use a different syntax for this feature
other syntax I've seen:

[![Watch the video](https://raw.github.com/GabLeRoux/WebMole/master/ressources/WebMole_Youtube_Video.png)](http://youtu.be/vt5fpE0bzSY)

other syntax I've seen here https://github.com/sethen/markdown-include

#include "another-markdown-file.md"

links

https://github.com/iainc/Markdown-Content-Blocks

CSV files are embedded as tables, source code files become code blocks, and embedded text files help writers structure their work:

table rendering broken maybe?

❯ nim --version
Nim Compiler Version 1.2.6 [Linux: amd64]
Compiled at 2020-07-29

import markdown

let j = """
| Header 1 | Header 2 | Header 3 | Header 4 |
| :------: | -------: | :------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |
"""

echo markdown(j)

gives

<p>| Header 1 | Header 2 | Header 3 | Header 4 |
| :------: | -------: | :------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |</p>

Paragraph continuation broken in lists

Paragraph rules (no new paragraph until repeated horizontal or vertical whitespace) are not processed correctly in lists.

An example:

import markdown

let md = """
Normal paragraph

* bullet item one

* bullet starts here
 and continues here
and this is also the bullet, not a new paragraph

This is a new paragraph though
"""

echo markdown(md, config = initGfmConfig())

Produces:

<p>Normal paragraph</p>
<ul>
<li>
<p>bullet item one</p>
</li>
<li>
<p>bullet starts here
and continues here</p>
</li>
</ul>
<p>and this is also the bullet, not a new paragraph</p>
<p>This is a new paragraph though</p>

Where the second to last paragraph lives outside the list. However, this is what GFM renders:

Normal paragraph

bullet item one
bullet starts here
and continues here
and this is also the bullet, not a new paragraph

This is a new paragraph though

build fails `markdown.nim(842, 81) Error: \u not allowed in character literal`

v0.5.0 build is failing with:

[david@eb ~]$ nimble install https://github.com/soasme/nim-markdown
Downloading https://github.com/soasme/nim-markdown using git
  Verifying dependencies for [email protected]
 Installing [email protected]
   Building markdown/markdown using c backend
    Prompt: Build failed for 'https://github.com/soasme/[email protected]', would you like to try installing 'https://github.com/soasme/nim-markdown@#head' (latest unstable)? [y/N]
    Answer: y
Downloading https://github.com/soasme/nim-markdown using git
  Verifying dependencies for markdown@#head
 Installing markdown@#head
   Building markdown/markdown using c backend
       Tip: 5 messages have been suppressed, use --verbose to show them.
     Error: Build failed for package: markdown
        ... Details:
        ... Execution failed with exit code 1
        ... Command: "/home/david/Nim/bin/nim" c --noBabelPath -d:release -o:"/tmp/nimble_26030/githubcom_soasmenimmarkdown_#head/markdown" "/tmp/nimble_26030/githubcom_soasmenimmarkdown_#head/src/markdown.nim"
        ... Output: Hint: used config file '/home/david/Nim/config/nim.cfg' [Conf]
        ... Hint: system [Processing]
        ... Hint: markdown [Processing]
        ... Hint: re [Processing]
        ... Hint: pcre [Processing]
        ... Hint: strutils [Processing]
        ... Hint: parseutils [Processing]
        ... Hint: math [Processing]
        ... Hint: bitops [Processing]
        ... Hint: algorithm [Processing]
        ... Hint: unicode [Processing]
        ... Hint: rtarrays [Processing]
        ... Hint: strformat [Processing]
        ... Hint: macros [Processing]
        ... Hint: tables [Processing]
        ... Hint: hashes [Processing]
        ... Hint: sequtils [Processing]
        ... Hint: uri [Processing]
        ... Hint: htmlparser [Processing]
        ... Hint: streams [Processing]
        ... Hint: parsexml [Processing]
        ... Hint: lexbase [Processing]
        ... Hint: xmltree [Processing]
        ... Hint: strtabs [Processing]
        ... Hint: os [Processing]
        ... Hint: times [Processing]
        ... Hint: options [Processing]
        ... Hint: typetraits [Processing]
        ... Hint: posix [Processing]
        ... Hint: ospaths [Processing]
        ... Hint: lists [Processing]
        ... markdown.nim(842, 81) Error: \u not allowed in character literal

i've tried to update my nim installation (on the freshest now) but same.

links in lists not working

1. [foo](https://nim-lang.org)
2. [baa](https://nim-lang.org/installation)

- [foo](https://nim-lang.org)
- [baa](https://nim-lang.org/installation)

get rendered like so:

1. [foo](https://nim-lang.org)
2. [baa](https://nim-lang.org/installation)

- [foo](https://nim-lang.org)
- [baa](https://nim-lang.org/installation)

but hyperlinks should be created.

javascript support

My plan was to use nim-markdown within a Karax app (js target), but compilations fails with:

../../../bin/nim-repo/lib/impure/re.nim(100, 3) Error: undeclared identifier: 'copyMem'

It looks like js support is currently blocked by this: nim-lang/Nim#7640

HTML to Markdown

Any plans on reversing from HTML => Markdown?

Consider making `config` more strict/strongly-typed

problem

The config parameter to markdown() is currently a regex-parsed string that's prone to both user & dev error, can blow up at runtime rather than compile time, and isn't great for documentation or autocomplete. TLDR it's "stringly typed" and that's not ideal.

nim-markdown/src/markdown.nim

Lines 865 to 868 in 9348402

    
           proc markdown*(doc: string, config: string = """ 
        
           Escape: true 
        
           KeepHTML: false 
        
           """): string =

solution

If the parameters are all flags (and they currently are) then it's probably best to use a set[enum] instead. This is a pattern I've used pretty often:

type
  MarkdownOption* {.pure.} = enum
    Escape, KeepHtml

  MarkdownOptions* = set[MarkdownOption]

const
  defaultMarkdownOptions* = {Escape}

proc markdown* (doc: string, config = defaultMarkdownOptions): string

This way if a user typos and passes Escaep they'll get a compile time error, or might even avoid it entirely because of code suggestions. If you want to go this route I'm willing to make a PR.

alternative

If future options might be non-boolean they wouldn't be covered by this and maybe something like an object + proc combo would be needed:

type
  MarkdownConfig* = object
    escape, keepHtml: bool
    someConfigurableString: string

proc initMarkdownConfig* (
  escape = true,
  keepHtml = false,
  someConfigurableString = ""
): MarkdownConfig

It's a bit heavier and more verbose though.

Commonmark 0.29.0 Support

Commonmark https://spec.commonmark.org/0.29/ was released recently. This issue is for tracking the progress of cmark 0.29.0 support.

29 Aug, 2019: #22, 632/649 passed.

support extension `escaped_line_breaks`

full context: https://github.com/nim-lang/Nim/issues/9291#issuecomment-432351178
it's useful for editors that automatically trim trailing spaces

Error without stacktrace when parsing long, complex document inside a Html comment tag

I ran into a weird behaviour that I was able to minimize in the following example:

import markdown

let text = """
## title

some text:
- one point, and a [link](to_here)
- two points and **emphasis**
- three points, _really_?
  + sub point
  + another

"""  # removing any single line or inline element (e.g. link, emphasis, ...) and error will disappear

var longText = "<!--\n" # if this is removed error disappears
for _ in 1 .. 30: # for less than 30 iterations, error disappears
  longText &= text
longText.add "\n-->"  # this can be removed and error will persist
echo markdown(longText)

Running this (nim 1.4.0, markdown #head) the program errors out without a stack trace.
If I reduce the number of iterations, or remove any line or element from text the error disappears.

The behaviour seems to be related to the appearance of a long and fairly complex (from parsing perspective) text inside a Html coment tag (it is sufficient that it starts with <!--).

Apart from this, I have to say this library is excellent, I have been using it extensively and it is the first time that it fails me (not too harmful, the workaround is simple: just split the text; it was only a bit tricky to minimize the error).
I take the opportunity to thank you for the work you did with nim-markdown and also nim-mustache, which are core dependencies of something I am working on and I am about to release (hopefully) soon: https://github.com/pietroppeter/nimib

are tables on the roadmap?

@GULPF

IMO if Nim is going to switch to markdown, there first must be a pure Nim implementation. Preferable one that implements a well specified form of markdown like CommonMark (note however that commonmark does not support anything fancy like tables).

IMO support for tables is a must, even if CommonMark doesn't support them.

[Bug] Unable to parse gfm 0.29 spec.txt

Source: https://raw.githubusercontent.com/github/cmark-gfm/master/test/spec.txt
Code:

import markdown, lists

let root = Document()
echo(markdown(readFile("/tmp/spec-01.txt"), root=root))

Weird render when you have a trailing whitespace after triple backticks

I am not even sure if this is a bug or it is according to specs (I know markdown is weird about trailing whitespaces).
Adding a trailing whitespace to a triple backtick makes markdown not recognize it that it ends the code block.

this (for clarity I am using a '*' char that I later replace to whitespace):

import markdown
import std / strutils

echo markdown("""
```nim
echo "hello"
```*
""".replace('*', ' '), config=initGfmConfig())

outputs this (without the backslash which I had to add for GitHub to render it):

<pre><code class="language-nim">echo &quot;hello&quot;
\```
</code></pre>

I would expect this:

<pre><code class="language-nim">echo &quot;hello&quot;
</code></pre>

which is what you get if you remove the trailing whitespace.

Table are not processed correctly

Markdown tables are not processed correctly. A table such as

| Month    | Savings |
|----------|---------|
| January  | $250    |
| February | $80     |
| March    | $420    |

results in the following html code

<p>| Month    | Savings |
|----------|---------|
| January  | $250    |
| February | $80     |
| March    | $420    |</p>

Inline markdown with unmodified normal text

Hi,

The blow code

let doc = "*Italic* **bold** normal"
echo markdown(doc, root = Paragraph())
echo markdown(doc, root = Inline())

gives the following output

<p><em>Italic</em> <strong>bold</strong> normal</p>

<em>Italic</em>
<strong>bold</strong>
n
o
r
m
a
l

I need the first (Paragraph) output, but without the <p> tag and with the normal word as a single line as shown below

<em>Italic</em> <strong>bold</strong> normal

How the above output can be achieved?

Thank you,
Vlad

[off-topic] [discussion] asciidoc vs nim-markdown

Didn't want to discuss this in #10 to keep each topic distinct; feel free to close if this is too off-topic :), but it may be worth discussing this at least once

Note also that @dom96 mentioned here that asciidoc could be another option for Nim documentation

while I do like asciidoc, the main argument I see against it is that markdown is far more ubiquitous and developpers are more likely to know it and be familiar with it, and it's not clear asciidoc's advantages outweigh this aspect.

since you're authoring nim-markdown maybe you have some insight/opinion on nim-markdown vs asciidoc (and @dom96 feel free to comment too)

Here are some readings I did:

advantages of asciidoc

https://asciidoctor.org/docs/asciidoc-vs-markdown/
- what truly makes AsciiDoc the right investment is that its syntax was designed to be extended as a core feature
- AsciiDoc uses a consistent formatting scheme (i.e., it has consistent patterns).
- builtin Includes syntax include::intro.adoc[] (but see #9)
- AsciiDoc offers power and flexibility without requiring the use of HTML or “flavors” for essential syntax such as tables, description lists, admonitions (tips, notes, warnings, etc.) and table of contents.
- Markdown has become a maze of different implementations, termed “flavors”, which make a universal definition evasive.
- AsciiDoc syntax was explicitly designed with the needs of publishing in mind, both print and web
https://medium.com/the-bower/markdown-considered-harmful-495ccfe24a52
https://www.red-gate.com/simple-talk/blogs/sundown-on-markdown/

advantages of markdown

the point I raised above about familiarity / popularity
https://news.ycombinator.com/item?id=9206755

AsciiDoc is a nice project, but I think that pandoc's variant of Markdown has a lot of advantages over AsciiDoc for academic writing [...]

Smartypants

Hey! One feature that would be nice to have is the smartypants extension, which automatically converts dumbquotes into smart quotes, adds en/em dashes and ellipses.

Thanks!

add list of working / not yet working markdown features in nim-markdown

could you add a list of working / not yet working markdown features in nim-markdown?
maybe as a list of checkboxes kept in sync with code changes eg:

image links
url links

this could be in README.md or another file status.md

I'm curious whether nim-markdown can already be used to process (a subset of) markdown files in Nim repo, if we were to start replacing some rst files with markdown, see https://github.com/nim-lang/Nim/issues/9291#issuecomment-431705555
what would you suggest for this question:
https://github.com/nim-lang/Nim/issues/9291#issuecomment-428877292

How do I write

`term`:idx:
in markdown?

Exported `toSeq` in this package robbed me of 3 hours of my life

nim-markdown/src/markdown.nim

Line 470 in a661c26

proc toSeq*(tokens: DoublyLinkedList[Token]): seq[Token] =

Please, don't export toSeq proc. I've wasted a lot of time debugging why my toSeq wasn't working.
It turns out, it's due to this issue: nim-lang/RFCs#512

I've found a somewhat dirty trick to ease the pain

import markdown except toSeq

Which is fine, but would be better if wasn't exported in the first place. Is there any particular reason for this toSeq to be exposed?

Document an example usage for parsing markdown into an AST

... instead of converting to markdown.
(in the README)

Issue generating tables in newer versions

Hitting a weird issue with nim-markdown, where tables are not generating after I updated to 0.8.0. For example with this input:

| Header 1 | Header 2 | Header 3 | Header 4 |
| -------- | -------- | -------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |

I get this output with 0.8.0:

xps13:~$ markdown < test.txt
<p>| Header 1 | Header 2 | Header 3 | Header 4 |
| -------- | -------- | -------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |</p>

If I install 0.4.0 it goes back to working again:

xps13:~$ markdown < test.txt
<table>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
...

The issue is there with #head as well, but if I download head and run tests, it seems to work fine - which is particularly weird. BTW, same issue programmatically as well as via command line.

	proc markdown*(doc: string, config: string = """
	Escape: true
	KeepHTML: false
	"""): string =

soasme / nim-markdown Goto Github PK

nim-markdown's People

Contributors

Stargazers

Watchers

Forkers

nim-markdown's Issues

use cases

related

use cases

syntax

caveats

links

problem

solution

alternative

advantages of asciidoc

advantages of markdown

Recommend Projects

Recommend Topics

Recommend Org