Giter Site home page Giter Site logo

soasme / nim-markdown Goto Github PK

View Code? Open in Web Editor NEW
149.0 8.0 11.0 886 KB

A Beautiful Markdown Parser in the Nim World.

Home Page: https://www.soasme.com/nim-markdown/

License: MIT License

Nim 100.00%
markdown markdown-to-html markdown-parser nim nim-lang nim-language

nim-markdown's People

Contributors

benjif avatar haltcase avatar hoijui avatar pietroppeter avatar soasme avatar yardanico avatar zedeus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nim-markdown's Issues

Bad performance with larger documents

I didn't really want to just post this without also providing a good fix, but it's causing me problems, and I lack the understanding of the large amount of code in this necessary to do much to it.

It seems that this library parses and renders Markdown quite slowly, especially on larger documents. I noticed this while testing with ~10KB documents but have mostly tried testing it on ~100KB ones, for which rendering takes about 4 seconds, while a Python CommonMark implementation seems to manage 400ms and a JavaScript one ~50ms. This is proving problematic for an application I want to use this in, and while one fix might be to switch to cmark bindings, they don't provide options to customize parsing, so I would have to reimplement a lot of logic.

By switching most of the internals away from doubly linked lists, I was able to get a roughly 25% improvement (very rough, as I just ran time over it a few times with a big test file as input) but I think bigger changes would be needed to get a significant difference; I think it needs to avoid allocating as many objects (maybe use an ADT or something instead) and to handle chunks of unformatted text better. This also breaks quite a few of the test cases (mostly only spacing in lists as far as I can tell): https://gist.github.com/osmarks/c7d8db89896047368d6512f6284cd7c2

Here is a test input file (without any formatting) and profiling output from it:

out2.txt
profile_results.txt

Add footnote support

I see that this in in the roadmap. It'd be nice if footnotes were properly supported! I'm writing a small static site generator using this library and there are a couple of existing markdown posts I have that use footnotes.

At the moment they seem to turn into links that point to strange paths.

Unable to compile with threading turned on

The compiler message is:

/home/johnd/.nimble/pkgs/markdown-0.7.2/markdown.nim(2317, 6) Warning: 'applyInlineParsers' is not GC-safe as it performs an indirect call via 'inlineParser' [GcUnsafe2]

Looking into it, it makes sense. The MarkdownConfig.inlineParsers field holds a list of pointers to procedures held dynamically. In a single-threaded app, that works. I suspect, but can't confirm, the biggest problem is that the compiler can't check for GC safety because the proc references are only known at runtime.

Simply adding a {.gcsafe.} pragma might fix it if you are confident there won't be any runtime problems.

To replicate, add --threads:on to the nim compiler parameters and call markdown from a threaded-context such as a route in Jester.

How to produce AST?

I've been reading through the docs and am lost trying to produce the AST described in the docs:

Document()
+-Heading(level=1)
  +-Text("H")
  +-Text("e")
  +-Text("l")
  +-Text("l")
  +-Text("o")
  ...
+-Paragraph()
  +-Text("W")
  +-Text("e")
  ...
  +-Em()
    +-Text("n")
    +-Text("i")
    +-Text("m")
    ...
  +-Text(".")
  ...

Given a string, such as # Hello World\This is a **bold** word., how would I go about generating that object? If possible, can that be more explicitly explained in the docs? This project looks to be exactly what I need so I would really appreciate the help.

[feature] API to operate on parsed markdown AST

pretty much the same motivation as described in http://pandoc.org/filters.html

How would you modify your regular expression to handle these cases? It would be hairy, to say the least. What we need is a real parser. Well, pandoc has a real markdown parser, the library function readMarkdown. This transforms markdown text to an abstract syntax tree (AST) that represents the document structure. Why not manipulate the AST directly in a short Haskell script, then convert the result back to markdown using writeMarkdown?

Just want to open a discussion on whether this could be supported, how would API look like etc.

use cases

enabling application writers to use nim-markdown API to write these:

  • writing linters for markdown files
  • listing urls / links
  • transform markdown file in custom ways

related

basically this is related to exposing a more modular API, as in pandoc, even if input/output is limited to markdown/html ; in particular input:markdown => output:markdown would be a natural extension (ie, markdown transformers)

Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format

using pandoc -f markdown... -t markdown... can have surprisingly useful applications. As a demo, this file is generated by
...

GFM and CommonMark compatibility

The roadmap mentions correctness, but doesn't further explain what that means. There is no consensus on what correct handling of markdown is, see for example Babelmark which compares the output of 20 different markdown implementations. Here's just one example where the issue becomes obvious.

Have you considered making this an implementation of the GFM spec? GFM is an extension of the CommonMark spec made by GitHub, which includes support for tables and several other non-standard markdown features. By making it possible to enable/disable the extensions in the API, it would also be an implementation of CommonMark itself.

Enhance post-processing

As a user, I want an enhanced post-processing API for the parsed AST, so that I can customize the parsed result to support more use cases, such as creating table of contents or adding slugs to headers, etc.

See discussion #67.

htmlEntityToUtf8 adds around 600 kb to binary size with -d:release on Windows

https://github.com/soasme/nim-markdown/blob/master/src/markdownpkg/entities.nim

Tested by manually removing its use in my local Nimble instance:

# markdown.nim
proc escapeHTMLEntity*(doc: string): string =
  var entities = doc.findAll(re"&([^;]+);")
  result = doc
  for entity in entities:
    if not IGNORED_HTML_ENTITY.contains(entity):
      let utf8Char = entity.htmlEntityToUtf8

Size of small website builder compiled with -d:release:

image

    if not IGNORED_HTML_ENTITY.contains(entity):
      let utf8Char = entity#.htmlEntityToUtf8

Same compilation settings:

image

Converting this to a constant table should save a large amount of space. A build option to turn it off might work as a temporary option though, like -d:markdownNoEntities

Update: Tried changing it to a hash table, it apparently does not save much space:

image

This makes sense because of the way case/of is optimized (case/of itself is probably faster than a hash table), but I expected it to have a bigger impact. My mistake.

What does save a little more space than that though is using an array of tuples and checking for equality every single time instead of hashes, sacrificing speed:

image

This is just a bad idea for performance. I would really rather just not have all this in my binary.

Forgot to mention this is on Nim 1.0.4.

Improve the Performance of nim-markdown

This is an umbrella issue for tracking the efforts on improving the performance of nim-markdown.

Below are potential bottlenecks:

  • Use normal sequence, instead of doublylinkedlist for token sequence.
  • HtmlBlockParser.parse is very slow. It has nested loop runs which can be optimized. #52
  • Consume less memory/gc by assigning slices instead of string objects to tokens. #54, #55, #56
  • Pre-chop the string by lines, instead of ad-hoc splitLines.
  • Use kind: XXX, instead of object inheritance.
  • Ignore parsing content when an html comment is matched.
  • remove bottleneck since() calls. #53
  • remove bottleneck firstLine() & restLines() calls. #56

discussion: feasibility of adding nim-markdown in nim documentation pipeline?

originally asked there nim-lang/Nim#9487 but moving that specific question here:

how could we use nim-markdown for markdown=>html generation in ./koch web ? options:

  • add a nimble dependency on nim-markdown for koch (is that even feasible, or would that cause circular dependencies ?)
  • [preferred] copy nim-markdown sources to Nim/ tests/deps/ ; we already do this for these:
    jester-#head opengl-1.1.0 x11-1.0 zip-0.2.1

the copied sources would be (regularly) updated as needed from upstream nim-markdown (and not meant to be locally modified in Nim repo, only meant as a stale copy)

@soasme what do you think?

support file transclusion (including contents of a file)

basically this feature request: github/markup#346

use cases

  • including a csv file (rendering it as a table)
  • including another markdown (cf rst's include)
  • including a LICENSE.txt
  • DRY docs: allows including a version number for example
  • makes it easier to transform Nim's rst files (which have a few include) to markdown
  • embed video

syntax

as proposed here: https://talk.commonmark.org/t/transclusion-or-including-sub-documents-for-reuse/270/3

{{ my_file }} -> include the file and parse it as markdown
{{ my_file[start:end] }} -> include the lines comprised between start and end and parse them as markdown.

there are alternatives that have been proposed in various markdown flavor.

Ideally, it should allow optionally specifying the file type (overriding guessing it from file extension if needed), eg: jpg, csv, md, txt, codebock(?)

caveats

  • github doesn't seem to support this feature
    Github doesn't provide this feature even for reStructuredText (rst) which has include directive in the official language spec : github/markup#172

  • I'e seen somewhere github doesn't support it because of security concerns, however I'd like to understand more that concern ; it doesn't seem relevant as far as nim-markdown is concerned

  • different flavors of markdown use a different syntax for this feature

  • other syntax I've seen:

[![Watch the video](https://raw.github.com/GabLeRoux/WebMole/master/ressources/WebMole_Youtube_Video.png)](http://youtu.be/vt5fpE0bzSY)
#include "another-markdown-file.md"

links

CSV files are embedded as tables, source code files become code blocks, and embedded text files help writers structure their work:

table rendering broken maybe?

❯ nim --version
Nim Compiler Version 1.2.6 [Linux: amd64]
Compiled at 2020-07-29
import markdown

let j = """
| Header 1 | Header 2 | Header 3 | Header 4 |
| :------: | -------: | :------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |
"""

echo markdown(j)

gives

<p>| Header 1 | Header 2 | Header 3 | Header 4 |
| :------: | -------: | :------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |</p>

Paragraph continuation broken in lists

Paragraph rules (no new paragraph until repeated horizontal or vertical whitespace) are not processed correctly in lists.

An example:

import markdown

let md = """
Normal paragraph

* bullet item one

* bullet starts here
 and continues here
and this is also the bullet, not a new paragraph

This is a new paragraph though
"""

echo markdown(md, config = initGfmConfig())

Produces:

<p>Normal paragraph</p>
<ul>
<li>
<p>bullet item one</p>
</li>
<li>
<p>bullet starts here
and continues here</p>
</li>
</ul>
<p>and this is also the bullet, not a new paragraph</p>
<p>This is a new paragraph though</p>

Where the second to last paragraph lives outside the list. However, this is what GFM renders:


Normal paragraph

  • bullet item one

  • bullet starts here
    and continues here
    and this is also the bullet, not a new paragraph

This is a new paragraph though

build fails `markdown.nim(842, 81) Error: \u not allowed in character literal`

v0.5.0 build is failing with:

[david@eb ~]$ nimble install https://github.com/soasme/nim-markdown
Downloading https://github.com/soasme/nim-markdown using git
  Verifying dependencies for [email protected]
 Installing [email protected]
   Building markdown/markdown using c backend
    Prompt: Build failed for 'https://github.com/soasme/[email protected]', would you like to try installing 'https://github.com/soasme/nim-markdown@#head' (latest unstable)? [y/N]
    Answer: y
Downloading https://github.com/soasme/nim-markdown using git
  Verifying dependencies for markdown@#head
 Installing markdown@#head
   Building markdown/markdown using c backend
       Tip: 5 messages have been suppressed, use --verbose to show them.
     Error: Build failed for package: markdown
        ... Details:
        ... Execution failed with exit code 1
        ... Command: "/home/david/Nim/bin/nim" c --noBabelPath -d:release -o:"/tmp/nimble_26030/githubcom_soasmenimmarkdown_#head/markdown" "/tmp/nimble_26030/githubcom_soasmenimmarkdown_#head/src/markdown.nim"
        ... Output: Hint: used config file '/home/david/Nim/config/nim.cfg' [Conf]
        ... Hint: system [Processing]
        ... Hint: markdown [Processing]
        ... Hint: re [Processing]
        ... Hint: pcre [Processing]
        ... Hint: strutils [Processing]
        ... Hint: parseutils [Processing]
        ... Hint: math [Processing]
        ... Hint: bitops [Processing]
        ... Hint: algorithm [Processing]
        ... Hint: unicode [Processing]
        ... Hint: rtarrays [Processing]
        ... Hint: strformat [Processing]
        ... Hint: macros [Processing]
        ... Hint: tables [Processing]
        ... Hint: hashes [Processing]
        ... Hint: sequtils [Processing]
        ... Hint: uri [Processing]
        ... Hint: htmlparser [Processing]
        ... Hint: streams [Processing]
        ... Hint: parsexml [Processing]
        ... Hint: lexbase [Processing]
        ... Hint: xmltree [Processing]
        ... Hint: strtabs [Processing]
        ... Hint: os [Processing]
        ... Hint: times [Processing]
        ... Hint: options [Processing]
        ... Hint: typetraits [Processing]
        ... Hint: posix [Processing]
        ... Hint: ospaths [Processing]
        ... Hint: lists [Processing]
        ... markdown.nim(842, 81) Error: \u not allowed in character literal

i've tried to update my nim installation (on the freshest now) but same.

links in lists not working

1. [foo](https://nim-lang.org)
2. [baa](https://nim-lang.org/installation)

- [foo](https://nim-lang.org)
- [baa](https://nim-lang.org/installation)

get rendered like so:

1. [foo](https://nim-lang.org)
2. [baa](https://nim-lang.org/installation)

- [foo](https://nim-lang.org)
- [baa](https://nim-lang.org/installation)

but hyperlinks should be created.

javascript support

My plan was to use nim-markdown within a Karax app (js target), but compilations fails with:

../../../bin/nim-repo/lib/impure/re.nim(100, 3) Error: undeclared identifier: 'copyMem'

It looks like js support is currently blocked by this: nim-lang/Nim#7640

Consider making `config` more strict/strongly-typed

problem

The config parameter to markdown() is currently a regex-parsed string that's prone to both user & dev error, can blow up at runtime rather than compile time, and isn't great for documentation or autocomplete. TLDR it's "stringly typed" and that's not ideal.

proc markdown*(doc: string, config: string = """
Escape: true
KeepHTML: false
"""): string =

solution

If the parameters are all flags (and they currently are) then it's probably best to use a set[enum] instead. This is a pattern I've used pretty often:

type
  MarkdownOption* {.pure.} = enum
    Escape, KeepHtml

  MarkdownOptions* = set[MarkdownOption]

const
  defaultMarkdownOptions* = {Escape}

proc markdown* (doc: string, config = defaultMarkdownOptions): string

This way if a user typos and passes Escaep they'll get a compile time error, or might even avoid it entirely because of code suggestions. If you want to go this route I'm willing to make a PR.

alternative

If future options might be non-boolean they wouldn't be covered by this and maybe something like an object + proc combo would be needed:

type
  MarkdownConfig* = object
    escape, keepHtml: bool
    someConfigurableString: string

proc initMarkdownConfig* (
  escape = true,
  keepHtml = false,
  someConfigurableString = ""
): MarkdownConfig

It's a bit heavier and more verbose though.

Error without stacktrace when parsing long, complex document inside a Html comment tag

I ran into a weird behaviour that I was able to minimize in the following example:

import markdown

let text = """
## title

some text:
- one point, and a [link](to_here)
- two points and **emphasis**
- three points, _really_?
  + sub point
  + another

"""  # removing any single line or inline element (e.g. link, emphasis, ...) and error will disappear

var longText = "<!--\n" # if this is removed error disappears
for _ in 1 .. 30: # for less than 30 iterations, error disappears
  longText &= text
longText.add "\n-->"  # this can be removed and error will persist
echo markdown(longText)

Running this (nim 1.4.0, markdown #head) the program errors out without a stack trace.
If I reduce the number of iterations, or remove any line or element from text the error disappears.

The behaviour seems to be related to the appearance of a long and fairly complex (from parsing perspective) text inside a Html coment tag (it is sufficient that it starts with <!--).

Apart from this, I have to say this library is excellent, I have been using it extensively and it is the first time that it fails me (not too harmful, the workaround is simple: just split the text; it was only a bit tricky to minimize the error).
I take the opportunity to thank you for the work you did with nim-markdown and also nim-mustache, which are core dependencies of something I am working on and I am about to release (hopefully) soon: https://github.com/pietroppeter/nimib

Weird render when you have a trailing whitespace after triple backticks

I am not even sure if this is a bug or it is according to specs (I know markdown is weird about trailing whitespaces).
Adding a trailing whitespace to a triple backtick makes markdown not recognize it that it ends the code block.

this (for clarity I am using a '*' char that I later replace to whitespace):

import markdown
import std / strutils

echo markdown("""
```nim
echo "hello"
```*
""".replace('*', ' '), config=initGfmConfig())

outputs this (without the backslash which I had to add for GitHub to render it):

<pre><code class="language-nim">echo &quot;hello&quot;
\```
</code></pre>

I would expect this:

<pre><code class="language-nim">echo &quot;hello&quot;
</code></pre>

which is what you get if you remove the trailing whitespace.

Table are not processed correctly

Markdown tables are not processed correctly. A table such as

| Month    | Savings |
|----------|---------|
| January  | $250    |
| February | $80     |
| March    | $420    |

results in the following html code

<p>| Month    | Savings |
|----------|---------|
| January  | $250    |
| February | $80     |
| March    | $420    |</p>

Inline markdown with unmodified normal text

Hi,

The blow code

let doc = "*Italic* **bold** normal"
echo markdown(doc, root = Paragraph())
echo markdown(doc, root = Inline())

gives the following output

<p><em>Italic</em> <strong>bold</strong> normal</p>

<em>Italic</em>
<strong>bold</strong>
n
o
r
m
a
l

I need the first (Paragraph) output, but without the <p> tag and with the normal word as a single line as shown below

<em>Italic</em> <strong>bold</strong> normal

How the above output can be achieved?

Thank you,
Vlad

[off-topic] [discussion] asciidoc vs nim-markdown

Didn't want to discuss this in #10 to keep each topic distinct; feel free to close if this is too off-topic :), but it may be worth discussing this at least once

Note also that @dom96 mentioned here that asciidoc could be another option for Nim documentation

while I do like asciidoc, the main argument I see against it is that markdown is far more ubiquitous and developpers are more likely to know it and be familiar with it, and it's not clear asciidoc's advantages outweigh this aspect.

since you're authoring nim-markdown maybe you have some insight/opinion on nim-markdown vs asciidoc (and @dom96 feel free to comment too)

Here are some readings I did:

advantages of asciidoc

advantages of markdown

AsciiDoc is a nice project, but I think that pandoc's variant of Markdown has a lot of advantages over AsciiDoc for academic writing [...]

Smartypants

Hey! One feature that would be nice to have is the smartypants extension, which automatically converts dumbquotes into smart quotes, adds en/em dashes and ellipses.

Thanks!

add list of working / not yet working markdown features in nim-markdown

  • could you add a list of working / not yet working markdown features in nim-markdown?
    maybe as a list of checkboxes kept in sync with code changes eg:
  • image links
  • url links

this could be in README.md or another file status.md

How do I write

`term`:idx:
in markdown?

Exported `toSeq` in this package robbed me of 3 hours of my life

proc toSeq*(tokens: DoublyLinkedList[Token]): seq[Token] =

Please, don't export toSeq proc. I've wasted a lot of time debugging why my toSeq wasn't working.
It turns out, it's due to this issue: nim-lang/RFCs#512

I've found a somewhat dirty trick to ease the pain

import markdown except toSeq

Which is fine, but would be better if wasn't exported in the first place. Is there any particular reason for this toSeq to be exposed?

Issue generating tables in newer versions

Hitting a weird issue with nim-markdown, where tables are not generating after I updated to 0.8.0. For example with this input:

| Header 1 | Header 2 | Header 3 | Header 4 |
| -------- | -------- | -------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |

I get this output with 0.8.0:

xps13:~$ markdown < test.txt
<p>| Header 1 | Header 2 | Header 3 | Header 4 |
| -------- | -------- | -------- | -------- |
| Cell 1   | Cell 2   | Cell 3   | Cell 4   |
| Cell 5   | Cell 6   | Cell 7   | Cell 8   |</p>

If I install 0.4.0 it goes back to working again:

xps13:~$ markdown < test.txt
<table>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
...

The issue is there with #head as well, but if I download head and run tests, it seems to work fine - which is particularly weird. BTW, same issue programmatically as well as via command line.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.