soasme / nim-markdown Goto Github PK
View Code? Open in Web Editor NEWA Beautiful Markdown Parser in the Nim World.
Home Page: https://www.soasme.com/nim-markdown/
License: MIT License
A Beautiful Markdown Parser in the Nim World.
Home Page: https://www.soasme.com/nim-markdown/
License: MIT License
I didn't really want to just post this without also providing a good fix, but it's causing me problems, and I lack the understanding of the large amount of code in this necessary to do much to it.
It seems that this library parses and renders Markdown quite slowly, especially on larger documents. I noticed this while testing with ~10KB documents but have mostly tried testing it on ~100KB ones, for which rendering takes about 4 seconds, while a Python CommonMark implementation seems to manage 400ms and a JavaScript one ~50ms. This is proving problematic for an application I want to use this in, and while one fix might be to switch to cmark bindings, they don't provide options to customize parsing, so I would have to reimplement a lot of logic.
By switching most of the internals away from doubly linked lists, I was able to get a roughly 25% improvement (very rough, as I just ran time
over it a few times with a big test file as input) but I think bigger changes would be needed to get a significant difference; I think it needs to avoid allocating as many objects (maybe use an ADT or something instead) and to handle chunks of unformatted text better. This also breaks quite a few of the test cases (mostly only spacing in lists as far as I can tell): https://gist.github.com/osmarks/c7d8db89896047368d6512f6284cd7c2
Here is a test input file (without any formatting) and profiling output from it:
I see that this in in the roadmap. It'd be nice if footnotes were properly supported! I'm writing a small static site generator using this library and there are a couple of existing markdown posts I have that use footnotes.
At the moment they seem to turn into links that point to strange paths.
The compiler message is:
/home/johnd/.nimble/pkgs/markdown-0.7.2/markdown.nim(2317, 6) Warning: 'applyInlineParsers' is not GC-safe as it performs an indirect call via 'inlineParser' [GcUnsafe2]
Looking into it, it makes sense. The MarkdownConfig.inlineParsers
field holds a list of pointers to procedures held dynamically. In a single-threaded app, that works. I suspect, but can't confirm, the biggest problem is that the compiler can't check for GC safety because the proc references are only known at runtime.
Simply adding a {.gcsafe.}
pragma might fix it if you are confident there won't be any runtime problems.
To replicate, add --threads:on
to the nim compiler parameters and call markdown
from a threaded-context such as a route in Jester
.
Hi, I am not able to find a working link to API docs (which I see they are commited in the repo at https://github.com/soasme/nim-markdown/tree/master/docs/htmldocs). Currently the link I find is https://www.soasme.com/nim-markdown/ which does not contain API documentation.
When installing, it reports
Warning: method has lock level <unknown>, but another method has 0
I've been reading through the docs and am lost trying to produce the AST described in the docs:
Document()
+-Heading(level=1)
+-Text("H")
+-Text("e")
+-Text("l")
+-Text("l")
+-Text("o")
...
+-Paragraph()
+-Text("W")
+-Text("e")
...
+-Em()
+-Text("n")
+-Text("i")
+-Text("m")
...
+-Text(".")
...
Given a string, such as # Hello World\This is a **bold** word.
, how would I go about generating that object? If possible, can that be more explicitly explained in the docs? This project looks to be exactly what I need so I would really appreciate the help.
pretty much the same motivation as described in http://pandoc.org/filters.html
How would you modify your regular expression to handle these cases? It would be hairy, to say the least. What we need is a real parser. Well, pandoc has a real markdown parser, the library function readMarkdown. This transforms markdown text to an abstract syntax tree (AST) that represents the document structure. Why not manipulate the AST directly in a short Haskell script, then convert the result back to markdown using writeMarkdown?
Just want to open a discussion on whether this could be supported, how would API look like etc.
enabling application writers to use nim-markdown API to write these:
basically this is related to exposing a more modular API, as in pandoc, even if input/output is limited to markdown/html ; in particular input:markdown => output:markdown would be a natural extension (ie, markdown transformers)
Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format
jgm/pandocfilters: A python module for writing pandoc filters, with a collection of examples
https://github.com/jgm/pandoc/wiki/Pandoc-Tricks#from-markdown-to-markdown
using pandoc -f markdown... -t markdown... can have surprisingly useful applications. As a demo, this file is generated by
...
The roadmap mentions correctness, but doesn't further explain what that means. There is no consensus on what correct handling of markdown is, see for example Babelmark which compares the output of 20 different markdown implementations. Here's just one example where the issue becomes obvious.
Have you considered making this an implementation of the GFM spec? GFM is an extension of the CommonMark spec made by GitHub, which includes support for tables and several other non-standard markdown features. By making it possible to enable/disable the extensions in the API, it would also be an implementation of CommonMark itself.
As a user, I want an enhanced post-processing API for the parsed AST, so that I can customize the parsed result to support more use cases, such as creating table of contents or adding slugs to headers, etc.
See discussion #67.
https://github.com/soasme/nim-markdown/blob/master/src/markdownpkg/entities.nim
Tested by manually removing its use in my local Nimble instance:
# markdown.nim
proc escapeHTMLEntity*(doc: string): string =
var entities = doc.findAll(re"&([^;]+);")
result = doc
for entity in entities:
if not IGNORED_HTML_ENTITY.contains(entity):
let utf8Char = entity.htmlEntityToUtf8
Size of small website builder compiled with -d:release:
if not IGNORED_HTML_ENTITY.contains(entity):
let utf8Char = entity#.htmlEntityToUtf8
Same compilation settings:
Converting this to a constant table should save a large amount of space. A build option to turn it off might work as a temporary option though, like -d:markdownNoEntities
Update: Tried changing it to a hash table, it apparently does not save much space:
This makes sense because of the way case/of is optimized (case/of itself is probably faster than a hash table), but I expected it to have a bigger impact. My mistake.
What does save a little more space than that though is using an array of tuples and checking for equality every single time instead of hashes, sacrificing speed:
This is just a bad idea for performance. I would really rather just not have all this in my binary.
Forgot to mention this is on Nim 1.0.4.
This is an umbrella issue for tracking the efforts on improving the performance of nim-markdown.
Below are potential bottlenecks:
since()
calls. #53firstLine()
& restLines()
calls. #56originally asked there nim-lang/Nim#9487 but moving that specific question here:
how could we use nim-markdown for markdown=>html generation in ./koch web ? options:
the copied sources would be (regularly) updated as needed from upstream nim-markdown (and not meant to be locally modified in Nim repo, only meant as a stale copy)
@soasme what do you think?
import markdown
echo markdown("* ~~da~~")
Output:
<ul>
<li>~~da~~</li>
</ul>
Support parsing text like - [x] todo
to
basically this feature request: github/markup#346
include
) to markdownas proposed here: https://talk.commonmark.org/t/transclusion-or-including-sub-documents-for-reuse/270/3
{{ my_file }} -> include the file and parse it as markdown
{{ my_file[start:end] }} -> include the lines comprised between start and end and parse them as markdown.
there are alternatives that have been proposed in various markdown flavor.
Ideally, it should allow optionally specifying the file type (overriding guessing it from file extension if needed), eg: jpg, csv, md, txt, codebock(?)
github doesn't seem to support this feature
Github doesn't provide this feature even for reStructuredText (rst) which has include directive in the official language spec : github/markup#172
I'e seen somewhere github doesn't support it because of security concerns, however I'd like to understand more that concern ; it doesn't seem relevant as far as nim-markdown is concerned
different flavors of markdown use a different syntax for this feature
other syntax I've seen:
[![Watch the video](https://raw.github.com/GabLeRoux/WebMole/master/ressources/WebMole_Youtube_Video.png)](http://youtu.be/vt5fpE0bzSY)
#include "another-markdown-file.md"
CSV files are embedded as tables, source code files become code blocks, and embedded text files help writers structure their work:
❯ nim --version
Nim Compiler Version 1.2.6 [Linux: amd64]
Compiled at 2020-07-29
import markdown
let j = """
| Header 1 | Header 2 | Header 3 | Header 4 |
| :------: | -------: | :------- | -------- |
| Cell 1 | Cell 2 | Cell 3 | Cell 4 |
| Cell 5 | Cell 6 | Cell 7 | Cell 8 |
"""
echo markdown(j)
gives
<p>| Header 1 | Header 2 | Header 3 | Header 4 |
| :------: | -------: | :------- | -------- |
| Cell 1 | Cell 2 | Cell 3 | Cell 4 |
| Cell 5 | Cell 6 | Cell 7 | Cell 8 |</p>
Paragraph rules (no new paragraph until repeated horizontal or vertical whitespace) are not processed correctly in lists.
An example:
import markdown
let md = """
Normal paragraph
* bullet item one
* bullet starts here
and continues here
and this is also the bullet, not a new paragraph
This is a new paragraph though
"""
echo markdown(md, config = initGfmConfig())
Produces:
<p>Normal paragraph</p>
<ul>
<li>
<p>bullet item one</p>
</li>
<li>
<p>bullet starts here
and continues here</p>
</li>
</ul>
<p>and this is also the bullet, not a new paragraph</p>
<p>This is a new paragraph though</p>
Where the second to last paragraph lives outside the list. However, this is what GFM renders:
Normal paragraph
bullet item one
bullet starts here
and continues here
and this is also the bullet, not a new paragraph
This is a new paragraph though
v0.5.0 build is failing with:
[david@eb ~]$ nimble install https://github.com/soasme/nim-markdown
Downloading https://github.com/soasme/nim-markdown using git
Verifying dependencies for [email protected]
Installing [email protected]
Building markdown/markdown using c backend
Prompt: Build failed for 'https://github.com/soasme/[email protected]', would you like to try installing 'https://github.com/soasme/nim-markdown@#head' (latest unstable)? [y/N]
Answer: y
Downloading https://github.com/soasme/nim-markdown using git
Verifying dependencies for markdown@#head
Installing markdown@#head
Building markdown/markdown using c backend
Tip: 5 messages have been suppressed, use --verbose to show them.
Error: Build failed for package: markdown
... Details:
... Execution failed with exit code 1
... Command: "/home/david/Nim/bin/nim" c --noBabelPath -d:release -o:"/tmp/nimble_26030/githubcom_soasmenimmarkdown_#head/markdown" "/tmp/nimble_26030/githubcom_soasmenimmarkdown_#head/src/markdown.nim"
... Output: Hint: used config file '/home/david/Nim/config/nim.cfg' [Conf]
... Hint: system [Processing]
... Hint: markdown [Processing]
... Hint: re [Processing]
... Hint: pcre [Processing]
... Hint: strutils [Processing]
... Hint: parseutils [Processing]
... Hint: math [Processing]
... Hint: bitops [Processing]
... Hint: algorithm [Processing]
... Hint: unicode [Processing]
... Hint: rtarrays [Processing]
... Hint: strformat [Processing]
... Hint: macros [Processing]
... Hint: tables [Processing]
... Hint: hashes [Processing]
... Hint: sequtils [Processing]
... Hint: uri [Processing]
... Hint: htmlparser [Processing]
... Hint: streams [Processing]
... Hint: parsexml [Processing]
... Hint: lexbase [Processing]
... Hint: xmltree [Processing]
... Hint: strtabs [Processing]
... Hint: os [Processing]
... Hint: times [Processing]
... Hint: options [Processing]
... Hint: typetraits [Processing]
... Hint: posix [Processing]
... Hint: ospaths [Processing]
... Hint: lists [Processing]
... markdown.nim(842, 81) Error: \u not allowed in character literal
i've tried to update my nim installation (on the freshest now) but same.
1. [foo](https://nim-lang.org)
2. [baa](https://nim-lang.org/installation)
- [foo](https://nim-lang.org)
- [baa](https://nim-lang.org/installation)
get rendered like so:
1. [foo](https://nim-lang.org)
2. [baa](https://nim-lang.org/installation)
- [foo](https://nim-lang.org)
- [baa](https://nim-lang.org/installation)
but hyperlinks should be created.
My plan was to use nim-markdown within a Karax app (js target), but compilations fails with:
../../../bin/nim-repo/lib/impure/re.nim(100, 3) Error: undeclared identifier: 'copyMem'
It looks like js support is currently blocked by this: nim-lang/Nim#7640
Any plans on reversing from HTML => Markdown?
The config
parameter to markdown()
is currently a regex-parsed string that's prone to both user & dev error, can blow up at runtime rather than compile time, and isn't great for documentation or autocomplete. TLDR it's "stringly typed" and that's not ideal.
Lines 865 to 868 in 9348402
If the parameters are all flags (and they currently are) then it's probably best to use a set[enum]
instead. This is a pattern I've used pretty often:
type
MarkdownOption* {.pure.} = enum
Escape, KeepHtml
MarkdownOptions* = set[MarkdownOption]
const
defaultMarkdownOptions* = {Escape}
proc markdown* (doc: string, config = defaultMarkdownOptions): string
This way if a user typos and passes Escaep
they'll get a compile time error, or might even avoid it entirely because of code suggestions. If you want to go this route I'm willing to make a PR.
If future options might be non-boolean they wouldn't be covered by this and maybe something like an object + proc combo would be needed:
type
MarkdownConfig* = object
escape, keepHtml: bool
someConfigurableString: string
proc initMarkdownConfig* (
escape = true,
keepHtml = false,
someConfigurableString = ""
): MarkdownConfig
It's a bit heavier and more verbose though.
Commonmark https://spec.commonmark.org/0.29/ was released recently. This issue is for tracking the progress of cmark 0.29.0 support.
full context: https://github.com/nim-lang/Nim/issues/9291#issuecomment-432351178
it's useful for editors that automatically trim trailing spaces
I ran into a weird behaviour that I was able to minimize in the following example:
import markdown
let text = """
## title
some text:
- one point, and a [link](to_here)
- two points and **emphasis**
- three points, _really_?
+ sub point
+ another
""" # removing any single line or inline element (e.g. link, emphasis, ...) and error will disappear
var longText = "<!--\n" # if this is removed error disappears
for _ in 1 .. 30: # for less than 30 iterations, error disappears
longText &= text
longText.add "\n-->" # this can be removed and error will persist
echo markdown(longText)
Running this (nim 1.4.0, markdown #head) the program errors out without a stack trace.
If I reduce the number of iterations, or remove any line or element from text
the error disappears.
The behaviour seems to be related to the appearance of a long and fairly complex (from parsing perspective) text inside a Html coment tag (it is sufficient that it starts with <!--
).
Apart from this, I have to say this library is excellent, I have been using it extensively and it is the first time that it fails me (not too harmful, the workaround is simple: just split the text; it was only a bit tricky to minimize the error).
I take the opportunity to thank you for the work you did with nim-markdown and also nim-mustache, which are core dependencies of something I am working on and I am about to release (hopefully) soon: https://github.com/pietroppeter/nimib
See also: https://github.com/nim-lang/Nim/issues/9291#issuecomment-431698702
IMO if Nim is going to switch to markdown, there first must be a pure Nim implementation. Preferable one that implements a well specified form of markdown like CommonMark (note however that commonmark does not support anything fancy like tables).
IMO support for tables is a must, even if CommonMark doesn't support them.
Source: https://raw.githubusercontent.com/github/cmark-gfm/master/test/spec.txt
Code:
import markdown, lists
let root = Document()
echo(markdown(readFile("/tmp/spec-01.txt"), root=root))
I am not even sure if this is a bug or it is according to specs (I know markdown is weird about trailing whitespaces).
Adding a trailing whitespace to a triple backtick makes markdown not recognize it that it ends the code block.
this (for clarity I am using a '*' char that I later replace to whitespace):
import markdown
import std / strutils
echo markdown("""
```nim
echo "hello"
```*
""".replace('*', ' '), config=initGfmConfig())
outputs this (without the backslash which I had to add for GitHub to render it):
<pre><code class="language-nim">echo "hello"
\```
</code></pre>
I would expect this:
<pre><code class="language-nim">echo "hello"
</code></pre>
which is what you get if you remove the trailing whitespace.
Markdown tables are not processed correctly. A table such as
| Month | Savings |
|----------|---------|
| January | $250 |
| February | $80 |
| March | $420 |
results in the following html code
<p>| Month | Savings |
|----------|---------|
| January | $250 |
| February | $80 |
| March | $420 |</p>
Hi,
The blow code
let doc = "*Italic* **bold** normal"
echo markdown(doc, root = Paragraph())
echo markdown(doc, root = Inline())
gives the following output
<p><em>Italic</em> <strong>bold</strong> normal</p>
<em>Italic</em>
<strong>bold</strong>
n
o
r
m
a
l
I need the first (Paragraph
) output, but without the <p>
tag and with the normal
word as a single line as shown below
<em>Italic</em> <strong>bold</strong> normal
How the above output can be achieved?
Thank you,
Vlad
Didn't want to discuss this in #10 to keep each topic distinct; feel free to close if this is too off-topic :), but it may be worth discussing this at least once
Note also that @dom96 mentioned here that asciidoc could be another option for Nim documentation
while I do like asciidoc, the main argument I see against it is that markdown is far more ubiquitous and developpers are more likely to know it and be familiar with it, and it's not clear asciidoc's advantages outweigh this aspect.
since you're authoring nim-markdown maybe you have some insight/opinion on nim-markdown vs asciidoc (and @dom96 feel free to comment too)
Here are some readings I did:
https://asciidoctor.org/docs/asciidoc-vs-markdown/
what truly makes AsciiDoc the right investment is that its syntax was designed to be extended as a core feature
AsciiDoc uses a consistent formatting scheme (i.e., it has consistent patterns).
builtin Includes
syntax include::intro.adoc[]
(but see #9)
AsciiDoc offers power and flexibility without requiring the use of HTML or “flavors” for essential syntax such as tables, description lists, admonitions (tips, notes, warnings, etc.) and table of contents.
Markdown has become a maze of different implementations, termed “flavors”, which make a universal definition evasive.
AsciiDoc syntax was explicitly designed with the needs of publishing in mind, both print and web
https://medium.com/the-bower/markdown-considered-harmful-495ccfe24a52
https://www.red-gate.com/simple-talk/blogs/sundown-on-markdown/
the point I raised above about familiarity / popularity
AsciiDoc is a nice project, but I think that pandoc's variant of Markdown has a lot of advantages over AsciiDoc for academic writing [...]
Hey! One feature that would be nice to have is the smartypants extension, which automatically converts dumbquotes into smart quotes, adds en/em dashes and ellipses.
Thanks!
this could be in README.md or another file status.md
I'm curious whether nim-markdown can already be used to process (a subset of) markdown files in Nim repo, if we were to start replacing some rst files with markdown, see https://github.com/nim-lang/Nim/issues/9291#issuecomment-431705555
what would you suggest for this question:
https://github.com/nim-lang/Nim/issues/9291#issuecomment-428877292
How do I write
`term`:idx:
in markdown?
Line 470 in a661c26
Please, don't export toSeq
proc. I've wasted a lot of time debugging why my toSeq
wasn't working.
It turns out, it's due to this issue: nim-lang/RFCs#512
I've found a somewhat dirty trick to ease the pain
import markdown except toSeq
Which is fine, but would be better if wasn't exported in the first place. Is there any particular reason for this toSeq
to be exposed?
... instead of converting to markdown.
(in the README)
Hitting a weird issue with nim-markdown, where tables are not generating after I updated to 0.8.0. For example with this input:
| Header 1 | Header 2 | Header 3 | Header 4 |
| -------- | -------- | -------- | -------- |
| Cell 1 | Cell 2 | Cell 3 | Cell 4 |
| Cell 5 | Cell 6 | Cell 7 | Cell 8 |
I get this output with 0.8.0:
xps13:~$ markdown < test.txt
<p>| Header 1 | Header 2 | Header 3 | Header 4 |
| -------- | -------- | -------- | -------- |
| Cell 1 | Cell 2 | Cell 3 | Cell 4 |
| Cell 5 | Cell 6 | Cell 7 | Cell 8 |</p>
If I install 0.4.0 it goes back to working again:
xps13:~$ markdown < test.txt
<table>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
...
The issue is there with #head as well, but if I download head and run tests, it seems to work fine - which is particularly weird. BTW, same issue programmatically as well as via command line.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.