mity / md4c Goto Github PK

View Code? Open in Web Editor NEW

771.0 20.0 144.0 1.33 MB

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.

License: MIT License

CMake 1.12% C 91.07% Shell 0.36% Python 7.00% Roff 0.44%

markdown-parser c commonmark markdown parser mit-license

md4c's Introduction

MD4C Readme

Home: http://github.com/mity/md4c
Wiki: http://github.com/mity/md4c/wiki
Issue tracker: http://github.com/mity/md4c/issues

MD4C stands for "Markdown for C" and that's exactly what this project is about.

What is Markdown

In short, Markdown is the markup language this README.md file is written in.

The following resources can explain more if you are unfamiliar with it:

What is MD4C

MD4C is Markdown parser implementation in C, with the following features:

Compliance: Generally, MD4C aims to be compliant to the latest version of CommonMark specification. Currently, we are fully compliant to CommonMark 0.31.
Extensions: MD4C supports some commonly requested and accepted extensions. See below.
Performance: MD4C is very fast.
Compactness: MD4C parser is implemented in one source file and one header file. There are no dependencies other than standard C library.
Embedding: MD4C parser is easy to reuse in other projects, its API is very straightforward: There is actually just one function, md_parse().
Push model: MD4C parses the complete document and calls few callback functions provided by the application to inform it about a start/end of every block, a start/end of every span, and with any textual contents.
Portability: MD4C builds and works on Windows and POSIX-compliant OSes. (It should be simple to make it run also on most other platforms, at least as long as the platform provides C standard library, including a heap memory management.)
Encoding: MD4C by default expects UTF-8 encoding of the input document. But it can be compiled to recognize ASCII-only control characters (i.e. to disable all Unicode-specific code), or (on Windows) to expect UTF-16 (i.e. what is on Windows commonly called just "Unicode"). See more details below.
Permissive license: MD4C is available under the MIT license.

Using MD4C

Parsing Markdown

If you need just to parse a Markdown document, you need to include md4c.h and link against MD4C library (-lmd4c); or alternatively add md4c.[hc] directly to your code base as the parser is only implemented in the single C source file.

The main provided function is md_parse(). It takes a text in the Markdown syntax and a pointer to a structure which provides pointers to several callback functions.

As md_parse() processes the input, it calls the callbacks (when entering or leaving any Markdown block or span; and when outputting any textual content of the document), allowing application to convert it into another format or render it onto the screen.

Converting to HTML

If you need to convert Markdown to HTML, include md4c-html.h and link against MD4C-HTML library (-lmd4c-html); or alternatively add the sources md4c.[hc], md4c-html.[hc] and entity.[hc] into your code base.

To convert a Markdown input, call md_html() function. It takes the Markdown input and calls the provided callback function. The callback is fed with chunks of the HTML output. Typical callback implementation just appends the chunks into a buffer or writes them to a file.

Markdown Extensions

The default behavior is to recognize only Markdown syntax defined by the CommonMark specification.

However, with appropriate flags, the behavior can be tuned to enable some extensions:

With the flag MD_FLAG_COLLAPSEWHITESPACE, a non-trivial whitespace is collapsed into a single space.
With the flag MD_FLAG_TABLES, GitHub-style tables are supported.
With the flag MD_FLAG_TASKLISTS, GitHub-style task lists are supported.
With the flag MD_FLAG_STRIKETHROUGH, strike-through spans are enabled (text enclosed in tilde marks, e.g. ~foo bar~).
With the flag MD_FLAG_PERMISSIVEURLAUTOLINKS permissive URL autolinks (not enclosed in < and >) are supported.
With the flag MD_FLAG_PERMISSIVEEMAILAUTOLINKS, permissive e-mail autolinks (not enclosed in < and >) are supported.
With the flag MD_FLAG_PERMISSIVEWWWAUTOLINKS permissive WWW autolinks without any scheme specified (e.g. www.example.com) are supported. MD4C then assumes http: scheme.
With the flag MD_FLAG_LATEXMATHSPANS LaTeX math spans ( $...$ ) and LaTeX display math spans ($$...$$) are supported. (Note though that the HTML renderer outputs them verbatim in a custom tag <x-equation>.)
With the flag MD_FLAG_WIKILINKS, wiki-style links ([[link label]] and [[target article|link label]]) are supported. (Note that the HTML renderer outputs them in a custom tag <x-wikilink>.)
With the flag MD_FLAG_UNDERLINE, underscore (_) denotes an underline instead of an ordinary emphasis or strong emphasis.

Few features of CommonMark (those some people see as mis-features) may be disabled with the following flags:

With the flag MD_FLAG_NOHTMLSPANS or MD_FLAG_NOHTMLBLOCKS, raw inline HTML or raw HTML blocks respectively are disabled.
With the flag MD_FLAG_NOINDENTEDCODEBLOCKS, indented code blocks are disabled.

Input/Output Encoding

The CommonMark specification declares that any sequence of Unicode code points is a valid CommonMark document.

But, under a closer inspection, Unicode plays any role in few very specific situations when parsing Markdown documents:

For detection of word boundaries when processing emphasis and strong emphasis, some classification of Unicode characters (whether it is a whitespace or a punctuation) is needed.
For (case-insensitive) matching of a link reference label with the corresponding link reference definition, Unicode case folding is used.
For translating HTML entities (e.g. &) and numeric character references (e.g. # or ಫ) into their Unicode equivalents.

However note MD4C leaves this translation on the renderer/application; as the renderer is supposed to really know output encoding and whether it really needs to perform this kind of translation. (For example, when the renderer outputs HTML, it may leave the entities untranslated and defer the work to a web browser.)

MD4C relies on this property of the CommonMark and the implementation is, to a large degree, encoding-agnostic. Most of MD4C code only assumes that the encoding of your choice is compatible with ASCII. I.e. that the codepoints below 128 have the same numeric values as ASCII.

Any input MD4C does not understand is simply seen as part of the document text and sent to the renderer's callback functions unchanged.

The two situations (word boundary detection and link reference matching) where MD4C has to understand Unicode are handled as specified by the following preprocessor macros (as specified at the time MD4C is being built):

If preprocessor macro MD4C_USE_UTF8 is defined, MD4C assumes UTF-8 for the word boundary detection and for the case-insensitive matching of link labels.

When none of these macros is explicitly used, this is the default behavior.
On Windows, if preprocessor macro MD4C_USE_UTF16 is defined, MD4C uses WCHAR instead of char and assumes UTF-16 encoding in those situations. (UTF-16 is what Windows developers usually call just "Unicode" and what Win32API generally works with.)

Note that because this macro affects also the types in md4c.h, you have to define the macro both when building MD4C as well as when including md4c.h.

Also note this is only supported in the parser (md4c.[hc]). The HTML renderer does not support this and you will have to write your own custom renderer to use this feature.
If preprocessor macro MD4C_USE_ASCII is defined, MD4C assumes nothing but an ASCII input.

That effectively means that non-ASCII whitespace or punctuation characters won't be recognized as such and that link reference matching will work in a case-insensitive way only for ASCII letters ([a-zA-Z]).

Documentation

The API of the parser is quite well documented in the comments in the md4c.h. Similarly, the markdown-to-html API is described in its header md4c-html.h.

There is also project wiki which provides some more comprehensive documentation. However note it is incomplete and some details may be somewhat outdated.

FAQ

Q: How does MD4C compare to other Markdown parsers?

A: Some other implementations combine Markdown parser and HTML generator into a single entangled code hidden behind an interface which just allows the conversion from Markdown to HTML. They are often unusable if you want to process the input in any other way.

Second, most parsers (if not all of them; at least within the scope of C/C++ language) are full DOM-like parsers: They construct abstract syntax tree (AST) representation of the whole Markdown document. That takes time and it leads to bigger memory footprint.

Building AST is completely fine as long as you need it. If you don't, there is a very high chance that using MD4C will be substantially faster and less hungry in terms of memory consumption.

Last but not least, some Markdown parsers are implemented in a naive way. When fed with a smartly crafted input pattern, they may exhibit quadratic (or even worse) parsing times. What MD4C can still parse in a fraction of second may turn into long minutes or possibly hours with them. Hence, when such a naive parser is used to process an input from an untrusted source, the possibility of denial-of-service attacks becomes a real danger.

A lot of our effort went into providing linear parsing times no matter what kind of crazy input MD4C parser is fed with. (If you encounter an input pattern which leads to a sub-linear parsing times, please do not hesitate and report it as a bug.)

Q: Does MD4C perform any input validation?

A: No. And we are proud of it. :-)

CommonMark specification states that any sequence of Unicode characters is a valid Markdown document. (In practice, this more or less always means UTF-8 encoding.)

In other words, according to the specification, it does not matter whether some Markdown syntax construction is in some way broken or not. If it's broken, it won't be recognized and the parser should see it just as a verbatim text.

MD4C takes this a step further: It sees any sequence of bytes as a valid input, following completely the GIGO philosophy (garbage in, garbage out). I.e. any ill-formed UTF-8 byte sequence will propagate to the respective callback as a part of the text.

If you need to validate that the input is, say, a well-formed UTF-8 document, you have to do it on your own. The easiest way how to do this is to simply validate the whole document before passing it to the MD4C parser.

License

MD4C is covered with MIT license, see the file LICENSE.md.

Links to Related Projects

Ports and bindings to other languages:

commonmark-d: Port of MD4C to D language.
markdown-wasm: Port of MD4C to WebAssembly.
PyMD4C: Python bindings for MD4C

Software using MD4C:

imgui_md: Markdown renderer for Dear ImGui
MarkDown Monolith Assembler: A command line tool for building browser-based books.
QOwnNotes: A plain-text file notepad and todo-list manager with markdown support and ownCloud / Nextcloud integration.
Qt: Cross-platform C++ GUI framework.
Textosaurus: Cross-platform text editor based on Qt and Scintilla.
8th: Cross-platform concatenative programming language.

md4c's People

Contributors

Stargazers

Watchers

Forkers

tin-pot aoloe demimarie lanjingling34 jimmyjxiao yang123vc roguemonkeys dvrogozh ec1oud perezmeyer phillipberndt webworkscollection hvoigt marcocpt dyedgreen antumdeluge niblo tnaia waqar144 qownnotes octokas meriemlaassal3 pschachte ioriayane data-man dominickpastore danielguovt cxw42 stisa eklitzke karstenbriksoft jschueller rongpenl tsieprawski gerph rsms mengzairanshao mondeja srinivas32 gerbenvoshol crackercat leo-torrez zababurinsv ashang dangelog kipertex niclasr billyfish eugenij-w nilquera davidkorczynski potatogim beingmeta strager crispyi2 kkoehne jarvis-ai cmaughan deersoftdevelopment siddhantdixit dtldarek bonsaigardener jbro885 ra2003 joseph-marshall-vizio jgarte mfkiwl jdivins regulusleow ayushrajlt bogdanov1609 rene-bos ankitpv rafael-conde chowette sigmareaver xdevelnet rnshah9 makesoftwaresafe mayhemheroes sdondley mingmoe steveroy0226 yyl-20020115 robert396 rstarov snej doytsujin macasieb tonyimax zeuston antons dotnwat nmgwddj tcknet dbuenzli johan-bjareholt vtorri spider2449 tomstevelfq

md4c's Issues

NULL pointer dereferenc in md4c/md4c.c:5824

i find a Segmentation fault ,when i used md2html.
commit cb7ecd7
./md2html --github crash1

it is a NULL pointer dereferenc in https://github.com/mity/md4c/blob/master/md4c/md4c.c#L5824.
ctx->current_block is a null pointer.
but i find you did the assert in https://github.com/mity/md4c/blob/master/md4c/md4c.c#L5822,i dont know why it does not work.
i just git clone it and use cmake . and make to build it.

(gdb) set args --github crash1 
(gdb) r
Starting program: /opt/lxf/md4c/md2html/md2html --github crash1 

Program received signal SIGSEGV, Segmentation fault.
md_process_line (line=0x7fffffffde80, p_pivot_line=<synthetic pointer>, ctx=0x7fffffffdf30)
    at /opt/lxf/md4c/md4c/md4c.c:5824
5824	        ctx->current_block->type = MD_BLOCK_TABLE;
(gdb) bt
#0  md_process_line (line=0x7fffffffde80, p_pivot_line=<synthetic pointer>, ctx=0x7fffffffdf30)
    at /opt/lxf/md4c/md4c/md4c.c:5824
#1  md_process_doc (ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:5865
#2  md_parse (text=text@entry=0x627250 "", size=size@entry=8632, renderer=renderer@entry=0x7fffffffe1c0, 
    userdata=userdata@entry=0x7fffffffe1a0) at /opt/lxf/md4c/md4c/md4c.c:5935
#3  0x0000000000403aa2 in md_render_html (input=input@entry=0x627250 "", input_size=input_size@entry=8632, 
    process_output=process_output@entry=0x402280 <process_output>, userdata=userdata@entry=0x7fffffffe210, 
    parser_flags=<optimized out>, renderer_flags=<optimized out>) at /opt/lxf/md4c/md2html/render_html.c:488
#4  0x0000000000401263 in process_file (out=0x7ffff7dd4400 <_IO_2_1_stdout_>, in=0x627010)
    at /opt/lxf/md4c/md2html/md2html.c:139
#5  main (argc=<optimized out>, argv=<optimized out>) at /opt/lxf/md4c/md2html/md2html.c:343
(gdb) p ctx->current_block 
$1 = (MD_BLOCK *) 0x0

this is the crash file :
poc file

Incorrect emphasis parsing

(Copied from commonmark/cmark#177, we are hit with exactly the same issue.)

@raphlinus writes:

In the example a***b* c*, cmark produces a**b c*, where I believe the spec would say a*b c. My reading of the spec is that the length of the opening delimiter run is 3, and the length of both closing runs is 1, so in neither case the sum is a multiple of 3.

Security audit, fuzzing, and more testing

Markdown implementations are often used to process untrusted input. md4c is written in C, which makes it very easy to introduce a security vulnerability. Hence, it is imperative that md4c is hardened against all possible exploits.

This includes:

Adding always-on assertions about things like array bounds
Intensive, repeated fuzzing with tools like afl-fuzz.
Security auditing

Get source line number for headings

Hello,
is it possible to retrieve the source line number for headings?
I would like to build a parser which can create a TOC with navigation feature for a markdown editor.

Entities in non-trivial contexts

Entities which are not inside a normal text flow are not translated.

This includes these situations:

Entity inside a link or image destination (URL).
Entity inside a link or image title text.
Entity inside a code fence info string.

It is responsible for following CommonMark 0.27 test failures.

Example 307: In link destination and title
Example 308: In link destination and title
Example 309: In code fence info string
Example 472: In link destination
Example 475: In link title

Enhance table parsing to deal with pipes inside the cells.

With MD_FLAG_TABLES enabled, we should handle pipe characters more carefully. Currently it is impossible to have (an unescaped) pipe inside the table. That makes a code span with a pipe impossible inside a table.

We can likely follow GitHub implementation as described here:
https://talk.commonmark.org/t/parsing-strategy-for-tables/2027/46

Heap buffer overflow in md_is_link_reference_definition_helper()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-387bd02/crash_md_is_link_reference_definition_helper

AddressSanitizer provided information as below:

=================================================================
==7016==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x615000000280 at pc 0x00000054e1e4 bp 0x7ffdf438ab70 sp 0x7ffdf438ab68
READ of size 4 at 0x615000000280 thread T0
    #0 0x54e1e3 in md_is_link_reference_definition_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1931:33
    #1 0x5320d5 in md_is_link_reference_definition /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2213:11
    #2 0x5320d5 in md_consume_link_reference_definitions /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4648
    #3 0x5320d5 in md_end_current_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4694
    #4 0x52c7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5850:5
    #5 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #6 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #7 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #8 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #9 0x7f20771c082f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #10 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x615000000280 is located 0 bytes to the right of 512-byte region [0x615000000080,0x615000000280)
allocated by thread T0 here:
    #0 0x4ded00 in realloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:107
    #1 0x527b65 in md_push_block_bytes /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4560:27
    #2 0x527b65 in md_start_new_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4587
    #3 0x527b65 in md_process_line /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5820
    #4 0x527b65 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5847
    #5 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #6 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #7 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #8 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #9 0x7f20771c082f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1931:33 in md_is_link_reference_definition_helper
Shadow bytes around the buggy address:
0x0c2a7fff8000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c2a7fff8050:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==7016==ABORTING

afl-fuzz: crash in md_process_table_cell()

With --github, the following minimized test case leads to seg. fault in md_process_table_cell():

[|]: u

[|]: u
---|

Entity in direct link title entails mis-translation

This gets processed correctly:

Some [direct link](http://example.com "Direct link -- ie, inline URL") here.

But when an entity reference occurs in the title text, like this:

Some [direct link](http://example.com "Direct link &ndash; ie, inline URL") here.

the last portion of the title text (starting at the reference) gets repeated after the generated <a> element, so the output from md2html looks like this:

<p>Some <a href="http://example.com" title="Direct link – ie, inline URL">dire
ct link</a>– ie, inline URL&quot;) here.</p>

As far as I have seen, this stems from md4c.c, not md2html.c: the – reference is indeed transmitted twice, once embedded in the attribute text, and a second time as a MD_TEXT_ENTITY item itself.

On can actually observe this behaviour in Babelmark 2, where MD4C 0.1.1 is included as of lately.

Tag `<gi att1=tok1 att2=tok2>` not recognized

In md_is_html_tag(), state 41 "in middle of unquoted attribute value" is not exited when white space is encounterd. As a consequence, in the tag mentioned above the second = throws scanning off, we end up in the "unexpected" case, do return FALSE - and the tag is not recognized as such.

Invalid HTML with --fpermissive-url-autolinks

With permissive-url-autolinks enabled, md2html appears to generate an extra closing tag after an ordinary inline link.

$ echo "This is a [link](http://github.com/)." | ./md2html
<p>This is a <a href="http://github.com/">link</a>.</p>

$ echo "This is a [link](http://github.com/)." | ./md2html --fpermissive-url-autolinks
<p>This is a <a href="http://github.com/">link</a></a>.</p>

Heap buffer overflow in md_is_named_entity_contents()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-387bd02/crash_md_is_named_entity_contents

AddressSanitizer provided information as below:

==16545==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200000001f at pc 0x0000005464c6 bp 0x7ffe90e1b080 sp 0x7ffe90e1b078
READ of size 1 at 0x60200000001f thread T0
    #0 0x5464c5 in md_is_named_entity_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1311:28
    #1 0x5464c5 in md_is_entity_str /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1341
    #2 0x553b62 in md_build_attribute /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1473:20
    #3 0x5562f9 in md_enter_leave_span_a /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3838:5
    #4 0x5510d2 in md_process_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3947:21
    #5 0x5510d2 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4284
    #6 0x52e7f7 in md_process_leaf_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4454:13
    #7 0x52e7f7 in md_process_all_blocks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4529
    #8 0x52e7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5856
    #9 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #10 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #11 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #12 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #13 0x7fd6be7ec82f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #14 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x60200000001f is located 0 bytes to the right of 15-byte region [0x602000000010,0x60200000001f)
allocated by thread T0 here:
    #0 0x4de898 in __interceptor_malloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x54bedb in md_merge_lines_alloc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:904:22
    #2 0x54bedb in md_is_inline_link_spec_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2352
    #3 0x53b9bf in md_is_inline_link_spec /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2370:12
    #4 0x53b9bf in md_resolve_links /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3367
    #5 0x53b9bf in md_analyze_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3786
    #6 0x550b75 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4283:5

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:1311:28 in md_is_named_entity_contents
Shadow bytes around the buggy address:
0x0c047fff7fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c047fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c047fff8000: fa fa 00[07]fa fa 00 07 fa fa fa fa fa fa fa fa
0x0c047fff8010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c047fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==16545==ABORTING

Feature request: parse GitHub checkbox syntax

Another reason I want to add Markdown support to Qt is to enable combined notes and TODO lists like Evernote has, to be able to edit GitHub README.md files, etc.

https://github.com/blog/1825-task-lists-in-all-markdown-documents

As they show there, the syntax is

### Solar System Exploration, 1950s – 1960s

- [ ] Mercury
- [x] Venus
- [x] Earth (Orbit/Moon)
- [x] Mars
- [ ] Jupiter
- [ ] Saturn
- [ ] Uranus
- [ ] Neptune
- [ ] Comet Haley

Solar System Exploration, 1950s – 1960s

It won't be quite so easy to add support for checking and unchecking them to any Qt text-editing component, but I'd like to try.

Currently for my personal todo lists and shopping lists etc., I use todo.txt format (https://f-droid.org/packages/nl.mpcjanssen.simpletask/ and https://f-droid.org/packages/com.nutomic.syncthingandroid/ are a good combination for this); but it has the disadvantage of not being able to mix free-form notes with the checklists. So I think markdown will be better.

Crash

American Fuzzy Lop has found a crash with a pattern I was able to minimize into this.

[x]:
x
- <?

  x

Attributes of `CommonMark.dtd` omitted by parser

When porting some code (which up to now used cmark) to MD4C I missed some information from the parser which is represented as attributes in the CommonMark DTD:

In the details for MD_BLOCK_UL, the information whether the list is "tight" or "loose" (attribute tight (true|false).
In the details for MD_BLOCK_OL, the information whether the list is "tight" or "loose" (attribute tight (true|false), and which style of item marker was used (attribute delimiter (period|paren)).

Any chance to put corresponding information into MD_BLOCK_UL_DETAIL rsp MD_BLOCK_OL_DETAIL?

heap_overflow in function md_build_attribute md4c.c:1462

commit cb7ecd7
./md2html --github crash2

in function md_build_attribute ,https://github.com/mity/md4c/blob/master/md4c/md4c.c#L1462
The raw_size is too large to cause a heap overflow

Program received signal SIGSEGV, Segmentation fault.
0x0000000000404f4b in md_build_attribute (ctx=ctx@entry=0x7fffffffdf30, 
    raw_text=0x62756b "http://meta.math.stackexchan\214\345\246\202\351\234\200\346\222\260\345\206\231\346\226\260\347\250\277\344\273\266\357\274\214\347\202\271\345\207\273\351\241\266\351\203\250\345\267\245\345\205\267\346\240\217\345\217\263\344\276\377\a\232\204 <i class=\"icon-file\"></i> **\346\226\260\346\226\207\347\250\277** \346\210\226\350\200\205\344\275\277\347\224\250\345\277\253\346\215\267\351\224\256 `Ctrl+Alt+N`\343\200\202\r\n\r\n------\r\n\r\n## \344\273\200\344\271\210\346\230\257 Markdown\r\n\r\n"..., 
    raw_size=raw_size@entry=4294966501, flags=<optimized out>, attr=attr@entry=0x7fffffffdd40, 
    build=build@entry=0x7fffffffdce0) at /opt/lxf/md4c/md4c/md4c.c:1462
1462	            if(raw_text[raw_off] == _T('\0')) {
(gdb) bt
#0  0x0000000000404f4b in md_build_attribute (ctx=ctx@entry=0x7fffffffdf30, 
    raw_text=0x62756b "http://meta.math.stackexchan\214\345\246\202\351\234\200\346\222\260\345\206\231\346\226\260\347\250\277\344\273\266\357\274\214\347\202\271\345\207\273\351\241\266\351\203\250\345\267\245\345\205\267\346\240\217\345\217\263\344\276\377\a\232\204 <i class=\"icon-file\"></i> **\346\226\260\346\226\207\347\250\277** \346\210\226\350\200\205\344\275\277\347\224\250\345\277\253\346\215\267\351\224\256 `Ctrl+Alt+N`\343\200\202\r\n\r\n------\r\n\r\n## \344\273\200\344\271\210\346\230\257 Markdown\r\n\r\n"..., 
    raw_size=raw_size@entry=4294966501, flags=<optimized out>, attr=attr@entry=0x7fffffffdd40, 
    build=build@entry=0x7fffffffdce0) at /opt/lxf/md4c/md4c/md4c.c:1462
#1  0x0000000000405446 in md_enter_leave_span_a (ctx=ctx@entry=0x7fffffffdf30, enter=4, 
    type=type@entry=MD_SPAN_A, dest=<optimized out>, dest_size=dest_size@entry=4294966501, 
    prohibit_escapes_in_dest=prohibit_escapes_in_dest@entry=1, title=0x0, title_size=0)
    at /opt/lxf/md4c/md4c/md4c.c:3846
#2  0x000000000040580c in md_process_inlines (ctx=ctx@entry=0x7fffffffdf30, lines=lines@entry=0x6333a8, 
    n_lines=n_lines@entry=1) at /opt/lxf/md4c/md4c/md4c.c:4003
#3  0x0000000000411ce3 in md_process_normal_block_contents (n_lines=1, lines=0x6333a8, ctx=0x7fffffffdf30)
    at /opt/lxf/md4c/md4c/md4c.c:4302
#4  md_process_leaf_block (block=0x6333a0, ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:4472
#5  md_process_all_blocks (ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:4547
#6  md_process_doc (ctx=0x7fffffffdf30) at /opt/lxf/md4c/md4c/md4c.c:5874
#7  md_parse (
    text=text@entry=0x627250 "# \346\254\242\350\277\216\344\275", '&' <repeats 17 times>, " \347\274\226\350\276\221\351\230\305\350\257\273\345\231\250\r\n\r\n------\r\n\r\n\346\310\221\344\273\254\347\220\206\350\247\r\n", size=size@entry=10251, renderer=renderer@entry=0x7fffffffe1c0, 
    userdata=userdata@entry=0x7fffffffe1a0) at /opt/lxf/md4c/md4c/md4c.c:5935
#8  0x0000000000403aa2 in md_render_html (
    input=input@entry=0x627250 "# \346\254\242\350\277\216\344\275", '&' <repeats 17 times>, " \347\274\226\350\276\221\351\230\305\350\257\273\345\231\250\r\n\r\n------\r\n\r\n\346\310\221\344\273\254\347\220\206\350\247\r\n", input_size=input_size@entry=10251, process_output=process_output@entry=0x402280 <process_output>, 
    userdata=userdata@entry=0x7fffffffe210, parser_flags=<optimized out>, renderer_flags=<optimized out>)
    at /opt/lxf/md4c/md2html/render_html.c:488
#9  0x0000000000401263 in process_file (out=0x7ffff7dd4400 <_IO_2_1_stdout_>, in=0x627010)
    at /opt/lxf/md4c/md2html/md2html.c:139
#10 main (argc=<optimized out>, argv=<optimized out>) at /opt/lxf/md4c/md2html/md2html.c:343
(gdb) p raw_off 
$1 = 187029
(gdb) p raw_text[raw_off]
Cannot access memory at address 0x655000
(gdb) p raw_text[raw_off-1]
$2 = 0 '\000'
(gdb) p raw_size 
$3 = 4294966501

poc file: poc

Multiline link reference definition label inside a blockquote does not work.

> [foo
> bar]: /url
>
> [foo bar]

is rendered into

<blockquote>
<p>[foo bar]</p>
</blockquote>

but should be rendered into

<blockquote>
<p><a href="/url">foo bar</a></p>
</blockquote>

Invalid output.

The test case input in #38 generates an unexpected output, even after fixing the invalid read.

Minimized version of the test case for the invalid output is as follows:

[x](((x
x]((C(&))x

It currently generates

<p><a href="" title="(x
x]((C(&amp;">x</a>
x]((C(&amp;))x</p>

That is clearly wrong. The input likely is not valid link syntax (needs yet some more analysis and spec studying to confirm). But even if it would be link then the link alt text should not be repeated after the link. Something rots here.

Lots of warnings

Hello,

Thank you for this awesome library, I just get a lots of warnings when compiling with -Wall -Wextra:

$ gcc -std=c99 -Wall -Wextra md4c.c -c
md4c.c: In function ‘md_decode_unicode’:
md4c.c:844:52: warning: unused parameter ‘str_size’ [-Wunused-parameter]
     md_decode_unicode(const CHAR* str, OFF off, SZ str_size, SZ* p_size)
                                                    ^~~~~~~~
md4c.c: In function ‘md_merge_lines’:
md4c.c:864:73: warning: unused parameter ‘n_lines’ [-Wunused-parameter]
 md_merge_lines(MD_CTX* ctx, OFF beg, OFF end, const MD_LINE* lines, int n_lines,
                                                                         ^~~~~~~
md4c.c: In function ‘md_is_hex_entity_contents’:
md4c.c:1275:35: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_is_hex_entity_contents(MD_CTX* ctx, const CHAR* text, OFF beg, OFF max_end, OFF* p_end)
                                   ^~~
md4c.c: In function ‘md_is_dec_entity_contents’:
md4c.c:1291:35: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_is_dec_entity_contents(MD_CTX* ctx, const CHAR* text, OFF beg, OFF max_end, OFF* p_end)
                                   ^~~
md4c.c: In function ‘md_is_named_entity_contents’:
md4c.c:1307:37: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_is_named_entity_contents(MD_CTX* ctx, const CHAR* text, OFF beg, OFF max_end, OFF* p_end)
                                     ^~~
md4c.c: In function ‘md_free_attribute’:
md4c.c:1412:27: warning: unused parameter ‘ctx’ [-Wunused-parameter]
 md_free_attribute(MD_CTX* ctx, MD_ATTRIBUTE_BUILD* build)
                           ^~~
md4c.c: In function ‘md_rollback’:
md4c.c:2618:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(i = 1; i < SIZEOF_ARRAY(ctx->mark_chains); i++) {
                  ^
md4c.c: In function ‘md_build_mark_char_map’:
md4c.c:2719:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         for(i = 0; i < sizeof(ctx->mark_char_map); i++) {
                      ^
md4c.c: In function ‘md_collect_marks’:
md4c.c:2936:52: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                 for(scheme_index = 0; scheme_index < SIZEOF_ARRAY(scheme_map); scheme_index++) {
                                                    ^
md4c.c: In function ‘md_process_verbatim_block_contents’:
md4c.c:4312:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         while(indent > SIZEOF_ARRAY(indent_chunk_str)) {
                      ^
md4c.c: In function ‘md_is_atxheader_line’:
md4c.c:4786:61: warning: unused parameter ‘p_end’ [-Wunused-parameter]
 md_is_atxheader_line(MD_CTX* ctx, OFF beg, OFF* p_beg, OFF* p_end, unsigned* p_level)
                                                             ^~~~~
md4c.c: At top level:
md4c.c:5336:1: warning: missing initializer for field ‘beg’ of ‘MD_LINE_ANALYSIS {aka const struct MD_LINE_ANALYSIS_tag}’ [-Wmissing-field-initializers]
 static const MD_LINE_ANALYSIS md_dummy_blank_line = { MD_LINE_BLANK, 0 };
 ^~~~~~
md4c.c:189:9: note: ‘beg’ declared here
     OFF beg;
         ^~~
md4c.c: In function ‘md_analyze_line’:
md4c.c:5481:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                ctx->n_block_bytes > sizeof(MD_BLOCK))
                                   ^
md4c.c:5499:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
                ctx->n_block_bytes > sizeof(MD_BLOCK))
                                   ^
md4c.c: In function ‘md_parse’:
md4c.c:5908:18: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for(i = 0; i < SIZEOF_ARRAY(ctx.mark_chains); i++) {
                  ^
md4c.c: In function ‘md_rollback’:
md4c.c:2669:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
                 if((mark_flags & MD_MARK_CLOSER)  &&  mark->prev > opener_index) {
                   ^
md4c.c:2676:13: note: here
             default:
             ^~~~~~~
md4c.c: In function ‘md_process_inlines’:
md4c.c:3956:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
                     if(!(mark->flags & MD_MARK_AUTOLINK)) {
                       ^
md4c.c:3966:17: note: here
                 case '@':       /* Permissive e-mail autolink. */
                 ^~~~
md4c.c: In function ‘md_enter_child_containers’:
md4c.c:5197:33: warning: this statement may fall through [-Wimplicit-fallthrough=]
                 is_ordered_list = TRUE;
                                 ^
md4c.c:5200:13: note: here
             case _T('-'):
             ^~~~
md4c.c: In function ‘md_leave_child_containers’:
md4c.c:5240:33: warning: this statement may fall through [-Wimplicit-fallthrough=]
                 is_ordered_list = TRUE;
                                 ^
md4c.c:5243:13: note: here
             case _T('-'):
             ^~~~

Buffer overflow in md_is_entity_str()

The following input leads to buffer overflow in md_is_entity_str() when used with md2html --github:

www.x.y&&y&&&y&&&&y&&y&&

(Distilled from PR #46 and from https://bugreports.qt.io/browse/QTBUG-72937)

Emphasis parsed wrongly.

a*b**c* should translate to ab**c, but currently it is translated to a*b**c*.

md2html doesn't generate HTML tables

I tested with the example.md from https://codereview.qt-project.org/#/c/214844/ . It simply put the raw table syntax into a paragraph:

| | Development Tools | Programming Techniques | Graphical User Interfaces | | ------------: | ----------------- | ---------------------- | ------------------------- | | 9:00 - 11:00 | Introduction to Qt ||| | 11:00 - 13:00 | Using qmake | Object-oriented Programming | Layouts in Qt | | 13:00 - 15:00 | Qt Designer Tutorial | Extreme Programming | Writing Custom Styles | | 15:00 - 17:00 | Qt Linguist and Internationalization | | |

This doesn't matter for Qt, and I don't have any plan to use md2html for anything ATM; but since I'm trying to add md4c to Arch AUR, if the package is to include md2html, the bugs will become noticeable when someone tries to use it for this basic purpose of generating HTML. If it's not intended to be a serious tool (after all, there are enough MD->HTML tools to choose from), I could rather leave it out of the package though.

Parsing '<><><><>...' takes quadratic time

$ python -c 'print("<>" * 10000)' | time md2html > /dev/null
1.41user 0.00system 0:01.45elapsed 96%CPU (0avgtext+0avgdata 1968maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
$ python -c 'print("<>" * 20000)' | time md2html > /dev/null
4.80user 0.00system 0:04.84elapsed 99%CPU (0avgtext+0avgdata 2608maxresident)k
0inputs+0outputs (0major+365minor)pagefaults 0swaps
$ python -c 'print("<>" * 40000)' | time md2html > /dev/null
19.52user 0.00system 0:19.65elapsed 99%CPU (0avgtext+0avgdata 3528maxresident)k
0inputs+0outputs (0major+640minor)pagefaults 0swaps

(the 1st case from #57)

Consequtive lists without a blank line

100. foo
    * bar

is transformed into

<ol start="100"><li>foo
* bar</li>
</ol>

But it should be

<ol start="100"><li>foo</li>
</ol>
<ul>
<li>bar</li>
</ul>

(Interestingly, lower or larger indentation of the 2nd list works correctly.)

Multiple vulnerabilities in md4c

There are multiple vulnerabilities in md4c (git repository: https://github.com/mity/md4c, Latest commit 81e2a5c on Apr 12, 2018).

git log

commit 81e2a5cac2c8c2b1f8fe63b7bce3fe7e516e2891
Author: Martin Mitas <[email protected]>
Date:   Thu Apr 12 17:03:37 2018 +0200

Heap buffer overflow in md_split_simple_pairing_mark()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-81e2a5c/Heap_buffer_overflow_in_md_split_simple_pairing_mark

It seems like that an overflow happened in memcpy() in md4c.c:3499:

memcpy(dummy, mark, sizeof(MD_MARK));

AddressSanitizer provided information as below:

==27938==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61a000000684 at pc 0x0000004dd7f5 bp 0x7ffedcfedc30 sp 0x7ffedcfed3e0
WRITE of size 20 at 0x61a000000684 thread T0
    #0 0x4dd7f4 in __asan_memcpy /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cc:23
    #1 0x546dd3 in md_split_simple_pairing_mark /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3499:5
    #2 0x546dd3 in md_analyze_simple_pairing_mark /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3553
    #3 0x540584 in md_analyze_marks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c
    #4 0x53c9c8 in md_analyze_link_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3813:5
    #5 0x53c9c8 in md_analyze_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3802
    #6 0x550b95 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4283:5
    #7 0x52e7f7 in md_process_leaf_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4454:13
    #8 0x52e7f7 in md_process_all_blocks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4529
    #9 0x52e7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5856
    #10 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #11 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #12 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #13 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #14 0x7f17c6fc582f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #15 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

Address 0x61a000000684 is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cc:23 in __asan_memcpy
Shadow bytes around the buggy address:
0x0c347fff8080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c347fff8090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c347fff80a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c347fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c347fff80d0:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff80e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff80f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff8100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff8110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c347fff8120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==27938==ABORTING

Heap buffer overflow in md_process_inlines()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-81e2a5c/Heap_buffer_overflow_in_md_process_inlines

It seems like that mark variable access a restricted area of memory in md4c.c:4004:

while(!(mark->flags & MD_MARK_RESOLVED) || mark->beg < off)

AddressSanitizer provided information as below:

==29037==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61e000000a91 at pc 0x000000553328 bp 0x7fffe3bdfa70 sp 0x7fffe3bdfa68
READ of size 1 at 0x61e000000a91 thread T0
    #0 0x553327 in md_process_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4004:27
    #1 0x553327 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4284
    #2 0x52e7f7 in md_process_leaf_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4454:13
    #3 0x52e7f7 in md_process_all_blocks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4529
    #4 0x52e7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5856
    #5 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #6 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #7 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #8 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #9 0x7f49ca83682f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #10 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x61e000000a91 is located 17 bytes to the right of 2560-byte region [0x61e000000080,0x61e000000a80)
allocated by thread T0 here:
    #0 0x4ded00 in realloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:107
    #1 0x5369af in md_push_mark /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2496:21
    #2 0x5369af in md_collect_marks /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2897
    #3 0x5369af in md_analyze_inlines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:3774
    #4 0x550b95 in md_process_normal_block_contents /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4283:5

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4004:27 in md_process_inlines
Shadow bytes around the buggy address:
0x0c3c7fff8100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c3c7fff8140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c3c7fff8150: fa fa[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8160: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8170: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff8190: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c3c7fff81a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==29037==ABORTING

Move HTML renderer from md2html.c to a separate source file

I'm currently working on porting an app from Discount to md4c and would like to reuse the HTML rendering code from md2html.c. However, copying the sources into my project repo currently requires manually stripping out code for option parsing, clock() calls, the main() function, etc. and makes it a harder to track future upstream changes.

Would a pull request to move just the rendering code to a separate source file be accepted? If so, is it likely to conflict with any changes you are working on for issue #5?

P.S. Thanks for sharing this great library. I've been wanting to upgrade a few projects to a CommonMark compatible parser for a long time, but couldn't use cmark for lack of tables support. The feature set of md4c really hits the sweet spot I was looking for.

trouble with zero-based nested lists

Try to parse something like this: it seems to me that it combines some list items. As long as the indices start from 1, it doesn't happen.

0. Introduction
1. Chapter One
    0) One thing
    1) Another thing
        0. Subpoint
        1. Counterpoint
    2) Yet another thing

md2html does not have man page

As MD4C is currently being added into some Linux distros (see #48 or #55), md2html tool should have a better documentation, in particular man page.

Heap-buffer-overflow in md4c.c

./md2html md4c_heap-buffer-overflow_md4c

==26370==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60300000f000 at pc 0x7f8e75e343ca bp 0x7fff7ec8b0f0 sp 0x7fff7ec8b0e0
WRITE of size 4 at 0x60300000f000 thread T0
    #0 0x7f8e75e343c9 in md_build_attribute /home/github/md4cg/md4c/md4c.c:1491
    #1 0x7f8e75e4e584 in md_setup_fenced_code_detail /home/github/md4cg/md4c/md4c.c:4377
    #2 0x7f8e75e4ea94 in md_process_leaf_block /home/github/md4cg/md4c/md4c.c:4419
    #3 0x7f8e75e4fb28 in md_process_all_blocks /home/github/md4cg/md4c/md4c.c:4528
    #4 0x7f8e75e5b574 in md_process_doc /home/github/md4cg/md4c/md4c.c:5854
    #5 0x7f8e75e5b99c in md_parse /home/github/md4cg/md4c/md4c.c:5915
    #6 0x4045ac in md_render_html /home/github/md4cg/md2html/render_html.c:488
    #7 0x401b4a in process_file /home/github/md4cg/md2html/md2html.c:139
    #8 0x402394 in main /home/github/md4cg/md2html/md2html.c:343
    #9 0x7f8e75a7b82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #10 0x4012c8 in _start (/home//github/md4cg/md2html/md2html+0x4012c8)

0x60300000f000 is located 0 bytes to the right of 32-byte region [0x60300000efe0,0x60300000f000)
allocated by thread T0 here:
    #0 0x7f8e760fb961 in realloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98961)
    #1 0x7f8e75e3327f in md_build_attr_append_substr /home/github/md4cg/md4c/md4c.c:1392
    #2 0x7f8e75e33e6b in md_build_attribute /home/github/md4cg/md4c/md4c.c:1482
    #3 0x7f8e75e4e584 in md_setup_fenced_code_detail /home/github/md4cg/md4c/md4c.c:4377
    #4 0x7f8e75e4ea94 in md_process_leaf_block /home/github/md4cg/md4c/md4c.c:4419
    #5 0x7f8e75e4fb28 in md_process_all_blocks /home/github/md4cg/md4c/md4c.c:4528
    #6 0x7f8e75e5b574 in md_process_doc /home/github/md4cg/md4c/md4c.c:5854
    #7 0x7f8e75e5b99c in md_parse /home/github/md4cg/md4c/md4c.c:5915
    #8 0x4045ac in md_render_html /home/github/md4cg/md2html/render_html.c:488
    #9 0x401b4a in process_file /home/github/md4cg/md2html/md2html.c:139
    #10 0x402394 in main /home//github/md4cg/md2html/md2html.c:343
    #11 0x7f8e75a7b82f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/github/md4cg/md4c/md4c.c:1491 md_build_attribute
Shadow bytes around the buggy address:
  0x0c067fff9db0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9dc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9dd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9de0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9df0: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00
=>0x0c067fff9e00:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c067fff9e50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
==26370==ABORTING

poc:https://github.com/xcainiao/poc/blob/master/md4c_heap-buffer-overflow_md4c

Erroneous UTF-16 Surrogate Decoding

The pertinent macros to detect and decode UTF-16 surrogate code units in md4c.c are (7d20152):

#define IS_UTF16_SURROGATE_HI(word)     (((WORD)(word) & 0xfc) == 0xd800)
#define IS_UTF16_SURROGATE_LO(word)     (((WORD)(word) & 0xfc) == 0xdc00)
#define UTF16_DECODE_SURROGATE(hi, lo)  ((((unsigned)(hi) & 0x3ff) << 10) | (((unsigned)(lo) & 0x3ff) << 0))

The constant 0xfc in the first two lines should read 0xfc00.

The cast to WORD seems pointless, the actual argument to these macros is always a wchar_t expression, which (in MSC) is promoted to 32-bit int without sign extension. (Furthermore WORD is defined in <windows.h>, and currently the only name used from there ...!). Thus the defining expression could be:

((word) & 0xfc00) == 0xd800

The expression to compose the Unicode code point from the two 10-bit fragments omits the bias value 0x10000; it should read:

0x10000 + ((((hi) & 0x3ff) << 10) | ((lo) & 0x3ff))

or - using only required parentheses -:

0x10000 + ( ((hi) & 0x3ff) << 10  |  (lo) & 0x3ff )

Provide a dynamic library

Hi! The Qt project is considering adding this library to be used in QTextDocument:

https://codereview.qt-project.org/#/c/214842/

As such distro maintainers like me would really prefer to have this library as a dynamic/shared one. Would you accept a patch for this?

Let the `entity_lookup()` function return UTF-32, not UTF-8

The fact that entity_lookup() currently returns the replacement text as a UTF-8 octet sequence is convenient for UTF-8 output (of course), but clumsy otherwise: when generating UTF-16 output, md2html would have to convert UTF-8 into a code point, and write the code point as one or two UTF-16 code units.

While the latter step is trivial, the former is just an unnecessary burden (and in fact, md2html.c won't replace entity references when generating UTF-16 right now).

A better approach would return UTF-32 from entity_lookup(): from this, a renderer could easily

output the replacement text in UTF-8,
output the replacement text in UTF-16,
output the replacement text in ASCII (using numerical character references for non-ASCII code points);
output the replacement text in Latin 1 (dito, for code points beyond U+00FF).

Angle brackets in link destinations

The spec recognizes two types of link destinations:

a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or
a nonempty sequence of characters that does not include ASCII space or control characters, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses. (Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.)

Although not explicitly stated, it's clear from various discussions (e.g. commonmark/cmark#193, commonmark/cmark#219) that if parsing with the type 1 fails, the parser should retry with type 2.

However, MD4C currently does that only on the link destination level, not whole link. (See function md_is_link_destination()).

Hence we parse correctly

[a](<te<st>)

But we fail with

[a](<x>X)

because <x> is seen as type 1, but the following unexpected char X then makes it to not be seen as a link.

Use MD4C for syntax highlighter: Passing document parts / single blocks

Hi,

MD4C's SAX-like callbacks sound a lot more useful than the reference CMark AST implementation to implement syntax highlighting in an editor. Skimming the code, it appears only reference-style links and images would cause problems if the client does not pass the whole document but only parts of it, like the visible portion on screen or the last edited block of text. Did I get that right?

-- Christian

Many links in a single block take too much time.

This takes too much time to process:

$ python -c 'print("[a](/url) " * 50000)'

Are size of these two arrays correct

I'm using VC++2015, and the compiler complains about something. Most of them seem Ok to be ignored, but I'm worried about size of two arrays:

1210 static const CHAR open_str[9] = _T("<![CDATA[");

4287 static const CHAR indent_str[16] = _T(" ");

It says the arrays are not big enough to contain endding '\0's.

Is it on purpose?

Thanks!

Assertion triggered

The input

***b* c*

triggers an assertion:

MD4C: ../md4c/md4c.c:2368: Assertion 'dummy->ch == 'D'' failed.

Extensions for Reddit flavor markdown

In particular, superscripts and subscripts like hoedown has, and auto linking of /r/ subreddits would be nice

Entities inside image contents are not rendered.

![alt text with *entity* &copy;](img.png 'title')

renders into

<p><img src="img.png" alt="alt text with entity " title="title"></p>

but it should render into

<p><img src="img.png" alt="alt text with entity ©" title="title"></p>

Heap buffer overflow in md_merge_lines()

command: ./md2html testfile

testcase: https://github.com/ChijinZ/security_advisories/blob/master/md4c-387bd02/crash_md_merge_lines

AddressSanitizer provided information as below:

=================================================================
==21464==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6040000000b1 at pc 0x00000054ff84 bp 0x7fff500be8d0 sp 0x7fff500be8c8
WRITE of size 1 at 0x6040000000b1 thread T0
    #0 0x54ff83 in md_merge_lines /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:878:18
    #1 0x54ff83 in md_merge_lines_alloc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:910
    #2 0x54ff83 in md_is_link_reference_definition_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2154
    #3 0x532108 in md_is_link_reference_definition /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2215:15
    #4 0x532108 in md_consume_link_reference_definitions /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4648
    #5 0x532108 in md_end_current_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4694
    #6 0x52c7f7 in md_process_doc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5850:5
    #7 0x5202cb in md_parse /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:5917:11
    #8 0x51a7a8 in md_render_html /home/ubuntu/fuzz/test/md4c/md2html/render_html.c:488:12
    #9 0x5195cc in process_file /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:139:11
    #10 0x5195cc in main /home/ubuntu/fuzz/test/md4c/md2html/md2html.c:343
    #11 0x7fda7443a82f in __libc_start_main /build/glibc-Cl5G7W/glibc-2.23/csu/../csu/libc-start.c:291
    #12 0x41a668 in _start (/home/ubuntu/fuzz/test/md4c/build/md2html/md2html+0x41a668)

0x6040000000b1 is located 0 bytes to the right of 33-byte region [0x604000000090,0x6040000000b1)
allocated by thread T0 here:
    #0 0x4de898 in __interceptor_malloc /home/ubuntu/llvm/llvm-6.0.0.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:88
    #1 0x54e91f in md_merge_lines_alloc /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:904:22
    #2 0x54e91f in md_is_link_reference_definition_helper /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2154
    #3 0x532108 in md_is_link_reference_definition /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:2215:15
    #4 0x532108 in md_consume_link_reference_definitions /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4648
    #5 0x532108 in md_end_current_block /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:4694

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ubuntu/fuzz/test/md4c/md4c/md4c.c:878:18 in md_merge_lines
Shadow bytes around the buggy address:
0x0c087fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c087fff8000: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 00 04
=>0x0c087fff8010: fa fa 00 00 00 00[01]fa fa fa fa fa fa fa fa fa
0x0c087fff8020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c087fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable:           00
Partially addressable: 01 02 03 04 05 06 07 
Heap left redzone:       fa
Freed heap region:       fd
Stack left redzone:      f1
Stack mid redzone:       f2
Stack right redzone:     f3
Stack after return:      f5
Stack use after scope:   f8
Global redzone:          f9
Global init order:       f6
Poisoned by user:        f7
Container overflow:      fc
Array cookie:            ac
Intra object redzone:    bb
ASan internal:           fe
Left alloca redzone:     ca
Right alloca redzone:    cb
==21464==ABORTING

Feature request: safe mode

This is a feature request for a “safe mode”: potentially-malicious content (such as certain URL schemes) is disallowed to prevent XSS attacks.

The output from this should be safely insertable into a webpage, without any further escaping or sanitization.

A list item can begin with at most one blank line.

MD4C does not currently follow the rule of CommonMark spec 0.27 that

a list item can begin with at most one blank line.

Therefore we fail the Example 241.

However when we literally implement the rule as per this patch:

diff --git a/md4c/md4c.c b/md4c/md4c.c
index aae2a63..a82d458 100644
--- a/md4c/md4c.c
+++ b/md4c/md4c.c
@@ -4833,6 +4833,18 @@ redo:
             ctx->last_line_has_list_loosening_effect = (n_parents > 0  &&
                     n_brothers + n_children == 0  &&
                     ctx->containers[n_parents-1].ch != _T('>'));
+
+            /* If current list item contains nothing but a single blank line
+             * and we would be second blank line in the same list item, then
+             * we and the list. */
+            if(n_parents > 0  &&  ctx->containers[n_parents-1].ch != _T('>')  &&
+               n_brothers + n_children == 0  &&  ctx->current_block == NULL  &&
+               ctx->n_block_bytes > sizeof(MD_BLOCK))
+            {
+                MD_BLOCK* top_block = (MD_BLOCK*) ((char*)ctx->block_bytes + ctx->n_block_bytes - sizeof(MD_BLOCK));
+                if(top_block->type == MD_BLOCK_LI)
+                    n_parents--;
+            }
         }
         goto done;
     } else {

then we stop to pass the Example 274.

After inspecting, it looks as a contradiction in the spec to me. Fired the issue commonmark/commonmark-spec#443 for it.

Ref. definitions with same labels

[foo]: /foo
[qnptgbh]: /qnptgbh
[abgbrwcv]: /abgbrwcv
[abgbrwcv]: /abgbrwcv2
[abgbrwcv]: /abgbrwcv3
[abgbrwcv]: /abgbrwcv4
[alqadfgn]: /alqadfgn

translates to

<p><a href="/foo">foo</a>
<a href="/qnptgbh">qnptgbh</a>
<a href="/abgbrwcv2">abgbrwcv</a>
<a href="/alqadfgn">alqadfgn</a>
[axgydtdu]</p>

I.e. MD4C now does not guarantee that first reference definition of the same label is used.

Broken permissive autolink justbefore end-of-file

$ echo -n 'http://example.com' | md2html/md2html --github

generates

<p><a href="http://example.com�">http://example.com</p>

I.e. there is garbage in the link URL and the link is not correctly ended with </a>.

make instructions look... peculiar... to me

https://github.com/mity/md4c/wiki/Building-MD4C says:

$ cd md4c
$ mkdir build
$ cmake -G ..
$ make

shouldn't it be:

$ cd md4c
$ mkdir build
$ cd build
$ cmake ..
$ make

on my computer the second one works well, while the first one gives me an error that the generator .. could not be found...

Parsing '[ (]([ (]([ (]([ (](...' takes quadratic time

$ python -c 'print("[ (](" * 20000)' | time md2html > /dev/null
0.75user 0.00system 0:00.76elapsed 97%CPU (0avgtext+0avgdata 3440maxresident)k
0inputs+0outputs (0major+550minor)pagefaults 0swaps
$ python -c 'print("[ (](" * 40000)' | time md2html > /dev/null
2.96user 0.00system 0:03.00elapsed 98%CPU (0avgtext+0avgdata 4956maxresident)k
0inputs+0outputs (0major+989minor)pagefaults 0swaps
$ python -c 'print("[ (](" * 80000)' | time md2html > /dev/null
12.22user 0.00system 0:12.28elapsed 99%CPU (0avgtext+0avgdata 8568maxresident)k
0inputs+0outputs (0major+1867minor)pagefaults 0swaps

(the 3rd case from #57)

Parsing '\``\``\``\``...' takes quadratic time

$ python -c 'print("\\``" * 20000)' | time md2html > /dev/null
0.81user 0.00system 0:00.85elapsed 95%CPU (0avgtext+0avgdata 2480maxresident)k
0inputs+0outputs (0major+333minor)pagefaults 0swaps
$ python -c 'print("\\``" * 40000)' | time md2html > /dev/null
3.70user 0.00system 0:03.76elapsed 98%CPU (0avgtext+0avgdata 3240maxresident)k
0inputs+0outputs (0major+551minor)pagefaults 0swaps
$ python -c 'print("\\``" * 80000)' | time md2html > /dev/null
15.23user 0.00system 0:15.35elapsed 99%CPU (0avgtext+0avgdata 5008maxresident)k
0inputs+0outputs (0major+991minor)pagefaults 0swaps

(the 2nd case from #57)

Feature request: support table cell merging

That is, it should be possible for a cell to span multiple columns and maybe even multiple rows.

For example MultiMarkdown does it like this:

https://fletcher.github.io/MultiMarkdown-5/tables.html

I'd suggest to add columnSpan and rowSpan to MD_BLOCK_TD_DETAIL.

I'm trying to use md4c to add Markdown support to Qt (the table patch is https://codereview.qt-project.org/#/c/214901/ ) and this is the main low-hanging fruit I've seen so far, since QTextTable supports it already.

Parsing <><><><><><>… takes quadratic time

$ python -c 'print("<>" * 10000)' | time md2html > /dev/null
1.41user 0.00system 0:01.45elapsed 96%CPU (0avgtext+0avgdata 1968maxresident)k
0inputs+0outputs (0major+229minor)pagefaults 0swaps
$ python -c 'print("<>" * 20000)' | time md2html > /dev/null
4.80user 0.00system 0:04.84elapsed 99%CPU (0avgtext+0avgdata 2608maxresident)k
0inputs+0outputs (0major+365minor)pagefaults 0swaps
$ python -c 'print("<>" * 40000)' | time md2html > /dev/null
19.52user 0.00system 0:19.65elapsed 99%CPU (0avgtext+0avgdata 3528maxresident)k
0inputs+0outputs (0major+640minor)pagefaults 0swaps

Flag MD_FLAG_NOHTMLSPANS disables also autolinks

MD_FLAG_NOHTMLSPANS is supposed to disable inline raw HTML but nothing else. It disables also (standard CommonMark) autolinks:

$ echo '<http://google.com>' | md2html 
<p><a href="http://google.com">http://google.com</a></p>

$ echo '<http://google.com>' | md2html --fno-html-spans
<p>&lt;http://google.com&gt;</p>

mity / md4c Goto Github PK

md4c's Introduction

MD4C Readme

What is Markdown

What is MD4C

Using MD4C

Parsing Markdown

Converting to HTML

Markdown Extensions

Input/Output Encoding

Documentation

FAQ

License

Links to Related Projects

md4c's People

Contributors

Stargazers

Watchers

Forkers

md4c's Issues

Solar System Exploration, 1950s – 1960s

Multiple vulnerabilities in md4c

Heap buffer overflow in md_split_simple_pairing_mark()

Heap buffer overflow in md_process_inlines()

Recommend Projects

Recommend Topics

Recommend Org