micromark / common-markup-state-machine Goto Github PK
View Code? Open in Web Editor NEWCMSM: Common markup state machine
Home Page: https://unifiedjs.com
CMSM: Common markup state machine
Home Page: https://unifiedjs.com
This state machine is finite. Markdown, mostly annoyingly, but in some cases hugely useful (GFM, MDX) has extensions.
We can either a) define most useful extensions and hide them behind flags, b) support hooks for extensions to overwrite states and the like, c) figure out a way to allow backtracking and attempting a list of possibilities, or d) something else?
They all have downsides.
GFM tables are pretty straightforward:
| foo | bar |
| --- | --- |
| baz | bim |
foo | bar |
---|---|
baz | bim |
But there are some caveats.
Exhibit A (no pipes):
abc
:-
abc |
---|
Also a table 🤷♂️
Exhibit B (escaping in inline code):
| f\|oo |
| ------ |
| b `\|` az |
| b **\|** im |
f|oo |
---|
b | az |
b | im |
Escapes cannot be used in CM inline code, or in GFM inline code, but they can apparently be used in inline code in table cells in GFM 🤷♂️
Define parsing of phrasing content (in ATX headings, Setext headings, and Paragraphs)
Undefined.
Defined
Blocks operate on an input stream. Phrasing is detected when blocks are closed. This means that we already have a token queue to operate on and can look ahead.
I think mostly this will be similar to how blocks can be parsed, but one challenge is that phrasing can span multiple lines.
In GFM, phrasing, when inside a list item, can render a new construct at its start, namely a checkbox!
This gets interesting, because the phrasing can be preceded by definitions.
I’ll use separate comments because there is a way to turn off checked checkboxes! 🙄
Exhibit A (definitions):
- [y]: a
[x] foo
(tokeniser: the thing currently specced, that emits tokens; adapter: a thing that consumes events and produces e.g. an HTML string or a ST, wording WIP).
We need something that knows that we are already in a blockquote, that a lazy continuation line belongs to the content in that blockquote.
The tokeniser should deal with creating tokens, not with opening and closing blocks.
These two will integrate together though.
The HTML spec has a tokeniser and a tree construction mechanism.
Tree construction (as in, an AST or a CST) is something that can sit above micromark.
So what’s between the tokens and syntax-trees or an HTML string? What does it look like?
Say we’d have this markup (note the superfluous whitespace at the start, and ␠ is a space):
> # asd #
> para␠␠
graph
The current state machine creates (or will create) something along these lines (token type, offset start and end in curlies, and value in parens):
whitespace {0, 2} (` `)
marker {2, 3} (`>`)
whitespace {3, 4} (` `)
sequence {4, 5} (`#`)
whitespace {5, 6} (` `)
content {6, 9} (`asd`)
whitespace {9, 10} (` `)
sequence {10, 11} (`#`)
lineTerminator {11, 12} (`\n`)
whitespace {0, 2} (` `)
marker {2, 3} (`>`)
whitespace {3, 4} (` `)
content {4, 8} (`para`)
whitespace {8, 10} (` `)
lineTerminator {10, 11} (`\n`)
whitespace {0, 2} (` `)
content {2, 7} (`graph`)
lineTerminator {7, 8} (`\n`)
end-of-file
A CST for that, would look something along these lines (just an idea, not the thing to discussed here):
document
blockquote
atxHeading
lineStart
whitespace {0, 2} (` `)
marker {2, 3} (`>`)
whitespace {3, 4} (` `)
fence
sequence {4, 5} (`#`)
whitespace {5, 6} (` `)
phrasing
text {6, 9} (`asd`)
lineEnd
fence
whitespace {9, 10} (` `)
sequence {10, 11} (`#`)
lineTerminator {11, 12} (`\n`)
paragraph
lineStart
whitespace {0, 2} (` `)
marker {2, 3} (`>`)
whitespace {3, 4} (` `)
phrasing
text {4, 8} (`para`)
hard-break
line-end
whitespace {8, 10} (` `)
lineTerminator {10, 11} (`\n`)
lineStart
whitespace {0, 2} (` `)
text {2, 7} (`graph`)
lineEnd
lineTerminator {7, 8} (`\n`)
What does the thing between the tokens and this tree (or an HTML string) look like?
Define parsing of content (a content group can result in zero or more definitions and zero or one of either a Paragraph or a Setext heading)
Undefined.
Defined
Similarly to phrasing, figuring out what is a definition, and what is phrasing, operates on tokens instead of a character stream.
The spec is pretty clear about strikethrough: it’s emphasis but then can only start and end with two tiles. EXCEPT!
Exhibit A through E:
alpha ~foo~
bravo ~~bar~~
charlie ~~~baz~~~
delta ~~echo ~foxtrot~ golf~~
hotel ~~india ~~juliett~~ kilo~~
alpha foo
bravo bar
charlie ~~~baz~~~
delta echo foxtrot golf
hotel india juliett kilo
Exhibit F:
~~~baz~~~
^-- Note this is an opening fenced code block line, that is not closed (hence why this text is part of it)
The queue is a flat list of tokens.
Sometimes, it is already known that some tokens can be grouped together (notably escapes, entitity references, or character references?)
Undefined
Defined
GFM gives a couple of extra ways to add links.
Exhibit A: they can start with www.
www.commonmark.org
Exhibit B: or with http://
or https://
http://commonmark.org
https://commonmark.org
Exhibit C: or include an @
, a lot of things can be before and after it:
MDX support consists of a couple of things:
For MDX to work, “normal” block and inline HTML parsing has to be turned off. As MDX would include autolinks (<http://example.com>
: http://example.com), it is probably better to split the HTML or autolink states in two and support split points / retreating / branching
Markdown is whitespace sensitive, line-by-line, whereas JSX is whitespace insensitive, and well, not line-by-line.
There is a WIP, not very maintained, spec for JSX, which we could properly parse, but this leaves the question open whether we support JSX in its entirety, such as this weird example:
Exhibit A:
<Dropdown x={1 /**/ + 2}
> /*
A dropdown list */
<Menu> // some stuff
<MenuItem>Do Something</MenuItem>
</Menu>
</Dropdown>
^-- Note blank lines, indentation, weird comments, etc?
JS(X) also has errors, of course, do we throw errors? Do we continue parsing?
Exhibit B:
<Dropdown x={} />
<Dropdown>
A dropdown list
<Menu><MenuItem>Do Something</MenuItem></Menu>
# Heading
Currently, MDX exposes a literal for JSX nodes: {type: 'jsx', value: '<Dropdown>…</Dropdown>'}
. What if we’d expose the JSX tree similar to how Acorn/Babel would parse the JSX?
/cc @johno
Hello! 👋 First of all, I’d like to let y’all know that I really appreciate your work towards evolving Markdown.
I am not sure whether this issue is appropriate on this repo, or whether it should go on the micromark
repo.
I have found this project while looking for ways to highlight Markdown syntax in a way compliant to CommonMark. Again, it’s really neat, and I appreciate your work towards it!
However, for this use‐case, something that appears to be missing is the ability to perform partial modifications to Markdown source. It doesn’t really matter that much to me if it take a while to highlight a large document for the first time, but it would be ideal if subsequent modifications could be performed rapidly.
I was wondering whether it’d make sense and be possible to account for that kind of use‐case, or if it is out of scope for this project.
Another interesting project is ToastMark by the Toast UI team. The problem I found with it, though, is that it produces AST, which makes it kinda awkward to use it for syntax highlighting. It also appears to not be very customizable, so one wouldn’t be able to e.g. remove GFM extensions, or use other kinds of plugins.
Sometimes, tokens in the queue are transformed to content, this isn’t defined but should be.
Undefined
Defined
Currently, block-level elements can’t really be nested (the blockquote is broken: they can only be opened, never closed).
Markdown allows blockquotes and lists/list-items to be nested. To continue in the nesting either a) angle brackets (>
) or whitespace (␠␠␠␠
) is used or b) content doesn’t need that. 😅
We already have the stack of open groups, and are now treating the current group as the group to add to, but if we are say, a blockquote and a list item deep, while the previous line was a blockquote, list item, and blockquote deep, we should instead operate on the list item.
Undefined.
Defined
Not all HTML can be created. A couple of tags are ignored.
Exhibit A:
alpha <title> bravo
<title>
<plaintext charlie="delta">
alpha <title> bravo
<title> <plaintext charlie="delta">Note that GitHub additionally filters HTML later, with this config
Also note the missing whitespace between title
and plaintext
, because there are no <p>
s around it. So the tag filter doesn’t seem to ignore block HTML in this case, and let it parse as a paragraph, and then also filter the inline HTML. Instead, it has some form of an “invalid HTML” construct.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.