micromark / common-markup-state-machine Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 0.0 1.4 MB

CMSM: Common markup state machine

Home Page: https://unifiedjs.com

JavaScript 100.00%

markup

common-markup-state-machine's Issues

Extensions

This state machine is finite. Markdown, mostly annoyingly, but in some cases hugely useful (GFM, MDX) has extensions.

We can either a) define most useful extensions and hide them behind flags, b) support hooks for extensions to overwrite states and the like, c) figure out a way to allow backtracking and attempting a list of possibilities, or d) something else?

They all have downsides.

GFM extension: tables

GFM tables are pretty straightforward:

| foo | bar |
| --- | --- |
| baz | bim |

foo	bar
baz	bim

But there are some caveats.

Exhibit A (no pipes):

abc
:-

abc

Also a table 🤷‍♂️

Exhibit B (escaping in inline code):

| f\|oo  |
| ------ |
| b `\|` az |
| b **\|** im |

f\|oo
b `\|` az
b \| im

Escapes cannot be used in CM inline code, or in GFM inline code, but they can apparently be used in inline code in table cells in GFM 🤷‍♂️

Phrasing

Subject of the feature

Define parsing of phrasing content (in ATX headings, Setext headings, and Paragraphs)

Problem

Undefined.

Expected behaviour

Defined

Considerations

Blocks operate on an input stream. Phrasing is detected when blocks are closed. This means that we already have a token queue to operate on and can look ahead.

I think mostly this will be similar to how blocks can be parsed, but one challenge is that phrasing can span multiple lines.

GFM extension: task lists

In GFM, phrasing, when inside a list item, can render a new construct at its start, namely a checkbox!

This gets interesting, because the phrasing can be preceded by definitions.

I’ll use separate comments because there is a way to turn off checked checkboxes! 🙄

Exhibit A (definitions):

- [y]: a
  [x] foo

The thing between the tokeniser and an adapter

Terms

(tokeniser: the thing currently specced, that emits tokens; adapter: a thing that consumes events and produces e.g. an HTML string or a ST, wording WIP).

Problem

We need something that knows that we are already in a blockquote, that a lazy continuation line belongs to the content in that blockquote.
The tokeniser should deal with creating tokens, not with opening and closing blocks.
These two will integrate together though.
The HTML spec has a tokeniser and a tree construction mechanism.
Tree construction (as in, an AST or a CST) is something that can sit above micromark.

So what’s between the tokens and syntax-trees or an HTML string? What does it look like?

Example

Say we’d have this markup (note the superfluous whitespace at the start, and ␠ is a space):

  > # asd #
  > para␠␠
  graph

The current state machine creates (or will create) something along these lines (token type, offset start and end in curlies, and value in parens):

whitespace {0, 2} (`  `)
marker {2, 3} (`>`)
whitespace {3, 4} (` `)
sequence {4, 5} (`#`)
whitespace {5, 6} (` `)
content {6, 9} (`asd`)
whitespace {9, 10} (` `)
sequence {10, 11} (`#`)
lineTerminator {11, 12} (`\n`)
whitespace {0, 2} (`  `)
marker {2, 3} (`>`)
whitespace {3, 4} (` `)
content {4, 8} (`para`)
whitespace {8, 10} (`  `)
lineTerminator {10, 11} (`\n`)
whitespace {0, 2} (`  `)
content {2, 7} (`graph`)
lineTerminator {7, 8} (`\n`)
end-of-file

A CST for that, would look something along these lines (just an idea, not the thing to discussed here):

document
  blockquote
    atxHeading
      lineStart
        whitespace {0, 2} (`  `)
        marker {2, 3} (`>`)
        whitespace {3, 4} (` `)
        fence
          sequence {4, 5} (`#`)
          whitespace {5, 6} (` `)
      phrasing
        text {6, 9} (`asd`)
      lineEnd
        fence
          whitespace {9, 10} (` `)
          sequence {10, 11} (`#`)
        lineTerminator {11, 12} (`\n`)
    paragraph
      lineStart
        whitespace {0, 2} (`  `)
        marker {2, 3} (`>`)
        whitespace {3, 4} (` `)
      phrasing
        text {4, 8} (`para`)
        hard-break
          line-end
            whitespace {8, 10} (`  `)
            lineTerminator {10, 11} (`\n`)
          lineStart
            whitespace {0, 2} (`  `)
        text {2, 7} (`graph`)
      lineEnd
        lineTerminator {7, 8} (`\n`)

What does the thing between the tokens and this tree (or an HTML string) look like?

Content

Subject of the feature

Define parsing of content (a content group can result in zero or more definitions and zero or one of either a Paragraph or a Setext heading)

Problem

Undefined.

Expected behaviour

Defined

Considerations

Similarly to phrasing, figuring out what is a definition, and what is phrasing, operates on tokens instead of a character stream.

GFM extension: strikethrough

The spec is pretty clear about strikethrough: it’s emphasis but then can only start and end with two tiles. EXCEPT!

Exhibit A through E:

alpha ~foo~

bravo ~~bar~~

charlie ~~~baz~~~

delta ~~echo ~foxtrot~ golf~~

hotel ~~india ~~juliett~~ kilo~~

alpha ~~foo~~

bravo ~~bar~~

charlie ~~~baz~~~

delta ~~echo ~~foxtrot~~ golf~~

hotel ~~india ~~juliett~~ kilo~~

Exhibit F:

~~~baz~~~


^-- Note this is an opening fenced code block line, that is not closed (hence why this text is part of it)

Grouping in the queue

Subject of the feature

The queue is a flat list of tokens.
Sometimes, it is already known that some tokens can be grouped together (notably escapes, entitity references, or character references?)

Problem

Undefined

Expected behaviour

Defined

GFM extension: extended autolinks (aka literal urls)

GFM gives a couple of extra ways to add links.

Exhibit A: they can start with www.

www.commonmark.org

www.commonmark.org

Exhibit B: or with http:// or https://

http://commonmark.org

https://commonmark.org

http://commonmark.org

https://commonmark.org

Exhibit C: or include an @, a lot of things can be before and after it:

[email protected]
[email protected]

[email protected]
[email protected]

MDX extension: JSX

MDX support consists of a couple of things:

Ignore HTML parsing

For MDX to work, “normal” block and inline HTML parsing has to be turned off. As MDX would include autolinks (<http://example.com>: http://example.com), it is probably better to split the HTML or autolink states in two and support split points / retreating / branching

Interleaving

Markdown is whitespace sensitive, line-by-line, whereas JSX is whitespace insensitive, and well, not line-by-line.
There is a WIP, not very maintained, spec for JSX, which we could properly parse, but this leaves the question open whether we support JSX in its entirety, such as this weird example:

Exhibit A:

<Dropdown x={1 /**/ + 2}
> /*
  A dropdown list */

  <Menu> // some stuff
  
  	<MenuItem>Do Something</MenuItem>

  </Menu>
</Dropdown>

^-- Note blank lines, indentation, weird comments, etc?

Errors (“invalid” JSX)

JS(X) also has errors, of course, do we throw errors? Do we continue parsing?

Exhibit B:

<Dropdown x={} />

<Dropdown>
  A dropdown list
  <Menu><MenuItem>Do Something</MenuItem></Menu>

# Heading

Exposing nodes

Currently, MDX exposes a literal for JSX nodes: {type: 'jsx', value: '<Dropdown>…</Dropdown>'}. What if we’d expose the JSX tree similar to how Acorn/Babel would parse the JSX?

/cc @johno

incremental parsing

Hello! 👋 First of all, I’d like to let y’all know that I really appreciate your work towards evolving Markdown.

I am not sure whether this issue is appropriate on this repo, or whether it should go on the micromark repo.

I have found this project while looking for ways to highlight Markdown syntax in a way compliant to CommonMark. Again, it’s really neat, and I appreciate your work towards it!

However, for this use‐case, something that appears to be missing is the ability to perform partial modifications to Markdown source. It doesn’t really matter that much to me if it take a while to highlight a large document for the first time, but it would be ideal if subsequent modifications could be performed rapidly.

I was wondering whether it’d make sense and be possible to account for that kind of use‐case, or if it is out of scope for this project.

Another interesting project is ToastMark by the Toast UI team. The problem I found with it, though, is that it produces AST, which makes it kinda awkward to use it for syntax highlighting. It also appears to not be very customizable, so one wouldn’t be able to e.g. remove GFM extensions, or use other kinds of plugins.

Turning tokens into content

Subject of the feature

Sometimes, tokens in the queue are transformed to content, this isn’t defined but should be.

Problem

Undefined

Expected behaviour

Defined

Stack of continuation

Subject of the feature

Currently, block-level elements can’t really be nested (the blockquote is broken: they can only be opened, never closed).

Markdown allows blockquotes and lists/list-items to be nested. To continue in the nesting either a) angle brackets (>) or whitespace (␠␠␠␠) is used or b) content doesn’t need that. 😅

We already have the stack of open groups, and are now treating the current group as the group to add to, but if we are say, a blockquote and a list item deep, while the previous line was a blockquote, list item, and blockquote deep, we should instead operate on the list item.

Problem

Undefined.

Expected behaviour

Defined

Considerations

See CommonMark’s appendix

GFM extension: tag filter

Not all HTML can be created. A couple of tags are ignored.

Exhibit A:

alpha <title> bravo

<title>

<plaintext charlie="delta">

alpha <title> bravo

Note that GitHub additionally filters HTML later, with this config

Also note the missing whitespace between title and plaintext, because there are no <p>s around it. So the tag filter doesn’t seem to ignore block HTML in this case, and let it parse as a paragraph, and then also filter the inline HTML. Instead, it has some form of an “invalid HTML” construct.

micromark / common-markup-state-machine Goto Github PK

common-markup-state-machine's Issues

Subject of the feature

Problem

Expected behaviour

Considerations

Terms

Problem

Example

Subject of the feature

Problem

Expected behaviour

Considerations

Subject of the feature

Problem

Expected behaviour

Ignore HTML parsing

Interleaving

Errors (“invalid” JSX)

Exposing nodes

Subject of the feature

Problem

Expected behaviour

Subject of the feature

Problem

Expected behaviour

Considerations

Recommend Projects

Recommend Topics

Recommend Org