Giter Site home page Giter Site logo

syntax-tree / hast-util-to-mdast Goto Github PK

View Code? Open in Web Editor NEW
34.0 8.0 16.0 467 KB

utility to transform hast (HTML) to mdast (markdown)

Home Page: https://unifiedjs.com

License: MIT License

JavaScript 100.00%
html markdown hast mdast hast-util mdast-util unist

hast-util-to-mdast's People

Contributors

christianmurphy avatar crossjs avatar jeffal avatar jounqin avatar lxcid avatar lyzidiamond avatar macklinu avatar mr0grog avatar sethvincent avatar triaeiou avatar vhf avatar wooorm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hast-util-to-mdast's Issues

Handle newlines in headings, table cells

Subject of the issue

<h3>alpha<br>bravo and <td>alpha<br>bravo</td> can in no way be represented in Markdown (except by using HTML in Markdown).

The h1 and h2 versions can be used, if the heading is serialised as a Setext heading:

alpha\
bravo
===

Your environment

n/a

Steps to reproduce

<h3>NEW YORK<br>CHARLES E. MERRILL CO.<br>1907</h3>

Expected behaviour

A single space is probably fine. May need a handler.

Actual behaviour

### NEW YORK᛫᛫
CHARLES E. MERRILL CO.᛫᛫
1907

Nested tables creates empty cells

Initial checklist

Affected packages and versions

hast-util-to-mdast 8.2.0 and 7.1.3

Link to runnable example

No response

Steps to reproduce

create a nested table that has more cells than the parent table.
the inspect() function in the table handler calculates the total width of the table,
but doesn't stop the traversal inside the cell.

when the cell has a nested table, it messes up the rowIndex and cellIndex calculations.

IMO, it should use visit.SKIP in the cell case:

Expected behavior

calculates the number of columns correctly.

Actual behavior

inserts empty cells if a nested table has more cells.

Runtime

Node v12

Package manager

npm v7

OS

macOS

Build and bundle tools

Other (please specify in steps to reproduce)

How should block-level code be handled?

For example, this HTML:

<pre>alpha();</pre>
<pre><code>bravo();</code></pre>
<pre><code class="language-js">charlie();</code></pre>
<pre><code class="delta language-js echo">foxtrot();</code></pre>
<pre>golf <code>hotel();</code> india</pre>
<pre>juliet <code>kilo();</code></pre>
<pre><code>lima();</code> mike</pre>
<pre><code></code></pre>
<pre>november <div><code>oscar();</code></div> papa</pre>

...what should it result in?

`<p><br /></p>` results `\` due to `wrapText`

Initial checklist

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

import fs from 'fs'

import type { Element } from 'hast'
import { all } from 'hast-util-to-mdast'
import { code } from 'hast-util-to-mdast/lib/handlers/code.js'
import rehypeParse from 'rehype-parse'
import rehypeRemark from 'rehype-remark'
import remarkStringify from 'remark-stringify'
import { unified } from 'unified'

const processor = unified()
  .use(rehypeParse)
  .use(rehypeRemark, {
    handlers: {
      div(h, node: Element) {
        if (
          (node.properties?.className as string[] | undefined)?.includes(
            'toc-macro',
          )
        ) {
          return
        }
        return all(h, node)
      },
      pre(h, node: Element) {
        // `h.handlers.code` is actually `inlineCode`
        return code(h, {
          ...node,
          children: node.children.map(child =>
            child.type === 'text'
              ? {
                  type: 'element',
                  tagName: 'code',
                  children: [child],
                }
              : child,
          ),
        })
      },
    },
  })
  .use(remarkStringify, {
    fences: true,
  })

const main = async () => {
  const text = await fs.promises.readFile('index.html', 'utf8')
  const vfile = await processor.process(text)
  await fs.promises.writeFile('index.md', vfile.value)
}

main().catch(console.error)
<!-- index.html -->
<p><br /></p>

Expected behavior

empty line

Actual behavior

\

Runtime

Node v12

Package manager

yarn v1

OS

macOS

Build and bundle tools

esbuild

Tasklists don't work if checkbox is a direct child of a list item

Initial checklist

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

Tasklists (lists where each item starts with a checkbox) are parsed correctly if the leading checkbox <input> element in an <li> node is nested in a <p> node, but not if the <input> is a direct child of the <li>. I’m not sure why this is a requirement (it’s certainly not in the HTML spec: https://html.spec.whatwg.org/multipage/grouping-content.html#the-li-element), so I assume it’s a bug. (But I appreciate also handling this common case where things are in a paragraph in the list item, which is best practice HTML.)

You can see where this explicit check for a <p> is implemented in lib/handlers/li.js:

export function li(state, node) {
const head = node.children[0]
/** @type {boolean | null} */
let checked = null
/** @type {Element | undefined} */
let clone
// Check if this node starts with a checkbox.
if (head && head.type === 'element' && head.tagName === 'p') {
const checkbox = head.children[0]
if (
checkbox &&
checkbox.type === 'element' &&
checkbox.tagName === 'input' &&
checkbox.properties &&
(checkbox.properties.type === 'checkbox' ||
checkbox.properties.type === 'radio')
) {
checked = Boolean(checkbox.properties.checked)
clone = {
...node,
children: [
{...head, children: head.children.slice(1)},
...node.children.slice(1)
]
}
}

I used this script to test:

import {inspect} from 'node:util';
import {toMdast} from 'hast-util-to-mdast';

const mdast = toMdast(hastTree);
console.log(inspect(mdast, false, 100, true));

Expected behavior

I expected that this script:

import {inspect} from 'node:util';
import {toMdast} from 'hast-util-to-mdast';

const mdast = toMdast({
  type: 'root',
  children: [
    {
      type: 'element',
      tagName: 'ul',
      children: [
        {
          type: 'element',
          tagName: 'li',
          children: [
            {
              type: 'element',
              tagName: 'input',
              properties: { type: 'checkbox', checked: true }
            },
            {
              type: 'text',
              value: 'Checked'
            }
          ]
        },
        {
          type: 'element',
          tagName: 'li',
          children: [
            {
              type: 'element',
              tagName: 'input',
              properties: { type: 'checkbox', checked: false }
            },
            {
              type: 'text',
              value: 'Unhecked'
            }
          ]
        }
      ]
    }
  ]
});

console.log(inspect(mdast, false, 100, true));

…to output:

{
  type: 'root',
  children: [
    {
      type: 'list',
      ordered: false,
      start: null,
      spread: false,
      children: [
        {
          type: 'listItem',
          spread: false,
          checked: true,
          children: [
            {
              type: 'paragraph',
              children: [ { type: 'text', value: 'Checked' } ]
            }
          ]
        },
        {
          type: 'listItem',
          spread: false,
          checked: false,
          children: [
            {
              type: 'paragraph',
              children: [ { type: 'text', value: 'Unhecked' } ]
            }
          ]
        }
      ]
    }
  ]
}

…which corresponds to this Markdown:

- [x] Checked
- [ ] Unchecked

Actual behavior

Instead, the above script outputs:

{
  type: 'root',
  children: [
    {
      type: 'list',
      ordered: false,
      start: null,
      spread: false,
      children: [
        {
          type: 'listItem',
          spread: false,
          checked: null,
          children: [
            {
              type: 'paragraph',
              children: [ { type: 'text', value: '[x]Checked' } ]
            }
          ]
        },
        {
          type: 'listItem',
          spread: false,
          checked: null,
          children: [
            {
              type: 'paragraph',
              children: [ { type: 'text', value: '[ ]Unhecked' } ]
            }
          ]
        }
      ]
    }
  ]
}

…which corresponds to this Markdown:

- \[x]Checked
- \[ ]Unchecked

Instead, to get the expected output, you. need to do:

import {inspect} from 'node:util';
import {toMdast} from 'hast-util-to-mdast';

const mdast = toMdast({
  type: 'root',
  children: [
    {
      type: 'element',
      tagName: 'ul',
      children: [
        {
          type: 'element',
          tagName: 'li',
          children: [
            {
              type: 'element',
              tagName: 'p',
              children: [
                {
                  type: 'element',
                  tagName: 'input',
                  properties: { type: 'checkbox', checked: true }
                },
                {
                  type: 'text',
                  value: 'Checked'
                }
              ]
            }
          ]
        },
        {
          type: 'element',
          tagName: 'li',
          children: [
            {
              type: 'element',
              tagName: 'p',
              children: [
                {
                  type: 'element',
                  tagName: 'input',
                  properties: { type: 'checkbox', checked: false }
                },
                {
                  type: 'text',
                  value: 'Unhecked'
                }
              ]
            }
          ]
        }
      ]
    }
  ]
});

console.log(inspect(mdast, false, 100, true));

Affected runtime and version

node>=16

Affected package manager and version

[email protected]

Affected OS and version

macOS 13.5

Build and bundle tools

No response

How to deal with media / embedded content

<audio>, <video>

Most examples use phrasing content in audio elements to say that stuff isn’t working, like this example from MDN:

<audio src="http://developer.mozilla.org/@api/deki/files/2926/=AudioTest_(1).ogg" autoplay>
  Your browser does not support the <code>audio</code> element.
</audio>

The naive idea would be to unwrap it, to:

Your browser does not support the `audio` element.

...but that isn’t really useful.

We can also link to the resource like so:

[Your browser does not support the `audio` element.](http://developer.mozilla.org/@api/deki/files/2926/=AudioTest_(1).ogg)

Video’s with a [poster] attribute could also be transformed to images:

<video src="videofile.webm" autoplay poster="posterimage.jpg">
Sorry, your browser doesn't support embedded videos, 
but don't worry, you can <a href="videofile.webm">download it</a>
and watch it with your favorite video player!
</video>

...to:

[![Sorry, your browser doesn't support embedded videos, 
but don't worry, you can download it and watch it with your favorite video player!](posterimage.jpg)](videofile.webm)

...but here the text isn’t very nice either.

<iframe>

<iframe src="https://mdn-samples.mozilla.org/snippets/html/iframe-simple-contents.html" width="400" height="300">
  <p>Your browser does not support iframes.</p>
</iframe>

...to:

[Your browser does not support iframes.](https://mdn-samples.mozilla.org/snippets/html/iframe-simple-contents.html)

...not very nice either.

<picture>

<picture>
 <source srcset="mdn-logo.svg" type="image/svg+xml">
 <img src="mdn-logo.png" alt="MDN">
</picture>

or:

<picture>
 <source srcset="mdn-logo-wide.png" media="(min-width: 600px)">
 <img src="mdn-logo-narrow.png" alt="MDN">
</picture>

Should we pick the first <img>?

<object>, <embed>

<object data="movie.swf" type="application/x-shockwave-flash"></object>

...or:

<object data="movie.swf" type="application/x-shockwave-flash">
  <param name="foo" value="bar">
</object>

...and:

<embed type="video/quicktime" src="movie.mov" width="640" height="480">

I think it’s best to ignore them.

<canvas>

<canvas id="canvas" width="300" height="300">
  An alternative text describing what your canvas displays. 
</canvas>

...to:

An alternative text describing what your canvas displays. 

?

<br> in tables breaks table

Subject of the issue

table is lost in mdast if <td> or <th> elements contain a <br>

Your environment

Steps to reproduce

https://github.com/syntax-tree/hast-util-to-mdast/pull/57/files

Expected behaviour

table is still preserved, even with <br>

I am not 100% sure what the best solution would be, maybe just drop the <br> ?

Actual behaviour

output of tables containing <br> seems to loose the table info

List item spread calculates spread incorrectly with nested lists.

Initial checklist

Affected packages and versions

9.0.0

Link to runnable example

No response

Steps to reproduce

With nested lists li.js spreadout incorrectly calculates spread when using a nested list with multiple items as far as I can tell.

import { toMdast } from "hast-util-to-mdast";
const hast = {
  type: "root",
  children: [
    {
      type: "element",
      tagName: "ul",
      children: [
        {
          type: "element",
          tagName: "li",
          properties: {},
          children: [
            {
              type: "text",
              value: "outer",
            },
            {
              type: "element",
              tagName: "ul",
              children: [
                {
                  type: "element",
                  tagName: "li",
                  properties: {},
                  children: [
                    {
                      type: "text",
                      value: "inner",
                    },
                  ],
                },
                {
                  type: "element",
                  tagName: "li",
                  properties: {},
                  children: [
                    {
                      type: "text",
                      value: "inner",
                    },
                  ],
                },
              ],
            },
          ],
        },
      ],
    },
    {
      type: "element",
      tagName: "li",
      properties: {},
      children: [
        {
          type: "text",
          value: "outer",
        },
      ],
    },
  ],
};
const mdast = toMdast(hast);
console.log(JSON.stringify(mdast));

Expected behavior

Both the outer and inner lists should be tight.

Actual behavior

The outer list is flagged as loose, as far as I can tell due to L94 which, as far as I can tell, should read seenFlow = child.tagName !== 'ul' && child.tagName !== 'ol' && child.tagName !== 'li' instead.

Affected runtime and version

18.2.0

Affected package manager and version

8.6.0

Affected OS and version

Windows 11

Build and bundle tools

Other (please specify in steps to reproduce)

How to handle documents

HAST can represent documents, e.g.,

<!doctype html>
<html>
  <head><meta charset="utf8"></head>
  <body><p>Text</p></body>
</html>

...but also:

<!doctype html>
<meta charset="utf8">
<p>Text</p>

Another interesting case is:

<!doctype html>
<html>
  <head><meta charset="utf8"></head>
  <body>
    <header>Random stuff</header>
    <main>Core</main>
    <footer>More random stuff</footer>
  </body>
</html>

...how should the core contents be handled? Should hast-util-to-mdast search for the body element, or even the main, or some other things? I remember there’s a class you can add to the main content to let Instapaper recognise it, I couldn’t find it when searching just now, but maybe something like that is interesting too.

Implicit paragraphs

Originally posted on Gitter by @justin-calleja:

<ol>
<li><code>something</code> Hello World? And now <code>channel</code>.</li>
</ol>

A <code>for</code> loop… ye?

...turns into:

1.  `something`

     Hello World? And now

    `channel`

    .

A

`for`

 loop… ye?

...which is because of HTMLs support for “implicit” paragraphs.

Weird trimming of inline element text causes invalid markdown

Initial checklist

Affected packages and versions

hast-util-to-mdast==10.1.0

Link to runnable example

https://codesandbox.io/p/devbox/dry-http-xx97sg

Steps to reproduce

Well, minimal example is converting HTML like this:
<p>some text with <em> spaced emphasis </em> in between</p>
with a direct html->hast->mdast->markdown pipeline generates this (invalid) markdown:
some text with *spaced emphasis *in between

Expected behavior

Given that the browser seems to trim the contents of <em> in such case, this?.
some text with *spaced emphasis* in between

Actual behavior

Well the space is in the wrong node somehow after hast-util-to-mdast call.
Not sure if other nodes are affected yup, at least strong and del are too, assuming it's inline nodes

Affected runtime and version

all?.

Affected package manager and version

No response

Affected OS and version

No response

Build and bundle tools

No response

Expose `defaultHandlers` to be reused

Initial checklist

Problem

Maybe a bit similar to syntax-tree/mdast-util-to-markdown#34

I'm trying to implement a tiny util cf2md which Transform from confluence flavored HTML to Markdown with enhanced features.

In confluence's HTML, pre can be used without code inside, so I have to wrap its direct text nodes into a code node to reuse original h.handlers.pre, but there is no defaultHandlers provided, so I have to use import { code } from 'hast-util-to-mdast/lib/handlers/code.js' which is the original h.handlers.pre actually.

import { code } from 'hast-util-to-mdast/lib/handlers/code.js'

unified()
  // ...
  .use(rehypeRemark, {
    handlers: {
      pre(h, node, parent) {
        // I'd like to have `h.defaultHandlers.pre` here
        return code(h, {
          ...node,
          children: node.children.map(child =>
            child.type === 'text'
              ? {
                  type: 'element',
                  tagName: 'code',
                  children: [child],
                }
              : child,
          ),
        })
      },
    },
  })

See also https://github.com/rx-ts/cf2md/blob/main/src/index.ts#L82

Solution

Expose defaultHandlers in h

Alternatives

N/A

Fix tables without header row

Subject of the issue

In markdown, the first row of a table is the header row.
In HTML, tables can have no header row, multiple header row, and a table could be flipped (the first column being ths, later columns being data).

I’m not sure how, but it should somehow be handled.

Add linter

This looks like standard, right? I’m prone to typing semi-colons, so if you’d rather not have them we should add a linter!

"url" package is not listed in package.json, but used in resolve.js

https://github.com/syntax-tree/hast-util-to-mdast/blob/main/lib/util/resolve.js#L5

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions: 8.0.0

Steps to reproduce

  1. Insall package npm install hast-util-to-mdast@latest -S
  2. Try to import and run it

Expected behavior

Hast utils are ran

Actual behavior

ERROR in ./node_modules/hast-util-to-mdast/lib/util/resolve.js 17:22-25
export 'URL' (imported as 'URL') was not found in 'url' (possible exports: Url, format, parse, resolve, resolveObject)
 @ ./node_modules/hast-util-to-mdast/lib/handlers/media.js 10:0-45 57:11-18 69:9-16
 @ ./node_modules/hast-util-to-mdast/lib/handlers/index.js 20:0-35 133:9-14 181:9-14

This happens because "url" package is not listed in package.json: https://github.com/syntax-tree/hast-util-to-mdast/blob/main/lib/util/resolve.js#L5

ignoring non-standard html tags

Initial checklist

Problem

The option handlers allows to keep some html tags like explained here: Example: keeping some HTML with
How can I keep all-non stardard html tags?
In my use case I need to keep all-standard html tags that a user defines.
This is because the non-standard html tags are actually JSX tags inside of a mdx file.

TAGS TO CONVERT

<h1>
<p>
...

TAGS TO KEEP

<custom-tag>
<note>
<JsxWannabe>

Solution

Add support for an html attribute data-mnast="keep"
Like the data-mdast="ignore" but it keeps the tag as it is instead of ignoring it.
<custom-tag data-mdast="keep">

Alternatives

I see 2 possible alternatives:

  1. Add an option that keeps all non-standard html tags

  2. passing a wildcard function to handlers option that doesn't specify the matching tag but will ran on everythig:
    It could be called: all, *, default or something else

handlers: {
  all: (h, node, tag) => {

    if (allStandardTags.includes(tag)) {
      // convert it normally to markdown
      
    } else {
      // keep it as html
      return h(node, 'html', toHtml(node))
    }
  },
},

`tt` element

tt

It’s deprecated, but produces monospace text, so I’d go with inlineCode.

<br /> tag doesn't compile to new line in <code />

I'm trying to compile the following html to markdown:

<pre class="language-javascript"><code class="language-javascript">var x = &quot;hi&quot;;<br/>var y = &quot;hi&quot;;<br/>var z = &quot;hi&quot;;<br/>var a = &quot;hi&quot;;<br/></code></pre>

however, when i run this html string through rehype-remark, it removes the <br /> tags and hands me markdown that looks like this:

var x = "hi";var y = "hi";var z = "hi";var a = "hi";

Original question asked on spectrum here

Support tables with "rowspan" or "colpsan" attributes

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and couldn’t find anything (or linked relevant results below)

Subject

  • Have a table with rowspan/colspan and missing cell in next row:
<table>
<tbody>
    <tr>
        <th>Table</th>
        <th>With rowspan</th>
    </tr>
    <tr>
        <td rowspan="2">Rowspan</td>
        <td>test</td>
    </tr>
    <tr>
        <!-- There is no cell because it is occupied by rowspan -->
        <td>test</td>
    </tr>
</tbody>
</table>

Which, when rendered, should look somehow like that:
image

Problem

When converting to MDAST, the missing cells are added automatically, so the following tree is constructed (notice empty tableCell added to third row's end):
image

So, if serialize to Markdown, it would look like:

| Table | With rowspan |
| - | - |
| Rowspan | test |
| test | |

Notice the "test" in third column is shifted to the left

Solution

The converter may take "rowspan", "colspan" attributes into account and put empty cells into the corresponding positions, to preserve table structure in general

Alternatives

I don't see any alternative implementation. Related to #54.

`wbr` element

wbr

As MDN says it behaves like U+200B, I think we should just compile to \u200B.

Discrepancy in dependency

Subject of the issue

Crash here when using rehype-remark https://github.com/syntax-tree/hast-util-to-mdast/blob/7.1.2/lib/handlers/code.js#L6

It seems you forgot to bump package.json for lib hast-util-is-element cause 1.0.x doesn't contain convert file.

https://github.com/syntax-tree/hast-util-is-element/tree/1.0.4

I think you can solve the issue by publishing a new tag with updated package.json with hast-util-is-element on ^1.1.0

Steps to reproduce

I don't really know why yarn sometimes pick 1.0.x and sometimes 1.1.x based on the instruction ^1.0.0

In the first case, it breaks, in the second it works.

`u` element

u

As there’s no underline in markdown, I’d say go with emphasis instead, or ignore it.

Tags and classes are lost

I noticed that when I try to convert hast with tags and classes to mdast, the details are lost.
What I was expecting is that the tags and classes are part of the mdast like so:

{
  type: 'something',
  data: {
    hName: 'a',
    hProperties: {
      href: url,
      rel: 'nofollow',
      class: 'ping ping-link',
    },
  }
}

I had a look at the code and the only element that supports this is code. If I want to add the change, I'd need to change the all function?

Add support for `<base>`

<base>

The href attribute on the first base element should probably affect all other relative URLs: on image and link nodes.

Straddling

Take this HTML:

<header>
 Welcome!
 <a href="about.html">
  This is home of...
  <h1>The Falcons!</h1>
  The Lockheed Martin multirole jet fighter aircraft!
 </a>
 This page discusses the F-16 Fighting Falcon's innermost secrets.
</header>

...perfectly valid, albeit weird, but not handled correctly here inhast-util-to-mdast.

Read more about straddled paragraphs in the HTML spec, 3.2.5.4 Paragraphs.

Allow mapping of elements to raw HTML

Based on the reasoning here, I guess there’s a use case for people to map things that aren’t easily doable in markdown to actual HTML. We could use raw nodes for that, as used by hast-util-to-mdast. I’m not sure I’d like to include a HAST stringifier to set the value of the raw node to the stringified tree. We could at least set the hChildren on the node though.

Supersedes GH-16.

Should `<select>`, `<datalist>`, `<option>`, and `<optgroup>` render lists?

Questions

See below for examples, but first two notes:

  • What to do with [disabled]?
  • Should these lists always render with checkboxes?
  • ...probably a lot more

Examples

For example, this HTML:

<label>
  Choose a browser from this list:
  <input list="browsers">
</label>
<datalist id="browsers">
  <option value="Chrome">
  <option value="Firefox">
  <option value="Internet Explorer">
  <option value="Opera">
  <option value="Safari">
  <option value="Microsoft Edge">
</datalist>

...could be rendered like:

Choose a browser from this list:

* Chrome
* Firefox
* Internet Explorer
* Opera
* Safari
* Microsoft Edge

...and:

<p>
  And these:
  <select>
    <option value="value1">Value 1</option>
    <option value="value2" selected>Value 2</option>
    <option value="value3">Value 3</option>
  </select>
</p>

could be:

And these:

* [ ] Value 1
* [x] Value 2
* [ ] Value 3

...and:

<p>
  Check out these options:
  <select>
    <optgroup label="Group 1">
      <option>Option 1.1</option>
    </optgroup>
    <optgroup label="Group 2">
      <option>Option 2.1</option>
      <option>Option 2.2</option>
    </optgroup>
    <optgroup label="Group 3" disabled>
      <option>Option 3.1</option>
      <option>Option 3.2</option>
      <option>Option 3.3</option>
    </optgroup>
  </select>
</p>

could be:

Check out these options:

* Group 1
  * Option 1.1
* Group 2
  * Option 2.1
  * Option 2.2
* Group 3
  * Option 3.1
  * Option 3.2
  * Option 3.3

passing handlers as options

I'm working on a project that introduces new syntax to markdown and need to convert that syntax between markdown & html.

This remark-bracketed-spans plugin is a good example of the type of thing I'm working on: https://github.com/sethvincent/remark-bracketed-spans/blob/master/index.js

With that plugin the goal is to convert between:

[text in the span]{.class .other-class key=val another=example}

and:

<p><span class="class other-class" data-key="val" data-another="example">text in the span</span></p>

There are a few similar plugins that I'm working on.

For converting the html to markdown, I'm not sure yet where in the pipeline is best for handling these special cases.

Passing custom handlers as options to this module could take care of it.

Usage would be similar to passing visitors to remark-stringify, and could look like:

var md = unified()
  .use(rehypeParse)
  .use(rehype2remark, { handlers: handlers }) // passing options through rehype-remark
  .use(remarkStringify)
  .process(html, options)

var handlers = {
  span: function (node, parent) {
    // handle conversion
  }
}

I can write up a PR for this, but wanted to check to see if there are other approaches I might consider.

Exception on tables without rows, cells

Subject of the issue

This markup errors:
https://github.com/syntax-tree/hast-util-to-mdast/pull/53/files#diff-b6b8d72f26f12669066c4e2850f557c7

for some odd reason, in my test, it only errored if there are 2 input[type=hidden] (inside the table)

TypeError: Cannot read property 'length' of undefined

Your environment

Steps to reproduce

PR: #53
Error on travis: https://travis-ci.org/syntax-tree/hast-util-to-mdast/jobs/567626032#L699

Expected behaviour

What should happen?

no error

Actual behaviour

What happens instead?

 TypeError: Cannot read property 'length' of undefined
    at one (./node_modules/hast-util-to-mdast/lib/handlers/table.js:77:24)
    at patch (./node_modules/hast-util-to-mdast/lib/handlers/table.js:68:5)
    at table (./node_modules/hast-util-to-mdast/lib/handlers/table.js:10:43)
    at one (./node_modules/hast-util-to-mdast/lib/one.js:24:51)
    at all (./node_modules/hast-util-to-mdast/lib/all.js:15:14)
    at wrapped (./node_modules/hast-util-to-mdast/lib/util/wrap-children.js:9:15)
    at one (./node_modules/hast-util-to-mdast/lib/one.js:24:51)
    at all (./node_modules/hast-util-to-mdast/lib/all.js:15:14)
    at cell (./node_modules/hast-util-to-mdast/lib/handlers/table-cell.js:8:31)
    at one (./node_modules/hast-util-to-mdast/lib/one.js:24:51)

`xmp` element

xmp

It’s deprecated, but I’m thinking the current <pre> handling can also work here: thus making xmps render as MDASTs code.

`sup`, `sub` elements

sup, sub

We can either try something with unicode superscript and -subscript, but I think that goes to far.
Ignoring is probably better.

HTML Tags

I made a list of tags in HTML, and whether they’re mapped to markdown.
Note that some nodes should stay unmapped, as that means they are unwrapped, which is a good thing for <section>, <main>, <address>, and probably many more.

Nodes
  • rootroot
  • texttext
  • commenthtml
  • element (see below)
Explicitly Ignored Nodes
  • doctype
Elements (todo / done)
  • alink
  • abbr → children
  • acronym → children
  • address → block children
  • article → block children
  • aside → block children
  • audio → children if block or with link, link otherwise
  • bstrong
  • base → to resolved URLs in link and image
  • bdi → children
  • bdo → children
  • big → children
  • blink → children
  • blockquoteblockquote
  • body → block children
  • brbreak
  • button → children
  • canvas → children
  • center → block children
  • cite → children
  • codeinlineCode
  • data → children
  • ddlist, with the contents of dl / dt, or when there are multiple dts or dds, one or two lists
  • deldelete
  • details → children
  • dfn → children
  • dirlist
  • div → block children
  • dllistItem
  • dtlistItem
  • ememphasis
  • fieldset → block children
  • figcaption → block children
  • figure → block children
  • font → children
  • footer → block children
  • form → block children
  • h1heading
  • h2heading
  • h3heading
  • h4heading
  • h5heading
  • h6heading
  • header → block children
  • hgroup → block children
  • html → block children
  • hrthematicBreak
  • iemphasis
  • iframelink (if with title and src), otherwise ignored
  • imageimage
  • imgimage
  • input → to its value; with a checkbox if radio or checkbox; the selected values or placeholder label options if with list
  • ins → children
  • kbdinlineCode
  • label → children
  • legend → block children
  • lilistItem
  • listingcode
  • main → block children
  • markemphasis
  • marqueechildren
  • meter → children
  • multicol → block children
  • nav → block children
  • nobr → children
  • noscript → children
  • ollist
  • output → children
  • pparagraph
  • picture → block children (should be one image if HTML is valid)
  • plaintextcode
  • precode
  • progress → children
  • q" and children
  • rb → children
  • rbc → children
  • rp → children
  • rt → children
  • rtc → children
  • ruby → children
  • sdelete
  • sampinlineCode
  • section → block children
  • select → its selected values or its placeholder label options
  • slot → children
  • small → children
  • span → children
  • strikedelete
  • strongstrong
  • sub → children
  • summaryparagraph
  • sup → children
  • tabletable
  • tbody → children
  • tdtableCell
  • textareatext
  • tfoot → children
  • thtableCell
  • thead → children
  • time → children
  • trtableRow
  • ttinlineCode
  • uemphasis
  • ullist
  • varinlineCode
  • video → children if block or with link, link otherwise, image if with poster
  • wbrtext with a zero-width space ('\u200B')
  • xmpcode
Implicitly Unhandled Elements
  • head
Explicitly Ignored Elements
  • applet
  • area
  • basefont
  • bgsound
  • caption
  • col
  • colgroup
  • command
  • content
  • datalist (affects inputs with list)
  • dialog
  • element
  • embed
  • frame
  • frameset
  • isindex
  • keygen
  • link
  • map
  • math
  • menu
  • menuitem
  • meta
  • nextid
  • noembed
  • noframes
  • object
  • optgroup (affects selects, inputs with list)
  • option (affects selects, inputs with list)
  • param
  • script
  • shadow
  • source (affect audio and video)
  • spacer
  • style
  • svg
  • template
  • title
  • track

How should `<dl>`, `<dt>`, and `<dd>` be rendered?

Say we have the following html:

<dl>
  <dt>Firefox</dt>
  <dd>A web browser.</dd>
</dl>

<dl>
  <dt>Firefox</dt>
  <dt>Mozilla Firefox</dt>
  <dt>Fx</dt>
  <dd>A web browser.</dd>
</dl>

<dl>
  <dt>Firefox</dt>
  <dd>A web browser.</dd>
  <dd>A Red Panda.</dd>
</dl>

Should that be something like this?

* **Firefox**

  A web browser.

* **Firefox**, **Mozilla Firefox**, **Fx**
  
  A web browser.

* **Firefox** <!--OPTION A-->
  
  * A web browser.
  * A Red Panda.

* **Firefox** <!--OPTION B-->
  
  A web browser.

  A Red Panda.

...which renders as:

  • Firefox

    A web browser.

  • Firefox, Mozilla Firefox, Fx

    A web browser.

  • Firefox

    • A web browser.
    • A Red Panda.
  • Firefox

    A web browser.

    A Red Panda.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.