evitanrelta / htmlarkdown Goto Github PK

HTML-to-Markdown converter that adaptively preserves HTML when needed (eg. when center-aligning, or resizing images)

Home Page: https://evitanrelta.github.io/htmlarkdown

License: MIT License

TypeScript 64.62% HTML 35.38%

converter html-to-markdown typescript commonmark gfm html-converter javascript node node-js nodejs

htmlarkdown's Introduction

HTMLarkdown is a HTML-to-Markdown converter that's able to output HTML-syntax when required.
Like when center-aligning, or resizing images:

Switching to HTML showcase

Written completely in TypeScript.
Has many Jest tests, covering many edge-case conversions.

Leave a issue/PR if you can think of more!
For now, is designed for GFM.
Try it out at the demo site below!
https://evitanrelta.github.io/htmlarkdown

How is this different?

Switching to HTML-syntax

Whenever elements cannot be represented in markdown-syntax, HTMLarkdown will switch to HTML-syntax:

Input HTML	Output Markdown
<h1>Normal-heading is <strong>boring</strong></h1> <h1 align="center"> Centered-heading is <strong>da wae</strong> </h1> <p><img src="https://image.src" /></p> <p><img width="80%" src="https://image.src" /></p>	# Normal-heading is boring <h1 align="center"> Centered-heading is <b>da wae</b> </h1> ![](https://image.src) <img width="80%" src="https://image.src" />

Input HTML

Output Markdown

<h1>Normal-heading is <strong>boring</strong></h1>

<h1 align="center">
  Centered-heading is <strong>da wae</strong>
</h1>

<p><img src="https://image.src" /></p>

<p><img width="80%" src="https://image.src" /></p>

# Normal-heading is **boring**

<h1 align="center">
  Centered-heading is <b>da wae</b>
</h1>

![](https://image.src)

<img width="80%" src="https://image.src" />

Note: The HTML-switching is controlled by the rules' Rule.toUseHtmlPredicate.

But HTMLarkdown tries to use as little HTML-syntax as possible. Mixing markdown and HTML if needed:

Input HTML	Output Markdown
<blockquote> <p align="center"> Centered-paragraph </p> <p>Below is a horizontal-rule in blockquote:</p> <hr> </blockquote>	> <p align="center"> > Centered-paragraph > </p> > Below is a horizontal-rule in blockquote: > > <hr>

Input HTML

Output Markdown

<blockquote>
  <p align="center">
    Centered-paragraph
  </p>
  <p>Below is a horizontal-rule in blockquote:</p>
  <hr>
</blockquote>

> <p align="center">
>   Centered-paragraph
> </p>
> Below is a horizontal-rule in blockquote:
> 
> <hr>

Depending on the situation, HTMLarkdown will switch between markdown's backslash-escaping or HTML-escaping:

Input HTML	Output Markdown
<!-- In markdown --> <p><TAG>, NOT BOLD</p> <!-- In in-line HTML --> <p> <sup><TAG>, NOT BOLD</sup> </p> <!-- In block HTML --> <p align="center"> <TAG>, NOT BOLD </p>	\<TAG>, \\NOT BOLD\\ <sup>\<TAG>, \\NOT BOLD\\</sup> <p align="center"> <TAG>, NOT BOLD </p>

Input HTML

Output Markdown

<!-- In markdown -->
<p>&lt;TAG&gt;, **NOT BOLD**</p>

<!-- In in-line HTML -->
<p>
  <sup>&lt;TAG&gt;, **NOT BOLD**</sup>
</p>

<!-- In block HTML -->
<p align="center">
  &lt;TAG&gt;, **NOT BOLD**
</p>

\<TAG>, \*\*NOT BOLD\*\*

<sup>\<TAG>, \*\*NOT BOLD\*\*</sup>

<p align="center">
  &lt;TAG>, **NOT BOLD**
</p>

Handling of edge cases

Adding separators in-between adjacent lists to prevent them from being combined by markdown-renderers:

Input HTML	Output Markdown
<ul> <li>List 1 > item 1</li> <li>List 1 > item 2</li> </ul> <ul> <li>List 2 > item 1</li> <li>List 2 > item 2</li> </ul>	- List 1 > item 1 - List 1 > item 2 <!-- LIST_SEPARATOR --> - List 2 > item 1 - List 2 > item 2

Input HTML

Output Markdown

<ul>
  <li>List 1 > item 1</li>
  <li>List 1 > item 2</li>
</ul>
<ul>
  <li>List 2 > item 1</li>
  <li>List 2 > item 2</li>
</ul>

- List 1 > item 1
- List 1 > item 2

<!-- LIST_SEPARATOR -->

- List 2 > item 1
- List 2 > item 2

And more!
But this section is getting too long so...

Installation

npm install htmlarkdown

Usage

Markdown conversion (either from `Element` or `string`)

import { HTMLarkdown } from 'htmlarkdown'

/** Convert an element! */
const htmlarkdown = new HTMLarkdown()
const container = document.getElementById('container')
console.log(container.outerHTML)
// => '<div id="container"><h1>Heading</h1></div>'
htmlarkdown.convert(container)
// => '# Heading'


/** 
 * Or a HTML string! 
 * Whichever u prefer. It's 2022, I don't judge :^)
 */
const htmlString = `
<h1>Heading</h1>
<p>Paragraph</p>
`
const htmlStrWithContainer = `<div>${htmlString}</div>`
htmlarkdown.convert(htmlString)
// Set 2nd param 'hasContainer' to true, for container-wrapped string.
htmlarkdown.convert(htmlStrWithContainer, true)
// Both output => '# Heading\n\nParagraph'

Note: If an element is given to convert, it's deep-cloned before any processing/conversion.
Thus, you don't have to worry about it mutating the original element :)

Configuring

/** Configure when creating an instance. */
const htmlarkdown = new HTMLarkdown({
    htmlEscapingMode: '&<>',
    maxPrettyTableWidth: Number.POSITIVE_INFINITY,
    addTrailingLinebreak: true
})

/** Or on an existing instance. */
htmlarkdown.options.maxPrettyTableWidth = -1

Plugins

Plugins are of type (htmlarkdown: HTMLarkdown): void.
They take in a HTMLarkdown instance and configure it by mutating it.

There's 2 plugin-options available in the options object: preloadPlugins and plugins.
The difference is:

preloadPlugins loads the plugins first, before your other options. (likes "presets")
Allowing you to overwrite the plugins' changes:

const enableTrailingLinebreak: Plugin = (htmlarkdown) => {
    htmlarkdown.options.addTrailingLinebreak = true
}
const htmlarkdown = new HTMLarkdown({
    addTrailingLinebreak: false,
    preloadPlugins: [enableTrailingLinebreak],
})
htmlarkdown.options.preloadPlugins // false

plugins loads the plugins after your other options.
Meaning, plugins can overwrite your options.

const enableTrailingLinebreak: Plugin = (htmlarkdown) => {
    htmlarkdown.options.addTrailingLinebreak = true
}
const htmlarkdown = new HTMLarkdown({
    addTrailingLinebreak: false,
    plugins: [enableTrailingLinebreak],
})
htmlarkdown.options.preloadPlugins // true

You can also load plugins on existing instances:

htmlarkdown.loadPlugins([myPlugin])

Making a copy of an instance

The conversion of a HTMLarkdown instance solely depends on its options property.
Meaning, you create a copy of an instance like this:

const htmlarkdown = new HTMLarkdown()
const copy = new HTMLarkdown(htmlarkdown.options)

Configuring rules/processes

See this section for info on what the rules/processes do.

/**
 * Overwriting default rules/processes.
 * (does NOT include the defaults)
 */
const htmlarkdown = new HTMLarkdown({
    preProcesses: [myPreProcess1, myPreProcess2],
    rules: [myRule1, myRule2],
    textProcesses: [myTextProcess1, myTextProcess2],
    postProcesses: [myPostProcess1, myPostProcess2]
})

/**
 * Adding on to default rules/processes.
 * (includes the defaults)
 */
const htmlarkdown = new HTMLarkdown()
htmlarkdown.addPreProcess(myPreProcess)
htmlarkdown.addRule(myRule)
htmlarkdown.addTextProcess(myTextProcess)
htmlarkdown.addPostProcess(myPostProcess)

How it works

HTMLarkdown has 3 distinct phases:

Pre-processing
The container-element that's received (and deep-cloned) by the convert method is passed consecutively to each PreProcess in options.preProcesses.
Conversion
The pre-processed container-element is then recursively converted to markdown.
Elements are converted by Rule in options.rules.
Text-nodes are converted by TextProcess in options.textProcesses.
The rule/text-process outputs strings are then appended to each other, to give the raw markdown.
Post-processing
The raw markdown string is then passed consecutively to each PostProcess in options.postProcess, to give the final markdown.

Rule-processes flowchart
(image: the general conversion flow of HTMLarkdown)

Contributing

Bugs

HTMLarkdown is still under-development, so there'll likely be bugs.

So the easiest way to contribute is submit an issue (with the bug label), especially for any incorrect markdown-conversions :)

For any incorrect markdown-conversions, state the:

input HTML
current incorrect markdown output
expected markdown output

New conversions, ideas, features, tests

If you have any new elements-conversions / ideas / features / tests that you think should be added, leave an issue with feature or improve label!

feature label is for new features

improve label is for improvements on existing features

Understandably, there are gray areas on what is a "feature" and what is an "improvement". So just go with whichever seems more appropriate :)

Other markdown specs

Currently, HTMLarkdown has been designed to output markdown for GitHub specifically (ie. GFM).
BUT, if there's another markdown spec. that you'd like to design for (maybe as a plugin?), do leave an issue/discussion :D

Coding-related stuff

Code-formatting is handled by Prettier, so no need to worry bout it :)

Any new feature should

be documented via TSDoc
come with new unit-tests for them
and should pass all new/existing tests

As for which merging method to use, check out the discussion.

Contributors

So far it's just me, so pls send help! :^)

Roadmap

If you've any new ideas / features, check out the Contributing section for it!

Element conversions

Block-elements:

Text-formattings:

Bold (For now, only outputs in asterisks **BOLD**)
Italic (For now, only outputs in asterisks *ITALIC*)
(GFM) ~~Strikethrough~~
Code
Link (For now, only inline links)
^Superscript (ie. <sup>)
_Subscript (ie. <sub>)
Underline (ie. <u>, <ins>)
(didn't know underlines possible till recently)

Misc:

Images (For now, only inline links)
Horizontal-rule (ie. <hr>)
Linebreaks (ie. <brr>)
Preserved HTML comments (Issue #25) (eg. )

Features to be added:

Custom id attributes

Go to [section with id](#my-section)

<p id="my-section">
  My section
</p>

Reversing GitHub's Issue/PR autolinks

Input HTML	Output Markdown
<p> Issue autolink: <a href="https://github.com/user/repo/issues/7">#7</a> </p>	Issue autolink: #7

Ability to customise how codeblock's syntax-highlighting langauge is obtained from the <pre><code> elements

noop-rule:
They only pass-on their converted inner-contents to their parents.
They themselves don't have any markdown conversions, not even in HTML-syntax.

License

The MIT License (MIT).
So it's freeeeeee

htmlarkdown's People

Contributors

Stargazers

Watchers

Forkers

grudus yam-reviser

htmlarkdown's Issues

`<div>` should not be handle by a noop rule

Context

Currently, <div> are handled by a noop rule, meaning they aren't stripped but they only pass-on their converted inner-contents to their parents. They themselves don't have any markdown conversions, not even in HTML-syntax.
For example, the below 2 HTML:

<div>
  TEXT
</div>

<div aligned="center">
  TEXT
</div>

both gives the same output of:

TEXT

The problem

<div> in markdowns aren't stripped when being rendered in Github.
For example, they can be used to center images, like:

<div align=center>
  <img src="...">
</div>

Thus, they should not be converted by a noop rule.

Option to use reference-style links

From the CommonMark spec: https://spec.commonmark.org/0.30/#reference-link

A reference link is in the form:

[foo][bar]

[bar]: /url "title"

But there's also different kinds of them: full, collapsed, and shortcut.
The above example is a full reference link. The other 2 are in the form:

[collapsed][]

[collapsed]: /url "title"

[shortcut]

[shortcut]: /url "title"

Horizontal-rule in headings can simplified

Currently, a <hr> in a heading element such as:

<h1><hr></h1>

converts to:

<h1>
  <hr>
</h1>

But it should instead be simplified to:

# <hr>

Improve the implementation for collapsing of whitespaces

Current implementation for collapsing whitespaces is found in the collapseWhitespace pre-process.

The previous implementation was actually a text-process,
but was later was very roughly adapted into a bootleg pre-process:

(from collapseWhitespace.ts)

If anyone has any idea how to overhaul it to make it cleaner or faster, pls send help :(

Should `forcehtml` element attribute on in-line elements propagate to child elements?

Currently, the below input HTML, where a text-formatting element (ie. <b>) has the forcehtml attribute:

<p><b forcehtml><s>TEXT</s></b></p>

gives this markdown output:

<b>~~TEXT~~</b>

where the forcehtml doesn't propagate to the inner <s> element.

The question is, should it propagate or not?
Because the above markdown output renders fine in GitHub (with bold and strikethrough applied).

It seems that markdown syntax still works inside of HTML-syntax, as long as the outer element is an in-line-element like <span>, <b> and <s>, like:

<span><b>~~TEXT~~</b><span>

which properly renders as:

~~TEXT~~

But if the outer element is a block-element (eg. <p>) like below:

<p><b>~~TEXT~~</b></p>

It fails to render the inner markdown-syntax, like so:

~~TEXT~~

Loose list

Edited on 20/12/2022 to include the edge cases in the comments

Context

If list-items have no blank-lines inbetween:

- Item 1
- Item 2
- Item 3

The list is considered tight, and render like this:

Item 1
Item 2
Item 3

But if there's blank-lines inbetween list-items:

- Item 1

- Item 2

- Item 3

The list is considered loose, and render with each list-items' contents being wrapped in a <p> tag like this:
(visually, it results in larger inbetween-list-item-space)

Item 1
Item 2
Item 3

The improvement

Loose lists (both ordered and unordered) such as:

<!-- All have paragraphs -->
<ul>
  <li><p>Item 1</p></li>
  <li><p>Item 2</p></li>
  <li><p>Item 3</p></li>
</ul>

<!-- Empty list-items have no paragraphs -->
<ul>
  <li><p>Item 1</p></li>
  <li></li>
  <li><p>Item 3</p></li>
</ul>

<!-- List-items with other block elements isn't wrapped in paragraphs -->
<ul>
  <li><p>Item 1</p></li>
  <li><h1>Item 2 (heading)</h1></li>
  <li><p>Item 3</p></li>
</ul>

<!-- List-items with multiple block-elements -->
<ul>
  <li><p>Item 1</p></li>
  <li>
    <p>Item 2</p>
    <h1>Heading in list-item</h1>
  </li>
  <li><p>Item 3</p></li>
</ul>

Should have blank-lines inbetween their converted list-items:

- Item 1

- Item 2

- Item 3

- Item 1

- 

- Item 3

- Item 1

- # Item 2 (heading)

- Item 3

- Item 1

- Item 2
  
  # Heading in list-item

- Item 3

But if the list is tight, with only some list-items have <p> like:

<ul>
  <li>Item 1</li>
  <li><p>Item 2</p></li>
  <li>Item 3</li>
</ul>

Then the output markdown should be:

- Item 1
- <p>Item 2</p>
- Item 3

`mergeOverwriteArray` not using its own defined TSDoc

Context

The mergeOverwriteArray function found in src > core > helpers > mergeOverwriteArray.ts is like lodash's _.merge but overwrites array values instead of merging them like _.merge.

var users = {
  'data': [{ 'user': 'barney' }, { 'user': 'fred' }]
};

var ages = {
  'data': [{ 'age': 36 }, { 'age': 40 }]
};

_.merge(users, ages);
// => { 'data': [{ 'user': 'barney', 'age': 36 }, { 'user': 'fred', 'age': 40 }] }

mergeOverwriteArray(users, ages);
// => { 'data': [{ 'age': 36 }, { 'age': 40 }] }

The problem

mergeOverwriteArray uses the same type as _.merge, and has its own TSDoc defined in the file mergeOverwriteArray.ts.

However, (at least in VSCode) the TSDoc of it when its used outside of the mergeOverwriteArray.ts file is that of _.merge instead of its own defined TSDoc:

Help wanted

Any ideas on how to make mergeOverwriteArray have to same generic typing as _.merge but without inheriting its TSDoc?

HTML Codeblock indents in list

Context

Codeblock's HTML-in-markdown syntax

The HTML-in-markdown equivalent for codeblock markdowns like:

```javascript
const one = 1;
const two = 2;
```

is:

<pre lang="javascript"><code>const one = 1;
const two = 2;
</code></pre>

which is sensitive to whitespaces that's inside the <pre><code> tags.
For example, adding a 2-space indent to the tags like:

  <pre lang="javascript"><code>const one = 1;
  const two = 2;
  </code></pre>

renders as:

const one = 1;
  const two = 2;

instead of:

const one = 1;
const two = 2;

Current related workarounds

Sometimes the HTML-syntax of codeblock is needed.
For example, inserting codeblocks in tables:

Codeblock in table
const one = 1; const two = 2;

which has a markdown (which is just all HTML) of:

<table>
  <thead>
    <tr>
      <th>Codeblock in table</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>
<pre lang="javascript"><code>const one = 1;
const two = 2;
</code></pre>
      </td>
    </tr>
  </tbody>
</table>

Notice how the codeblock-tags are completely unindented.

This is currently achieved by letting the rules indent the codeblock as they please, then unindent it via Regular Expression in the unindentCodeblocks post-process.

The problem

Given the below HTML, which is a tight-list with <p> tags in 2/3 of its list-items:

<ul>
    <li>Item 1</li>
    <li>
        <p>Item 2</p>
        <pre><code>Item 2 (codeblock)</code></pre>
        <h1>Item 2 (heading)</h1>
    </li>
    <li><p>Item 3</p></li>
</ul>

The target markdown will have a mix of HTML and markdown syntax:

- Item 1
- <p>Item 2</p>
  <pre><code>Item 2 (codeblock)
  </code></pre>
  <h1>Item 2 (heading)</h1>
- <p>Item 3</p>

Notice how in this case, the <pre><code> tags of the codeblock is indented.
Without this indent, the codeblock will be outside the list.

BUT, that indent cannot be achieved without affecting the current unindentCodeblocks post-process trick mentioned in the Context section above.

Solution?

The easiest way to fix this is to simply make the entire list use HTML-syntax, like:

<ul>
    <li>Item 1</li>
    <li>
        <p>Item 2</p>
<pre><code>Item 2 (codeblock)
</code></pre>
        <h1>Item 2 (heading)</h1>
    </li>
    <li>
      <p>Item 3</p>
    </li>
</ul>

or perhaps the regex of the indent utility function could be change to indent all but the HTML-codeblocks.

But if the above target markdown is to be achieved without some convoluted regex-base workaround , some SIGNIFICANT OVERHAUL needs to be done on the handling of indents for HTML codeblocks.

HTML comment tags not being escaped

Currently, this HTML:

<p>
  &lt;!--COMMENT--&gt;
</p>

is converted to:

<!--COMMENT-->

where the < isn't escaped when it should be.

Weird rendering of HTML-codeblocks in block-elements like `<p>`

The problem

Although this markdown of HTML-syntax codeblock renders properly:

<pre><code># Heading-1 markdown

## Heading-2 markdown __ITALIC__

&lt;h1 align="center">
    Centered-heading
&lt;/h1>
</code></pre>

# Heading-1 markdown

## Heading-2 markdown __ITALIC__

<h1 align="center">
    Centered-heading
</h1>

When we wrap it in a block element like <blockquote>:

<blockquote>
<pre><code># Heading-1 markdown

## Heading-2 markdown __ITALIC__

&lt;h1 align="center">
    Centered-heading
&lt;/h1>
</code></pre>
</blockquote>

# Heading-1 markdown
Heading-2 markdown ITALIC
<h1 align="center">

Centered-heading

</h1>

It seems to collapse the whitespaces in the codeblock, and attempts to parse the codeblock's contents for markdown-syntax.

The solution

It seems this problem only occurs when there's blank-lines inside the codeblock's content.
Thus, by adding a noop-comment in every blank-line, it seems to stop this weird behavior.

<blockquote>
<pre><code># Heading-1 markdown
<!-- BLANK_LINE -->
## Heading-2 markdown __ITALIC__
<!-- BLANK_LINE -->
&lt;h1 align="center">
    Centered-heading
&lt;/h1>
</code></pre>
</blockquote>

# Heading-1 markdown

## Heading-2 markdown __ITALIC__

<h1 align="center">
    Centered-heading
</h1>

Incorrect conversion for empty + forced-HTML block elements

Currently, a forced-HTML block-element like <p> and <h1> like below:

<p forcehtml></p>
<h1 forcehtml></h1>

gets converted to:

<p /><h1/ >

It should instead be:

<p></p>

<h1></h1>

Block-elements in in-line elements (eg. bold, italic, hyperlinks)

If in-line elements contains block elements, they should be in HTML-syntax.

For example, this HTML:

<i>
<pre><code>CODEBLOCK
</code></pre>
</i>

converts to this incorrect markdown:

*```
CODEBLOCK
```

*

Incorrect conversion for empty `<pre>` tag when `addTrailingLinebreak` option is enabled

When addTrailingLinebreak option is enabled via new HTMLarkdown({ addTrailingLinebreak: true }),
the below empty pre tag HTML:

<pre></pre>

gets converted to:

<pre><code><br>
</code></pre>

as the addTrailingLinebreaks preprocess adds a trailing <br> to the empty pre tag.

Codeblocks usually don't have this problem as it contains an inner <code> tag, which is ignored by the addTrailingLinebreaks preprocess as it's not a block-element.

Headings should be in HTML-syntax if they contain block-elements

This HTML:

<h1>
<pre><code>Codeblock (a block-element)
</code></pre>
</h1>

<h1>
  <blockquote>Blockquote (another block-element)</blockquote>
<h1>

Cannot be represented in markdown. The below markdowns will not work:
Fenced-codeblock style: (current implementation's output)

# ```
Codeblock (a block-element)
```

# > Blockquote (another block-element)

Indented-codeblock style:

#     Codeblock (a block-element)

# > Blockquote (another block-element)

The correct conversion should be fully in HTML-syntax:

<h1>
<pre><code>Codeblock (a block-element)
</code></pre>
</h1>

<h1>
  <blockquote>Blockquote (another block-element)</blockquote>
</h1>

Jest coverage report highlighting wrong lines

The coverage report generated by npx jest --coverage gave these highlights which seems to wrong:

Option to preserve HTML comments

Basic idea

Given these HTML:

<!-- BELOW IS A HEADING -->
<h1>Heading</h1>

<!-- BELOW IS A PARAGRAPH -->
<p>Paragraph</p>

<h1>Heading</h1>
<!-- ABOVE IS A HEADING -->

<p>Paragraph</p>
<!-- ABOVE IS A PARAGRAPH -->

<h1>Heading</h1>

<!-- INBETWEEN -->

<p>Paragraph</p>

The conversion will attempt to place the comment close to it's original siblings, like:

<!-- THIS IS A HEADING -->
# Heading

<!-- THIS IS A PARAGRAPH -->
Paragraph

# Heading
<!-- ABOVE IS A HEADING -->

Paragraph
<!-- ABOVE IS A PARAGRAPH -->

# Heading

<!-- INBETWEEN -->

Paragraph

No blank-lines on both sides

If the comment in the HTML has no blank-lines between it and BOTH its sibling elements, like:

<h1>Heading</h1>
<!-- NO BLANK LINES ON BOTH SIDES -->
<p>Paragraph</p>

then a blank-line will be added on both sides:

# Heading

<!-- NO BLANK LINES ON BOTH SIDES -->

Paragraph

Multiple blank-lines

In cases where theres more than 1 blank-line between the comment and its sibling elements:

<h1>Heading</h1>



<!-- 3 BLANKS ABOVE, 2 BLANKS BELOW -->


<p>Paragraph</p>

Then the blank-lines are preserved:

# Heading



<!-- 3 BLANKS ABOVE, 2 BLANKS BELOW -->


Paragraph

Comments inside elements

Heading

<h1>
  <!-- COMMENT BEFORE -->
  Heading
  <!-- COMMENT AFTER -->
</h1>

# <!-- COMMENT BEFORE -->Heading<!-- COMMENT AFTER -->

Paragraph

<p>
  <!-- COMMENT BEFORE -->
  Paragraph
  <!-- COMMENT AFTER -->
</p>

<!-- COMMENT BEFORE -->
Paragraph
<!-- COMMENT AFTER -->

List

<ul>
  <!-- OUTSIDE LIST-ITEM (BEFORE) -->
  <li>Item 1</li>
  <li>
    <!-- INSIDE LIST ITEM (BEFORE) -->
    Item 2
    <!-- INSIDE LIST ITEM (AFTER) -->
  </li>
  <li>Item 3</li>
  <!-- OUTSIDE LIST-ITEM (AFTER) -->
</ul>

<!-- OUTSIDE LIST-ITEM (BEFORE) -->
- Item 1
- <!-- INSIDE LIST ITEM (BEFORE) -->
  Item 2
  <!-- INSIDE LIST ITEM (AFTER) -->
- Item 3
<!-- OUTSIDE LIST-ITEM (AFTER) -->

Cases to add to issue's description:

Should not use markdown-escaping inside of HTML-syntax

The problem

Currently, this HTML:

<p align="center">
  &lt;tag&gt;
</p>

converts to:

<p align="center">
  \<tag>
</p>

which incorrectly uses markdown's backslash-escaping, instead of HTML's < escaping.

Edge cases

Most of the time, while inside HTML tags, markdown-syntax (including backslash escaping) doesn't work.
However, there are times when it does, specifically in tags which are:

In-line (eg. text-formattings <em> / <code> & span)
are in a single-line in the markdown

For example, these markdown-syntax containing tags render properly:

<code>\<tag> \&nbsp; **Bold**</code>
<sup>\<tag> \&nbsp;  **Bold**</sup>
<span>\<tag> \&nbsp;  **Bold**</span>

Rendered as:

<tag>   Bold
^{<tag>   Bold}
<tag>   Bold

But when they are broken up into multi-lines, the markdown-syntax stop working:

<code>
  \<tag> \&nbsp; **Bold**
</code>
<sup>
  \<tag> \&nbsp;  **Bold**
</sup>
<span>
  \<tag> \&nbsp;  **Bold**
</span>

Rendered as:


  \ \  **Bold**

^{\ \ **Bold**} \ \ **Bold**

Horizontal-rule in text-formatting not converting

Currently, when a <hr> is in any text-formatting element like so:

<b><hr></b>

It doesn't gets converted to anything, as the removeTextlessFormattings preprocess removes the element because it has no text.

Remove newlines inside & inbetween block-elements for better readability

Currently, the below HTML:

<blockquote forcehtml>
    <p>Line 1</p>
    <p>Line 2</p>
<pre><code>Codeblock
</code></pre>
    <h1>Heading</h1>
</blockquote>

is converted to:

<blockquote>
  <p>
    Line 1
  </p>
  
  <p>
    Line 2
  </p>
  
<pre><code>Codeblock
</code></pre>
  
  <h1>
    Heading
  </h1>
</blockquote>

which has alot of unnecessary newlines.

It would be nice to remove the newlines inbetween inner block-elements (eg. paragraphs/codeblocks/headings), as well as make each inner block-element take only 1 line, like so:

<blockquote>
  <p>Line 1</p>
  <p>Line 2</p>
<pre><code>Codeblock
</code></pre>
  <h1>Heading</h1>
</blockquote>

Note: The removal of newlines inbetween inner block-elements inside blockquotes can be achieved by passing down a boolean, which controls whether to add \n\n to end of the inner-content in each block-element-rule

Similarly, block-elements that has no attributes and contain 1 liner inner-content (ie. contains no \n), can be made into 1 liners to improve readability, like so:

<!-- Improved 1 liner -->
<p>TEXT</p>

<!-- Current multi-line block-elements -->
<p>
  TEXT
</p>

Confusing array filter logic

Currently, if the a rule's filter is an array of tag-names (ie. TagName[]), the elements are OR against each other.
eg. filter: ['b', 'strong'] is logically "element has the tag-name 'b' OR 'strong'"

But if the array contains an element of type FilterPredicate, the elements are suddenly AND against each other.
eg. filter: ['b', isStrong] is logically "element has the tag-name 'b' AND isStrong"

Also, for convenience, the rules' filters are allowed to be a single (ie. not an array) TagName or FilterPredicate type.
eg. filter: 'b' or filter: isStrong, which slightly complicates the typings and evaluation of the filters.

A better evaluation logic would be:

disallow just filter: TagName | FilterPredicate types (ie. only allow arrays)
allow nested arrays where:
- the outer array is OR logically
- the inner array is AND logically
  (ie. [A, [B, C], D] is logically "A or (B and C) or D")

Text-formattings unnecessarily converting to HTML syntax

<p>a <b> a</b></p>

Currently converts to:

a <strong>&nbsp;a</strong>

But, to avoid unnecessarily using HTML syntax, it should instead be converted to:

a **&nbsp;a**

The above bug also happens to the other text-formatting types (eg. italic, strikethrough)

Edit:

I can't think of a situation where the HTML-syntax of text-formattings are required.

Since now leading/trailing spaces inside the text-formatings are escaped to  , such leading/trailing spaces can be used in Markdown-syntax:
(eg. bolding with leading & trailing spaces: **    TEXT    **)

So for now, the toUseHtmlPredicate of text-formattings will be set to always return false, until a situation that needs HTML arises.

List separator to break-up adjacent lists

Relevant context

If list-items have no blank-lines inbetween:

- Item 1
- Item 2
- Item 3

They render as a tight-list like this:

Item 1
Item 2
Item 3

But if there's blank-lines:

- Item 1

- Item 2

- Item 3

They render as a loose-list:
(which results in larger inbetween-list-item-space)

Item 1
Item 2
Item 3

The problem

If 2 lists are adjacent to each other:

- List 1 - item 1
- List 1 - item 2
- List 1 - item 3

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3

They render as 1 big loose-list:

List 1 - item 1
List 1 - item 2
List 1 - item 3
List 2 - item 1
List 2 - item 2
List 2 - item 3

This doesn't seem to happen if one of the adjacent lists is in HTML-syntax:

<ul>
  <li>List 1 - item 1<li>
  <li>List 1 - item 2<li>
  <li>List 1 - item 3<li>
</ul>

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3

List 1 - item 1
List 1 - item 2
List 1 - item 3

List 2 - item 1
List 2 - item 2
List 2 - item 3

It also doesn't happen if the the adjacent lists aren't of the same type (ie. ordered vs. unordered):

1. Ordered item 1
2. Ordered item 2
3. Ordered item 3

- Unordered item 1
- Unordered item 2
- Unordered item 3

Ordered item 1
Ordered item 2
Ordered item 3

Unordered item 1
Unordered item 2
Unordered item 3

Solution

This can be solved by adding something inbetween the 2 lists. Such as:

Using a comment:

- List 1 - item 1
- List 1 - item 2
- List 1 - item 3

<!-- LIST_SEPARATOR -->

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3

Which renders as:

List 1 - item 1
List 1 - item 2
List 1 - item 3

List 2 - item 1
List 2 - item 2
List 2 - item 3

Using an invalid HTML tag:

- List 1 - item 1
- List 1 - item 2
- List 1 - item 3

<LISTSEPARATOR>

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3

Which renders as:

List 1 - item 1
List 1 - item 2
List 1 - item 3

List 2 - item 1
List 2 - item 2
List 2 - item 3

Allow titles in links

From CommonMark spec: https://spec.commonmark.org/0.30/#link-title

link titles can be added to the link-markdown by:

[link](/uri "title")

which renders as this HTML:

<p><a href="/uri" title="title">link</a></p>

Allow conversion of string HTMLs with containers

Currently, only HTML strings that aren't wrapped in a container is accepted by HTMLarkdown.convert.
For example:

<h1>Heading</h1>
<h1>Paragraph</h1>

is converted properly to:

# Heading

Paragraph

But when it's wrapped in a container like:

<article id="container">
  <h1>Heading</h1>
  <p>Paragraph</p>
</article>

it doesn't convert properly.

Note: <article> is used as an example above as it doesn't have an associated rule, and is thus stripped.

Trailing-newline in codeblocks

Github adds a trailing-newline inside the generated <pre><code> element:

```
codeblock
```

renders as this HTML:

<pre><code>codeblock
</code></pre>

Thus, the current tests (which doesn't have a trailing-newline) as shown below:

<pre><code>&lt;tag&gt;</code></pre>

needs to be updated to include the trailing-newline like so:

<pre><code>&lt;tag&gt;
</code></pre>

Perhaps, an option to enable/disable removing of the trailing-newline should be added.

Also, the HTML-syntax of codeblocks doesn't collapse the whitespace inside it. Thus, it's sensitive to extra newlines.

Hence, the current tests' expected HTML-syntax markdown such as below:

<pre><code>
unbolded <b>bolded</b>
</code></pre>

should instead be:

<pre><code>unbolded <b>bolded</b>
</code></pre>

Option to use setext headings

Since setext headings only support <h1> and <h2>,
the problem is what should HTMLarkdown do when it encounters a <h3>Heading 3</h3> when in setext mode?

Should it:

switch to ATX-style? (ie. ### Heading 3)
or switch to HTML-syntax? (<h3>Heading 3</h3> )

Padding table delimiter-row's hyphens with a space

Context

Currently, the hyphens - of the table delimiter-row extends all the way to the column-separators |:

| Default-Left | Centered | Right-Aligned |
|--------------|:--------:|--------------:|
| Cell 1       | Cell 2   | Cell 3        |

Note: Colon-aligned : feature hasn't been implemented yet, but is here for illustration purposes.

and does not pad the hyphens with a space like:

| Default-Left | Centered | Right-Aligned |
| ------------ | :------: | ------------: |
| Cell 1       | Cell 2   | Cell 3        |

This was because tables require:

minimum of 3 hyphens
(although GitHub doesn't require it, and technically accepts 1 hyphen)
minimum of length 3 to replace both the leading and trailing hyphen with colons : for center-alignment
(ie. :-:, the shortest possible center-aligned delimiter),

and I didn't want to deal with the edge case where the column only has a width of 1 character, like:

| a |
| - |
| b |

The improvement

Add 1-space paddings on the edge of the hyphens.

As for the edge case where the column has < 3 characters, we can either:

set the minimum width of columns to 3 characters (excluding the 2-space paddings):

| 1   |      | 12  |      | 123 |      | 1234 |
| --- |  ->  | --- |  ->  | --- |  ->  | ---- |
| a   |      | a   |      | a   |      | a    |

Center-aligned:
| 1   |      | 12  |      | 123 |      | 1234 |
| :-: |  ->  | :-: |  ->  | :-: |  ->  | :--: |
| a   |      | a   |      | a   |      | a    |

extend the hyphens all the way if there's < 3 characters, else pad the hyphens:

| 1 |        | 12 |       | 123 |      | 1234 |
|---|   ->   |----|  ->   | --- |  ->  | ---- |
| a |        | a  |       | a   |      | a    |

Center-aligned:
| 1 |        | 12 |       | 123 |      | 1234 |
|:-:|   ->   |:--:|  ->   | :-: |  ->  | :--: |
| a |        | a  |       | a   |      | a    |

The question

Which of the 2 styles in "The improvement" section above is better?

In Github, codeblock syntax-highlighting trims leading/trailing newlines

In Github, codeblocks without syntax-highlighting preserves leading/trailing newlines:

```

const x

```

renders as:


const x

But when the language is specified, and theres syntax-highlighting, like:

```javascript

const x

```

it trims leading/trailing newlines when rendered:

const x

This CANNOT be circumvented with HTML-syntax like below. It still trims the newlines.

<pre lang="javascript"><code>

const x

</code></pre>

A possible fix would be to add 1 space to the front and back each, like:
(this example has double newlines on both ends, to show that u only need 1 space at the front and back)

```javascript
[SPACE]

const x

[SPACE]
```

Incorrect conversion for empty `<code>` tag

Currently, the below HTML:

<code></code>

gets converted to:

``

Which is rendered (at least on GitHub) as literal backticks: ``

The correct conversion should be in HTML-syntax:

<code></code>

Note: other inline text-formatting tags didn't had this problem they are removed by the removeEmptyElements preprocess

Don't strip `<thead>`, `<tbody>` & `<tfoot>` tags in tables

The coloring for the rows might be different if they are stripped:

Description HTML-in-markdown Rendered HTML

No <thead>, <tbody> nor <tfoot>

<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Cell 1.1</td>
    <td>Cell 1.2</td>
  </tr>
  <tr>
    <td>Cell 2.1</td>
    <td>Cell 2.2</td>
  </tr>
</table>

Header 1	Header 2
Cell 1.1	Cell 1.2
Cell 2.1	Cell 2.2

With <thead> and <tbody> only

<table>
  <thead>
    <tr>
      <th>Header 1</th>
      <th>Header 2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cell 1.1</td>
      <td>Cell 1.2</td>
    </tr>
    <tr>
      <td>Cell 2.1</td>
      <td>Cell 2.2</td>
    </tr>
  </tbody>
</table>

Header 1	Header 2
Cell 1.1	Cell 1.2
Cell 2.1	Cell 2.2

With <thead>, <tbody> and <tfoot>

<table>
  <thead>
    <tr>
      <th>Header 1</th>
      <th>Header 2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cell 1.1</td>
      <td>Cell 1.2</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td>Cell 2.1</td>
      <td>Cell 2.2</td>
    </tr>
  </tfoot>
</table>

Header 1	Header 2
Cell 1.1	Cell 1.2
Cell 2.1	Cell 2.2

Add README badges

Such as the coverage %, license, version and the "all test cases passed" badges:

Missing syntax-highlighting language in empty codeblocks

This HTML:

<pre lang="md"><code>
</code></pre>

outputs this:

```
```

Which is missing the "md" syntax-highlighting language.

Incorrect newlines for `<br>` that's outside of block-elements

Input:

<br><br>
<p>TEXT</p>
<br><br>

Current output:

<br>
<br>TEXT


<br>
<br>

Expected output:

<br><br>

TEXT

<br><br>

Prevent indenting codeblocks that are in HTML-syntax

Currently, some HTML conversions adds indentation like so:

<h1>
  <pre><code>TEXT
  </code</pre
</h1>

But codeblocks don't collapse whitespace, and thus are sensitive to the added spaces from the indentation.

Hence, codeblocks should not be indented.

Aligning table columns with colons

Table columns can be aligned by adding a trailing/leading colon to the delimiter-row:

| Default-Left | Center-Aligned | Right-Aligned |
| ------------ | :------------: | ------------: |

Additionally, the text in the row can also be visually aligned with spaces to follow the column's alignment:

| Default-Left | Center-Aligned | Right-Aligned |
| ------------ | :------------: | ------------: |
| Cell 1       |     Cell 2     |        Cell 3 |

` ` should not be escaped in code

Currently, this HTML:

<code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code>

converts to:

`&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;`

instead of:

`         `

Incorrect conversions for tables with row of different lengths

Problem 1

Currently, this HTML:

<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Cell 1</td>
  </tr>
</table>

Throws an error:

The output markdown should be in HTML-syntax, the same as the above HTML.

Note: It should NOT be in markdown with missing cells like:
| Header 1 | Header 2 |
| -------- | -------- |
| Cell 1   |
As the above markdown renders as:

Header 1 Header 2

Cell 1

When it's suppose to be:

Header 1 Header 2

Cell 1

Header 1	Header 2
Cell 1

Problem 2

Currently, this HTML:

<table>
  <tr>
    <th>Header 1</th>
  </tr>
  <tr>
    <td>Cell 1</td>
    <td>Cell 2</td>
  </tr>
</table>

converts to this markdown:

| Header 1 |
|----------|
| Cell 1   | Cell 2 |

which incorrectly renders in GitHub as:

Header 1
Cell 1

The output markdown should be in HTML-syntax, the same as the above HTML.

Note: It should NOT be in markdown with empty headers:
| Header 1 |        |
| -------- | ------ |
| Cell 1   | Cell 2 |
As the above markdown renders as:

Header 1

Cell 1 Cell 2

When it's suppose to be:

Header 1

Cell 1 Cell 2

Header 1
Cell 1	Cell 2

Header 1
Cell 1	Cell 2

What merging methods to use?

Context

So far, the merging method I've used is:

ensure branch is up-to-date (if not, rebase branch onto master)
merge by git merge --no-ff [branch]
no squash
then git commit --am, and add an issue-closing line to the merge-commit's body (eg. Fixes #4, Closes #5)

Which gives a history that looks like this:

The question

I'm still fairly new in the programming world, so idk what is the industry standard for this.

To squash?
To rebase instead of merge?
Which commit to put the issue-closing keywords (eg. Fixes #3) on? The merge commit? Or the commits in the branch?

If anyone with some experience in this, and knows the pros/cons of each, please advise! D:

Block-elements wrapped in text-formatting

When block-element are wrapped in text-formattings (eg. <b>), like so:

<b><p>Bold-wrapped paragraph</p></b>
<p>=== SEPARATOR ===</p>
<b><hr></b>
<p>=== END ===</p>

The trailing-newlines of the block-elements becomes malformed:

**Bold-wrapped paragraph

**=== SEPARATOR ===

**<hr>**=== END ===

The expected output should be:

**Bold-wrapped paragraph**

=== SEPARATOR ===

**<hr>**

=== END ===

The question is, is this a problem?
Since block-elements wrapped in text-formatting elements could be considered as malformed HTML.

HTML-codeblock are indented when it contains attributes

This HTML:

<h1>
<pre><code>asd
qwe
</code></pre>
</h1>

<h1>
<pre lang="md"><code>asd
qwe
</code></pre>
</h1>

is converted to this markdown:

<h1>
<pre><code>asd
qwe
</code></pre>
</h1>

<h1>
  <pre lang="md"><code>asd
  qwe
  </code></pre>
</h1>

Where the HTML-syntax codeblock is indented when it contains any attributes (eg. lang).

Text-formatting elements not propagating `forcehtml` attribute

Currently, this HTML:

<p>
  <b forcehtml><i>Bold and italic</i></b>
</p>

converts to:

<b>*Bold and italic*</b>

instead of turning all elements inside the bold into HTML-syntax:

<b><i>Bold and italic</i></b>

Multiple block-elements in a list-item

The problem

The below tight-list HTML:
(explaination on tight/loose list found in #17)

<ul>
  <li>
    Item 1
    <pre><code>Codeblock</code></pre>
    <h1>Heading</h1>
  </li>
  <li>Item 2</li>
</ul>

Currently converts to:

- Item 1```
  Codeblock
  ```
  
  # Heading
- Item 2

It should instead be converted to:

- Item 1
  ```
  Codeblock
  ```
  # Heading
- Item 2

(Sub-problem) Newline after text-node

With reference to the conversion above,
there's a lack of newline after the text "Item 1":

- Item 1```
  Codeblock
  ```

(Sub-problem) Blank-lines inbetween block-elements inside list-item

Tight-lists cannot have blank-lines inside their list-items, else it will be rendered as a loose-list instead:

With reference to the conversion in the above,
notice the lack of blank-lines inbetween the codeblock and heading in the correct conversion:

- Item 1
  ```
  Codeblock
  ```
  # Heading
- Item 2

Option to use underscore bold and italic

Underscore bold and italic are in the form:

__bold__, _italic_

The reason both bold and italics are currently asterisks * only, is cuz of this edge case:

PREFIX__bold__, PREFIX_italic_

PREFIX**bold**, PREFIX*italic*

which renders as:

PREFIX__bold__, PREFIX_italic_

PREFIXbold, PREFIXitalic

The question is (or i guess "are"):

How do we detect if a bold/italic is prefixed (without spaces inbetween)?
What should HTMLarkdown do if its set to use underscore, and those edge cases happen?
Should it:
- Use asterisks?
- or use HTML-syntax (ie. <b>, <i>)

evitanrelta / htmlarkdown Goto Github PK

htmlarkdown's Introduction

How is this different?

Switching to HTML-syntax

Handling of edge cases

Installation

Usage

Markdown conversion (either from Element or string)

Configuring

Plugins

Making a copy of an instance

Configuring rules/processes

How it works

Contributing

Bugs

New conversions, ideas, features, tests

Other markdown specs

Coding-related stuff

Contributors

Roadmap

Element conversions

Block-elements:

Text-formattings:

License

htmlarkdown's People

Contributors

Stargazers

Watchers

Forkers

htmlarkdown's Issues

Context

The problem

Context

The improvement

Context

The problem

Help wanted

Context

Codeblock's HTML-in-markdown syntax

Current related workarounds

The problem

Solution?

The problem

Heading-2 markdown ITALIC

The solution

Basic idea

No blank-lines on both sides

Multiple blank-lines

Comments inside elements

Heading

Paragraph

List

The problem

Edge cases

Edit:

Relevant context

The problem

Solution

Using a comment:

Using an invalid HTML tag:

Context

The improvement

The question

Input:

Current output:

Expected output:

Problem 1

Problem 2

Context

The question

The problem

(Sub-problem) Newline after text-node

(Sub-problem) Blank-lines inbetween block-elements inside list-item

Recommend Projects

Recommend Topics

Recommend Org

Markdown conversion (either from `Element` or `string`)