Giter Site home page Giter Site logo

evitanrelta / htmlarkdown Goto Github PK

View Code? Open in Web Editor NEW
52.0 52.0 2.0 2.05 MB

HTML-to-Markdown converter that adaptively preserves HTML when needed (eg. when center-aligning, or resizing images)

Home Page: https://evitanrelta.github.io/htmlarkdown

License: MIT License

TypeScript 64.62% HTML 35.38%
commonmark converter gfm html-converter html-to-markdown javascript node node-js nodejs typescript

htmlarkdown's Introduction

  Hello!  I'm Shaun  

a NUS Computer Science undergraduate

$  I code
$  I drink coffee
$  and sometimes upload gaming YouTube videos

htmlarkdown's People

Contributors

evitanrelta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

htmlarkdown's Issues

In Github, codeblock syntax-highlighting trims leading/trailing newlines

In Github, codeblocks without syntax-highlighting preserves leading/trailing newlines:

```

const x

```


renders as:


const x


But when the language is specified, and theres syntax-highlighting, like:

```javascript

const x

```

it trims leading/trailing newlines when rendered:

const x

This CANNOT be circumvented with HTML-syntax like below. It still trims the newlines.

<pre lang="javascript"><code>

const x

</code></pre>

A possible fix would be to add 1 space to the front and back each, like:
(this example has double newlines on both ends, to show that u only need 1 space at the front and back)

```javascript
[SPACE]

const x

[SPACE]
```

Trailing-newline in codeblocks

Github adds a trailing-newline inside the generated <pre><code> element:

```
codeblock
```

renders as this HTML:

<pre><code>codeblock
</code></pre>

Thus, the current tests (which doesn't have a trailing-newline) as shown below:

<pre><code>&lt;tag&gt;</code></pre>

needs to be updated to include the trailing-newline like so:

<pre><code>&lt;tag&gt;
</code></pre>

Perhaps, an option to enable/disable removing of the trailing-newline should be added.


Also, the HTML-syntax of codeblocks doesn't collapse the whitespace inside it. Thus, it's sensitive to extra newlines.

Hence, the current tests' expected HTML-syntax markdown such as below:

<pre><code>
unbolded <b>bolded</b>
</code></pre>

should instead be:

<pre><code>unbolded <b>bolded</b>
</code></pre>

HTML Codeblock indents in list

Context

Codeblock's HTML-in-markdown syntax

The HTML-in-markdown equivalent for codeblock markdowns like:

```javascript
const one = 1;
const two = 2;
```

is:

<pre lang="javascript"><code>const one = 1;
const two = 2;
</code></pre>

which is sensitive to whitespaces that's inside the <pre><code> tags.
For example, adding a 2-space indent to the tags like:

  <pre lang="javascript"><code>const one = 1;
  const two = 2;
  </code></pre>


renders as:

const one = 1;
  const two = 2;
  


instead of:

const one = 1;
const two = 2;

Current related workarounds

Sometimes the HTML-syntax of codeblock is needed.
For example, inserting codeblocks in tables:

Codeblock in table
const one = 1;
const two = 2;

which has a markdown (which is just all HTML) of:

<table>
  <thead>
    <tr>
      <th>Codeblock in table</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>
<pre lang="javascript"><code>const one = 1;
const two = 2;
</code></pre>
      </td>
    </tr>
  </tbody>
</table>

Notice how the codeblock-tags are completely unindented.

This is currently achieved by letting the rules indent the codeblock as they please, then unindent it via Regular Expression in the unindentCodeblocks post-process.


The problem

Given the below HTML, which is a tight-list with <p> tags in 2/3 of its list-items:

<ul>
    <li>Item 1</li>
    <li>
        <p>Item 2</p>
        <pre><code>Item 2 (codeblock)</code></pre>
        <h1>Item 2 (heading)</h1>
    </li>
    <li><p>Item 3</p></li>
</ul>


The target markdown will have a mix of HTML and markdown syntax:

- Item 1
- <p>Item 2</p>
  <pre><code>Item 2 (codeblock)
  </code></pre>
  <h1>Item 2 (heading)</h1>
- <p>Item 3</p>

Notice how in this case, the <pre><code> tags of the codeblock is indented.
Without this indent, the codeblock will be outside the list.

BUT, that indent cannot be achieved without affecting the current unindentCodeblocks post-process trick mentioned in the Context section above.


Solution?

The easiest way to fix this is to simply make the entire list use HTML-syntax, like:

<ul>
    <li>Item 1</li>
    <li>
        <p>Item 2</p>
<pre><code>Item 2 (codeblock)
</code></pre>
        <h1>Item 2 (heading)</h1>
    </li>
    <li>
      <p>Item 3</p>
    </li>
</ul>

or perhaps the regex of the indent utility function could be change to indent all but the HTML-codeblocks.

But if the above target markdown is to be achieved without some convoluted regex-base workaround , some SIGNIFICANT OVERHAUL needs to be done on the handling of indents for HTML codeblocks.

Option to use setext headings

Since setext headings only support <h1> and <h2>,
the problem is what should HTMLarkdown do when it encounters a <h3>Heading 3</h3> when in setext mode?

Should it:

  • switch to ATX-style? (ie. ### Heading 3)
  • or switch to HTML-syntax? (<h3>Heading 3</h3> )

Don't strip `<thead>`, `<tbody>` & `<tfoot>` tags in tables

The coloring for the rows might be different if they are stripped:

Description HTML-in-markdown Rendered HTML
No <thead>, <tbody> nor <tfoot>
<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Cell 1.1</td>
    <td>Cell 1.2</td>
  </tr>
  <tr>
    <td>Cell 2.1</td>
    <td>Cell 2.2</td>
  </tr>
</table>
Header 1 Header 2
Cell 1.1 Cell 1.2
Cell 2.1 Cell 2.2
With <thead> and <tbody> only
<table>
  <thead>
    <tr>
      <th>Header 1</th>
      <th>Header 2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cell 1.1</td>
      <td>Cell 1.2</td>
    </tr>
    <tr>
      <td>Cell 2.1</td>
      <td>Cell 2.2</td>
    </tr>
  </tbody>
</table>
Header 1 Header 2
Cell 1.1 Cell 1.2
Cell 2.1 Cell 2.2
With <thead>, <tbody> and <tfoot>
<table>
  <thead>
    <tr>
      <th>Header 1</th>
      <th>Header 2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cell 1.1</td>
      <td>Cell 1.2</td>
    </tr>
  </tbody>
  <tfoot>
    <tr>
      <td>Cell 2.1</td>
      <td>Cell 2.2</td>
    </tr>
  </tfoot>
</table>
Header 1 Header 2
Cell 1.1 Cell 1.2
Cell 2.1 Cell 2.2

Horizontal-rule in text-formatting not converting

Currently, when a <hr> is in any text-formatting element like so:

<b><hr></b>

It doesn't gets converted to anything, as the removeTextlessFormattings preprocess removes the element because it has no text.

Add README badges

Such as the coverage %, license, version and the "all test cases passed" badges:

image

Option to preserve HTML comments

Basic idea

Given these HTML:

<!-- BELOW IS A HEADING -->
<h1>Heading</h1>

<!-- BELOW IS A PARAGRAPH -->
<p>Paragraph</p>
<h1>Heading</h1>
<!-- ABOVE IS A HEADING -->

<p>Paragraph</p>
<!-- ABOVE IS A PARAGRAPH -->
<h1>Heading</h1>

<!-- INBETWEEN -->

<p>Paragraph</p>

The conversion will attempt to place the comment close to it's original siblings, like:

<!-- THIS IS A HEADING -->
# Heading

<!-- THIS IS A PARAGRAPH -->
Paragraph
# Heading
<!-- ABOVE IS A HEADING -->

Paragraph
<!-- ABOVE IS A PARAGRAPH -->
# Heading

<!-- INBETWEEN -->

Paragraph

No blank-lines on both sides

If the comment in the HTML has no blank-lines between it and BOTH its sibling elements, like:

<h1>Heading</h1>
<!-- NO BLANK LINES ON BOTH SIDES -->
<p>Paragraph</p>

then a blank-line will be added on both sides:

# Heading

<!-- NO BLANK LINES ON BOTH SIDES -->

Paragraph

Multiple blank-lines

In cases where theres more than 1 blank-line between the comment and its sibling elements:

<h1>Heading</h1>



<!-- 3 BLANKS ABOVE, 2 BLANKS BELOW -->


<p>Paragraph</p>

Then the blank-lines are preserved:

# Heading



<!-- 3 BLANKS ABOVE, 2 BLANKS BELOW -->


Paragraph

Comments inside elements

Heading

<h1>
  <!-- COMMENT BEFORE -->
  Heading
  <!-- COMMENT AFTER -->
</h1>
# <!-- COMMENT BEFORE -->Heading<!-- COMMENT AFTER -->

Paragraph

<p>
  <!-- COMMENT BEFORE -->
  Paragraph
  <!-- COMMENT AFTER -->
</p>
<!-- COMMENT BEFORE -->
Paragraph
<!-- COMMENT AFTER -->

List

<ul>
  <!-- OUTSIDE LIST-ITEM (BEFORE) -->
  <li>Item 1</li>
  <li>
    <!-- INSIDE LIST ITEM (BEFORE) -->
    Item 2
    <!-- INSIDE LIST ITEM (AFTER) -->
  </li>
  <li>Item 3</li>
  <!-- OUTSIDE LIST-ITEM (AFTER) -->
</ul>
<!-- OUTSIDE LIST-ITEM (BEFORE) -->
- Item 1
- <!-- INSIDE LIST ITEM (BEFORE) -->
  Item 2
  <!-- INSIDE LIST ITEM (AFTER) -->
- Item 3
<!-- OUTSIDE LIST-ITEM (AFTER) -->

Cases to add to issue's description:

  • Heading
  • Paragraph
  • Tight-list
  • Loose-list
  • Blockquote
  • Codeblock
  • Table
  • Hyperlink (ie. <a>)
  • Text-formattings
    (ie. bold, italics, code, strikethrough, sub/superscript, underline)
  • Comments inside tags
    (eg. <h1 <!-- COMMENT--> >)

Remove newlines inside & inbetween block-elements for better readability

Currently, the below HTML:

<blockquote forcehtml>
    <p>Line 1</p>
    <p>Line 2</p>
<pre><code>Codeblock
</code></pre>
    <h1>Heading</h1>
</blockquote>

is converted to:

<blockquote>
  <p>
    Line 1
  </p>
  
  <p>
    Line 2
  </p>
  
<pre><code>Codeblock
</code></pre>
  
  <h1>
    Heading
  </h1>
</blockquote>

which has alot of unnecessary newlines.


It would be nice to remove the newlines inbetween inner block-elements (eg. paragraphs/codeblocks/headings), as well as make each inner block-element take only 1 line, like so:

<blockquote>
  <p>Line 1</p>
  <p>Line 2</p>
<pre><code>Codeblock
</code></pre>
  <h1>Heading</h1>
</blockquote>

Note: The removal of newlines inbetween inner block-elements inside blockquotes can be achieved by passing down a boolean, which controls whether to add \n\n to end of the inner-content in each block-element-rule


Similarly, block-elements that has no attributes and contain 1 liner inner-content (ie. contains no \n), can be made into 1 liners to improve readability, like so:

<!-- Improved 1 liner -->
<p>TEXT</p>

<!-- Current multi-line block-elements -->
<p>
  TEXT
</p>

HTML-codeblock are indented when it contains attributes

This HTML:

<h1>
<pre><code>asd
qwe
</code></pre>
</h1>

<h1>
<pre lang="md"><code>asd
qwe
</code></pre>
</h1>

is converted to this markdown:

<h1>
<pre><code>asd
qwe
</code></pre>
</h1>

<h1>
  <pre lang="md"><code>asd
  qwe
  </code></pre>
</h1>

Where the HTML-syntax codeblock is indented when it contains any attributes (eg. lang).

List separator to break-up adjacent lists

Relevant context

If list-items have no blank-lines inbetween:

- Item 1
- Item 2
- Item 3

They render as a tight-list like this:

  • Item 1
  • Item 2
  • Item 3


But if there's blank-lines:

- Item 1

- Item 2

- Item 3

They render as a loose-list:
(which results in larger inbetween-list-item-space)

  • Item 1

  • Item 2

  • Item 3


The problem

If 2 lists are adjacent to each other:

- List 1 - item 1
- List 1 - item 2
- List 1 - item 3

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3

They render as 1 big loose-list:

  • List 1 - item 1

  • List 1 - item 2

  • List 1 - item 3

  • List 2 - item 1

  • List 2 - item 2

  • List 2 - item 3


This doesn't seem to happen if one of the adjacent lists is in HTML-syntax:

<ul>
  <li>List 1 - item 1<li>
  <li>List 1 - item 2<li>
  <li>List 1 - item 3<li>
</ul>

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3
  • List 1 - item 1
  • List 1 - item 2
  • List 1 - item 3
  • List 2 - item 1
  • List 2 - item 2
  • List 2 - item 3

It also doesn't happen if the the adjacent lists aren't of the same type (ie. ordered vs. unordered):

1. Ordered item 1
2. Ordered item 2
3. Ordered item 3

- Unordered item 1
- Unordered item 2
- Unordered item 3
  1. Ordered item 1
  2. Ordered item 2
  3. Ordered item 3
  • Unordered item 1
  • Unordered item 2
  • Unordered item 3

Solution

This can be solved by adding something inbetween the 2 lists. Such as:

Using a comment:

- List 1 - item 1
- List 1 - item 2
- List 1 - item 3

<!-- LIST_SEPARATOR -->

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3

Which renders as:

  • List 1 - item 1
  • List 1 - item 2
  • List 1 - item 3
  • List 2 - item 1
  • List 2 - item 2
  • List 2 - item 3

Using an invalid HTML tag:

- List 1 - item 1
- List 1 - item 2
- List 1 - item 3

<LISTSEPARATOR>

- List 2 - item 1
- List 2 - item 2
- List 2 - item 3

Which renders as:

  • List 1 - item 1
  • List 1 - item 2
  • List 1 - item 3
  • List 2 - item 1
  • List 2 - item 2
  • List 2 - item 3

Incorrect conversions for tables with row of different lengths

Problem 1

Currently, this HTML:

<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Cell 1</td>
  </tr>
</table>


Throws an error:

image


The output markdown should be in HTML-syntax, the same as the above HTML.


Note: It should NOT be in markdown with missing cells like:

| Header 1 | Header 2 |
| -------- | -------- |
| Cell 1   |


As the above markdown renders as:

Header 1 Header 2
Cell 1


When it's suppose to be:

Header 1 Header 2
Cell 1

Problem 2

Currently, this HTML:

<table>
  <tr>
    <th>Header 1</th>
  </tr>
  <tr>
    <td>Cell 1</td>
    <td>Cell 2</td>
  </tr>
</table>


converts to this markdown:

| Header 1 |
|----------|
| Cell 1   | Cell 2 |


which incorrectly renders in GitHub as:

Header 1
Cell 1


The output markdown should be in HTML-syntax, the same as the above HTML.


Note: It should NOT be in markdown with empty headers:

| Header 1 |        |
| -------- | ------ |
| Cell 1   | Cell 2 |


As the above markdown renders as:

Header 1
Cell 1 Cell 2


When it's suppose to be:

Header 1
Cell 1 Cell 2

Multiple block-elements in a list-item

The problem

The below tight-list HTML:
(explaination on tight/loose list found in #17)

<ul>
  <li>
    Item 1
    <pre><code>Codeblock</code></pre>
    <h1>Heading</h1>
  </li>
  <li>Item 2</li>
</ul>


Currently converts to:

- Item 1```
  Codeblock
  ```
  
  # Heading
- Item 2


It should instead be converted to:

- Item 1
  ```
  Codeblock
  ```
  # Heading
- Item 2

(Sub-problem) Newline after text-node

With reference to the conversion above,
there's a lack of newline after the text "Item 1":

- Item 1```
  Codeblock
  ```

(Sub-problem) Blank-lines inbetween block-elements inside list-item

Tight-lists cannot have blank-lines inside their list-items, else it will be rendered as a loose-list instead:

With reference to the conversion in the above,
notice the lack of blank-lines inbetween the codeblock and heading in the correct conversion:

- Item 1
  ```
  Codeblock
  ```
  # Heading
- Item 2

Should not use markdown-escaping inside of HTML-syntax

The problem

Currently, this HTML:

<p align="center">
  &lt;tag&gt;
</p>

converts to:

<p align="center">
  \<tag>
</p>

which incorrectly uses markdown's backslash-escaping, instead of HTML's &lt; escaping.


Edge cases

Most of the time, while inside HTML tags, markdown-syntax (including backslash escaping) doesn't work.
However, there are times when it does, specifically in tags which are:

  • In-line (eg. text-formattings <em> / <code> & span)
  • are in a single-line in the markdown

For example, these markdown-syntax containing tags render properly:

<code>\<tag> \&nbsp; **Bold**</code>
<sup>\<tag> \&nbsp;  **Bold**</sup>
<span>\<tag> \&nbsp;  **Bold**</span>

Rendered as:

<tag> &nbsp; Bold
<tag> &nbsp; Bold
<tag> &nbsp; Bold


But when they are broken up into multi-lines, the markdown-syntax stop working:

<code>
  \<tag> \&nbsp; **Bold**
</code>
<sup>
  \<tag> \&nbsp;  **Bold**
</sup>
<span>
  \<tag> \&nbsp;  **Bold**
</span>

Rendered as:

\ \  **Bold** \ \  **Bold** \ \  **Bold**

Improve the implementation for collapsing of whitespaces

Current implementation for collapsing whitespaces is found in the collapseWhitespace pre-process.

The previous implementation was actually a text-process,
but was later was very roughly adapted into a bootleg pre-process:


(from collapseWhitespace.ts)


If anyone has any idea how to overhaul it to make it cleaner or faster, pls send help :(

Block-elements wrapped in text-formatting

When block-element are wrapped in text-formattings (eg. <b>), like so:

<b><p>Bold-wrapped paragraph</p></b>
<p>=== SEPARATOR ===</p>
<b><hr></b>
<p>=== END ===</p>

The trailing-newlines of the block-elements becomes malformed:

**Bold-wrapped paragraph

**=== SEPARATOR ===

**<hr>**=== END ===

The expected output should be:

**Bold-wrapped paragraph**

=== SEPARATOR ===

**<hr>**

=== END ===

The question is, is this a problem?
Since block-elements wrapped in text-formatting elements could be considered as malformed HTML.

`mergeOverwriteArray` not using its own defined TSDoc

Context

The mergeOverwriteArray function found in src > core > helpers > mergeOverwriteArray.ts is like lodash's _.merge but overwrites array values instead of merging them like _.merge.

var users = {
  'data': [{ 'user': 'barney' }, { 'user': 'fred' }]
};

var ages = {
  'data': [{ 'age': 36 }, { 'age': 40 }]
};

_.merge(users, ages);
// => { 'data': [{ 'user': 'barney', 'age': 36 }, { 'user': 'fred', 'age': 40 }] }

mergeOverwriteArray(users, ages);
// => { 'data': [{ 'age': 36 }, { 'age': 40 }] }

The problem

mergeOverwriteArray uses the same type as _.merge, and has its own TSDoc defined in the file mergeOverwriteArray.ts.

However, (at least in VSCode) the TSDoc of it when its used outside of the mergeOverwriteArray.ts file is that of _.merge instead of its own defined TSDoc:

image


Help wanted

Any ideas on how to make mergeOverwriteArray have to same generic typing as _.merge but without inheriting its TSDoc?

Padding table delimiter-row's hyphens with a space

Context

Currently, the hyphens - of the table delimiter-row extends all the way to the column-separators |:

| Default-Left | Centered | Right-Aligned |
|--------------|:--------:|--------------:|
| Cell 1       | Cell 2   | Cell 3        |

Note: Colon-aligned : feature hasn't been implemented yet, but is here for illustration purposes.


and does not pad the hyphens with a space like:

| Default-Left | Centered | Right-Aligned |
| ------------ | :------: | ------------: |
| Cell 1       | Cell 2   | Cell 3        |

This was because tables require:

  • minimum of 3 hyphens
    (although GitHub doesn't require it, and technically accepts 1 hyphen)

  • minimum of length 3 to replace both the leading and trailing hyphen with colons : for center-alignment
    (ie. :-:, the shortest possible center-aligned delimiter),

and I didn't want to deal with the edge case where the column only has a width of 1 character, like:

| a |
| - |
| b |

The improvement

Add 1-space paddings on the edge of the hyphens.

As for the edge case where the column has < 3 characters, we can either:

  • set the minimum width of columns to 3 characters (excluding the 2-space paddings):

    | 1   |      | 12  |      | 123 |      | 1234 |
    | --- |  ->  | --- |  ->  | --- |  ->  | ---- |
    | a   |      | a   |      | a   |      | a    |
    
    Center-aligned:
    | 1   |      | 12  |      | 123 |      | 1234 |
    | :-: |  ->  | :-: |  ->  | :-: |  ->  | :--: |
    | a   |      | a   |      | a   |      | a    |
    
  • extend the hyphens all the way if there's < 3 characters, else pad the hyphens:

    | 1 |        | 12 |       | 123 |      | 1234 |
    |---|   ->   |----|  ->   | --- |  ->  | ---- |
    | a |        | a  |       | a   |      | a    |
    
    Center-aligned:
    | 1 |        | 12 |       | 123 |      | 1234 |
    |:-:|   ->   |:--:|  ->   | :-: |  ->  | :--: |
    | a |        | a  |       | a   |      | a    |
    

The question

Which of the 2 styles in "The improvement" section above is better?

Allow conversion of string HTMLs with containers

Currently, only HTML strings that aren't wrapped in a container is accepted by HTMLarkdown.convert.
For example:

<h1>Heading</h1>
<h1>Paragraph</h1>

is converted properly to:

# Heading

Paragraph

But when it's wrapped in a container like:

<article id="container">
  <h1>Heading</h1>
  <p>Paragraph</p>
</article>

it doesn't convert properly.

Note: <article> is used as an example above as it doesn't have an associated rule, and is thus stripped.

Headings should be in HTML-syntax if they contain block-elements

This HTML:

<h1>
<pre><code>Codeblock (a block-element)
</code></pre>
</h1>

<h1>
  <blockquote>Blockquote (another block-element)</blockquote>
<h1>

Cannot be represented in markdown. The below markdowns will not work:
Fenced-codeblock style: (current implementation's output)

# ```
Codeblock (a block-element)
```

# > Blockquote (another block-element)

Indented-codeblock style:

#     Codeblock (a block-element)

# > Blockquote (another block-element)

The correct conversion should be fully in HTML-syntax:

<h1>
<pre><code>Codeblock (a block-element)
</code></pre>
</h1>

<h1>
  <blockquote>Blockquote (another block-element)</blockquote>
</h1>

Prevent indenting codeblocks that are in HTML-syntax

Currently, some HTML conversions adds indentation like so:

<h1>
  <pre><code>TEXT
  </code</pre
</h1>

But codeblocks don't collapse whitespace, and thus are sensitive to the added spaces from the indentation.

Hence, codeblocks should not be indented.

What merging methods to use?

Context

So far, the merging method I've used is:

  • ensure branch is up-to-date (if not, rebase branch onto master)
  • merge by git merge --no-ff [branch]
  • no squash
  • then git commit --am, and add an issue-closing line to the merge-commit's body (eg. Fixes #4, Closes #5)

Which gives a history that looks like this:

image


The question

I'm still fairly new in the programming world, so idk what is the industry standard for this.

  • To squash?
  • To rebase instead of merge?
  • Which commit to put the issue-closing keywords (eg. Fixes #3) on? The merge commit? Or the commits in the branch?

If anyone with some experience in this, and knows the pros/cons of each, please advise! D:

Incorrect conversion for empty `<code>` tag

Currently, the below HTML:

<code></code>

gets converted to:

``

Which is rendered (at least on GitHub) as literal backticks: ``


The correct conversion should be in HTML-syntax:

<code></code>

Note: other inline text-formatting tags didn't had this problem they are removed by the removeEmptyElements preprocess

`<div>` should not be handle by a noop rule

Context

Currently, <div> are handled by a noop rule, meaning they aren't stripped but they only pass-on their converted inner-contents to their parents. They themselves don't have any markdown conversions, not even in HTML-syntax.
For example, the below 2 HTML:

<div>
  TEXT
</div>
<div aligned="center">
  TEXT
</div>

both gives the same output of:

TEXT

The problem

<div> in markdowns aren't stripped when being rendered in Github.
For example, they can be used to center images, like:

<div align=center>
  <img src="...">
</div>

Thus, they should not be converted by a noop rule.

Weird rendering of HTML-codeblocks in block-elements like `<p>`

The problem

Although this markdown of HTML-syntax codeblock renders properly:

<pre><code># Heading-1 markdown

## Heading-2 markdown __ITALIC__

&lt;h1 align="center">
    Centered-heading
&lt;/h1>
</code></pre>
# Heading-1 markdown

## Heading-2 markdown __ITALIC__

<h1 align="center">
    Centered-heading
</h1>

When we wrap it in a block element like <blockquote>:

<blockquote>
<pre><code># Heading-1 markdown

## Heading-2 markdown __ITALIC__

&lt;h1 align="center">
    Centered-heading
&lt;/h1>
</code></pre>
</blockquote>
# Heading-1 markdown

Heading-2 markdown ITALIC

<h1 align="center">
Centered-heading
</h1>

It seems to collapse the whitespaces in the codeblock, and attempts to parse the codeblock's contents for markdown-syntax.


The solution

It seems this problem only occurs when there's blank-lines inside the codeblock's content.
Thus, by adding a noop-comment in every blank-line, it seems to stop this weird behavior.

<blockquote>
<pre><code># Heading-1 markdown
<!-- BLANK_LINE -->
## Heading-2 markdown __ITALIC__
<!-- BLANK_LINE -->
&lt;h1 align="center">
    Centered-heading
&lt;/h1>
</code></pre>
</blockquote>
# Heading-1 markdown

## Heading-2 markdown __ITALIC__

<h1 align="center">
    Centered-heading
</h1>

Aligning table columns with colons

Table columns can be aligned by adding a trailing/leading colon to the delimiter-row:

| Default-Left | Center-Aligned | Right-Aligned |
| ------------ | :------------: | ------------: |

Additionally, the text in the row can also be visually aligned with spaces to follow the column's alignment:

| Default-Left | Center-Aligned | Right-Aligned |
| ------------ | :------------: | ------------: |
| Cell 1       |     Cell 2     |        Cell 3 |

Should `forcehtml` element attribute on in-line elements propagate to child elements?

Currently, the below input HTML, where a text-formatting element (ie. <b>) has the forcehtml attribute:

<p><b forcehtml><s>TEXT</s></b></p>

gives this markdown output:

<b>~~TEXT~~</b>

where the forcehtml doesn't propagate to the inner <s> element.


The question is, should it propagate or not?
Because the above markdown output renders fine in GitHub (with bold and strikethrough applied).

It seems that markdown syntax still works inside of HTML-syntax, as long as the outer element is an in-line-element like <span>, <b> and <s>, like:

<span><b>~~TEXT~~</b><span>

which properly renders as:

TEXT


But if the outer element is a block-element (eg. <p>) like below:

<p><b>~~TEXT~~</b></p>

It fails to render the inner markdown-syntax, like so:

~~TEXT~~

Confusing array filter logic

Currently, if the a rule's filter is an array of tag-names (ie. TagName[]), the elements are OR against each other.
eg. filter: ['b', 'strong'] is logically "element has the tag-name 'b' OR 'strong'"

But if the array contains an element of type FilterPredicate, the elements are suddenly AND against each other.
eg. filter: ['b', isStrong] is logically "element has the tag-name 'b' AND isStrong"

Also, for convenience, the rules' filters are allowed to be a single (ie. not an array) TagName or FilterPredicate type.
eg. filter: 'b' or filter: isStrong, which slightly complicates the typings and evaluation of the filters.


A better evaluation logic would be:

  • disallow just filter: TagName | FilterPredicate types (ie. only allow arrays)
  • allow nested arrays where:
    • the outer array is OR logically
    • the inner array is AND logically
      (ie. [A, [B, C], D] is logically "A or (B and C) or D")

Incorrect conversion for empty `<pre>` tag when `addTrailingLinebreak` option is enabled

When addTrailingLinebreak option is enabled via new HTMLarkdown({ addTrailingLinebreak: true }),
the below empty pre tag HTML:

<pre></pre>

gets converted to:

<pre><code><br>
</code></pre>

as the addTrailingLinebreaks preprocess adds a trailing <br> to the empty pre tag.

Codeblocks usually don't have this problem as it contains an inner <code> tag, which is ignored by the addTrailingLinebreaks preprocess as it's not a block-element.

`&nbsp;` should not be escaped in code

Currently, this HTML:

<code>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</code>

converts to:

`&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;`

instead of:

`         `

Option to use underscore bold and italic

Underscore bold and italic are in the form:

__bold__, _italic_

The reason both bold and italics are currently asterisks * only, is cuz of this edge case:

PREFIX__bold__, PREFIX_italic_

PREFIX**bold**, PREFIX*italic*

which renders as:

PREFIX__bold__, PREFIX_italic_

PREFIXbold, PREFIXitalic


The question is (or i guess "are"):

  • How do we detect if a bold/italic is prefixed (without spaces inbetween)?
  • What should HTMLarkdown do if its set to use underscore, and those edge cases happen?
    Should it:
    • Use asterisks?
    • or use HTML-syntax (ie. <b>, <i>)

Loose list

Edited on 20/12/2022 to include the edge cases in the comments

Context

If list-items have no blank-lines inbetween:

- Item 1
- Item 2
- Item 3

The list is considered tight, and render like this:

  • Item 1
  • Item 2
  • Item 3


But if there's blank-lines inbetween list-items:

- Item 1

- Item 2

- Item 3

The list is considered loose, and render with each list-items' contents being wrapped in a <p> tag like this:
(visually, it results in larger inbetween-list-item-space)

  • Item 1

  • Item 2

  • Item 3


The improvement

Loose lists (both ordered and unordered) such as:

<!-- All have paragraphs -->
<ul>
  <li><p>Item 1</p></li>
  <li><p>Item 2</p></li>
  <li><p>Item 3</p></li>
</ul>
<!-- Empty list-items have no paragraphs -->
<ul>
  <li><p>Item 1</p></li>
  <li></li>
  <li><p>Item 3</p></li>
</ul>
<!-- List-items with other block elements isn't wrapped in paragraphs -->
<ul>
  <li><p>Item 1</p></li>
  <li><h1>Item 2 (heading)</h1></li>
  <li><p>Item 3</p></li>
</ul>
<!-- List-items with multiple block-elements -->
<ul>
  <li><p>Item 1</p></li>
  <li>
    <p>Item 2</p>
    <h1>Heading in list-item</h1>
  </li>
  <li><p>Item 3</p></li>
</ul>

Should have blank-lines inbetween their converted list-items:

- Item 1

- Item 2

- Item 3
- Item 1

- 

- Item 3
- Item 1

- # Item 2 (heading)

- Item 3
- Item 1

- Item 2
  
  # Heading in list-item

- Item 3

But if the list is tight, with only some list-items have <p> like:

<ul>
  <li>Item 1</li>
  <li><p>Item 2</p></li>
  <li>Item 3</li>
</ul>

Then the output markdown should be:

- Item 1
- <p>Item 2</p>
- Item 3

Text-formattings unnecessarily converting to HTML syntax

<p>a <b> a</b></p>

Currently converts to:

a <strong>&nbsp;a</strong>

But, to avoid unnecessarily using HTML syntax, it should instead be converted to:

a **&nbsp;a**

The above bug also happens to the other text-formatting types (eg. italic, strikethrough)


Edit:

I can't think of a situation where the HTML-syntax of text-formattings are required.

Since now leading/trailing spaces inside the text-formatings are escaped to &nbsp;, such leading/trailing spaces can be used in Markdown-syntax:
(eg. bolding with leading & trailing spaces: **&nbsp; &nbsp; TEXT &nbsp; &nbsp;**)

So for now, the toUseHtmlPredicate of text-formattings will be set to always return false, until a situation that needs HTML arises.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.