Giter Site home page Giter Site logo

html-tree's Introduction

HTML-Tree

HTML-Tree is a suite of Perl modules for representing, creating, and extracting information from HTML syntax trees.

This is a Git repository where development of HTML-Tree takes place. For more information, visit HTML-Tree on CPAN.

Copyright and License

This software is copyright 1995-1998 Gisle Aas; 1999-2004 Sean M. Burke; 2005 Andy Lester; 2006 Pete Krawczyk; 2010 Jeff Fearn; 2012 Christopher J. Madsen (Except the articles contained in HTML::Tree::AboutObjects, HTML::Tree::AboutTrees, and HTML::Tree::Scanning, which are all copyright 2000 The Perl Journal.)

Except for those three TPJ articles, the whole HTML-Tree distribution is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Those three TPJ articles may be distributed under the same terms as Perl itself.

The programs in this library are distributed in the hope that they will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

html-tree's People

Contributors

gbarr avatar kentfredric avatar lyda avatar madsen avatar metaperl avatar petdance avatar petek avatar tokuhirom avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

html-tree's Issues

v5.05 and may be v5.04 breaks many modules

Hi,

I found 100% problem from your v5.05 version with Web::Scrapper
Same errors;

Examples:
http://www.cpantesters.org/cpan/report/78ead4ae-2664-11e7-b625-8b3bdc40f832
http://www.cpantesters.org/cpan/report/7812f82c-2664-11e7-b625-8b3bdc40f832
http://www.cpantesters.org/cpan/report/77400106-2664-11e7-b625-8b3bdc40f832

I installed v5.03 and everything was fine

And other module:
https://rt.cpan.org/Ticket/Display.html?id=121310

Please check up please as soon as and patch it!

Best regards, Perlover

Possible typo/bug v5.07

Have just installed latest HTML::Tree 5.07 (6 days old?) and get this:

Use of uninitialized value $tag in string eq at ....
..../HTML/Element.pm line 1109

Have been using prior versions in same calling code for many many years
without this problem.

mishandling <ins> and <del> tags

I seem to be having trouble with HTML::TreeBuilder, where an <ins> tag seems to trigger an early </p> (before it), and a <del> tag seems to trigger an early <p> tag (after it). See example below. Has anyone else seen such problems with these two tags?

Perl code used in processing $text (HTML) input and create $tree hash:

	my $tree = HTML::TreeBuilder->new();
	$tree->ignore_unknown(0);  # don't discard non-HTML recognized tags
	$tree->no_space_compacting(1);  # preserve spaces
	$tree->warn(1);  # warn if syntax error found
	$tree->p_strict(1);  # auto-close paragraph on new block element
	$tree->implicit_body_p_tag(1);  # loose text gets wrapped in <p>
	$tree->parse_content($text);

Mixed Markdown and HTML input:

Example of Markdown that needs to be supported in document text blocks. There is no need to support this within tables, although it would be a "nice" feature.

Firstly just some simple styling: *italics*, **bold** and ***both***.

There should also be support for _alternative italics_

Then a bulleted list:

* Unordered item
* Another unordered item

And a numbered list:

1. Item one
2. Item two

# We will need a heading

## And a subheading

Finally we&#x92;ll need some [external links](https://duckduckgo.com).

Show that [another link](https://www.catskilltech.com) on the same page works.

Show some <ins>inserted</ins> text and <u>underlined</u> text that display 
underlines. Show some <del>deleted</del> text, <strike>strike-out</strike> text,
and <s>s'd out</s> text that show line-throughs. 
More than <span style="text-decoration: 'underline line-through overline'">one
at a time</span> are possible via style attribute, also via
<u><s>nested tags</s></u>.

Then we need some styling features in tables as shown in the table below. There is no need to support this in text blocks, although it would be a nice feature (colored text is already available in text blocks using its options).

MD to HTML via Text::Markdown:

    <p>Example of Markdown that needs to be supported in document text blocks. There is no need to support this within tables, although it would be a "nice" feature.</p>

    <p>Firstly just some simple styling: <em>italics</em>, <strong>bold</strong> and <strong><em>both</em></strong>.</p>

    <p>There should also be support for <em>alternative italics</em></p>

    <p>Then a bulleted list:</p>

    <ul>
    <li>Unordered item</li>
    <li>Another unordered item</li>
    </ul>

    <p>And a numbered list:</p>

    <ol>
    <li>Item one</li>
    <li>Item two</li>
    </ol>

    <h1>We will need a heading</h1>

    <h2>And a subheading</h2>

    <p>Finally we&#x92;ll need some <a href="https://duckduckgo.com">external links</a>.</p>

    <p>Show that <a href="https://www.catskilltech.com">another link</a> on the same page works.</p>

    <p>Show some <ins>inserted</ins> text and <u>underlined</u> text that display 
    underlines. Show some <del>deleted</del> text, <strike>strike-out</strike> text,
    and <s>s'd out</s> text that show line-throughs. 
    More than <span style="text-decoration: 'underline line-through overline'">one
    at a time</span> are possible via style attribute, also via
    <u><s>nested tags</s></u>.</p>

    <p>Then we need some styling features in tables as shown in the table below. There is no need to support this in text blocks,   although it would be a nice feature (colored text is already available in text blocks using its options).</p>

Dump of HTML parse (via HTML::TreeBuilder) output:

$VAR1 = bless( {
  '_body' => bless( {
    '_content' => [
      bless( {
        '_content' => [
          'Example of Markdown that needs to be supported in document text blocks. There is no need to support this within tables, although it would be a "nice" feature.'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'Firstly just some simple styling: ',
          bless( {
            '_content' => [
              'italics'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[1],
            '_tag' => 'em'
          }, 'HTML::Element' ),
          ', ',
          bless( {
            '_content' => [
              'bold'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[1],
            '_tag' => 'strong'
          }, 'HTML::Element' ),
          ' and ',
          bless( {
             '_content' => [
              bless( {
                '_content' => [
                  'both'
                ],
                '_parent' => $VAR1->{'_body'}{'_content'}[1]{'_content'}[5],
                '_tag' => 'em'
               }, 'HTML::Element' )
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[1],
            '_tag' => 'strong'
          }, 'HTML::Element' ),
          '.'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
         '_content' => [
          'There should also be support for ',
          bless( {
            '_content' => [
              'alternative italics'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[2],
            '_tag' => 'em'
          }, 'HTML::Element' )
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'Then a bulleted list:'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          bless( {
            '_content' => [
              'Unordered item'
           ],
            '_parent' => $VAR1->{'_body'}{'_content'}[4],
            '_tag' => 'li'
          }, 'HTML::Element' ),
          bless( {
            '_content' => [
              'Another unordered item'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[4],
            '_tag' => 'li'
          }, 'HTML::Element' )
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'ul'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'And a numbered list:'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
         bless( {
            '_content' => [
              'Item one'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[6],
            '_tag' => 'li'
          }, 'HTML::Element' ),
          bless( {
            '_content' => [
              'Item two'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[6],
            '_tag' => 'li'
          }, 'HTML::Element' )
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'ol'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
           'We will need a heading'
        ],
        '_parent' => $VAR1->{'_body'}
        '_tag' => 'h1'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'And a subheading'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'h2'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'Finally we’ll need some ',
          bless( {
            '_content' => [
              'external links'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[9],
            '_tag' => 'a',
            'href' => 'https://duckduckgo.com'
          }, 'HTML::Element' ),
          '.'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'Show that ',
          bless( {
            '_content' => [
              'another link'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[10],
            '_tag' => 'a',
            'href' => 'https://www.catskilltech.com'
          }, 'HTML::Element' ),
          ' on the same page works.'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [ <<<<<<<<<<<<< should include the entire paragraph, not just up to first child!
          'Show some '
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
>>>>>>>> why isn't this content a child of the above paragraph?
      bless( {
        '_content' => [
          'inserted'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'ins'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          ' text and ',
          bless( {
            '_content' => [
              'underlined'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[13],
            '_tag' => 'u'
          }, 'HTML::Element' ),
          ' text that display 
underlines. Show some '
        ],
        '_implicit' => 1,
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'deleted'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'del'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          ' text, ',
          bless( {
            '_content' => [
              'strike-out'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[15],
            '_tag' => 'strike'
          }, 'HTML::Element' ),
          ' text,
and ',
          bless( {
            '_content' => [
              's\'d out'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[15],
            '_tag' => 's'
          }, 'HTML::Element' ),
          ' text that show line-throughs. 
More than ',
          bless( {
            '_content' => [
              'one
at a time'
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[15],
            '_tag' => 'span',
            'style' => 'text-decoration: \'underline line-through overline\''
          }, 'HTML::Element' ),
          ' are possible via style attribute, also via
',
          bless( {
            '_content' => [
              bless( {
                '_content' => [
                  'nested tags'
                ],
                '_parent' => $VAR1->{'_body'}{'_content'}[15]{'_content'}[7],
                '_tag' => 's'
              }, 'HTML::Element' )
            ],
            '_parent' => $VAR1->{'_body'}{'_content'}[15],
            '_tag' => 'u'
          }, 'HTML::Element' ),
          '.'
        ],
        '_implicit' => 1,
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' ),
      bless( {
        '_content' => [
          'Then we need some styling features in tables as shown in the table below. There is no need to support this in text blocks, although it would be a nice feature (colored text is already available in text blocks using its options).'
        ],
        '_parent' => $VAR1->{'_body'},
        '_tag' => 'p'
      }, 'HTML::Element' )
    ],
    '_implicit' => 1,
    '_parent' => $VAR1,
    '_tag' => 'body'
  }, 'HTML::Element' ),
  '_content' => [
    bless( {
      '_implicit' => 1,
      '_parent' => $VAR1,
      '_tag' => 'head'
    }, 'HTML::Element' ),
    $VAR1->{'_body'}
  ],
  '_done' => 1,
  '_element_count' => 5,
  '_head' => $VAR1->{'_content'}[0],
  '_hparser_xs_state' => \66027640,
  '_ignore_text' => 0,
  '_ignore_unknown' => 0,
  '_implicit' => 1,
  '_implicit_body_p_tag' => 1,
  '_implicit_tags' => 1,
  '_no_expand_entities' => 0,
  '_no_space_compacting' => 1,
  '_p_strict' => 1,
  '_pos' => undef,
  '_store_comments' => 0,
  '_store_declarations' => 1,
  '_store_pis' => 0,
  '_tag' => 'html',
  '_tighten' => 1,
  '_warn' => 1
}, 'HTML::TreeBuilder' );

final processed HTML:

$VAR1 = {   <<<<<<<<<< default CSS (not shown, used in rendering)
};
$VAR2 = {    <<<<<<<<<<<<<<<< user-added CSS styling
  'tag' => 'style',
  'text' => ''
};
$VAR3 = {    <<<<<<<<<<<<<<<< actual start of HTML text to process
  'tag' => 'p',
  'text' => ''
};
$VAR4 = {
  'tag' => '',
  'text' => 'Example of Markdown that needs to be supported in document text blocks. There is no need to support this within tables, although it would be a "nice" feature.'
};
$VAR5 = {
  'tag' => '/p',
  'text' => ''
};
$VAR6 = {
  'tag' => 'p',
  'text' => ''
};
$VAR7 = {
  'tag' => '',
  'text' => 'Firstly just some simple styling: '
};
$VAR8 = {
  'tag' => 'em',
  'text' => ''
};
$VAR9 = {
  'tag' => '',
  'text' => 'italics'
};
$VAR10 = {
  'tag' => '/em',
  'text' => ''
};
$VAR11 = {
  'tag' => '',
  'text' => ', '
};
$VAR12 = {
  'tag' => 'strong',
  'text' => ''
};
$VAR13 = {
  'tag' => '',
  'text' => 'bold'
};
$VAR14 = {
  'tag' => '/strong',
  'text' => ''
};
$VAR15 = {
  'tag' => '',
  'text' => ' and '
};
$VAR16 = {
  'tag' => 'strong',
  'text' => ''
};
$VAR17 = {
  'tag' => 'em',
  'text' => ''
};
$VAR18 = {
  'tag' => '',
  'text' => 'both'
};
$VAR19 = {
  'tag' => '/em',
  'text' => ''
};
$VAR20 = {
  'tag' => '/strong',
  'text' => ''
};
$VAR21 = {
  'tag' => '',
  'text' => '.'
};
$VAR22 = {
  'tag' => '/p',
  'text' => ''
};
$VAR23 = {
  'tag' => 'p',
  'text' => ''
};
$VAR24 = {
  'tag' => '',
  'text' => 'There should also be support for '
};
$VAR25 = {
  'tag' => 'em',
  'text' => ''
};
$VAR26 = {
  'tag' => '',
  'text' => 'alternative italics'
};
$VAR27 = {
  'tag' => '/em',
  'text' => ''
};
$VAR28 = {
  'tag' => '/p',
  'text' => ''
};
$VAR29 = {
  'tag' => 'p',
  'text' => ''
};
$VAR30 = {
  'tag' => '',
  'text' => 'Then a bulleted list:'
};
$VAR31 = {
  'tag' => '/p',
  'text' => ''
};
$VAR32 = {
  'tag' => 'ul',
  'text' => ''
};
$VAR33 = {    <<<<<<< <marker> pseudotag added to permit list marker styling via CSS
  'tag' => 'marker',
  'text' => ''
};
$VAR34 = {
  'tag' => '',
  'text' => ''  <<<<<<<<< will be filled in during rendering with actual marker text
};
$VAR35 = {
  'tag' => '/marker',
  'text' => ''
};
$VAR36 = {
  'tag' => 'li',
  'text' => ''
};
$VAR37 = {
  'tag' => '',
  'text' => 'Unordered item'
};
$VAR38 = {
  'tag' => '/li',
  'text' => ''
};
$VAR39 = {
  'tag' => 'marker',
  'text' => ''
};
$VAR40 = {
  'tag' => '',
  'text' => ''
};
$VAR41 = {
  'tag' => '/marker',
  'text' => ''
};
$VAR42 = {
  'tag' => 'li',
  'text' => ''
};
$VAR43 = {
  'tag' => '',
  'text' => 'Another unordered item'
};
$VAR44 = {
  'tag' => '/li',
  'text' => ''
};
$VAR45 = {
  'tag' => '/ul',
  'text' => ''
};
$VAR46 = {
  'tag' => 'p',
  'text' => ''
};
$VAR47 = {
  'tag' => '',
  'text' => 'And a numbered list:'
};
$VAR48 = {
  'tag' => '/p',
  'text' => ''
};
$VAR49 = {
  'tag' => 'ol',
  'text' => ''
};
$VAR50 = {
  'tag' => 'marker',
  'text' => ''
};
$VAR51 = {
  'tag' => '',
  'text' => ''
};
$VAR52 = {
  'tag' => '/marker',
  'text' => ''
};
$VAR53 = {
  'tag' => 'li',
  'text' => ''
};
$VAR54 = {
  'tag' => '',
  'text' => 'Item one'
};
$VAR55 = {
  'tag' => '/li',
  'text' => ''
};
$VAR56 = {
  'tag' => 'marker',
  'text' => ''
};
$VAR57 = {
  'tag' => '',
  'text' => ''
};
$VAR58 = {
  'tag' => '/marker',
  'text' => ''
};
$VAR59 = {
  'tag' => 'li',
  'text' => ''
};
$VAR60 = {
  'tag' => '',
  'text' => 'Item two'
};
$VAR61 = {
  'tag' => '/li',
  'text' => ''
};
$VAR62 = {
  'tag' => '/ol',
  'text' => ''
};
$VAR63 = {
  'tag' => 'h1',
  'text' => ''
};
$VAR64 = {
  'tag' => '',
  'text' => 'We will need a heading'
};
$VAR65 = {
  'tag' => '/h1',
  'text' => ''
};
$VAR66 = {
  'tag' => 'h2',
  'text' => ''
};
$VAR67 = {
  'tag' => '',
  'text' => 'And a subheading'
};
$VAR68 = {
  'tag' => '/h2',
  'text' => ''
};
$VAR69 = {
  'tag' => 'p',
  'text' => ''
};
$VAR70 = {
  'tag' => '',
  'text' => 'Finally we’ll need some '
};
$VAR71 = {
  '_href' => 'https://duckduckgo.com',
  'tag' => 'a',
  'text' => ''
};
$VAR72 = {
  'tag' => '',
  'text' => 'external links'
};
$VAR73 = {
  'tag' => '/a',
  'text' => ''
};
$VAR74 = {
  'tag' => '',
  'text' => '.'
};
$VAR75 = {
  'tag' => '/p',
  'text' => ''
};
$VAR76 = {
  'tag' => 'p',
  'text' => ''
};
$VAR77 = {
  'tag' => '',
  'text' => 'Show that '
};
$VAR78 = {
  '_href' => 'https://www.catskilltech.com',
  'tag' => 'a',
  'text' => ''
};
$VAR79 = {
  'tag' => '',
  'text' => 'another link'
};
$VAR80 = {
  'tag' => '/a',
  'text' => ''
};
$VAR81 = {
  'tag' => '',
  'text' => ' on the same page works.'
};
$VAR82 = {
  'tag' => '/p',
  'text' => ''
};
$VAR83 = {
  'tag' => 'p',
  'text' => ''
};
$VAR84 = {  <<<<<<<<<<<<<<<< start of next-to-last paragraph
  'tag' => '',
  'text' => 'Show some '
};
$VAR85 = {   <<<<<<<<<<<<<<< should NOT end paragraph here!
  'tag' => '/p',
  'text' => ''
};
$VAR86 = {   <<<<<<<<<<<<<<< floating off by itself, s/b within a <p>
  'tag' => 'ins',
  'text' => ''
};
$VAR87 = {
  'tag' => '',
  'text' => 'inserted'
};
$VAR88 = {
  'tag' => '/ins',
  'text' => ''
};
$VAR89 = {   <<<<<<<<<<<<<<< finally (re)start paragraph
  'tag' => 'p',
  'text' => ''
};
$VAR90 = {
  'tag' => '',
  'text' => ' text and '
};
$VAR91 = {
  'tag' => 'u',
  'text' => ''
};
$VAR92 = {
  'tag' => '',
  'text' => 'underlined'
};
$VAR93 = {
  'tag' => '/u',
  'text' => ''
};
$VAR94 = {
  'tag' => '',
  'text' => ' text that display 
underlines. Show some '
};
$VAR95 = {   <<<<<<<<<<<<<<<< should not end paragraph here
  'tag' => '/p',
  'text' => ''
};
$VAR96 = {
  'tag' => 'del',
  'text' => ''
};
$VAR97 = {
  'tag' => '',
  'text' => 'deleted'
};
$VAR98 = {
  'tag' => '/del',
  'text' => ''
};
$VAR99 = {  <<<<<<<<<<<<<<< remainder of paragraph, finally
  'tag' => 'p',
  'text' => ''
};
$VAR100 = {
  'tag' => '',
  'text' => ' text, '
};
$VAR101 = {
  'tag' => 'strike',
  'text' => ''
};
$VAR102 = {
  'tag' => '',
  'text' => 'strike-out'
};
$VAR103 = {
  'tag' => '/strike',
  'text' => ''
};
$VAR104 = {
  'tag' => '',
  'text' => ' text,
and '
};
$VAR105 = {
  'tag' => 's',
  'text' => ''
};
$VAR106 = {
  'tag' => '',
  'text' => 's\'d out'
};
$VAR107 = {
  'tag' => '/s',
  'text' => ''
};
$VAR108 = {
  'tag' => '',
  'text' => ' text that show line-throughs. 
More than '
};
$VAR109 = {
  'style' => 'text-decoration: \'underline line-through overline\'',
  'tag' => 'span',
  'text' => '',
  'text-decoration' => 'underline line-through overline'
};
$VAR110 = {
  'tag' => '',
  'text' => 'one
at a time'
};
$VAR111 = {
  'tag' => '/span',
  'text' => ''
};
$VAR112 = {
  'tag' => '',
  'text' => ' are possible via style attribute, also via
'
};
$VAR113 = {
  'tag' => 'u',
  'text' => ''
};
$VAR114 = {
  'tag' => 's',
  'text' => ''
};
$VAR115 = {
  'tag' => '',
  'text' => 'nested tags'
};
$VAR116 = {
  'tag' => '/s',
  'text' => ''
};
$VAR117 = {
  'tag' => '/u',
  'text' => ''
};
$VAR118 = {
  'tag' => '',
  'text' => '.'
};
$VAR119 = {  <<<<<<<<<<<<< everything from "Show some <ins>... to here should
             <<<<<<<<<<<<< have been within one paragraph!
  'tag' => '/p',
  'text' => ''
};
$VAR120 = {
  'tag' => 'p',
  'text' => ''
};
$VAR121 = {
  'tag' => '',
  'text' => 'Then we need some styling features in tables as shown in the table below. There is no need to support this in text blocks, although it would be a nice feature (colored text is already available in text blocks using its options).'
};
$VAR122 = {
  'tag' => '/p',
  'text' => ''
};

I don't do any extra processing for <ins> and <del>... I merely set the very same text CSS properties as <u> and <s>, so they should be processed the same way (<u> and <s> work fine). It appears that the structure created by HTML::TreeBuilder is faulty, in that some paragraphs end prematurely. This is end-user input, so I can't simply eliminate <ins> and <del> content. Besides, they may wish to style (CSS) them differently.

Improper handling of NOSCRIPT in HEAD

When calling parse_file on HTML that contains a noscript block in the head the noscript tags are left in the head but the contained text is moved to the body. One of two things should instead be done:

  1. Leave the entire noscript block (tags and content) in the head.
  2. Move the entire block (including the surrounding <noscript> and </noscript> tags) to the body.

In addition, blocks with unknown tags (e.g. test, see example below) should perhaps be left in the head.

The behavior appears to be correct for script tags and for tags whose type is unknown (i.e. both the tags and their contents are moved). The code in question may be lines 488–492 in TreeBuilder.pm.

Here's an example (processing is done with ignore_unknown=0 and some newlines added for clarity):

Input

<html>
<head>
<noscript><i>NS</i></noscript>
<test><b>TEST</b></test>
<script>/*JS*/</script>
</head>
<body>
</body>
</html>

Output

<html>
<head>
<noscript></noscript>
</head>
<body>
<i>NS</i>
<test><b>TEST</b></test>
<script>/*JS*/</script>
</body>
</html>

Expected output

<html>
<head>
</head>
<body>
<noscript><i>NS</i></noscript>
<test><b>TEST</b></test>
<script>/*JS*/</script>
</body>
</html>

Invalid tags get automatic end tag?

Here is the Perl code using TreeBuilder:

my $tree = HTML::TreeBuilder->new();
$tree->ignore_unknown(0);  # don't discard non-HTML recognized tags
$tree->no_space_compacting(1);  # preserve spaces
$tree->warn(1);  # warn if syntax error found
$tree->p_strict(1);  # auto-close paragraph on new block element
$tree->implicit_body_p_tag(1);  # loose text gets wrapped in <p>
$tree->parse_content($text);

Here is some HTML input being parsed:

<ul>
  <li>Expect a regular black bullet and text.</li>
  <marker style="_marker-color: red;"><li>Expect red bullet and black text.</li>
</ul>

What TreeBuilder is delivering to me is

tag='ul'
tag='li'
tag='' text='Expect a regular black bullet and text.'
tag='/li'
tag='marker' style='_marker-color: red;'
tag='/marker'                                                      <<<< UNEXPECTED
tag='li'
tag='' text='Expect a red bullet and black text.'
tag='/li'
tag='/ul'

<marker> and _marker-color are my extensions to HTML and CSS, so I don't expect them to be recognized. I do have ignore_unknown set to NOT ignore invalid tags, so my expectation would be to see <marker> as just another tag. Is it behaving properly in that it creates an end tag for <marker>? What if my definition of <marker> is that I'm giving an explicit end tag </marker>, but it appears many lines away... am I going to end up with two </marker>s?

HTML::Tagset has a list of tags with no (or optional) end tags, which of course doesn't include <marker>. If a tag isn't found in this list, does HTML::TreeBuilder automatically create an end tag? I'm not sure that's a good idea.

Clarify how to use DEBUG

To turn debug output on one must use:

BEGIN {
	$HTML::TreeBuilder::DEBUG=2;
}
use HTML::TreeBuilder;

However the documentation (in Treebuilder.pm:2066–2070) only states:

=sub DEBUG

Are we in Debug mode?  This is a constant subroutine, to allow
compile-time optimizations.  To control debug mode, set
C<$HTML::TreeBuilder::DEBUG> I<before> loading HTML::TreeBuilder.

It isn't clear from the above that one needs the wrapping BEGIN block. When I initially tried to turn on debug output I used the following (which didn't work):

$HTML::TreeBuilder::DEBUG=2;
use HTML::TreeBuilder;

Fix

Rewrite the last sentence of the documentation quoted above to mention that a BEGIN block must be used. Adding an example (such as the first code snippet above) would be nice as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.