Giter Site home page Giter Site logo

ezyang / htmlpurifier Goto Github PK

View Code? Open in Web Editor NEW
3.0K 65.0 319.0 8.69 MB

Standards compliant HTML filter written in PHP

Home Page: http://htmlpurifier.org

License: GNU Lesser General Public License v2.1

PHP 94.50% HTML 4.60% CSS 0.24% XSLT 0.52% JavaScript 0.12% Shell 0.04%

htmlpurifier's Introduction

HTML Purifier Build Status

HTML Purifier is an HTML filtering solution that uses a unique combination of robust whitelists and aggressive parsing to ensure that not only are XSS attacks thwarted, but the resulting HTML is standards compliant.

HTML Purifier is oriented towards richly formatted documents from untrusted sources that require CSS and a full tag-set. This library can be configured to accept a more restrictive set of tags, but it won't be as efficient as more bare-bones parsers. It will, however, do the job right, which may be more important.

Places to go:

  • See INSTALL for a quick installation guide
  • See docs/ for developer-oriented documentation, code examples and an in-depth installation guide.
  • See WYSIWYG for information on editors like TinyMCE and FCKeditor

HTML Purifier can be found on the web at: http://htmlpurifier.org/

Installation

Package available on Composer.

If you're using Composer to manage dependencies, you can use

$ composer require ezyang/htmlpurifier

htmlpurifier's People

Contributors

bytestream avatar codebymikey avatar darhazer avatar dilongfa avatar ezyang avatar kish0808 avatar makstech avatar marinaglancy avatar mbrodala avatar morozov avatar mpyw avatar mtibben avatar ngnpope avatar okainov avatar phpgangsta avatar r-kovalenko avatar rah1x avatar robloach avatar royopa avatar sandromiguel avatar semantic-release-bot avatar skodak avatar snapshotpl avatar stefanotorresi avatar synchro avatar timwolla avatar xemlock avatar xiphin avatar zerocrates avatar zobzn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

htmlpurifier's Issues

something wrong when html tags within script

4.7
Core.HiddenElements

HTML:

<!DOCTYPE html>
<html>
<head>
    <title>aaaaa</title>
</head>
<body>

<div class="test">

    <script type="text/javascript">
        document.getElementById('test').innerHTML = '<div class="zyCon"><span class="icondlsj" id="icondlsj"></span><div class="zyText"><p>'+_PA+'</p></div></div>';
        document.getElementById('test');
        document.getElementById('test');
        document.getElementById('test');
    </script>

</div>


<a href="#">aaaaa</a>
<p>hello world</p>
</body>
</html>

php:

$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$html = $purifier->purify($html);

output:

<div class="test">

    </div>';
        document.getElementById('test');
        document.getElementById('test');
        document.getElementById('test');





<a href="#">aaaaa</a>
<p>hello world</p>

the right output should be:

<div class="test">


</div>


<a href="#">aaaaa</a>
<p>hello world</p>

Please tag a new stable version of the library

The latest release has been done in 2013. This means that people using Composer with its default settings (allowing only stable versions of packages) will not benefit from any of the bug fixes done since then. I just discovered that a bug annoying me all the time (#32) has been fixed 6 months ago, but I don't have the fix because it is not released yet.

Warning Array to string conversion in HTMLPurifier_Printer_ConfigForm_default->render

I surprised no one else is seeing this, so sorry if I'm doing something stupid.

v4.7.0 introduced the warning

Array to string conversion in HTMLPurifier_Printer_ConfigForm_default->render

This comes from AutoFormat.RemoveEmpty.Predicate where the default value is an array of arrays. This default also fails to print correctly on the config page:

colgroup:Array
th:Array
td:Array
iframe:Array

It doesn't seem possible to type in a different value and actually get an array of array.

This value is also missing from the docs at http://htmlpurifier.org/live/configdoc/plain.html.

DOMLex strip spaces when DirectLex does not - with exact same config.

Given a string of "Testing<mark> </mark>Spaces".

DOMLex will return: "Testing<mark></mark>Spaces"
DirectLex will return: "Testing<mark> </mark>Spaces"

Here is a unit test describing the issue in more detail:

class SpaceBugTest extends \PHPUnit_Framework_TestCase {

    public function testDirectLex_DoesNotRemoveSpace()
    {
        $config = HTMLPurifier_Config::createDefault();
        $config->set('Core.LexerImpl', 'DirectLex');

        $def = $config->getHTMLDefinition(true);
        $def->addElement('mark', 'Inline', 'Flow', 'Common');
        $def->addAttribute('mark', 'data-thread-id', 'CDATA');

        $purifier = new HTMLPurifier($config);

        $before = "Testing<mark> </mark>Spaces";
        $after  = $purifier->purify($before, $config);

        $this->assertEquals($before, $after);
    }

    public function testDOMLex_DoesNotRemoveSpace()
    {
        $config = HTMLPurifier_Config::createDefault();
        $config->set('Core.LexerImpl', 'DOMLex');

        $def = $config->getHTMLDefinition(true);
        $def->addElement('mark', 'Inline', 'Flow', 'Common');
        $def->addAttribute('mark', 'data-thread-id', 'CDATA');

        $purifier = new HTMLPurifier($config);

        $before = "Testing<mark> </mark>Spaces";
        $after  = $purifier->purify($before, $config);

        $this->assertEquals($before, $after);
    }

}

The test testDirectLex_DoesNotRemoveSpace will pass without any problems, however testDOMLex_DoesNotRemoveSpace fails to assert that the $before and $after strings are the same as the space has been stripped.

'flashvars' check is case sensitive

HTMLPurifier_Injector_SafeObject has a variable called "$allowedParam" that is used to filter out some tokens in a flash object. Currently this does a case sensitive check on "flashvars" but other variations such as "FlashVars" are filtered out. Given that Adobe uses "FlashVars" in it's example, this should probably be a case insensitive check.

I can make the changes and generate a pull request, but I'm curious. Should all checks against "$allowedParam" be case insensitive or is it better to just add "FlashVars" to the array?

CSS3 rules are deleted

Things like box-shadow, border-radius and linear-gradient() are removed. It's quite a disadvantage for design to not be able to use these properties.
Would it be possible to fix this?

Ability to purify entire HTML documents

Looks like setting false to this doesn't working properly, or this parameter description is wrong

I need to purify html page while leaving its structure - html title head - unaltered, but setting option to false removes this tags too, the only difference with true value is that page title added as plain text to output, but html head and body tags are removed

htmlpurifier v.4.4.0

Youtube's default embed codes (old) changed

<object width="480" height="360"><param name="movie" value="//www.youtube.com/v/tZ5tk8ytOFM?hl=ru_RU&amp;version=3&amp;rel=0"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="//www.youtube.com/v/tZ5tk8ytOFM?hl=ru_RU&amp;version=3&amp;rel=0" type="application/x-shockwave-flash" width="480" height="360" allowscriptaccess="always" allowfullscreen="true"></embed></object>

And because there're no http://, Youtube filter doesn't catch them

chmod destroys ACL

Hi,
Method HTMLPurifier_DefinitionCache_Serializer::_prepareDir there contains use of umask function which is destring permitions on system with ACL (Debian in my case). Please remove manual setting permition to cache files or give the way to configure this.

How to use htmlpurifer in Laravel 5

In Laravel 5, To use this library, I have to install it via composer: All library files is put in project_folder/vendor/ezyang/htmlpurifier/library/HTMLPurifier.auto.php
And use below command in each class:

require_once app_path() . '/../vendor/ezyang/htmlpurifier/library/HTMLPurifier.auto.php';

Have a better than this way ?

Thank you !

Note: app_path() function return: project_folder/app
My web struct:

-- project_folder
---- app
---- vendor
-------- ezyang
------------ ...

tel protocol

It would be great to have HTMLPurifier_URIScheme for

<a href="tel:+21321312312">

by default. Is it ever possible or should I create fork for that?

css attribute border-color on IMG doesn't allow rgb value

This might be of interest to you http://stackoverflow.com/questions/19200653/htmlpurifier-drops-border-color-defined-in-rgb-but-not-in-hex/19200839#19200839

In short, setting the css attribute border-color to an IMG object is not possible if you use rgb values (hex is fine though). It is possible to get around it like it is done in the above link, by first setting the border-color, and then creating a compound border attribute by reading the 'border' attribute (even though it is technically not defined, the browser/jquery will deliver it) and then turning around and setting that hard to the 'border' attribute.

Surely this is a bug in the htmlpurifier filtering process?

How do I use HTMLPurifier?

I'm completely new to PHP (although I do know C) I've installed pear and HTMLPurifier through it on my Mac, but how do I actually get it up and running? The guide says something about how I need to include the HTMLPurifier main-equilivent file, but where?

In php.ini? somewhere else?

All I'm trying to do is run it against a few of my HTML files for an ebook, I'm not trying to run it on a live site or anything.

PHP errors when nested element not allowed

If you are restricting HTML content to a certain set of allowed tags, and you allow <ul> tags but not <li> tags, you get a lot of nondescript PHP errors bubbling up from the library. The fact it was the missing rule allowing <li> tags causing the problem was not obvious, just something I tried arbitrarily and it worked.

Object module does not exist ? win7 ok, but win 2003 failed?

vendor/ezyang/htmlpurifier/library/HTMLPurifier/HTMLModuleManager.php 163

            if (!$ok) {
                $module = $original_module;
                if (!class_exists($module)) {
                    trigger_error($original_module . ' module does not exist',
                        E_USER_ERROR);
                    return;
                }

i var_dump($module), then output
string(6) "Object", but in windows7 it is ok, windows 2003 failed,
php version == 5.3.27 ,no problem.
why ?
PHP extension ?

Bug: HTML tag in comment taken as root element

double css commands in htmlpruifier

htmlpurifier remove double commands.

from:

color: red;
color: blue;
color: green;

to:

color: red;
  1. If htmlpurifier remove commands, it should prevent the last command, not the first.

  2. double commands may be legitimate (in rare cases, with different values), for compatibility with older browsers.

I think its not needed to delete anything here.

Cache directory should default to PHP temporary directory

A site using htmlpurifier as part of a deployable artifact doesn't have access to write to the codebase, but htmlpurifier defaults to writing within the codebase, specifically:

Directory .../sites/all/libraries/htmlpurifier/library/HTMLPurifier/DefinitionCache/Serializer not writable, please chmod to 777
File .../sites/all/libraries/htmlpurifier/library/HTMLPurifier/DefinitionCache/Serializer.php, line 278

The library should instead default to PHP's built-in temporary directory support.

Also, 777 is an extremely unsafe permission and should not be recommended, especially on a webserver.

It cuts not tags also.

I already wrote to the mailling list. But I think better to do it here also.

    $config = HTMLPurifier_Config::createDefault();
    $config->set('HTML.Allowed', '');
    $p = new HTMLPurifier($config);
    $p->purify('L<R blabla'); 

result is "L" just "L" letter.

Is it correct behavior ?

Syntax error (Quick to fix)

Hi,

I am using Netbeans 8 and it does a syntax error check on the files, and it found this error:
htmlpurifier / docs / specimens / jochem-blok-word.html - line 59 (double dot)
..MsoChpDefault
change to
.MsoChpDefault

HTMLPurifier causes segmentation fault

this is my code:

include_once 'htmlpurifier/HTMLPurifier.auto.php';
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$html = $purifier->purify($html);

if the $html = 'a very long rich html text more than ten thousand character' ,it will cause segmentation fault.

Tagged releases

Could you please start tagging htmlpurifier so that we can stick to a specific version of this lib?

IDs starting with numbers disallowed, even with Attr.EnableID = true

Hi!

I stumbled upon this problem with numeric id attributes (or at least id attributes that starts with numbers). They are stripped out, even if I turn Attr.EnableID on.

See this example:

$config = HTMLPurifier_Config::createDefault();
$config->set('Attr.EnableID', true);
$purifier = new HTMLPurifier($config);

var_dump(
    $purifier->purify('<div id="helloWorld"></div>'),
    $purifier->purify('<div id="4-helloWorld"></div>'),
    $purifier->purify('<div id="5"></div>'),
    $purifier->purify('<div id="a6"></div>')
);

The output is:

string '<div id="helloWorld"></div>' (length=27)
string '<div></div>' (length=11)
string '<div></div>' (length=11)
string '<div id="a6"></div>' (length=19)

HTMLPurifier for SVG Files?

This is probably more of a configuration issue than something that necessitates a code-change, but I felt it would be worth mentioning.

SVG images can contain embedded JavaScript code, since they're XML documents. e.g. https://hackerone.com/reports/148853

Has anyone had any luck using HTMLPurifier with SVG files? Is any significant code change needed to make it work?

Missing spaces after upgrade from 4.6.0 to 4.7.0

On 4.6.0 we had the following:

<b>Vetgedrukt</b> <i>Schuingedrukt</i> <span>Hou</span><iframe></iframe><script></script> jij ook zo van vakjesdenken?

became:

<b>Vetgedrukt</b> <i>Schuingedrukt</i> Hou jij ook zo van vakjesdenken?

since 4.7.0 it becomes:

<b>Vetgedrukt</b><i>Schuingedrukt</i>Hou jij ook zo van vakjesdenken?

I checked the changeset (v4.6.0...v4.7.0) but can't really find what's causing this.

Any thoughts?

Error handling in

Hi,

yesterday we encountered an error in file htmlpurifier/library/HTMLPurifier/DefinitionCache/Serializer.php in the function cleanup. The return value of opendir is not checked against false and the result may be an endless loop in the following readdir loop (readdir returns an empty string for the parameter "false" - why ever). The cause of the opendir failure was a missing file permission (x-flag).

Sincerly

Marc

Core.EnableIDNA without PEAR Net_IDNA2

Hey,

we just stumbled upon an issue with international domain names. As they are getting more and more common (especially in Germany with our umlauts) I found out that HTMLPurifier "kills" those in the purification process. We developing a deployable system running on 1000+ Debian servers and don't want to add PEAR dependencies. As the code of HTMLPurifier/AttrDef/URI/Host.php is based on 2012 PHP I'd kindly ask, if it'd be possible to update that one with proper support of the php_intl component, which offers built-in functions for IDNs: http://php.net/manual/ref.intl.idn.php

As far as I understood the Net_IDNA2 using the idn_to_x functions should be a proper replacement?

Cheers
Matthias

borderRadius

Hi, thanks for a great repo. How come the CSS3 properrty borderRadius is not supported? I found a year old snippet on how to implement it (code below from http://htmlpurifier.org/phorum/read.php?2,6154,6154) but doesn't seem to work in current version of htmlpurifier.

  $config = HTMLPurifier_Config::createDefault();

  // add some custom CSS3 properties                                                                                                                                              
  $css_definition = $config->getDefinition('CSS');

  $border_radius =
    $info['border-top-left-radius'] =
    $info['border-top-right-radius'] =
    $info['border-bottom-left-radius'] =
    $info['border-bottom-right-radius'] =
    new HTMLPurifier_AttrDef_CSS_Composite(array(
                                             new HTMLPurifier_AttrDef_CSS_Length('0'),
                                             new HTMLPurifier_AttrDef_CSS_Percentage(true)
                                             ));

  $info['border-radius'] = new HTMLPurifier_AttrDef_CSS_Multiple($border_radius);

  // wrap all new attr-defs with decorator that handles !important                                                                                                                
  $allow_important = $config->get('CSS.AllowImportant');
  foreach ($info as $k => $v) {
    $css_definition->info[$k] = new HTMLPurifier_AttrDef_CSS_ImportantDecorator($v, $allow_important);
  }

  $html_purifier = new HTMLPurifier($config);

Expand CSSDefinition.php to include more properties

It's been a while since the list of CSS properties in CSSDefinition.php has been updated. There are many new properties available with CSS3, many of which have become standardized. I think it would be a good idea to update the basic doSetup info method with all the safe standardized CSS3 properties.

I would gladly commit the changes.

Remove excessive BR tags

An option that allow you specify the consecutive BR tags limit (default: null). When set to a number, it'll check if you have more than the specified option number, and will limit BR tags.

  • option = null: <br /><br /><br /><br /><br />Text<br /><br />
  • option = 1: <br />Text<br />
  • option = 2: <br /><br />Text<br /><br />
  • option = 3: <br /><br /><br />Text<br /><br />

RemoveEmpty always excludes iframes

RemoveEmpty always excludes iframes but does not remove filtered frames.
In my setup I only allow certain tags and iframe sources and want all other iframes to be removed.

Config:

HTML.Allowed: a[href],p,b,strong,i,em,br,img[src],iframe[src|frameborder]
AutoFormat.RemoveEmpty: true
HTML.SafeIframe: true
URI.SafeIframeRegexp: '#^(https?:)?//(www\.youtube(?:-nocookie)?\.com/embed|player\.vimeo\.com/video)/#'

Sample Input:

<iframe src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/151316054&auto_play=false&hide_related=false&visual=true" width="100%" height="450" frameborder="no" scrolling="no">
<iframe src="//www.youtube.com/embed/91dTtJm198g" frameborder="0"></iframe>

Output:

<iframe></iframe> # should be removed
<iframe src="//www.youtube.com/embed/91dTtJm198g" frameborder="0"></iframe>

Remove opacity property from proprietary list in CSSDefinition.php

The comment on line 245 reads:

// technically not proprietary, but CSS3, and no one supports it
$this->info['opacity'] = new HTMLPurifier_AttrDef_CSS_AlphaValue();

This is no longer true. I think the opacity property would be safe to include in the doSetup info method.

Unexpected outcome.

I have the following


/**
 * @param string $raw
 * @return string
 */
function safely_does_it($raw)
{
    $config = HTMLPurifier_Config::create([]);

    $config->set('Core.Encoding', 'UTF-8');
    $config->set('HTML.Allowed', 'span,p,br,a,h1,h2,h3,h4,h5,strong,em,u,ul,li,ol,hr,blockquote,sub,sup,p[class],img');
    $config->set('HTML.AllowedElements', ['span', 'p', 'br', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'strong', 'em', 'u', 'ul', 'li', 'ol', 'hr', 'blockquote', 'sub', 'sup', 'img']);
    $config->set('HTML.AllowedAttributes', 'style,target,title,href,class,src,border,alt,width,height,title,name,id');
    $config->set('CSS.AllowedProperties', 'text-align,font-weight,text-decoration');
    $config->set('AutoFormat.RemoveEmpty', true);
    $config->set('Attr.ForbiddenClasses', ['MsoNormal']);

    $purifier = new HTMLPurifier($config);
    return $purifier->purify($raw);
}

I pass it

<p>
  <strong>Hey</strong>
  <br/>
  <img src="data:image/jpeg;base64.....{REST GOES HERE}">
  <strong>Cool</strong>
</p>

I get back

<p><br/></p>

Expected original input. (All I want to do is remove scripts and styles really.

Any ideas?

Unknown name property in DefinitionCache_Decorator

This line refers to a name property of an HTMLPurifier_DefinitionCache_Decorator instance, however, I can't find that property defined anywhere.

I'm trawling through everything in my IDE doing PSR-2 stuff, and it's spotting things like this - it's a nice payoff for getting phpdocs right!

htmlpurifier downloads

Could you add all releases on github to offer a kind of mirror for when the PEAR channel is down?

I'd love to help out here โ€“ work-wise, hosting, etc.. We seem to run into these hickups fairly often. Mostly due to latency issues, but e.g. right now, the entire website is down due to a 'bad httpd conf'.

Thoroughly confused. HTML purifier seems to not do anything.

<?php

require_once("../vendor/autoload.php");

$config = \HTMLPurifier_Config::createDefault();
$config->set("Cache.SerializerPath", "cache");
$config->set("HTML.AllowedElements", null);
$purifier = new \HTMLPurifier($config);

echo $purifier->purify("<div>I'm definitely really a puppy  </div>  ");
// Outputs "<div>I'm definitely really a puppy  </div>  "

I really have no idea what I'm doing wrong here. I've never had a problem with HTMLPurifier before, but now it seems like I can't get it to do anything at all - it just returns the original input. I've tried clearing out the cache, but still no luck.

I'm using the latest release (4.7.0), with PHP 5.6.15. I'm loading the library via Composer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.