dcwatson / bbcode Goto Github PK

View Code? Open in Web Editor NEW

68.0 8.0 17.0 437 KB

A pure python bbcode parser and formatter.

License: BSD 2-Clause "Simplified" License

Python 100.00%

bbcode python markup

bbcode's Introduction

Overview

Latest Package http://pypi.python.org/pypi/bbcode

Source Code https://github.com/dcwatson/bbcode

Documentation https://dcwatson.github.io/bbcode/

Installation

The easiest way to install the bbcode module is with pip, e.g.:

pip install bbcode

Requirements

Python, tested with versions 2.7, 3.5, 3.6, 3.7, and 3.8. Also tested with PyPy (2 and 3).

Basic Usage

# Using the default parser.
import bbcode
html = bbcode.render_html(text)

# Installing simple formatters.
parser = bbcode.Parser()
parser.add_simple_formatter('hr', '<hr />', standalone=True)
parser.add_simple_formatter('sub', '<sub>%(value)s</sub>')
parser.add_simple_formatter('sup', '<sup>%(value)s</sup>')

# A custom render function.
def render_color(tag_name, value, options, parent, context):
    return '<span style="color:%s;">%s</span>' % (tag_name, value)

# Installing advanced formatters.
for color in ('red', 'blue', 'green', 'yellow', 'black', 'white'):
    parser.add_formatter(color, render_color)

# Calling format with context.
html = parser.format(text, somevar='somevalue')

Advantages Over Postmarkup

More tag options for how/when to escape - for instance, you can specify whether to escape html or perform cosmetic replacements on a tag-by-tag basis. Same for auto-linking and transforming newlines.
More liberal (and accurate) automatic link creation, using John Gruber's URL regular expression: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Does not swallow unrecognized tags. For example, [3] will be output as [3], not silently ignored.
More flexible tag option parser. Tags may have standard bbcode options, for example [url=something]text[/url], but may also have named options, for example [url=something alt=icon]text[/url]. These options are passed to the render function as a standard python dictionary.
Ability to specify tag opening and closing delimiters (default: [ and ]). A side benefit of this is being able to use this library to selectively strip HTML tags from a string by using < and >.
Includes a runnable unittest suite.
Python 3 support.

bbcode's People

Contributors

Stargazers

Watchers

Forkers

wagzhi nhoad pythonesque jonsimington javex randy-ran hwkns tayfunerbilen lanny inspilab threadloom brianmckeever cattrinket xaqbr a3-system serensoner

bbcode's Issues

Add more basic tags

Some basic tags not implemented.
For example:

[img]
[email]
[size]
[quote <name>]
[code [codetype]]
[font]
[sub]
[sup]

http://bb.bbboy.net/man/BbCode.html
http://www.bbcode.org/reference.php

Possible improvements to linker

I noticed that there are some things about linker (the user hook into the bbcode parser that activates on link detection) that could be made a bit nicer. The biggest one is:

Give linker access to the context like render() has. Without this I'm not sure there's an idiomatic way of getting this information to the linker, so I am getting around it by overriding large swathes of the Parser class.

It would be convenient (and quite easy, I suspect) to make the following changes, but I can work around all of them by overriding the URL formatter so they're not critical:

linker is used by default in _render_url, not just _link_replace (there are only very slight differences between them as it is).
Allow different schemes than http:// for link replacement (https:// is the obvious one that comes to mind).
For _domain_re, allow a custom list of matchable domains (there are probably clever ICANN related things you could do to get and cache an official list but that's probably a library in itself).

From a perf standpoint:

Have you tested whether it's faster to not capture the https?:// the first time around (rather than using a noncapturing group, or doing a full regex replace)? I.e., change _url_re to

r'(?im)\b((?:((https?://)|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^\s`!()\[\]{};:\'".,<>?]))

and then use the new capture group to decide whether it was necessary to check for prepending http(s)? Obviously it's very dependent on the regex implementation but this technique wouldn't require rescanning the URL again, and would also make the (I suspect) common task of matching all links to a given domain faster.

(Also, I feel like domain_re would be, if not faster, at least more correct if it was explicitly anchored to the beginning of the string with ^--any reason it isn't?)

double line returns

This module has a tendency to convert each line return into a double line break. I don't know if this is a part of bbcode's standard, but it's rather annoying because I like to keep careful control over my formatting.

As far as I can tell, there's no way to disable this behavior aside from running python's replace() function on the render_html() function's output.

XSS vulnerability in some tags

[url]javascript:alert('XSS');[/url]
[url]123" onmouseover="alert('Hacked');[/url]

Solution: escaping more symbols like ", '. All returned html values between tags must be escaped.

check as this done in django for example https://github.com/django/django/blob/master/django/utils/html.py#L39

url should allow specifying target and rel attributes

It would be nice if we could replace

[url=example.com]page[/url]

with this:
<a href="http://example.com/" target="_blank" rel="nofollow">link</a>

by setting an config option to the BBCode instance.

Question about simple_formatter

If I need to create or add a new formatter with the simple formatter. It is possible to add it and then used it or do you have an example about how to use it ?

Quotation marks in brackes confuses parser

Here are samples:

import bbcode
bbcode.render_html("[b]Hello, [world][/b]")
'<strong>Hello, [world]</strong>'  # all ok

bbcode.render_html("[b]Hello, [wor'ld][/b]")
'<strong>Hello, [wor&#39;ld][/b]</strong>'   # see - [/b] and closing </strong>

bbcode.render_html("[b]Hello, [wor\"ld][/b]")
'<strong>Hello, [wor&quot;ld][/b]</strong>'  # same issue

"Dash" symbols inside code block

Dashes inside code block should remain unchanged, but in 1.0.26:

print(bbcode.render_html('[code]--[/code]'))
<code>&ndash;</code>

print(bbcode.render_html('[code]---[/code]'))
<code>&mdash;</code>

Formatters should only transform their contents and not the contents of their children

How to reproduce

import bbcode

def render(bbcode_text):
    parser = bbcode.Parser()
    parser.add_simple_formatter('left', '<div class="bb-left">%(value)s</div>')
    parser.add_simple_formatter(
        'code',
        '<code>%(value)s</code>',
        same_tag_closes=True,
        render_embedded=False,
        transform_newlines=False,
        escape_html=False,
        replace_links=False,
        replace_cosmetic=False,
        strip=True,
        swallow_trailing_newline=True
    )
    return parser.format(bbcode_text)

print render('[left]a\nb[code]c\nd[/code]\ne\nf[/left]')

Expected output

<div class="bb-left">a<br>b<code>c\nd</code><br>e<br>f</div>

Actual output

<div class="bb-left">a<br>b<code>c<br>d</code><br>e<br>f</div>

Note the   inside the <code>c d</code> instead of the \n.

Version

BBCode 1.0.22
Python 2.7.12

Parsing links not working correctly

Hi,

There is problem with parsing links. If source string contains more than one url with same domain (?), rest of them are repeated by first two.

import bbcode
p = bbcode.Parser()
p.parse("http://github.com/ http://example.org http://github.com/dcwatson/")

Result:
'http://github.com/dcwatson/ http://example.org http://github.com/dcwatson/'

Cosmetic replacement within list tag even when replace_cosmetic=False

This should be an MVE:

>>> import bbcode
>>> p = bbcode.Parser(replace_cosmetic=False)
>>> p.format('[list=1][*](c)[/list]')
'<ol style="list-style-type:decimal;"><li>&copy;</li></ol>'

I think that whatever formatter is doing the substitution for list has replace_cosmetic set forced to true?

url bbcode replaces \n with tag

I don't know why, but it should not.

Example: user posts message: [url=\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"[/url]
It becames very long empty post.

Expected behaviour:
<a href="\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n">\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n</a>

Problem: lines too looong

Would you accept a PR with PEP8 conform line lengths?

Unclosed "tag" treated as tag

Given input like:

>>> import bbcode
>>> bbcode.render_html('He said "[I]n the matter of X v Y, X wins".')
'He said &quot;<em>n the matter of X v Y, X wins&quot;.</em>'

Is it possible to have the mis-identified tag – or tags in general – not close except when there is a matching closing tag, eg [/i]? (Such behaviour is default in phpBB, for example).

Thanks for any help.

Problem with :[

Hi,

import bbcode
parser = bbcode.Parser()
parser.install_default_formatters()
parser.format("Test :[ something")
'Test :Test :[ something'

As you can see, when using [ as ie. emoticon (without closing ]), there is problem with output. Text is duplicated.

render_embedded=False doesn't ignore opening tags

Code with custom formater, which demonstrates the problem:

import bbcode

parser = bbcode.Parser()

def render_code(name, value, options, parent, context):
    if options:
        lang = parser._replace(options.keys()[0], parser.REPLACE_ESCAPE)
        highlight_class = "lang-%s" % lang
    else:
        highlight_class = 'no-highlight'
    return '<code class="%s">%s</code>' % (highlight_class, value)

parser.add_formatter(
    "code", render_code, 
    render_embedded=False, 
    transform_newlines=False, 
    replace_links=False,
    replace_cosmetic=False,
)

markup = """text before code:
[code python]
def test():
    print "test (c)", 'test'
    code = 123
    a = [code]
    b = 42
[/code] after code"""
print parser.format(markup)

Output:

text before code:<br /><code class="lang-python">def test():
    print &quot;test (c)&quot;, &#39;test&#39;
    code = 123
    a = [code]
    b = 42
[/code] after code</code>

Expected:

text before code:<br /><code class="lang-python">def test():
    print &quot;test (c)&quot;, &#39;test&#39;
    code = 123
    a = [code]
    b = 42
</code> after code

Postmarkup has the same issue.

Output is html escaped even if `escape_html=False`

import bbcode
parser = bbcode.Parser(escape_html=False)
print (parser.format("A' [b]A'[/b]"));

Expected output: A' A'
Actual output: A' A'

Affected version: bbcode 1.0.32, tested with Python3.6/3.7

Obscure XSS using url tag

Hey there,

today a few fellow students (@qll, @Immortalem, myself) took a look at your library and scanned it for potential security issues. We came up pretty much empty handed but were able to find a pretty nasty way to inject javascript through your parser, though not all browsers are vulnerable to this (most notably IE 10 and Chrome 40).

The exact details of this XSS include sending, e.g. a \x01 byte before the javascript: part so it bypasses your filter but still executes as valid JavaScript:

<a href="\x01javascript:alert(1)">Test</a>

We developed this case (and found other possible cases) using Shazzer (see 1 and 2).

While this seems like a weird edge case, there actually lies a real vulnerability as any user could generate this and run JS in the context of another user. And Chrome 40 is actually the current recent version.

Your current approach using blacklisting has the unfortunate result that there may always come up some way to bypass it (for example, there exists an ancient vbscript: URI for IE6 or something like that, though these would be very ancient machines indeed...).

However, using a whitelisting approach would involve building a whitelist of allowed tags and that can be infinite (think irc://, steam://, ...).

We would like to contribute to this project by offering a fix. But before blindly sending a pull request, we want to get your input on this.

Method 1:
We develop some method that fixes the above bug and still keeps blacklisting intact. While we feel this should be possible in a reasonably secure way, using the blacklisting approach might turn some other, weirder way, to evade it. However, we think such a fix is possible.

Method 2:
We implement a whitelisting approach that allows augmenting allowed tags by users of your libary.

I'd personally prefer the first method as the other one limits functionality. However, since this is your library, we want this to be your decision in the end.

Please let me know how you stand on this issue and we can get started on a fix 👍

Leading slashes getting removed

The bbcode parser is stripping the leading slashes out of links. Because of this, relative links cannot be used without additional hacking.

Some tag values should not be stripped by default

Let's take source text:
Normal font [b]and bold font [/b]and again normal

As the result "bold font" will be joined with "and again" without space.
But expected result is and bold font and, because many non-dev users have no idea about stripping tag values (they even don't now anything about tags and bbcode).

Version: 1.0.10

Color tag CSS injection

Not a major bug. But you know somebody is going to be able to exploit this:
https://github.com/dcwatson/bbcode/blob/master/bbcode.py#L137 is not secure. I can do [color=red; font-size: 1000px]Blah[/color] for example.

Option to use custom tag opener/closer on formatter level

Hey!
It would be cool if there would be an option to change the tag_opener and tag_closer on formatter level. Or instead of this, a regex that matches for example an emoji (like :smile: or :)).
Sorry if this is already available but I couldn't find it.. just had a quick look on the source and in the docs this isn't mentioned.

Thanks!

PhpBB BBCode url error

Current output (taken directly from PhpBB database posts table):

>>> t="[url=http://www.dlvr.it/:3pzdg5xo]dlvr[/url:3pzdg5xo]"
>>> bbcode.render_html(t)
'<a href="http://www.dlvr.it/:3pzdg5xo">dlvr[/url:3pzdg5xo]</a>'

Expected output:

'<a href="http://www.dlvr.it/">dlvr</a>'

I don't know it this kind of BBCode is valid, but PhpBB uses it like a comment (don't why it do that) and they declare they use BBCode, so maybe it should recognized.

I suppose that bbcode could just strip them.

make tag options dict case-insensitive

But still retain the original case of the option keys. Essentially, I want to use something like this:

https://github.com/psf/requests/blob/master/requests/structures.py#L15

URLs with ampersand seem to break

A url with ampersand in the query string seems to break. It appears like the ampersand is stripped by default.

replace_links=False doesn't work on video tag/mp4

        def render_video(name, value, options, parent, context):
            return '<video width="100%" controls><source src="' + format(
                value) + '" type="video/mp4"></video>'

        parser.add_formatter('video', render_video, replace_links=False)

output:

<video width="100%" controls="" flashstopped="true" id="dummyid75" preload="metadata"><source src="<a rel=" nofollow"="" href="https://domain.com/media/ts/2020/01/11/09/27/f6e662d8-9d69-4512-8c8c-27284c09ce39.mp4">https://domain.com/media/ts/2020/01/11/09/27/f6e662d8-9d69-4512-8c8c-27284c09ce39.mp4" type="video/mp4"></video>

Quoted option values are cut on square bracket

I'd like to enable users to specify the author of a quote. http://bbcode.readthedocs.org/en/latest/formatters.html describes an approach to implement a quote tag formatter that recognizes multiple variants of such a specification. I'm only allow [quote author="John"]…[/quote] in my code.

This fails when the author name contains square brackets, as it is not uncommon on Internet forums (especially in gaming communities):

[quote author="SomeClan][John"]…[/quote]

In this case only SomeClan is recognized as the option value, the first closing square bracket is interpreted as the start tag's closing symbol, and [John"] followed by the text between the quote tags will be the content.

Since the value of the author option is quoted, I expect the parser to honor those quotes and not interpret tag delimiters inside it.

Unvalid BBCode Generates HTML

I use this module in my Django project as below:

##################
# BBCode Parsers #
##################
import bbcode
from django.core.urlresolvers import reverse

# Parser Object
platform_parser = bbcode.Parser(
    install_defaults=False,
    replace_links=False,
    drop_unrecognized=True
)

# Parser Functions
def render_refer_entry_tag(tag_name, value, options, parent, context) -> str:
    return '<a href="{url}">#{pk}</a>'.format(
        url=reverse("entry", kwargs={"pk":value}),
        pk=value
    )

def render_refer_title_tag(tag_name, value, options, parent, context) -> str:
    return '<a href="{url}">{content}</a>'.format(
        url=reverse("title", kwargs={"label": value}),
        content=value
    )

# Parsers
platform_parser.add_formatter(
    "g",
    render_refer_entry_tag,
    strip=True,
    swallow_trailing_newline=True,
    same_tag_closes=True,
    newline_closes=True
)

platform_parser.add_formatter(
    "b",
    render_refer_title_tag,
    strip=True,
    swallow_trailing_newline=True,
    same_tag_closes=True,
    newline_closes=True
)

Following generates unexpected output while it is not valid at all:

AssertionError: '<a href="/g/4332/">#4332</a>' != '[g]4332'
- <a href="/g/4332/">#4332</a>
+ [g]4332

I want it to stay same. I could not find any valid option for Parser class in documentation.

Image tag unsupported

The parser does not appear to support the [img] bbcode tag. Like the other issue I reported, I'll need to code around it.