alir3z4 / html2text Goto Github PK

View Code? Open in Web Editor NEW

1.7K 26.0 261.0 1.24 MB

Convert HTML to Markdown-formatted text.

Home Page: alir3z4.github.io/html2text/

License: GNU General Public License v3.0

Python 62.22% HTML 37.78%

markdown markdown-parser python

html2text's Introduction

html2text

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

Usage: html2text [filename [encoding]]

Option	Description
`--version`	Show program's version number and exit
`-h`, `--help`	Show this help message and exit
`--ignore-links`	Don't include any formatting for links
`--escape-all`	Escape all special characters. Output is less readable, but avoids corner case formatting issues.
`--reference-links`	Use reference links instead of links to create markdown
`--mark-code`	Mark preformatted and code blocks with [code]...[/code]

For a complete list of options see the docs

Or you can use it from within Python:

>>> import html2text
>>>
>>> print(html2text.html2text("<p><strong>Zed's</strong> dead baby, <em>Zed's</em> dead.</p>"))
**Zed's** dead baby, _Zed's_ dead.

Or with some configuration options:

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!")
Hello, world!

>>> print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))

Hello, world!

>>> # Don't Ignore links anymore, I like links
>>> h.ignore_links = False
>>> print(h.handle("<p>Hello, <a href='https://www.google.com/earth/'>world</a>!"))
Hello, [world](https://www.google.com/earth/)!

Originally written by Aaron Swartz. This code is distributed under the GPLv3.

How to install

html2text is available on pypi https://pypi.org/project/html2text/

$ pip install html2text

How to run unit tests

tox

To see the coverage results:

coverage html

then open the ./htmlcov/index.html file in your browser.

Documentation

Documentation lives here

html2text's People

Contributors

Stargazers

Watchers

Forkers

alexgarel nerostake sciunto kharazi nylas alethiophile h4ck3rm1k3 summerisgone oasiswork mdorn peicheng moonshawdo mgontav lekensteyn smblackburn malsmith pombreda deanishe thesage21 alawibaba nikolas al-berger emillon wpli yodamaster madhugb gilesbrown critiqjo tsouvarev robsmith1776 savageman 360youlun ccpaging frisi phillip-hopper wongzigii zoomquiet ryangrimm freron markmuir87 toshism amitu powergo gaulinmp livingbio gilessbrown wysie philippeowagner lucasvo bytearchive yuhaoth koniiiik laundromat luoyufu ezawadzki lucasb rolepoint dred86 jfrancos kricktechnologic osemenovsky andreskrey luisrain benedictking luistoledo dmkoch serized remusao ionutgrigorescu matiastafer ciprianmiclaus jonathan-s mivok mattdennewitz slideclick dimytr biazzotto drmeers benmpeterson cybort whuream otizonaizit optionalg kurtmckee henfee asford wu738224316 dahlbaek barseghyanartur ottumm slzdude desilinguist b1r3k elebow afcarl oudream unit03 pavkazzz s7e11ar aureliosaraiva

html2text's Issues

Remove py_modules from setup.py

[Enhancement] Automatic version number

I would like to propose a way to automatically load the version info from the package (init.py) into setup.py:

def get_version():
    import os
    import re
    init_file = os.path.join("html2text", "__init__.py")
    init_lines = open(init_file, 'rt').readlines()
    version_re = r"^__version__ = ['\"]([^'\"]*)['\"]"
    for line in init_lines:
        mo = re.search(version_re, line, re.M)
        if mo:
            return mo.group(1)
    raise RuntimeError(u"Unable to find version string in {}".format(init_file))

and then you just do

setup(
    name="html2text",
    version=get_version(),
    ...
)

Adapted from stackoverflow.com

2014.9.7 is broken

Installs from pypi, tests are present but the html2test.py module is not.

Better table support

As Markdown supports tables it would be nice if we could convert the html

properly.

Does not accept standard input when running under python 3

When running under python 3, html2text will not accept standard input, e.g:

$ echo '<p>hi</p>' | html2text
Traceback (most recent call last):
File "/usr/bin/html2text", line 9, in <module>
load_entry_point('html2text==2014.9.25', 'console_scripts', 'html2text')()
File "/usr/lib/python3.4/site-packages/html2text/__init__.py", line 1083, in main
data = data.decode(encoding)
AttributeError: 'str' object has no attribute 'decode'

If you put that input string in a file and pass it as an argument it works ok. Or if you use python 2.

I discovered this on Arch linux with the standard python-html2text package from the community repo. Python 3 is the default python on Arch but I reproduced the problem on a debian box.

Link titles lost.

<a href="htt://example.com" title="Title"> an example</a> inline link.

Produces

This is [ an example](http://example.com/) inline link.

when

This is [ an example](http://example.com/ "Title") inline link.

is expected as per Daring Fireball.

Very slow if input has long sections without spaces

When running html2text on some HTML with a base64-encoded image in an <img> tag I notice really poor performance of html2text.

I narrowed it down to seeing the problem when a long string appears without spaces (e.g. what you see with a base64 image):

In [30]: html = 'x' * 25000
In [31]: %time c = html2text.html2text(html)
Wall time: 11.7 s

You'll notice that a string twice as long (but with spaces) runs 200x faster:

In [33]: html = 'x ' * 25000
In [34]: %time c = html2text.html2text(html)
Wall time: 48.8 ms

Thoughts?

Python 3.5 compatibility

While the full test suite seems to succeed under Python 3.4, 12 tests fail under Python 3.5. I haven't yet figured out what the difference is:

running test
......................FF....FFFF......FFFF.........F.........F..........................
======================================================================
FAIL: test_emdash-para_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: 'Baco[257 chars]shank--\n\n--irure ex esse id, ham commodo mea[476 chars]\n\n' != 'Baco[257 chars]shank—\n\n—irure ex esse id, ham commodo meatl[474 chars]\n\n'
  Bacon ipsum dolor sit amet pork chop id pork belly ham hock, sed meatloaf eu
  exercitation flank quis veniam officia. Chuck dolor esse, occaecat est elit
  drumstick ground round tri-tip nisi. Eu fugiat drumstick leberkas magna.
- Turducken frankfurter nisi aute shank--
?                                      ^^
+ Turducken frankfurter nisi aute shank—
?                                      ^

- --irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami
? ^^
+ —irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami in
? ^                                                                          +++
- in fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork
? ---
+ fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork chop,
?                                                                       ++++++
- chop, ad leberkas reprehenderit id voluptate salami ham ut in ut cillum
? ------
+ ad leberkas reprehenderit id voluptate salami ham ut in ut cillum turducken.
?                                                                  +++++++++++
- turducken. Nisi ribeye tail capicola dolore andouille. Short ribs id beef
? -----------
+ Nisi ribeye tail capicola dolore andouille. Short ribs id beef ribs, et nulla
?                                                               +++++++++++++++
- ribs, et nulla ground round do sunt dolore. Dolore nisi ullamco veniam sunt.
? ---------------
+ ground round do sunt dolore. Dolore nisi ullamco veniam sunt. Duis brisket
?                                                              +++++++++++++
- Duis brisket drumstick, dolor fatback filet mignon meatloaf laboris tri-tip
? -------------
+ drumstick, dolor fatback filet mignon meatloaf laboris tri-tip speck chuck
?                                                               ++++++++++++
- speck chuck ball tip voluptate ullamco laborum.
? ------------
+ ball tip voluptate ullamco laborum.

  \--



======================================================================
FAIL: test_emdash-para_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: 'Baco[257 chars]shank--\n\n--irure ex esse id, ham commodo mea[476 chars]\n\n' != 'Baco[257 chars]shank—\n\n—irure ex esse id, ham commodo meatl[474 chars]\n\n'
  Bacon ipsum dolor sit amet pork chop id pork belly ham hock, sed meatloaf eu
  exercitation flank quis veniam officia. Chuck dolor esse, occaecat est elit
  drumstick ground round tri-tip nisi. Eu fugiat drumstick leberkas magna.
- Turducken frankfurter nisi aute shank--
?                                      ^^
+ Turducken frankfurter nisi aute shank—
?                                      ^

- --irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami
? ^^
+ —irure ex esse id, ham commodo meatloaf pig pariatur ut cow. Officia salami in
? ^                                                                          +++
- in fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork
? ---
+ fatback voluptate boudin ullamco beef ribs shank. Duis spare ribs pork chop,
?                                                                       ++++++
- chop, ad leberkas reprehenderit id voluptate salami ham ut in ut cillum
? ------
+ ad leberkas reprehenderit id voluptate salami ham ut in ut cillum turducken.
?                                                                  +++++++++++
- turducken. Nisi ribeye tail capicola dolore andouille. Short ribs id beef
? -----------
+ Nisi ribeye tail capicola dolore andouille. Short ribs id beef ribs, et nulla
?                                                               +++++++++++++++
- ribs, et nulla ground round do sunt dolore. Dolore nisi ullamco veniam sunt.
? ---------------
+ ground round do sunt dolore. Dolore nisi ullamco veniam sunt. Duis brisket
?                                                              +++++++++++++
- Duis brisket drumstick, dolor fatback filet mignon meatloaf laboris tri-tip
? -------------
+ drumstick, dolor fatback filet mignon meatloaf laboris tri-tip speck chuck
?                                                               ++++++++++++
- speck chuck ball tip voluptate ullamco laborum.
? ------------
+ ball tip voluptate ullamco laborum.

  \--



======================================================================
FAIL: test_googledocmassdownload_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 


======================================================================
FAIL: test_googledocmassdownload_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 


======================================================================
FAIL: test_googledocsaved_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 


======================================================================
FAIL: test_googledocsaved_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: "#  t[237 chars]n**   being\n  3. end  \n  \n**bold**   \n_ita[156 chars]_ \n" != "#  t[237 chars]n**  being\n  3. end  \n  \n**bold**   \n_ital[146 chars]_ \n"
  #  test doc  

  first issue  

    - bit
    - _**bold italic**_ 
      - orange
      - apple
    - final  

  text to separate lists  

    1. now with numbers
    2. the prisoner
      1. not an  _italic number_ 
-     2. a  **bold human**   being
?                           -
+     2. a  **bold human**  being
    3. end  

  **bold**   
  _italic_   

  ` def func(x):`  
- `   if x < 1:`  
?  --
+ ` if x < 1:`  
- `     return 'a'`  
?  ----
+ ` return 'a'`  
- `   return 'b'`  
?  --
+ ` return 'b'`  

- Some  ` fixed width text`  here  
?                           -
+ Some  ` fixed width text` here  
  _` italic fixed width text`_ 


======================================================================
FAIL: test_html-escaping_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: 'Escaped HTML like &lt;div&gt; or &amp; should remain escape[100 chars]\n\n' != 'Escaped HTML like <div> or & should remain escaped on outpu[90 chars]\n\n'
- Escaped HTML like &lt;div&gt; or &amp; should remain escaped on output
?                   ^^^^   ^^^^     ----
+ Escaped HTML like <div> or & should remain escaped on output
?                   ^   ^



      ...unless that escaped HTML is in a <pre> tag

  `...or a <code> tag`



======================================================================
FAIL: test_html-escaping_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: 'Escaped HTML like &lt;div&gt; or &amp; should remain escape[100 chars]\n\n' != 'Escaped HTML like <div> or & should remain escaped on outpu[90 chars]\n\n'
- Escaped HTML like &lt;div&gt; or &amp; should remain escaped on output
?                   ^^^^   ^^^^     ----
+ Escaped HTML like <div> or & should remain escaped on output
?                   ^   ^



      ...unless that escaped HTML is in a <pre> tag

  `...or a <code> tag`



======================================================================
FAIL: test_html_entities_out_of_text_cmd (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 106, in test_cmd
    self.assertEqual(result, actual)
AssertionError: '[allas: Country Manager](http://thth)\n\n' != '[állás: Country Manager](http://thth)\n\n'
- [allas: Country Manager](http://thth)
?  ^  ^
+ [állás: Country Manager](http://thth)
?  ^  ^



======================================================================
FAIL: test_html_entities_out_of_text_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: '[allas: Country Manager](http://thth)\n\n' != '[állás: Country Manager](http://thth)\n\n'
- [allas: Country Manager](http://thth)
?  ^  ^
+ [állás: Country Manager](http://thth)
?  ^  ^



======================================================================
FAIL: test_invalid_unicode_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: 'Br\n\n' != 'B�r\n\n'
- Br
+ B�r
?  +



======================================================================
FAIL: test_nbsp_unicode_mod (test.test_html2text.TestHTML2Text)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/barry/projects/debian/html2text/upstream/test/test_html2text.py", line 99, in test_mod
    self.assertEqual(result, actual)
AssertionError: '# NB[182 chars]ed do\xa0eiusmod\ntempor incididunt ut\xa0labo[385 chars]\n\n' != '# NB[182 chars]ed do eiusmod\ntempor incididunt ut labore et [349 chars]\n\n'
  # NBSP handling test #2

  In this test all NBSPs will be replaced with unicode non-breaking spaces
  (unicode_snob = True).

- Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
?                                                                 ^
+ Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
?                                                                 ^
- tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
?                     ^         ^                       ^       ^
+ tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
?                     ^         ^                       ^       ^
- quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
?                                                  ^          ^
+ quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
?                                                  ^          ^
  consequat.

- Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
?                         ^                ^
+ Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore
?                         ^                ^
- eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
?   ^
+ eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt
?   ^
- in culpa qui officia deserunt mollit anim id est laborum.
?   ^                                         ^
+ in culpa qui officia deserunt mollit anim id est laborum.
?   ^                                         ^



----------------------------------------------------------------------
Ran 88 tests in 4.929s

FAILED (failures=12)

i18n

Since not all of the world is speaking English, having i18n (or as @svetlemodry said internationalization) would be really good.

Right now, https://github.com/Alir3z4/html2text/blob/master/html2text/cli.py is the only place to start working on its translation.

Python itself is great with gettext and handling translations, the only thing that remains is how to run the app in the specific translation.

For example how to keep track of the selected translation would be keeping the configuration via:

~/.config/html2text
Pass to ENV var
Pass the translation on runtime to CLI

So let's talk about those options:

Having app conf under ~/.config/html2text is something standard in GNU/Linux and *unix based machines. In future with regards of the progress of the CLI we can easily expand and use that configuration file. This option makes CLI slower on load time, since it should read a file from user conf dir.
ENV is a good place if we're not going to keep lots of conf file and an env var such as HTML2TEXT_LANG=miow can be enough and we're good to go. it can be easily picked up while running the program.
I'm not a fan on these, no never! Program should know the translation without explicitly declaring the language.

Use kramdown/Markdown Extra/MMD/pandoc Definition List Format

@lupiter: Currently definition lists are formatted as

The DT item
    The DD item
    The other DD item

Would it be possible to used MD Extra format instead?

The DT item
:    The DD item
:    The other DD item

I believe this would only be a teeny tiny change on line 564. I'd write a pull request but I'm 'not allowed' to write GPLv3 code :(

@mcepl: I vote -1 on this. I would like to keep html2text only to the strictest John Gruber Markdown. If we step into the endless swamp of Markdown variants we are in my opinion doomed.

@lupiter: @mcepl Then why convert it into anything at all? Why not just strip the tags? Since you're already picking something, it would be nice to pick something that has at least one way of converting it back to html. All the variants I can find that do support definition lists use the same syntax.

@Alir3z4: I'm all for this, I mean

it would be nice to pick something that has at least one way of converting it back to html.

Is pure truth.

@mcepl: @Alir3z4 are you really sure you want to get into http://johnmacfarlane.net/babelmark2/ and http://johnmacfarlane.net/babelmark2/faq.html (just brief into to the unlimited number of Markdown variants)? I certainly don't.

@Alir3z4: It's not about "Markdown variants" here, thing is again:

it would be nice to pick something that has at least one way of converting it back to html.

which if we have a output that can't be converted back html(or at least) then I can say we lost the text formatting in the first place.

@MCPEL: How does inability to make round-trip (which would be the real bug if we cannot make our output into HTML via [markdown2[(https://pypi.python.org/pypi/Markdown/) with no extensions) has anything to do with support for MD Extra?

@Alir3z4: You mean the output of html2text which contains def list can be converted to html with no ext at all ?

I know there's a ext for markdown(def_list) for this purpose specifically with syntax:

The DT item
:    The DD item
:    The other DD item

@mcepl: AAaaaaah ... I really believed John Gruber's Markdown has <dl> equivalent. It apparently doesn't. OK, then if (and that's a big IF, I would still give up on it) we want to support some extension to The Markdown, why MD Extra?

@Alir3z4: Don't laugh at me please, but what MD Extra is referring to here?

@Alir3z4: In matter of fact I'm fine with any implementation as long as "The Markdown" has a support for it somehow. for <dl> python markdown comes with def_list ext, which is working as expected in this feature request.

@mcepl: Would it be possible to used MD Extra format instead? from the original description of this ticket.

@Alir3z4: @lupiter Please enlighten @mcepl on

Would it be possible to used MD Extra format instead?

I'm interested too.

@mcepl: Most likely https://en.wikipedia.org/wiki/Markdown_Extra ... some kind of PHP dialect of MD. One of zillion ones. Which is exactly my point. If we want to desert a plain Grubber MD, I would probably go all the way to rST, but I think it is too much work.

@lupiter: My apologies, I did mean PHP Markdown Extra. I believe that was the first variant to implement definition lists, and that was then adopted by other variants which support definition lists such as pandoc, kramdown and MultiMarkdown. I believe the GitHub and StackOverflow markdown variants do not have any support for definition lists.

If you'd like to keep to Gruber markdown (ignoring the ambiguities you pointed out in John MacFarlane's excellent collection) then just strip the tags.

If you are happy to do more than that (as is currently the case), then at least use something that can be converted back into html.

Of the variants covered by MacFarlane, the following support no definition list format:

Showdown
Markdown.pl
RedCarpet
PHP Markdown
Parsedown
Marked
cheapskate
Haskell markdown
Blackfriday

The following use the Markdown Extra format:

PHP Markdown Extra
Parsedown Extra
pandoc
lunamark
RDiscount
Python-Markdown
Minima
Maruku

The following use a different format:

Fatdown (via BB Codes, uses [dl] [dd] [dt] etc)
@mcepl: I hope there is some light in this (and other similar) tunnel ... there is now am initative by major users of Markdown (including but not limited to GitHub) to unify Markdowns.

Note: Extracted from an old repo.

HTMl enities get out of anchor text

@szepeviktor:

another strange behaviour

<a href="http://thth" class="nolink" style="text-decoration:none;color:inherit;cursor:default;">&#225;ll&#225;s: Country Manager</a>

a[llas: Country Manager](http://thth)

Remove meta vars from html2text.py file header

Such as copyright author license and some other
These information already mentioned in separate files such as COPYING AUTHORS.rst etc

Make all available options command line flags?

A lot of the options html2text supports are not available via the command line.

Making them available as command line flags should be nice for a lot of people.

@Alir3z4 Thoughts?

Malformed output of links

>>> import html2text
>>> h = html2text.HTML2Text()
>>> h.handle('<a href="http://www.test.com">http://www.test.com</a>')  #  this fails
u'<http://www.test.com>\n\n'
>>> h.handle('<a href="http://www.test.com/">http://www.test.com</a>')  # adding slash works
u'[http://www.test.com](http://www.test.com/)\n\n'

while this works as expected

>>> h.handle('<a href="http://www.test.com/">test</a>')
u'[test](http://www.test.com/)\n\n'
>>> h.handle('<a href="http://www.test.com">test</a>')
u'[test](http://www.test.com)\n\n'

Wraps long URLs

Forwarding aaronsw/html2text#7, so it doesn't get forgotten:
Forwarding http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616090:

Long URLs are wrapped, which they probably shouldn't be.

Example:

<html>
<head><title>Test</title></head>
<body>
<p>And <a href="http://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=multiarch;[email protected]">here</a> is a long link I had at hand.</p>
</body>
</html>

Results in:

And [here][1] is a long link I had at hand.

   [1]: http://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=multiarch;users
[email protected]

Nested code, anchor bug

There's a bug that crops up when anchor text is code:

>>> import html2text
>>> h = html2text.HTML2Text()
>>> h.handle("<a href='http://www.google.com'><code>google</code></a>")
u'`[google`](http://www.google.com)\n\n'

Should return:

u'[`google`](http://www.google.com)\n\n'

(Looks related to issue 24.)

html2text fails on daringfireball

I couldn't figure out how to resolve this myself.

Install on cygwin with python 2.7
html2text seems to work on many sites:

html2text https://michelf.ca/projects/php-markdown/extra

But not on Gruber's standard syntax:

html2text http://daringfireball.net/projects/markdown/syntax

The reason for extracting syntax from daringfireball is the gray-background pages are so hard to read.

Nice application; I look forward to studying the html2text python code to learn.

Backslash getting inserted before multiple dashes

Anyone have an explanation or fix for this? It seems like a bug to me.

I'm running version 2014.12.5

In [11]: print html2text.html2text('-').strip()
-

In [12]: print html2text.html2text('--').strip()
\--

In [13]: print html2text.html2text('------').strip()
\------

Bring python version compatibility to separate module

And name is compat.py

Add AUTHORS file

Body width not working with lists

I use the body_width property to limit the width of the text but it seems not to be working for lists.

Why html2text module throws UnicodeDecodeError?

I have problem with html2text module...shows me UnicodeDecodeError:

UnicodeDecodeError: 'ascii' codec can't decode byte 
0xbe in position 6: ordinal not in range(128)

Example :

#!/usr/bin/python
# -*- coding: utf-8 -*-
import html2text
import urllib

h = html2text.HTML2Text()
h.ignore_links = True

html = urllib.urlopen( "http://google.com" ).read()

print h.handle( html )

...also have tried h.handle( unicode( html, "utf-8" ) with no success. Any help.
EDIT :

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    print h.handle(html)
  File "/home/alex/Desktop/html2text-master/html2text.py", line 254, in handle
    return self.optwrap(self.close())
  File "/home/alex/Desktop/html2text-master/html2text.py", line 266, in close
    self.outtext = self.outtext.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbe in position 6: ordinal not in range(128)

Source: http://stackoverflow.com/q/24755022/636136

Remove `How to do a release` section from README.md

I don't think people need to do that usually, I see also most of the upstream changes have the version number updated.

Document features and capability of project in README

Currently it's not clear what kind of features html2text comes with.

at the moment it can handle:

General [x]html tags, , etc ...
Unicode
Packaged with a command line utility
HTML2Text class to be used in other apps.
...

img link in 'a href'

@dvasseur The following html code

<a href="http://google.com"><img src="images/google.png" alt="Go to google"></a>

is converted to [Go to google](images/google.png)
I would have prefered, [Go to google](http://google.com)

can this be done ?

@szepeviktor: This is so in Markdown:

[![Go to google](images/google.png)](http://google.com)

@Alir3z4 Hmm, this is interesting actually:

In [2]: html2text.html2text('<a href="http://google.com"><img src="images/google.png" alt="Go to google"></a>')
Out[2]: u'![Go to google](images/google.png)\n\n'

Sure it has its own downside, but I would like to have base hostname in the image src too.

@szepeviktor: So the <a> element is completely stripped out?

@Alir3z4: That's another issue here and yes the a tag is gone here

@mcepl: Just for reference http://daringfireball.net/projects/markdown/syntax#img

Note: Extracted from an old repo.

Coverage not detecting cmd line test execution

The coverage module is not detecting the command line tests due to them being carried out
in subprocesses.

Am I right in this?Nedbat shows us how to do it but I am unable to implement it.

Fixing this would increase code coverage a lot as the cli.py file currently stands at 0% coverage.

I do not have experience with coverage.py and so it would take me a long time to fix.
@Alir3z4 @mcepl if you can fix this please do.

Extract cli to separate module

2014.4.5 tarballs on PyPI lack test data files test/*.{md,html} ?

Comparing
https://github.com/Alir3z4/html2text/archive/2014.4.5.tar.gz
with
https://pypi.python.org/packages/source/h/html2text/html2text-2014.4.5.tar.gz
the former does contain files test/*.{md,html} while the latter does not.

So the PyPI release test suite is running 0 tests.
Is that intended? Otherwise please fix :)

Release 2014.9.8 only on PyPI?

It seems releases are out of sync with PyPI on Github:

Latest 2014.9.8
https://pypi.python.org/pypi/html2text

Latest 2014.7.3
https://github.com/Alir3z4/html2text/releases
https://github.com/html2text/html2text/releases

Please get them back in sync. Thanks!

Two extra spaces added before line breaks

Whitespace in HTML should be ignored, but the parser seems to add extra spaces before a newline when translating a <br>.

>>> import html2text
>>> html2text.__version__
(2015, 6, 21)
>>> p = html2text.HTML2Text(bodywidth=0)
>>> p.handle('foo<br>bar')
u'foo  \nbar\n'

In addition to the two spaces added, it preserves one extra space, if present. (Though that seems sane to me, to treat any length of whitespace in html as a single space in text.)

>>> p.handle('foo   <br>    bar')
u'foo   \nbar\n'

It is a low impact problem in practice, but it makes automated testing a bit of a hassle.

Is there a reason for the extra spaces? The parser is very explicit on the replace:

        if tag == "br" and start:
            self.o("  \n")

html2text throws ValueError on invalid unicode entities

Using html2text 2015.4.14, python 2.7.9 on Ubuntu 15.06, x86_64:

ipdb> p html2text.html2text('B&#3291685;r')
*** ValueError: ValueError('unichr() arg not in range(0x110000) (wide Python build)',)
ipdb> p html2text.html2text(u'B&#3291685;r')
*** ValueError: ValueError('unichr() arg not in range(0x110000) (wide Python build)',)

Possibly addressed by PR #60?

Make html2text a package

Bring html2text to a separate module and take out the conf/constant variables

Consider some upstream pull requests

Ordering issue with emphasis output

html2text turns input of the form

<i>weird </i>whitespace

into output of form

_weird _whitespace

This isn't valid Markdown (or rather, it just leaves the underscores literal rather than translating them back); it seems reasonable that the output should instead be of the form

_weird_ whitespace

Similarly, input like

<i>inside</i>word

turns into

_inside_word

also not valid. I think the best usage here is to leave it literal, which is valid Markdown and readable.

Fix AUTHROS.rst formatting

Extract utility/helper methods to `utils` module

Remove install_deps.py

And replace it with a better solution

Add ChangeLog file

Allow selecting decode errors bahaviour

Forwarded from https://bugs.launchpad.net/ubuntu/+source/python-html2text/+bug/1318227

Currently it stops conversion on any decode error:

$ html2markdown broken_text
Traceback (most recent call last):
  File "/usr/bin/html2markdown", line 9, in <module>
    load_entry_point('html2text==3.200.3', 'console_scripts', 'html2text')()
  File "/usr/lib/python3/dist-packages/html2text.py", line 781, in main
    data = data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 4: invalid start byte

But for the files I'm working on it would be perfectly fine just to add

data = data.decode(encoding, errors='ignore')

It can be exposed as an option.

File html2text.py (for module html2text) not found

When I try to install html2text, 15 of the 34 tests are failing.

When setup.py is running, the following message is thrown
file html2text.py (for module html2text) not found

This most probably seems to be the reason for the tests failing. Any solutions to this problem?

UnicodeDecode Error

r = requests.get("http://en.wikipedia.org/wiki/Python_%28programming_language%29")
print html2text.html2text(r.content)

Traceback (most recent call last):
File "", line 1, in
File "html2text/init.py", line 750, in html2text
return h.handle(html)
File "html2text/init.py", line 121, in handle
return self.optwrap(self.close())
File "html2text/init.py", line 139, in close
outtext = nochr.join(self.outtextlist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 62: ordinal not in range(128)

3.200.3 vs 2014.7.3 output quirks

Just upgraded from 3.200.3 to 2014.7.3 and noticed the following things:

Bold text inside links

<a href="link.htm"><b>Text</b></a>

Before: [**Text**](link.htm)
After: **[Text**](link.htm) (to me this looks incorrect)

Image links

<a href="images/image.jpg"><img alt="Title" src="images/thumbnails/image.jpg"></img></a>

Before: [![Title](images/thumbnails/image.jpg)](images/image.jpg)
After: ![Title](images/thumbnails/image.jpg)

Literal links

Links like this [http://example.com](http://example.com) now look like this <http://example.com>. Is that valid markdown?

Escapes

A lot of unnecessary escapes: \--, 1\.

Downgraded back to 3.200.3

Remove website address

http://www.aaronsw.com/2002/html2text/ should be remove from the project and replaced with something else

Github generated website (from readme file)
or nothing

Turn readme file from Markdown to reStructuredText

Spaces inside and around links are concatenated

When there's a space before a link and before the link's content, both are preserved:

~$ python
Python 2.7.9 (default, Mar  1 2015, 12:57:24) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import html2text
>>> h = html2text.HTML2Text()
>>> h.handle('foo <a href="/"> bar</a>')
u'foo [ bar](/)\n\n'
>>> h.ignore_links = True
>>> h.handle('foo <a href="/"> bar</a>')
u'foo  bar\n\n'
>>> html2text.__version__
'2014.12.29'

Browsers strip the text inside the link:

foo bar

This is html2text installed from pip, on Debian jessie.

Memory Leak in html2text 2014.4.5

I recently added html2text 2014.4.5 to my project and have been using it to convert HTML generated from Jinja2 templates into text. I attach the HTML and the text version of said HTML to emails constructed using the standard email.mime classes.

I added html2text amidst some other changes and so it took me a little time to track down that the source of a memory leak issue that started occurring to html2text:

(Pdb) problem
Partition of a set of 2 objects. Total size = 38380808 bytes.
 Index  Count   %     Size   % Cumulative  % Referrers by Kind (class / dict of class)
     0      2 100 38380808 100  38380808 100 dict of html2text.HTML2Text
(Pdb) problem.byclodo
Partition of a set of 2 objects. Total size = 38380808 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      2 100 38380808 100  38380808 100 unicode
(Pdb) problem.byid
Set of 2 <unicode> objects. Total size = 38380808 bytes.
 Index     Size   %   Cumulative  %   Representation (limited)
     0 38380600 100.0  38380600 100.0 u'Hi Mar... \n\n \n'
     1      208   0.0  38380808 100.0 u'<p sty... 100%;">'
(Pdb) problem.byvia
Partition of a set of 2 objects. Total size = 38380808 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1  50 38380600 100  38380600 100 "['outtext']"
     1      1  50      208   0  38380808 100 "['_HTMLParser__starttag_text']"
(Pdb) leftover
Partition of a set of 419 objects. Total size = 38480944 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0     26   6 38384680 100  38384680 100 unicode
     1     93  22    39096   0  38423776 100 dict (no owner)
     2     43  10    23560   0  38447336 100 dict of guppy.etc.Glue.Interface
     3      8   2     8384   0  38455720 100 dict of guppy.etc.Glue.Share
     4     22   5     6160   0  38461880 100 dict of guppy.etc.Glue.Owner
     5    100  24     5232   0  38467112 100 str
     6     23   5     3128   0  38470240 100 list
     7     43  10     2752   0  38472992 100 guppy.etc.Glue.Interface
     8     22   5     1584   0  38474576 100 guppy.etc.Glue.Owner
     9      1   0     1048   0  38475624 100 dict of guppy.heapy.Classifiers.ByUnity
<17 more rows. Type e.g. '_.more' to view.>
(Pdb)

The above information was captured using heapy component of Guppy-PE after following information detailed in http://python.dzone.com/articles/diagnosing-memory-leaks-python and http://www.smira.ru/wp-content/uploads/2011/08/heapy.html

As you can see, the contents of ['outtext'] are huge and based on inspection of the data itself (See last file referenced below) basically consist of the same text repeated over and over. This would seem to indicate some kind of looping error.

I'm not sure if it is relevant to this issue but every now and then when using html2text it fails after reaching line 360 of /usr/lib64/python2.7/HTMLParser.py:
raise AssertionError("we should not get here!")

On a final note, I have replicated both of these issues using both Python 2.7.5 64-bit and PyPy 2.3.0 64-bit.

For your reference as to the context, please see the following pastebin links:

send_mail: http://pastebin.com/hHHh1fUN
email_tasks.py (used by send_mail): http://pastebin.com/Eqcsk23X
email_template: http://pastebin.com/XWS4VreU
base_email_template (used by email_template): http://pastebin.com/X7GfT1LJ
contents of ['outtext']: http://www.mediafire.com/view/6uoj861r59oxme9/problemcontents.txt

I have not done any investigation yet into the exact cause of this issue with html2text, although I hope to do so tomorrow.

For now, hopefully this information will prove useful in determining the source of the issue.

Do not install tests

python2.7 setup.py install needlessly installs test directory as /usr/lib/python2.7/site-packages/test.

Inclusion of test directory in source tarball is already provided by recursive-include test *.html *.md *.py line in MANIFEST.in.

--- setup.py
+++ setup.py
@@ -66,7 +66,7 @@
     """,
     license='GNU GPL 3',
     requires=requires_list,
-    packages=find_packages(),
+    packages=find_packages(exclude=['test']),
     include_package_data=True,
     zip_safe=False,
 )

pypin time

install instructions out of date

The install instructions say to use

$ pip install html2text

that only gets the Aaron's old version
$ html2markdown --version
html2markdown 3.200.3

Suggestion: Remove instructions from README.md ?

Don't append newlines inside a span

the pull request on mainstream have conflicts with code base, aaronsw/html2text#87.