Giter Site home page Giter Site logo

Comments (11)

Alir3z4 avatar Alir3z4 commented on May 16, 2024

szepeviktor:

I expected [állás: Country Manager](http://thth)
Are accents converted by default?

from html2text.

Alir3z4 avatar Alir3z4 commented on May 16, 2024

szepeviktor it's because html2text like to see the world as ASCII only and seems to think there's only English with 26 letters only, with giving mercy to some characters too.

Well this need to be fixed, text should be be utf-8 and encoded too.

In [23]: html2text.html2text(u'<a href="go.com" class="nolink">állás: Country Manager]</a>')
Out[23]: u'[\xe1ll\xe1s: Country Manager]](go.com)\n\n'

or just simply:

In [37]: print html2text.html2text(u'<a href="go.com" class="nolink">állás: Country Manager]</a>')
[állás: Country Manager]](go.com)

I guess we need to encode the text before feeding it to html2text, right?

from html2text.

Alir3z4 avatar Alir3z4 commented on May 16, 2024

szepeviktor:

Yes. Only a one-byte HTML entity will get outside the anchor, not an UTF-8 character.

from html2text.

Alir3z4 avatar Alir3z4 commented on May 16, 2024

I guess it should be handled by html2text, I mean encoding the input to
utf8.

Feel free to patch and make it by default.

from html2text.

theSage21 avatar theSage21 commented on May 16, 2024

This is only caused when the link text begins with a char reference.
<a href="http://thth">&#225;ll&#225;s: Country Manager</a> causes the bug.
<a href="http://thth"> &#225;ll&#225;s: Country Manager</a> translates correctly to
[ állás: Country Manager](http://thth)

This is because the first call is to handle_charref and after that handle_data and handle_data is the function that adds the '['.

Fixed in #77

from html2text.

theSage21 avatar theSage21 commented on May 16, 2024

The issue is solved in #77. Can we close this?

from html2text.

szepeviktor avatar szepeviktor commented on May 16, 2024

Please wait till I get home, and confirm.

from html2text.

theSage21 avatar theSage21 commented on May 16, 2024

@szepeviktor sure sure. it is morning again and I have to sleep. See you on the other side of the sun. 😄

from html2text.

szepeviktor avatar szepeviktor commented on May 16, 2024

There is a problem:

$ echo $LANG
en_US.UTF-8
$ echo '<a href="http://thth" class="nolink" style="text-decoration:none;color:inherit;cursor:default;">&#225;ll&#225;s: Country Manager</a>'| ./html2text
[allas: Country Manager](http://thth)

Shouldn't &#225; be á?

from html2text.

theSage21 avatar theSage21 commented on May 16, 2024

@szepeviktor The command line by default works with ASCII. Hence they are being converted to ASCII equivalents. As of now there is no command line option for unicode. Try this.

>>>import html2text as h2t
>>>H2T = h2t.HTML2Text()
>>>html = '<a href="http://thth" class="nolink" style="text-decoration:none;color:inherit;cursor:default;">&#225;ll&#225;s: Country Manager</a>'
>>>md_ascii = H2T.handle(html)
>>>H2T.unicode_snob = True
>>>md_unicode = H2T.handle(html)
>>>print(md_ascii)
[allas: Country Manager](http://thth)


>>>print(md_unicode)
[állás: Country Manager](http://thth)


>>>

from html2text.

szepeviktor avatar szepeviktor commented on May 16, 2024

Thank you.
Please merge #77.

from html2text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.