Giter Site home page Giter Site logo

Wraps long URLs about html2text HOT 13 CLOSED

alir3z4 avatar alir3z4 commented on May 23, 2024
Wraps long URLs

from html2text.

Comments (13)

Alir3z4 avatar Alir3z4 commented on May 23, 2024

I label this as bug.

@stefanor Thanks for follow up on this. Please feel free to open up/forward issues/bugs from
https://github.com/aaronsw/html2text/.

from html2text.

jacobsvante avatar jacobsvante commented on May 23, 2024

Took me a while to find that html2text indirectly caused the %0A (newline in URL encoding) occurring in my links. I've temporarily disabled body_width wrapping in my code to prevent it.

parser = HTML2Text()
parser.body_width = 0
parser.handle(value)

from html2text.

Alir3z4 avatar Alir3z4 commented on May 23, 2024

Hey @jmagnusson @stefanor
I see these two hacks trying to fix the issue:

Do you think we can apply the same into htmltext without explicitly set body_width=0 ?

from html2text.

theSage21 avatar theSage21 commented on May 23, 2024

@stefanor does this fix help?

from html2text.

stefanor avatar stefanor commented on May 23, 2024

Yeah, combined with --reference-links that seems to do the right thing.

from html2text.

theSage21 avatar theSage21 commented on May 23, 2024

@stefanor @Alir3z4 Consider closed?

from html2text.

Alir3z4 avatar Alir3z4 commented on May 23, 2024

I'm going to close this then, thanks for your awesome collaboration on this ;)

from html2text.

nguyenl95 avatar nguyenl95 commented on May 23, 2024

This issue still happens to me when the link contains special characters like "-".

Are there anyway to rebuild this package with BODY_WIDTH = 0 (config.py) ?

from html2text.

Alir3z4 avatar Alir3z4 commented on May 23, 2024

@nguyenl95 have you consider --reference-links ?

from html2text.

nguyenl95 avatar nguyenl95 commented on May 23, 2024

@Alir3z4 I intend to use html2text as lib instead of command-line.
Btw I just read your /tests and it is really useful.

I use this lib for my crawler (this case seems popular), and I think body_width=0 or protect_links=True and skip_internal_links=False should be default. Baseurl is really good one that need to be exposed for readers btw.

def html2md(raw):
  h = html2text.HTML2Text()
  h.body_width = 0
  h.baseurl = "https://example.org" # this is hidden
  return h.handle(raw)

from html2text.

Alir3z4 avatar Alir3z4 commented on May 23, 2024

@nguyenl95 Thanks for mentioning.
I didn't noticed you were referring to use of of the lib itself and not he CLI.

I'd love to see a pull request for updating the documentation so other can see and use it.
You would be modifying:

Let me know if I can help you with anything else.

from html2text.

nguyenl95 avatar nguyenl95 commented on May 23, 2024

@Alir3z4 Actually there is one feature I think of.

It is the limit of output, my forum platform doesn't allow my crawler to post the content over 32000 characters.

from html2text.

Alir3z4 avatar Alir3z4 commented on May 23, 2024

@nguyenl95 Great, feel free to make a feature request or even better a pull request, I would love to know more about it.

from html2text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.