Comments (13)
I label this as bug
.
@stefanor Thanks for follow up on this. Please feel free to open up/forward issues/bugs from
https://github.com/aaronsw/html2text/.
from html2text.
Took me a while to find that html2text
indirectly caused the %0A
(newline in URL encoding) occurring in my links. I've temporarily disabled body_width
wrapping in my code to prevent it.
parser = HTML2Text()
parser.body_width = 0
parser.handle(value)
from html2text.
Hey @jmagnusson @stefanor
I see these two hacks trying to fix the issue:
Do you think we can apply the same into htmltext without explicitly set body_width=0
?
from html2text.
@stefanor does this fix help?
from html2text.
Yeah, combined with --reference-links
that seems to do the right thing.
from html2text.
@stefanor @Alir3z4 Consider closed?
from html2text.
I'm going to close this then, thanks for your awesome collaboration on this ;)
from html2text.
This issue still happens to me when the link contains special characters like "-".
Are there anyway to rebuild this package with BODY_WIDTH = 0 (config.py) ?
from html2text.
@nguyenl95 have you consider --reference-links
?
from html2text.
@Alir3z4 I intend to use html2text as lib instead of command-line.
Btw I just read your /tests and it is really useful.
I use this lib for my crawler (this case seems popular), and I think body_width=0 or protect_links=True and skip_internal_links=False
should be default. Baseurl
is really good one that need to be exposed for readers btw.
def html2md(raw):
h = html2text.HTML2Text()
h.body_width = 0
h.baseurl = "https://example.org" # this is hidden
return h.handle(raw)
from html2text.
@nguyenl95 Thanks for mentioning.
I didn't noticed you were referring to use of of the lib itself and not he CLI.
I'd love to see a pull request for updating the documentation so other can see and use it.
You would be modifying:
- https://github.com/Alir3z4/html2text/blob/master/docs/how_it_works.md
- https://github.com/Alir3z4/html2text/blob/master/docs/usage.md
Let me know if I can help you with anything else.
from html2text.
@Alir3z4 Actually there is one feature I think of.
It is the limit of output, my forum platform doesn't allow my crawler to post the content over 32000 characters.
from html2text.
@nguyenl95 Great, feel free to make a feature request or even better a pull request, I would love to know more about it.
from html2text.
Related Issues (20)
- Different results when `HTML2Text` object is reused HOT 2
- Strip leading/trailing whitespace for links and inline code HOT 1
- Cannot provide space between content of <option> tag in <select>
- The export format is incorrect when the table tag contains < p > or < br >
- Featurerequest: Output without markdown HOT 3
- RE_MD_DASH_MATCHER does not exist in the HTML2TEXT() object
- Character reference replacement results in raw HTML
- Broken Images in README.md HOT 2
- How can I parse the `<pre>` tag into tri-backquote style?
- [Bug] Assumes first row is always table header even if it is not
- --ignore-links flag creates new composite words in output
- Link titles break with encoded quote
- Ignoring some elements
- HTML <picture> Element not returned as image link from srcset
- charref() maybe throw OverflowError: Python int too large to convert to C int HOT 1
- Extra "\" slashes before specific numeric
- `.handle()` w/ new text yields previous results if AssertionError is raised
- Considered using the rust library html2text under the hood? HOT 1
- <img src> fails assert
- Space missing before links inside <b> tag
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from html2text.