danburzo / percollate Goto Github PK
View Code? Open in Web Editor NEWA command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
Home Page: https://danburzo.ro/projects/percollate/
License: MIT License
A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
Home Page: https://danburzo.ro/projects/percollate/
License: MIT License
Tried running percollate --v, returns the "Unexpected token function" error.
percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at Object.exports.runInThisContext (vm.js:76:16)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:394:7)
at startup (bootstrap_node.js:149:9)
at bootstrap_node.js:509:3
Add a --css
CLI option to allow sending short style snippets from the CLI directly, without having to use a custom HTML/CSS file. For example, changing the page size:
percollate --output some.pdf --css "@page { size: A4; }" http://example.com
The CSS will be appended to the stylesheet.
Is there a way to configure cookie/local storage before visiting url to create pdf?
I tried and found percollate a very useful tool. However, I would love to use it within html pages for on-demand creation of html page to pdf.
How can I run percollate from within browser instead of node or command-line?
Installed globally via NPM however when trying to run (any site) the following error is received:
/usr/local/lib/node_modules/percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at createScript (vm.js:56:10)
at Object.runInThisContext (vm.js:97:10)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:390:7)
at startup (bootstrap_node.js:150:9)
$percollate pdf --output 1.pdf https://reactjs.org/docs/hello-world.html
Fetching: https://reactjs.org/docs/hello-world.html
Enhancing web page
(node:10750) UnhandledPromiseRejectionWarning: ReferenceError: anchor is not defined
at Array.from.forEach.img (/usr/lib/node_modules/percollate/src/enhancements.js:12:18)
at Array.forEach (<anonymous>)
at imagesAtFullSize (/usr/lib/node_modules/percollate/src/enhancements.js:11:57)
at cleanup (/usr/lib/node_modules/percollate/index.js:40:2)
at process._tickCallback (internal/process/next_tick.js:68:7)
(node:10750) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:10750) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Seems to be a special case though for the page https://betterwebtype.com/rhythm-in-web-typography, the images are not displayed in the final PDF, even though they seem to show up fine in my Firefox's reader view. Any idea why?
This will start a local server (via serve
) that processes the HTMLs and shows a web reader interface.
To avoid having to implement separate CLI options.
Let's say we have a static web site which is generated by a static site generator.
Could we transform them into one pdf ?
Paragraphs are indented in the default style. Seems that Readability will wrap <img>
elements into <p>
elements, causing the images to be indented.
Readability's results are not always perfect, so let's make it flexible enough so that we can take out the readability step, or replace it with some other way of parsing the content.
This means standardizing what we get back from the parser, and we can take Readability as the baseline output.
When I put percollate pdf --output as32.pdf https://github.com/danburzo/percollate , get this error.
SyntaxError: Unexpected token ...
at createScript (vm.js:74:10)
at Object.runInThisContext (vm.js:116:10)
at Module._compile (module.js:533:28)
at Object.Module._extensions..js (module.js:580:10)
at Module.load (module.js:503:32)
at tryModuleLoad (module.js:466:12)
at Function.Module._load (module.js:458:3)
at Module.require (module.js:513:17)
at require (internal/module.js:11:18)
at Object.<anonymous> (/usr/local/lib/node_modules/percollate/cli.js:5:40)
Hi,
Good work on this one!
It would be nice to include a few screenshots on the README page just to get a brief idea of what the output would look like.
Regards.
#31 Introduced some href filtering, but I wonder - do we really need to show those hrefs at all? I think they just make it harder to read the text.
Please provide an option to skip all href generation. For now I'm using this hack:
function noUselessHref(doc) {
Array.from(doc.querySelectorAll(`a`))
.filter(function(el) {
return true;
})
.forEach(el => el.classList.add('no-href'));
}
percollate pdf https://hpbn.co/primer-on-latency-and-bandwidth/
Generates this:
Networking-101:-Primer-on-Latency-and-Bandwidth-High-Performance-Browser-Networking-(O'Reilly).pdf
Some of the image tags have svg
sources, and they are not rendered. This page also has a png
which is not rendered either.
Saving as PDF
(node:24806) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!
[1012/081529.390835:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.
TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md
at onClose (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:339:14)
at Interface.helper.addEventListener (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:328:50)
at emitNone (events.js:111:20)
at Interface.emit (events.js:208:7)
at Interface.close (readline.js:368:8)
at Socket.onend (readline.js:147:10)
at emitNone (events.js:111:20)
at Socket.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:139:11)
(node:24806) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:24806) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
A PDF generated from many web pages would benefit from a Table of Contents, implemented as PDF bookmarks. We'll probably need to post-process the PDF with something like HummusJS to write the TOC. (Also, I'd appreciate if someone with more experience would explain whether its license is compatible with our MIT License)
Related: #25
For example:
percollate pdf https://github.com/danburzo/percollate
Will result in a PDF where the links in the Table of Contents doesn't work:
(produced on macOS / [email protected])
I may be dense, but I can't tell how the anchors work in browsers in the first place ๐ฐ (Later edit: the behavior is dependent on JavaScript โ of course ๐ )
Initially, the HTML template received the path for the stylesheet (either the default one, or a custom one provided with the --stylesheet
option):
percollate/templates/default.html
Lines 3 to 7 in 7c43aec
With the introduction of the --css
option, I changed it to:
percollate/templates/default.html
Lines 7 to 9 in cecd11e
And deprecated the passing of the stylesheet
property to the template.
However, I think I might have missed a valid use-case for an external stylesheet, namely being able to reference outside resources (images, web fonts) in the custom CSS.
This issue outline some use-cases, to make sure the final solution covers all of them elegantly:
(Adding to this list as more use-cases arise)
The idea of the imagesAtFullSize
enhancement is to get the largest available image from blogs using Blogspot, WordPress, and the like:
percollate/src/enhancements.js
Lines 1 to 20 in 3506b37
However, Wikipedia images are an exception:
<a href="/wiki/File:Perkulator.jpg" class="image">
<img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/250px-Perkulator.jpg" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/375px-Perkulator.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/500px-Perkulator.jpg 2x" data-file-width="1944" data-file-height="2592" width="250" height="333">
</a>
They link to what looks like an image file, but is in fact a HTML page for that image. How can we handle this situation gracefully?
Sometimes I might just want to create PDFs from a list of web pages, using the page title as the default file name (using the page title is the default printing behavior of Chrome I believe).
Currently, if I run percollate
without specifying --output
, it claims to have saved the pdf, but I can't find it in the folder where I executed the command.
Can it just save the web page to the current folder using its title as the filename, when an --output
flag is omitted?
textContent
is the same as the href
attribute should not have the latter appendedgot
accepts proxy parameters.
Either allow command line parameters for proxy to be passed over to got, or even better, honor http_proxy
, https_proxy
, no_proxy
env variables.
As discussed in #37, sometimes it might be useful to output each webpage to a separate PDF instead of putting all pages into one PDF.
Some websites like SCMP (e.g. https://www.scmp.com/economy/china-economy/article/2169378/xi-jinping-donald-trump-agree-talks-g20-summit-next-month) doesn't seem to allow requests from headless browsers? I wonder if there's anything that can be done about it, though of course this might fall out of the scope of what this tool should handle.
It would be great to have an option to feed not just a plain list of URLs, but a tabbed, spaced, or somehow formatted (see below) file with captions and URLs to form a multilevel TOC in a resulting PDF.
A sample of such an input file:
<h1><a href="http://url1.com">Level 1 caption</a></h1>
<h2><a href="http://url11.com">Level 1-1 caption</a></h2>
<h3><a href="http://url111.com">Level 1-1-1 caption</a></h3>
<h3><a href="http://url112.com">Level 1-1-2 caption</a></h3>
<h2><a href="http://url12.com">Level 1-2 caption</a></h2>
<h3><a href="http://url121.com">Level 1-2-1 caption</a></h3>
<h3><a href="http://url122.com">Level 1-2-2 caption</a></h3>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.