danburzo / percollate Goto Github PK

View Code? Open in Web Editor NEW

4.1K 4.1K 164.0 1.16 MB

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

Home Page: https://danburzo.ro/projects/percollate/

License: MIT License

JavaScript 88.00% HTML 4.24% CSS 7.68% Shell 0.07%

cli epub html markdown pdf puppeteer readability

percollate's People

Contributors

Stargazers

Watchers

Forkers

w00t3k tarsbase juhq michalfratczak ayalamac jithinraj dgreyling shafiahmed k-rhen kabootit floydc1987 de8ug repoforks luckypoem phenax jadedgnome shelltips subhahu123 ramadis hhy5277 landsurveyorsunited rubenbase naqushab orwell-coder proful jaysonzhang shaunstanislauslau lxee hadryan ankibahuguna cuulee elliota43 justin2061 ryunin102 emersonlaurentino pgaholic liwenhui19941115 parampavar shaikmustaq63 bharani91 suetming dcatfly raphaelhgarcia ridjohansen happy-ferret anima-os-stash cxz ssonal pyrou opw0011 bordilovskii kamyu104 coolpup idkwim josemalcher earlbabson j3r1ch0-2007 hxl1990 alanjimenez1 daniellidg 1017neroliu louie110 aronjames wsf1990 threatinteltest difraxion-github bbhunter guybedo neurosatan ncsing zhaodahan holajiawei crazyjser suvradip bruceyang-yeu dpresence yashha migsr22 lishuyanla plume alioh wir cybernetics ranshaw wangsongyan suqcnn haledeng memojja fuath mokacao mlee156 sunnyday2006 hraverkar ghanima bams guanguanghui wodemax emperorsreeni chenkovsky xcheng1986

percollate's Issues

Unexpected token function

Tried running percollate --v, returns the "Unexpected token function" error.

percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at Object.exports.runInThisContext (vm.js:76:16)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:394:7)
at startup (bootstrap_node.js:149:9)
at bootstrap_node.js:509:3

Add --css CLI option

Add a --css CLI option to allow sending short style snippets from the CLI directly, without having to use a custom HTML/CSS file. For example, changing the page size:

percollate --output some.pdf --css "@page { size: A4; }" http://example.com

The CSS will be appended to the stylesheet.

Allow programmatic usage

Add a truly beautfiul CSS print stylesheet

Look into the Firefox reader view algorithm (or similar)

Create pdf of pages behind authentication wall?

Is there a way to configure cookie/local storage before visiting url to create pdf?

Browser extension

I tried and found percollate a very useful tool. However, I would love to use it within html pages for on-demand creation of html page to pdf.
How can I run percollate from within browser instead of node or command-line?

Unexpected token function

Installed globally via NPM however when trying to run (any site) the following error is received:

/usr/local/lib/node_modules/percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at createScript (vm.js:56:10)
at Object.runInThisContext (vm.js:97:10)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:390:7)
at startup (bootstrap_node.js:150:9)

Perform HTTP requests sequentially

Resolve relative links to absolute links

anchor is not defined

$percollate pdf --output 1.pdf https://reactjs.org/docs/hello-world.html
Fetching: https://reactjs.org/docs/hello-world.html
Enhancing web page
(node:10750) UnhandledPromiseRejectionWarning: ReferenceError: anchor is not defined
    at Array.from.forEach.img (/usr/lib/node_modules/percollate/src/enhancements.js:12:18)
    at Array.forEach (<anonymous>)
    at imagesAtFullSize (/usr/lib/node_modules/percollate/src/enhancements.js:11:57)
    at cleanup (/usr/lib/node_modules/percollate/index.js:40:2)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:10750) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:10750) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Images not included in the output PDF even though they are shown in Firefox reader view?

Seems to be a special case though for the page https://betterwebtype.com/rhythm-in-web-typography, the images are not displayed in the final PDF, even though they seem to show up fine in my Firefox's reader view. Any idea why?

Make it work with the are.na API

Add URL to each web page in the resulting PDF

Add command `read`

This will start a local server (via serve) that processes the HTMLs and shows a web reader interface.

Add contribution guidelines

Prefer CSS page size / margins, if possible

To avoid having to implement separate CLI options.

Transform static html folders ?

Let's say we have a static web site which is generated by a static site generator.
Could we transform them into one pdf ?

Add command `html`

Readability wraps images in paragraph elements, causing text-indent

Paragraphs are indented in the default style. Seems that Readability will wrap <img> elements into <p> elements, causing the images to be indented.

Skip / replace Readability

Readability's results are not always perfect, so let's make it flexible enough so that we can take out the readability step, or replace it with some other way of parsing the content.

This means standardizing what we get back from the parser, and we can take Readability as the baseline output.

SyntaxError: Unexpected token ...

When I put percollate pdf --output as32.pdf https://github.com/danburzo/percollate , get this error.

SyntaxError: Unexpected token ...
    at createScript (vm.js:74:10)
    at Object.runInThisContext (vm.js:116:10)
    at Module._compile (module.js:533:28)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Module.require (module.js:513:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/percollate/cli.js:5:40)

Screenshots to README

Hi,

Good work on this one!
It would be nice to include a few screenshots on the README page just to get a brief idea of what the output would look like.

Regards.

Look into HummusJS

https://github.com/galkahana/HummusJS

noUselessHref: provide option to remove all hrefs

#31 Introduced some href filtering, but I wonder - do we really need to show those hrefs at all? I think they just make it harder to read the text.

Please provide an option to skip all href generation. For now I'm using this hack:

function noUselessHref(doc) {
	Array.from(doc.querySelectorAll(`a`))
		.filter(function(el) {
			return true;
		})
		.forEach(el => el.classList.add('no-href'));
}

Images are not rendered for specific article

percollate pdf https://hpbn.co/primer-on-latency-and-bandwidth/

Generates this:
Networking-101:-Primer-on-Latency-and-Bandwidth-High-Performance-Browser-Networking-(O'Reilly).pdf

Some of the image tags have svg sources, and they are not rendered. This page also has a png which is not rendered either.

Phase out short option flags

Add command `pdf`

Failed to launch chrome buecause Running as root without --no-sandbox

Saving as PDF
(node:24806) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!
[1012/081529.390835:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.

TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

at onClose (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:339:14)
at Interface.helper.addEventListener (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:328:50)
at emitNone (events.js:111:20)
at Interface.emit (events.js:208:7)
at Interface.close (readline.js:368:8)
at Socket.onend (readline.js:147:10)
at emitNone (events.js:111:20)
at Socket.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:139:11)

(node:24806) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:24806) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

PDF: Add a Table of Contents to the metadata

A PDF generated from many web pages would benefit from a Table of Contents, implemented as PDF bookmarks. We'll probably need to post-process the PDF with something like HummusJS to write the TOC. (Also, I'd appreciate if someone with more experience would explain whether its license is compatible with our MIT License)

Related: #25

--individual + --output: use --output path as prefix?

In-page anchors on github.com pages don't work

For example:

percollate pdf https://github.com/danburzo/percollate

Will result in a PDF where the links in the Table of Contents doesn't work:

danburzopercollate.pdf

(produced on macOS / [email protected])

I may be dense, but I can't tell how the anchors work in browsers in the first place 😰 (Later edit: the behavior is dependent on JavaScript — of course 😄 )

Add example input/output (screenshots, etc.)

Fix template / stylesheet path resolution

Usecases for --stylesheet vs. --css handling in the HTML template

Initially, the HTML template received the path for the stylesheet (either the default one, or a custom one provided with the --stylesheet option):

percollate/templates/default.html

Lines 3 to 7 in 7c43aec

    
           <head> 
        
           	<meta charset="utf-8"> 
        
           	<title>🌐 Percollate</title> 
        
           	<link rel='stylesheet' media='all' href="{{ stylesheet }}"/> 
        
           </head>

With the introduction of the --css option, I changed it to:

percollate/templates/default.html

Lines 7 to 9 in cecd11e

    
           <style type='text/css'> 
        
           	{{ style }} 
        
           </style>

And deprecated the passing of the stylesheet property to the template.

However, I think I might have missed a valid use-case for an external stylesheet, namely being able to reference outside resources (images, web fonts) in the custom CSS.

This issue outline some use-cases, to make sure the final solution covers all of them elegantly:

Override just the page size, margins, font sizes
Override the default stylesheet with a custom one
Use local web fonts in the custom stylesheet

(Adding to this list as more use-cases arise)

Create a default print stylesheet

Use largest available size for images in Wikipedia articles

The idea of the imagesAtFullSize enhancement is to get the largest available image from blogs using Blogspot, WordPress, and the like:

percollate/src/enhancements.js

Lines 1 to 20 in 3506b37

    
           function imagesAtFullSize(doc) { 
        
           	/* 
        
           		Replace: 
        
           			<a href='original-size.png'> 
        
           				<img src='small-size.png'/> 
        
           			</a> 
        
           		With: 
        
           			<img src='original-size.png'/> 
        
           	 */ 
        
           	Array.from(doc.querySelectorAll('a > img:only-child')).forEach(img => { 
        
           		let anchor = img.parentNode; 
        
           		let original = anchor.href; 
        
           		// only replace if the HREF matches an image file 
        
           		if (original.match(/\.(png|jpg|jpeg|gif|svg)$/)) { 
        
           			img.setAttribute('src', original); 
        
           			anchor.parentNode.replaceChild(img, anchor); 
        
           		} 
        
           	});

However, Wikipedia images are an exception:

<a href="/wiki/File:Perkulator.jpg" class="image">
  <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/250px-Perkulator.jpg" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/375px-Perkulator.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/500px-Perkulator.jpg 2x" data-file-width="1944" data-file-height="2592" width="250" height="333">
</a>

They link to what looks like an image file, but is in fact a HTML page for that image. How can we handle this situation gracefully?

Add a cover page to the PDF

URLs starting with slash are considered file:/// URLs

When an href in the webpage starts with /, it stays as is and hence the PDF viewer considers it a File URL. It should probably prepend it with the domain.

Default save location?

Sometimes I might just want to create PDFs from a list of web pages, using the page title as the default file name (using the page title is the default printing behavior of Chrome I believe).

Currently, if I run percollate without specifying --output, it claims to have saved the pdf, but I can't find it in the folder where I executed the command.

Can it just save the web page to the current folder using its title as the filename, when an --output flag is omitted?

Add default style for tables

Don't append the "href" attribute unnecessarily

Links whose textContent is the same as the href attribute should not have the latter appended
in-page anchors (?)

Allow proxy parameters

got accepts proxy parameters.

Either allow command line parameters for proxy to be passed over to got, or even better, honor http_proxy, https_proxy, no_proxy env variables.

Add tests

Output multiple PDF files for multiple webpages instead of one single PDF?

As discussed in #37, sometimes it might be useful to output each webpage to a separate PDF instead of putting all pages into one PDF.

Add command `epub`

CLI: show help by default when running with no arguments

Document default CSS / HTML

Some websites return 405 error

Some websites like SCMP (e.g. https://www.scmp.com/economy/china-economy/article/2169378/xi-jinping-donald-trump-agree-talks-g20-summit-next-month) doesn't seem to allow requests from headless browsers? I wonder if there's anything that can be done about it, though of course this might fall out of the scope of what this tool should handle.

TOC with multiple levels

It would be great to have an option to feed not just a plain list of URLs, but a tabbed, spaced, or somehow formatted (see below) file with captions and URLs to form a multilevel TOC in a resulting PDF.

A sample of such an input file:

<h1><a href="http://url1.com">Level 1 caption</a></h1>
	<h2><a href="http://url11.com">Level 1-1 caption</a></h2>
		<h3><a href="http://url111.com">Level 1-1-1 caption</a></h3>
		<h3><a href="http://url112.com">Level 1-1-2 caption</a></h3>
	<h2><a href="http://url12.com">Level 1-2 caption</a></h2>
		<h3><a href="http://url121.com">Level 1-2-1 caption</a></h3>
		<h3><a href="http://url122.com">Level 1-2-2 caption</a></h3>

	<head>
	<meta charset="utf-8">
	<title>🌐 Percollate</title>
	<link rel='stylesheet' media='all' href="{{ stylesheet }}"/>
	</head>

	function imagesAtFullSize(doc) {
	/*
	Replace:
	<a href='original-size.png'>
	<img src='small-size.png'/>
	</a>

	With:
	<img src='original-size.png'/>
	*/
	Array.from(doc.querySelectorAll('a > img:only-child')).forEach(img => {
	let anchor = img.parentNode;
	let original = anchor.href;

	// only replace if the HREF matches an image file
	if (original.match(/\.(png\|jpg\|jpeg\|gif\|svg)$/)) {
	img.setAttribute('src', original);
	anchor.parentNode.replaceChild(img, anchor);
	}
	});