Giter Site home page Giter Site logo

danburzo / percollate Goto Github PK

View Code? Open in Web Editor NEW
4.1K 4.1K 164.0 1.16 MB

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

Home Page: https://danburzo.ro/projects/percollate/

License: MIT License

JavaScript 88.00% HTML 4.24% CSS 7.68% Shell 0.07%
cli epub html markdown pdf puppeteer readability

percollate's People

Contributors

akuukis avatar danburzo avatar emersonlaurentino avatar guybedo avatar juhq avatar mosegontar avatar ncsing avatar opw0011 avatar pascalw avatar pedrolucasp avatar phenax avatar ramadis avatar ssonal avatar tanmayrajani avatar vongrad avatar xiangronglin avatar yashha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

percollate's Issues

Unexpected token function

Tried running percollate --v, returns the "Unexpected token function" error.

percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at Object.exports.runInThisContext (vm.js:76:16)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:394:7)
at startup (bootstrap_node.js:149:9)
at bootstrap_node.js:509:3

Add --css CLI option

Add a --css CLI option to allow sending short style snippets from the CLI directly, without having to use a custom HTML/CSS file. For example, changing the page size:

percollate --output some.pdf --css "@page { size: A4; }" http://example.com

The CSS will be appended to the stylesheet.

Browser extension

I tried and found percollate a very useful tool. However, I would love to use it within html pages for on-demand creation of html page to pdf.
How can I run percollate from within browser instead of node or command-line?

Unexpected token function

Installed globally via NPM however when trying to run (any site) the following error is received:

/usr/local/lib/node_modules/percollate/index.js:31
async function cleanup(url) {
^^^^^^^^
SyntaxError: Unexpected token function
at createScript (vm.js:56:10)
at Object.runInThisContext (vm.js:97:10)
at Module._compile (module.js:542:28)
at Object.Module._extensions..js (module.js:579:10)
at Module.load (module.js:487:32)
at tryModuleLoad (module.js:446:12)
at Function.Module._load (module.js:438:3)
at Module.runMain (module.js:604:10)
at run (bootstrap_node.js:390:7)
at startup (bootstrap_node.js:150:9)

anchor is not defined

$percollate pdf --output 1.pdf https://reactjs.org/docs/hello-world.html
Fetching: https://reactjs.org/docs/hello-world.html
Enhancing web page
(node:10750) UnhandledPromiseRejectionWarning: ReferenceError: anchor is not defined
    at Array.from.forEach.img (/usr/lib/node_modules/percollate/src/enhancements.js:12:18)
    at Array.forEach (<anonymous>)
    at imagesAtFullSize (/usr/lib/node_modules/percollate/src/enhancements.js:11:57)
    at cleanup (/usr/lib/node_modules/percollate/index.js:40:2)
    at process._tickCallback (internal/process/next_tick.js:68:7)
(node:10750) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:10750) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Add command `read`

This will start a local server (via serve) that processes the HTMLs and shows a web reader interface.

Transform static html folders ?

Let's say we have a static web site which is generated by a static site generator.
Could we transform them into one pdf ?

Skip / replace Readability

Readability's results are not always perfect, so let's make it flexible enough so that we can take out the readability step, or replace it with some other way of parsing the content.

This means standardizing what we get back from the parser, and we can take Readability as the baseline output.

SyntaxError: Unexpected token ...

When I put percollate pdf --output as32.pdf https://github.com/danburzo/percollate , get this error.

SyntaxError: Unexpected token ...
    at createScript (vm.js:74:10)
    at Object.runInThisContext (vm.js:116:10)
    at Module._compile (module.js:533:28)
    at Object.Module._extensions..js (module.js:580:10)
    at Module.load (module.js:503:32)
    at tryModuleLoad (module.js:466:12)
    at Function.Module._load (module.js:458:3)
    at Module.require (module.js:513:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/usr/local/lib/node_modules/percollate/cli.js:5:40)

Screenshots to README

Hi,

Good work on this one!
It would be nice to include a few screenshots on the README page just to get a brief idea of what the output would look like.

Regards.

noUselessHref: provide option to remove all hrefs

#31 Introduced some href filtering, but I wonder - do we really need to show those hrefs at all? I think they just make it harder to read the text.

Please provide an option to skip all href generation. For now I'm using this hack:

function noUselessHref(doc) {
	Array.from(doc.querySelectorAll(`a`))
		.filter(function(el) {
			return true;
		})
		.forEach(el => el.classList.add('no-href'));
}

Failed to launch chrome buecause Running as root without --no-sandbox

Saving as PDF
(node:24806) UnhandledPromiseRejectionWarning: Error: Failed to launch chrome!
[1012/081529.390835:ERROR:zygote_host_impl_linux.cc(89)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.

TROUBLESHOOTING: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md

at onClose (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:339:14)
at Interface.helper.addEventListener (/usr/lib/node_modules/percollate/node_modules/puppeteer/lib/Launcher.js:328:50)
at emitNone (events.js:111:20)
at Interface.emit (events.js:208:7)
at Interface.close (readline.js:368:8)
at Socket.onend (readline.js:147:10)
at emitNone (events.js:111:20)
at Socket.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:139:11)

(node:24806) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:24806) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

PDF: Add a Table of Contents to the metadata

A PDF generated from many web pages would benefit from a Table of Contents, implemented as PDF bookmarks. We'll probably need to post-process the PDF with something like HummusJS to write the TOC. (Also, I'd appreciate if someone with more experience would explain whether its license is compatible with our MIT License)

Related: #25

In-page anchors on github.com pages don't work

For example:

percollate pdf https://github.com/danburzo/percollate

Will result in a PDF where the links in the Table of Contents doesn't work:

danburzopercollate.pdf

(produced on macOS / [email protected])

I may be dense, but I can't tell how the anchors work in browsers in the first place ๐Ÿ˜ฐ (Later edit: the behavior is dependent on JavaScript โ€” of course ๐Ÿ˜„ )

Usecases for --stylesheet vs. --css handling in the HTML template

Initially, the HTML template received the path for the stylesheet (either the default one, or a custom one provided with the --stylesheet option):

<head>
<meta charset="utf-8">
<title>๐ŸŒ Percollate</title>
<link rel='stylesheet' media='all' href="{{ stylesheet }}"/>
</head>

With the introduction of the --css option, I changed it to:

<style type='text/css'>
{{ style }}
</style>

And deprecated the passing of the stylesheet property to the template.

However, I think I might have missed a valid use-case for an external stylesheet, namely being able to reference outside resources (images, web fonts) in the custom CSS.

This issue outline some use-cases, to make sure the final solution covers all of them elegantly:

  • Override just the page size, margins, font sizes
  • Override the default stylesheet with a custom one
  • Use local web fonts in the custom stylesheet

(Adding to this list as more use-cases arise)

Use largest available size for images in Wikipedia articles

The idea of the imagesAtFullSize enhancement is to get the largest available image from blogs using Blogspot, WordPress, and the like:

function imagesAtFullSize(doc) {
/*
Replace:
<a href='original-size.png'>
<img src='small-size.png'/>
</a>
With:
<img src='original-size.png'/>
*/
Array.from(doc.querySelectorAll('a > img:only-child')).forEach(img => {
let anchor = img.parentNode;
let original = anchor.href;
// only replace if the HREF matches an image file
if (original.match(/\.(png|jpg|jpeg|gif|svg)$/)) {
img.setAttribute('src', original);
anchor.parentNode.replaceChild(img, anchor);
}
});

However, Wikipedia images are an exception:

<a href="/wiki/File:Perkulator.jpg" class="image">
  <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/250px-Perkulator.jpg" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/375px-Perkulator.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/500px-Perkulator.jpg 2x" data-file-width="1944" data-file-height="2592" width="250" height="333">
</a>

They link to what looks like an image file, but is in fact a HTML page for that image. How can we handle this situation gracefully?

Default save location?

Sometimes I might just want to create PDFs from a list of web pages, using the page title as the default file name (using the page title is the default printing behavior of Chrome I believe).

Currently, if I run percollate without specifying --output, it claims to have saved the pdf, but I can't find it in the folder where I executed the command.

Can it just save the web page to the current folder using its title as the filename, when an --output flag is omitted?

Allow proxy parameters

got accepts proxy parameters.

Either allow command line parameters for proxy to be passed over to got, or even better, honor http_proxy, https_proxy, no_proxy env variables.

TOC with multiple levels

It would be great to have an option to feed not just a plain list of URLs, but a tabbed, spaced, or somehow formatted (see below) file with captions and URLs to form a multilevel TOC in a resulting PDF.

A sample of such an input file:

<h1><a href="http://url1.com">Level 1 caption</a></h1>
	<h2><a href="http://url11.com">Level 1-1 caption</a></h2>
		<h3><a href="http://url111.com">Level 1-1-1 caption</a></h3>
		<h3><a href="http://url112.com">Level 1-1-2 caption</a></h3>
	<h2><a href="http://url12.com">Level 1-2 caption</a></h2>
		<h3><a href="http://url121.com">Level 1-2-1 caption</a></h3>
		<h3><a href="http://url122.com">Level 1-2-2 caption</a></h3>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.