Giter Site home page Giter Site logo

andrejgajdos / link-preview-generator Goto Github PK

View Code? Open in Web Editor NEW
259.0 9.0 65.0 124 KB

Get preview data (a title, description, image, domain name) from a url. Library uses puppeteer headless browser to scrape the web site.

License: MIT License

JavaScript 100.00%
node node-js url preview link-preview open-graph javascript

link-preview-generator's Introduction

link-preview-generator

NPM Downloads NPM License Twitter

Get preview data (a title, description, image, domain name) from a url. Library uses puppeteer headless browser to scrape the web site.

BLOG POST and DEMO

Install

$ npm install link-preview-generator

Usage

const linkPreviewGenerator = require("link-preview-generator");

const previewData = await linkPreviewGenerator(
  "https://www.youtube.com/watch?v=8mqqY2Ji7_g"
);
console.log(previewData);
/*
{
  title: 'Kiteboarding: Stylish Backroll in 4 Sessions - Ride with Blake: Vlog 20',
  description: 'The backroll is a staple in your kiteboarding trick ' +
    'bag. With a few small adjustments, you can really ' +
    'improve your style and make this basic your own. ' +
    'Sessio...',
  domain: 'youtube.com',
  img: 'https://i.ytimg.com/vi/8mqqY2Ji7_g/hqdefault.jpg',
  favicon: 'https://www.youtube.com/s/desktop/d3411c39/img/favicon.ico'
}
*/

API

linkPreviewGenerator(url, puppeteerArgs?, puppeteerAgent?)

Accepts a url, which is scraped and optional parameters puppeteerArgs -- browser options and puppeteerAgent -- browser user agent.

Returns an object with preview data of url.

url

Type: string

Scraped url.

puppeteerArgs

Type: array

Options to set on the Chrome browser.

puppeteerAgent

Type: string

Specific user agent to use.

Troubleshooting

If you need to deploy this library (Puppeteer) on Heroku, follow these steps.

If you want to run this library from within a Docker container:

  1. pass the following puppeteer arguments as second argument
// Required for Docker version of Puppeteer
'--no-sandbox',
'--disable-setuid-sandbox',
// This will write shared memory files into /tmp instead of /dev/shm,
// because Docker’s default for /dev/shm is 64MB
'--disable-dev-shm-usage'
  1. make sure your Docker image has all needed dependencies for headless chrome or just go straight away with buildkite/puppeteer
  2. done

License

MIT © Andrej Gajdos

link-preview-generator's People

Contributors

adrianfdez469 avatar andrejgajdos avatar arminfro avatar kliuiev-io avatar koukikitamura avatar lucasleray avatar tgdn avatar zewa666 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

link-preview-generator's Issues

Error: Failed to launch the browser process!

I can't use this with error

/root/bot/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193
            reject(new Error([
                   ^

Error: Failed to launch the browser process!
[0729/104939.603734:ERROR:zygote_host_impl_linux.cc(90)] Running as root without --no-sandbox is not supported. See https://crbug.com/638180.


TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

    at onClose (/root/bot/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:193:20)
    at Interface.<anonymous> (/root/bot/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:183:68)
    at Interface.emit (node:events:539:35)
    at Interface.close (node:internal/readline/interface:529:10)
    at Socket.onend (node:internal/readline/interface:258:10)
    at Socket.emit (node:events:539:35)
    at endReadableNT (node:internal/streams/readable:1345:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)

Node.js v17.9.0

<h2> tags not used

This is a minor bug but h2 tags are never used as a title.
See the code below from index.js, where both parts parse the h1 header :

const h1 = document.querySelector("h1").innerHTML;
if (h1 != null && h1.length > 0) {
  return h1;
}
const h2 = document.querySelector("h1").innerHTML;
if (h2 != null && h2.length > 0) {
  return h2;
}

I'm not sure how useful it is to get h2 tags, but I thought it couldn't hurt to fix it.

Some links that throw errors

Hi,

thanks for your work so far.
I've trouble with quite some links and I thought you might could have a look. Thanks!

https://en.bem.info/methodology/
(node:300075) UnhandledPromiseRejectionWarning: TypeError [ERR_INVALID_URL]: Invalid URL: /methodology/

https://www.der-informationsdesigner.de/agentur-blog/allgemein/persuasive-design-psychologie-im-webdesign
(node:300075) UnhandledPromiseRejectionWarning: Error: Evaluation failed: Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames: Host: der-informationsdesigner.de. is not in the cert's altnames: DNS:*.goserver.host, DNS:goserver.host

https://web.dev/one-line-layouts/
(node:300075) UnhandledPromiseRejectionWarning: TimeoutError: Navigation timeout of 30000 ms exceeded

Some URLs result in TypeError: Cannot read property 'src' of undefined

Thanks for sharing this handy preview generator. It works on most links, but a few are failing:

https://www.bloomberg.com/news/articles/2020-03-27/trump-threatens-to-force-gm-to-move-faster-on-ventilators

Error: Evaluation failed: TypeError: Cannot read property 'src' of undefined
at puppeteer_evaluation_script:50:22
at ExecutionContext._evaluateInternal ((...)\node_modules\puppeteer\lib\ExecutionContext.js:122:13)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at ExecutionContext. ((...)\node_modules\puppeteer\lib\helper.js:111:15)
at DOMWorld.evaluate ((...)\node_modules\puppeteer\lib\DOMWorld.js:112:20)
at process._tickCallback (internal/process/next_tick.js:68:7)
-- ASYNC --
at Frame. ((...)\node_modules\puppeteer\lib\helper.js:111:15)
at Page.evaluate ((...)\node_modules\puppeteer\lib\Page.js:860:43)
at Page. ((...)\node_modules\puppeteer\lib\helper.js:112:23)
at getImg ((...)\node_modules\link-preview-generator\index.js:18:26)
at module.exports ((...)\node_modules\link-preview-generator\index.js:174:19)
at process._tickCallback (internal/process/next_tick.js:68:7)

Error

I am using it in Windows 10 nodejs and got the following error while CPU is heavily busy

(node:5176) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 exit listeners added. Use emitter.setMaxListeners() to increase limit
(node:5176) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGINT listeners added. Use emitter.setMaxListeners() to increase limit
(node:5176) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGTERM listeners added. Use emitter.setMaxListeners() to increase limit
(node:5176) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 SIGHUP listeners added. Use emitter.setMaxListeners() to increase limit
(node:5176) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
at puppeteer_evaluation_script:14:44

Deploy to Amazon AWS

This is all very new to me. I saw you deployed the demo to Heroku, but it seems that to use this as an API for a mobile app in production, it would have to be deployed to someplace like AWS. I looked at the documentation for npm and I couldn't find out how this would be used to create a business-use API. Can you direct me in the right direction?

Cannot get preview image of Amazon URL

Hello, i have trying get images preview of url products. I used this package in my project and works with Ebay, Mercadolibre, Newegg but with Amazon not get image. Example of image preview Amazon:

image-example-amazon

I would like you to help me, since most of the product urls come from this website. Thanks!

Use in browser

Is there a way to use this in a front-end application? E.g. using webpack and ES6 imports? The current package seems to only work in node.

Problem with WebSocketTransport

This dependency was not found:
* ws in ./node_modules/puppeteer/lib/WebSocketTransport.js
To install it, you can run: npm install --save ws

used in Vuejs project
even after installing ws , still same error!

Is lattest code working?

Demos are not working.

After running:

const linkPreviewGenerator = require("link-preview-generator");
const previewData = linkPreviewGenerator("https://www.youtube.com/watch?v=8mqqY2Ji7_g");
console.log(previewData);
Promise { }

This hangs out here:

(node:12888) UnhandledPromiseRejectionWarning: TimeoutError: Navigation timeout of 30000 ms exceeded at C:\lpg\node_modules\puppeteer\lib\cjs\puppeteer\common\LifecycleWatch er.js:106:111

Any ideas of how to make it work?

Not working with link that have CAPTCHA

Error while getting preview for google meet link

Thanks for the efforts and work you gave to make this.

I'm facing a isssue when i send google meet link as url to the function and the error is :
(node:15759) [fs-extra-WARN0003] Warning: fs.realpath.native is not a function. Is fs being monkey-patched? please can you provide any insight to this .

It is throwing "URL is not defined error" in Docker

I have added this in one of my APIs with this logic

const { link } = req.body;
	try {
		const previewData = await linkPreviewGenerator(link, [
			'--no-sandbox',
			'--disable-setuid-sandbox'
		  ]);
		return res.json(previewData);
	} catch (error) {
		console.error(error);
		return res.status(500).send(error.message);
	}

And In the Dockerfile, I have added all the dependencies Puppeteer will require to launch chrome.
with this code:

RUN apt-get update \
    && apt-get install -yq --no-install-recommends \
	ca-certificates fonts-liberation gconf-service libappindicator1 \
	libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 \
	libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 \
	libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 \
	libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 \
	libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 \
	libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils

But when I call the API, it is throwing URL is not defined

I would love, if you can help mw with this.

Evaluation failed: ReferenceError: uri is not defined

Thanks for your library. It works well, just this link: https://yarnpkg.com doesn't.

Here is some of the stack trace:
Error: Evaluation failed: ReferenceError: uri is not defined\n at <anonymous>:48:37\n at Array.forEach (<anonymous>)\n at <anonymous>:46:14\n at ExecutionContext._evaluateInternal (./node_modules/puppeteer/lib/ExecutionContext.js:122:13)\n at runMicrotasks (<anonymous>)\n at processTicksAndRejections (node:internal/process/task_queues:94:5)\n at async ExecutionContext.evaluate (./node_modules/puppeteer/lib/ExecutionContext.js:48:12)\n at async getImg (./node_modules/link-preview-generator/index.js:18:15)\n at async module.exports (./node_modules/link-preview-generator/index.js:176:13)

I'm not sure if it's link-preview-generator or puppeteer which needs a fix. The error also appears in the provided heroku demo

Unable to get linkedin

image

The above image shows that the basic linkedin page isn't accessible.
Is this because they implemented some anti-scraping technique?

Thx!

Get a favicon (feature request)

I want to get a favicon as preview data. Can you merge a PR if I implement the following?

// index.js
const getFavicon = async (page, uri) => { // New
  ...
  return favicon
}

module.exports = async (
  uri,
  puppeteerArgs = [],
  puppeteerAgent = "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)",
  executablePath
) => {
  puppeteer.use(pluginStealth());
  ...
  const obj = {};
  obj.title = await getTitle(page);
  obj.description = await getDescription(page);
  obj.domain = await getDomainName(page, uri);
  obj.img = await getImg(page, uri);
  ojb.favicon = await getFavicon(page, uri) // New

  await browser.close();
  return obj;
};

https://link-preview-generator.herokuapp.com/ down

Hey,

First of all, thank you for this great library.
I use your demo website to generate preview for my website and it seams to be down since a couple of days. The website is not really par of this repository but it was a way to notice you.

Alex

amazon images are coming with review and rating

I am trying to generating an amazon product URL preview with the image.
Your library has been very helpful to get me the image of the product but review and rating not required in my and they are appearing.
Is there any workaround for that.
image

I just want the product image only.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.