wikimedia / html-metadata Goto Github PK

View Code? Open in Web Editor NEW

137.0 30.0 43.0 221 KB

MetaData html scraper and parser for Node.js (supports Promises and callback style)

License: MIT License

JavaScript 69.28% HTML 30.72%

javascript nodejs node-module metadata-extraction metadata-extractor web-scraping web-scraper

html-metadata's Introduction

html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)

The aim of this library is to be a comprehensive source for extracting all html embedded metadata. Currently it supports Schema.org microdata using a third party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).

Planned is support for RDFa, AGLS, and other yet unheard of metadata types. Contributions and requests for other metadata types welcome!

Install

npm install html-metadata

Usage

Promise-based:

var scrape = require('html-metadata');

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

scrape(url).then(function(metadata){
	console.log(metadata);
});

Callback-based:

var scrape = require('html-metadata');

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

scrape(url, function(error, metadata){
	console.log(metadata);
});

The scrape method used here invokes the parseAll() method, which uses all the available methods registered in method metadataFunctions(), and are available for use separately as well, for example:

Promise-based:

var cheerio = require('cheerio');
var preq = require('preq'); // Promisified request library
var parseDublinCore = require('html-metadata').parseDublinCore;

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

preq(url).then(function(response){
	$ = cheerio.load(response.body);
	return parseDublinCore($).then(function(metadata){
		console.log(metadata);
	});
});

Callback-based:

var cheerio = require('cheerio');
var request = require('request');
var parseDublinCore = require('html-metadata').parseDublinCore;

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

request(url, function(error, response, html){
	$ = cheerio.load(html);
	parseDublinCore($, function(error, metadata){
		console.log(metadata);
	});
});

Options object:

You can also pass an options object as the first argument containing extra parameters. Some websites require the user-agent or cookies to be set in order to get the response.

var scrape = require('html-metadata');
var request = require('request');

var options =  {
	url: "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/",
	jar: request.jar(), // Cookie jar
	headers: {
		'User-Agent': 'webscraper'
	}
};

scrape(options, function(error, metadata){
	console.log(metadata);
});

The method parseGeneral obtains the following general metadata:

<link rel="apple-touch-icon" href="" sizes="" type="">
<link rel="icon" href="" sizes="" type="">
<meta name="author" content="">
<link rel="author" href="">
<link rel="canonical" href="">
<meta name ="description" content="">
<link rel="publisher" href="">
<meta name ="robots" content="">
<link rel="shortlink" href="">
<title></title>
<html lang="en">
<html dir="rtl">

Tests

npm test runs the mocha tests

npm run-script coverage runs the tests and reports code coverage

Contributing

Contributions welcome! All contibutions should use bluebird promises instead of callbacks, and be .nodeify()-ed in index.js so the functions can be used as either callbacks or Promises.

html-metadata's People

Contributors

Stargazers

Watchers

Forkers

m4tx iscinc d00rman mattbrewer-educredu martindale davemorro appsumo neonowy scimonster geofbot runt18 olariuadrian osternaudclem abramovi ethanlee16 mlsgrnt feelfreelinux jo12bar coerick mvolz mcpo spot-app 1242035 mechanicalhuman saffatahmed sutori kevinahuber hatsa-com luongvm quant-daddy namroodinc kael webhacking mblurtonww jay915686 n-sviridenko tuananhzippy not-jayden authereum joemidi jlvcm thewilkybarkid razzinteractive

html-metadata's Issues

Push to NPM

The version of this library on npm is still 1.2.1, while the version here is 1.3.0.

Cannot read property 'connectTimeoutTimer' of undefined

I'm getting a strange error that seems to be caused by some conflict between this module and https://github.com/desmondmorris/node-twitter

TypeError: Uncaught error: Cannot read property 'connectTimeoutTimer' of undefined
    at ConnectTimeoutAgent.createSocket (/path/to/project/node_modules/html-metadata/node_modules/preq/index.js:21:46)
    at ConnectTimeoutAgent.Agent.addRequest (_http_agent.js:157:10)
    at new ClientRequest (_http_client.js:160:16)
    at Object.exports.request (http.js:31:10)
    at Object.exports.request (https.js:199:15)
    at Request.start (/path/to/project/node_modules/twitter/node_modules/request/request.js:744:32)
    at Request.write (/path/to/project/node_modules/twitter/node_modules/request/request.js:1421:10)
    at end (/path/to/project/node_modules/twitter/node_modules/request/request.js:551:18)
    at Immediate.<anonymous> (/path/to/project/node_modules/twitter/node_modules/request/request.js:580:7)
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5)

The strange thing is that this error is thrown when I make a request with node-twitter. It seems html-metadata never actually needs to be used, this error will be thrown if it is simply require-ed.

Sorry if this isn't the right place to post this, but I figured I would start here since I'm not actually sure where the source of this problem is.

Missing bracket in the demo code on main page

The closing bracket for the "options" variable is missing, only "headers" is closed correctly in the example code for the Options object:

var options = {
url: "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/",
jar: request.jar(), // Cookie jar
headers: {
'User-Agent': 'webscraper'
}

cheerio

⚠️ Self-closing tags get corrupted 🚨

The library doesn't support html5 tags (e.g. self-closing span).

When parsing the following:

<span itemprop="price" content="139.90" />

foo

bar

It adds "foo ... bar" to the price attribute until it won't find a closing </span> tag.

The issue is in chtml which replaces /> w/ >

Scraping foreign language site returns gibberish in metadata

While scrapping a url that is in a language different than English, such as Russian, the title and description in the metadata are returned as gibberish

Here is a link example: https://pikabu.ru/story/privet_fsbshniki_mne_drug_byivshiy_sotrudnik_fsb_rasskazal_chto_vyi_tut_sidite__2821880

preq dependency should be updated

Current preq version is "preq": "0.4.10"
Should be updated to version 0.5.0

Licence needed

I'd suggest MIT given it's a service.

using cache to store metadata for frequently used urls

Thanks for this module.

I wanted to suggest adding an optional caching layer built into the module to make it more suitable for such use cases.

This would also help avoid hitting the target URL multiple items.

Meta tag search should be case insensitive

I noticed that in some websites the meta tags begin with an Upper case for e.g.
<meta name="Description" content=

Currently the application only works for lower case tags and case insensitivity is not supported. I tried to write my own but i just thought this is a simple fix and you can do a quick release for this. I think that the change required would be in the function

exports.parseGeneral = BBPromise.method(function(html) and the change would be meta[name="description" i] and similar case insensitivity reg ex should be applied to all meta tags

Request you to do the needful asap, since this tool captures meta information in most cases but websites like www.microsoft.com "description" tag text is ignored, may be there could be cases were other tags are ignored as well although they might be present in the page

Thanks and great work
-Saood

Not returning Open graph details in certain websites

I'm scraping this Url and it isn't returning openGraph property even though it exist in the head's meta tags as you can see in the image attached.
https://fussball.news/artikel/allagui-lobt-pauli-paradebeispiel-dafuer-wie-man-eine-krise-bewaeltigen-kann/

general:Object {icons: Array(2), canonical: "https://fussball.news/artikel/allagui-lobt-pauli-p…", shortlink: "https://wp.me/pa5NPE-Grv", …}
canonical:"https://fussball.news/artikel/allagui-lobt-pauli-paradebeispiel-dafuer-wie-man-eine-krise-bewaeltigen-kann/"
icons:Array(2) [Object, Object]
lang:"de-DE"
shortlink:"https://wp.me/pa5NPE-Grv"
title:"Allagui lobt Pauli: "Paradebeispiel dafür, wie man eine Krise bewältigen kann" | fussball.news"
__proto__:Object {constructor: , __defineGetter__: , __defineSetter__: , …}
jsonLd:Object {@context: "https://schema.org", @type: "Organization", url: "https://fussball.news/", …}
schemaOrg:Object {items: Array(1)}

Error building with Webpack.

I get the error:

SyntaxError: Unexpected character '#' (1:0) on index.js.

That line is most likely unneeded?

JSON-LD content embedded in CDATA not parsed

Trying to parse a Recipe described in a JSON-LD content embedded in CDATA returns nothing.

The sample:

<script type="application/ld+json">
    //<![CDATA[
    {"@context":"http://schema.org","@type":"Recipe","name":"Ile flottante","recipeCategory":"\u00eele flottante","image":"https://image.afcdn.com/recipe/20130408/34776_w1024h768c1cx256cy192.jpg","datePublished":"2003-03-31T07:21:00+02:00","prepTime":"PT15M","cookTime":"PT30M","totalTime":"PT45M","recipeYield":"4 personnes","recipeIngredient":["60 cl lait","60 g sucre en poudre","1 gousse vanille","5 oeuf","1 pinc\u00e9e sel","130 g sucre glace","60 g amande"],"recipeInstructions":[{"@type":"HowToStep","text":"Casser le \u0153ufs en s\u00e9parant les blancs des jaunes."},{"@type":"HowToStep","text":"Monter les blancs en neige en y ajoutant une pinc\u00e9e de sel. Mettre petit \u00e0 petit le sucre glace."},{"@type":"HowToStep","text":"Mettre le m\u00e9lange dans un moule \u00e0 charlotte recouvert d'aluminium et beurr\u00e9."},{"@type":"HowToStep","text":"Cuire au bain-marie dans un four \u00e0 210\u00b0C (thermostat 7) pendant 25 \u00e0 30 minutes."},{"@type":"HowToStep","text":"Faire une cr\u00e8me anglaise en chauffant le lait avec une gousse de vanille. Battre les jaunes d\u2019\u0153ufs et le sucre pour les faire mousser. Ajouter petit \u00e0 petit le lait chaud \u00e0 la vanille."},{"@type":"HowToStep","text":"\u00c9paissir le m\u00e9lange au bain-marie et arr\u00eater lorsque la cr\u00e8me nappe la cuill\u00e8re."},{"@type":"HowToStep","text":"D\u00e9moulez l'\u00eele. Saupoudrez le dessus d'amandes effil\u00e9es."},{"@type":"HowToStep","text":"Versez la cr\u00e8me anglaise tout autour de l'\u00eele et mettez au r\u00e9frig\u00e9rateur jusqu'au moment de servir."}],"author":"Sinfonia","description":"lait, sucre en poudre, vanille, oeuf, sel, sucre glace, amande","keywords":"Ile flottante, \u00eele flottante, lait, sucre en poudre, vanille, oeuf, sel, sucre glace, amande","aggregateRating":{"@type":"AggregateRating","reviewCount":44,"ratingValue":4.3,"worstRating":0,"bestRating":5}}
    //]]>
</script>

It seems Cheerio doesn't handle that case. I'm using that quick fix for the parsing:

-- contents = JSON.parse(this.children[0].data);
++ contents = JSON.parse(this.children[0].data.replace(/\n    \/\//g, '').replace(/\n/g, '').replace(/<!\[CDATA\[(.*?)]]>/, '$1').trim());

There's certainly a better way to clean the content.

Is there an issue with parsing of robots.txt?

I think there is some issue in parsing the robots.txt in html-metadata. I get the metadata as { general: { robots: 'noindex,nofollow' } } for the url https://www.sciencedaily.com/news/matter_energy/graphene/. But the robots.txt is as follows

Sitemap: https://www.sciencedaily.com/sitemap-index.xml
User-agent: Yahoo Pipes 1.0
Disallow: /
User-agent: Yahoo Pipes 2.0
Disallow: /
User-agent: *
Disallow: /templates/
Disallow: /includes/
Disallow: /1002721/
Disallow: /v1/sciencedaily/

The robots.txt clearly allows scrapping but still html-metadata does not scrape.

Or I am missing something here.

Receiving HTTPError: 403 from Shopify Sites

This library is working great in all areas except when we try to scrape a Shopify site. When running the main scrape function the JSON response returned is an error message with string "Receiving HTTPError: 403 from Shopify Sites".

Any guidance? You can reproduce from the Node app we setup at https://metaparser-v2.thesocialpresskit.com by passing a url in as a query argument. A Shopify example of this is https://metaparser-v2.thesocialpresskit.com/?url=https://sincerelyjewelry.com/

Thanks in advance for any help!!

wrong project, sorry

html-metada(preq) being blocked by DDOS arrest

Hi guys, thank you for this wonderful module.

I just wanna ask help regarding this ddos protection issue. It seems that this module can't get through some site with ddos protection(DDOS arrest). Like this gulfnews website: http://gulfnews.com/news/uae/health/health-authority-launches-campaign-for-safe-disposal-of-expired-medicines-1.1637255

Is there any way around this?

Thanks
Mark

OG:Image:url being overwritten

I'm attempting to use html-metadata on this url http://www.lemonde.fr/

If you visit the page and inspect you can see the og tags

I am expecting to return back http://s1.lemde.fr/medias/web/1.2.672/img/placeholder/opengraph.jpg
as the og:image:url but instead I am getting this back.

Doing some inspection it looks like at one point the parseOpenGraph function actually populates the value with the correct url but then later overrides it. =[

Changing
https://github.com/wikimedia/html-metadata/blob/master/index.js#L208
to
if (root && !root[propertyValue[2]]){
property = propertyValue[2];
root[property] = content;
}

seems to prevent overwriting the expected value but has the side effect of not adding any of the other image metadata to the opengraph.image object. Not sure what other side effects it may have as well. Any idea how this could be edited to add new properties to the subobjects but not override them? Is that in scope for the project?

TypeError: result.isFulfilled is not a function

Full error dump:

TypeError: result.isFulfilled is not a function
at /home/bubs/BotMark/node_modules/html-metadata/lib/index.js:34:26
at runCallback (timers.js:637:20)
at tryOnImmediate (timers.js:610:5)
at processImmediate [as _immediateCallback] (timers.js:582:5)
From previous event:
at Object.exports.parseAll (/home/bubs/BotMark/node_modules/html-metadata/lib/index.js:30:4)
at Request. (/home/bubs/BotMark/node_modules/html-metadata/index.js:31:16)
at Request.self.callback (/home/bubs/BotMark/node_modules/request/request.js:186:22)
at emitTwo (events.js:106:13)
at Request.emit (events.js:191:7)
at Request. (/home/bubs/BotMark/node_modules/request/request.js:1081:10)
at emitOne (events.js:96:13)
at Request.emit (events.js:188:7)
at Gunzip. (/home/bubs/BotMark/node_modules/request/request.js:1001:12)
at Gunzip.g (events.js:291:16)
at emitNone (events.js:91:20)
at Gunzip.emit (events.js:185:7)

Tried both promise -based and callback based

Example calling code

htmlMetadata('http://www.google.com', (err, metadata)=> {
if (err) {
debug('Scraper html-metadata failed! ', err);
} else {
debug('Html-metadata returned: ', metadata);
}
});

Tried this module a few months ago and it worked just fine. Is there something wrong with the latest release?

Getting this issue : Error: Cannot find module 'preq'

Error: Cannot find module 'preq'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:580:15)
at Function.Module._load (internal/modules/cjs/loader.js:506:25)
at Module.require (internal/modules/cjs/loader.js:636:17)
at require (internal/modules/cjs/helpers.js:20:18)
at Object. (/Users/sinhagaurav010/Documents/CICD/ved_backend/node_modules/html-metadata/index.js:15:12)
at Module._compile (internal/modules/cjs/loader.js:688:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:699:10)
at Module.load (internal/modules/cjs/loader.js:598:32)
at tryModuleLoad (internal/modules/cjs/loader.js:537:12)
at Function.Module._load (internal/modules/cjs/loader.js:529:3)
at Module.require (internal/modules/cjs/loader.js:636:17)
at require (internal/modules/cjs/helpers.js:20:18)
at Object. (/Users/sinhagaurav010/Documents/CICD/ved_backend/routes/PostRoute2.js:19:14)
at Module._compile (internal/modules/cjs/loader.js:688:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:699:10)
at Module.load (internal/modules/cjs/loader.js:598:32)

Please fix

How can I use this module for ES6 js ?

I am creating a package which is dependent on html-metadata, and the package should be used in html, so it should be ES6...
So if I am able to create it in ES6, then I can use it

html-metadata returns all fields as undefined for specific url

Hi,

I have been using html-metadata and thanks for such a wonderful software. I noticed that when used on the following url https://www.cnet.com/special-reports/vr101/ - it gives all fields as undefined.

Can you please look into the issue.

My code is

var scrape = require('html-metadata');
var url = process.argv[2];

scrape(url).then(function(metadata){
console.log("************************");
console.log(metadata);
});

and the output I get for this program is

parse() is deprecated, use toJson()

{ openGraph:
{ site_name: undefined,
title: undefined,
description: undefined,
url: undefined,
image:
{ url: undefined,
type: 'image/jpeg',
width: '630',
height: '315' },
app_id: undefined,
type: 'article' },
twitter:
{ card: 'summary_large_image',
creator: undefined,
site: undefined } }

Add some CI

:-)

Not working on nytimes.com

Hi,
I try to parse the page: http://www.nytimes.com/2017/04/07/world/middleeast/syria-attack-trump.html
Got some error:
(node:68496) Warning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
(node:68496) Warning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
Unhandled rejection Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F07%2Fworld%2Fmiddleeast%2Fsyria-attack-trump.html%3F_r%3D4
at Redirect.onResponse (/Users/Tao/Work/Ludlow/www/ludlow-web/node_modules/request/lib/redirect.js:98:27)
at Request.onRequestResponse (/Users/Tao/Work/Ludlow/www/ludlow-web/node_modules/request/request.js:917:22)
at emitOne (events.js:96:13)
at ClientRequest.emit (events.js:188:7)
at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:474:21)
at HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
at TLSSocket.socketOnData (_http_client.js:363:20)
at emitOne (events.js:96:13)
at TLSSocket.emit (events.js:188:7)
at readableAddChunk (_stream_readable.js:176:18)
at TLSSocket.Readable.push (_stream_readable.js:134:10)
at TLSWrap.onread (net.js:548:20)

Probably need to set some cookies to break the redirect loop.

Support Browserify

This is a wonderful module, but it seems difficult to browserify (CORS being the a least one of your concerns). Perhaps we use something other than Cheerio for a browser context?

Wrong encoding problem

Hi there! First of all, thank for your great work on this piece of software. 👍

Now, I'm using it with some websites and it gets a wrong charset encoding. For example with http://elmundo.es, it's getting some weird chars. Any advice? I could try to take a look to the package and send a Pull Request if I'm able to fix that. 👍

Possible to add 'keywords' to 'clutteredMeta' object?

Hi,

Would it be possible to add the following beneath line 300 on lib/index.js:
keywords: chtml('meta[name=keywords i]').attr('content'), //meta keywords <meta name ="keywords" content="">

Think it'd be really useful to return these also.

Many thanks!

Microdata: Underlying lib does not honour "content" attribute on all tags

hi there,

the module used to parse microdata does not honour the microdata spec with regard to "content" tags attributes.

the spec states:

HTML only allows the content attribute on the meta element. This specification changes the content model to allow it on any element, as a global attribute.

the relevant function in microdata-node only looks at the "content" attribute for "meta" tags.

I mention this problem here due to the fact that the microdata-node module has seen no development in the last 3 years.

How to handle error when host does not respond?

I'm using html-metadata to parse a list of hostnames:

const htmlMetadata = require('html-metadata');

console.log(`Domain: ${domain}`);
const newUrl = completeUrl(domain);
htmlMetadata(newUrl, (err, data) => {
	console.log(`htmlMetadata callback:`, err);
	const newData = {
		url: newUrl,
		shortName: _.capitalize(newUrl.split('.')[1]),
		tld: newUrl.split('.').pop(),
		title: _.get(data, 'general.title', '').replace(ALL_LINEBREAKS, '').replace(ALL_TABS, ''),
		description: _.get(data, 'general.description', '').replace(ALL_LINEBREAKS, '').replace(ALL_TABS, ''),
	}
	cb(null, newData); // even if err, don't propagate it
});

However, when host does not respond, I get this error:

Domain: f-edition.se
_http_agent.js:186
        nextTick(newSocket._handle.getAsyncId(), function() {
                          ^

TypeError: Cannot read property '_handle' of undefined
    at _http_agent.js:186:27
    at Timeout._onTimeout (/Users/tomsoderlund/Documents/Projects/Weld.io/Development/weld-extensions/cabal/node_modules/preq/index.js:24:17)
    at ontimeout (timers.js:488:11)
    at tryOnTimeout (timers.js:323:5)
    at Timer.listOnTimeout (timers.js:283:5)

Please note that this happens before my htmlMetadata callback is triggered.

I've tried wrapping my code in try/catch but it doesn't help, I guess it's because it's async and the error is happening in a different thread.

Any tips?

Add support for reading application/ld+json tag

https://developers.google.com/schemas/formats/json-ld

Release?

The head of master is ahead of the latest version on npm and has been for a couple of months - any chance of a new release?

Relative URLs

OG allows for image URLs that are relative paths. The debugger automatically prepends the domain to relative paths.. I've seen this done with other metadata as well.. whether it is specifically allowed in a spec or not. It would be helpful for this package to identify and replace those cases. Is it within the scope of this package to cover that? If so, I am happy to make a contribution!

passing url without http fails

i am passing url like www.lightercapital.com rather http://lightercapital.com and this fails with error

HTTPError: Invalid URI "www.lightercapital.com"
at request.then (/srv/node_modules/preq/index.js:246:19)

is it possible to handle this?