Giter Site home page Giter Site logo

wikimedia / html-metadata Goto Github PK

View Code? Open in Web Editor NEW
137.0 30.0 43.0 221 KB

MetaData html scraper and parser for Node.js (supports Promises and callback style)

License: MIT License

JavaScript 69.28% HTML 30.72%
javascript nodejs node-module metadata-extraction metadata-extractor web-scraping web-scraper

html-metadata's Introduction

html-metadata

npm

MetaData html scraper and parser for Node.js (supports Promises and callback style)

The aim of this library is to be a comprehensive source for extracting all html embedded metadata. Currently it supports Schema.org microdata using a third party library, a native BEPress, Dublin Core, Highwire Press, JSON-LD, Open Graph, Twitter, EPrints, PRISM, and COinS implementation, and some general metadata that doesn't belong to a particular standard (for instance, the content of the title tag, or meta description tags).

Planned is support for RDFa, AGLS, and other yet unheard of metadata types. Contributions and requests for other metadata types welcome!

Install

npm install html-metadata

Usage

Promise-based:

var scrape = require('html-metadata');

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

scrape(url).then(function(metadata){
	console.log(metadata);
});

Callback-based:

var scrape = require('html-metadata');

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

scrape(url, function(error, metadata){
	console.log(metadata);
});

The scrape method used here invokes the parseAll() method, which uses all the available methods registered in method metadataFunctions(), and are available for use separately as well, for example:

Promise-based:

var cheerio = require('cheerio');
var preq = require('preq'); // Promisified request library
var parseDublinCore = require('html-metadata').parseDublinCore;

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

preq(url).then(function(response){
	$ = cheerio.load(response.body);
	return parseDublinCore($).then(function(metadata){
		console.log(metadata);
	});
});

Callback-based:

var cheerio = require('cheerio');
var request = require('request');
var parseDublinCore = require('html-metadata').parseDublinCore;

var url = "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/";

request(url, function(error, response, html){
	$ = cheerio.load(html);
	parseDublinCore($, function(error, metadata){
		console.log(metadata);
	});
});

Options object:

You can also pass an options object as the first argument containing extra parameters. Some websites require the user-agent or cookies to be set in order to get the response.

var scrape = require('html-metadata');
var request = require('request');

var options =  {
	url: "http://blog.woorank.com/2013/04/dublin-core-metadata-for-seo-and-usability/",
	jar: request.jar(), // Cookie jar
	headers: {
		'User-Agent': 'webscraper'
	}
};

scrape(options, function(error, metadata){
	console.log(metadata);
});

The method parseGeneral obtains the following general metadata:

<link rel="apple-touch-icon" href="" sizes="" type="">
<link rel="icon" href="" sizes="" type="">
<meta name="author" content="">
<link rel="author" href="">
<link rel="canonical" href="">
<meta name ="description" content="">
<link rel="publisher" href="">
<meta name ="robots" content="">
<link rel="shortlink" href="">
<title></title>
<html lang="en">
<html dir="rtl">

Tests

npm test runs the mocha tests

npm run-script coverage runs the tests and reports code coverage

Contributing

Contributions welcome! All contibutions should use bluebird promises instead of callbacks, and be .nodeify()-ed in index.js so the functions can be used as either callbacks or Promises.

html-metadata's People

Contributors

d00rman avatar ethanlee16 avatar feelfreelinux avatar geofbot avatar gwicke avatar jdforrester avatar kevinahuber avatar kolarski avatar m4tx avatar mcpo avatar mvolz avatar neonowy avatar rcya1 avatar saffatahmed avatar scimonster avatar thewilkybarkid avatar yoranbrondsema avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html-metadata's Issues

Push to NPM

The version of this library on npm is still 1.2.1, while the version here is 1.3.0.

Cannot read property 'connectTimeoutTimer' of undefined

I'm getting a strange error that seems to be caused by some conflict between this module and https://github.com/desmondmorris/node-twitter

TypeError: Uncaught error: Cannot read property 'connectTimeoutTimer' of undefined
    at ConnectTimeoutAgent.createSocket (/path/to/project/node_modules/html-metadata/node_modules/preq/index.js:21:46)
    at ConnectTimeoutAgent.Agent.addRequest (_http_agent.js:157:10)
    at new ClientRequest (_http_client.js:160:16)
    at Object.exports.request (http.js:31:10)
    at Object.exports.request (https.js:199:15)
    at Request.start (/path/to/project/node_modules/twitter/node_modules/request/request.js:744:32)
    at Request.write (/path/to/project/node_modules/twitter/node_modules/request/request.js:1421:10)
    at end (/path/to/project/node_modules/twitter/node_modules/request/request.js:551:18)
    at Immediate.<anonymous> (/path/to/project/node_modules/twitter/node_modules/request/request.js:580:7)
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5)

The strange thing is that this error is thrown when I make a request with node-twitter. It seems html-metadata never actually needs to be used, this error will be thrown if it is simply require-ed.

Sorry if this isn't the right place to post this, but I figured I would start here since I'm not actually sure where the source of this problem is.

⚠️ Self-closing tags get corrupted 🚨

The library doesn't support html5 tags (e.g. self-closing span).

When parsing the following:

<span itemprop="price" content="139.90" />

foo

bar

It adds "foo ... bar" to the price attribute until it won't find a closing </span> tag.

The issue is in chtml which replaces /> w/ >

Meta tag search should be case insensitive

I noticed that in some websites the meta tags begin with an Upper case for e.g.
<meta name="Description" content=

Currently the application only works for lower case tags and case insensitivity is not supported. I tried to write my own but i just thought this is a simple fix and you can do a quick release for this. I think that the change required would be in the function

exports.parseGeneral = BBPromise.method(function(html) and the change would be meta[name="description" i] and similar case insensitivity reg ex should be applied to all meta tags

Request you to do the needful asap, since this tool captures meta information in most cases but websites like www.microsoft.com "description" tag text is ignored, may be there could be cases were other tags are ignored as well although they might be present in the page

Thanks and great work
-Saood

Not returning Open graph details in certain websites

I'm scraping this Url and it isn't returning openGraph property even though it exist in the head's meta tags as you can see in the image attached.
https://fussball.news/artikel/allagui-lobt-pauli-paradebeispiel-dafuer-wie-man-eine-krise-bewaeltigen-kann/

general:Object {icons: Array(2), canonical: "https://fussball.news/artikel/allagui-lobt-pauli-p…", shortlink: "https://wp.me/pa5NPE-Grv", …}
canonical:"https://fussball.news/artikel/allagui-lobt-pauli-paradebeispiel-dafuer-wie-man-eine-krise-bewaeltigen-kann/"
icons:Array(2) [Object, Object]
lang:"de-DE"
shortlink:"https://wp.me/pa5NPE-Grv"
title:"Allagui lobt Pauli: "Paradebeispiel dafür, wie man eine Krise bewältigen kann" | fussball.news"
__proto__:Object {constructor: , __defineGetter__: , __defineSetter__: , …}
jsonLd:Object {@context: "https://schema.org", @type: "Organization", url: "https://fussball.news/", …}
schemaOrg:Object {items: Array(1)}

screen shot 2018-10-26 at 14 30 28

JSON-LD content embedded in CDATA not parsed

Trying to parse a Recipe described in a JSON-LD content embedded in CDATA returns nothing.

The sample:

<script type="application/ld+json">
    //<![CDATA[
    {"@context":"http://schema.org","@type":"Recipe","name":"Ile flottante","recipeCategory":"\u00eele flottante","image":"https://image.afcdn.com/recipe/20130408/34776_w1024h768c1cx256cy192.jpg","datePublished":"2003-03-31T07:21:00+02:00","prepTime":"PT15M","cookTime":"PT30M","totalTime":"PT45M","recipeYield":"4 personnes","recipeIngredient":["60 cl lait","60 g sucre en poudre","1 gousse vanille","5 oeuf","1 pinc\u00e9e sel","130 g sucre glace","60 g amande"],"recipeInstructions":[{"@type":"HowToStep","text":"Casser le \u0153ufs en s\u00e9parant les blancs des jaunes."},{"@type":"HowToStep","text":"Monter les blancs en neige en y ajoutant une pinc\u00e9e de sel. Mettre petit \u00e0 petit le sucre glace."},{"@type":"HowToStep","text":"Mettre le m\u00e9lange dans un moule \u00e0 charlotte recouvert d'aluminium et beurr\u00e9."},{"@type":"HowToStep","text":"Cuire au bain-marie dans un four \u00e0 210\u00b0C (thermostat 7) pendant 25 \u00e0 30 minutes."},{"@type":"HowToStep","text":"Faire une cr\u00e8me anglaise en chauffant le lait avec une gousse de vanille. Battre les jaunes d\u2019\u0153ufs et le sucre pour les faire mousser. Ajouter petit \u00e0 petit le lait chaud \u00e0 la vanille."},{"@type":"HowToStep","text":"\u00c9paissir le m\u00e9lange au bain-marie et arr\u00eater lorsque la cr\u00e8me nappe la cuill\u00e8re."},{"@type":"HowToStep","text":"D\u00e9moulez l'\u00eele. Saupoudrez le dessus d'amandes effil\u00e9es."},{"@type":"HowToStep","text":"Versez la cr\u00e8me anglaise tout autour de l'\u00eele et mettez au r\u00e9frig\u00e9rateur jusqu'au moment de servir."}],"author":"Sinfonia","description":"lait, sucre en poudre, vanille, oeuf, sel, sucre glace, amande","keywords":"Ile flottante, \u00eele flottante, lait, sucre en poudre, vanille, oeuf, sel, sucre glace, amande","aggregateRating":{"@type":"AggregateRating","reviewCount":44,"ratingValue":4.3,"worstRating":0,"bestRating":5}}
    //]]>
</script>

It seems Cheerio doesn't handle that case. I'm using that quick fix for the parsing:

-- contents = JSON.parse(this.children[0].data);
++ contents = JSON.parse(this.children[0].data.replace(/\n    \/\//g, '').replace(/\n/g, '').replace(/<!\[CDATA\[(.*?)]]>/, '$1').trim());

There's certainly a better way to clean the content.

Is there an issue with parsing of robots.txt?

I think there is some issue in parsing the robots.txt in html-metadata. I get the metadata as { general: { robots: 'noindex,nofollow' } } for the url https://www.sciencedaily.com/news/matter_energy/graphene/. But the robots.txt is as follows

Sitemap: https://www.sciencedaily.com/sitemap-index.xml
User-agent: Yahoo Pipes 1.0
Disallow: /
User-agent: Yahoo Pipes 2.0
Disallow: /
User-agent: *
Disallow: /templates/
Disallow: /includes/
Disallow: /1002721/
Disallow: /v1/sciencedaily/

The robots.txt clearly allows scrapping but still html-metadata does not scrape.

Or I am missing something here.

Receiving HTTPError: 403 from Shopify Sites

This library is working great in all areas except when we try to scrape a Shopify site. When running the main scrape function the JSON response returned is an error message with string "Receiving HTTPError: 403 from Shopify Sites".

Any guidance? You can reproduce from the Node app we setup at https://metaparser-v2.thesocialpresskit.com by passing a url in as a query argument. A Shopify example of this is https://metaparser-v2.thesocialpresskit.com/?url=https://sincerelyjewelry.com/

Thanks in advance for any help!!

OG:Image:url being overwritten

I'm attempting to use html-metadata on this url http://www.lemonde.fr/

If you visit the page and inspect you can see the og tags
screen shot 2015-05-13 at 8 35 58 pm

I am expecting to return back http://s1.lemde.fr/medias/web/1.2.672/img/placeholder/opengraph.jpg
as the og:image:url but instead I am getting this back.

screen shot 2015-05-13 at 8 37 57 pm

Doing some inspection it looks like at one point the parseOpenGraph function actually populates the value with the correct url but then later overrides it. =[

Changing
https://github.com/wikimedia/html-metadata/blob/master/index.js#L208
to
if (root && !root[propertyValue[2]]){
property = propertyValue[2];
root[property] = content;
}

seems to prevent overwriting the expected value but has the side effect of not adding any of the other image metadata to the opengraph.image object. Not sure what other side effects it may have as well. Any idea how this could be edited to add new properties to the subobjects but not override them? Is that in scope for the project?

TypeError: result.isFulfilled is not a function

Full error dump:

TypeError: result.isFulfilled is not a function
at /home/bubs/BotMark/node_modules/html-metadata/lib/index.js:34:26
at runCallback (timers.js:637:20)
at tryOnImmediate (timers.js:610:5)
at processImmediate [as _immediateCallback] (timers.js:582:5)
From previous event:
at Object.exports.parseAll (/home/bubs/BotMark/node_modules/html-metadata/lib/index.js:30:4)
at Request. (/home/bubs/BotMark/node_modules/html-metadata/index.js:31:16)
at Request.self.callback (/home/bubs/BotMark/node_modules/request/request.js:186:22)
at emitTwo (events.js:106:13)
at Request.emit (events.js:191:7)
at Request. (/home/bubs/BotMark/node_modules/request/request.js:1081:10)
at emitOne (events.js:96:13)
at Request.emit (events.js:188:7)
at Gunzip. (/home/bubs/BotMark/node_modules/request/request.js:1001:12)
at Gunzip.g (events.js:291:16)
at emitNone (events.js:91:20)
at Gunzip.emit (events.js:185:7)

Tried both promise -based and callback based

Example calling code

htmlMetadata('http://www.google.com', (err, metadata)=> {
if (err) {
debug('Scraper html-metadata failed! ', err);
} else {
debug('Html-metadata returned: ', metadata);
}
});

Tried this module a few months ago and it worked just fine. Is there something wrong with the latest release?

Getting this issue : Error: Cannot find module 'preq'

Error: Cannot find module 'preq'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:580:15)
at Function.Module._load (internal/modules/cjs/loader.js:506:25)
at Module.require (internal/modules/cjs/loader.js:636:17)
at require (internal/modules/cjs/helpers.js:20:18)
at Object. (/Users/sinhagaurav010/Documents/CICD/ved_backend/node_modules/html-metadata/index.js:15:12)
at Module._compile (internal/modules/cjs/loader.js:688:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:699:10)
at Module.load (internal/modules/cjs/loader.js:598:32)
at tryModuleLoad (internal/modules/cjs/loader.js:537:12)
at Function.Module._load (internal/modules/cjs/loader.js:529:3)
at Module.require (internal/modules/cjs/loader.js:636:17)
at require (internal/modules/cjs/helpers.js:20:18)
at Object. (/Users/sinhagaurav010/Documents/CICD/ved_backend/routes/PostRoute2.js:19:14)
at Module._compile (internal/modules/cjs/loader.js:688:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:699:10)
at Module.load (internal/modules/cjs/loader.js:598:32)

Please fix

How can I use this module for ES6 js ?

I am creating a package which is dependent on html-metadata, and the package should be used in html, so it should be ES6...
So if I am able to create it in ES6, then I can use it

html-metadata returns all fields as undefined for specific url

Hi,

I have been using html-metadata and thanks for such a wonderful software. I noticed that when used on the following url https://www.cnet.com/special-reports/vr101/ - it gives all fields as undefined.

Can you please look into the issue.

My code is

var scrape = require('html-metadata');
var url = process.argv[2];

scrape(url).then(function(metadata){
console.log("************************");
console.log(metadata);
});

and the output I get for this program is

parse() is deprecated, use toJson()


{ openGraph:
{ site_name: undefined,
title: undefined,
description: undefined,
url: undefined,
image:
{ url: undefined,
type: 'image/jpeg',
width: '630',
height: '315' },
app_id: undefined,
type: 'article' },
twitter:
{ card: 'summary_large_image',
creator: undefined,
site: undefined } }

Not working on nytimes.com

Hi,
I try to parse the page: http://www.nytimes.com/2017/04/07/world/middleeast/syria-attack-trump.html
Got some error:
(node:68496) Warning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
(node:68496) Warning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit
Unhandled rejection Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F07%2Fworld%2Fmiddleeast%2Fsyria-attack-trump.html%3F_r%3D4
at Redirect.onResponse (/Users/Tao/Work/Ludlow/www/ludlow-web/node_modules/request/lib/redirect.js:98:27)
at Request.onRequestResponse (/Users/Tao/Work/Ludlow/www/ludlow-web/node_modules/request/request.js:917:22)
at emitOne (events.js:96:13)
at ClientRequest.emit (events.js:188:7)
at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:474:21)
at HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
at TLSSocket.socketOnData (_http_client.js:363:20)
at emitOne (events.js:96:13)
at TLSSocket.emit (events.js:188:7)
at readableAddChunk (_stream_readable.js:176:18)
at TLSSocket.Readable.push (_stream_readable.js:134:10)
at TLSWrap.onread (net.js:548:20)

Probably need to set some cookies to break the redirect loop.

Support Browserify

This is a wonderful module, but it seems difficult to browserify (CORS being the a least one of your concerns). Perhaps we use something other than Cheerio for a browser context?

Wrong encoding problem

Hi there! First of all, thank for your great work on this piece of software. 👍

Now, I'm using it with some websites and it gets a wrong charset encoding. For example with http://elmundo.es, it's getting some weird chars. Any advice? I could try to take a look to the package and send a Pull Request if I'm able to fix that. 👍

Possible to add 'keywords' to 'clutteredMeta' object?

Hi,

Would it be possible to add the following beneath line 300 on lib/index.js:
keywords: chtml('meta[name=keywords i]').attr('content'), //meta keywords <meta name ="keywords" content="">

Think it'd be really useful to return these also.

Many thanks!

Microdata: Underlying lib does not honour "content" attribute on all tags

hi there,

the module used to parse microdata does not honour the microdata spec with regard to "content" tags attributes.

the spec states:

HTML only allows the content attribute on the meta element. This specification changes the content model to allow it on any element, as a global attribute.

the relevant function in microdata-node only looks at the "content" attribute for "meta" tags.

I mention this problem here due to the fact that the microdata-node module has seen no development in the last 3 years.

How to handle error when host does not respond?

I'm using html-metadata to parse a list of hostnames:

const htmlMetadata = require('html-metadata');

console.log(`Domain: ${domain}`);
const newUrl = completeUrl(domain);
htmlMetadata(newUrl, (err, data) => {
	console.log(`htmlMetadata callback:`, err);
	const newData = {
		url: newUrl,
		shortName: _.capitalize(newUrl.split('.')[1]),
		tld: newUrl.split('.').pop(),
		title: _.get(data, 'general.title', '').replace(ALL_LINEBREAKS, '').replace(ALL_TABS, ''),
		description: _.get(data, 'general.description', '').replace(ALL_LINEBREAKS, '').replace(ALL_TABS, ''),
	}
	cb(null, newData); // even if err, don't propagate it
});

However, when host does not respond, I get this error:

Domain: f-edition.se
_http_agent.js:186
        nextTick(newSocket._handle.getAsyncId(), function() {
                          ^

TypeError: Cannot read property '_handle' of undefined
    at _http_agent.js:186:27
    at Timeout._onTimeout (/Users/tomsoderlund/Documents/Projects/Weld.io/Development/weld-extensions/cabal/node_modules/preq/index.js:24:17)
    at ontimeout (timers.js:488:11)
    at tryOnTimeout (timers.js:323:5)
    at Timer.listOnTimeout (timers.js:283:5)

Please note that this happens before my htmlMetadata callback is triggered.

I've tried wrapping my code in try/catch but it doesn't help, I guess it's because it's async and the error is happening in a different thread.

Any tips?

Release?

The head of master is ahead of the latest version on npm and has been for a couple of months - any chance of a new release?

Relative URLs

OG allows for image URLs that are relative paths. The debugger automatically prepends the domain to relative paths.. I've seen this done with other metadata as well.. whether it is specifically allowed in a spec or not. It would be helpful for this package to identify and replace those cases. Is it within the scope of this package to cover that? If so, I am happy to make a contribution!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.