ruipgil / scraperjs Goto Github PK

View Code? Open in Web Editor NEW

3.7K 94.0 189.0 149 KB

A complete and versatile web scraper.

License: MIT License

JavaScript 98.59% HTML 1.41%

scraperjs's Introduction

Scraperjs

Scraperjs is a web scraper module that make scraping the web an easy job.

Installing

npm install scraperjs

If you would like to test (this is optional and requires the installation with the --save-dev tag),

grunt test

To use some features you’ll need to install phantomjs, if you haven’t already

Getting started

Scraperjs exposes two different scrapers,

a StaticScraper, that is light fast and with a low footprint, however it doesn't allow for more complex situations, like scraping dynamic content.
a DynamicScraper, that is a bit more heavy, but allows you to scrape dynamic content, like in the browser console. both scrapers expose a very similar API, with some minor differences when it comes to scraping.

Lets scrape Hacker News, with both scrapers.

Try to spot the differences.

Static Scraper

var scraperjs = require('scraperjs');
scraperjs.StaticScraper.create('https://news.ycombinator.com/')
	.scrape(function($) {
		return $(".title a").map(function() {
			return $(this).text();
		}).get();
	})
	.then(function(news) {
		console.log(news);
	})

The scrape promise receives a function that will scrape the page and return the result, it only receives jQuery a parameter to scrape the page. Still, very powerful. It uses cheerio to do the magic behind the scenes.

Dynamic Scraper

var scraperjs = require('scraperjs');
scraperjs.DynamicScraper.create('https://news.ycombinator.com/')
	.scrape(function($) {
		return $(".title a").map(function() {
			return $(this).text();
		}).get();
	})
	.then(function(news) {
		console.log(news);
	})

Again, the scrape promise receives a function to scrape the page, the only difference is that, because we're using a dynamic scraper, the scraping function is sandboxed only with the page scope, so no closures! This means that in this (and only in this) scraper you can't call a function that has not been defined inside the scraping function. Also, the result of the scraping function must be JSON-serializable. We use phantom and phantomjs to make it happen, we also inject jQuery for you.

However, it's possible to pass JSON-serializable data to any scraper.

The $ varible received by the scraping function is, only for the dynamic scraper, hardcoded.

Show me the way! (aka Routes)

For a more flexible scraping and crawling of the web sometimes we need to go through multiple web sites and we don't want map every possible url format. For that scraperjs provides the Router class.

Example

var scraperjs = require('scraperjs'),
	router = new scraperjs.Router();

router
	.otherwise(function(url) {
	console.log("Url '"+url+"' couldn't be routed.");
});

var path = {};

router.on('https?://(www.)?youtube.com/watch/:id')
	.createStatic()
	.scrape(function($) {
		return $("a").map(function() {
			return $(this).attr("href");
		}).get();
	})
	.then(function(links, utils) {
		path[utils.params.id] = links
	})

router.route("https://www.youtube.com/watch/YE7VzlLtp-4", function() {
	console.log("i'm done");
});

Code that allows for parameters in paths is from the project Routes.js, information about the path formating is there too.

API overview

Scraperjs uses promises whenever possible.

StaticScraper, DynamicScraper and ScraperPromise

So, the scrapers should be used with the ScraperPromise. By creating a scraper

var scraperPromise = scraperjs.StaticScraper.create() // or DynamicScraper

The following promises can be made over it, they all return a scraper promise,

onStatusCode(code:number, callback:function(utils:Object)), executes the callback when the status code is equal to the code,
onStatusCode(callback:function(code:number, utils:Object)), executes the callback when receives the status code. The callback receives the current status code,
delay(time:number, callback:function(last:?, utils:Object)), delays the execution of the chain by time (in milliseconds),
timeout(time:number, callback:function(last:?, utils:Object)), executes the callback function after time (in milliseconds),
then(lastResult:?, callback:function(last:?, utils:Object)), executes the callback after the last promise,
async(callback:function(last:?, done:function(result:?, err:?), utils)), executes the callback, stopping the promise chain, resuming it when the done function is called. You can provide a result to be passed down the promise chain, or an error to trigger the catch promise,
catch(callback:function(error:Error, utils:Object)), executes the callback when there was an error, errors block the execution of the chain even if the promise was not defined,
done(callback:function(last:?, utils:Object)), executes the callback at the end of the promise chain, this is always executed, even if there was an error,
get(url:string), makes a simple HTTP GET request to the url. This promise should be used only once per scraper.
request(options:Object), makes a (possibly) more complex HTTP request, scraperjs uses the request module, and this method is a simple wrapper of request.request(). This promise should be used only once per scraper.
scrape(scrapeFn:function(...?), callback:function(result:?, utils:Object)=, ...?), scrapes the page. It executes the scrapeFn and passes it's result to the callback. When using the StaticScraper, the scrapeFn receives a jQuery function that is used to scrape the page. When using the DynamicScraper, the scrapeFn doesn't receive anything and can only return a JSON-serializable type. Optionally an arbitrary number of arguments can be passed to the scraping function. The callback may be omitted, if so, the result of the scraping may be accessed with the then promise or utils.lastReturn in the next promise.

All callback functions receive as their last parameter a utils object, with it the parameters of an url from a router can be accessed. Also the chain can be stopped.

DynamicScraper.create()
	.get("http://news.ycombinator.com")
	.then(function(_, utils) {
		utils.stop();
		// utils.params.paramName
	});

The promise chain is fired with the same sequence it was declared, with the exception of the promises get and request that fire the chain when they've received a valid response, and the promises done and catch, which were explained above.

You can also waterfall values between promises by returning them (with the exception of the promise timeout, that will always return undefined) and it can be access through utils.lastReturn.

The `utils` object

You've seen the utils object that is passed to promises, it provides useful information and methods to your promises. Here's what you can do with it:

.lastResult, value returned in the last promise
.stop(), function to stop the promise chain,
.url, url provided to do the scraper,
.params, object with the parameters defined in the router matching pattern.

A more powerful DynamicScraper.

When lots of instances of DynamicScraper are needed, it's creation gets really heavy on resources and takes a lot of time. To make this more lighter you can use a factory, that will create only one PhantomJS instance, and every DynamicScraper will request a page to work with. To use it you must start the factory before any DynamicSrcaper is created, scraperjs.DynamicScraper.startFactory() and then close the factory after the execution of your program, scraperjs.DynamicScraper.closeFactory(). To make the scraping function more robust you can inject code into the page,

var ds = scraperjs.DynamicScraper
	.create('http://news.ycombinator.com')
	.async(function(_, done, utils) {
		utils.scraper.inject(__dirname+'/path/to/code.js', function(err) {
			// in this case if there was an error won't fire catch promise.
			if(err) {
				done(err);
			} else {
				done();
			}
		});
	})
	.scrape(function() {
		return functionInTheCodeInjected();
	})
	.then(function(result) {
		console.log(result);
	});

Router

The router should be initialized like a class

var router = new scraperjs.Router(options);

The options object is optional, and these are the options:

firstMatch, a boolean, if true the routing will stop once the first path is matched, the default is false.

The following promises can be made over it,

on(path:string|RegExp|function(url:string)), makes the promise for the match url or regular expression, alternatively you can use a function to accept or not a passed url. The promises get or request and createStatic or createDynamic are expected after the on promise.
get(), makes so that the page matched will be requested with a simple HTTP request,
request(options:Object), makes so that the page matched will be requested with a possible more complex HTTP request, , scraperjs uses the request module, and this method is a simple wrapper of request.request(),
createStatic(), associates a static scraper to use to scrape the matched page, this returns ScraperPromise, so any promise made from now on will be made over a ScraperPromise of a StaticScraper. Also the done promise of the scraper will not be available.
createDynamic(), associates a dynamic scraper to use to scrape the matched page, this returns ScraperPromise, so any promise made from now on will be made over a ScraperPromise of a DynamicScraper. Also the done promise of the scraper will not be available.
route(url:string, callback:function(boolean)), routes an url through all matched paths, calls the callback when it's executed, true is passed if the route was successful, false otherwise.
use(scraperInstance:ScraperPromise), uses a ScraperPromise already instantiated.
otherwise(callback:function(url:string)), executes the callback function if the routing url didn't match any path.
catch(callback:function(url:string, error:Error)), executes the callback when an error occurred on the routing scope, not on any scraper, for that situations you should use the catch promise of the scraper.

Notes

Scraperjs always fetches the document with request, and then when using a DynamicScraper, leverages phantom's setContent() to set the body of the page object. This will result in subtly different processing of web pages compared to directly loading a URL in PhantomJS.

Check the examples, the tests or just dig into the code, it's well documented and it's simple to understand.

Dependencies

As mentioned above, scraperjs is uses some dependencies to do the the heavy work, such as

async, for flow control
request, to make HTTP requests, again, if you want more complex requests see it's documentation
phantom + phantomjs, phantom is an awesome module that links node to phantom, used in the DynamicScraper
cheerio, light and fast DOM manipulation, used to implement the StaticScraper
jquery, to include jquery in the DynamicScraper
although Routes.js is great, scraperjs doesn't use it to maintain it's "interface layout", but the code to transform the path given on the on promise to regular expressions is from them

License

This project is under the MIT license.

scraperjs's People

Contributors

Stargazers

Watchers

Forkers

felipegtx baatar dandrews timsvoice eiriklv darilldrems gnagel lambder psing008 bencruz ahplayford bradparks lifedispenser aag6z ngedmundas hbehkamal thegeektets credosam obersoy alihalabyah emilkje vibster aquinault lumiscript michael-benin-cn pixelcrumbs nivertech davidlegit alixaxel maxluck jariaz marufsiddiqui robincamdsphd suyogdjain jimjea mgarcs jbinfo jarvizx anukin danetheory praba230890 rourkie conroymedeiros gmaragos littlebaby lemonhall galvezz avert dberti edwinyzh viva-developer karamax alexislitool carlvlewis maksze slimbn teymurnurullaev cszhan163 marmikcfc kalatushki inno-v handtrix wangybao ipalibowhyte mani95lisa briandamage soitun strogo edwardt dahal iazi johan-- realguess fiestabonita imclab pjleonard37 rrrene vsakaria kosoadi why-not-sky outboundexplorer streamgao kickes mahefar bassochette akoda sidsol55 pinnaclepointers woodwardoge jaredmansaakintola bitfed nask0 hitesh97 jay3126 nkchouhan nishi-inc knightwind jimmytuc praneybehl expertmaksud

scraperjs's Issues

please add Iconv support

I have added this to make it works, convert GBK to UTF-8. I think this is useful for somebody like me.
This should can be set to decode any charset to utf-8.

Thanks!

Iconv = require('iconv').Iconv;
body = (new Iconv('GBK','UTF-8')).convert(new Buffer(body,'binary'))

then promise

I am trying to understand where the 'then' promise comes from.
I have used Q in the past; and believe node does not yet support JS6 ?

DynamicScraper from the example not doing anything

I'm trying out the examples exactly as written. The HackerNews example with scraperjs.StaticScraper works perfectly fine and I get a big array of strings.

However, replacing Static with Dynamic doesn't do anything. At all. The execution just ends with no error. I tried debugging it, adding try/catch statements, but I still get no error.
I tried "npm install phantom" too, but it doesn't change anything for scraperjs.
I have no idea what's wrong here.

Error installing with `npm install scraperjs`

It seems that the required gyp version doesn't support that option.

gyp_main.py: error: no such option: --no-parallel
gyp ERR! configure error
gyp ERR! stack Error: gyp failed with exit code: 2
gyp ERR! stack at ChildProcess.onCpExit (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/configure.js:343:16)
gyp ERR! stack at ChildProcess.emit (events.js:98:17)
gyp ERR! stack at Process.ChildProcess._handle.onexit (child_process.js:810:12)
gyp ERR! System Darwin 13.3.0
gyp ERR! command "node" "/usr/local/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js" "rebuild"
gyp ERR! cwd /usr/local/lib/node_modules/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/weak
gyp ERR! node -v v0.10.30
gyp ERR! node-gyp -v v0.13.1
gyp ERR! not ok

Dynamic Hackernews example crashes

ml@hpz4:~$ node n.js 

stream.js:94
      throw er; // Unhandled stream error in pipe.
            ^
Error
    at new ScraperError (/home/ml/node_modules/scraperjs/src/ScraperError.js:7:16)
    at /home/ml/node_modules/scraperjs/src/DynamicScraper.js:58:20
    at Proto.apply (/home/ml/node_modules/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:123:13)
    at Proto.handle (/home/ml/node_modules/scraperjs/node_modules/phantom/node_modules/dnode/node_modules/dnode-protocol/index.js:99:19)
    at D.dnode.handle (/home/ml/node_modules/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:140:21)
    at D.dnode.write (/home/ml/node_modules/scraperjs/node_modules/phantom/node_modules/dnode/lib/dnode.js:128:22)
    at SockJSConnection.ondata (stream.js:51:26)
    at SockJSConnection.emit (events.js:95:17)
    at Session.didMessage (/home/ml/node_modules/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/transport.js:220:25)
    at WebSocketReceiver.didMessage (/home/ml/node_modules/scraperjs/node_modules/phantom/node_modules/shoe/node_modules/sockjs/lib/trans-websocket.js:102:40)

ml@hpz4:~$ node -v
v0.10.30

DynamicScraper closeFactory() seems to have memory leak

I'm running nodeJS v0.10.35 on Ubuntu 14.04.1 LTS.
I read your documentation and calls

scraperjs.DynamicScraper.startFactory()

before I request multiple pages in parallel to scrape.
On completion of all the scraping, (using Promise). I would call

scraperjs.DynamicScraper.closeFactory()

However, looking at the memory usage, after a few hundred rounds of function calls. I see there are a bunch of processes belonging to phantomjs sitting there using up memory, eventually the system runs out of memory (4G) and nodeJS crashes.

I looked through the source code for this PhantomPoll class, I don't see any where it close the "Page", is this close() for each page needed to release the memory? could this be the reason for the memory "leak" that I see? Could you please spend a little bit of your time to help check? much appreciated.

User-id

It is sometimes useful to be able to set the user-id/user-agent for the request as the target website may depend on it.

then promise

I am trying to understand where the 'then' promise comes from.
I have used Q in the past; and believe node does not yet support JS6 ?

Feature: Adding the posibility to chose viewport size or any page option when using DynamicScrapper

Hi,

As you know changing viewport size is important when scrapping directly maps or anything that depends on screen width.
WIth scrapperjs we can't modify page.viewport options,
Can you add the possibility to change it by adding a page.viewport or a dinamic merge option into page var on Dynamic Scrapper.

Thanx

doc issue

onError(callback:function(utils)), executes the callback when there was an error, errors block the execution of the chain even if the promise was not defined,

done(callback:function(utils)), executes the callback at the end of the promise chain, this is always executed, even if there was an error,

use(ScraperPromise), uses a ScraperPromise already instantiated.

onError callback params is error not utils
done callback params
- if first scraper promise error (such as error host), will done(null)
- if error is a syntax error in scrape function, will done(uitls)
- done should receive error as second param
I don't finduse from source, any example?

Potential small bug

In ScraperPromise.js line 282. Do you mean to call this.doneCallBack() with (utils) ?

because I'm getting a crash onError with a stack trace like this:

TypeError: Cannot read property 'lastReturn' of undefined
at doneCallback (/home/innopage/node_modules/scraperjs/src/ScraperPromise.js:28:15)
at ScraperPromise._fire (/home/innopage/node_modules/scraperjs/src/ScraperPromise.js:282:9)
at /home/innopage/node_modules/scraperjs/src/ScraperPromise.js:230:9
at Request.processGet as _callback
at self.callback (/home/innopage/node_modules/scraperjs/node_modules/request/request.js:123:22)
at Request.emit (events.js:95:17)
at ClientRequest.self.clientErrorHandler (/home/innopage/node_modules/scraperjs/node_modules/request/request.js:244:10)
at ClientRequest.emit (events.js:95:17)
at Socket.socketOnEnd as onend
at Socket.g (events.js:180:16)

[email protected] breaks DynamicScraper

Updating requests module from 2.61 to 2.62 (allowed by semver) breaks DynamicScraper with 'Unhandled stream error in pipe.' error message. Besides by application code it breaks the simpler DynamicScraper Hacker News sample too.

In order to obtain an older and working module tree I tried 'npm shrinkwrap' at first, but it thrown an error about phantomjs inner dependencies, so I removed ./node_modules/scraperjs/node_modules/, pinned [email protected] in ./node_modules/scraperjs/package.json and ran npm install inside it. DynamicScraper started working again.

How can I scrape different websites at the same time?

for (var option in options) {
  scraperjs.StaticScraper.create()
            .request(option)
            .scrape(function($) {
                return $(site_info.matchElem).map(function() {
                    return {
                        title: $(this).text(),
                        link: $(this).attr('href')
                    }
                }).get();
            }, function(results) {
                results.forEach(function(result) {
                    console.log(result);
                });
            });
}

I want to use this code to scrape several websites,but it only show the last website I scrape.How can I do it?

How static scraper handles cookies?

Hi, pls clarify does the static scraper handle cookies between requests?
How to get access to them?

ECONNRESET issue in Windows 7 - More frequently occuring

I have written some NodeJS code with ScraperJS to scrape off from a website.. it runs perfectly in my OSX, but when run in Windows, it throws ECONNRESET err (the timing is random) almost all the time, I try to execute it.

D:\Documents\_Wealth\_PROJECTS\Hampers in London\_Data\famousbirthdays\node_modu
les\scraperjs\src\ScraperPromise.js:37
                throw err;
                      ^
Error: read ECONNRESET
    at exports._errnoException (util.js:746:11)
    at TCP.onread (net.js:559:26)

I even tried to wrap my router.route(...) method with async.eachLimit(urls, 2, function() {..}), but then, I'm not getting this ECONNRESET err, but the scraper itself ends smoothly without completing the task. This again works perfectly in OSX.
Any idea why this is happening? Ideally it should have limited only the number of iterators running parallely, right?

Query parameters

Does scraperjs support query parameters?

For instance, if I was trying to scrape the explore/trending page and route any URL of the form "https?://(www.)?github.com/trending?since=:since", is this possible?

Requiring scraper prevents capturing process signals

Not sure if this should be considered a bug or not, but it should be documented somewhere. After scraperjs is required, process.on(SIGNAL, callback) callbacks are never called. This prevents the ability to do things like hooking into process terminations and sending logs somewhere, or doing any sort of post-process termination cleanups.

It is easily reproducable, here are four examples (just run each script and kill the process): https://gist.github.com/enoex/84dca1dae510c671a537

Not sure that there needs to be solved by scraperjs itself, as one solution is to just create a script and run it as a child process from a server, but it's a side effect that should be documented. Thanks!

Generate scroll-down event to force images to load

Hello,

I am trying to use this to scrape some images from a website. The problem is the image URLs are only generated if the user scrolls down the page and the images get into the view port. Is it possible to generate a Page Down event (or Ctrl + End or just go at the end of the page) inside the DynamicScraper?

Thanks!

using 32 bit 10.37 node.js -- the test fails

First: thank you for writing this npm module , nice work.
on Ubuntu 14.x x-linux
using 64 bits nodej.s -v0.10.37
the tests works great.

110 passing (15s)

Writing coverage object [/home/ivyho2/ivyTest/node_modules/scraperjs/coverage/coverage.json]

Writing coverage reports at [/home/ivyho2/ivyTest/node_modules/scraperjs/coverage]

=============================== Coverage summary ===============================
Statements : 100% ( 337/337 )
Branches : 100% ( 111/111 )
Functions : 98.15% ( 106/108 )

Lines : 100% ( 337/337 )

Running "exec:check-coverage" (exec) task

Running "unserve" task

Done, without errors.

Using 32 bit node.js v10.37

  usage of utils
    ✓ stop()
    ✓ scraper
    ✓ params
with DynamicScraper
  with Factory

assert.js:93
throw new assert.AssertionError({
^
AssertionError: "abnormal phantomjs exit code: 127" == "random message"
at Domain. (/home/ivyho/testIvy/node_modules/scraperjs/test/ScraperPromise.js:92:12)
at Domain.emit (events.js:95:17)
at process._fatalException (node.js:263:29)
Exited with code: 7.
Warning: Task "exec:coverage" failed. Use --force to continue.

Aborted due to warnings.
npm ERR! Test failed. See above for more details.
npm ERR! not ok code 0

Is it a phantomjs issue?

how to scrape a list of urls

Hi there,

How would I go about scraping a list of urls? I'm a bit stuck.

Thanks,

Gabriel

Exceeded maxRedirects. Probably stuck in a redirect loop

So I'm trying to scrape a url which seems to keep redirecting only to result in the following error. This is a well documented issue in the request module and I tried to create a scraperPromise.request(options) promise with an options object with followAllRedirects = false but that just returns a scraperPromise that's an [object Object] when I print it to console.

Here's the relevant stack trace

Error: Exceeded maxRedirects. Probably stuck in a redirect loop http://ubc.summon.serialssolutions.com/search?s.cmd=addFacetValueFilters%28ContentType%2CNewspaper+Article%3At%29&spellcheck=true&s.q=macbeth
    at Request.onResponse (/Users/.../node_modules/scraperjs/node_modules/request/request.js:901:26)
    at ClientRequest.g (events.js:180:16)
    at ClientRequest.emit (events.js:95:17)
    at HTTPParser.parserOnIncomingClient [as onIncoming] (http.js:1688:21)
    at HTTPParser.parserOnHeadersComplete [as onHeadersComplete] (http.js:121:23)
    at Socket.socketOnData [as ondata] (http.js:1583:20)
    at TCP.onread (net.js:527:27)

Here's how I'm trying to use the request. Not sure how I can proceed further since that console.log never executes. Am I doing something wrong? Any help is appreciated!

var scraperPromise = scraperjs.StaticScraper.create();

        scraperPromise.request(options, function (error, response) {
            if (error) {
              callback(error, null);
            } else {
              callback(null, response.request.href);
              console.log(response);
            }
        });

Receiving 'cannot read error property of null' when using the dynamic scraper.

I have the basic dynamic scraper setup and even tried to create a factory but the error persists. Its at line https://github.com/ruipgil/scraperjs/blob/master/src/DynamicScraper.js#L85 whenever that function is called the result argument is null so it throws an error. I am examining this through a debugger and it seems that the scraper is fetching the website properly so I am unsure what it is. I am using the https://github.com/ruipgil/scraperjs/blob/master/doc/examples/IMDBOpeningThisWeek.js example except the .scrape function is returning $('body');

using phantom version 1.9.0

Please also add error handling example

Hi!

Really like your library.
I cannot figure out how to make it handle errors (wrong URL for example).
Could you add this somewhere?

Greetings,
Martin

Doesn't install phantomjs dependency, fresh install

Fresh install in new directory with nothing else installed other than this package.

Running the two examples from README.md:

https://gist.github.com/ssteinerx/d0e6619a5a7e6d9223a2

OSX Dynamic Scraper Error: spawn EMFILE

This happens to me more frequently (in OSX) when using DynamicScraper.
Is this possibly because of routing too many URLs? ... may be, too many Phantom instances are getting created.

child_process.js:958
    throw errnoException(process._errno, 'spawn');
          ^
Error: spawn EMFILE
    at errnoException (child_process.js:1011:11)
    at ChildProcess.spawn (child_process.js:958:11)
    at exports.spawn (child_process.js:746:9)
    at spawn (/Users/kannanv/Personal/Workplace/helloworld/node_modules/scraperjs/node_modules/phantom/node_modules/win-spawn/index.js:54:10)
    at startPhantomProcess (/Users/kannanv/Personal/Workplace/helloworld/node_modules/scraperjs/node_modules/phantom/phantom.js:17:12)
    at Server.<anonymous> (/Users/kannanv/Personal/Workplace/helloworld/node_modules/scraperjs/node_modules/phantom/phantom.js:115:14)
    at Server.emit (events.js:92:17)
    at net.js:1056:10
    at process._tickCallback (node.js:442:13)

Using it on an express app

How is it supposed to work when we want the scrap to happen on demand, so when I go to a url, the server scaps a url and present the results to the user, since its asynchronous how can we achieve this?

Thanks

Getting started?

Sorry, but not following how to simply run the express server and check the project out. How do you start the server and pass in a url to be scraped? I ran "grunt serve", and console reported that it was listening on localhost:3000, but the router does not seem to be catching anything. I'm missing something basic here...

jquery file injection failing

When I use this module in my project an exception is being raised. Basically, the jquery script is failing to be injected because the file isn't present. The jquery dependency isn't being installed in scraperjs's node_modules directory so the hard-coded path in DynamicScraper fails.

Near as I can tell jquery is not being installed because it's already in the dependencies of my project, so I guess it's being deduped by NPM (I didn't think version 2 did this, but it's the only logical explanation).

I don't know of a reasonable way around this other than including the script file in the src directory.

How to follow redirects?

Is there a built in method to follow redirects? Any examples?

Passing options to phantom

Is there a way to pass options (like load-images: no, load-styles: no, etc.) to the DynamicScraper?
Its just that some pages may load faster without the resources.

Using iconv-lite to decode ISO-8859-1 page

How I can decode the $ result using something like cheerio.load(iconv.decode(html, 'ISO-8859-1'));

tks

How to set user agent or proxy

Could you pls. provide an example to pass a proxy and set a user agent?
Means basically how to access the inner request object ...

.catch() function doesn't exist

I am trying to run the following code :

scraperjs.StaticScraper
        .create({
            url: url,
            headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'}
        })
        .scrape(function ($) {

            // REDACTED

        }, function (data) {
            console.log(data);
        })
        .catch(function(err, utils){
            switch (err.code) {
                case 'ENOTFOUND':
                    debug("Page '%s' not found", err.hostname);
                    break;
                case '30THLINK':
                    debug("Page '%s' doesn't have a 30th link", utils.url);
                    break;
                default:
                    debug('Unknown error found %s', err);
            }
        });

And I am getting this error :

[TypeError: Object [object Object] has no method 'catch']

Without this .catch(), I am not able to catch 'ENOTFOUND' errors.

I saw the Error Handling example, where the .catch() is used but with 'router' (which returns a scraper promise as well). I am not sure if I am following the guide correctly based on the available promise list on the scraper object mentioned in README.

Any pointers that if I am missing something ? (I tried placing .catch() before .scrape() as well)

Arrays are null if assigned more than once

Here's a simple reproduction

scraperjs = require 'scraperjs'

scraperjs.DynamicScraper.create()
.request
  url: 'http://google.com'
.scrape ->
  a = ['a', 'b', 'c']
  a2 = [a]
  a2.push a
  a2.push a
  a2.push a

  # Return a2
  a2

, (obj, scraperObj) ->
  console.log obj

Or the meaningful part in javascript for those who prefer:

scraperjs.DynamicScraper.create().request({
  url: 'http://google.com'
}).scrape(function() {
  var a, a2;
  a = ['a', 'b', 'c'];
  a2 = [a];
  a2.push(a);
  a2.push(a);
  a2.push(a);
  return a2;
}, function(obj, scraperObj) {
  return console.log(obj);
});

I would expect to see this:

[ [ 'a', 'b', 'c' ], [ 'a', 'b', 'c' ], [ 'a', 'b', 'c' ] ]

But I see this:

[ [ 'a', 'b', 'c' ], null, null ]

Is it my mistake? Is it a bug? In this package? PhantomJS?

Dynamic scraper cannot trigger onError promise

Errors thrown from the scraping functions in the dynamic scraper will not trigger the onError promise.

DynamicScraper: Support timeout for page loading

During scrapping I encounter couple pages that hangup during loading.
I'm doing scrapping on thousands of pages in parallel, and some of them stuck forever(at least 10+ min).
Timeout will solve this problem, ideally it would have user supplied value, but reasonable hardcode is also good.

File/HTML support

Instead of scraping a site, is there a way to scrape HTML files?

Dynamic Scraper maintain session across requests

It seems like the dynamic scraper works by using the 'request' library to load content and then loads it in phantom with the setContent command.

https://github.com/ariya/phantomjs/wiki/API-Reference-WebPage#webpage-setContent

It seems to do navigation based scraping that depends on sessions we would have to manually parse out the appropriate data from the response object and pass it along in subsequent requests.

It would be nice to have a persistent session mode where the dynamic scraper would work at the level of a phantom WebPage instance (like a browser tab) and load/navigate across pages using js actions (like clicks). This is important for scraping ajax based pages. Tools like CasperJs work well for this, and the use cases for the Dynamic scraper seem a little limited without it.

Dynamic example fails on OpenShift instance

I am using [email protected] on an OpenShift instance. I am trying the examples on the README page. The static example works, but the dynamic one fails with a strange error. Any hints?

I would expect it might be due to the limited environment available on OpenShift instances and I wonder what is the cause so I can try to fix it.

> cat > static.js
var scraperjs = require('scraperjs');
scraperjs.StaticScraper.create('https://news.ycombinator.com/')
    .scrape(function($) {
        return $(".title a").map(function() {
            return $(this).text();
        }).get();
    }, function(news) {
        console.log(news);
    })
> node static.js 
[ 'Show HN: My SSH server knows who you are',
  'Show HN: JAWS – A JavaScript and AWS Stack',
  'Federal Judge Strikes Down Idaho ‘Ag-Gag Law’',
...

> cat > dynamic.js
var scraperjs = require('scraperjs');
scraperjs.DynamicScraper.create('https://news.ycombinator.com/')
    .scrape(function() {
        return $(".title a").map(function() {
            return $(this).text();
        }).get();
    }, function(news) {
        console.log(news);
    })
> node dynamic.js 

events.js:72
        throw er; // Unhandled 'error' event
              ^
Error: listen EACCES
    at errnoException (net.js:905:11)
    at Server._listen2 (net.js:1024:19)
    at listen (net.js:1065:10)
    at net.js:1147:9
    at asyncCallback (dns.js:68:16)
    at Object.onanswer [as oncomplete] (dns.js:121:9)

Scraper function will not work unless it matches regex pattern

I spent a few hours tracking this one down. I encountered a 'cannot read property error of undefined' when I supplied a scrape function into the .scrape promise. I followed everything step by step and wrote the code myself so that I can memorize the flow. It didn't work. Then I pasted the exact same procedure from the examples folder and sure enough the thing ran without a problem. What happened was that the function you supply is turned into a string and then manipulated to extract the whatever is inside the block. This is done through a regex to match the enclosing 'function(){..}' stuff but the regex doesn't work unless there is a space between the closing parentheses and the starting bracket.

so this

.scrape( function($) { .. .... } )

will work just fine, but this

.scrape( function($){ .. .... } )

throws the null error.

a very subtle difference but it can give you headaches.... Anyways I hope someone who is better with regex expressions can fix this.

the current exp is

var rg = /^function\s+([a-zA-Z_$][a-zA-Z_$0-9])?((.?)) {/g;

within the PhantomWrapper.js file

possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.

I was scraping a large site and came across this:

possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.

It stems from request.js. I didn't see this exact issued addressed here, but after moments of searching I found the solution, thought I may share it.

var sjs = require('scraperjs');

var scraperPromise = sjs.StaticScraper.create()
.request({
    url: 'https://calendar.lafayette.edu/node/13829', 
    // solution here, within the request object, set jar: true
    jar: true
})
.onError(function(err) {
    console.log("Error occurred:", err);
})
.scrape(function($) {
    return $('div.field-item.even').map(function() {
        return $(this).text();
    }).get();
}, function(gotten) {
    console.log(gotten);
});

scraperPromise.request should support custom function

sometime we need to use last scrape result to create request options

//mocha patch, suger method
var ddescribe = describe.only;
var xdescribe = describe.skip;
var iit = it.only;
var xit = it.skip;

var expect = require('chai').expect;
var _ = require('lodash');
var URL = require('url');
var scraperjs = require('scraperjs');

ddescribe('scraperjs', function(){
  it('should chain', function(done){
    var scraperPromise = scraperjs.StaticScraper.create();
    scraperPromise
      .get('http://echo.jsontest.com/k1/v1')
      .scrape(function($){
        return $.html();
      }, function(result){
        return result;
      })
      .request(function(result, utils){
        //FEATURE REQUEST: support custom options
        return {
          url: 'http://echo.jsontest.com/k2/v1',
          method: 'post',
          json: {xxx: result}
        }
      })
      .scrape(function($) {
        return $.html()
      }, function(result, utils){
        //how to got the first scrape result?
        //use utils.last got request result?
        result;
        done();
      });
  });
});

Return full request response rather than just the Cheerio object

I'd like to suggest passing the full response to the processing function rather than just the cheerio document object, since some scraping will want access to headers or other information elsewhere in the response.

Using Asynchronous ScraperFn

Is there any way to use an Asynchronous function as the ScrapeFn ? I have a url where I need to set an interval to load extra data into the DOM before I actually do any scraping, and then when a certain condition is met, I do that actual scrape.

The examples show a return statement, but is there any way to do this with a callback? Thanks!

Doesn't install phantomjs dependency (on DynamicScraper only)

It's a similar issue as this.
Tried doing npm install phantomjs and then using scraperjs, here's what happened:

var scraperjs = require('scraperjs');

// Static: runs just fine
scraperjs.StaticScraper.create('https://news.ycombinator.com/')
.scrape(function($) {
 return $(".title a").map(function() {
     return $(this).text();
 }).get();
}, function(res) {
 console.log(res);
})

// Dynamic: complaints about "phantomjs-node: You don't have 'phantomjs' installed"
scraperjs.DynamicScraper.create('https://news.ycombinator.com/')
.scrape(function($) {
 return $(".title a").map(function() {
     return $(this).text();
 }).get();
}, function(res) {
 console.log(res);
})

phantomjs-2.1.1 Not compatible with DynamicScraper

Hi,
There is an issue with the last version of PhantomJS,

Resolved with a downgrade to phantomjs-1.9.8

Please write it under requirements docs

Thanks,
Cordialy,
Mehdi AISSANI

Facebook

can we scrape facebook pages ?

how to scrape with json response?

scraperjs.StaticScraper.create()
    .get('http://echo.jsontest.com/k1/v1')
    .scrape(function($){
        // how to get response json?
        return $.html();
     }, function(result){
       return result;
     })

using 32 bit node.js version 0.12 -- 44 tests failed

StaticScraper
✓ .clone
#create
✓ with argument
✓ without argument
.loadBody, .scrape, .close
✓ without errors
✓ with errors

69 passing (4s)

44 failing

DynamicScraper .loadBody, .scrape, .close:
Uncaught AssertionError: abnormal phantomjs exit code: 127
at Console.assert (console.js:106:23)
at ChildProcess. (/home/nodetest/test/node_modules/scraperjs/node_modules/phantom/phantom.js:153:28)

However, using 64 bit node.js version 0.12 -- everything passed
StaticScraper
✓ .clone
#create
✓ with argument
✓ without argument
.loadBody, .scrape, .close
✓ without errors
✓ with errors

110 passing (18s)

Writing coverage object [/home/nodetest/test/node_modules/scraperjs/coverage/coverage.json]

Writing coverage reports at [/home/nodetest/test/node_modules/scraperjs/coverage]

=============================== Coverage summary ===============================
Statements : 100% ( 337/337 )
Branches : 100% ( 111/111 )
Functions : 98.15% ( 106/108 )

Lines : 100% ( 337/337 )

Running "exec:check-coverage" (exec) task

Running "unserve" task

Done, without errors.

ruipgil / scraperjs Goto Github PK

scraperjs's Introduction

Scraperjs

Installing

Getting started

Lets scrape Hacker News, with both scrapers.

Static Scraper

Dynamic Scraper

Show me the way! (aka Routes)

Example

API overview

StaticScraper, DynamicScraper and ScraperPromise

The utils object

A more powerful DynamicScraper.

Router

Notes

More

Dependencies

License

scraperjs's People

Contributors

Stargazers

Watchers

Forkers

scraperjs's Issues

Writing coverage reports at [/home/ivyho2/ivyTest/node_modules/scraperjs/coverage]

Lines : 100% ( 337/337 )

Writing coverage reports at [/home/nodetest/test/node_modules/scraperjs/coverage]

Lines : 100% ( 337/337 )

Recommend Projects

Recommend Topics

Recommend Org

The `utils` object