rchipka / node-osmosis Goto Github PK

View Code? Open in Web Editor NEW

4.1K 4.1K 245.0 856 KB

Web scraper for NodeJS

JavaScript 99.73% HTML 0.27%

node-osmosis's People

Contributors

Stargazers

Watchers

Forkers

bawerd schuka-kirill eivindingebrigtsen arvindr21 stuartnz paulsmash checarsner muazu iyerish alpacarodeo antouank protez yanlinaung nivertech arden sarvex kodemill arezki1990 abada riddbengkok fangleen manshu joaobellotti iarnaud eric013 idahoan kantharia maxehnert gdcnb ruslankonev hbehkamal mjgil marufsiddiqui asterixaye woodwardoge jaredmansaakintola wurh sankyutang licong jmtexier andineck reimund petroslee mcanthony brunocascio qcxl kublaj shirk3y forbe cloudxtreme 00kenshin detectiveosint martinheidegger jwarni ngkhoawork koleok intesso bisubus michaelhogg itainteasy jonnybgod mashamba dwdraju ellerbrock pushpen mbhasin kritikag maroonface imjul1an meanlittlechimp tomato1001 ilonabudapesti muxiansen tlevesque stewones jspenc72 lhfazry deadlocked247 mih4ajlo learn-with-data theplue kode-kitchen jfajobi rubythonode oliv23 lspecian direktspeed compareagences w3aran haidermalik12 mpanic despean diaakasem samverneck fehub bokinga cloud-architecture cubehouse maurellius samogot

node-osmosis's Issues

Ajax selection

It's a great scraping solution but what if their is ajax loaded content in the page?

i tried the following but somehow it's not returning any data

osmosis
.get('http://tweakers.net/pricewatch/')
.find('.large a')
.set('product')
.data(function(listing) {
console.log(listing);
})

Any idea?

osmosis get kills when processing large files

Hi, im processing a 30 mb xml, if i just parse basic fields osmosis works;

DEBUG: (get) starting instance 1
(get) loaded [get] https://xxxxxxxxxxx/feed.xml 
(find) found 41162 results for "item" in https://xxxxxxxxxxx/feed.xml 
DEBUG: (process) stack: 0, RAM: 618.86Mb (+587.44Mb) requests: 1, heap: 14.02Mb / 46.69Mb

But if i add some data processing, the process gets kill;

var superTrim = function (s) {
        return s.split('\n').map(function (l) {
            return l
            .replace(/\r/g, ' ')
            .replace(/\s{2,}/g, ' ');
        })
        .join('\n')
        .replace(/ \n/g, '\n')
        .replace(/\n{3,}/g, '\n\n')
        .trim();
};


osmosis.
(.....)
.then(function(context, job, next) {
    var shasum = crypto.createHash('md5');
    shasum.update(job.url.toString());
    job.md5 = shasum.digest('hex');

    job.title = superTrim(job.title.toString());
    job.description = superTrim(job.description.toString());

    job.published = moment(job.published, 'DD/MM/YYYY').format('YYYY-MM-DD HH:mm:00');

    var shasum = crypto.createHash('md5');
    shasum.update(job.title + job.description);
    job.checksum = shasum.digest('hex');

    next(context, job);
})

Any ideas? could momentjs be leaking memory?

Regards, Eugenio

Pagination loads but skips previous commands

I am currently working on trying to get my scraper to paginate and it doesnt seem to do anything

I20150709-14:05:08.890(-4)? Scraping Heating & Air Conditioning in Charlotte, NC
I20150709-14:05:08.890(-4)? http://myURL
I20150709-14:05:09.897(-4)? (get) loaded [get] http://myURL
I20150709-14:05:09.905(-4)? (find) found 16 results for ".business-info" in http://myURL (/html)
I20150709-14:05:10.582(-4)? (paginate) loaded [get] http://myURL&page=2 {}

Notice how it doesnt "find" anything else it just paginates and does nothing? Am i doing something wrong?

Should work with german umlauts

In my resulting console output all umlauts like öäüß are destroyed.
The site is encoded with charset=iso-8859-1

Something wrong with CSS Selector in Version 0.0.5

Code:
{
'city': '.cityName h3',
'district': '.cityName h1',
'current': '.sk .tem span',
'wet': '.sk .num .sp1',
'wind': '.sk .num .sp2',
--> 'otherCity[]': '.city li a span',
'otherCityLink[]': '.city li a@href'
}

'.city li a span' works fine in v0.0.4.
But in v0.0.5 show this message:
omsosis DEBUG (get) starting instance 2
XPath error : Invalid expression
.//*[contains(concat(' ',normalize-space(@Class),' '),' city ')]//li a//span

Plz help..

Multiple sets on the same page

I'm trying to scrape multiple sets from one and the same page.
Imagine scraping all hrefs from a

tag and then all hrefs from an tag.
I want to avoid making two GET requests to do that.

In my opinion the command chain would look this ...

page = osmosis.get(pageUrl)

.find('menu')
.set({ 
    // get the <menu> links
})
.data(function() {
    // do something with the <menu> links
})

.find('aside')
.set({ 
    // get the <aside> links
})
.data(function() {
    // do something with the <aside> links
})

.then(function() {
    // ... finally write everything to disk
});

... but .find('aside') doesn't get the whole context passed.

How would I do that with osmosis (not using another promise or a generator)?

How to get the find domNode's innerHTML ?

osmosis = require 'osmosis'

osmosis
.get('http://www.cnbeta.com/')
.find('div.hd > div.title > a')
.follow('@href')
.set({
    'content': 'how to get domNode's innerHTML'
})
.data (listing)->
  console.log listing

default is get text

xmljs.node Error

When trying to require Osmosis I get the error below. I'm running node v0.10.25.

Thanks for looking it over.

/root/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:83
throw e
^
Error: /root/node_modules/osmosis/node_modules/libxmljs/build/Release/xmljs.node: undefined symbol: node_module_register
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at bindings (/root/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:76:44)
at Object. (/root/node_modules/osmosis/node_modules/libxmljs/lib/bindings.js:1:99)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)

How to write data to json file.

.data(function(listing) {
    jsonfile.writeFileSync(file, listing);
})

I know there is a done function, but if there is a error in the scraping process, this function will never be called, but I don't want to lose previous scraped data. So How to write data to json file?

Error when crawling thousands of pages

This is my config

  osmosis
  .config({proxy: 'localhost:8118'})
  .get(url)
  .paginate('#fecha_logo>a+img+a')
  .find('h2>a,#bloque_titulares h3>a')
  .follow('@href')
  .set({
    'title':        'h2',
    'subtitle':     '.volanta',
    'content1':     '.intro',
    'content2':     '#cuerpo',
    'html':         'html'
  })
  .then(function(context, data, next) {
    var url = context.doc().request.url;
    var u = url.split("-");
    var date = new Date(u[2], u[3]-1, u[4].split('.')[0]);
    var article = new Article({
      title: iconv.decode(data.title, "utf-8"),
      subtitle: iconv.decode(data.subtitle, "utf-8"),
      content: htmlToText.fromString(iconv.decode(data.content1 + data.content2, "utf-8")),
      html: data.html,
      url: url,
      date: date
    });
    article.save(function (err, obj) {
      if (err) console.log('Error saving: ' + err);
      console.log('     [' + obj.date + '] ' + obj.title);
    });
    next(context, data);
  })
  .log(console.log)
  .error(console.log)
  .debug(console.log);

Is this error from something in osmosis?

(get) Error: socket hang up
[get] http://www.pagina12.com.ar/diario/elmundo/4-263946-2015-01-15.html tries: 1 - socket hang up
/home/jperelli/crawler/node_modules/osmosis/lib/promises.js:411
                    data.obj[key] = getContent(context.get(val));
                                                      ^

TypeError: Cannot read property 'get' of null
    at loopObject (/home/jperelli/crawler/node_modules/osmosis/lib/promises.js:411:55)
    at Promise.set [as cb] (/home/jperelli/crawler/node_modules/osmosis/lib/promises.js:416:9)
    at Promise.start (/home/jperelli/crawler/node_modules/osmosis/lib/promise.js:227:9)
    at /home/jperelli/crawler/node_modules/osmosis/lib/promise.js:217:81
    at /home/jperelli/crawler/node_modules/osmosis/lib/promises.js:148:25
    at /home/jperelli/crawler/node_modules/osmosis/lib/promise.js:266:5
    at /home/jperelli/crawler/node_modules/osmosis/index.js:231:13
    at done (/home/jperelli/crawler/node_modules/osmosis/node_modules/needle/lib/needle.js:357:7)
    at ClientRequest.had_error (/home/jperelli/crawler/node_modules/osmosis/node_modules/needle/lib/needle.js:364:5)
    at emitOne (events.js:77:13)
    at ClientRequest.emit (events.js:169:7)
    at Socket.socketCloseListener (_http_client.js:235:9)
    at emitOne (events.js:82:20)
    at Socket.emit (events.js:169:7)
    at TCP._onclose (net.js:469:12)

Dealing with parsing errors arising from malformed web pages.

I've been trying to get Osmosis up and going to crawl through some data on a website which unfortunately has malformed DOM trees in some areas. I've used the then call to inspect the context and discovered a whole bunch of libxml parsing error objects. Data which I need to continue crawling is inevitably lopped off due to bad parsing.

I noted in another issue that fetching the raw HTML response from the context is not supported, which is unfortunate because I think I could pull some cringe-worthy regular expression voodoo to get the data I need to continue crawling.

I was wondering if I missed anything in Osmosis which would let me grab any HTML that libxml has not been able to parse successfully.

Thanks in advance.

How to get the find domNode's innerText ?

How to get the find domNode's innerText ?For example

osmosis
.get('https://twitter.com/FunStoryID').
find('p.ProfileTweet-text')
.then(function(context, data, next){
    next();
})
.set({
    'title':'innerText',
})
.data(function(item){
    console.log(item);
});

I want to get the p tag's innerText to set to title.

osmosis.js not found and osmosis.get is not a function

Hi, I am trying to use node-osmosis in a script for an html file.
I have created a directory and executed the command: npm install osmosis.
Next, I copy-pasted the craigs list script given and am attempting to see if that works.
The first error I got was that "require is not defined".
I proceeded to download require js, and used the command: npm install requirejs.
Then in my html file before I ran the craigs list script, I loaded the require.js.
Now, my HTML code looks like the following:

(testing.js is just the craigs list code)
The errors I am getting now include

Osmosis.js not found
osmosis.get is not a function

Do you have any recommendations about what to do ?

Accessing JavaScript Variables

Hi,
I currently use casper/phantom's evaluate function to read global JavaScript variables from the page. It's slow, but still easier than scraping script tags manually with regex to get out vars.

Does Osmosis have something to offer in that direction?

Clarification

What exactly is meant by "DOM support and the ability to run scripts/CSS without a headless browser"?

Does this mean that you will be able to scrape data from a page request that potentially uses AJAX calls or client side code to manipulate the DOM to be scraped?

Clicking JS-only elements/links/buttons

Is this possible? I hit a wall now, osmosis doesn't seem to have a generic .click() method

Osmosis with express?

Hey,

Maybe it's a stupid question but I`m trying to use osmosis inside 'app.get' function but it's working only once, is there a way to reset osmosis and initialize it more then once?

Thanks, great project!

Unable to run on (X)Ubuntu 14.04.2 LTS

Trying to run your sample provided in the README that targets Craigslist after installing via npm install -g yields the following error:
(My $NODE_PATH is /usr/local/lib/node_modules)

/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:91
  throw err
        ^
Error: Could not locate the bindings file. Tried:
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/Debug/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/Release/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/out/Debug/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/Debug/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/out/Release/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/Release/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/default/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/compiled/1.6.3/linux/x64/xmljs.node
    at bindings (/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:88:9)
    at Object.<anonymous> (/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/lib/bindings.js:1:99)
    at Module._compile (module.js:410:26)
    at Object.Module._extensions..js (module.js:428:10)
    at Module.load (module.js:335:32)
    at Function.Module._load (module.js:290:12)
    at Module.require (module.js:345:17)
    at require (module.js:364:17)
    at Object.<anonymous> (/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/index.js:4:16)
    at Module._compile (module.js:410:26)

Osmosis + Nock support

I wasn't really sure where I should create this issue, so I created it on both projects. See nock/nock#335

I'm facing some issues using Osmosis and Nock for mocking HTTP responses in my unit tests, see the build failing on Travis. The before timeout in mocha gets triggered because Osmosis is encountering an error in the request/parsing.

I've written the smallest reproducible test case I could come up with, based on the failing build above.

test case

var osmosis = require('osmosis');
var nock = require('nock');

nock('https://www.spotify.com')
  .get('/select-your-country/')
  .reply(200, 'whatever');

osmosis
  .get('https://www.spotify.com/select-your-country/')
  .find('.country-list .country-item .country-link')
  .set({
    countryCode: '@rel'
  })
  .error(console.error)
  .done(function(){
    // never reaches here :sadface:
    console.log('it works!');
  });

error

(get) TypeError: Cannot read property '_hasBody' of undefined
[get] https://www.spotify.com/select-your-country/ tries: 1 - Cannot read property '_hasBody' of undefined
    at /Users/matiassingers/dev/osmosis-nock-5226/node_modules/osmosis/index.js:150:35
    at done (/Users/matiassingers/dev/osmosis-nock-5226/node_modules/osmosis/node_modules/needle/lib/needle.js:355:7)
    at PassThrough.<anonymous> (/Users/matiassingers/dev/osmosis-nock-5226/node_modules/osmosis/node_modules/needle/lib/needle.js:523:9)
    at PassThrough.emit (events.js:117:20)
    at _stream_readable.js:943:16
    at process._tickCallback (node.js:419:13)

possible bug

The offending lines in Osmosis look like this:

if (!res.socket._httpMessage._hasBody || data.length == 0)
  throw(new Error('Document is empty'))

res.socket contains no _httpMessage property when returning the mocked response from Nock, this is the entire object:

{ domain: null,
  _events: {},
  _maxListeners: 10,
  writable: true,
  readable: true,
  setNoDelay: [Function: noop],
  setTimeout: [Function],
  _checkTimeout: [Function],
  setKeepAlive: [Function: noop],
  destroy: [Function: noop],
  resume: [Function: noop],
  getPeerCertificate: [Function: getPeerCertificate] }

The question then is:

Is Nock mocking the requests wrongly by not including the _httpMessage._hasBody property?
Or shouldn't Osmosis be checking for this property at all? Might be deprecated or something.

How to get the HTML body string after the Osmosis followed some url?

osmosis
.follow('a.result', function(context, data, nextPage) {
// can I get the HTML body here?
})

What's the correct way to handle missing data?

Whenever one of my find-commands fails and gives me "(find) no results for ____" Osmosis seems to abort the chain.

I tried catching the error with .error, but my subsequent find/data commands are still ignored.

What's the correct way to make Osmosis continue, ignoring the missing data?

I'm doing something like:

osmosis
    .get(url) 
    .find(...)
        .set(...)
    .follow(...)
        .find(...)
            .set(...)
        .find(...) // This selector does not always resolve
            .set(...})
        .find(...) // This doesn't seem to run if the above fails
            .set(...)
    .follow('a@href') // Neither does this
        .find(...)
            .set(..)
    .data(function(stuff) {
        // And I never get here :(
    })
    .done(function() {
        // But I do get here...
    })

how to catch errors?

Hi, thank you for this great lib!
I couldn't find an equivalent to other promises libs .catch or .fail to handle errors, is there something like that?
If not, as I'm using Bluebird, I was thinking to wrap Osmosis promises in a Bluebird promise, any reasons why it wouldn't be a good idea?
thanks in advance!
Max

`done` not being called

It looks like done isn't begin called anymore in the current version of the code.

I moved a project to a different system and it didn't work. I did a diff and it looks like a change to promise.js might have broke it. Using the old code in node_modules2 done works for me. I could be doing something wrong though, this is my first node project :)

diff -bur node_modules/osmosis/lib/promise.js node_modules2/osmosis/lib/promise.js
--- node_modules/osmosis/lib/promise.js 2015-04-12 01:13:06.000000000 +0000
+++ node_modules2/osmosis/lib/promise.js 2015-06-07 01:51:50.476756092 +0000
@@ -63,6 +63,7 @@
p.name = name;
p.cb = cb;
p.args = args;

                          p.stackPending = false;
                        return p.next;
                }else{
                        this.instance = ++instances;

@@ -75,6 +76,7 @@
this.next.depth = this.depth+1;
this.next.prev = this;
this.args = args;

          this.stackPending = false;

        if (typeof this.initialized === 'function') {
                this.initialized(this);

@@ -160,16 +162,20 @@
};

Promise.prototype.start = function(context, data) {

```
  if (this.stackPending === true) {
```
```
          parser.stack--;
```
```
          this.stackPending = false;
```

  }
if (context === null || this.next === undefined)
    return;
if (context === undefined && this.depth !== 0) {
        this.error('no context');
}else if (this.cb !== undefined) {

```
          if (context !== undefined)
```

          if (this.next.stackPending === false) {
        parser.stack++;

                  this.next.stackPending = true;

          }
        this.cb(context, extend({}, data));

```
          context = null;
```
```
          parser.stack--;
}
```
}

How to push the result item to array?

osmosis
    .get('https://twitter.com/FunStoryID')
    .find('div.ProfileTweet-contents')
    .then(function(context, data, next){
        next();
    })
    .set({
        'link':'p[1] a@title',
        'title': "p[1]",
    })
    .data(function(item){
        // push item to array
    });

I want to push all the items to an array,and then send to an api to insert, not directly insert into db.

Examples should hint that a final .done(callback) is required.

No examples hint that a final .done() is required to run without any errors. Without this, every scraping attempt ends up in a TypeError: Cannot read property 'done' of undefined error.

Using osmosis with multiple sites in parallel

I have an express api that should return data scrapped from various sites depending on the express route.

But if multiple requests are executed at the same time or before one of theme is finished errors start being thrown, for example "Can't set headers after they are sent".

I believe this is because context are being passed from one osmosis instance to another. How can I separate entire osmosis instances so I can accomplish this?

some code from express

app.get '/api/scrape/:site/:ref', (req, res, next) ->
  if req.params.ref.length < 7
    res.status(400).json 'error' : 'Invalid reference'
  else
    siteScrape = new require("./src/#{req.params.site}")
    siteScrape req.params.ref, (err, data)->
      if err
        res.status(400).json
          error : err.toString()
      else
        res.status(200).json data
  return

Thanks

How to parse xml?

I am easily able to parse HTML however facing issues with XML. My sample xml document looks like

   <urlset>
         <url>
           <loc>http://www.abc.com/1</loc>
           <image:image>
              <image:loc>http:/www.abc.com/image1.png</image:loc>
              <image:caption><![CDATA[Title 1]]></image:caption>
           </image:image>
         </url>
         <url>
           <loc>http://www.abc.com/2</loc>
           <image:image>
              <image:loc>http:/www.abc.com/image2.png</image:loc>
              <image:caption><![CDATA[Title 2]]></image:caption>
          </image:image>
        </url>
   <urlset>

And I was trying with this code:

osmosis.get('http://www.abc.com/sitemap.xml')
        .set([{
            'url':'url > loc',
            'image':'url > image:loc'
        }])
        .data(function(array){
            console.log(array);
        });

post: TypeError: Cannot read property 'preloader' of undefined

I'm having trouble attempting to use the post method:

When I do

osmosis.post('www.craigslist.org/about/sites');

I get

node-osmosis/lib/promise.js:17
if (typeof cb.preloader === 'function')
             ^
TypeError: Cannot read property 'preloader' of undefined
    at null.post (node-osmosis/lib/promise.js:17:16)

Is it possible to scrape binary data?

For instance:

osmosis
    .get("http://www.todayonline.com/print/1166611")
    .find("//*[@class='image-top']/img/@src")
    .set("img")
    .follow("@src")
...

Quite obviously, the call to .follow("@src") will return binary image data - which I can confirm is happening when I enable the log/debug/error hooks; I see 2 HTTP requests made, instead of 1, and also a whole bunch of XML parsing errors on the binary image data (to be expected).

I wonder if you've got any functionality to scrape the base64-encoded version of the raw image binary data, so that I can place it together with the its @src attribute, in the final data output?

How to validate the request config?

I set the request config via https://github.com/rc0x03/node-osmosis/wiki/01.-The-command-chain,

 osmosis
 .config({
    tries: 3,
    timeout: 1000,
    proxy: 'http://localhost:8889',
})
.get(url);

but the result is that the request config do not work

Version 0.0.5 seems to incompatible with Request

I use both Osmosis and Request in my project.
Request for fetching a task config from my server.
And Osmosis will do the crawling.
But after 5 ~ 6 times, Request will failed to response/callback correctly.
Plz have a look at it.

Code Like:
Request.get(url, function (err, res, body) {
// after 5 ~ 6 times, this function will not be called any more, and just wait for the request timeout.
osmosis.get(body).done();
});

How to get the HTML body string after the Osmosis get called?

osmosis
.get(url)
.then(function (context, data, next) {
  // can I get the HTML body here?
});

Http Proxy Rotation

Hey @rc0x03 ,

Awesome library! After browsing osmosis and needle docs / code it doesn't appear to me that there's a way to rotate your http proxies. Let's say that I've got a list of 100 http proxies and I'd like to have Osmosis randomly select one per each needle request or use a round robin etc.. Is there any way to currently do this? If not, is this something you are open to?

Thanks

Cannot call method 'done' of undefined

Was running below code for testing and recieved the following error msg:

osmosis/lib/promise.js:108
        this.next.done();
                  ^
TypeError: Cannot call method 'done' of undefined

var osmosis = require('osmosis');

osmosis
.get('www.craigslist.org/about/sites') 
.find('h1 + div a')
.set('location')
.data(function(listing) {
    // do something with listing data
    console.log(listing);
})

Thoughts on collaborating?

It seems like we're heading in the same direction. I've been working on the following library: https://github.com/lapwinglabs/x-ray.

I really like some of your design decisions here, specifically around offering a native parser and how you're handling an array of items.

I think having native bindings as the default makes sense, but having a fallback on a node-only solution (like ws or bson) would be helpful.

Some things that x-ray adds is a pluggable driver and more fine grain control on how many requests you're making. So it makes sense to me to merge the projects or come up with some way of working together.

Let me know! :-D

Set complex data

Example

osmosis.set({
    'title':  'a.title',
    'description': 'p.description',
    'url': 'a.permalink @href',
    'author': {
        'id': 'a.author @data-id'
        'name': 'a.author'
        'link': 'a.author @href'
    }
});

parse.find: TypeError: Property 'next' of object [object Object] is not a function

When I use parse and try to find,

osmosis.parse('<html><body>text</body></html>').find("body");

I get the following error

node-osmosis/lib/promise.js:235
                this.next(parser.parse(data));
                     ^
TypeError: Property 'next' of object [object Object] is not a function
    at null.initialized (node-osmosis/lib/promise.js:235:8)
    at null.find (node-osmosis/lib/promise.js:82:9)

Can't get two array data by one set

Code:

    .set({
        'otherCityLink[]': 'li a@href'
        'otherCity[]': 'li a > span'
    })

the second field of the result will be null.
and have to change the code to:

    .set({
        'otherCityLink[]': 'li a@href'
    })
    .set({
        'otherCity[]': 'li a > span'
    })

How to abort a running osmosis?

For some reason, I want to abort a running osmosis.
How to do it?

Dynamic .get() URLs

I'm failing a bit at getting data to pass into the .get, the url isn't directly encoded from the web page but rather a constructed url using an id number. I tried using data to construct it, but wasn't able to pass it to the get command. Thoughts?

.get(init)
.follow('div.name > a')
.set({
    'someNo': 'tr.no'
})
.data(function(data) {
    data.url = 'http://somesite'+ data.someNo + 'moreurl'
})
.get(data.url)

Which fails because no data.url is undefined.

How might I go about passing data to the .get?

Best Regards

Scrape innerHTML propery of an element?

Hello, than you very much for great scraper.

I could not find in the documentation if it is possible to scrape properties of element, like innerHTML.
Is it possible to get scrape not just plain text but rather html content? Something like
.get("https://website.com")
.find('element')
.set('element@innerHTML')

Thank you very much for your help.

Example doesn't work

After installing and trying the example, I get the following error:

Error: Cannot find module '/usr/lib/node_modules/needle'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/Users/matt/Playground/osmosis/node_modules/osmosis/index.js:2:14)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)

After updating that require, I'm getting:

/Users/matt/Playground/osmosis/node_modules/osmosis/node_modules/needle/lib/needle.js:562
        throw new TypeError('Invalid type for ' + key);
              ^
TypeError: Invalid type for follow
    at Object.exports.defaults (/Users/matt/Playground/osmosis/node_modules/osmosis/node_modules/needle/lib/needle.js:562:15)
    at Object.<anonymous> (/Users/matt/Playground/osmosis/node_modules/osmosis/index.js:49:8)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/Users/matt/Playground/osmosis/index.js:1:77)
    at Module._compile (module.js:456:26)

Support for nested objects in array?

In the example, you use .find and .set to populate an array of image URLS (strings) - that's great - but what if I want to populate an array of objects? Let's say I want to save an array of images as:
[ { src : String, title: String } ]. Any support for this?
If not, a decent syntax might be: images[].src : a@href, images[].title: a@title. (perhaps), or: "images[]" : { src : "a@href", title : "a@title" }

(bug) v1.0.0: README.md demo code not working

Several issues I bumped into (sha 3970056)
After similar experience with Issue #1 I noticed that the package.json version is now 1.0.0, and npmjs.org only v0.0.3. So I cloned the repo and npm link'ed:

$ git clone [email protected]:rc0x03/node-osmosis.git
$ cd node-osmosis
$ npm link
#NPM logs go here...
[email protected] node_modules/needle
├── [email protected]
└── [email protected] ([email protected])

[email protected] node_modules/libxmljs
├── [email protected]
└── [email protected]
$ cd ../osmosis-test
npm link osmosis
$ npm i
npm WARN package.json [email protected] No description
npm WARN package.json [email protected] No repository field.
npm WARN package.json [email protected] No README data
[email protected] node_modules/needle
├── [email protected]
└── [email protected] ([email protected])
#index is the exact same demo code from README.md at commit 397005653fceabec90c5ee6772bdade62fb6312d
$ node index.js 
/Users/lf/sandbox/node-osmosis/lib/promise.js:110
    this.next.debugNext(msg);
             ^
TypeError: Cannot read property 'debugNext' of undefined
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:11)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)

After that I place a if in the lib/promise.js

Promise.prototype.debugNext = function(msg) {
    if(this.next) this.next.debugNext(msg);
}

While this doesn't fix the issue at least then I was able to run the demo code.

Thanks, Luis

Logging into a page with basic http authentication?

I'm doing a little web page automation, but I need to log in to the page using basic http authentication first. I can do this with node-request, but either I haven't figured out how with osmosis, or even if osmosis supports it.

Any advice, hints or tips?

Concurrency not in effect?

I wrote a script that generates a long command chain like this one:

osmosis
.get('http://...').set({...}).data(fn)
.get('http://...').set({...}).data(fn)
...
.get('http://...').set({...}).data(fn)
.done(fn);

I would have expected osmosis to kick off many concurrent connections (5, by default) but when I look at the requests property of the Parser instance when I run it, the number never gets past 1 and scraping seems to happen one url at a time indeed given the slowness.

Am I missing something here or is it a genuine issue with this library?

Error: htmlCheckEncoding: unknown encoding gb2312

Problem:

When I crawl in the website which's url is http://news.163.com/15/0416/04/AN9VM4U600014AED.html. I get an unknown encoding error.

Output:

按照单位、�幸怠⑾钅俊⒌赜蛉哺堑谋曜迹婷�2013年

Except Output:

按照单位、行业、项目、地域全覆盖的标准，全面摸清2013年

Description:

It use the gb2312 also call gbk for it's encoding, you can get the introduction of gb2312 in wikipedia(http://en.wikipedia.org/wiki/GB_2312). Before using the osmosis, I use the iconv-lite to solve the encoding error and I get the right answer at last, you can refer to it.

reference

https://github.com/ashtuchkin/iconv-lite
http://en.wikipedia.org/wiki/GB_2312

ERROR: