Giter Site home page Giter Site logo

node-osmosis's People

Contributors

cubehouse avatar deadlocked247 avatar e-e-e avatar hasnat avatar ivanca avatar michaelhogg avatar n-devr avatar rchipka avatar samogot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-osmosis's Issues

Set complex data

Example

osmosis.set({
    'title':  'a.title',
    'description': 'p.description',
    'url': 'a.permalink @href',
    'author': {
        'id': 'a.author @data-id'
        'name': 'a.author'
        'link': 'a.author @href'
    }
});

Accessing JavaScript Variables

Hi,
I currently use casper/phantom's evaluate function to read global JavaScript variables from the page. It's slow, but still easier than scraping script tags manually with regex to get out vars.

Does Osmosis have something to offer in that direction?

Logging into a page with basic http authentication?

I'm doing a little web page automation, but I need to log in to the page using basic http authentication first. I can do this with node-request, but either I haven't figured out how with osmosis, or even if osmosis supports it.

Any advice, hints or tips?

What's the correct way to handle missing data?

Whenever one of my find-commands fails and gives me "(find) no results for ____" Osmosis seems to abort the chain.

I tried catching the error with .error, but my subsequent find/data commands are still ignored.

What's the correct way to make Osmosis continue, ignoring the missing data?

I'm doing something like:

osmosis
    .get(url) 
    .find(...)
        .set(...)
    .follow(...)
        .find(...)
            .set(...)
        .find(...) // This selector does not always resolve
            .set(...})
        .find(...) // This doesn't seem to run if the above fails
            .set(...)
    .follow('a@href') // Neither does this
        .find(...)
            .set(..)
    .data(function(stuff) {
        // And I never get here :(
    })
    .done(function() {
        // But I do get here...
    })

post: TypeError: Cannot read property 'preloader' of undefined

I'm having trouble attempting to use the post method:

When I do

osmosis.post('www.craigslist.org/about/sites');

I get

node-osmosis/lib/promise.js:17
if (typeof cb.preloader === 'function')
             ^
TypeError: Cannot read property 'preloader' of undefined
    at null.post (node-osmosis/lib/promise.js:17:16)

osmosis get kills when processing large files

Hi, im processing a 30 mb xml, if i just parse basic fields osmosis works;

DEBUG: (get) starting instance 1
(get) loaded [get] https://xxxxxxxxxxx/feed.xml 
(find) found 41162 results for "item" in https://xxxxxxxxxxx/feed.xml 
DEBUG: (process) stack: 0, RAM: 618.86Mb (+587.44Mb) requests: 1, heap: 14.02Mb / 46.69Mb

But if i add some data processing, the process gets kill;

var superTrim = function (s) {
        return s.split('\n').map(function (l) {
            return l
            .replace(/\r/g, ' ')
            .replace(/\s{2,}/g, ' ');
        })
        .join('\n')
        .replace(/ \n/g, '\n')
        .replace(/\n{3,}/g, '\n\n')
        .trim();
};


osmosis.
(.....)
.then(function(context, job, next) {
    var shasum = crypto.createHash('md5');
    shasum.update(job.url.toString());
    job.md5 = shasum.digest('hex');

    job.title = superTrim(job.title.toString());
    job.description = superTrim(job.description.toString());

    job.published = moment(job.published, 'DD/MM/YYYY').format('YYYY-MM-DD HH:mm:00');

    var shasum = crypto.createHash('md5');
    shasum.update(job.title + job.description);
    job.checksum = shasum.digest('hex');

    next(context, job);
})

Any ideas? could momentjs be leaking memory?

Regards, Eugenio

How to get the find domNode's innerText ?

How to get the find domNode's innerText ?For example

osmosis
.get('https://twitter.com/FunStoryID').
find('p.ProfileTweet-text')
.then(function(context, data, next){
    next();
})
.set({
    'title':'innerText',
})
.data(function(item){
    console.log(item);
});

I want to get the p tag's innerText to set to title.

(bug) v1.0.0: README.md demo code not working

Several issues I bumped into (sha 3970056)
After similar experience with Issue #1 I noticed that the package.json version is now 1.0.0, and npmjs.org only v0.0.3. So I cloned the repo and npm link'ed:

$ git clone [email protected]:rc0x03/node-osmosis.git
$ cd node-osmosis
$ npm link
#NPM logs go here...
[email protected] node_modules/needle
├── [email protected]
└── [email protected] ([email protected])

[email protected] node_modules/libxmljs
├── [email protected]
└── [email protected]
$ cd ../osmosis-test
npm link osmosis
$ npm i
npm WARN package.json [email protected] No description
npm WARN package.json [email protected] No repository field.
npm WARN package.json [email protected] No README data
[email protected] node_modules/needle
├── [email protected]
└── [email protected] ([email protected])
#index is the exact same demo code from README.md at commit 397005653fceabec90c5ee6772bdade62fb6312d
$ node index.js 
/Users/lf/sandbox/node-osmosis/lib/promise.js:110
    this.next.debugNext(msg);
             ^
TypeError: Cannot read property 'debugNext' of undefined
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:11)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)
    at Promise.debugNext (/Users/lf/sandbox/node-osmosis/lib/promise.js:110:12)

After that I place a if in the lib/promise.js

Promise.prototype.debugNext = function(msg) {
    if(this.next) this.next.debugNext(msg);
}

While this doesn't fix the issue at least then I was able to run the demo code.

Thanks, Luis

Dealing with parsing errors arising from malformed web pages.

I've been trying to get Osmosis up and going to crawl through some data on a website which unfortunately has malformed DOM trees in some areas. I've used the then call to inspect the context and discovered a whole bunch of libxml parsing error objects. Data which I need to continue crawling is inevitably lopped off due to bad parsing.

I noted in another issue that fetching the raw HTML response from the context is not supported, which is unfortunate because I think I could pull some cringe-worthy regular expression voodoo to get the data I need to continue crawling.

I was wondering if I missed anything in Osmosis which would let me grab any HTML that libxml has not been able to parse successfully.

Thanks in advance.

`done` not being called

It looks like done isn't begin called anymore in the current version of the code.

I moved a project to a different system and it didn't work. I did a diff and it looks like a change to promise.js might have broke it. Using the old code in node_modules2 done works for me. I could be doing something wrong though, this is my first node project :)

diff -bur node_modules/osmosis/lib/promise.js node_modules2/osmosis/lib/promise.js
--- node_modules/osmosis/lib/promise.js 2015-04-12 01:13:06.000000000 +0000
+++ node_modules2/osmosis/lib/promise.js 2015-06-07 01:51:50.476756092 +0000
@@ -63,6 +63,7 @@
p.name = name;
p.cb = cb;
p.args = args;

  •                           p.stackPending = false;
                            return p.next;
                    }else{
                            this.instance = ++instances;
    

    @@ -75,6 +76,7 @@
    this.next.depth = this.depth+1;
    this.next.prev = this;
    this.args = args;

  •           this.stackPending = false;
    
            if (typeof this.initialized === 'function') {
                    this.initialized(this);
    

    @@ -160,16 +162,20 @@
    };

    Promise.prototype.start = function(context, data) {

  •   if (this.stackPending === true) {
    
  •           parser.stack--;
    
  •           this.stackPending = false;
    
  •   }
    if (context === null || this.next === undefined)
        return;
    if (context === undefined && this.depth !== 0) {
            this.error('no context');
    }else if (this.cb !== undefined) {
    
  •           if (context !== undefined)
    
  •           if (this.next.stackPending === false) {
            parser.stack++;
    
  •                   this.next.stackPending = true;
    
  •           }
            this.cb(context, extend({}, data));
    
  •           context = null;
    
  •           parser.stack--;
    }
    

    }

Concurrency not in effect?

I wrote a script that generates a long command chain like this one:

osmosis
.get('http://...').set({...}).data(fn)
.get('http://...').set({...}).data(fn)
...
.get('http://...').set({...}).data(fn)
.done(fn);

I would have expected osmosis to kick off many concurrent connections (5, by default) but when I look at the requests property of the Parser instance when I run it, the number never gets past 1 and scraping seems to happen one url at a time indeed given the slowness.

Am I missing something here or is it a genuine issue with this library?

Multiple sets on the same page

I'm trying to scrape multiple sets from one and the same page.
Imagine scraping all hrefs from a

tag and then all hrefs from an tag.
I want to avoid making two GET requests to do that.

In my opinion the command chain would look this ...

page = osmosis.get(pageUrl)

.find('menu')
.set({ 
    // get the <menu> links
})
.data(function() {
    // do something with the <menu> links
})

.find('aside')
.set({ 
    // get the <aside> links
})
.data(function() {
    // do something with the <aside> links
})

.then(function() {
    // ... finally write everything to disk
});

... but .find('aside') doesn't get the whole context passed.

How would I do that with osmosis (not using another promise or a generator)?

Scrape innerHTML propery of an element?

Hello, than you very much for great scraper.

I could not find in the documentation if it is possible to scrape properties of element, like innerHTML.
Is it possible to get scrape not just plain text but rather html content? Something like
.get("https://website.com")
.find('element')
.set('element@innerHTML')

Thank you very much for your help.

Cannot call method 'done' of undefined

Was running below code for testing and recieved the following error msg:

osmosis/lib/promise.js:108
        this.next.done();
                  ^
TypeError: Cannot call method 'done' of undefined
var osmosis = require('osmosis');

osmosis
.get('www.craigslist.org/about/sites') 
.find('h1 + div a')
.set('location')
.data(function(listing) {
    // do something with listing data
    console.log(listing);
})

Example doesn't work

After installing and trying the example, I get the following error:

Error: Cannot find module '/usr/lib/node_modules/needle'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/Users/matt/Playground/osmosis/node_modules/osmosis/index.js:2:14)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)

After updating that require, I'm getting:

/Users/matt/Playground/osmosis/node_modules/osmosis/node_modules/needle/lib/needle.js:562
        throw new TypeError('Invalid type for ' + key);
              ^
TypeError: Invalid type for follow
    at Object.exports.defaults (/Users/matt/Playground/osmosis/node_modules/osmosis/node_modules/needle/lib/needle.js:562:15)
    at Object.<anonymous> (/Users/matt/Playground/osmosis/node_modules/osmosis/index.js:49:8)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.require (module.js:364:17)
    at require (module.js:380:17)
    at Object.<anonymous> (/Users/matt/Playground/osmosis/index.js:1:77)
    at Module._compile (module.js:456:26)

Support for nested objects in array?

In the example, you use .find and .set to populate an array of image URLS (strings) - that's great - but what if I want to populate an array of objects? Let's say I want to save an array of images as:
[ { src : String, title: String } ]. Any support for this?
If not, a decent syntax might be: images[].src : a@href, images[].title: a@title. (perhaps), or: "images[]" : { src : "a@href", title : "a@title" }

Is it possible to scrape binary data?

For instance:

osmosis
    .get("http://www.todayonline.com/print/1166611")
    .find("//*[@class='image-top']/img/@src")
    .set("img")
    .follow("@src")
...

Quite obviously, the call to .follow("@src") will return binary image data - which I can confirm is happening when I enable the log/debug/error hooks; I see 2 HTTP requests made, instead of 1, and also a whole bunch of XML parsing errors on the binary image data (to be expected).

I wonder if you've got any functionality to scrape the base64-encoded version of the raw image binary data, so that I can place it together with the its @src attribute, in the final data output?

How to push the result item to array?

osmosis
    .get('https://twitter.com/FunStoryID')
    .find('div.ProfileTweet-contents')
    .then(function(context, data, next){
        next();
    })
    .set({
        'link':'p[1] a@title',
        'title': "p[1]",
    })
    .data(function(item){
        // push item to array
    });

I want to push all the items to an array,and then send to an api to insert, not directly insert into db.

How to write data to json file.

How to write data to json file.

.data(function(listing) {
    jsonfile.writeFileSync(file, listing);
})

I know there is a done function, but if there is a error in the scraping process, this function will never be called, but I don't want to lose previous scraped data. So How to write data to json file?

Http Proxy Rotation

Hey @rc0x03 ,

Awesome library! After browsing osmosis and needle docs / code it doesn't appear to me that there's a way to rotate your http proxies. Let's say that I've got a list of 100 http proxies and I'd like to have Osmosis randomly select one per each needle request or use a round robin etc.. Is there any way to currently do this? If not, is this something you are open to?

Thanks

Clarification

What exactly is meant by "DOM support and the ability to run scripts/CSS without a headless browser"?

Does this mean that you will be able to scrape data from a page request that potentially uses AJAX calls or client side code to manipulate the DOM to be scraped?

parse.find: TypeError: Property 'next' of object [object Object] is not a function

When I use parse and try to find,

osmosis.parse('<html><body>text</body></html>').find("body");

I get the following error

node-osmosis/lib/promise.js:235
                this.next(parser.parse(data));
                     ^
TypeError: Property 'next' of object [object Object] is not a function
    at null.initialized (node-osmosis/lib/promise.js:235:8)
    at null.find (node-osmosis/lib/promise.js:82:9)

Osmosis + Nock support

I wasn't really sure where I should create this issue, so I created it on both projects. See nock/nock#335


I'm facing some issues using Osmosis and Nock for mocking HTTP responses in my unit tests, see the build failing on Travis. The before timeout in mocha gets triggered because Osmosis is encountering an error in the request/parsing.

I've written the smallest reproducible test case I could come up with, based on the failing build above.

test case

var osmosis = require('osmosis');
var nock = require('nock');

nock('https://www.spotify.com')
  .get('/select-your-country/')
  .reply(200, 'whatever');

osmosis
  .get('https://www.spotify.com/select-your-country/')
  .find('.country-list .country-item .country-link')
  .set({
    countryCode: '@rel'
  })
  .error(console.error)
  .done(function(){
    // never reaches here :sadface:
    console.log('it works!');
  });

error

(get) TypeError: Cannot read property '_hasBody' of undefined
[get] https://www.spotify.com/select-your-country/ tries: 1 - Cannot read property '_hasBody' of undefined
    at /Users/matiassingers/dev/osmosis-nock-5226/node_modules/osmosis/index.js:150:35
    at done (/Users/matiassingers/dev/osmosis-nock-5226/node_modules/osmosis/node_modules/needle/lib/needle.js:355:7)
    at PassThrough.<anonymous> (/Users/matiassingers/dev/osmosis-nock-5226/node_modules/osmosis/node_modules/needle/lib/needle.js:523:9)
    at PassThrough.emit (events.js:117:20)
    at _stream_readable.js:943:16
    at process._tickCallback (node.js:419:13)

possible bug

The offending lines in Osmosis look like this:

if (!res.socket._httpMessage._hasBody || data.length == 0)
  throw(new Error('Document is empty'))

res.socket contains no _httpMessage property when returning the mocked response from Nock, this is the entire object:

{ domain: null,
  _events: {},
  _maxListeners: 10,
  writable: true,
  readable: true,
  setNoDelay: [Function: noop],
  setTimeout: [Function],
  _checkTimeout: [Function],
  setKeepAlive: [Function: noop],
  destroy: [Function: noop],
  resume: [Function: noop],
  getPeerCertificate: [Function: getPeerCertificate] }

The question then is:

  • Is Nock mocking the requests wrongly by not including the _httpMessage._hasBody property?
  • Or shouldn't Osmosis be checking for this property at all? Might be deprecated or something.

Using osmosis with multiple sites in parallel

Hi

I have an express api that should return data scrapped from various sites depending on the express route.

But if multiple requests are executed at the same time or before one of theme is finished errors start being thrown, for example "Can't set headers after they are sent".

I believe this is because context are being passed from one osmosis instance to another. How can I separate entire osmosis instances so I can accomplish this?

some code from express

app.get '/api/scrape/:site/:ref', (req, res, next) ->
  if req.params.ref.length < 7
    res.status(400).json 'error' : 'Invalid reference'
  else
    siteScrape = new require("./src/#{req.params.site}")
    siteScrape req.params.ref, (err, data)->
      if err
        res.status(400).json
          error : err.toString()
      else
        res.status(200).json data
  return

Thanks

how to catch errors?

Hi, thank you for this great lib!
I couldn't find an equivalent to other promises libs .catch or .fail to handle errors, is there something like that?
If not, as I'm using Bluebird, I was thinking to wrap Osmosis promises in a Bluebird promise, any reasons why it wouldn't be a good idea?
thanks in advance!
Max

osmosis.js not found and osmosis.get is not a function

Hi, I am trying to use node-osmosis in a script for an html file.
I have created a directory and executed the command: npm install osmosis.
Next, I copy-pasted the craigs list script given and am attempting to see if that works.
The first error I got was that "require is not defined".
I proceeded to download require js, and used the command: npm install requirejs.
Then in my html file before I ran the craigs list script, I loaded the require.js.
Now, my HTML code looks like the following:
image
(testing.js is just the craigs list code)
The errors I am getting now include

  1. Osmosis.js not found
  2. osmosis.get is not a function

Do you have any recommendations about what to do ?

Something wrong with CSS Selector in Version 0.0.5

Code:
{
'city': '.cityName h3',
'district': '.cityName h1',
'current': '.sk .tem span',
'wet': '.sk .num .sp1',
'wind': '.sk .num .sp2',
--> 'otherCity[]': '.city li a span',
'otherCityLink[]': '.city li a@href'
}

'.city li a span' works fine in v0.0.4.
But in v0.0.5 show this message:
omsosis DEBUG (get) starting instance 2
XPath error : Invalid expression
.//*[contains(concat(' ',normalize-space(@Class),' '),' city ')]//li a//span

Plz help..

access html data

First of all thanks for this project. How can I access html data as text or dom if I can't build a selector for an old website? I plan to "search" in text.

Error when crawling thousands of pages

This is my config

  osmosis
  .config({proxy: 'localhost:8118'})
  .get(url)
  .paginate('#fecha_logo>a+img+a')
  .find('h2>a,#bloque_titulares h3>a')
  .follow('@href')
  .set({
    'title':        'h2',
    'subtitle':     '.volanta',
    'content1':     '.intro',
    'content2':     '#cuerpo',
    'html':         'html'
  })
  .then(function(context, data, next) {
    var url = context.doc().request.url;
    var u = url.split("-");
    var date = new Date(u[2], u[3]-1, u[4].split('.')[0]);
    var article = new Article({
      title: iconv.decode(data.title, "utf-8"),
      subtitle: iconv.decode(data.subtitle, "utf-8"),
      content: htmlToText.fromString(iconv.decode(data.content1 + data.content2, "utf-8")),
      html: data.html,
      url: url,
      date: date
    });
    article.save(function (err, obj) {
      if (err) console.log('Error saving: ' + err);
      console.log('     [' + obj.date + '] ' + obj.title);
    });
    next(context, data);
  })
  .log(console.log)
  .error(console.log)
  .debug(console.log);

Is this error from something in osmosis?

(get) Error: socket hang up
[get] http://www.pagina12.com.ar/diario/elmundo/4-263946-2015-01-15.html tries: 1 - socket hang up
/home/jperelli/crawler/node_modules/osmosis/lib/promises.js:411
                    data.obj[key] = getContent(context.get(val));
                                                      ^

TypeError: Cannot read property 'get' of null
    at loopObject (/home/jperelli/crawler/node_modules/osmosis/lib/promises.js:411:55)
    at Promise.set [as cb] (/home/jperelli/crawler/node_modules/osmosis/lib/promises.js:416:9)
    at Promise.start (/home/jperelli/crawler/node_modules/osmosis/lib/promise.js:227:9)
    at /home/jperelli/crawler/node_modules/osmosis/lib/promise.js:217:81
    at /home/jperelli/crawler/node_modules/osmosis/lib/promises.js:148:25
    at /home/jperelli/crawler/node_modules/osmosis/lib/promise.js:266:5
    at /home/jperelli/crawler/node_modules/osmosis/index.js:231:13
    at done (/home/jperelli/crawler/node_modules/osmosis/node_modules/needle/lib/needle.js:357:7)
    at ClientRequest.had_error (/home/jperelli/crawler/node_modules/osmosis/node_modules/needle/lib/needle.js:364:5)
    at emitOne (events.js:77:13)
    at ClientRequest.emit (events.js:169:7)
    at Socket.socketCloseListener (_http_client.js:235:9)
    at emitOne (events.js:82:20)
    at Socket.emit (events.js:169:7)
    at TCP._onclose (net.js:469:12)

Unable to run on (X)Ubuntu 14.04.2 LTS

Trying to run your sample provided in the README that targets Craigslist after installing via npm install -g yields the following error:
(My $NODE_PATH is /usr/local/lib/node_modules)

/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:91
  throw err
        ^
Error: Could not locate the bindings file. Tried:
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/Debug/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/Release/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/out/Debug/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/Debug/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/out/Release/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/Release/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/build/default/xmljs.node
 → /usr/local/lib/node_modules/osmosis/node_modules/libxmljs/compiled/1.6.3/linux/x64/xmljs.node
    at bindings (/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:88:9)
    at Object.<anonymous> (/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/lib/bindings.js:1:99)
    at Module._compile (module.js:410:26)
    at Object.Module._extensions..js (module.js:428:10)
    at Module.load (module.js:335:32)
    at Function.Module._load (module.js:290:12)
    at Module.require (module.js:345:17)
    at require (module.js:364:17)
    at Object.<anonymous> (/usr/local/lib/node_modules/osmosis/node_modules/libxmljs/index.js:4:16)
    at Module._compile (module.js:410:26)

Dynamic .get() URLs

I'm failing a bit at getting data to pass into the .get, the url isn't directly encoded from the web page but rather a constructed url using an id number. I tried using data to construct it, but wasn't able to pass it to the get command. Thoughts?

.get(init)
.follow('div.name > a')
.set({
    'someNo': 'tr.no'
})
.data(function(data) {
    data.url = 'http://somesite'+ data.someNo + 'moreurl'
})
.get(data.url)

Which fails because no data.url is undefined.

How might I go about passing data to the .get?

Best Regards

Ajax selection

It's a great scraping solution but what if their is ajax loaded content in the page?

i tried the following but somehow it's not returning any data

osmosis
.get('http://tweakers.net/pricewatch/')
.find('.large a')
.set('product')
.data(function(listing) {
console.log(listing);
})

Any idea?

Osmosis with express?

Hey,

Maybe it's a stupid question but I`m trying to use osmosis inside 'app.get' function but it's working only once, is there a way to reset osmosis and initialize it more then once?

Thanks, great project!

Version 0.0.5 seems to incompatible with Request

I use both Osmosis and Request in my project.
Request for fetching a task config from my server.
And Osmosis will do the crawling.
But after 5 ~ 6 times, Request will failed to response/callback correctly.
Plz have a look at it.

Code Like:
Request.get(url, function (err, res, body) {
// after 5 ~ 6 times, this function will not be called any more, and just wait for the request timeout.
osmosis.get(body).done();
});

Thoughts on collaborating?

It seems like we're heading in the same direction. I've been working on the following library: https://github.com/lapwinglabs/x-ray.

I really like some of your design decisions here, specifically around offering a native parser and how you're handling an array of items.

I think having native bindings as the default makes sense, but having a fallback on a node-only solution (like ws or bson) would be helpful.

Some things that x-ray adds is a pluggable driver and more fine grain control on how many requests you're making. So it makes sense to me to merge the projects or come up with some way of working together.

Let me know! :-D

"done" in the command chain

Shouldn't done be called after all other commands are finished? If you run this example:

var fs = require('fs');
var osmosis = require('osmosis');

osmosis
.get('www.craigslist.org/about/sites')
.data(function(data) {
    //call an io function
    fs.readdir('.', function(err, files){
      console.log('data ... working on %s files', files.length);
    });
})
.then(function(context, data, next) {
    //call an io function
    fs.readdir('.', function(err, files){
      console.log('then ... working on %s files', files.length);
      next(context, data);
    });
})
.done(function(){
  console.log('done');
})

the result will be

done
data ... working on 18 files
then ... working on 18 files

Shouldn't "done" come last?

Pagination loads but skips previous commands

I am currently working on trying to get my scraper to paginate and it doesnt seem to do anything

I20150709-14:05:08.890(-4)? Scraping Heating & Air Conditioning in Charlotte, NC
I20150709-14:05:08.890(-4)? http://myURL
I20150709-14:05:09.897(-4)? (get) loaded [get] http://myURL
I20150709-14:05:09.905(-4)? (find) found 16 results for ".business-info" in http://myURL (/html)
I20150709-14:05:10.582(-4)? (paginate) loaded [get] http://myURL&page=2 {}

Notice how it doesnt "find" anything else it just paginates and does nothing? Am i doing something wrong?

xmljs.node Error

When trying to require Osmosis I get the error below. I'm running node v0.10.25.

Thanks for looking it over.

/root/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:83
throw e
^
Error: /root/node_modules/osmosis/node_modules/libxmljs/build/Release/xmljs.node: undefined symbol: node_module_register
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.require (module.js:364:17)
at require (module.js:380:17)
at bindings (/root/node_modules/osmosis/node_modules/libxmljs/node_modules/bindings/bindings.js:76:44)
at Object. (/root/node_modules/osmosis/node_modules/libxmljs/lib/bindings.js:1:99)
at Module._compile (module.js:456:26)
at Object.Module._extensions..js (module.js:474:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)

Error: htmlCheckEncoding: unknown encoding gb2312

Problem:

When I crawl in the website which's url is http://news.163.com/15/0416/04/AN9VM4U600014AED.html. I get an unknown encoding error.

Output:

按照单位、�幸怠⑾钅俊⒌赜蛉哺堑谋曜迹婷�2013年

Except Output:

按照单位、行业、项目、地域全覆盖的标准,全面摸清2013年

Description:

It use the gb2312 also call gbk for it's encoding, you can get the introduction of gb2312 in wikipedia(http://en.wikipedia.org/wiki/GB_2312). Before using the osmosis, I use the iconv-lite to solve the encoding error and I get the right answer at last, you can refer to it.

reference

https://github.com/ashtuchkin/iconv-lite
http://en.wikipedia.org/wiki/GB_2312

ERROR:

Error: htmlCheckEncoding: unknown encoding gb2312

Can't get two array data by one set

Code:

    .set({
        'otherCityLink[]': 'li a@href'
        'otherCity[]': 'li a > span'
    })

the second field of the result will be null.
and have to change the code to:

    .set({
        'otherCityLink[]': 'li a@href'
    })
    .set({
        'otherCity[]': 'li a > span'
    })

How to get the find domNode's innerHTML ?

osmosis = require 'osmosis'

osmosis
.get('http://www.cnbeta.com/')
.find('div.hd > div.title > a')
.follow('@href')
.set({
    'content': 'how to get domNode's innerHTML'
})
.data (listing)->
  console.log listing

default is get text

How to parse xml?

I am easily able to parse HTML however facing issues with XML. My sample xml document looks like

   <urlset>
         <url>
           <loc>http://www.abc.com/1</loc>
           <image:image>
              <image:loc>http:/www.abc.com/image1.png</image:loc>
              <image:caption><![CDATA[Title 1]]></image:caption>
           </image:image>
         </url>
         <url>
           <loc>http://www.abc.com/2</loc>
           <image:image>
              <image:loc>http:/www.abc.com/image2.png</image:loc>
              <image:caption><![CDATA[Title 2]]></image:caption>
          </image:image>
        </url>
   <urlset>

And I was trying with this code:

osmosis.get('http://www.abc.com/sitemap.xml')
        .set([{
            'url':'url > loc',
            'image':'url > image:loc'
        }])
        .data(function(array){
            console.log(array);
        });

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.