dharmafly / noodle Goto Github PK

View Code? Open in Web Editor NEW

746.0 746.0 69.0 6.19 MB

A node server and module which allows for cross-domain page scraping on web documents with JSONP or POST.

Home Page: https://noodle.dharmafly.com/

JavaScript 80.81% Shell 0.30% HTML 18.89%

noodle's People

Contributors

Stargazers

Watchers

Forkers

mgulaid awesome bastinrobin chrisrayroxs weareburo jmurowaniecki randylien stevekinney jackplug martinsv joeromero paulolc callmephilip yatnosudar extrataylor web5design imclab htaningcojr abpin nivertech matteomenapace soross bryancolligan jkso aowola rbtkoz adam8810 thodeveloper mnjstwins netfirms veyselkarani cokkee goodreferences why-not-sky marufsiddiqui winning1120xx ronanguilloux modulexcite damianof jimmytuc khrustovskiy sapila gichuwil kublaj pandeysoni kpgarrod cloudxtreme goosia pll33 leesander1 evaluation-alex automotola manojmali1991 ykaaouachi dusty73 melinite oriben2 payapi rubythonode serginator jucasoliveira onode nawazkhan leecookson massdrop morristech calabiyauman oliviera71 tntasd

noodle's Issues

Support CSV source files

Indicate by the use of "type": "csv" in the request parameters.

Provide a standard JSON response.

Accept an optional selector - e.g. to slice by row or column in a CSV spreadsheet.

See #18 for "type": "json".

What is the difference between converting RSS to XML with: feedparser (normalize=false) / saxparser / xml2json ?

Implement cache to save bandwidth

working branch for this is feature/caching

Serve informative HTTP caching headers with each response

When a request is cached, a cache time is set and a cache expiry time is known (say, 1 hour after the response was cached), so the HTTP headers sent with the response should indicate to the client that the response should be cached until the cache expiry time.

This feature should be handled by the new, abstracted caching module.

Unicode encoding issue?

In the response from http://dharmafly.nsql.jit.su/?url=http%3A%2F%2Flanyrd.com%2Fseries%2Fl4rp%2F&selector=h4%20a.summary.url&extract=text
the unicode character ☄ is represented as â˜„ (in the title for a Lanyrd event).

Provide non-memory cache for web pages

Consider supporting `Link` headers in response

E.g. When the request type is JSON (see #18).

A use case:
In the GitHub API, when listing a user's watched/starred repos pagination, important information about pagination is included in a Link header.
E.g. The headers sent with https://api.github.com/users/premasagar/watched includes the following header:

Link: <https://api.github.com/users/premasagar/watched?page=2>; rel="next",
      <https://api.github.com/users/premasagar/watched?page=15>; rel="last"

The Link header is the only way to be able to retrieve the last page in the set.

Perhaps we could expand our current response items so that they have a head and a body property (as per @almost's hypermedia API design). The head property will contain headers - just supporting Link headers in the short term. The contents of the body property is what we are currently returning as a response item:

{
  "results": [
    {
      "head": {
        "link": [
            {
                "rel": "next",
                "url": "https://api.github.com/users/premasagar/watched?page=2"
            },
            {
                "rel": "last",
                "url": "https://api.github.com/users/premasagar/watched?page=15"
            }
        ]
      },
      "body": {}
      "created": "2012-07-25T13:02Z"
    }
  ],
}

How can the cache be used internally?

Noodle has a key:value cache internally. However I'm having trouble integrating the function for checking for cached items and storing them. This is because I don't have access to the key which happens to be the

The new API does not have a main method like the previous fetch() where a query was passed in.

Instead it has a collection of methods under a variety of types. eg 'noodle.html.select', 'noodle.rss.fetch' etc.

Right now the caching can be implemented by the server since that has access to the query. It would be best however if noodle internally managed it.

Maybe every fetch method can inherit from a basic fetch method which handles both the caching check and the caching put.

Need to catch error for xml2json conversions which are not xml

Expire header not being set to oldest to expire result set

Add config file

Specify settings like:

cache "is old" time
cache purge time
server port

etc

Correct Expire header to correct http-date format

Fri Oct 26 2012 17:08:48 GMT+0100 (BST)

should be

Fri, Oct 26 2012 17:08:48 GMT+0100

Add live demo to docs

Improve caching

Truncate cache, according to a limit. Keep a total amount of cached items at one time.
Cache time for considering certain items to be in need of an update. eg 1 hour
If request fails, serve cached data
Purge time, eg 1 week

Rename `extract` parameter to `selector`?

The parameter's value is a string that is a selector for the element to be extracted.
Both extract and selector are relevant: extract describes the end goal, and selector describes the path to get to it.

Allow for getting server headers without any selection

In the following use case I wish to just get the header information via noodle.

{
  "url": "https://api.github.com/users/premasagar/starred",
  "headers": "all"
}

{
  "url": "https://api.github.com/users/premasagar/starred",
  "linkHeader": true,
}

However noodle attributes the lack of specified selector to returning the full document. Perhaps it would be best if it didn't ?

edit: In addition it would seem that if one doesn't specify the document type in the noodle query then header information does not appear at all. This is another unnecessary step

Make error reporting more consistent

The addition of the error property on a result object is inconsistent.

Currently errors appear on a result object is the result object had the following things occur:

On 'json' type queries if the selector yielded no results (returns an empty array or selector for searching was blank)
On 'json' type queries if the JSON parse fails (ie the url is not a json document)

The following errors appear but are buggy:

On 'html' type queries if the page could not be found
On 'json' type queries if the page could not be found

The errors above appear consistently if there is only one result object. However if there are multiple result objects instead of just one the errors above sometimes dont appear. Need to debug this further.

The following errors still need to be written:

On 'html' type queries if the selector yielded no results

Examples of these errors should also be documented in the README

Also consider error propery to be an error object (message and type/code) instead of just a string
💎

Change use of jsDom for Cheerio

As per Elsewhere module.

https://github.com/MatthewMueller/cheerio

Change returned json structure

From this:

{
  "href": [
    "http://google.com/",
    "http://google.com/2"
  ],
  "title": [
    "Search stuff",
    "Search more stuff"
  ]
}

To this:

[
  {
    "href": "http://google.com/",
    "title": "Search stuff"
  },
  {
    "href": "http://google.com/2",
    "title": "Search more stuff"
  }
]

Create test suite

Using Mocha + Chai + Sinon

Give noodle.scrape a promise interface

Revisions to the Noodle API

To emphasise the use of of noodle as a node library revisions should be made to the current API. See the branch feature/new-api to see the current code implementation but do not expect a working server.

Currently noodle consists entirely of a single method (scrape) which takes all of its instruction from one or more query objects. This is coupled with separate type modules (html, json, feed) which provide logic for applying the select/extract rules against the data.

The revisions would involve separating out most of the query instructions from the query object into specific methods.

The main noodle object receiving also having a simple noodle.fetch(url) method which returns a promise.

The noodle object is segmented into different namespaces based on the file type the operation must be performed:

noodle.html;
noodle.json;
noodle.feed; // normalised representation of rss, atom and rdf
noodle.xml; // for specific targetting of rss, atom, rdf and any xml file

Amongst these types are common operations:

noodle.html.fetch();
noodle.html.select();
noodle.json.fetch();
noodle.json.select();
noodle.feed.fetch();
noodle.feed.select();
noodle.xml.fetch();
noodle.xml.select();

The fetch() methods internally making use of their respected select methods. The exception to this is that some fetch() methods make use of another types select() method. One example being for the feed type.

Some implementations:

noodle.feed.select = function (feedContents, options) {
  var json = noodle.feed.toJson(feedContents);
  return noodle.json.select(json) : json;
};

noodle.feed.fetch = function (url, options) {
  return noodle.fetch(url).then(function (data) {
    return noodle.feed.select(data, options);
  });
};

noodle.html.fetch = function (url, options) {
  noodle.fetch(url).then(function (html) {
    return noodle.html.select(html, options);
  });
};

noodle.html.select = function (html, options) {
  var results = options && options.selector ?
      cheerio(html).select(options.selector) : html;
  return {results: results};
};

Expire header not being set to oldest to expire result set

(non-issue)

Support post-processing the response, mixing in new data

e.g. Fetch an OPML file, fetch each feed in the file, and for each feed, return a list of the article titles and links.

e.g.

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "user",
    "update": [
        {
            "replace": "foo",
            "url": "http://example.com/user/${id}.json",
            "type": "json",
            "selector": "bar"
       }
    ]
}

The update property contains either a request to update a property on the target of the selector, or an array of requests. If an array, each request object is processed in turn, modifying the selector target each time.

The value ${id} matches the selector target's id property.

The value foo on the target of the selector is updated (or if it doesn't exist, then created) with the data from the JSON file requested in the update request.

If there are multiple targets that match the selector, then the update request is repeated for each target.

This syntax could no doubt be improved.

Related to #6 & #18.

Support JSON queries

e.g.

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "foo.bar"
}

`"type": "JSON"`

If the type parameter is omitted, then the default type of html should apply.

If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.

JSON, JSONP and MIME types

The default content returned should be in standard JSON, served with the MIME type `application/json' (as is the case with HTML-type queries).

An optional callback parameter will return the resultant JSON in a JSONP wrapper, and serve the response with the MIME type text/javascript.

As with HTML-type queries, JSON requests can also be part of a multiple query (see #5). They can be part of multiple JSON requests, as well as mixed in with HTML requests.

`selector`

When selector is supplied, then:

it is a string
it uses a JavaScript-style object selector to identify a target property in the JSON file, e.g. foo.bar[0].foo - see keypath.js, which is derived from Tiny Tim, in order to parse this kind of selector
(advanced feature; not priority) - it supports slicing the response - e.g. foo.bar[1,3] or foo.bar[-1]

Rename usage of `valid` to `verified` and `invalid` to `unverified`

Change this in the code (e.g. variable names), API responses, and in the labels on the demo GUI.

Support multiple queries in one request

If the request is an array, then treat each array element as an individual query and return an array of individual results.

Support conditional / piped queries

i.e. perform an additional query based on the result of the first query
or
Results from one query could also be used to make a subsequent request.
eg. foo.com/package.json, fetch all contributors, then return their github repos & avatars

Allow for array of selectors for html document types

Support for different charsets

Currently the charset is fixed at utf-8

This may be problematic down the line. The server should serve what the client requests in regards to headers.

Maybe there is a connect module which does this already.

Regenerate gh-pages site

~~Update the gh-pages site due to changes in name and the README.~~

Expand upon error property in result objects

Consider error propery to be an error object (message and type/code) instead of just a string.

The noodle gh-pages site needs a quote!

getJSON is not defined on noodlejs.com

Neither is jQuery

Support whole documents and not just bits of data to extract

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "foo.bar"
}

If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.
If the type is html or none specified then the entire HTML page is returned.

Support optional whitelist of referrer domains

This will prevent overload of the service by third party developers.
(This is also issue 22 on node-socialgraph, so the implementation can be shared).

Support for extracting multiple attributes in one request

Support multiple selectors in a single request

e.g. fetch a list of URLs from anchor hrefs AND select the text content of some other element

Support JSONP

Limit number of concurrent requests, to prevent server crash

E.g. research the number of jsDom instances on large request pages that the server can handle before crashing, and use only 10% of this capacity.

See identical issue 21 in node-socialgraph.

Fix bug: querySelectorAll

window.document does not have the method

Support Atom and RSS as an input

The output should be normalised so that the same fields appear for every type of feed, no matter whether RSS (of any version) or Atom (of any version). Consider emulating the Google Feed API's response for this normalised format: https://developers.google.com/feed/v1/

Change project name?

Contenders?:

Noodle (or Noodl or N00dle or N00dl)
Chopsticks
Tweezers
?

Gzip http responses

This is most likely possible with connect middleware like connect-compress.

Considerations for web service URL interface

Web service url structure

What is the best URL structure for creating a query-able interface for the noodle web service?

Below are some of examples:

.com/fetch/halfmelt.com
.com/fetch?q=halfmelt.com
.com/html?q=halfmelt.com&selector=a[href]
.com/json?q=foo.json&selector=a.b
.com/multiple/?q=html;halfmelt.com;selector=.foo,.bar|rss;foo.com;selector=blah

Queries can be quite complex as they could contain characters which should be encoded as well as multiple values for one key. eg extract=html,href,text

Multiple queries can also be sent within one request. These queries could be delimited by a | character. The Google charts API makes use of the pipe delimitations for their urls.

Add functionality for easily supporting more document types

Perhaps a "types" folder where people can include there own supported document type methods.

Support whole documents and not just bits of data to extract

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "foo.bar"
}

If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.
If the type is html or none specified then the entire HTML page is returned.

One should consider how the returned data is to be structured. Perhaps a document property.

Add 'xml' type

Selection might be achieved with either xpath against the original xml file, or dot-notation against the converted json.

How might the API look if we want to allow both xpath and dot-notation selection?

dharmafly / noodle Goto Github PK

noodle's People

Contributors

Stargazers

Watchers

Forkers

noodle's Issues

"type": "JSON"

JSON, JSONP and MIME types

selector

Web service url structure

Recommend Projects

Recommend Topics

Recommend Org

`"type": "JSON"`

`selector`