dharmafly / noodle Goto Github PK
View Code? Open in Web Editor NEWA node server and module which allows for cross-domain page scraping on web documents with JSONP or POST.
Home Page: https://noodle.dharmafly.com/
A node server and module which allows for cross-domain page scraping on web documents with JSONP or POST.
Home Page: https://noodle.dharmafly.com/
Indicate by the use of "type": "csv"
in the request parameters.
Provide a standard JSON response.
Accept an optional selector - e.g. to slice by row or column in a CSV spreadsheet.
See #18 for "type": "json"
.
working branch for this is feature/caching
When a request is cached, a cache time is set and a cache expiry time is known (say, 1 hour after the response was cached), so the HTTP headers sent with the response should indicate to the client that the response should be cached until the cache expiry time.
This feature should be handled by the new, abstracted caching module.
In the response from http://dharmafly.nsql.jit.su/?url=http%3A%2F%2Flanyrd.com%2Fseries%2Fl4rp%2F&selector=h4%20a.summary.url&extract=text
the unicode character ☄
is represented as ☄
(in the title for a Lanyrd event).
E.g. When the request type is JSON (see #18).
A use case:
In the GitHub API, when listing a user's watched/starred repos pagination, important information about pagination is included in a Link
header.
E.g. The headers sent with https://api.github.com/users/premasagar/watched includes the following header:
Link: <https://api.github.com/users/premasagar/watched?page=2>; rel="next",
<https://api.github.com/users/premasagar/watched?page=15>; rel="last"
The Link header is the only way to be able to retrieve the last page in the set.
Perhaps we could expand our current response items so that they have a head
and a body
property (as per @almost's hypermedia API design). The head
property will contain headers - just supporting Link
headers in the short term. The contents of the body
property is what we are currently returning as a response item:
{
"results": [
{
"head": {
"link": [
{
"rel": "next",
"url": "https://api.github.com/users/premasagar/watched?page=2"
},
{
"rel": "last",
"url": "https://api.github.com/users/premasagar/watched?page=15"
}
]
},
"body": {}
"created": "2012-07-25T13:02Z"
}
],
}
Noodle has a key:value cache internally. However I'm having trouble integrating the function for checking for cached items and storing them. This is because I don't have access to the key which happens to be the
The new API does not have a main method like the previous fetch()
where a query was passed in.
Instead it has a collection of methods under a variety of types. eg 'noodle.html.select', 'noodle.rss.fetch' etc.
Right now the caching can be implemented by the server since that has access to the query. It would be best however if noodle internally managed it.
Maybe every fetch method can inherit from a basic fetch method which handles both the caching check and the caching put.
Specify settings like:
etc
Fri Oct 26 2012 17:08:48 GMT+0100 (BST)
should be
Fri, Oct 26 2012 17:08:48 GMT+0100
The parameter's value is a string that is a selector for the element to be extracted.
Both extract
and selector
are relevant: extract
describes the end goal, and selector
describes the path to get to it.
In the following use case I wish to just get the header information via noodle.
{
"url": "https://api.github.com/users/premasagar/starred",
"headers": "all"
}
or
{
"url": "https://api.github.com/users/premasagar/starred",
"linkHeader": true,
}
However noodle attributes the lack of specified selector to returning the full document. Perhaps it would be best if it didn't ?
edit: In addition it would seem that if one doesn't specify the document type in the noodle query then header information does not appear at all. This is another unnecessary step
The addition of the error
property on a result object is inconsistent.
Currently errors appear on a result object is the result object had the following things occur:
The following errors appear but are buggy:
The errors above appear consistently if there is only one result object. However if there are multiple result objects instead of just one the errors above sometimes dont appear. Need to debug this further.
The following errors still need to be written:
Examples of these errors should also be documented in the README
Also consider error propery to be an error object (message and type/code) instead of just a string
💎
As per Elsewhere module.
From this:
{
"href": [
"http://google.com/",
"http://google.com/2"
],
"title": [
"Search stuff",
"Search more stuff"
]
}
To this:
[
{
"href": "http://google.com/",
"title": "Search stuff"
},
{
"href": "http://google.com/2",
"title": "Search more stuff"
}
]
Using Mocha + Chai + Sinon
To emphasise the use of of noodle as a node library revisions should be made to the current API. See the branch feature/new-api to see the current code implementation but do not expect a working server.
Currently noodle consists entirely of a single method (scrape
) which takes all of its instruction from one or more query objects. This is coupled with separate type modules (html, json, feed) which provide logic for applying the select/extract rules against the data.
The revisions would involve separating out most of the query instructions from the query object into specific methods.
The main noodle object receiving also having a simple noodle.fetch(url)
method which returns a promise.
The noodle object is segmented into different namespaces based on the file type the operation must be performed:
noodle.html;
noodle.json;
noodle.feed; // normalised representation of rss, atom and rdf
noodle.xml; // for specific targetting of rss, atom, rdf and any xml file
Amongst these types are common operations:
noodle.html.fetch();
noodle.html.select();
noodle.json.fetch();
noodle.json.select();
noodle.feed.fetch();
noodle.feed.select();
noodle.xml.fetch();
noodle.xml.select();
The fetch()
methods internally making use of their respected select methods. The exception to this is that some fetch()
methods make use of another types select()
method. One example being for the feed type.
Some implementations:
noodle.feed.select = function (feedContents, options) {
var json = noodle.feed.toJson(feedContents);
return noodle.json.select(json) : json;
};
noodle.feed.fetch = function (url, options) {
return noodle.fetch(url).then(function (data) {
return noodle.feed.select(data, options);
});
};
noodle.html.fetch = function (url, options) {
noodle.fetch(url).then(function (html) {
return noodle.html.select(html, options);
});
};
noodle.html.select = function (html, options) {
var results = options && options.selector ?
cheerio(html).select(options.selector) : html;
return {results: results};
};
(non-issue)
e.g. Fetch an OPML file, fetch each feed in the file, and for each feed, return a list of the article titles and links.
e.g.
{
"url": "http://example.com/data.json",
"type": "json",
"selector": "user",
"update": [
{
"replace": "foo",
"url": "http://example.com/user/${id}.json",
"type": "json",
"selector": "bar"
}
]
}
The update
property contains either a request to update a property on the target of the selector, or an array of requests. If an array, each request object is processed in turn, modifying the selector target each time.
The value ${id}
matches the selector target's id
property.
The value foo
on the target of the selector is updated (or if it doesn't exist, then created) with the data from the JSON file requested in the update request.
If there are multiple targets that match the selector, then the update request is repeated for each target.
This syntax could no doubt be improved.
e.g.
{
"url": "http://example.com/data.json",
"type": "json",
"selector": "foo.bar"
}
"type": "JSON"
If the type
parameter is omitted, then the default type
of html
should apply.
If the type
is json
, but the selector
parameter is omitted, then the whole JSON file is returned.
The default content returned should be in standard JSON, served with the MIME type `application/json' (as is the case with HTML-type queries).
An optional callback
parameter will return the resultant JSON in a JSONP wrapper, and serve the response with the MIME type text/javascript
.
As with HTML-type queries, JSON requests can also be part of a multiple query (see #5). They can be part of multiple JSON requests, as well as mixed in with HTML requests.
selector
When selector
is supplied, then:
foo.bar[0].foo
- see keypath.js, which is derived from Tiny Tim, in order to parse this kind of selectorfoo.bar[1,3]
or foo.bar[-1]
Change this in the code (e.g. variable names), API responses, and in the labels on the demo GUI.
If the request is an array, then treat each array element as an individual query and return an array of individual results.
i.e. perform an additional query based on the result of the first query
or
Results from one query could also be used to make a subsequent request.
eg. foo.com/package.json, fetch all contributors, then return their github repos & avatars
Currently the charset is fixed at utf-8
This may be problematic down the line. The server should serve what the client requests in regards to headers.
Maybe there is a connect module which does this already.
Update the gh-pages site due to changes in name and the README.
Consider error propery to be an error object (message and type/code) instead of just a string.
Neither is jQuery
{
"url": "http://example.com/data.json",
"type": "json",
"selector": "foo.bar"
}
If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.
If the type is html or none specified then the entire HTML page is returned.
This will prevent overload of the service by third party developers.
(This is also issue 22 on node-socialgraph, so the implementation can be shared).
e.g. fetch a list of URLs from anchor hrefs AND select the text content of some other element
E.g. research the number of jsDom instances on large request pages that the server can handle before crashing, and use only 10% of this capacity.
See identical issue 21 in node-socialgraph.
window.document does not have the method
The output should be normalised so that the same fields appear for every type of feed, no matter whether RSS (of any version) or Atom (of any version). Consider emulating the Google Feed API's response for this normalised format: https://developers.google.com/feed/v1/
Contenders?:
This is most likely possible with connect middleware like connect-compress.
What is the best URL structure for creating a query-able interface for the noodle web service?
Below are some of examples:
.com/fetch/halfmelt.com
.com/fetch?q=halfmelt.com
.com/html?q=halfmelt.com&selector=a[href]
.com/json?q=foo.json&selector=a.b
.com/multiple/?q=html;halfmelt.com;selector=.foo,.bar|rss;foo.com;selector=blah
Queries can be quite complex as they could contain characters which should be encoded as well as multiple values for one key. eg extract=html,href,text
Multiple queries can also be sent within one request. These queries could be delimited by a |
character. The Google charts API makes use of the pipe delimitations for their urls.
Perhaps a "types" folder where people can include there own supported document type methods.
{
"url": "http://example.com/data.json",
"type": "json",
"selector": "foo.bar"
}
If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.
If the type is html or none specified then the entire HTML page is returned.
One should consider how the returned data is to be structured. Perhaps a document
property.
Selection might be achieved with either xpath against the original xml file, or dot-notation against the converted json.
How might the API look if we want to allow both xpath and dot-notation selection?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.