Giter Site home page Giter Site logo

dharmafly / noodle Goto Github PK

View Code? Open in Web Editor NEW
746.0 746.0 69.0 6.19 MB

A node server and module which allows for cross-domain page scraping on web documents with JSONP or POST.

Home Page: https://noodle.dharmafly.com/

JavaScript 80.81% Shell 0.30% HTML 18.89%

noodle's People

Contributors

aaronacerboni avatar chrisnewtn avatar dusty73 avatar freder avatar juanbrujo avatar kdocki avatar mikeatlas avatar paulolc avatar premasagar avatar ronanguilloux avatar stevekinney avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

noodle's Issues

Support CSV source files

Indicate by the use of "type": "csv" in the request parameters.

Provide a standard JSON response.

Accept an optional selector - e.g. to slice by row or column in a CSV spreadsheet.

See #18 for "type": "json".

Serve informative HTTP caching headers with each response

When a request is cached, a cache time is set and a cache expiry time is known (say, 1 hour after the response was cached), so the HTTP headers sent with the response should indicate to the client that the response should be cached until the cache expiry time.

This feature should be handled by the new, abstracted caching module.

Consider supporting `Link` headers in response

E.g. When the request type is JSON (see #18).

A use case:
In the GitHub API, when listing a user's watched/starred repos pagination, important information about pagination is included in a Link header.
E.g. The headers sent with https://api.github.com/users/premasagar/watched includes the following header:

Link: <https://api.github.com/users/premasagar/watched?page=2>; rel="next",
      <https://api.github.com/users/premasagar/watched?page=15>; rel="last"

The Link header is the only way to be able to retrieve the last page in the set.

Perhaps we could expand our current response items so that they have a head and a body property (as per @almost's hypermedia API design). The head property will contain headers - just supporting Link headers in the short term. The contents of the body property is what we are currently returning as a response item:

{
  "results": [
    {
      "head": {
        "link": [
            {
                "rel": "next",
                "url": "https://api.github.com/users/premasagar/watched?page=2"
            },
            {
                "rel": "last",
                "url": "https://api.github.com/users/premasagar/watched?page=15"
            }
        ]
      },
      "body": {}
      "created": "2012-07-25T13:02Z"
    }
  ],
}

How can the cache be used internally?

Noodle has a key:value cache internally. However I'm having trouble integrating the function for checking for cached items and storing them. This is because I don't have access to the key which happens to be the

The new API does not have a main method like the previous fetch() where a query was passed in.

Instead it has a collection of methods under a variety of types. eg 'noodle.html.select', 'noodle.rss.fetch' etc.

Right now the caching can be implemented by the server since that has access to the query. It would be best however if noodle internally managed it.

Maybe every fetch method can inherit from a basic fetch method which handles both the caching check and the caching put.

Add config file

Specify settings like:

  • cache "is old" time
  • cache purge time
  • server port

etc

Improve caching

  • Truncate cache, according to a limit. Keep a total amount of cached items at one time.
  • Cache time for considering certain items to be in need of an update. eg 1 hour
  • If request fails, serve cached data
  • Purge time, eg 1 week

Rename `extract` parameter to `selector`?

The parameter's value is a string that is a selector for the element to be extracted.
Both extract and selector are relevant: extract describes the end goal, and selector describes the path to get to it.

Allow for getting server headers without any selection

In the following use case I wish to just get the header information via noodle.

{
  "url": "https://api.github.com/users/premasagar/starred",
  "headers": "all"
}

or

{
  "url": "https://api.github.com/users/premasagar/starred",
  "linkHeader": true,
}

However noodle attributes the lack of specified selector to returning the full document. Perhaps it would be best if it didn't ?

edit: In addition it would seem that if one doesn't specify the document type in the noodle query then header information does not appear at all. This is another unnecessary step

Make error reporting more consistent

The addition of the error property on a result object is inconsistent.

Currently errors appear on a result object is the result object had the following things occur:

  • On 'json' type queries if the selector yielded no results (returns an empty array or selector for searching was blank)
  • On 'json' type queries if the JSON parse fails (ie the url is not a json document)

The following errors appear but are buggy:

  • On 'html' type queries if the page could not be found
  • On 'json' type queries if the page could not be found

The errors above appear consistently if there is only one result object. However if there are multiple result objects instead of just one the errors above sometimes dont appear. Need to debug this further.

The following errors still need to be written:

  • On 'html' type queries if the selector yielded no results

Examples of these errors should also be documented in the README


Also consider error propery to be an error object (message and type/code) instead of just a string
💎

Change returned json structure

From this:

{
  "href": [
    "http://google.com/",
    "http://google.com/2"
  ],
  "title": [
    "Search stuff",
    "Search more stuff"
  ]
}

To this:

[
  {
    "href": "http://google.com/",
    "title": "Search stuff"
  },
  {
    "href": "http://google.com/2",
    "title": "Search more stuff"
  }
]

Revisions to the Noodle API

To emphasise the use of of noodle as a node library revisions should be made to the current API. See the branch feature/new-api to see the current code implementation but do not expect a working server.

Currently noodle consists entirely of a single method (scrape) which takes all of its instruction from one or more query objects. This is coupled with separate type modules (html, json, feed) which provide logic for applying the select/extract rules against the data.

The revisions would involve separating out most of the query instructions from the query object into specific methods.

The main noodle object receiving also having a simple noodle.fetch(url) method which returns a promise.

The noodle object is segmented into different namespaces based on the file type the operation must be performed:

noodle.html;
noodle.json;
noodle.feed; // normalised representation of rss, atom and rdf
noodle.xml; // for specific targetting of rss, atom, rdf and any xml file

Amongst these types are common operations:

noodle.html.fetch();
noodle.html.select();
noodle.json.fetch();
noodle.json.select();
noodle.feed.fetch();
noodle.feed.select();
noodle.xml.fetch();
noodle.xml.select();

The fetch() methods internally making use of their respected select methods. The exception to this is that some fetch() methods make use of another types select() method. One example being for the feed type.

Some implementations:

noodle.feed.select = function (feedContents, options) {
  var json = noodle.feed.toJson(feedContents);
  return noodle.json.select(json) : json;
};

noodle.feed.fetch = function (url, options) {
  return noodle.fetch(url).then(function (data) {
    return noodle.feed.select(data, options);
  });
};

noodle.html.fetch = function (url, options) {
  noodle.fetch(url).then(function (html) {
    return noodle.html.select(html, options);
  });
};

noodle.html.select = function (html, options) {
  var results = options && options.selector ?
      cheerio(html).select(options.selector) : html;
  return {results: results};
};

Support post-processing the response, mixing in new data

e.g. Fetch an OPML file, fetch each feed in the file, and for each feed, return a list of the article titles and links.

e.g.

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "user",
    "update": [
        {
            "replace": "foo",
            "url": "http://example.com/user/${id}.json",
            "type": "json",
            "selector": "bar"
       }
    ]
}

The update property contains either a request to update a property on the target of the selector, or an array of requests. If an array, each request object is processed in turn, modifying the selector target each time.

The value ${id} matches the selector target's id property.

The value foo on the target of the selector is updated (or if it doesn't exist, then created) with the data from the JSON file requested in the update request.

If there are multiple targets that match the selector, then the update request is repeated for each target.

This syntax could no doubt be improved.

Related to #6 & #18.

Support JSON queries

e.g.

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "foo.bar"
}

"type": "JSON"

If the type parameter is omitted, then the default type of html should apply.

If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.

JSON, JSONP and MIME types

The default content returned should be in standard JSON, served with the MIME type `application/json' (as is the case with HTML-type queries).

An optional callback parameter will return the resultant JSON in a JSONP wrapper, and serve the response with the MIME type text/javascript.

As with HTML-type queries, JSON requests can also be part of a multiple query (see #5). They can be part of multiple JSON requests, as well as mixed in with HTML requests.

selector

When selector is supplied, then:

  • it is a string
  • it uses a JavaScript-style object selector to identify a target property in the JSON file, e.g. foo.bar[0].foo - see keypath.js, which is derived from Tiny Tim, in order to parse this kind of selector
  • (advanced feature; not priority) - it supports slicing the response - e.g. foo.bar[1,3] or foo.bar[-1]

Support conditional / piped queries

i.e. perform an additional query based on the result of the first query
or
Results from one query could also be used to make a subsequent request.
eg. foo.com/package.json, fetch all contributors, then return their github repos & avatars

Support for different charsets

Currently the charset is fixed at utf-8

This may be problematic down the line. The server should serve what the client requests in regards to headers.

Maybe there is a connect module which does this already.

Support whole documents and not just bits of data to extract

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "foo.bar"
}

If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.
If the type is html or none specified then the entire HTML page is returned.

Gzip http responses

This is most likely possible with connect middleware like connect-compress.

Considerations for web service URL interface

Web service url structure

What is the best URL structure for creating a query-able interface for the noodle web service?

Below are some of examples:

.com/fetch/halfmelt.com
.com/fetch?q=halfmelt.com
.com/html?q=halfmelt.com&selector=a[href]
.com/json?q=foo.json&selector=a.b
.com/multiple/?q=html;halfmelt.com;selector=.foo,.bar|rss;foo.com;selector=blah

Queries can be quite complex as they could contain characters which should be encoded as well as multiple values for one key. eg extract=html,href,text

Multiple queries can also be sent within one request. These queries could be delimited by a | character. The Google charts API makes use of the pipe delimitations for their urls.

Support whole documents and not just bits of data to extract

{
    "url": "http://example.com/data.json",
    "type": "json",
    "selector": "foo.bar"
}

If the type is json, but the selector parameter is omitted, then the whole JSON file is returned.
If the type is html or none specified then the entire HTML page is returned.

One should consider how the returned data is to be structured. Perhaps a document property.

Add 'xml' type

Selection might be achieved with either xpath against the original xml file, or dot-notation against the converted json.

How might the API look if we want to allow both xpath and dot-notation selection?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.