PyJSONProxy - simple proxy and scraper
simple proxy
AJAX requests are restricted by the same origin policy. This can be bypassed by using either JSONP, CORS or a local proxy. This implements the third variant. So you can do something like this:
$ curl http://localhost:5000/github/xi/ { "login": "xi", ... }
With a configuration like this:
ENDPOINTS = { 'github': { 'host': 'https://api.github.com/users/' } }
scraping
Maybe the more interesting part is that this also contains a simple scraping mechanism. So if a service does not offer an API but only plain HTML pages, PyJSONProxy can extract information from there:
$ curl http://localhost:5000/github/xi/ { "url": "https://github.com/xi/", "login": "xi", "activity": [ ... ], "repos": [{ ... }] ... }
ENDPOINTS = { 'github': { 'host': 'https://github.com/', 'fields': { 'login': '.vcard-username', 'fullname': '.vcard-fullname', 'email': '.vcard-details .email', 'join-date': '.vcard-details .join-date@datetime', 'activity': { 'selector': '.contribution-activity-listing ul a' }, 'repos': { 'selector': '.popular-repos a.mini-repo-list-item', 'fields': { 'url': '@href', 'name': '.repo', 'description': '.repo-description' } } } } }
Selectors are generally CSS-selectors with the additional option to
select an attribute by appending an @
and the attribute name. If no
attribute is selected, the text content of the element will be used.
CORS header
By setting ALLOW_CORS
to True
, an
Access-Control-Allow-Origin
-header with value *
will be set for
all responses.
Documentation
Some simple documentation is automatically generated and available under
/
(for all endpoints) or /{endpoint}/
(for an individual
endpoint). To provide some input for this documentation, you can add a
description to both endpoints and fields:
ENDPOINTS = { 'github': { 'host': 'https://github.com/', 'doc': 'Access data about GitHub users', 'fields': { 'login': '.vcard-username', 'fullname': '.vcard-fullname', 'email': '.vcard-details .email' 'join-date': '.vcard-details .join-date@datetime' }, 'fields_doc': { 'login': 'github username', 'fullname': 'the user\'s full name', 'join-date': 'date when the user joined github in ISO-xx format' } } }
Note on security and performance
Security and performance were not a priority in this project. So it might be bad.