Giter Site home page Giter Site logo

scrapper's Introduction

Scrapper

Scrapper is a simple service to fetch webpages. It exposes two endpoints:

  • /pages - POSTing here will queue fetching task. If successful, you'll get 202 Accepted response from service with Location header containing URL for a temporary resource (/tasks) providing information about queued task. GET /pages/{id} when queued task is completed to see webpage contents.
  • /tasks/{id} - GET information about background task. If background task is completed successfully, you'll get 303 See Other with redirect to /pages/{id} to see actual results.

Scrapper requires MongoDB to work.

Installation

git clone https://github.com/jacek-jablonski/scrapper.git
cd scrapper
make docker-build

Usage

make docker-run

Service is listening on http://localhost:8080/.

  1. Request fetching:
 ❯ curl -i -d '{"url": "http://github.com"}' -H "Content-Type: application/json" -X POST http://localhost:8080/pages
HTTP/1.1 202 Accepted
Location: /tasks/e8dc0719-ad93-4ac5-8466-7858169509d6
Content-Type: text/plain; charset=utf-8
Content-Length: 13
Date: Fri, 02 Aug 2019 17:44:09 GMT
Server: Python/3.7 aiohttp/3.5.4

202: Accepted
  1. Get task status:
 ❯ curl -iL -H "Content-Type: application/json" -X GET http://localhost:8080/tasks/e8dc0719-ad93-4ac5-8466-7858169509d6
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 163
Date: Fri, 02 Aug 2019 17:44:10 GMT
Server: Python/3.7 aiohttp/3.5.4

{
  "_id": "8a51f5d7-053e-4587-9893-11d8a0597eb8",
  "created_at": "2019-08-02T17:44:09.506322Z",
  "url": "https://httpstat.us/200?sleep=50000",
  "status": "fetching",
  "error_message": null
}
  1. When finished - get fetching result:
 ❯ curl -iL -H "Content-Type: application/json" -X GET http://localhost:8080/tasks/e8dc0719-ad93-4ac5-8466-7858169509d6
HTTP/1.1 303 See Other
Content-Type: text/plain; charset=utf-8
Location: /pages/06f7f4c8-5771-428b-8e8e-0d46a59f2d81
Content-Length: 14
Date: Fri, 02 Aug 2019 17:44:20 GMT
Server: Python/3.7 aiohttp/3.5.4

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Content-Length: 92124
Date: Fri, 02 Aug 2019 17:44:20 GMT
Server: Python/3.7 aiohttp/3.5.4

{
    "_id": "06f7f4c8-5771-428b-8e8e-0d46a59f2d81",
    "created_at": "2019-08-02T17:44:10.188744Z",
    "url": "http://github.com",
    "body": cut
}

Additional business rules

If you would like to modify body, you need to provide inherited Processor class with one requried process method. Uncomment line

app.add_processor(UpperizationProcessor())

in scrapper/main.py to see how it works.

Tests

make devinstall
make tests

scrapper's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.