Giter Site home page Giter Site logo

darter's Introduction

darter - web scraping utility

made with love using node using node, phantomjs, and cheerio

deployed on AWS using Docker

Deployed URL http://13.126.217.83:10010/scrap

POST Call

Request Body: {

    "productUrl":"http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=3415697"

}

Response : [
    [
        
    {
        "reviewComment": "\n\nNice personal label maker\nNice little inexpensive label maker.  Only one con about this deal is that the AC adapter is NOT included.  That was a little disappointing because it takes 6 AAA batteries if you want to run it without AC.\n\n",
        "reviewerName": "sacses,",
        "reviewDate": "Feb 21, 2013",
        "rating": "4.5"
    },
    .....

]

Some Known Improvements

Request Improvements

  1. To add an identifier like 'companyId' to identify the company, of which data needs to be scraped. On the basis of companyId, we can pick the company's website schema configuration.

  2. To accept multiple Urls to scrap

Response Improvements

To return an object like this
        {
            "productName": "Calculator",
            "noOfRevies": 10,
            "reviews": [ {
                ....
            } ]
        }

Code improvements

  1. To improve recursion logic for continuously scraping the reviews pagewise.

  2. To add logging and integrate it with ELK or error tracking tools like Sentry.io

  3. To make scraping configurable on the basis of company by storing schema information in configurable properties. So that a single API can be used for scraping multiple company's pages.

  4. To integrate with database in order to save the scraped data in db and use that for analytics.

  5. To implement IP spoofing in case a company blocks our IP after some hits.

  6. To add authentication API to validate the requests.

darter's People

Watchers

James Cloos avatar Tarush Arora avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.