darter's Introduction

darter - web scraping utility

made with love using node using node, phantomjs, and cheerio

deployed on AWS using Docker

Deployed URL http://13.126.217.83:10010/scrap

POST Call

Request Body: {

    "productUrl":"http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=3415697"

}

Response : [
    [
        
    {
        "reviewComment": "\n\nNice personal label maker\nNice little inexpensive label maker.  Only one con about this deal is that the AC adapter is NOT included.  That was a little disappointing because it takes 6 AAA batteries if you want to run it without AC.\n\n",
        "reviewerName": "sacses,",
        "reviewDate": "Feb 21, 2013",
        "rating": "4.5"
    },
    .....

]

Some Known Improvements

Request Improvements

To add an identifier like 'companyId' to identify the company, of which data needs to be scraped. On the basis of companyId, we can pick the company's website schema configuration.
To accept multiple Urls to scrap

Response Improvements

To return an object like this
        {
            "productName": "Calculator",
            "noOfRevies": 10,
            "reviews": [ {
                ....
            } ]
        }

Code improvements

To improve recursion logic for continuously scraping the reviews pagewise.
To add logging and integrate it with ELK or error tracking tools like Sentry.io
To make scraping configurable on the basis of company by storing schema information in configurable properties. So that a single API can be used for scraping multiple company's pages.
To integrate with database in order to save the scraped data in db and use that for analytics.
To implement IP spoofing in case a company blocks our IP after some hits.
To add authentication API to validate the requests.

Recommend Projects