Giter Site home page Giter Site logo

scrapehero-code / amazon-scraper Goto Github PK

View Code? Open in Web Editor NEW
295.0 8.0 154.0 17 KB

A simple web scraper to extract Product Data and Pricing from Amazon

Home Page: https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python-and-selectorlib/

Python 100.00%
amazon-scraper page-scraper scrape-products web-scraping web-scraping-tutorials web-crawling

amazon-scraper's Introduction

Amazon Scraper using Selectorlib

A simple amazon scraper to extract product details and prices from Amazon.com using Python Requests and Selectorlib.

Full article at ScrapeHero Tutorials

There are two simple scrapers in this project.

  1. Amazon Product Page Scraper amazon.py
  2. Amazon Search Results Page Scraper searchresults.py

Note: A completely web browser based commercial version of these scrapers are available in ScrapeHero Marketplace

Usage

From a terminal

  1. Clone this project git clone https://github.com/scrapehero-code/amazon-scraper.git and cd into it cd amazon-scraper
  2. Add a Virtual Environment python3 -m venv .venv (Optional)
  3. Activate the Virtual Environment source .venv/bin/activate (Optional)
  4. Install Requirements pip3 install -r requirements.txt

Scrape Product Details from Product Page

  1. Add Amazon Product URLS to urls.txt
  2. Run python3 amazon.py
  3. Get data from output.jsonl

Scrape Products from Search Results

This scraper only scrapes product from the first page of search results

  1. Add Amazon Product URLS to search_results_urls.txt
  2. Run python3 searchresults.py
  3. Get data from search_results_output.jsonl

Example Data Format

Product Details

{
  "name": "2020 HP 15.6\" Laptop Computer, 10th Gen Intel Quard-Core i7 1065G7 up to 3.9GHz, 16GB DDR4 RAM, 512GB PCIe SSD, 802.11ac WiFi, Bluetooth 4.2, Silver, Windows 10, YZAKKA USB External DVD + Accessories",
  "price": "$959.00",
  "short_description": "Powered by latest 10th Gen Intel Core i7-1065G7 Processor @ 1.30GHz (4 Cores, 8M Cache, up to 3.90 GHz); Ultra-low-voltage platform. Quad-core, eight-way processing provides maximum high-efficiency power to go.\n15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768) Display; Intel Iris Plus Graphics\n16GB 2666MHz DDR4 Memory for full-power multitasking; 512GB Solid State Drive (PCI-e), Save files fast and store more data. With massive amounts of storage and advanced communication power, PCI-e SSDs are great for major gaming applications, multiple servers, daily backups, and more.\nRealtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo; 1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\nWindows 10 Home, 64-bit, English; Natural silver; YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad\n› See more product details",
  "images": "{\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX425_.jpg\":[425,425],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX466_.jpg\":[466,466],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY355_.jpg\":[355,355],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX569_.jpg\":[569,569],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY450_.jpg\":[450,450],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX679_.jpg\":[679,679],\"https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX522_.jpg\":[522,522]}",
  "variants": [
    {
      "name": "Click to select 4GB DDR4 RAM, 128GB PCIe SSD",
      "asin": "B01MCZ4LH1"
    },
    {
      "name": "Click to select 8GB DDR4 RAM, 256GB PCIe SSD",
      "asin": "B08537NR9D"
    },
    {
      "name": "Click to select 12GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B08537ZDYH"
    },
    {
      "name": "Click to select 16GB DDR4 RAM, 512GB PCIe SSD",
      "asin": "B085383P7M"
    },
    {
      "name": "Click to select 20GB DDR4 RAM, 1TB PCIe SSD",
      "asin": "B08537NDVZ"
    }
  ],
  "product_description": "Capacity:16GB DDR4 RAM, 512GB PCIe SSD\n\nProcessor\n\n  Intel Core i7-1065G7 (1.3 GHz base frequency, up to 3.9 GHz with Intel Turbo Boost Technology, 8 MB cache, 4 cores)\n\nChipset\n\n  Intel Integrated SoC\n\nMemory\n\n  16GB DDR4-2666 SDRAM\n\nVideo graphics\n\n  Intel Iris Plus Graphics\n\nHard drive\n\n  512GB PCIe NVMe M.2 SSD\n\nDisplay\n\n  15.6\" diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768)\n\nWireless connectivity\n\n  Realtek RTL8821CE 802.11b/g/n/ac (1x1) Wi-Fi and Bluetooth 4.2 Combo\n\nExpansion slots\n\n  1 multi-format SD media card reader\n\nExternal ports\n\n  1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\n\nMinimum dimensions (W x D x H)\n\n  9.53 x 14.11 x 0.70 in\n\nWeight\n\n  3.75 lbs\n\nPower supply type\n\n  45 W Smart AC power adapter\n\nBattery type\n\n  3-cell, 41 Wh Li-ion\n\nBattery life mixed usage\n\n  Up to 11 hours and 30 minutes\n\n  Video Playback Battery life\n\n  Up to 10 hours\n\nWebcam\n\n  HP TrueVision HD Camera with integrated dual array digital microphone\n\nAudio features\n\n  Dual speakers\n\nOperating system\n\n  Windows 10 Home 64\n\nAccessories\n\n  YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad",
  "link_to_all_reviews": "https://www.amazon.com/HP-Computer-Quard-Core-Bluetooth-Accessories/product-reviews/B085383P7M/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
}

Search Results

Each result would look similar

{
    "title": "New ! Dell Inspiron i3583 15.6\" HD Touch-Screen Laptop - Intel i3-8145U - 8GB DDR4-128GB SSD - Windows 10 - Wireless-AC - Bluetooth - SD Card Reader - HDMI & USB 3.1 -Waves MaxxAudio Pro- Black",
    "url": "/Dell-Inspiron-i3583-Touch-Screen-Laptop/dp/B08173ZTJX/ref=sr_1_3?dchild=1&keywords=laptops&qid=1591584632&sr=8-3",
    "rating": "4.1 out of 5 stars",
    "reviews": "122",
    "price": "$472.00",
    "search_url": "https://www.amazon.com/s?k=laptops"
}

amazon-scraper's People

Contributors

scrapehero-code avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-scraper's Issues

Can we use this for wine products?

I am looking for a scraper for Amazon Vine products

[https://www.amazon.it/vine/vine-items?queue=last_chance&size=60](this a example link, but you must be vine)

empty search results

you can try to make it know wich type of page this is bcz amazon shows multiple like pages that dont have same html that show results u can try to make a scraper for each results and some thing that checks wich page this is

Does not Work with Amazon.de / Amazon Europe

If we use the scraper for every other Amazon Marketplace, like Amazon.de we get the following error:

Downloading https://www.amazon.de/s?k=laptop
Traceback (most recent call last):
  File "searchresults.py", line 43, in <module>
    for product in data['products']:
TypeError: 'NoneType' object is not iterable

When we use the SelectorLib Google Chrome Plugin, the markups are okay. We can see the prices, title, ratings e.g.

Empty results

Running the script without changes gives me an output.jsonl that looks like this:

{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}
{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}
{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}

Question: why is this URL not returning anything?

Firstly, I just wanted to say, great work and great article!

So I am making a web scraper based on yours, that is going to search amazon for a list of DVDs or Blu-rays based on title alone (hopefully with little error), and return the price. I have the test URL https://www.amazon.com/Sean-Connery/dp/B011MHCHTQ/, but it is returning null in every field. Do you know why this is?

Thanks!

Page blocked by Amazon

I tried running the code and I got the following error:

Downloading https://www.amazon.com/s?k=laptops
Page https://www.amazon.com/s?k=laptops was blocked by Amazon. Please try using better proxies

Could you look into this issue?

it works but not always

Hi,
I'm trying to scrape the PS5 page on Amazon.it:

https://www.amazon.it/dp/B08KKJ37F7

the most of times I do not get the data, sometimes I get it.
I scheduled the script to run every 10 seconds, herebelow you can see six executions. Only one get the data properly.
Which could be the reason? Maybe it is a timeout issue?

{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}

{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}

{"name": "Sony PlayStation 5", "price": null, "short_description": "Prova un caricamento ultra rapido con un'unit\u00e0 SSD ad altissima velocit\u00e0, un coinvolgimento ancora maggiore grazie al supporto per il feedback aptico, ai grilletti adattivi e all'audio 3D e scopri una nuova generazione di incredibili giochi PlayStation Lasciati stupire dalla grafica incredibile e prova le nuove funzionalit\u00e0 di PS5. Scopri un'esperienza di gioco pi\u00f9 profonda con supporto per feedback tattile, trigger adattivi e tecnologia audio 3D Ray Tracing - Immergiti in mondi che offrono un livello di realismo senza precedenti, con ogni raggio di luce simulato individualmente, creando effetti di ombre e riflessi ultra realistici sui giochi PS5 compatibili. Fino a 120 FPS con uscita a 120 Hz - Goditi un gameplay fluido con frame rate elevato fino a 120 FPS per giochi compatibili, con supporto per l'uscita a 120 Hz su display 4K Tecnologia HDR - Su una TV HDR, i giochi PS5 compatibili mostrano una gamma di colori vivaci e realistici Uscita 8K - Le console PS5 supportano l'uscita 8K, permettendoti di giocare sul tuo display 4320p Feedback tattile - Prova il feedback tattile tramite il controller wireless DualSense mentre giochi a determinati giochi per PS5 e senti gli effetti e l'impatto delle tue azioni di gioco attraverso il feedback sensoriale dinamico. Trigger adattivi: fai i conti con i trigger adattivi coinvolgenti e i loro livelli di resistenza dinamica che simulano l'impatto fisico delle tue azioni di gioco in alcuni titoli PS5 . Descrizione completa", "images": "{"https://images-na.ssl-images-amazon.com/images/I/71PMC4DWWFL._AC_SX466_.jpg\":[350,466],\"https://images-na.ssl-images-amazon.com/images/I/71PMC4DWWFL._AC_SX679_.jpg\":[511,679],\"https://images-na.ssl-images-amazon.com/images/I/71PMC4DWWFL._AC_SX342_.jpg\":[257,342],\"https://images-na.ssl-images-amazon.com/images/I/71PMC4DWWFL._AC_SX522_.jpg\":[393,522],\"https://images-na.ssl-images-amazon.com/images/I/71PMC4DWWFL._AC_SX425_.jpg\":[320,425],\"https://images-na.ssl-images-amazon.com/images/I/71PMC4DWWFL._AC_SX385_.jpg\":[290,385],\"https://images-na.ssl-images-amazon.com/images/I/71PMC4DWWFL._AC_SX569_.jpg\":[428,569]}", "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": "/Sony-PlayStation-5/product-reviews/B08KKJ37F7?reviewerType=all_reviews"}

{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}

{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}

{"name": null, "price": null, "short_description": null, "images": null, "rating": null, "number_of_reviews": null, "variants": null, "product_description": null, "sales_rank": null, "link_to_all_reviews": null}

empty results

Hi,
The code works well when i give search page URL, but it does not work with the URL https://www.amazon.com/gp/goldbox

Can you look into it?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.