adamlwgriffiths / amazon_scraper Goto Github PK

View Code? Open in Web Editor NEW

233.0 233.0 60.0 4.57 MB

Provides content not accessible through the standard Amazon API

License: Other

Python 100.00%

amazon_scraper's People

Contributors

Stargazers

Watchers

Forkers

yuval kinglouisxviii hahnicity arunbabucode donalddominko eriter tjjonesy16 anujgupta1104 siyaofd apapillon qmind cermat shanfei bkg6 ruiaylin crystalrood acamtech benperove jupiturliu hoaivan kevinlondon puneetmadaan rob0tca yappawu yashojha1 diptarkbose pramodshenoy dicekarai yflyzhang greybeard355113 gardart hamik112 udris juno249 talhe okayamayoshinobu huochequan ankursarode hansvomkreuz actiboost crazyrocks iotwlw zedfauji guoqi228 cug09evan ternence-li stungkit slapbassify mindaugasvaitkus2 cwellmann life-turner humdan ikermatias oneboxok fagan2888 jmdtol ugurayan dctikan iq-scm

amazon_scraper's Issues

Average Review Rating

I didn't see this existed, so I tried parsing it out myself.

avg_rating = float(rs.soup.find("i", {'class':re.compile('averageStarRating')}).text.split(' ')[0])

Should I submit a pull request?

ImportError: No module named tests

Hi Adam,

first of all - thanks for sharing this project! I am having a small issue. This is my first python project so please bear with me. I cloned this project on a c9.io environment and I am getting a few errors when running the tests, namely this one:

ImportError: No module named tests

I would be very grateful if you could point me in the right direction.

Thanks

-Dat

Edit: things I have tried: Changing the CWD, reinstalling all the dependencies

Missing something obvious: urllib 400 error on simple requests

When I try to run the scraper using a simple test and request as follows, it consistently fails as show with this error traceback. Any idea where I am going awry? Apologies for my idiocy here, but not obvious to me why urllib is throwing a 400 error.

from amazon_scraper import AmazonScraper
amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
import itertools
for p in itertools.islice(amzn.search(Keywords='python', SearchIndex='Books'), 5):
   print p.title

And here is the result:

Traceback (most recent call last):
  File "amazon-scraper.py", line 4, in <module>
    for p in itertools.islice(amzn.search(Keywords='python', SearchIndex='Books'), 5):
  File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon_scraper/__init__.py", line 188, in search
    for p in self.api.search(**kwargs):
  File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon/api.py", line 519, in __iter__
    for page in self.iterate_pages():
  File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon/api.py", line 535, in iterate_pages
    yield self._query(ItemPage=self.current_page, **self.kwargs)
  File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon/api.py", line 548, in _query
    response = self.api.ItemSearch(ResponseGroup=ResponseGroup, **kwargs)
  File "/Users/pk/anaconda/lib/python2.7/site-packages/bottlenose/api.py", line 242, in __call__
    {'api_url': api_url, 'cache_url': cache_url})
  File "/Users/pk/anaconda/lib/python2.7/site-packages/bottlenose/api.py", line 203, in _call_api
    return urllib2.urlopen(api_request, timeout=self.Timeout)
  File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request

GUI?

Is there a recommended software that will provide a gui?

Page sometimes not loading?

I'm having sporadic trouble when extracting the ASIN using reviews/full_review:

rs = amzn.reviews(ItemId='006001203X')
fr = r.full_review()
myfile.write("%s," % (fr.asin))

I'm sometimes getting the error:

asin = unicode(tag.string)
AttributeError: 'NoneType' object has no attribute 'string'

My guess is that I'm not getting the content of the page when this error is occurring because the individual review's URL is passed on correctly (fr.url) and I can see that the content exists in my browser, but I am getting "None" when asking for the text of the review (fr.text). Furthermore, sometimes the scraper errors on a specific review and sometimes it doesn't, again making me think this is a loading issue.

In case it helps, I'm using the scraper in conjunction with Tor and PySocks (maybe not necessary?). What would lead to pages sometimes not loading? Any solutions to this issue?

*UPDATE: *

Here is some output when just printing out the reviews (rather than writing them). The format is the review URL followed by the text. What you'll notice is that "None" just seems to appear randomly and when you visit the actual page, there is writing there.

http://www.amazon.com/review/R1GLFST9IJDL3Z
None
http://www.amazon.com/review/R3O5KSEJ5BONJ7
Written by Dr. Atkins, this book is definitely a good way to get started on the diet. My only reservation is that he spends an awful long time convincing the reader to start the diet. But a good resource for a low/no carb diet.
http://www.amazon.com/review/R353I88IYNVGZJ
Thank you it is what I was looking for
http://www.amazon.com/review/R22GIPYTEYX7IK
None

Also, I have seen this happen both with and without using Tor/PySocks.

Add captcha detection

Detect when amazon shoots a captcha at us and raise an appropriate error instead of letting our soup code fail with None dereferences.

See this issue for more information on the format of the captcha and result of it being sent:
#25

Review ids are not being parsed correctly

with url: http://www.amazon.com/product-reviews/1449355730/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending

I did:

from amazon_scraper import AmazonScraper
amzn = AmazonScraper(stuff...)
revs = amz.reviews(URL="http://www.amazon.com/product-reviews/1449355730/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending")
revs.ids

I get an empty list. The cause might be that amazon changed their html? I'd like to make the change

     @property
     def ids(self):
         return [
-            extract_review_id(anchor['href'])
-            for anchor in self.soup.find_all('a', text=re.compile(ur'permalink', flags=re.I))
+            anchor["id"]
+            for anchor in self.soup.find_all('div', class_="a-section review")
         ]

This matches up a bit more closely with amazon html which looks like

<div id="R2UBSL6L1T8MIF" class="a-section review"><div class="a-row helpful-votes-count"></div>
...

Handle new amazon ratings

Amazon have moved from ratings count (10 vote 1 star), to a percentage (5% vote 1 star)
http://www.amazon.com/dp/B00FLIJJSA

We need to handle this in the API

Reviews API broken?

I don't know if this is an issue, but I can't get the reviews API to work. Even if I use the values in your unit tests they do not work. Do the unit tests work for you? Here are the results:
Fitbit FB401BK Flex Wireless Activity + Sleep Wristband, Black, Small - 5.5" - 6
.5" & Large - 6.5" - 7.9"
[0, 0, 0, 0, 0]
built-in method title of unicode object at 0x0000000004CEAED0>
built-in method title of unicode object at 0x0000000004D66600>
built-in method title of unicode object at 0x0000000004D8BCC0>
built-in method title of unicode object at 0x0000000004DB53C0>
built-in method title of unicode object at 0x0000000004E24A80>
built-in method title of unicode object at 0x0000000004E4C180>
built-in method title of unicode object at 0x0000000004E77870>
built-in method title of unicode object at 0x0000000004E9CF30>
built-in method title of unicode object at 0x0000000004F10630>
built-in method title of unicode object at 0x0000000004F76CF0>
Traceback (most recent call last):
File "test.py", line 11, in
for r in rs:
File "build\bdist.win-amd64\egg\amazon_scraper\reviews.py", line 180, in ite
r
TypeError: init() takes at least 2 arguments (2 given)

help

hello

im new to coding and want to set up script to pull prices from warehousedeals.com for items that are 70%+ off vs new amazon price, can someone assist me how to do this.
so far i have download python 2 and notepad++

Get Product Price

Sir can you please tell me how to get the Product price through your amazon_scraper API in Python. and also tell me how to get Seller Information.

Can't get ASIN for reviews

I'm running into an odd error when trying to get the ASIN for reviews. I get an RS object, but oddly can't get at the ASIN despite the fact it has that attribute on the span. Any help?

...
RS value: <amazon_scraper.reviews.Reviews object at 0x10c0c9890>
...

Here's my traceback:

Traceback (most recent call last):
    File "amazon_reviews.py", line 112, in <module>
scrapeAmazonReviews(filepath, title, amzn, url)
    File "amazon_reviews.py", line 83, in scrapeAmazonReviews
print "RS ASID: %s" % rs.asin
    File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
    TypeError: 'NoneType' object has no attribute '__getitem__'

Fix up tests

I'm going to try to take on fixing up the tests and getting them to work properly. Right now there are basic issues just getting the tests to run and we need to turn off the MaxQPS property in order to get them to go. Also there are three test failures and several errors that can be fixed. Hell maybe I'll even figure out how to throw travis on here; but that can be a separate ticket.

how to properly iterate over reviews

I'm trying to run the example block of code that looks like this:

p = amzn.lookup(ItemId='B0051QVF7A')
rs = amzn.reviews(URL=p.reviews_url)

for r in rs:
    print(r)

but at first I get an error like this:

Traceback (most recent call last):
  File "review-scraper/review-scraper.py", line 19, in <module>
    for r in rs:
  File "/usr/local/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 178, in __iter__
    for id in page.ids:
  File "/usr/local/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 202, in ids
    for anchor in self.soup.find_all('div', class_="a-section review")
  File "/usr/local/lib/python2.7/site-packages/amazon_scraper/__init__.py", line 113, in decorator
    raise e
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

And when I install the html5lib package things work a little better (I'm able to print out the first page of reviews) but then I hit another error:

R1V8OBW4HRDV5W
R38AV3D6I8CHS6
R1R19OOAWIN48U
RL37IWIVVB5B4
R3S9D4LLRP7AQN
R1CAZXTXQ6F5A
R36R23EPPWW6UQ
RA751EK4W8EV4
RGZ3A10EDUYQ1
RP149JO3VJ31O
Traceback (most recent call last):
  File "review-scraper/review-scraper.py", line 19, in <module>
    for r in rs:
  File "/usr/local/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 180, in __iter__
    page = Reviews(URL=page.next_page_url) if page.next_page_url else None
TypeError: __init__() takes at least 2 arguments (2 given)

Is there a different package I should be using?

Random stopping on multiples of 10

I am randomly having stops occur as I try to scrape all the reviews for a product (using 'reviews'). I don't receive an error, and the 'soup' output for the last review it scraped doesn't seem all that informative (i.e., it looks like the reviews before it). This doesn't happen for specific products in particular nor does it occur after the same review each time. Specifically, sometimes if I re-scrape the same product, it starts or stops at a different review and sometimes all of the reviews will be scraped. If I am scraping multiple products one-after-the-other, the scraper will just continue on to the next product, even though it didn't scrape all the reviews for the first product. Finally, the scraper tends to stop on multiples of 10 (e.g., 80, 110, etc). This makes me believe it has something to do with continuing on to the next page.

Here is the code I'm using (along with a product ID where the scraper randomly stopped):

p = amzn.lookup(ItemId='B008LX6OC6') #also ItemID='B000F8EUFI'
rs = p.reviews()
for review in rs:
    print review.asin
    print review.url
    print review.soup

Incorrect Syntax in init

Hi,

I get an error when installing and running in init.py in line 60.
It says: "incorrect syntax": _price_regexp = re.compile(ur'(?P[$ú][\d,.]+)', flags=re.I).
The arrow is on the last '.

Also in product.py line 50, reviews.py line 60.

AWS Accout

Do I need a AWS account for this to work?

Problem installing amazon_scraper

Either installing from command line using pip or through PyCharm, I get the following error message:

//
Collecting amazon-scraper
Using cached amazon_scraper-0.3.2.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "C:\Users...\AppData\Local\Temp\pycharm-packaging0.tmp\amazon-scraper\setup.py", line 4, in
del os.link
AttributeError: link
//

By deleting this line I think it installs OK (I have tried it and used the scraper some time ago).
The problem is when trying to install the scraper automatically using an IDE. I am running PyCharm on Windows 8.1, if this could indicate the source of the problem. Are any Linux users getting this error?
Any help would be appreciated.

Problems with .text command

Hello,

I've been able to use many of the commands and functions in the amazon_scraper packge, but the .text doesn't appear to work. I tried to pull the text of a valid review on amazon using the instructions laid out in your documentations. However, I continue to get 'None' instead of the text. I've tried this with several review_ids but all return 'None'. Can someone help me with this?

Thanks,

Brad

extract_asin doesn't work with all Amazon's links

On Amazon home page, the product link are different and extract_asin doesn't work.
I propose you to change _extract_asin_regexp by (/dp/|/gp/product/)(?P<asin>[^/]+)/

#_extract_asin_regexp = re.compile(r'/dp/(?P<asin>[^/]+)/')
_extract_asin_regexp = re.compile(r'(/dp/|/gp/product/)(?P<asin>[^/]+)/')

Example of link : http://www.amazon.com/gp/product/B00GBHZDY4/

Can't parse review date for foreign Amazon regions

dateutils.parser can't parse dates for non-english Amazon regions where review dates are, e.g., am 15. September 2017 instead of on September 15, 2017.

Problem with BeautifulSoup import

I raise an import problem, when I setup a simple scraper :
Here is the content of my script :

from future import print_function
import itertools
from amazon_scraper import AmazonScraper
amzn = AmazonScraper("AKILSM4QQOKUVAX3lNPO", "s8a0hVmtfLL1TLZsKKiCBVvZTdQVG7x1HqhHZ1+E", "")
for p in itertools.islice(amzn.search(Keywords='python',
SearchIndex='Books'), 5):
print(p.title)

Here is the output given after the compilation:

root@nivose:~/amazon# python product.py
Traceback (most recent call last):
  File "product.py", line 3, in <module>
    from amazon_scraper import AmazonScraper
  File "/usr/local/lib/python2.7/dist-packages/amazon_scraper/__init__.py", line 16, in <module>
    from bs4 import BeautifulSoup
  File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 30, in <module>
  File "build/bdist.linux-x86_64/egg/bs4/builder/__init__.py", line 314, in <module>
  File "build/bdist.linux-x86_64/egg/bs4/builder/_html5lib.py", line 70, in <module>
AttributeError: 'module' object has no attribute '_base

Problem with InsecureRequest

/home/####/.local/lib/python2.7/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)

This is the warning i get on trying this :

from amazon_scraper import AmazonScraper amzn = AmazonScraper(I've passed correct arguments) rs = amzn.reviews(ItemId='B0734X8GW5') for r in rs.ids: rvn = amzn.review(Id=r) print (rvn.id) print (rvn.text)

Only getting the last 10 reviews

I seem to be able to only get the last 10 reviews. Is this known?

Create a new release

Let me know when you're happy with what you've commited.
We'll tag it and release it on PyPi.

Do you have a pypi account?
If so, let me know the username and I'll give you release permissions.

reviews return only 10 results

i tried the function:
rs = amzn.reviews(URL=p.reviews_url)

it return only <10 results.(unless i'm missing something)

Reviews not getting after review page

I am getting the reviews for first review page. But after first page how would I get the reviews for other pages?

Install requirement contains invalid library name

Installing requirements should have a library named bs4; beautifulsoup4 is not a valid library (python--v>3)

Add ability to set amazon_base

Regardless of the Region parameter, product reviews are always tried to be fetched from http://www.amazon.com/product-reviews/ due to the amazon_base constant.
I'd suggest that this base url either is set dynamically, depending on the Region or can be specified as a paramter, without having to statically overwrite the variable.

How to get offer listings(all offer price by all merchants for single product)

example : http://www.amazon.com/gp/offer-listing/B00OBRE5UE/