adamlwgriffiths / amazon_scraper Goto Github PK
View Code? Open in Web Editor NEWProvides content not accessible through the standard Amazon API
License: Other
Provides content not accessible through the standard Amazon API
License: Other
I didn't see this existed, so I tried parsing it out myself.
avg_rating = float(rs.soup.find("i", {'class':re.compile('averageStarRating')}).text.split(' ')[0])
Should I submit a pull request?
Hi Adam,
first of all - thanks for sharing this project! I am having a small issue. This is my first python project so please bear with me. I cloned this project on a c9.io environment and I am getting a few errors when running the tests, namely this one:
ImportError: No module named tests
I would be very grateful if you could point me in the right direction.
Thanks
-Dat
Edit: things I have tried: Changing the CWD, reinstalling all the dependencies
When I try to run the scraper using a simple test and request as follows, it consistently fails as show with this error traceback. Any idea where I am going awry? Apologies for my idiocy here, but not obvious to me why urllib is throwing a 400 error.
from amazon_scraper import AmazonScraper
amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
import itertools
for p in itertools.islice(amzn.search(Keywords='python', SearchIndex='Books'), 5):
print p.title
And here is the result:
Traceback (most recent call last):
File "amazon-scraper.py", line 4, in <module>
for p in itertools.islice(amzn.search(Keywords='python', SearchIndex='Books'), 5):
File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon_scraper/__init__.py", line 188, in search
for p in self.api.search(**kwargs):
File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon/api.py", line 519, in __iter__
for page in self.iterate_pages():
File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon/api.py", line 535, in iterate_pages
yield self._query(ItemPage=self.current_page, **self.kwargs)
File "/Users/pk/anaconda/lib/python2.7/site-packages/amazon/api.py", line 548, in _query
response = self.api.ItemSearch(ResponseGroup=ResponseGroup, **kwargs)
File "/Users/pk/anaconda/lib/python2.7/site-packages/bottlenose/api.py", line 242, in __call__
{'api_url': api_url, 'cache_url': cache_url})
File "/Users/pk/anaconda/lib/python2.7/site-packages/bottlenose/api.py", line 203, in _call_api
return urllib2.urlopen(api_request, timeout=self.Timeout)
File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/Users/pk/anaconda/lib/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request
Is there a recommended software that will provide a gui?
I'm having sporadic trouble when extracting the ASIN using reviews/full_review:
rs = amzn.reviews(ItemId='006001203X')
fr = r.full_review()
myfile.write("%s," % (fr.asin))
I'm sometimes getting the error:
asin = unicode(tag.string)
AttributeError: 'NoneType' object has no attribute 'string'
My guess is that I'm not getting the content of the page when this error is occurring because the individual review's URL is passed on correctly (fr.url) and I can see that the content exists in my browser, but I am getting "None" when asking for the text of the review (fr.text). Furthermore, sometimes the scraper errors on a specific review and sometimes it doesn't, again making me think this is a loading issue.
In case it helps, I'm using the scraper in conjunction with Tor and PySocks (maybe not necessary?). What would lead to pages sometimes not loading? Any solutions to this issue?
*UPDATE: *
Here is some output when just printing out the reviews (rather than writing them). The format is the review URL followed by the text. What you'll notice is that "None" just seems to appear randomly and when you visit the actual page, there is writing there.
http://www.amazon.com/review/R1GLFST9IJDL3Z
None
http://www.amazon.com/review/R3O5KSEJ5BONJ7
Written by Dr. Atkins, this book is definitely a good way to get started on the diet. My only reservation is that he spends an awful long time convincing the reader to start the diet. But a good resource for a low/no carb diet.
http://www.amazon.com/review/R353I88IYNVGZJ
Thank you it is what I was looking for
http://www.amazon.com/review/R22GIPYTEYX7IK
None
Also, I have seen this happen both with and without using Tor/PySocks.
Detect when amazon shoots a captcha at us and raise an appropriate error instead of letting our soup code fail with None dereferences.
See this issue for more information on the format of the captcha and result of it being sent:
#25
I did:
from amazon_scraper import AmazonScraper
amzn = AmazonScraper(stuff...)
revs = amz.reviews(URL="http://www.amazon.com/product-reviews/1449355730/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending")
revs.ids
I get an empty list. The cause might be that amazon changed their html? I'd like to make the change
@property
def ids(self):
return [
- extract_review_id(anchor['href'])
- for anchor in self.soup.find_all('a', text=re.compile(ur'permalink', flags=re.I))
+ anchor["id"]
+ for anchor in self.soup.find_all('div', class_="a-section review")
]
This matches up a bit more closely with amazon html which looks like
<div id="R2UBSL6L1T8MIF" class="a-section review"><div class="a-row helpful-votes-count"></div>
...
Amazon have moved from ratings count (10 vote 1 star), to a percentage (5% vote 1 star)
http://www.amazon.com/dp/B00FLIJJSA
We need to handle this in the API
I don't know if this is an issue, but I can't get the reviews API to work. Even if I use the values in your unit tests they do not work. Do the unit tests work for you? Here are the results:
Fitbit FB401BK Flex Wireless Activity + Sleep Wristband, Black, Small - 5.5" - 6
.5" & Large - 6.5" - 7.9"
[0, 0, 0, 0, 0]
built-in method title of unicode object at 0x0000000004CEAED0>
built-in method title of unicode object at 0x0000000004D66600>
built-in method title of unicode object at 0x0000000004D8BCC0>
built-in method title of unicode object at 0x0000000004DB53C0>
built-in method title of unicode object at 0x0000000004E24A80>
built-in method title of unicode object at 0x0000000004E4C180>
built-in method title of unicode object at 0x0000000004E77870>
built-in method title of unicode object at 0x0000000004E9CF30>
built-in method title of unicode object at 0x0000000004F10630>
built-in method title of unicode object at 0x0000000004F76CF0>
Traceback (most recent call last):
File "test.py", line 11, in
for r in rs:
File "build\bdist.win-amd64\egg\amazon_scraper\reviews.py", line 180, in ite
r
TypeError: init() takes at least 2 arguments (2 given)
hello
im new to coding and want to set up script to pull prices from warehousedeals.com for items that are 70%+ off vs new amazon price, can someone assist me how to do this.
so far i have download python 2 and notepad++
Sir can you please tell me how to get the Product price through your amazon_scraper API in Python. and also tell me how to get Seller Information.
I'm running into an odd error when trying to get the ASIN for reviews. I get an RS object, but oddly can't get at the ASIN despite the fact it has that attribute on the span. Any help?
...
RS value: <amazon_scraper.reviews.Reviews object at 0x10c0c9890>
...
Here's my traceback:
Traceback (most recent call last):
File "amazon_reviews.py", line 112, in <module>
scrapeAmazonReviews(filepath, title, amzn, url)
File "amazon_reviews.py", line 83, in scrapeAmazonReviews
print "RS ASID: %s" % rs.asin
File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
TypeError: 'NoneType' object has no attribute '__getitem__'
I'm going to try to take on fixing up the tests and getting them to work properly. Right now there are basic issues just getting the tests to run and we need to turn off the MaxQPS property in order to get them to go. Also there are three test failures and several errors that can be fixed. Hell maybe I'll even figure out how to throw travis on here; but that can be a separate ticket.
I'm trying to run the example block of code that looks like this:
p = amzn.lookup(ItemId='B0051QVF7A')
rs = amzn.reviews(URL=p.reviews_url)
for r in rs:
print(r)
but at first I get an error like this:
Traceback (most recent call last):
File "review-scraper/review-scraper.py", line 19, in <module>
for r in rs:
File "/usr/local/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 178, in __iter__
for id in page.ids:
File "/usr/local/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 202, in ids
for anchor in self.soup.find_all('div', class_="a-section review")
File "/usr/local/lib/python2.7/site-packages/amazon_scraper/__init__.py", line 113, in decorator
raise e
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?
And when I install the html5lib
package things work a little better (I'm able to print out the first page of reviews) but then I hit another error:
R1V8OBW4HRDV5W
R38AV3D6I8CHS6
R1R19OOAWIN48U
RL37IWIVVB5B4
R3S9D4LLRP7AQN
R1CAZXTXQ6F5A
R36R23EPPWW6UQ
RA751EK4W8EV4
RGZ3A10EDUYQ1
RP149JO3VJ31O
Traceback (most recent call last):
File "review-scraper/review-scraper.py", line 19, in <module>
for r in rs:
File "/usr/local/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 180, in __iter__
page = Reviews(URL=page.next_page_url) if page.next_page_url else None
TypeError: __init__() takes at least 2 arguments (2 given)
Is there a different package I should be using?
I am randomly having stops occur as I try to scrape all the reviews for a product (using 'reviews'). I don't receive an error, and the 'soup' output for the last review it scraped doesn't seem all that informative (i.e., it looks like the reviews before it). This doesn't happen for specific products in particular nor does it occur after the same review each time. Specifically, sometimes if I re-scrape the same product, it starts or stops at a different review and sometimes all of the reviews will be scraped. If I am scraping multiple products one-after-the-other, the scraper will just continue on to the next product, even though it didn't scrape all the reviews for the first product. Finally, the scraper tends to stop on multiples of 10 (e.g., 80, 110, etc). This makes me believe it has something to do with continuing on to the next page.
Here is the code I'm using (along with a product ID where the scraper randomly stopped):
p = amzn.lookup(ItemId='B008LX6OC6') #also ItemID='B000F8EUFI'
rs = p.reviews()
for review in rs:
print review.asin
print review.url
print review.soup
Hi,
I get an error when installing and running in init.py in line 60.
It says: "incorrect syntax": _price_regexp = re.compile(ur'(?P[$ΓΊ][\d,.]+)', flags=re.I).
The arrow is on the last '.
Also in product.py line 50, reviews.py line 60.
Do I need a AWS account for this to work?
Either installing from command line using pip or through PyCharm, I get the following error message:
//
Collecting amazon-scraper
Using cached amazon_scraper-0.3.2.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 20, in
File "C:\Users...\AppData\Local\Temp\pycharm-packaging0.tmp\amazon-scraper\setup.py", line 4, in
del os.link
AttributeError: link
//
By deleting this line I think it installs OK (I have tried it and used the scraper some time ago).
The problem is when trying to install the scraper automatically using an IDE. I am running PyCharm on Windows 8.1, if this could indicate the source of the problem. Are any Linux users getting this error?
Any help would be appreciated.
Hello,
I've been able to use many of the commands and functions in the amazon_scraper packge, but the .text doesn't appear to work. I tried to pull the text of a valid review on amazon using the instructions laid out in your documentations. However, I continue to get 'None' instead of the text. I've tried this with several review_ids but all return 'None'. Can someone help me with this?
Thanks,
Brad
On Amazon home page, the product link are different and extract_asin
doesn't work.
I propose you to change _extract_asin_regexp
by (/dp/|/gp/product/)(?P<asin>[^/]+)/
#_extract_asin_regexp = re.compile(r'/dp/(?P<asin>[^/]+)/')
_extract_asin_regexp = re.compile(r'(/dp/|/gp/product/)(?P<asin>[^/]+)/')
Example of link : http://www.amazon.com/gp/product/B00GBHZDY4/
dateutils.parser
can't parse dates for non-english Amazon regions where review dates are, e.g., am 15. September 2017
instead of on September 15, 2017
.
I raise an import problem, when I setup a simple scraper :
Here is the content of my script :
from future import print_function
import itertools
from amazon_scraper import AmazonScraper
amzn = AmazonScraper("AKILSM4QQOKUVAX3lNPO", "s8a0hVmtfLL1TLZsKKiCBVvZTdQVG7x1HqhHZ1+E", "")
for p in itertools.islice(amzn.search(Keywords='python',
SearchIndex='Books'), 5):
print(p.title)
Here is the output given after the compilation:
root@nivose:~/amazon# python product.py
Traceback (most recent call last):
File "product.py", line 3, in <module>
from amazon_scraper import AmazonScraper
File "/usr/local/lib/python2.7/dist-packages/amazon_scraper/__init__.py", line 16, in <module>
from bs4 import BeautifulSoup
File "build/bdist.linux-x86_64/egg/bs4/__init__.py", line 30, in <module>
File "build/bdist.linux-x86_64/egg/bs4/builder/__init__.py", line 314, in <module>
File "build/bdist.linux-x86_64/egg/bs4/builder/_html5lib.py", line 70, in <module>
AttributeError: 'module' object has no attribute '_base
/home/####/.local/lib/python2.7/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
This is the warning i get on trying this :
from amazon_scraper import AmazonScraper amzn = AmazonScraper(I've passed correct arguments) rs = amzn.reviews(ItemId='B0734X8GW5') for r in rs.ids: rvn = amzn.review(Id=r) print (rvn.id) print (rvn.text)
I seem to be able to only get the last 10 reviews. Is this known?
Let me know when you're happy with what you've commited.
We'll tag it and release it on PyPi.
Do you have a pypi account?
If so, let me know the username and I'll give you release permissions.
i tried the function:
rs = amzn.reviews(URL=p.reviews_url)
it return only <10 results.(unless i'm missing something)
I am getting the reviews for first review page. But after first page how would I get the reviews for other pages?
Installing requirements should have a library named bs4; beautifulsoup4 is not a valid library (python--v>3)
Regardless of the Region
parameter, product reviews are always tried to be fetched from http://www.amazon.com/product-reviews/
due to the amazon_base constant.
I'd suggest that this base url either is set dynamically, depending on the Region
or can be specified as a paramter, without having to statically overwrite the variable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.