Giter Site home page Giter Site logo

remitchell / python-scraping Goto Github PK

View Code? Open in Web Editor NEW
4.3K 4.3K 2.4K 13.01 MB

Code samples from the book Web Scraping with Python http://shop.oreilly.com/product/0636920034391.do

Python 0.69% Jupyter Notebook 92.30% JavaScript 2.54% Roff 4.46% HTML 0.01%

python-scraping's Introduction

Web Scraping with Python Code Samples

These code samples are for the book Web Scraping with Python 2nd Edition

If you're looking for the first edition code files, they can be found in the v1 directory.

Most code for the second edition is contained in Jupyter notebooks. Although these files can be viewed directly in your browser in Github, some formatting changes and oddities may occur. I recommend that you clone the repository, install Jupyter, and view them locally for the best experience.

The web changes, libraries update, and make mistakes and typos more frequently than I'd like to admit! If you think you've spotted an error, please feel free to make a pull request against this repository.

python-scraping's People

Contributors

acpk avatar malonegod avatar mimibambino avatar remitchell avatar spwilson2 avatar takesxisximada avatar tmrblr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-scraping's Issues

chapter 2 nameList got problem: NoneType object is not callable

hey i type the same code as the book says, and i got this error:

Traceback (most recent call last):
File "C:\Users\hongy\AppData\Local\Programs\Python\Python36\nameList.py", line 5, in
nameList = bsObj.findall('span', {'class':'green'})
TypeError: 'NoneType' object is not callable

im new to python, and cant figure it out by searching google. could you help me with this?

Indentation Error (Storing Data, 4th code)

On the second loop there is a slight indentation issue to have the proper form

csvFile = open('editors.csv', 'wt+')
writer = csv.writer(csvFile)
try:
    for row in rows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            csvRow.append(cell.get_text())
        writer.writerow(csvRow) #it should be indented backward to have a proper csv form
finally:
    csvFile.close()

Simple inversion in Chapter 04 > 2 Crawling through sites with search

in Chapter 04 > 2 Crawling through sites with search:
in function def search(self, topic, site):
there is: content = Content(topic, title, body, url)
instead of: content = Content(topic, url, title, body)
as in:
class Content: def __init__(self, topic, url, title, body)
making returned content difficult to understand.

Chapter 10.4: Issue about Posting Image

Hello Everyone,

I am getting error with following code

import requests
files = {'uploadFile': open('python.png', 'rb')}
r = requests.post('http://pythonscraping.com/pages/processing2.php', files=files)
print(r.text)

Respondes is
Sorry, there was an error uploading your file.

Anyone's help would be very thankful.

Can't view files

Can't view any files. Just get a "Sorry, something went wrong. Reload?" message.

Possible improvement on the code of page 76

Hi author,

The lines of code are at the very top of page 76 in the book:

try:
    for row in tableRows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            csvRow.append(cell.get_text())
            writer.writerow(csvRow)
finally:
    csvFile.close()

I tested it and believed that, if we intend to output a normal csv file, this code should be modified as below:

try:
    for row in tableRows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            csvRow.append(cell.get_text())
        writer.writerow(csvRow)
finally:
    csvFile.close()

That is, this line is moved to the left by one indent:
writer.writerow(csvRow)

Otherwise, the csv file would include many lines that are not complete.

Error occurred when running chapter-12/2-seleniumCookies.py

The output of this script is:

[{'expires': 'Sun, 18 Dec 2016 12:53:17 GMT', 'name': '_gat', 'path': '/', 'expiry': 1482065597, 'domain': '.pythonscraping.com', 'httponly': False, 'value': '1', 'secure': False}, {'expires': 'Tue, 18 Dec 2018 12:43:17 GMT', 'name': '_ga', 'path': '/', 'expiry': 1545136997, 'domain': '.pythonscraping.com', 'httponly': False, 'value': 'GA1.2.2049848913.1482064997', 'secure': False}, {'name': 'has_js', 'path': '/', 'domain': 'pythonscraping.com', 'httponly': False, 'value': '1', 'secure': False}]

WebDriverException Traceback (most recent call last)
in ()
12 driver2.delete_all_cookies()
13 for cookie in savedCookies:
---> 14 driver2.add_cookie(cookie)
15
16 driver2.get("http://pythonscraping.com")

/usr/local/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py in add_cookie(self, cookie_dict)
669
670 """
--> 671 self.execute(Command.ADD_COOKIE, {'cookie': cookie_dict})
672
673 # Timeouts

/usr/local/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py in execute(self, driver_command, params)
234 response = self.command_executor.execute(driver_command, params)
235 if response:
--> 236 self.error_handler.check_response(response)
237 response['value'] = self._unwrap_value(
238 response.get('value', None))

/usr/local/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
190 elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
191 raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 192 raise exception_class(message, screen, stacktrace)
193
194 def _value_or_default(self, obj, key, default):

WebDriverException: Message: {"errorMessage":"Can only set Cookies for the current domain","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"243","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:58537","User-Agent":"Python-urllib/3.5"},"httpVersion":"1.1","method":"POST","post":"{"cookie": {"expires": "Sun, 18 Dec 2016 12:53:17 GMT", "name": "_gat", "path": "/", "expiry": 1482065597, "domain": ".pythonscraping.com", "httponly": false, "value": "1", "secure": false}, "sessionId": "94eb85a0-c51f-11e6-badf-39312152c0b6"}","url":"/cookie","urlParsed":{"anchor":"","query":"","file":"cookie","directory":"/","path":"/cookie","relative":"/cookie","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/cookie","queryKey":{},"chunks":["cookie"]},"urlOriginal":"/session/94eb85a0-c51f-11e6-badf-39312152c0b6/cookie"}}
Screenshot: available via screen

How to eliminate this error? Thank you in advance.

csv Creation Failure

python --version: 3.7.0
chapter5/2-createCsv.py
pythonerror

Hello, I write the code according to the content on github, and even copy and paste the code directly afterwards, so that I can control the platform to report errors:
FileNotFoundError: [Errno 2] No such file or directory: '../files/test.csv'
Can you tell me how to solve the problem?

Error on first example

Hello I just purchased your book, and I am on the first example "scrapetest.py", and I am getting an error.

Exception has occurred: NameError
name 'null' is not defined
File "C:\Users\mine\Documents\VSCode\Workspaces\PythonScrapping\scrapetest.py", line 114, in "execution_count": null,

It is a fresh install o vs code, and python 3.1. This is my first time running them. Any help is appreciated.

chapter13/4-dragAndDrop.py Doesn't work

debian 7

from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver import ActionChains

driver = webdriver.PhantomJS(executable_path='./phantomjs')
driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html')

print(driver.find_element_by_id("message").text)
Prove you are not a bot, by dragging the square from the blue area to the red area!

element = driver.find_element_by_id("draggable")
target = driver.find_element_by_id("div2")
actions = ActionChains(driver)
actions.drag_and_drop(element, target).perform()

print(driver.find_element_by_id("message").text)
Prove you are not a bot, by dragging the square from the blue area to the red area!

======
the same result saw at windows7 too!!
and i found this
SeleniumHQ/selenium#2533

[SSL: CERTIFICATE_VERIFY_FAILED] on pythonscraping

When running the code in the first two chapters that use pythonscraping.com I receive the following error:

<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1125)>

Method missing

There is no method named getNextExternalLink() defined in Chapter 3 getExternalLinks.py

[SSL: CERTIFICATE_VERIFY_FAILED] on example site

Hi,

I can't anymore use the example site.

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>

Can you renewal the certificate ? ;)

Thx,


To bypass the problem I use requests not urlllib :

  • requests.get("url", verify=False) instead of urlopen()
  • BeautifulSoup(html.content, 'html.parser') instead of BeautifulSoup(html.read(), 'html.parser')

Question in ch14 v1

What is the meaning of the sentence "horrify web administrators by sending their website traffic from Internet Explorer 5.0." ? What special feature does Internet Explorer 5.0 have ?

Chapter3.Article"Crawling with Scrapy". NotImplementedError

Hi everybody! I am newbie and try to implement my first test scrapy project wikiSpider . I done everything as described in the book and got the result- NotImplementedError: ArticleSpider.parse callback is not defined.

files:
-articleSpider.py:
from scrapy.selector import Selector
from scrapy import Spider
from wikiSpider.items import Article

class ArticleSpider(Spider):
name="article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Main_Page",
"http://en.wikipedia.org/wiki/Python_%28programming_language%29"]

def parse(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print("Title is: "+title)
item['title'] = item
return item

-items.py

from scrapy import Item, Field

class Article(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=Field()

More details you can see here...
![mypic1](https://user-images.githubusercontent.com/29522625/37275003-b2660ff0-25e6-11e8-9b35-1a3afe85ebdd.jpg

Can anybody help me? Thnx in advance...

chapitre11/3-readWebImages.py

driver.find_element_by_id("sitbReaderRightPageTurner").click()

"Element is not currently visible and may not be manipulated"

findAll vs find_all

the text uses find_all

Using this BeautifulSoup object, you can use the find_all function...

where as the example uses findAll instead:

 = bs.findAll('span', {'class':'green'})
for name in nameList:
    print(name.get_text())

which doesn't make a difference when the code is run, but the code example and text should remain consistent with one-another

Error on "urllib.request import urlopen" from Chapter01_BeginningToScrape.ipynb

Hi,

I am getting below error after the code

`from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')`

`Traceback (most recent call last):
File "C:\Anaconda3\envs\py38\lib\http\client.py", line 871, in _get_hostport
port = int(host[i+1:])
ValueError: invalid literal for int() with base 10: 'port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "C:\Anaconda3\envs\py38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Anaconda3\envs\py38\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Anaconda3\envs\py38\lib\urllib\request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "C:\Anaconda3\envs\py38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Anaconda3\envs\py38\lib\urllib\request.py", line 1379, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Anaconda3\envs\py38\lib\urllib\request.py", line 1319, in do_open
h = http_class(host, timeout=req.timeout, **http_conn_args)
File "C:\Anaconda3\envs\py38\lib\http\client.py", line 833, in init
(self.host, self.port) = self._get_hostport(host, port)
File "C:\Anaconda3\envs\py38\lib\http\client.py", line 876, in _get_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
http.client.InvalidURL: nonnumeric port: 'port'`

I am using the latest version of Python (3.8.5). What could be the problem?

Thank you.

chapter 10 - handling redirects

Hi,
I'm running the 3-javascriptRedirect.py file from chapter 10, and StaleElementReferenceException is not thrown, therefore the full 10 sec timeout is performed. Is this a bug?

Chapter 8: 2-countUncommon2Grams.py

First of all, I really enjoying working though all the examples in the book, however, on this specific chapter I am lost. You have a function isCommon but never use it in the program.

test

Also, the output that you have in the book does not match with what you have in this repo.

r2

I am confused can you please advised? Thank you!

Can you provide the CAPTCHA 'captchaExample.png' of Chapter 11?

I want to test the CAPTCHA recognition with tesseract of the code of Chapter 11. But I can't find the CAPTCHA file captchaExample.png. I can make a screenshot of the picture on the book and save it as png format, but the result of the command tesseract captchaExample.png output is:

Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Empty page!!
Empty page!!

So I want to obtain the original CAPTCHA file, can you provide it? Thanks.

chap 16: error occurred while running multiprocess crawling code

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import random

from multiprocessing import Process
import os
import time

visited = []
def get_links(bs):
    print('Getting links in {}'.format(os.getpid()))
    links = bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))
    return [link for link in links if link not in visited]

def scrape_article(path):
    visited.append(path)
    html = urlopen('http://en.wikipedia.org{}'.format(path))
    time.sleep(3)
    bs = BeautifulSoup(html, 'html.parser')
    title = bs.find('h1').get_text()
    print('Scraping {} in process {}'.format(title, os.getpid()))
    links = get_links(bs)
    if len(links) > 0:
        newArticle = links[random.randint(0, len(links)-1)].attrs['href']
        print(newArticle)
        scrape_article(newArticle)

processes = []
processes.append(Process(target=scrape_article, args=('/wiki/Kevin_Bacon', )))
processes.append(Process(target=scrape_article, args=('/wiki/Monty_Python', )))

for p in processes:
    p.start()

following error occurred while running it

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

changed code to this:

# inserted if__name__=='__main__':
if __name__ == '__main__':
    processes = []
    processes.append(Process(target=scrape_article, args=('/wiki/Kevin_Bacon', )))
    processes.append(Process(target=scrape_article, args=('/wiki/Monty_Python', )))
    
    for p in processes:
        p.start()

Question in ch2

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bs=BeautifulSoup(html,"html.parser")
nameList = bs.find_all(text='the prince')
print(len(nameList))

I run the code above and the result is 7. However, when I use 'ctrl+F' to search 'the prince' in the the browser, the result is 11. I'm confused why the results are inconsistent.

seleniumBasic.py in Chapter10 - 'Service' object has no attribute 'process'

Sorry, I am missing the executable path,
specific in executable_path=''

my code is same as the book:

#!/usr/bin/env python3

import time
from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='')
driver.get("http://www.pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

but when I run this code, I got this:

Traceback (most recent call last):
File "use_selenium.py", line 6, in
driver = webdriver.PhantomJS(executable_path='')
File "D:\Python34\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py", line 52, in init
self.service.start()
File "D:\Python34\lib\site-packages\selenium\webdriver\common\service.py", line 64, in start
stdout=self.log_file, stderr=self.log_file)
File "D:\Python34\lib\subprocess.py", line 848, in init
restore_signals, start_new_session)
File "D:\Python34\lib\subprocess.py", line 1104, in _execute_child
startupinfo)
OSError: [WinError 87] ▒▒▒▒▒▒▒▒
Exception ignored in: <bound method Service.del of <selenium.webdriver.phantomjs.service.Service object at 0x000000000307A940>>
Traceback (most recent call last):
File "D:\Python34\lib\site-packages\selenium\webdriver\common\service.py", line 163, in del
self.stop()
File "D:\Python34\lib\site-packages\selenium\webdriver\common\service.py", line 135, in stop
if self.process is None:
AttributeError: 'Service' object has no attribute 'process'

I am using windows8.1+python3.4, install selenium by pip3. And it's also this error occurs in my centos6.5 server patform

Chapter5 articles.py

I tried to run the code for the file articles.py both on Jupyter and on the anaconda command line, but everytime i run it, I get the error "No module named 'scrapy.contrib'" and up until now I couldn't solve this issue. I'd be glad if I could get any help with it, Thank you so much.

Chapter 05 mySQLBasicExample

Thanks for writing this book. As a non-programmer it has been a fun project working through the various examples. I've run into a wall trying to run the mySQLBasic Example. Running in Windows 7. mySQL installed. PyMySQL installed. The mySQL command prompt is open. Copied the subject code into notepad and saved as mySQL01.py. Running results in the following

image

I have tried resetting my root password based on instructions here: http://dev.mysql.com/doc/refman/5.7/en/resetting-permissions.html but have been unsuccessful. It is seems that there are two possible passwords, the one set during install and ''. To enter mysql I just hit enter, so it would seem that the password is ''.

What am I missing here?

Thanks!

HTTPError 403:Forbidden or no internal links

I am from China, therefore, as you know, google, facebook, twitter are not available.
There are many these kinds of external links related to these social websites.

And it should also consider the number of internal links, since in some websites, there are no internal links. It is very weird.
Thanks

chapter3 question

hi,little sister,
in chapter3,Collect all the list of ExternalLinks,The code in the book is wrong with the splitAddress function times:

即将获取链接的URL是:/

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-4254ccc1a2c3> in <module>()
     64             getAllExternalLinks(link)
     65 
---> 66 getAllExternalLinks("http://oreilly.com")

<ipython-input-2-4254ccc1a2c3> in getAllExternalLinks(siteUrl)
     62             print("即将获取链接的URL是:" + link)
     63             allIntLinks.add(link)
---> 64             getAllExternalLinks(link)
     65 
     66 getAllExternalLinks("http://oreilly.com")

<ipython-input-2-4254ccc1a2c3> in getAllExternalLinks(siteUrl)
     62             print("即将获取链接的URL是:" + link)
     63             allIntLinks.add(link)
---> 64             getAllExternalLinks(link)
     65 
     66 getAllExternalLinks("http://oreilly.com")

<ipython-input-2-4254ccc1a2c3> in getAllExternalLinks(siteUrl)
     62             print("即将获取链接的URL是:" + link)
     63             allIntLinks.add(link)
---> 64             getAllExternalLinks(link)
     65 
     66 getAllExternalLinks("http://oreilly.com")

<ipython-input-2-4254ccc1a2c3> in getAllExternalLinks(siteUrl)
     49 allIntLinks = set()
     50 def getAllExternalLinks(siteUrl):
---> 51     html = urlopen(siteUrl)
     52     bs = BeautifulSoup(html,"html.parser")
     53     internalLinks = getInternalLinks(bs,splitAddress(siteUrl)[0])

~/anaconda3/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

~/anaconda3/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    509         # accept a URL or a Request object
    510         if isinstance(fullurl, str):
--> 511             req = Request(fullurl, data)
    512         else:
    513             req = fullurl

~/anaconda3/lib/python3.6/urllib/request.py in __init__(self, url, data, headers, origin_req_host, unverifiable, method)
    327                  origin_req_host=None, unverifiable=False,
    328                  method=None):
--> 329         self.full_url = url
    330         self.headers = {}
    331         self.unredirected_hdrs = {}

~/anaconda3/lib/python3.6/urllib/request.py in full_url(self, url)
    353         self._full_url = unwrap(url)
    354         self._full_url, self.fragment = splittag(self._full_url)
--> 355         self._parse()
    356 
    357     @full_url.deleter

~/anaconda3/lib/python3.6/urllib/request.py in _parse(self)
    382         self.type, rest = splittype(self._full_url)
    383         if self.type is None:
--> 384             raise ValueError("unknown url type: %r" % self.full_url)
    385         self.host, self.selector = splithost(rest)
    386         if self.host:

ValueError: unknown url type: '/'

Using the code you provide on GitHub is wrong:

Traceback (most recent call last):
  File "/home/kongnian/PycharmProjects/Scraping/getAllExternalLinks.py", line 81, in <module>
    getAllExternalLinks("http://oreilly.com")
  File "/home/kongnian/PycharmProjects/Scraping/getAllExternalLinks.py", line 76, in getAllExternalLinks
    getAllExternalLinks(link)
  File "/home/kongnian/PycharmProjects/Scraping/getAllExternalLinks.py", line 76, in getAllExternalLinks
    getAllExternalLinks(link)
  File "/home/kongnian/PycharmProjects/Scraping/getAllExternalLinks.py", line 76, in getAllExternalLinks
    getAllExternalLinks(link)
  [Previous line repeated 15 more times]
  File "/home/kongnian/PycharmProjects/Scraping/getAllExternalLinks.py", line 63, in getAllExternalLinks
    html = urlopen(siteUrl)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 564, in error
    result = self._call_chain(*args)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 756, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/kongnian/anaconda3/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Is this the end of the collection?

thanks!

Can't login to Yahoo! for scraping

I'm going through the 2nd ed. of the book now and it's great. I've spent hours upon hours trying to log into Yahoo! with a POST request but I'm being thwarted. First, the program throws a TooManyRedirects error. When I add the keyword arg of allow_redirects=False, apparently I am being redirected anyway to a site with no content:

Output of response_obj.text:
'<p>Found. Redirecting to <a href="https://guce.yahoo.com/consent?gcrumb=F12tPO4&amp;trapType=login&amp;done=https%3A%2F%2Fwww.yahoo.com%2F&amp;src=">https://guce.yahoo.com/consent?gcrumb=F12tPO4&amp;trapType=login&amp;done=https%3A%2F%2Fwww.yahoo.com%2F&amp;src=</a></p>'

I am passing my browser headers and just about every other data I can identify under normal login circumstances with the request.
If anyone can successfully log into Yahoo, please spread the knowledge!

Typo on page 221 of the book

The line

In the second scenario, the load your Internet connection and home machine can place on a site like Wikipedia....

Should be

In the third scenario, the load your Internet connection and home machine can place on a site like Wikipedia....

CAPTCHA sample available?

Hi, I'm editing Korean translation of the book.
In chapter 11 readers should get hundreds of CAPTCHA images,
but there's no explanation how or where to get those images.
Maybe you can provide the images you have used when writing the book?
Please help me! Thank you.

[문의] splitAddress 사용법

안녕하세요.
크롤링 시작하기 65페이지

splitAddress(startingPage)[0]
여기서 [0]의 쓰임새 문의 드립니다.

감사합니다.

UserWarning: No parser was explicitly specified

All instances of BeautifulSoup([your markup]) need to be updated to BeautifulSoup([your markup], "html.parser").

For Example, the current use: bsObj = BeautifulSoup(html),
is fixed as bsObj = BeautifulSoup(html, "html.parser")

Full error being returned:

bs4/init.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 5 of the file 5-findParents.py. To get rid of this warning, change code that looks like this:

BeautifulSoup([your markup])

to this:

BeautifulSoup([your markup], "html.parser")

markup_type=markup_type))

chapter3

No "getNextExternalLink" method

ModuleNotFoundError: No module named 'stem'

I'd installed stem with pip install stem and installed using conda install -c conda-forge stem.

If i write on terminal: pip freeze the stem shows up on his last version 1.7.1, but when i try to import to my code i got:
ModuleNotFoundError: No module named 'stem'

I've tried to install and unistall stem on version 1.6.0 and 1.7.0, but it doesnt work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.