The beautifulscraper from zhangqm666

BeautifulScraper

Simple wraper around BeautifulSoup for HTML parsing and urllib2 for HTTP(S) request/response handling. BeautifulScraper also overrides some of the default handlers in urllib2 in order to:

Handle cookies properly
Offer full control of included cookies
Return the actual response from the server, un-mangled and not reprocessed

Installation

# pip install beautifulscraper

# git clone git://github.com/adregner/beautifulscraper.git
# cd beautifulscraper/
# python setup.py install

Examples

Getting started is brain-dead simple.

>>> from beautifulscraper import BeautifulScraper
>>> scraper = BeautifulScraper()

Start by requesting something.

>>> body = scraper.go("https://github.com/adregner/beautifulscraper")

The response will be a plain BeautifulSoup object. See their documentation for how to use it.

>>> body.select(".repository-description")[0].text
u'\nPython web-scraping library that wraps urllib2 and BeautifulSoup\n'

The headers from the server's response are accessiable.

>>> for header, value in scraper.response_headers.items():
...     print "%s: %s" % (header, value)
... 
status: 200 OK
content-length: 36179
set-cookie: _gh_sess=BAh7BzoQX2NzcmZfdG9rZW4iMUNCOWxnbFpVd3EzOENqVk9GTUFXbDlMVUJIbGxsNEVZUFZJNiswRjhwejQ9Og9zZXNzaW9uX2lkIiUyNmQ2ODE5ZDdiZjM3MTA2N2VlZDk3Y2VlMDViYzI2OA%3D%3D--5d31df13d5c0eeb8f3cccb140392124968abc374; path=/; expires=Sat, 01-Jan-2022 00:00:00 GMT; secure; HttpOnly
strict-transport-security: max-age=2592000
connection: close
server: nginx
x-runtime: 98
etag: "1c595b5b6a25eb7f021e68c3476d61da"
cache-control: private, max-age=0, must-revalidate
date: Wed, 31 Oct 2012 02:14:08 GMT
x-frame-options: deny
content-type: text/html; charset=utf-8

So is the response code as an integer.

>>> type(scraper.response_code), scraper.response_code
(<type 'int'>, 200)

The scraper will keep track of all cookies it sees via the cookielib.CookieJar class. You can read the cookies if you'd like. The Cookie object's are just a collection of properties.

>>> scraper.cookies[0].name
'_gh_sess'

See the pydoc for more information.

zhangqm666 / beautifulscraper Goto Github PK

beautifulscraper's Introduction

BeautifulScraper

Installation

Examples

beautifulscraper's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent