psf / requests-html Goto Github PK

View Code? Open in Web Editor NEW

13.6K 13.6K 977.0 2.91 MB

Pythonic HTML Parsing for Humans™

Home Page: http://html.python-requests.org

License: MIT License

Python 99.65% Makefile 0.35%

beautifulsoup css-selectors html http kennethreitz lxml pyquery python requests scraping

requests-html's People

Contributors

Stargazers

Watchers

Forkers

baifengbai adammendoza marius92mc markluro chroming anonymous-qsh swayson ubaidsayyed54 simonw ourobouros sensecollective send2cloud havron heyifei songzcn library-collections matrixback vishalsodani chyroc wuqiangroy arturbeg george-taotaome chaitanyaphalak char-li eliseutorres rishabh115 artwr umeshgmrl gaoxianli nrep laiyuncong8404 foxboron xhades nilopc-python hogala claudiotubertini tkylin mengyou658 slshan josxa dong910325 fearlessroy johnsonc liangsongyou guptarohit browniebroke wlmgithub hax0rg1rl shelltips mrakitin jacke littlecho localchart showhilllee secwsstest pingerfyy saurindashadia shaunstanislauslau dikang123 xxduck asmalllemon chengtian5huang rik7821 simpleapples zaoliu cooper111 xarrow yousstone ekpono chubang122 rubyco jorik041 pyraaryp manoadamro vlevit bac guxindexin cafonso wuwx pablomarti michael-k reyadrahman dreamsql nagappankv fanfanruyun 666king999 plumiron wsqy cody33231 xyzszf ranchocooper migzone jgrasser denon kartikeyarokde robertding jaxonyang blackhacked dgq2011 xudbin

requests-html's Issues

coupling with scrapy

Hello, your project is very interesting. I wanted to know if it was possible to use it with scrapy library.

Thank you for your help 😊

"certificate verify failed" when downloading Chromium on macOS

Just started experimenting with the library and encountered an error.
I am running Python 3.6.4 on macOS High Sierra (10.13.3).

I have tried to execute the following code:

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('https://python.org')
r.html.render()

And I get the following error as a result.

[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1400, in connect
    server_hostname=server_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 407, in wrap_socket
    _context=self, _session=session)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 814, in __init__
    self.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 1068, in do_handshake
    self._sslobj.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/ssl.py", line 689, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/anton/Development/Local/Drafts/scraping.py", line 6, in <module>
    r.html.render()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests_html.py", line 416, in render
    content, result = loop.run_until_complete(_async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests_html.py", line 373, in _async_render
    browser = pyppeteer.launch(headless=True)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyppeteer/launcher.py", line 161, in launch
    return Launcher(options, **kwargs).launch()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyppeteer/launcher.py", line 87, in __init__
    download_chromium()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyppeteer/chromium_downloader.py", line 94, in download_chromium
    extract_zip(download_zip(get_url()), DOWNLOADS_FOLDER / REVISION)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyppeteer/chromium_downloader.py", line 58, in download_zip
    with request.urlopen(url) as f:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

Windows7 Python3.6.3 requests-html0.7.2

from requests_html import HTMLSession, HTML

session = HTMLSession()

doc = """

This get's replaced

This get's added to:

<script type="text/javascript">

function addText() {

document.getElementById("add").append(" Text");

}

document.getElementById("replace").innerHTML = "";

</script>

"""

html = HTML(html=doc)

html.render()

print(html.html)

Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/1.py", line 23, in
html.render()
File "C:\ProgramData\Anaconda3\lib\site-packages\requests_html.py", line 416, in render
content, result = loop.run_until_complete(_async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout))
File "C:\ProgramData\Anaconda3\lib\asyncio\base_events.py", line 467, in run_until_complete
return future.result()
File "C:\ProgramData\Anaconda3\lib\site-packages\requests_html.py", line 373, in _async_render
browser = pyppeteer.launch(headless=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyppeteer\launcher.py", line 161, in launch
return Launcher(options, **kwargs).launch()
File "C:\ProgramData\Anaconda3\lib\site-packages\pyppeteer\launcher.py", line 129, in launch
msg = self.proc.stdout.readline().decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

Spanish translation

I'm interested in beginning translating the project into Spanish and I was wondering if I could start pull requests of the files. Many thanks in advance!

question: is using a global initiliazed session considered `safe`?

AttributeError: 'str' object has no attribute 'encoding'

Why xpath("//a/@href") can't work well ？

In [11]: session.get("https://www.python.org/").html.xpath("//a/@href")
Out[11]: ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/usr/lib/python3.6/site-packages/IPython/core/formatters.py in call(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()

/usr/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
378 if cls in self.type_pprinters:
379 # printer registered in self.type_pprinters
--> 380 return self.type_pprinters[cls](obj, self, cycle)
381 else:
382 # deferred printer

/usr/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
557 p.text(',')
558 p.breakable()
--> 559 p.pretty(x)
560 if len(obj) == 1 and type(obj) is tuple:
561 # Special case for 1-item tuples.

/usr/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
393 if callable(meth):
394 return meth(obj, self, cycle)
--> 395 return _default_pprint(obj, self, cycle)
396 finally:
397 self.end_group()

/usr/lib/python3.6/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
508 if safe_getattr(klass, 'repr', None) is not object.repr:
509 # A user-provided repr. Find newlines and replace them with p.break()
--> 510 _repr_pprint(obj, p, cycle)
511 return
512 p.begin_group(1, '<')

/usr/lib/python3.6/site-packages/IPython/lib/pretty.py in repr_pprint(obj, p, cycle)
699 """A pprint that just redirects to the normal repr function."""
700 # Find newlines and replace them with p.break()
--> 701 output = repr(obj)
702 for idx,output_line in enumerate(output.splitlines()):
703 if idx:

/usr/lib/python3.6/site-packages/requests_html.py in repr(self)
194 def repr(self) -> str:
195 attrs = []
--> 196 for attr in self.attrs:
197 attrs.append('{}={}'.format(attr, repr(self.attrs[attr])))
198

/usr/lib/python3.6/site-packages/requests_html.py in attrs(self)
202 def attrs(self) -> dict:
203 """Returns a dictionary of the attributes of the class:Element <Element>."""
--> 204 attrs = {k: self.pq.attr[k].strip() for k in self.element.keys()}
205
206 # Split class up, as there are ussually many of them:

AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'keys'

Relative to absolute links in markdown output

Hi Kenneth,

I was just referred to this lib and I must say it's just as awesome as everything for Humans 👍

Since we already have the capability to transform relative links to absolute ones according to the readme, would it be possible to swap out all of the relative links in the generated markdown of an element to absolute ones aswell (this would be a sensible default)?

Cheers

Would you consider supporting local HTML files?

One of my uses of BeautifulSoup is parsing a file that takes a few minutes to get generated on the server, so I tend to download it and work on it locally, something like:

with open('report-filtered.html') as html_file:
    soup = BeautifulSoup(html_file, 'html.parser')

From a quick look at the HTML class this would be quite straightforward to support, we'd set self.html to the content of that file, and self.url would be None. The result of that is that absolute_links wouldn't work (unless URL was passed in too), and perhaps the __repr__ would be more useful with knowledge of a filename too. Maybe the best way would be self self.url to the filename and feature-gate making links absolute? Obviously there's a few things to think about implementation wise, but I'd be happy to put the PR together if there's interest. I wanted to check first though whether that's something you'd consider/be interested in! :)

TypeError when opening site

When running the following code I'm getting the error:
"TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'"

import requests_html
session = requests_html.Session()
r = session.get("https://www.facebook.com")

Full Stacktrace:

/usr/lib/python3.6/site-packages/requests/sessions.py in get(self, url, **kwargs)
    499
    500         kwargs.setdefault('allow_redirects', True)
--> 501         return self.request('GET', url, **kwargs)
    502
    503     def options(self, url, **kwargs):

/usr/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    486         }
    487         send_kwargs.update(settings)
--> 488         resp = self.send(prep, **send_kwargs)
    489
    490         return resp

/usr/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    613
    614         # Response manipulation hooks
--> 615         r = dispatch_hook('response', hooks, r, **kwargs)
    616
    617         # Persist cookies

/usr/lib/python3.6/site-packages/requests/hooks.py in dispatch_hook(key, hooks, hook_data, **kwargs)
     29             hooks = [hooks]
     30         for hook in hooks:
---> 31             _hook_data = hook(hook_data, **kwargs)
     32             if _hook_data is not None:
     33                 hook_data = _hook_data

/usr/lib/python3.6/site-packages/requests_html.py in _handle_response(response, **kwargs)
    205         """
    206
--> 207         response.html = HTML(response=response)
    208         return response
    209

/usr/lib/python3.6/site-packages/requests_html.py in __init__(self, response)
    156     def __init__(self, *, response):
    157         super(HTML, self).__init__(
--> 158             element=fromstring(response.text),
    159             html=response.text,
    160             url=response.url

/usr/lib/python3.6/site-packages/lxml/html/soupparser.py in fromstring(data, beautifulsoup, makeelement, **bsargs)
     31     used.
     32     """
---> 33     return _parse(data, beautifulsoup, makeelement, **bsargs)
     34
     35

/usr/lib/python3.6/site-packages/lxml/html/soupparser.py in _parse(source, beautifulsoup, makeelement, **bsargs)
     77             bsargs['features'] = ['html.parser']  # use Python html parser
     78     tree = beautifulsoup(source, **bsargs)
---> 79     root = _convert_tree(tree, makeelement)
     80     # from ET: wrap the document in a html root element, if necessary
     81     if len(root) == 1 and root[0].tag == "html":

/usr/lib/python3.6/site-packages/lxml/html/soupparser.py in _convert_tree(beautiful_soup_tree, makeelement)
    131     # and body elements.
    132     pre_root = beautiful_soup_tree.contents[:first_element_idx]
--> 133     roots = beautiful_soup_tree.contents[first_element_idx:last_element_idx+1]
    134     post_root = beautiful_soup_tree.contents[last_element_idx+1:]
    135

TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

Is it possible to extract computed css for a given element?

For example, we need to scrape an image displayed using background-image css property, it would be nice to have to possibility to extract a css dictionary for a given element, something like:

about = r.html.find('#about', first=True)
image = about.css('background-image')
# or 
image = about.css()['background-image']

Scraper throws error instead of pulling values from a webpage

I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully.

import requests_html

with requests_html.HTMLSession() as session:
    r = session.get('https://www.gdax.com/trade/LTC-EUR')
    js = r.html.render()
    item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
    print(item)

When I execute the script I get the following error (partial traceback):

Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\new_line_one.py", line 27, in <module>
    item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
AttributeError: 'NoneType' object has no attribute 'find'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\lib\shutil.py", line 381, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 5] Access is denied:

unreadable result

this is my code,it runs success except the unreadable result

i don't know how to fix it.
is there anyone can help me ?

import requests_html
session = requests_html.Session()

r = session.get("http://www.dyxia.com/")
r.encoding = "utf-8"

lb = r.html.find(".leibie h2")

print(lb[1].text)

RecursionError: maximum recursion depth exceeded while calling a Python object

1.problem：

2.Solution：
(1)
(2)
I'm sorry I'm not well versed in English。

What's principal difference between Requests-HTML and Scrapy?

Hello, Kenneth! Thanks for Requests-HTML 👍

My question is: What's principal difference between Requests-HTML and Scrapy?

I know, Requests-HTML is smaller and faster (?), but what's more?
Would be great to see simple words of difference into docs.

How to install?

Dump from my terminal:

(py36) ❯ pip install requests-http
Collecting requests-http
  Could not find a version that satisfies the requirement requests-http (from versions: )
No matching distribution found for requests-http

I prefer to use conda-environments rather than pipenv, so it would be nice if I could install using pip.

SyntaxError: invalid syntax

import requests_html

Traceback (most recent call last):
File "<pyshell#0>", line 1, in
import requests_html
File "C:\Python27\lib\site-packages\requests_html.py", line 20
def init(self, *, element, html=None, url):
^
SyntaxError: invalid syntax

Suggestion: Consistent content of html with CJK

Details of suggestion

In latest version that not yet released, requests_html.HTML.html works fine. But requests_html.Element.html works weird in CJK. I think cause of this happen is html property with default ASCII encoding in etree.tostring()

Problem

>>> import requests_html
>>> doc = '<div>한글<p>한글CJK</p></div>'
>>> page = requests_html.HTML(html=doc, url='fake_url')
>>> page.html
'<div>한글<p>한글CJK</p></div>'
>>> div = page.find('div', first=True)
>>> div
<Element 'div' >
>>> div.html
'<div>&#54620;&#44544;<p>&#54620;&#44544;CJK</p></div>'
>>> div.text
'한글\n한글CJK'

Expected behaviour

>>> doc = '<div>한글<p>한글CJK</p></div>'
>>> page = requests_html.HTML(html=doc, url='fake_url')
>>> page.html
'<div>한글<p>한글CJK</p></div>'
>>> div = page.find('div', first=True)
>>> div.html
'<div>한글<p>한글CJK</p></div>'

I'll work on it to fix this issue.

links contain javascript:;

gettting error pyppeteer.errors.BrowserError: Unexpectedly chrome process closed with return code: 1

r = session.get('http://python-requests.org/')

r.html.render()

r.html.render()
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
[W:pyppeteer.chromium_downloader] chromium download done.
[W:pyppeteer.chromium_downloader] chromium extracted to: /home/hadn/.pyppeteer/local-chromium/533271
Traceback (most recent call last):
File "", line 1, in
File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/requests_html.py", line 282, in render
content, result = loop.run_until_complete(_async_render(url=self.url, script=script, sleep=sleep, scrolldown=scrolldown))
File "/usr/lib64/python3.6/asyncio/base_events.py", line 467, in run_until_complete
return future.result()
File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/requests_html.py", line 250, in _async_render
browser = pyppeteer.launch(headless=True)
File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/pyppeteer/launcher.py", line 146, in launch
return Launcher(options, **kwargs).launch()
File "/mnt/website/flask_miguel/venv/lib64/python3.6/site-packages/pyppeteer/launcher.py", line 111, in launch
raise BrowserError('Unexpectedly chrome process closed with '
pyppeteer.errors.BrowserError: Unexpectedly chrome process closed with return code: 1

my pyppeteer-0.0.10

UnicodeDecodeError: 'utf-8' codec can't decode byte

First of all I wanted to say thanks for the library. I'm starting to drop requests + BeautifulSoup in favor of this.

I am extracting information about Japan Airlines from this source.
Finding the desired element using the find method is possible, but not with the xpath. With the xpath selector, an error of the following form is raised:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8c in position 4467: invalid start byte

Source example:

from requests_html import HTMLSession

BASE_URL = 'http://press.jal.co.jp/en/financial/'
session = HTMLSession()

r = session.get(BASE_URL)

# this one works
last_page = r.html.find('.pager-wrap a')[-2].text

# this one does not
last_page = r.html.xpath('*//div[@class="pager-wrap"]//a')[-2].text

SyntaxError: invalid syntax

Traceback (most recent call last):
File "/Users/shashwatbishwen/Documents/whatsapp_search_CNG/scrape_1.py", line 1, in
from requests_html import HTMLSession
File "/Library/Python/2.7/site-packages/requests_html.py", line 20
def init(self, *,element, html=None, url):
^
SyntaxError: invalid syntax

Process finished with exit code 1

help SynatxError

im noob,help pls

How to set max_retries using requests-html?

Hi,
I'm using simply r = session.get(myURL) and sometime I have timeout error.

Thank you

Add information on how to build from source

It's useful for people who want to contribute to the project and have no confidence with pipenv.

requests_html.HTML :- TypeError: init() missing 1 required keyword-only argument: 'url'

When i try to use this example,

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""

>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

I get the following error

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() missing 1 required keyword-only argument: 'url'

Running 'requests_html.py' issue after the install | Fixed

└──╼ $python requests_html.py
File "requests_html.py", line 134
SyntaxError: Non-ASCII character '\xe2' in file requests_html.py on line 134, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

adding those two lines can fix the issue.


#!/usr/bin/env python 
# -*- coding:utf-8 -*-

Get sibling node

Is it possible to get sibling nodes, etc. with this? Such as BeautifulSoup's next_sibling method.

Url in setup is wrong

The url in setup is wrong.

Render w/o request doesn't execute inline JS

This lib looks great, thanks :)...
Just a note, I was expecting:

doc = """<a href='https://httpbin.org'>"""
html = HTML(html=doc)
html.render()
html.html

to output : <a href='https://httpbin.org'>

Instead I get the content from example.org, which is the default url.

How can I set the html content and then render it? I can't seem to pass it to:

doc = """<a href='https://httpbin.org'>"""
html = HTML(html=doc)
html.render(script=doc)
html.html

either, as I get an:

BrowserError: Evaluation failed: SyntaxError: Unexpected token <
pageFunction:
<a href='https://httpbin.org'>

I could set the url to the local file and patch it in, but that solution seems lacking.

no support for python 2?

concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://python-requests.org')
r.html.render()
Exception in callback NavigatorWatcher.waitForNavigation..watchdog_cb( result=None>) at C:\ProgramData\Anaconda3\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation..watchdog_cb( result=None>) at C:\ProgramData\Anaconda3\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\asyncio\events.py", line 127, in _run
self._callback(*self._args)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded

pyppeteer (0.0.10)

AttributeError: 'str' object has no attribute 'decode'

I'm using python 3.6.3, and following the document usage example in the repository.
After installing the requests_html (pip3 install requests-html), I run the code below:

>>> import requests_html
>>> session = requests_html.Session()

>>> r = session.get('https://python.org/')
>>> r.html.links
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/requests_html.py", line 129, in links
    return set(g for g in gen())
  File "/usr/local/lib/python3.6/dist-packages/requests_html.py", line 129, in <genexpr>
    return set(g for g in gen())
  File "/usr/local/lib/python3.6/dist-packages/requests_html.py", line 120, in gen
    for link in self.find('a'):
  File "/usr/local/lib/python3.6/dist-packages/requests_html.py", line 85, in find
    c = [g for g in gen()]
  File "/usr/local/lib/python3.6/dist-packages/requests_html.py", line 85, in <listcomp>
    c = [g for g in gen()]
  File "/usr/local/lib/python3.6/dist-packages/requests_html.py", line 83, in gen
    yield Element(element=found, url=self.url, default_encoding=_encoding or self.encoding)
  File "/usr/local/lib/python3.6/dist-packages/requests_html.py", line 53, in encoding
    self._encoding = html_to_unicode(self.default_encoding, self.html)[0]
  File "/usr/local/lib/python3.6/dist-packages/w3lib/encoding.py", line 273, in html_to_unicode
    return enc, to_unicode(html_body_str, enc)
  File "/usr/local/lib/python3.6/dist-packages/w3lib/encoding.py", line 185, in to_unicode
    return data_str.decode(encoding, 'replace' if version_info[0:2] >= (3, 3) else 'w3lib_replace')
AttributeError: 'str' object has no attribute 'decode'
>>>

An error occured.

Unsed import

First line from tempfile import TemporaryFile is unsed.

concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded

>>> r = session.get('http://python-requests.org')
>>> r.html.render()
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at /usr/local/lib/python3.6/dist-packages/pyppeteer/navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at /usr/local/lib/python3.6/dist-packages/pyppeteer/navigator_watcher.py:49>
Traceback (most recent call last):
  File "/usr/lib/python3.6/asyncio/events.py", line 127, in _run
    self._callback(*self._args)
  File "/usr/local/lib/python3.6/dist-packages/pyppeteer/navigator_watcher.py", line 52, in watchdog_cb
    self._timeout)
  File "/usr/local/lib/python3.6/dist-packages/pyppeteer/navigator_watcher.py", line 40, in _raise_error
    raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded

when i taking example usage, this error happen, i can't figure out either my internet issue or pyppeteer installation error.

ValueError: Invalid PI name 'b'xml''

The issue is seen probably because of lxml. Here's the I/O:

>>> import requests_html
>>> sess = requests_html.Session()
>>> r = sess.get("http://twitter.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 218, in resolve_redirects
    **adapter_kwargs
  File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 625, in send
    r = dispatch_hook('response', hooks, r, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests/hooks.py", line 31, in dispatch_hook
    _hook_data = hook(hook_data, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/requests_html.py", line 245, in _handle_response
    response.html = HTML(url=response.url, html=response.text, default_encoding=response.encoding)
  File "/usr/local/lib/python3.6/site-packages/requests_html.py", line 202, in __init__
    element=fromstring(html),
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 33, in fromstring
    return _parse(data, beautifulsoup, makeelement, **bsargs)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 79, in _parse
    root = _convert_tree(tree, makeelement)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 152, in _convert_tree
    res_root = convert_node(html_root)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 216, in convert_node
    return handler(bs_node, parent)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 255, in convert_tag
    handler(child, res)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 255, in convert_tag
    handler(child, res)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 255, in convert_tag
    handler(child, res)
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.6/site-packages/lxml/html/soupparser.py", line 273, in convert_pi
    res = etree.ProcessingInstruction(*bs_node.split(' ', 1))
  File "src/lxml/etree.pyx", line 3056, in lxml.etree.ProcessingInstruction (src/lxml/etree.c:79300)
ValueError: Invalid PI name 'b'xml''

Python version: 3.6.4

Instalation fails

Hi! I just heard about this project and was willing to try this package, but I was not able to install it following the instructions in the readme.

$ pipenv install requests-html
Installing requests-html…
Collecting requests-html
  Using cached requests_html-0.6.9-py2.py3-none-any.whl
Collecting pyppeteer (from requests-html)
  Using cached pyppeteer-0.0.10.tar.gz
Collecting requests (from requests-html)
  Using cached requests-2.18.4-py2.py3-none-any.whl
Collecting bs4 (from requests-html)
  Using cached bs4-0.0.1.tar.gz
Collecting fake-useragent (from requests-html)
  Using cached fake-useragent-0.1.10.tar.gz
Collecting pyquery (from requests-html)
  Using cached pyquery-1.4.0-py2.py3-none-any.whl
Collecting parse (from requests-html)
  Using cached parse-1.8.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/4_/3m18vsgn6tx2gwh1n_g78tv40000gn/T/pip-build-v4xjk72f/parse/setup.py", line 10, in <module>
        f.write(__doc__)
    TypeError: write() argument must be str, not None

    ----------------------------------------

Adding requests-html to Pipfile's [packages]…
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Traceback (most recent call last):
  File "/Users/victor/Library/Python/3.6/bin/pipenv", line 11, in <module>
    load_entry_point('pipenv==9.0.3', 'console_scripts', 'pipenv')()
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/vendor/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/vendor/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/vendor/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/vendor/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/vendor/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/cli.py", line 1934, in install
    do_lock(system=system, pre=pre)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/cli.py", line 1102, in do_lock
    pre=pre
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/utils.py", line 545, in resolve_deps
    resolved_tree = actually_resolve_reps(deps, index_lookup, markers_lookup, project, sources, verbose, clear, pre)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/utils.py", line 507, in actually_resolve_reps
    resolved_tree.update(resolver.resolve(max_rounds=PIPENV_MAX_ROUNDS))
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/piptools/resolver.py", line 102, in resolve
    has_changed, best_matches = self._resolve_one_round()
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/piptools/resolver.py", line 200, in _resolve_one_round
    for dep in self._iter_dependencies(best_match):
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/piptools/resolver.py", line 296, in _iter_dependencies
    dependencies = self.repository.get_dependencies(ireq)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/piptools/repositories/pypi.py", line 153, in get_dependencies
    result = reqset._prepare_file(self.finder, ireq)
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/pip/req/req_set.py", line 639, in _prepare_file
    abstract_dist.prep_for_dist()
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/pip/req/req_set.py", line 134, in prep_for_dist
    self.req_to_install.run_egg_info()
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/pip/req/req_install.py", line 438, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/Users/victor/Library/Python/3.6/lib/python/site-packages/pipenv/patched/pip/utils/__init__.py", line 707, in call_subprocess
    % (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command "python setup.py egg_info" failed with error code 1 in /var/folders/4_/3m18vsgn6tx2gwh1n_g78tv40000gn/T/tmpw_18ya1sbuild/parse/

I also tried to pip install it, but without success:

$ pip3.6 install requests-html
Collecting requests-html
  Using cached requests_html-0.6.9-py2.py3-none-any.whl
Collecting w3lib (from requests-html)
  Using cached w3lib-1.19.0-py2.py3-none-any.whl
Collecting parse (from requests-html)
  Using cached parse-1.8.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/4_/3m18vsgn6tx2gwh1n_g78tv40000gn/T/pip-build-cw83qyfo/parse/setup.py", line 10, in <module>
        f.write(__doc__)
    TypeError: write() argument must be str, not None

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/4_/3m18vsgn6tx2gwh1n_g78tv40000gn/T/pip-build-cw83qyfo/parse/

Any tips?

install requests-html[browser] failed

hi, when I install requests-html[browser], I use pip3 install requests-html[browser] and pipenv install requests-html[browser] both failed. The error msg is no matches found: requests-html[browser].
But pip3 install requests-html succeed.What should I do?

Remove Markdown rendering

While it might sometimes come in handy to render an element as Markdown I would prefer it if requests-html would stay lightweight and focused on one clear problem without throwing in too many nice-to-have stuff .

doesn't work w/ python3.7b2+ (today's)

from requests_http import HTTPSession
...
  File "/usr/local/lib/python3.7/site-packages/websockets/client.py", line 20, in <module>
    from .protocol import WebSocketCommonProtocol

  File "/usr/local/lib/python3.7/site-packages/websockets/protocol.py", line 18, in <module>
    from .compatibility import asyncio_ensure_future

  File "/usr/local/lib/python3.7/site-packages/websockets/compatibility.py", line 15
    asyncio_ensure_future = asyncio.async           # Python < 3.5
                                        ^
SyntaxError: invalid syntax

Posted here for tracking purposes... (seems to import into 3.6 fine)

Installation failed: UnicodeDecodeError

Hello, I had a problem when I try to install the package. The following is my environment.

Python 3.5.2 :: Anaconda custom (64-bit)
Windows 10
pipenv, version 11.0.1

Using pipenv install requests-html to install the package. The error is

Installing requests-html…
Collecting requests-html
  Using cached requests_html-0.6.6-py2.py3-none-any.whl
Collecting bs4 (from requests-html)
  Downloading bs4-0.0.1.tar.gz
Collecting pyppeteer (from requests-html)
  Using cached pyppeteer-0.0.12.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\secsi\AppData\Local\Temp\pip-build-e_70uy7c\pyppeteer\setup.py", line 27, in <module>
        compile_files(in_dir, out_dir, target)
      File "c:\users\secsi\.virtualenvs\lianjia-bj-ijjpmtde\lib\site-packages\py_backwards\compiler.py", line 85, in compile_files
        dependencies.update(_compile_file(paths, target))
      File "c:\users\secsi\.virtualenvs\lianjia-bj-ijjpmtde\lib\site-packages\py_backwards\compiler.py", line 57, in _compile_file
        code = f.read()
    UnicodeDecodeError: 'gbk' codec can't decode byte 0xb6 in position 2147: illegal multibyte sequence

    ----------------------------------------

Error:  An error occurred while installing requests-html!
Command "python setup.py egg_info" failed with error code 1 in C:\Users\secsi\AppData\Local\Temp\pip-build-e_70uy7c\pyppeteer\

Add support for an async api of the package

When scraping we look for performance when doing it on a large scale, asyncio makes improvements on this, and since this library is >python3.6 support, we could implement this without hack around.

Since the project it's already quite used by a lot of people, the idea is that anyone can use the package in async and sync ways. How to support both versions without duplicate codebase tens of debates. So like the codebase on this case is not that large what I think we could do is rewrite everything in async and then add wrappers to the API for sync support, then users in sync mode can use the library like normal even though behind the scene it will be running asynchronously; I have achieved this in others projects creating a sync fn wich within call the async version inside a decorator fn that handles the loop and any other async stuff.

Since this library depends on request which does not support async yet I see two options if we choose the way proposed above, keep using requests and running it in a ThreadPoolExecutor (yet this won't allow actually hight concurrency) or use aiohttp which interface is barely similar to requests.

@kennethreitz let me know what you think.

Bug: Misspelling

https://github.com/kennethreitz/requests-html/blob/6f70ec7fad726940a1deb0ebe08fdb22c01be9df/requests_html.py#L332-L335

I think it will be BrowserSession , not BroserSession.

Unnecessary bs4 dependency?

bs4 is listed as a dependency but as far as I can tell it's not actually used:

https://github.com/kennethreitz/requests-html/blob/7e0530a9c8959318a2f08638bda5bb487bbebb48/Pipfile#L15

AttributeError: 'str' object has no attribute 'decode'

this error occurs when writing official demo code in pipenv --python 3.6 envirment

from requests_html import session
r = session.get('https://python.org')
r.html.links

full error info is like below:

Traceback (most recent call last):
File "", line 1, in
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/requests_html.py", line 129, in links
return set(g for g in gen())
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/requests_html.py", line 129, in
return set(g for g in gen())
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/requests_html.py", line 120, in gen
for link in self.find('a'):
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/requests_html.py", line 85, in find
c = [g for g in gen()]
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/requests_html.py", line 85, in
c = [g for g in gen()]
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/requests_html.py", line 83, in gen
yield Element(element=found, url=self.url, default_encoding=_encoding or self.encoding)
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/requests_html.py", line 53, in encoding
self._encoding = html_to_unicode(self.default_encoding, self.html)[0]
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/w3lib/encoding.py", line 273, in html_to_unicode
return enc, to_unicode(html_body_str, enc)
File "/Users/winterfall/.local/share/virtualenvs/requests-html-demo-rIwHhsY3/lib/python3.6/site-packages/w3lib/encoding.py", line 185, in to_unicode
return data_str.decode(encoding, 'replace' if version_info[0:2] >= (3, 3) else 'w3lib_replace')
AttributeError: 'str' object has no attribute 'decode'

AttributeError: 'str' object has no attribute 'decode'

Here's what I've done:

>>> import requests_html
>>> session = requests_html.Session()
>>> r = session.get("http://google.com")
>>> r.html.find('a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/requests_html.py", line 85, in find
    c = [g for g in gen()]
  File "/usr/local/lib/python3.6/site-packages/requests_html.py", line 85, in <listcomp>
    c = [g for g in gen()]
  File "/usr/local/lib/python3.6/site-packages/requests_html.py", line 83, in gen
    yield Element(element=found, url=self.url, default_encoding=_encoding or self.encoding)
  File "/usr/local/lib/python3.6/site-packages/requests_html.py", line 53, in encoding
    self._encoding = html_to_unicode(self.default_encoding, self.html)[0]
  File "/usr/local/lib/python3.6/site-packages/w3lib/encoding.py", line 273, in html_to_unicode
    return enc, to_unicode(html_body_str, enc)
  File "/usr/local/lib/python3.6/site-packages/w3lib/encoding.py", line 185, in to_unicode
    return data_str.decode(encoding, 'replace' if version_info[0:2] >= (3, 3) else 'w3lib_replace')
AttributeError: 'str' object has no attribute 'decode'

Python version: 3.6.4

No module names requests_html

Hey,
I am trying to run the tutorial but i keep getting this.

I did pipenv install requests-html beforehand and activated it with pipenv shell

Any ideas?

Warning raised

The library raises a warning with using the BrowserSession.get and Session.get methods:

In [1]: import requests_html

In [2]: session = requests_html.BrowserSession()

In [3]: session.get('https://github.com/kennethreitz/requests-html')
/Users/allan/homeInstalled/miniconda3/envs/py36/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 11 of the file /Users/allan/homeInstalled/miniconda3/envs/py36/bin/ipython. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html.parser")

  markup_type=markup_type))

I happen to know that the same warning is raised when using the BeautifulSoup library like so:

In [5]: from bs4 import BeautifulSoup
In [5]: from bs4 import BeautifulSoup

In [6]: s = """<!DOCTYPE html>
   ...: <html>
   ...: <head>
   ...:     <title>Hej</title>
   ...: </head>
   ...: <body>
   ...: Foobar
   ...: </body>
   ...: </html>"""

In [7]: BeautifulSoup(s)
/Users/allan/homeInstalled/miniconda3/envs/py36/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 11 of the file /Users/allan/homeInstalled/miniconda3/envs/py36/bin/ipython. To get rid of this warning, change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
Out[7]:
<!DOCTYPE html>
<html>
<head>
<title>Hej</title>
</head>
<body>
Foobar
</body>
</html>

In [8]: BeautifulSoup(s, 'lxml')  # no warning raised
Out[8]:
<!DOCTYPE html>
<html>
<head>
<title>Hej</title>
</head>
<body>
Foobar
</body>
</html>

Hope it helps when tracking it down.

links code has some problems

https://github.com/kennethreitz/requests-html/blob/master/requests_html.py#L157

I want hold anchors and set skip_anchors False but ...

>>> skip_anchors = False
>>> not '#qytest'.startswith('#') and skip_anchors
False

Think of HTML and Element as the same thing

They have only a little difference between HTML and Element.
The Element has no links, base_url, absolute_links.

In most cases the Element also needs the three properties, because we usually only need to follow a part of links of a HTML page.

So then the Element class and the HTML class are the same.

psf / requests-html Goto Github PK

requests-html's People

Contributors

Stargazers

Watchers

Forkers

requests-html's Issues

Details of suggestion

Problem

Expected behaviour

Recommend Projects

Recommend Topics

Recommend Org