Giter Site home page Giter Site logo

daijro / hrequests Goto Github PK

View Code? Open in Web Editor NEW
529.0 11.0 27.0 138 KB

๐Ÿš€ Web scraping for humans

Home Page: https://daijro.gitbook.io/hrequests/

License: Apache License 2.0

Python 95.37% Batchfile 0.04% Go 4.60%
forhumans gevent grequests http humans playwright playwright-python python python-requests requests

hrequests's Introduction

hrequests

PyPI PyPI

Hrequests (human requests) is a simple, configurable, feature-rich, replacement for the Python requests library.

โœจ Features

  • Seamless transition between HTTP and headless browsing ๐Ÿ’ป
  • Integrated fast HTML parser ๐Ÿš€
  • High performance network concurrency with goroutines & gevent ๐Ÿš€
  • Replication of browser TLS fingerprints ๐Ÿš€
  • JavaScript rendering ๐Ÿš€
  • Supports HTTP/2 ๐Ÿš€
  • Realistic browser header generation ๐Ÿš€
  • JSON serializing up to 10x faster than the standard library ๐Ÿš€

๐Ÿ’ป Browser crawling

  • Simple & uncomplicated browser automation
  • Human-like cursor movement and typing
  • Chrome and Firefox extension support
  • Full page screenshots
  • Proxy support
  • Headless and headful support
  • No CORS restrictions

โšก More

  • High performance โœจ
  • Minimal dependence on the python standard libraries
  • HTTP backend written in Go
  • Automatic gzip & brotli decode
  • Written with type safety
  • 100% threadsafe โค๏ธ

Installation

Install via pip:

pip install -U hrequests[all]
python -m hrequests install
Or, install without headless browsing support

Ignore the [all] option if you don't want headless browsing support:

pip install -U hrequests

Documentation

For the latest stable hrequests documentation, check the Gitbook page.

  1. Simple Usage
  2. Sessions
  3. Concurrent & Lazy Requests
  4. HTML Parsing
  5. Browser Automation

Simple Usage

Here is an example of a simple get request:

>>> resp = hrequests.get('https://www.google.com/')

Requests are sent through bogdanfinn's tls-client to spoof the TLS client fingerprint. This is done automatically, and is completely transparent to the user.

Other request methods include post, put, delete, head, options, and patch.

The Response object is a near 1:1 replica of the requests.Response object, with some additional attributes.

Parameters
Parameters:
    url (Union[str, Iterable[str]]): URL or list of URLs to request.
    data (Union[str, bytes, bytearray, dict], optional): Data to send to request. Defaults to None.
    files (Dict[str, Union[BufferedReader, tuple]], optional): Data to send to request. Defaults to None.
    headers (dict, optional): Dictionary of HTTP headers to send with the request. Defaults to None.
    params (dict, optional): Dictionary of URL parameters to append to the URL. Defaults to None.
    cookies (Union[RequestsCookieJar, dict, list], optional): Dict or CookieJar to send. Defaults to None.
    json (dict, optional): Json to send in the request body. Defaults to None.
    allow_redirects (bool, optional): Allow request to redirect. Defaults to True.
    history (bool, optional): Remember request history. Defaults to False.
    verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
    timeout (float, optional): Timeout in seconds. Defaults to 30.
    proxy (str, optional): Proxy URL. Defaults to None.
    nohup (bool, optional): Run the request in the background. Defaults to False.
    <Additionally includes all parameters from `hrequests.Session` if a session was not specified>

Returns:
    hrequests.response.Response: Response object

Properties

Get the response url:

>>> resp.url: str
'https://www.google.com/'

Check if the request was successful:

>>> resp.status_code: int
200
>>> resp.reason: str
'OK'
>>> resp.ok: bool
True
>>> bool(resp)
True

Getting the response body:

>>> resp.text: str
'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><m...'
>>> resp.content: bytes
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><m...'
>>> resp.encoding: str
'UTF-8'

Parse the response body as JSON:

>>> resp.json(): Union[dict, list]
{'somedata': True}

Get the elapsed time of the request:

>>> resp.elapsed: datetime.timedelta
datetime.timedelta(microseconds=77768)

Get the response cookies:

>>> resp.cookies: RequestsCookieJar
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2023-07-05-20', port=None, port_specified=False, domain='.google.com', domain_specified=True...

Get the response headers:

>>> resp.headers: CaseInsensitiveDict
{'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000', 'Cache-Control': 'private, max-age=0', 'Content-Encoding': 'br', 'Content-Length': '51288', 'Content-Security-Policy-Report-Only': "object-src 'none';base-uri 'se

Sessions

Creating a new Chrome Session object:

>>> session = hrequests.Session()  # version randomized by default
>>> session = hrequests.Session('chrome', version=120)
Parameters
Parameters:
    browser (Literal['firefox', 'chrome'], optional): Browser to use. Default is 'chrome'.
    version (int, optional): Version of the browser to use. Browser must be specified. Default is randomized.
    os (Literal['win', 'mac', 'lin'], optional): OS to use in header. Default is randomized.
    headers (dict, optional): Dictionary of HTTP headers to send with the request. Default is generated from `browser` and `os`.
    verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
    timeout (float, optional): Default timeout in seconds. Defaults to 30.
    proxy (str, optional): Proxy URL. Defaults to None.
    cookies (Union[RequestsCookieJar, dict, list], optional): Cookie Jar, or cookie list/dict to send. Defaults to None.
    certificate_pinning (Dict[str, List[str]], optional): Certificate pinning. Defaults to None.
    disable_ipv6 (bool, optional): Disable IPv6. Defaults to False.
    detect_encoding (bool, optional): Detect encoding. Defaults to True.
    ja3_string (str, optional): JA3 string. Defaults to None.
    h2_settings (dict, optional): HTTP/2 settings. Defaults to None.
    additional_decode (str, optional): Decode response body with "gzip" or "br". Defaults to None.
    pseudo_header_order (list, optional): Pseudo header order. Defaults to None.
    priority_frames (list, optional): Priority frames. Defaults to None.
    header_order (list, optional): Header order. Defaults to None.
    force_http1 (bool, optional): Force HTTP/1. Defaults to False.
    catch_panics (bool, optional): Catch panics. Defaults to False.
    debug (bool, optional): Debug mode. Defaults to False.

Browsers can also be created through the firefox and chrome shortcuts:

>>> session = hrequests.firefox.Session()
>>> session = hrequests.chrome.Session()
Parameters
Parameters:
    version (int, optional): Version of the browser to use. Browser must be specified. Default is randomized.
    os (Literal['win', 'mac', 'lin'], optional): OS to use in header. Default is randomized.
    headers (dict, optional): Dictionary of HTTP headers to send with the request. Default is generated from `browser` and `os`.
    verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
    timeout (float, optional): Default timeout in seconds. Defaults to 30.
    proxy (str, optional): Proxy URL. Defaults to None.
    cookies (Union[RequestsCookieJar, dict, list], optional): Cookie Jar, or cookie list/dict to send. Defaults to None.
    certificate_pinning (Dict[str, List[str]], optional): Certificate pinning. Defaults to None.
    disable_ipv6 (bool, optional): Disable IPv6. Defaults to False.
    detect_encoding (bool, optional): Detect encoding. Defaults to True.
    ja3_string (str, optional): JA3 string. Defaults to None.
    h2_settings (dict, optional): HTTP/2 settings. Defaults to None.
    additional_decode (str, optional): Decode response body with "gzip" or "br". Defaults to None.
    pseudo_header_order (list, optional): Pseudo header order. Defaults to None.
    priority_frames (list, optional): Priority frames. Defaults to None.
    header_order (list, optional): Header order. Defaults to None.
    force_http1 (bool, optional): Force HTTP/1. Defaults to False.
    catch_panics (bool, optional): Catch panics. Defaults to False.
    debug (bool, optional): Debug mode. Defaults to False.

os can be 'win', 'mac', or 'lin'. Default is randomized.

>>> session = hrequests.chrome.Session(os='mac')

This will automatically generate headers based on the browser name and OS:

>>> session.headers
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4; rv:60.2.2) Gecko/20100101 Firefox/60.2.2', 'Accept-Encoding': 'gzip, deflate, br', 'Pragma': 'no-cache'}
Why is the browser version in the header different than the TLS browser version?

Website bot detection systems typically do not correlate the TLS fingerprint browser version with the browser header.

By adding more randomization to our headers, we can make our requests appear to be coming from a larger number of clients. We can make it seem like our requests are coming from a larger number of clients. This makes it harder for websites to identify and block our requests based on a consistent browser version.

Properties

Here is a simple get request. This is a wrapper around hrequests.get. The only difference is that the session cookies are updated with each request. Creating sessions are recommended for making multiple requests to the same domain.

>>> resp = session.get('https://www.google.com/')

Session cookies update with each request:

>>> session.cookies: RequestsCookieJar
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2023-07-05-20', port=None, port_specified=False, domain='.google.com', domain_specified=True...

Regenerate headers for a different OS:

>>> session.os = 'win'
>>> session.headers: CaseInsensitiveDict
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0.3) Gecko/20100101 Firefox/66.0.3', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US;q=0.5,en;q=0.3', 'Cache-Control': 'max-age=0', 'DNT': '1', 'Upgrade-Insecure-Requests': '1', 'Pragma': 'no-cache'}

Closing Sessions

Sessions can also be closed to free memory:

>>> session.close()

Alternatively, sessions can be used as context managers:

with hrequests.Session() as session:
    resp = session.get('https://www.google.com/')
    print(resp)

Concurrent & Lazy Requests

Nohup Requests

Similar to Unix's nohup command, nohup requests are sent in the background.

Adding the nohup=True keyword argument will return a LazyTLSRequest object. This will send the request immediately, but doesn't wait for the response to be ready until an attribute of the response is accessed.

resp1 = hrequests.get('https://www.google.com/', nohup=True)
resp2 = hrequests.get('https://www.google.com/', nohup=True)

resp1 and resp2 are sent concurrently. They will never pause the current thread, unless an attribute of the response is accessed:

print('Resp 1:', resp1.reason)  # will wait for resp1 to finish, if it hasn't already
print('Resp 2:', resp2.reason)  # will wait for resp2 to finish, if it hasn't already

This is useful for sending requests in the background that aren't needed until later.

Note: In nohup, a new thread is created for each request. For larger scale concurrency, please consider the following:

Easy Concurrency

You can pass an array/iterator of links to the request methods to send them concurrently. This wraps around hrequests.map:

>>> hrequests.get(['https://google.com/', 'https://github.com/'])
(<Response [200]>, <Response [200]>)

This also works with nohup:

>>> resps = hrequests.get(['https://google.com/', 'https://github.com/'], nohup=True)
>>> resps
(<LazyResponse[Pending]>, <LazyResponse[Pending]>)
>>> # Sometime later...
>>> resps
(<Response [200]>, <Response [200]>)

Grequests-style Concurrency

The methods async_get, async_post, etc. will create an unsent request. This levereges gevent, making it blazing fast.

Parameters
Parameters:
    url (str): URL to send request to
    data (Union[str, bytes, bytearray, dict], optional): Data to send to request. Defaults to None.
    files (Dict[str, Union[BufferedReader, tuple]], optional): Data to send to request. Defaults to None.
    headers (dict, optional): Dictionary of HTTP headers to send with the request. Defaults to None.
    params (dict, optional): Dictionary of URL parameters to append to the URL. Defaults to None.
    cookies (Union[RequestsCookieJar, dict, list], optional): Dict or CookieJar to send. Defaults to None.
    json (dict, optional): Json to send in the request body. Defaults to None.
    allow_redirects (bool, optional): Allow request to redirect. Defaults to True.
    history (bool, optional): Remember request history. Defaults to False.
    verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
    timeout (float, optional): Timeout in seconds. Defaults to 30.
    proxy (str, optional): Proxy URL. Defaults to None.
    <Additionally includes all parameters from `hrequests.Session` if a session was not specified>

Returns:
    hrequests.response.Response: Response object

Async requests are evaluated on hrequests.map, hrequests.imap, or hrequests.imap_enum.

This functionality is similar to grequests. Unlike grequests, monkey patching is not required because this does not rely on the standard python SSL library.

Create a set of unsent Requests:

>>> reqs = [
...     hrequests.async_get('https://www.google.com/', browser='firefox'),
...     hrequests.async_get('https://www.duckduckgo.com/'),
...     hrequests.async_get('https://www.yahoo.com/')
... ]

map

Send them all at the same time using map:

>>> hrequests.map(reqs, size=3)
[<Response [200]>, <Response [200]>, <Response [200]>]
Parameters
Concurrently converts a list of Requests to Responses.
Parameters:
    requests - a collection of Request objects.
    size - Specifies the number of requests to make at a time. If None, no throttling occurs.
    exception_handler - Callback function, called when exception occurred. Params: Request, Exception
    timeout - Gevent joinall timeout in seconds. (Note: unrelated to requests timeout)

Returns:
    A list of Response objects.

imap

imap returns a generator that yields responses as they come in:

>>> for resp in hrequests.imap(reqs, size=3):
...    print(resp)
<Response [200]>
<Response [200]>
<Response [200]>
Parameters
Concurrently converts a generator object of Requests to a generator of Responses.

Parameters:
    requests - a generator or sequence of Request objects.
    size - Specifies the number of requests to make at a time. default is 2
    exception_handler - Callback function, called when exception occurred. Params: Request, Exception

Yields:
    Response objects.

imap_enum returns a generator that yields a tuple of (index, response) as they come in. The index is the index of the request in the original list:

>>> for index, resp in hrequests.imap_enum(reqs, size=3):
...     print(index, resp)
(1, <Response [200]>)
(0, <Response [200]>)
(2, <Response [200]>)
Parameters
Like imap, but yields tuple of original request index and response object
Unlike imap, failed results and responses from exception handlers that return None are not ignored. Instead, a
tuple of (index, None) is yielded.
Responses are still in arbitrary order.

Parameters:
    requests - a sequence of Request objects.
    size - Specifies the number of requests to make at a time. default is 2
    exception_handler - Callback function, called when exception occurred. Params: Request, Exception

Yields:
    (index, Response) tuples.

Exception Handling

To handle timeouts or any other exception during the connection of the request, you can add an optional exception handler that will be called with the request and exception inside the main thread.

>>> def exception_handler(request, exception):
...    return f'Response failed: {exception}'

>>> bad_reqs = [
...     hrequests.async_get('http://httpbin.org/delay/5', timeout=1),
...     hrequests.async_get('http://fakedomain/'),
...     hrequests.async_get('http://example.com/'),
... ]
>>> hrequests.map(bad_reqs, size=3, exception_handler=exception_handler)
['Response failed: Connection error', 'Response failed: Connection error', <Response [200]>]

The value returned by the exception handler will be used in place of the response in the result list.

If an exception handler isn't specified, the default yield type is hrequests.FailedResponse.


HTML Parsing

HTML scraping is based off selectolax, which is over 25x faster than bs4. This functionality is inspired by requests-html.

Library Time (1e5 trials)
BeautifulSoup4 52.6
PyQuery 7.5
selectolax 1.9

The HTML parser can be accessed through the html attribute of the response object:

>>> resp = session.get('https://python.org/')
>>> resp.html
<HTML url='https://www.python.org/'>

Parsing page

Grab a list of all links on the page, as-is (anchors excluded):

>>> resp.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/',...

Grab a list of all links on the page, in absolute form (anchors excluded):

>>> resp.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.g...

Search for text on the page:

>>> resp.html.search('Python is a {} language')[0]
programming

Selecting elements

Select an element using a CSS Selector:

>>> about = resp.html.find('#about')
Parameters
Given a CSS Selector, returns a list of
:class:`Element <Element>` objects or a single one.

Parameters:
    selector: CSS Selector to use.
    clean: Whether or not to sanitize the found HTML of ``<script>`` and ``<style>``
    containing: If specified, only return elements that contain the provided text.
    first: Whether or not to return just the first result.
    raise_exception: Raise an exception if no elements are found. Default is True.
    _encoding: The encoding format.

Returns:
    A list of :class:`Element <Element>` objects or a single one.

Example CSS Selectors:
- ``a``
- ``a.someClass``
- ``a#someID``
- ``a[target=_blank]``
See W3School's `CSS Selectors Reference
<https://www.w3schools.com/cssref/css_selectors.asp>`_
for more details.
If ``first`` is ``True``, only returns the first
:class:`Element <Element>` found.

Introspecting elements

Grab an Element's text contents:

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure

Getting an Element's attributes:

>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
>>> about.id
'about'

Get an Element's raw HTML:

>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'

Select Elements within Elements:

>>> about.find_all('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
>>> about.find('a')
<Element 'a' href='/about/' title='' class=''>

Searching by HTML attributes:

>>> about.find('il', role='treeitem')
<Element 'li' role='treeitem' class=('tier-2', 'element-1')>

Search for links within an element:

>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}

Browser Automation

Hrequests supports both Firefox and Chrome browsers, headless and headful sessions, and browser addons/extensions:

Browser support table

Chrome supports both Manifest v2/v3 extensions. Firefox only supports Manifest v2 extensions.

Only Firefox supports CloudFlare WAFs.

Browser MV2 MV3 Cloudfare WAFs
Firefox โœ”๏ธ โŒ โœ”๏ธ
Chrome โœ”๏ธ โœ”๏ธ โŒ

Usage

You can spawn a BrowserSession instance by calling it:

>>> page = hrequests.BrowserSession()  # headless=True by default
Parameters
Parameters:
    headless (bool, optional): Whether to run the browser in headless mode. Defaults to True.
    session (hrequests.session.TLSSession, optional): Session to use for headers, cookies, etc.
    resp (hrequests.response.Response, optional): Response to update with cookies, headers, etc.
    proxy (str, optional): Proxy to use for the browser. Example: http://1.2.3.4:8080
    mock_human (bool, optional): Whether to emulate human behavior. Defaults to False.
    browser (Literal['firefox', 'chrome'], optional): Generate useragent headers for a specific browser
    os (Literal['win', 'mac', 'lin'], optional): Generate headers for a specific OS
    extensions (Union[str, Iterable[str]], optional): Path to a folder of unpacked extensions, or a list of paths to unpacked extensions

By default, BrowserSession returns a Chrome browser.

To create a Firefox session, use the chrome shortcut instead:

>>> page = hrequests.firefox.BrowserSession()

BrowserSession is entirely safe to use across threads.

Render an existing Response

Responses have a .render() method. This will render the contents of the response in a browser page.

Once the page is closed, the Response content and the Response's session cookies will be updated.

Simple usage

Rendered browser sessions will use the browser set in the initial request.

You can set a request's browser with the browser parameter in the hrequests.get method:

>>> resp = hrequests.get('https://example.com', browser='chrome')

Or by setting the browser parameter of the hrequests.Session object:

>>> session = hrequests.Session(browser='chrome')
>>> resp = session.get('https://example.com')

Example - submitting a login form:

>>> session = hrequests.Session(browser='chrome')
>>> resp = session.get('https://www.somewebsite.com/')
>>> with resp.render(mock_human=True) as page:
...     page.type('.input#username', 'myuser')
...     page.type('.input#password', 'p4ssw0rd')
...     page.click('#submit')
# `session` & `resp` now have updated cookies, content, etc.
Or, without a context manager
>>> session = hrequests.Session(browser='chrome')
>>> resp = session.get('https://www.somewebsite.com/')
>>> page = resp.render(mock_human=True)
>>> page.type('.input#username', 'myuser')
>>> page.type('.input#password', 'p4ssw0rd')
>>> page.click('#submit')
>>> page.close()  # must close the page when done!

The mock_human parameter will emulate human-like behavior. This includes easing and randomizing mouse movements, and randomizing typing speed. This functionality is based on botright.

Parameters
Parameters:
    headless (bool, optional): Whether to run the browser in headless mode. Defaults to False.
    mock_human (bool, optional): Whether to emulate human behavior. Defaults to False.
    extensions (Union[str, Iterable[str]], optional): Path to a folder of unpacked extensions, or a list of paths to unpacked extensions

Properties

Cookies are inherited from the session:

>>> page.cookies: RequestsCookieJar  # cookies are inherited from the session
<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2023-07-05-20', port=None, port_specified=False, domain='.somewebsite.com', domain_specified=True...

Pulling page data

Get current page url:

>>> page.url: str
https://www.somewebsite.com/

Get page content:

>>> page.text: str
'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpag'
>>> page.content: bytes
b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpag'

Get the status of the last navigation:

>>> page.status_code: int
200
>>> page.reason: str
'OK'

Parsing HTML from the page content:

>>> page.html.find_all('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, ...]
>>> page.html.find('a')
<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>

Take a screenshot of the page:

>>> page.screenshot(path='screenshot.png')
Parameters
Take a screenshot of the page

Parameters:
    selector (str, optional): CSS selector to screenshot
    path (str, optional): Path to save screenshot to. Defaults to None.
    full_page (bool): Whether to take a screenshot of the full scrollable page. Cannot be used with selector. Defaults to False.

Returns:
    Optional[bytes]: Returns the screenshot buffer, if `path` was not provided

Navigate the browser

Navigate to a url:

>>> page.url = 'https://bing.com'
# or use goto
>>> page.goto('https://bing.com')

Navigate through page history:

>>> page.back()
>>> page.forward()

Controlling elements

Click an element:

>>> page.click('#my-button')
# or through the html parser
>>> page.html.find('#my-button').click()
Parameters
Parameters:
    selector (str): CSS selector to click.
    button (Literal['left', 'right', 'middle'], optional): Mouse button to click. Defaults to 'left'.
    count (int, optional): Number of clicks. Defaults to 1.
    timeout (float, optional): Timeout in seconds. Defaults to 30.
    wait_after (bool, optional): Wait for a page event before continuing. Defaults to True.

Hover over an element:

>>> page.hover('.dropbtn')
# or through the html parser
>>> page.html.find('.dropbtn').hover()
Parameters
Parameters:
    selector (str): CSS selector to hover over
    modifiers (List[Literal['Alt', 'Control', 'Meta', 'Shift']], optional): Modifier keys to press. Defaults to None.
    timeout (float, optional): Timeout in seconds. Defaults to 90.

Type text into an element:

>>> page.type('#my-input', 'Hello world!')
# or through the html parser
>>> page.html.find('#my-input').type('Hello world!')
Parameters
Parameters:
    selector (str): CSS selector to type in
    text (str): Text to type
    delay (int, optional): Delay between keypresses in ms. On mock_human, this is randomized by 50%. Defaults to 50.
    timeout (float, optional): Timeout in seconds. Defaults to 30.

Drag and drop an element:

>>> page.dragTo('#source-selector', '#target-selector')
# or through the html parser
>>> page.html.find('#source-selector').dragTo('#target-selector')
Parameters
Parameters:
    source (str): Source to drag from
    target (str): Target to drop to
    timeout (float, optional): Timeout in seconds. Defaults to 30.
    wait_after (bool, optional): Wait for a page event before continuing. Defaults to False.
    check (bool, optional): Check if an element is draggable before running. Defaults to False.

Throws:
    hrequests.exceptions.BrowserTimeoutException: If timeout is reached

Check page elements

Check if a selector is visible and enabled:

>>> page.isVisible('#my-selector'): bool
>>> page.isEnabled('#my-selector'): bool
Parameters
Parameters:
    selector (str): Selector to check

Evaluate and return a script:

>>> page.evaluate('selector => document.querySelector(selector).checked', '#my-selector')
Parameters
Parameters:
    script (str): Javascript to evaluate in the page
    arg (str, optional): Argument to pass into the javascript function

Awaiting events

>>> page.awaitNavigation()
Parameters
Parameters:
    timeout (float, optional): Timeout in seconds. Defaults to 30.

Throws:
    hrequests.exceptions.BrowserTimeoutException: If timeout is reached

Wait for a script or function to return a truthy value:

>>> page.awaitScript('selector => document.querySelector(selector).value === 100', '#progress')
Parameters
Parameters:
    script (str): Script to evaluate
    arg (str, optional): Argument to pass to script
    timeout (float, optional): Timeout in seconds. Defaults to 30.

Throws:
    hrequests.exceptions.BrowserTimeoutException: If timeout is reached

Wait for the URL to match:

>>> page.awaitUrl(re.compile(r'https?://www\.google\.com/.*'), timeout=10)
Parameters
Parameters:
    url (Union[str, Pattern[str], Callable[[str], bool]]) - URL to match for
    timeout (float, optional): Timeout in seconds. Defaults to 30.

Throws:
    hrequests.exceptions.BrowserTimeoutException: If timeout is reached

Wait for an element to exist on the page:

>>> page.awaitSelector('#my-selector')
# or through the html parser
>>> page.html.find('#my-selector').awaitSelector()
Parameters
Parameters:
    selector (str): Selector to wait for
    timeout (float, optional): Timeout in seconds. Defaults to 30.

Throws:
    hrequests.exceptions.BrowserTimeoutException: If timeout is reached

Wait for an element to be enabled:

>>> page.awaitEnabled('#my-selector')
# or through the html parser
>>> page.html.find('#my-selector').awaitEnabled()
Parameters
Parameters:
    selector (str): Selector to wait for
    timeout (float, optional): Timeout in seconds. Defaults to 30.

Throws:
    hrequests.exceptions.BrowserTimeoutException: If timeout is reached

Screenshot an element:

>>> page.screenshot('#my-selector', path='screenshot.png')
# or through the html parser
>>> page.html.find('#my-selector').screenshot('selector.png')
Parameters
Screenshot an element

Parameters:
    selector (str, optional): CSS selector to screenshot
    path (str, optional): Path to save screenshot to. Defaults to None.
    full_page (bool): Whether to take a screenshot of the full scrollable page. Cannot be used with selector. Defaults to False.

Returns:
    Optional[bytes]: Returns the screenshot buffer, if `path` was not provided

Adding Firefox/Chrome extensions

Firefox/Chrome extensions can be easily imported into a browser session. Some potentially useful extensions include:

Note: Firefox extensions are Firefox-only, and Chrome extensions are Chrome-only.

If you plan on using Firefox-specific or Chrome-specific extensions, make sure to set your browser parameter to the correct browser before rendering the page:

# when dealing with captchas, make sure to use firefox
>>> resp = hrequests.get('https://accounts.hcaptcha.com/demo', browser='firefox')

Extensions are added with the extensions parameter:

  • This can be an list of absolute paths to unpacked extensions:

    with resp.render(extensions=['C:\\extensions\\hektcaptcha', 'C:\\extensions\\ublockorigin']):
  • Or a folder containing the unpacked extensions:

    with resp.render(extensions='C:\\extentions'):

    Note that these need to be unpacked extensions. You can unpack a .crx file by changing the file extension to .zip and extracting the contents.

Here is an usage example of using a captcha solver:

>>> resp = hrequests.get('https://accounts.hcaptcha.com/demo', browser='firefox')
>>> with resp.render(extensions=['C:\\extensions\\hektcaptcha']) as page:
...     page.awaitSelector('.hcaptcha-success')  # wait for captcha to finish
...     page.click('input[type=submit]')

Requests & Responses

Requests can also be sent within browser sessions. These operate the same as the standard hrequests.request, and will use the browser's cookies and headers. The BrowserSession cookies will be updated with each request.

This returns a normal Response object:

>>> resp = page.get('https://duckduckgo.com')
Parameters
Parameters:
    url (str): URL to send request to
    params (dict, optional): Dictionary of URL parameters to append to the URL. Defaults to None.
    data (Union[str, dict], optional): Data to send to request. Defaults to None.
    headers (dict, optional): Dictionary of HTTP headers to send with the request. Defaults to None.
    form (dict, optional): Form data to send with the request. Defaults to None.
    multipart (dict, optional): Multipart data to send with the request. Defaults to None.
    timeout (float, optional): Timeout in seconds. Defaults to 30.
    verify (bool, optional): Verify the server's TLS certificate. Defaults to True.
    max_redirects (int, optional): Maximum number of redirects to follow. Defaults to None.

Throws:
    hrequests.exceptions.BrowserTimeoutException: If timeout is reached

Returns:
    hrequests.response.Response: Response object

Other methods include post, put, delete, head, and patch.

Closing the page

The BrowserSession object must be closed when finished. This will close the browser, update the response data, and merge new cookies with the session cookies.

>>> page.close()

Note that this is automatically done when using a context manager.

Session cookies are updated:

>>> session.cookies: RequestsCookieJar
<RequestsCookieJar[Cookie(version=0, name='MUID', value='123456789', port=None, port_specified=False, domain='.bing.com', domain_specified=True, domain_initial_dot=True...

Response data is updated:

>>> resp.url: str
'https://www.bing.com/?toWww=1&redig=823778234657823652376438'
>>> resp.content: Union[bytes, str]
'<!DOCTYPE html><html lang="en" dir="ltr"><head><meta name="theme-color" content="#4F4F4F"><meta name="description" content="Bing helps you turn inform...

Other ways to create a Browser Session

You can use .render to spawn a BrowserSession object directly from a url:

# Using a Session:
>>> page = session.render('https://google.com')
# Or without a session at all:
>>> page = hrequests.render('https://google.com')

Make sure to close all BrowserSession objects when done!

>>> page.close()

hrequests's People

Contributors

daijro avatar dependabot[bot] avatar kianmeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hrequests's Issues

Browser Extension interaction

You currently have the ability for browsers to use extensions such as adblockers. However, there are a few extensions that I want to use that require active input (pressing a couple of buttons). Could there be a way to interact with these extensions?

Contact with me

I'm sorry to write here, but can you contact me about something on discord @max_andolini

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

My python crash when trying simple request

import hrequests

resp = hrequests.get('https://www.google.com')

I am using virtual env with python 3.9.6, Macbook with M2 chip. In pycharm interpreter or terminal, I get "Process finished with exit code 137 (interrupted by signal 9: SIGKILL)" (pycharm) and apple report window appears with message that "python quit unexpectedly" This happens during import of the package

Can we use auth proxies?

How could we accomplish something like this using hrequests?

import requests

proxies = {
   'http': 'http://proxy.example.com:8080',
   'https': 'http://proxy.example.com:8081',
}

response = requests.get('http://httpbin.org/ip', proxies=proxies auth=('USERNAME', 'PASSWORD'))

Add socks5 proxies [tls-client v1.7.0]

Hello,

everytime I try to use socks5 proxies I got a Connection Error:

>>> get(
...     "https://ipv4.webshare.io/",
...     proxies={
...         "http": "socks5h://XXXX-rotate:[email protected]:80/",
...         "https": "socks5h://XXXX-rotate:[email protected]:80/"
...     }
... ).text
Traceback (most recent call last):
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 72, in execute_request
    resp = self.session.execute_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/client.py", line 389, in execute_request
    raise ClientException(response_object['body'])
hrequests.exceptions.ClientException: failed to build client out of request input: scheme socks5h is not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 209, in request
    req.send()
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 126, in send
    raise e
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/reqs.py", line 123, in send
    self.response = self.session.request(self.method, self.url, **merged_kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/session.py", line 181, in request
    proc.send()
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 65, in send
    self.response = self.execute_request()
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bader/code/python3/lib/python3.11/site-packages/hrequests/response.py", line 80, in execute_request
    raise ClientException('Connection error') from e
hrequests.exceptions.ClientException: Connection error

Speed benchmarks with `async_get`?

Hey, I am using async_get with hrequests.imap_enum(reqs) to scrape 100 URLs at a time. The readme says this should be blazing fast, but I'm not sure what that means. It's currently taking 3-5 minutes, and that does not include a render step.

Is that approximately the amount of time it should take? I was thinking it'd be a lot faster since it's just a request and not a rendering. Here's the code I'm currently using. The rows variable is 100 DB records, specifically a SQL Alchemy RowMapping object.

reqs = [hrequests.async_get(r.url) for r in rows]

responses = []

for index, resp in hrequests.imap_enum(reqs):
    if resp:
        with open(f'{directory}/{index}.pickle', 'wb') as file:
            pickle.dump(resp, file)
        responses.append({"url": resp.url, "resp": resp})
        
    else:
        print(f'No response for {index}')

TypeError: string indices must be integers

Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hrequests
Downloading tls-client library from bogdanfinn/tls-client...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python310\lib\site-packages\hrequests\__init__.py", line 2, in <module>
    from .session import Session, TLSSession, chrome, firefox
  File "C:\Python310\lib\site-packages\hrequests\session.py", line 11, in <module>
    from .cffi import freeMemory
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 90, in <module>
    libman = LibraryManager()
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 35, in __init__
    filename = self.check_library()
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 54, in check_library
    self.download_library()
  File "C:\Python310\lib\site-packages\hrequests\cffi.py", line 63, in download_library
    if self.file_cont in asset['name'] and asset['name'].endswith(self.file_ext):
TypeError: string indices must be integers

This also happens when you install both hrequests and hrequests[all]

How i can use Browser Automation?

page = hrequests.BrowserSession()

i have tried to use Browser Automation as you given in example but it is showing error like this :
AttributeError: module 'hrequests' has no attribute 'BrowserSession'

and i also tried this example :

session = hrequests.Session(browser='chrome')
resp = session.get('https://quotes.toscrape.com/page/1/')
with resp.render(mock_human=True) as page:
print(page.text)

and it's also throwing error : AttributeError: module 'hrequests' has no attribute 'browser'

Render Screenshot - page is not fully rendered

I was working through the README; excellent package. In the past I have used playwright to headlessly render a page and take a screenshot. I attempted to replicate this in your library but the screenshot captured appears to lack the javascript rendering.

session = hrequests.Session(browser='chrome')
resp = session.get('https://www.bentley.edu/undergraduate')
page = resp.render(mock_human=True)
page.awaitNavigation()
page.screenshot('test.png', full_page=True)

It likely could be user error, but I am not sure what the best path would be to ensure that the page renders before grabbing a screenshot.

AttributeError: dlsym(0x8546e980, DestroySession): symbol not found

Hi, I'm unable to import hrequests using the latest beta version: 0.8.0-beta b1af435

I get the following exception:

AttributeError                            Traceback (most recent call last)
/Users/libre/Documents/GitHub/project_env/playground.ipynb Cell 43 line 1
----> 1 import hrequests

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/__init__.py:1
----> 1 from .response import Response, ProcessResponse
      2 from .session import Session, TLSSession, chrome, firefox
      3 from .reqs import *

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/response.py:11
      8 from orjson import dumps, loads
     10 import hrequests
---> 11 from hrequests.cffi import PORT
     12 from hrequests.exceptions import ClientException
     14 from .cookies import RequestsCookieJar

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/site-packages/hrequests/cffi.py:112
    109 del libman
    111 # extract the exposed destroySession function
--> 112 library.DestroySession.argtypes = [GoString]
    113 library.DestroySession.restype = ctypes.c_void_p
    116 def destroySession(session_id: str):

File /opt/homebrew/Caskroom/miniforge/base/envs/project_env/lib/python3.11/ctypes/__init__.py:389, in CDLL.__getattr__(self, name)
...
--> 394     func = self._FuncPtr((name_or_ordinal, self))
    395     if not isinstance(name_or_ordinal, int):
    396         func.__name__ = name_or_ordinal

AttributeError: dlsym(0x8546e980, DestroySession): symbol not found

Browser Session - Error

Good Morning

OS: Ubuntu 20.04
hrequests version: 0.8.2

By using the hrequests.firefox.BrowserSession() I am receiving the error:

on 0: Exception in thread Thread-2 (spawn_main): Traceback (most recent call last): File "/.asdf/installs/python/3.10.5/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/browser.py", line 128, in spawn_main asyncio.new_event_loop().run_until_complete(self.main()) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete return future.result() on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/browser.py", line 135, in main self.context = await self.client.new_context( on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/playwright_mock/playwright_mock.py", line 35, in new_context _proxy = await ProxyManager(self, proxy) on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/async_class.py", line 173, in __await__ yield from self.create_task( on 0: File "/.asdf/installs/python/3.10.5/lib/python3.10/site-packages/hrequests/playwright_mock/proxy_manager.py", line 25, in __ainit__ self.timeout = httpx.Timeout(20.0, read=None) on 0: TypeError: Timeout.__init__() got an unexpected keyword argument 'read' Processing Tags | | โ–„โ–‚โ–‚ 0/3 [0%] in 7:48 (~0s, 0.0/s)

It's related to the dependency and not with the code.

I am reporting for further analysis.

Have a great day.

Required dependencie missing when installing the library

OS: Windows 11.0
Python version: 3.11.4

When installing the "hrequests" package using the pip command and executing some related code, some specific errors are observed, such as:

  • ModuleNotFoundError: No module named 'bs4'
  • ModuleNotFoundError: No module named 'BeautifulSoup'

It is noticeable that, even when using only the "get" function provided by the "hrequests" package, the "bs4" library is requested, and even after trying to install the "hrequests[all]" package, the same errors persist. However, when the "bs4" library is installed manually, the code works without problems. This is a rather inconvenient problem to solve every time I use this library in another project.

Resp.render is not applying proxy

I have observed that when I try to render the content the proxy is not used, because proxy is None in the Response class. Proxy is only working without rendering.

r.content to Bytes or r.raw is needed for binaries.

Hi, I have tried downloading images, the request.content is not raw bytes?
I've tried mangling the r.content to .encode('utf-8') / decoding(uncode-escapes), but to no avail. The response is always some weird mix of encoded and escaped string and bytes.
The Request class implementation is different from requests.Request() and does not support requests.Request.raw and r.content is not Bytes.

Is there any workaround? Cheers

Sample code:

url = f"https://upload.wikimedia.org/wikipedia/commons/5/59/Shrine_of_Rememberance_%2811884180023%29.jpg"
filePath = os.path.join("tmp", f"image.jpg")
while True:
    r = hrequests.request("GET", url)
    if r.status_code != 403:
        data = r.content.encode('utf-8') 
        with open(filePath, "wb") as f:
            f.write(data)
        break

Output from r:
snip

Resulting binary r.content.encode('utf-8'):
image

Content is not fully loaded

I am testing this library with browser automation on some websites and I have observed that for many of them the content that is lazy is not fully loading (images, js scripts that might load the page). I was wandering why might cause this issue.

Cookies not properly set in session

I stumbled on a case that indicates that cookies are not properly set in session.

Example url: https://somo.app

If you open url in browser you can see that first requests returns 307 redirect to '/' and sets cookie, subsequent request is to the same url but with cookie and it returns 200.

Trying to open this url with hrequests will fail with:
hrequests.exceptions.ClientException: failed to do request: Get "/": stopped after 10 redirects

import hrequests

url = "https://somo.app"
session = hrequests.Session()

resp = session.get(url, allow_redirects=True)

if I try to do requests by request

import hrequests

url = "https://somo.app"
session = hrequests.Session()

resp = session.get(url, allow_redirects=False)
print(resp.status_code)
print(resp.headers.get("location"))

resp = session.get(url, allow_redirects=False)
print(resp.status_code)

It shows that subsequent requests returns 307 again. But it should not cookie should be set and second request should return 200.

Getting and setting cookie manually produces expected behavior

import hrequests

url = "https://somo.app"
session = hrequests.Session()

resp = session.get(url, allow_redirects=False)
print(resp.status_code)
print(resp.headers.get("location"))

headers = {"Cookie": ";".join([f"{c.name}={c.value}" for c in resp.cookies])}
resp = session.get(url, allow_redirects=False, headers=headers)
print(resp.status_code)

Unsupported chrome version error

Hi there, when running a simple hrequests.get command on Ubuntu I get the following error:

<bound method ? of <class 'hrequests.session.chrome'>>` is not a supported chrome version: (103, 104, 105, 106, 107, 108, 109, 110, 111, 112)

This happens when both installing via hrequests[all] and hrequests.

Was wondering if anyone else has run into this, or could help me debug?

Thanks!

No Binaries for M1 Macs

Installing version 0.8.0 or above doesn't work on M1 (or presumably M2) Macs, due to the missing binaries.

/lib/python3.10/site-packages/hrequests/cffi.py", line 98, in download_library
    raise IOError('Could not find a matching binary for your system.')
OSError: Could not find a matching binary for your system.

Temporary workaround is to install:

pip install -U "hrequests[all]<0.8.0"

Are there plans to include these binaries? Or perhaps some instructions on how to compile them myself?

Lambda Execution Issues

Hey there! Awesome library! I am running into some issues. I hope the community here can help me troubleshoot them. I am attempting to run hrequests in Lambda to interact with specific web pages when a function URL is called.

I am using the AWS SDK to deploy a Docker container similar to the following to ECR -> Lambda:

FROM mcr.microsoft.com/playwright/python:v1.34.0-jammy

# Include global arg in this stage of the build
ARG FUNCTION_DIR

RUN mkdir -p ${FUNCTION_DIR}

COPY app.py ${FUNCTION_DIR}

WORKDIR /app

COPY ./mytool/pyproject.toml ./mytool/poetry.lock /app/

COPY ./mytool/. /app

# Install dependencies using poetry
RUN pip install --no-cache-dir poetry awslambdaric aws-xray-sdk sh \
    && poetry config virtualenvs.create false \
    && poetry install --no-interaction --no-ansi

RUN python -m playwright install-deps
RUN python -m playwright install

WORKDIR ${FUNCTION_DIR}

ENTRYPOINT [ "/usr/bin/python", "-m", "awslambdaric" ]
CMD [ "app.handler" ]

An app.py file similar to the following is then called using said function URL via awslambdaric:

def handler(event, context):
    logger.debug(msg=f"Initial event: {event}")

    headers = event["headers"]
    header_validation = validate_headers(headers)

    input = headers["x-input"]
    try:
        command = headers["x-command"].split()
        command.extend(input.split())
    except Exception as e:
        logger.error(msg=f"Error parsing command: {e}")
        return {
            "statusCode": 500,
            "body": f"Error parsing command: {e}",
        }

    parsed = []
    try:
        logger.debug(msg=f"Running command: {command}")

        # Set HOME=/tmp to avoid writing to the container filesystem
        # Set LD_LIBRARY_PATH to include /usr/lib64 to avoid issues with the AWS X-Ray daemon
        os.environ["HOME"] = "/tmp"
        os.environ["LD_LIBRARY_PATH"] = "/usr/lib64"

        results = subprocess.run(command, capture_output=True, text=True, env=os.environ.copy())
        logger.debug(msg=f"Results stdout: {results.stdout}")
        logger.debug(msg=f"Results stderr: {results.stderr}")
        logger.debug(msg=f"Command exited with code: {results.returncode}")

    except subprocess.TimeoutExpired as e:
        logger.error(msg=f"Command timed out: {e}")
        return {
            "statusCode": 408,  # HTTP status code for Request Timeout
            "body": json.dumps({
                "stdout": str(e.stdout),
                "stderr": str(e.stderr),
                "e": str(e),
                "error": "Command timed out"
            }),
        }
    except Exception as e:
        logger.error(msg=f"Error executing command: {e}")
        return {
            "statusCode": 500,
            "body": f"Error executing command: {e}",
        }

    try:
        for line in results.stdout.splitlines():
            parsed_json = json.loads(line)
            logger.debug(msg=f"Output: {parsed_json}")
            parsed.append(parsed_json)
    except Exception as e:
        logger.error(msg=f"Error parsing output: {e}")
        return {
            "statusCode": 500,
            "body": f"Error parsing output: {e}",
        }
    
    xray_recorder.end_segment()

    return {"statusCode": 200, "body": json.dumps(parsed)}

This app.py code is calling a separate tool I have created that utilizes hrequests for navigation and interaction with web pages. When calling the app.py file with the function URL, however, the following error is returned from hrequests specifically:

Exception in thread Thread-1 (spawn_main):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/hrequests/browser.py", line 128, in spawn_main
    asyncio.new_event_loop().run_until_complete(self.main())
  File "/usr/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/hrequests/browser.py", line 135, in main
    self.context = await self.client.new_context(
  File "/usr/local/lib/python3.10/dist-packages/hrequests/playwright_mock/playwright_mock.py", line 38, in new_context
    _browser = await context.new_context(
  File "/usr/local/lib/python3.10/dist-packages/hrequests/playwright_mock/context.py", line 6, in new_context
    context = await inst.main_browser.new_context(
  File "/usr/local/lib/python3.10/dist-packages/playwright/async_api/_generated.py", line 14154, in new_context
    await self._impl_obj.new_context(
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_browser.py", line 127, in new_context
    channel = await self._channel.send("newContext", params)
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 61, in send
    return await self._connection.wrap_api_call(
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 482, in wrap_api_call
    return await cb()
  File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", line 97, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Target page, context or browser has been closed

Some notes on what has already been attempted:

  • The container image runs just fine on my local system with similar resource allocations specified
  • I can call my tool remotely, and it appears to run partially before hitting this exception
  • I have increased memory allocation to the Lambda function several times without success.
  • My tool is always hitting the lambda timeout value set no matter how high so I suspect this error is occurring and locking the application entirely.

I am not experienced with playwright and headless browser usage, so any help would be greatly appreciated. I understand this is not directly related to hrequests, but I hope the community here is familiar enough with the frameworks to assist. Thanks!

None Type Error

Thanks for Library,

Faced this error in Windows

  File "C:\Users\Chetan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\hrequests\response.py", line 127, in Response
    elapsed: timedelta | None = None
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Fix was to set elapsed = None. Please add it to library to help people.

Jupyter support for BrowserSession

On Windows 10, Python 3.10.1:

import hrequests
page = hrequests.BrowserSession()

Results in the follow exception:

Task exception was never retrieved
future: <Task finished name='Task-7' coro=<Connection.run() done, defined at <redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py:264> exception=NotImplementedError()>
Traceback (most recent call last):
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py", line 271, in run
    await self._transport.connect()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    raise exc
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 116, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec
    transport = await self._make_subprocess_transport(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError
Exception in thread Thread-5 (spawn_main):
Traceback (most recent call last):
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "<redacted>\hrequests\venv\lib\site-packages\hrequests\browser.py", line 115, in spawn_main
    asyncio.new_event_loop().run_until_complete(self.main())
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 641, in run_until_complete
    return future.result()
  File "<redacted>\hrequests\venv\lib\site-packages\hrequests\browser.py", line 119, in main
    self.client = await hrequests.PlaywrightMock(
  File "<redacted>\hrequests\venv\lib\site-packages\async_class.py", line 173, in __await__
    yield from self.create_task(
  File "<redacted>\hrequests\venv\lib\site-packages\hrequests\playwright_mock\playwright_mock.py", line 19, in __ainit__
    self.playwright = await async_playwright().start()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\async_api\_context_manager.py", line 52, in start
    return await self.__aenter__()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\async_api\_context_manager.py", line 47, in __aenter__
    playwright = AsyncPlaywright(next(iter(done)).result())
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_connection.py", line 271, in run
    await self._transport.connect()
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 127, in connect
    raise exc
  File "<redacted>\hrequests\venv\lib\site-packages\playwright\_impl\_transport.py", line 116, in connect
    self._proc = await asyncio.create_subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec
    transport, protocol = await loop.subprocess_exec(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec
    transport = await self._make_subprocess_transport(
  File "C:\Users\<redacted>\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
    raise NotImplementedError
NotImplementedError

Screenshot for element?

I think screenshots should be taken for specific elements instead of the entire page and used in memory without saving.

Like this;

image_element = page.find('#captchaframe')
image = image_element.screenshot()

The library I have been dreaming for, thank you.

Quick question, how do I use the html parser once the browser has loaded? or even just evaluate with the render function and pass the evaluated html to the parser

I'm trying to get the value of a rendered js object

thank you for putting out a cohesive package for automation, def here early before it blows up

Dockerfile with hrequest

I'm building a container using the hrequest library, however, when I try to use the function get, it doesn't work and stops. Thanks for your help!

hrequests[all] doesn't support Python 3.12

i tried to install with pip install -U hrequests[all] but got this problem:
i have installed these Individual components by visual studio installer:
C++ Cmake tools for Windows
Testing tools core features
C++ Address Sanitizer
command: pip install -U hrequests[all]
Error:
Using cached playwright_stealth-1.0.6-py3-none-any.whl (28 kB)
Building wheels for collected packages: greenlet
Building wheel for greenlet (setup.py) ... error
error: subprocess-exited-with-error

ร— python setup.py bdist_wheel did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> [120 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-312
creating build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet_init_.py -> build\lib.win-amd64-cpython-312\greenlet
creating build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform_init_.py -> build\lib.win-amd64-cpython-312\greenlet\platform
creating build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\leakcheck.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_contextvars.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_cpp.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_extension_interface.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_gc.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_generator.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_generator_nested.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_greenlet.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_greenlet_trash.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_leaks.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_stack_saved.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_throw.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_tracing.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_version.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests\test_weakref.py -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests_init_.py -> build\lib.win-amd64-cpython-312\greenlet\tests
running egg_info
writing src\greenlet.egg-info\PKG-INFO
writing dependency_links to src\greenlet.egg-info\dependency_links.txt
writing requirements to src\greenlet.egg-info\requires.txt
writing top-level names to src\greenlet.egg-info\top_level.txt
reading manifest file 'src\greenlet.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files found matching 'benchmarks*.json'
no previously-included directories found matching 'docs_build'
warning: no files found matching '.py' under directory 'appveyor'
warning: no previously-included files matching '
.pyc' found anywhere in distribution
warning: no previously-included files matching '.pyd' found anywhere in distribution
warning: no previously-included files matching '
.so' found anywhere in distribution
warning: no previously-included files matching '.coverage' found anywhere in distribution
adding license file 'LICENSE'
adding license file 'LICENSE.PSF'
adding license file 'AUTHORS'
writing manifest file 'src\greenlet.egg-info\SOURCES.txt'
copying src\greenlet\greenlet.cpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet.h -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_allocator.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_compiler_compat.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_cpython_compat.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_exceptions.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_greenlet.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_internal.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_refs.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_slp_switch.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_state.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_state_dict_cleanup.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\greenlet_thread_support.hpp -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\slp_platformselect.h -> build\lib.win-amd64-cpython-312\greenlet
copying src\greenlet\platform\setup_switch_x64_masm.cmd -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_aarch64_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_alpha_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_amd64_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm32_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm32_ios.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_masm.asm -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_masm.obj -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_arm64_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_csky_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_m68k_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_mips_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc64_aix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc64_linux.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_aix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_linux.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_macosx.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_ppc_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_riscv_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_s390_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_sparc_sun_gcc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x32_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_masm.asm -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_masm.obj -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x64_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x86_msvc.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\platform\switch_x86_unix.h -> build\lib.win-amd64-cpython-312\greenlet\platform
copying src\greenlet\tests_test_extension.c -> build\lib.win-amd64-cpython-312\greenlet\tests
copying src\greenlet\tests_test_extension_cpp.cpp -> build\lib.win-amd64-cpython-312\greenlet\tests
running build_ext
building 'greenlet._greenlet' extension
creating build\temp.win-amd64-cpython-312
creating build\temp.win-amd64-cpython-312\Release
creating build\temp.win-amd64-cpython-312\Release\src
creating build\temp.win-amd64-cpython-312\Release\src\greenlet
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -DWIN32=1 -IC:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include -IC:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\cppwinrt" /EHsc /Tpsrc/greenlet/greenlet.cpp /Fobuild\temp.win-amd64-cpython-312\Release\src/greenlet/greenlet.obj /EHsr /GT
greenlet.cpp
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(831): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(834): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(834): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(848): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(867): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(870): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(870): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(881): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(891): error C2039: 'use_tracing': is not a member of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(67): note: see declaration of '_PyCFrame'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(899): error C2039: 'recursion_limit': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
C:\Users\HOSEEN\AppData\Local\Temp\pip-install-xc72bsav\greenlet_678fbe0f7a954225994ef76c8021962b\src\greenlet\greenlet_greenlet.hpp(899): error C2039: 'recursion_remaining': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
src/greenlet/greenlet.cpp(3095): error C2039: 'trash_delete_nesting': is not a member of '_ts'
C:\Users\HOSEEN\AppData\Local\Programs\Python\Python312\include\cpython/pystate.h(115): note: see declaration of '_ts'
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.39.33519\bin\HostX86\x64\cl.exe' failed with exit code 2
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for greenlet
Running setup.py clean for greenlet
Failed to build greenlet
ERROR: Could not build wheels for greenlet, which is required to install pyproject.toml-based projects

python version: 3.12.2
OS: Windows
How can i fix this error?

Handling AttributeErrors when parsing many different URLs

Hi, nice work on this library. I'm trying to parse a bunch of pages with it. But I'm running into issues where fetching content that doesn't exist throws an attribute error. Here's an example:

resp = hrequests.get("some_url")
data = {}

try:
    data['url'] = resp.url
    data["canonical"] = resp.html.find("link[@rel='canonical']").url
    data["title"] = resp.html.find("title").text
    data["meta_description"] = resp.html.find("meta[name='description']").text

except AttributeError:
    pass

Because I'm calling .text and .url on these elements, if any elements don't exist in the HTML response, the code throws an AttributeError: 'NoneType' object has no attribute 'text' and the data object will only have content prior to the error, missing any other valid elements. So for example, if there is no <title> element, but the other 3 elements do exist, the data dict will only contain the url and canonical values, it won't have the meta_description.

The attribute error makes sense, but when scraping content at scale, there's going to be errors, edge cases and missing contents. I don't see a way to handle this gracefully. I'm fine having an empty string if the value is missing, or a None type value. Is there a better way to handle this? I can remove the .url and .text properties, but I'd still have to handle it downstream with a bunch of if/else statements, and I'd prefer to just parse out the content early in the pipeline.

Pyinstaller support

Using hrequersts, after creating an exe from my Python script, I get this error at the exe startup:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\frige\AppData\Local\Temp\_MEI441402\hrequests\bin\CR_VERSIONS.json'.

The script works fine.

import hrequests
session = hrequests.Session('chrome', version=103)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101 Firefox/120.0',
    'Accept': '*/*',
    'Accept-Language': 'it-IT,it;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://sell.wethenew.com/login',
    'content-type': 'application/json',
    'Alt-Used': 'sell.wethenew.com',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-origin',
    'If-None-Match': 'W/17gujz3lxj828',
}

csrf = eval(session.get('https://sell.wethenew.com/api/auth/csrf', headers=headers, proxies=proxy).text)['csrfToken']

That is the command that auto-py-to-exe run:
pyinstaller --noconfirm --onefile --console --hidden-import "discord_webhook" "D:/Dev/Main.py"

Can anybody help me?

I tried to put the hrequest in the hidden-import while using auto-py-to-exe but nothing happened

Browser version selection not showing in request Header

How to reproduce issue:
session = hrequests.Session(browser='chrome', version=112)
resp = session.get("https://httpbin.org/headers")
print(resp.json())
It will show a different version in Chrome/114.0.5731.1, Instead of expected Chrome/112.X.X.X

Overriding encoding

In the requests library, if the wrong type of encoding is used then it's possible to manually fix this by overriding the request.encoding attribute.

In hrequests, this is unfortunately not possible, as doing so leads to the following exception:

request.encoding = 'euc_kr'
AttributeError: property 'encoding' of 'Response' object has no setter

I was wondering if there's any other way to achieve this manual override, as some websites I'm working with are being encoded incorrectly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.