Giter Site home page Giter Site logo

hip's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hip's Issues

Picking a good name

Last time I talked to @theacodes about this, she was pretty strongly of the opinion that it was great that we were basing this off urllib3 so we inherit all the hard-earned testing and corner-case handling, but that ultimately it was going to be much smoother if it ended up as a separate library that could be deployed alongside legacy urllib3 during the transitional period.

That means that we're going to need a new name!

Well, or we could just call it urllib3v2, but... no. I refuse.

I guess urllib4 is another obvious option. It does communicate "this is the replacement for urllib3", though IMO it'd be nice to take this opportunity to switch to a real, human-friendly name instead of like... a bad password. Also someone's already squatting the name on PyPI, though we could probably get it back if we wanted to argue about it.

Other ideas: getcha? posterboy?

What should we do with urllib3's securetransport and pyopenssl support?

urllib3 has code in urllib3.contrib to support using securetransport or pyopenssl for TLS, instead of the default stdlib ssl module.

This is accomplished by doing some pretty intrusive monkeypatching of urllib3's low level networking code. In hip we've replaced all that low-level networking code, and we additionally have the complication that now we have multiple different networking backends. So the existing code needs to change somehow.

I guess the first question is: should we keep it at all? Can someone summarize the problems that it's solving? I feel like there's probably a bunch of important but subtle details and history here that I don't understand, but we need to before making a decision.

If we do keep it, does it only need to work for the sync backend, or does it need to work for the async backends too? Presumably the answer will depend on why it's useful in the first place.

And if we do keep it, can we not do wacky monkeypatching stuff? I'm not a fan of contrib modules in general... I feel like either we should support something properly as a full-fledged feature, or not support it.

Running unasync on test files

One of Hip's main goals is to support both asynchronous I/O (asyncio, trio...) and synchronous I/O. We achieve this by running unasync to generate synchronous code from asynchronous code and writing various I/O backends, which allows us to avoid any duplication between async and sync code. See #149 for more details.

We know how to test asynchronous code (see https://github.com/python-trio/hip/blob/master/test/with_dummyserver/async/test_poolmanager.py) and sync code (by using the urllib3 tests), but we don't know how to keep a single copy of the test files yet, apart from some vague plan of using unasync in tests too.

This issue is about figuring this out: I started thinking about it since we're about to modify unasync to cope with the new hip/ahip naming.

The idea is to generate synchronous tests, ship them as part of the source distribution, install that and run pytest on the installed tests. This new process won't be more annoying because we already need to install the sdist to run tests anyway.

File names

We want to transform hip/test/with_dummyserver/$ASYNC/test_poolmanager.py into hip/test/with_dummyserver/$SYNC/test_poolmanager.py. pytest allows using the same file name (test_poolmanager.py here) as long as we add __init__.py files in the appropriate places.

We need to choose the $ASYNC and $SYNC names. Since async becomes a keyword in Python 3.7, async and sync will only work if we don't need to import those test files. We don't need to import them right now but that's how the PyOpenSSL and SecureTransport tests work in urllib3, so maybe just sticking to _async and _sync is more future-proof.

File contents

The async test file is going to look like this after the hip/ahip rename:

from ahip import PoolManager

from test.with_dummyserver import conftest


class TestPoolManager:
    @conftest.async_backends
    async def test_redirect(self, http_server, backend):
        base_url = "http://{}:{}".format(http_server.host, http_server.port)
        with PoolManager(backend=backend) as http:
            r = await http.request(
                "GET",
                "%s/redirect" % base_url,
                fields={"target": "%s/" % base_url},
                redirect=False,
            )
            assert r.status == 303

            r = await http.request(
                "GET", "%s/redirect" % base_url, fields={"target": "%s/" % base_url}
            )

            assert r.status == 200
            assert r.data == b"Dummy server!"

And I think the sync test file should look like this:

from hip import PoolManager

from test.with_dummyserver import conftest


class TestPoolManager:
    @conftest.sync_backend
    def test_redirect(self, http_server, backend):
        base_url = "http://{}:{}".format(http_server.host, http_server.port)
        with PoolManager(backend=backend) as http:
            r = http.request(
                "GET",
                "%s/redirect" % base_url,
                fields={"target": "%s/" % base_url},
                redirect=False,
            )
            assert r.status == 303

            r = http.request(
                "GET", "%s/redirect" % base_url, fields={"target": "%s/" % base_url}
            )

            assert r.status == 200
            assert r.data == b"Dummy server!"

We need two changes here: ahip -> hip and async_backends -> sync_backend. We can specify those in the setup.py file or we can hardcode that in unasync, because hip is just that special :-).

The decisions we need to make:

  • Does that make sense?
  • What directory names? _async -> _sync?
  • Should we hardcode ahip -> hip and async_backend -> sync_backend or allow specifying it?

Figure out what interface the Response object should expose for streaming data

In urllib3, HTTPResponse inherits from io.IOBase and implements the standard Python file interface. This is an interesting issue that discusses a bug in their implementation, and also why people value this feature: urllib3/urllib3#1305

We need to decide what interface our version of HTTPResponse should expose. Constraints:

  • The async version can't inherit from IOBase, because IOBase is a sync interface, and for async we probably want to expose something compatible with trio's ReceiveStream API
  • But our sync version will need to provide some way to expose an IOBase, so people can keeping feeding it into the csv module etc., like they're doing in the issue I linked above
  • We generally try to make sync and async interfaces identical, modulo async markers

You'll notice that these constraints are contradictory :-). Especially since the ReceiveStream and IOBase interfaces actually conflict with each other: they both support iteration, but for ReceiveStream it gives arbitrary chunks of bytes, while for IOBase it gives lines.

Some options:

  • We could have HTTPResponse implement the IOBase API in sync mode, and the ReceiveStream API in async mode.
  • We could have it implement IOBase in sync mode, and an async version of IOBase in async mode, and also provide an as_stream() method
  • We could have it implement ReceiveStream in async mode, a sync-ified version of ReceiveStream in sync mode, and also provide an as_file() method.
    • Further options: in async mode, the as_file method could either be not defined at all, or it could still exist and return an async version of the file interface
  • We could have it implement neither interface directly, but provide both as_stream and as_file methods (and then there's the same question about which methods are available in which modes)

Repo cleanup

I think we should:

  • go through and delete all the legacy branches we aren't actively using (I think that's everything except: bleach-spike, api-design, and disable-ci?)
  • go through and rename all the existing tags from <version> to urllib3/<version> or similar – it might be handy to keep them around for historical purposes, but we're restarting the version numbering, so they shouldn't just be called <version>. Or maybe it doesn't matter and we should just delete them?
  • rename bleach-spike to master
  • switch the project settings to make master the default+protected branch

Am I missing anything? This all requires some fiddly+intrusive git stuff, so we want to do it soon, but we also want to do it right...

Notes on "high-level" HTTP support

I've heard repeatedly that requests is a "high-level" HTTP API and urllib3 is a "low-level" HTTP API, but I wasn't super clear on what exactly that meant in practice. So I read requests today in hopes of learning more about what secret sauce it's adding. These are some notes/thoughts I wrote down while doing that.


Things that requests does:

  • Convenience functions: get, post, etc., plus all the helpers for encoding and decoding bodies. Including:

    • the thing where you can pass in a bunch of different formats of request body and requests will try to DWIM

    • ability to read the response body as bytes, text, json, streaming bytes, streaming text, lines, ...

    • charset autodetection when reading text/json

    • adding http:// on the front of URLs that are missing a scheme

    • choosing between Content-Length and Transfer-Encoding: chunked framing

    • measures time from starting the request until response headers are received (or something like that)

    • lots of fiddling with URLs and URL encodings. Seems to normalize URL encoding even. Uses urlparse, which is highly dubious. Not sure what this is actually about.

  • a "hooks" system that lets you register callbacks on particular events: https://2.python-requests.org//en/master/user/advanced/#event-hooks

    • In practice, it just lets you pass some functions that get to inspect/replace the Response object before it's returned to the user
  • automatic redirect handling. this has a lot of interesting complexities

    • there's some redirect defaults in requests/sessions.py for get, options, head... but not the other methods for some reason? I'm not sure what this actually does. (they're also in requests/api.py, but AFAICT that's totally redundant)

    • requests accepts file objects as request bodies, and streams them. This requires seeking back to the beginning on redirects

    • switching proxy configuration when redirected

    • saves the list of response objects from intermediate requests. I think the body is unconditionally read and stashed in memory on the response object! this is plausibly correct, since I guess redirect bodies are always small, and you want to consume the body so the connection can be released back to the pool? though there's some kind of DoS opportunity here if the server gives back a large body – normally a client can protect themselves against large response bodies by choosing to use the streaming body reading APIs, but in this case requests unconditionally fetches and stores the whole body and there's no way to stop it. Not the scariest thing, but probably should be considered a bug.

    • there's a lower-level "incremental" API: if you set allow_redirects=False, you get back the response object for the first request in the chain, and that object has some extra state on it that tracks where it is in the request chain, which you can access with requests.Response.next. However, I'm not 100% sure how you use this... seems like it would be more useful for next() to go ahead and perform the next request and return the response? maybe if the API were that easy, that could be what gets recommended to everyone who wants to see the detailed history of the redirect chain, and avoid some of the awkwardness about intermediate bodies mentioned in the previous bullet point?

    • the redirect code uses a ton of mutable state and is entangled with a lot of the other features here, which makes it hard to understand and reason about

  • authorization

    • digest auth handling is really complex: thread-local variables! relies on the hook system to integrate into redirect handling! reimplements the file seek on redirect handling (I have no idea why)! some silly code for generating nonces that combines os.urandom and ad hoc entropy sources!

    • if the user passed https://username:password@host/ URLs, requests removes them and converts them into a HTTPBasicAuth setup

  • cookies: you can pass in a cookiejar of cookies to use on requests, and it will automatically update it across a Session

    • also interacts with redirects
  • mounting different adapters

  • setting up TLS trust via certifi

  • proxy autoconfiguration from environment (reading .netrc, $http_proxy, $NO_PROXY, etc.)

  • forces all HTTP verbs to be uppercase -- interesting choice. the HTTP RFCs say that method names are case-sensitive, so in principle GET and get are both valid HTTP verbs, and they're different from each other. In practice, I've never heard of anyone using anything except all-uppercase, and apparently the requests devs never have either.

  • pickling!


Dealing with cookies is really messy! AFAICT there isn't a modern WHATWG-style standard yet, so no-one entirely knows how they're supposed to work. In Python there's the venerable http.cookies, whose API is super gnarly to work with, and I'm skeptical that its semantics are really suitable for the modern world, given that it comes from the same era as http.client. There's this cookies package on PyPI, which hasn't had a release since 2014, but the README has a bunch of important-sounding critiques of http.client. And... that's about it! Eek.


I've been thinking about architectures like: a package.Session with a highlevel API (similar to requests), and package.lowlevel.ConnectionPool with a lowlevel API (similar to what urllib3 has now, but simplified), where the low-level object is "mounted" onto the highlevel object, so we can use the mounting feature as an API for users to inject quirky configuration stuff. This is on the same lines as httpx's middleware system. httpx wants to use this for redirects/auth/etc. Looking at the notes above, I'm not sure they're that easily splittable? But need to think about it more. What parts would people reasonable need to be able to mix and match?

Note: I literally just mean I've been thinking about these kinds of architectures, like as a design puzzle. I'm not saying that we should adopt an architecture like that for this project. But I am pondering what we should do.

Remove HTTPHeaderDict.from_httplib

We're not using httplib (the old name for http.client), so all mentions of from_httplib can be removed from the code.

This is a good first issue! If you're interested, please ask for help here.

Disabling cert verification: can we improve security without punishing users?

@sethmlarson thinks that the classic verify=False makes it a bit too easy to disable cert checking, so it encourages folks to do it without fully understanding the consequences. Of course we don't want to make it punitively difficult either. Often the folks who have to disable cert checking hate that as much as we do, and are just stuck in an impossible situation. But some ideas for things we could potentially do:

  • Make the kwarg more verbose, so it's more obvious what it does and that enabling it is insecure, e.g. disable_secure_certificate_validation=True
  • Issue a warning whenever it's used... folks who want to disable the warning could still do so, but it makes the issue visible withotu actually stopping anyone from getting their work done
  • Make it very easy to trust a specific certificate, to make it easy to do TOFU-style trust for self-signed certificates. Ideally this should be even easier than disabling cert validation entirely.

support for request targets that can't be specified as URLs

There's a weird special-case in HTTP: the OPTIONS method (only!) can be used with a request-target of * to request general info about the server, rather than any particular file on the server (details). What makes this weird is that there's no way to specify this target as a URL: if you do hip.request("OPTIONS", "https://github.com"), that should generate a request-target of /. OPTIONS * HTTP/1.1 and OPTIONS / HTTP/1.1 are both meaningful things, and they're different from each other, so we need some other way to get a * target.

There's another weird special case: the CONNECT method (only!) doesn't take a destination URL, but host + port.

These aren't necessary for an MVP, but we should support them eventually. What are the API implications?

One approach that occurred to me was to offer a low-level API that lets you directly specify the HTTP "request-target" (the thing in-between the method and HTTP/1.1). Normally this is determined by a combination of the target URL + proxy setup, but conceivably people might need to do some wacky stuff to interoperate with custom almost-HTTP servers. But, I don't think this is the best approach here, because it's not actually sufficient for either the OPTIONS * or CONNECT cases.

For OPTIONS *, you normally set the request target to *... but if you're going through a proxy, then you instead use a URL without a trailing slash, like OPTIONS https://github.com HTTP/1.1. (This also means that if the user does hip.request("OPTIONS", "https://github.com"), and is using a proxy, then we need to convert that into OPTIONS https://github.com/ HTTP/1.1, i.e. make sure to add the trailing slash). So this will need special case handling; we can't just cover it with a generic "set the request target" thing.

And for CONNECT you definitely need special-case handling, because instead of a regular response body you get an arbitrary bidirectional byte stream (at least if the request succeeds).

So, maybe we want the core public API to be something like:

# hip.get, hip.post, hip.options, etc. are trivial wrappers around this:
hip.request(method, url, ...)
# These are special and can't be implemented in terms of hip.request
hip.options_asterisk(host, ...)
hip.connect(target, ...)

(Of course internally these would probably all immediately call into the same _real_request interface. But the point is that an interface that can handle all these weird cases at once is maybe too weird and error-prone to make public.)

A few more notes that aren't exactly on topic for this issue, but are kind of adjacent and worth writing down:

  • In Go apparently they sometimes use a form like CONNECT /some/path HTTP/1.1, as part of an ad hoc RPC system: https://golang.org/src/net/http/request.go#L1035. Not sure we would ever care about this, and it's semi-deprecated anyway, but hey, if we take a target string like hostname:port, then we can automatically support this too? Or maybe it's better to ignore this weird edge case and make the signature hip.connect(host, port, ...)

  • The HTTP/1.1 Upgrade header is another mechanism where you do an HTTP request and then it turns into a bidirectional bytestream. But it's a bit different from CONNECT, because it takes a proper URL. HOWEVER, there is one complicated case where they interact: HTTP/2 doesn't have an Upgrade header; instead, you do a special variant of CONNECT that does take a full URL. If we want to support this (and we probably do eventually, for e.g. websockets), then we can't just expose Upgrade: and RFC-8441-extended-CONNECT, because you have to pick between them based on which protocol you're using, and we don't know that until we've already made the connection to the host. So we probably also need a top-level hip.upgrade(...) function that exposes the common subset of Upgrade: and RFC-8441-extended-CONNECT, and internally it picks between them. This would be another top-level function that can't be implemented in terms of hip.request.

Mysterious PyPy2 test case failure

We just merged this new test from urllib3's master branch, and it fails on Travis with PyPy 2 with the following traceback:

____________ test_wrap_socket_given_ca_certs_no_load_default_certs _____________
Traceback (most recent call last):
  File ".../urllib3/test/test_ssl.py", line 111, in test_wrap_socket_given_ca_certs_no_load_default_certs
    ssl_.ssl_wrap_socket(sock, ca_certs="/tmp/fake-file")
  File ".../urllib3/.tox/pypy/site-packages/urllib3/util/ssl_.py", line 382, in ssl_wrap_socket
    context.load_verify_locations(ca_certs, ca_cert_dir)
  File ".../urllib3/.tox/pypy/site-packages/mock/mock.py", line 1061, in __call__
    _mock_self._mock_check_sig(*args, **kwargs)
  File ".../urllib3/.tox/pypy/site-packages/mock/mock.py", line 209, in checksig
    sig.bind(*args, **kwargs)
  File ".../urllib3/.tox/pypy/site-packages/funcsigs/__init__.py", line 792, in bind
    return args[0]._bind(args[1:], kwargs)
  File ".../urllib3/.tox/pypy/site-packages/funcsigs/__init__.py", line 717, in _bind
    raise TypeError(msg)
TypeError: 'cadata' parameter lacking default value

The traceback is also recorded in the commit message, and you can currently see it in the urllib3/urllib master builds. For example in https://travis-ci.org/urllib3/urllib3/jobs/572772982 (search for test_wrap_socket_given_ca_certs_no_load_default_certs).

Possibly relevant metadata:

$ python --version
Python 2.7.13 (ab0b9caf307db6592905a80b8faffd69b39005b8, Apr 30 2018, 08:21:35)
[PyPy 6.0.0 with GCC 7.2.0]
$ python -c "import ssl; print(ssl.OPENSSL_VERSION)"
OpenSSL 1.1.0h  27 Mar 2018

Anyway, the test fails with TypeError: 'cadata' parameter lacking default value in an autospec mock. cadata is a parameter that was added to SSLContext.load_verify_locations in the standard library ssl module in CPython 2.7.9. (CPython 2.7.9 was released in December 2014 and backported the whole ssl module from Python 3.4.)

For CPython < 2.7.9, urllib3 provides a custom SSLContext that only exposes the available options, so for example load_verify_locations is called without cadata. At some point, PyPy was probably just calling OpenSSL's SSL_CTX_load_verify_locations directly, but if I understand correctly then had to switch to the CPython implementation to support cadata.

Okay, so, it looks like it's a mock/funcsigs bug they decided not to fix: testing-cabal/mock#438. What needs to be done here is to link to this issue and to the mock issue in the code, and I thus mark this as a good first issue!

When should we run unasync?

We're using unasync to generate synchronous code from async code. We have a few choices about when to run it. Which is best? There's some logic behind our current approach, but we haven't really talked it through, and @sethmlarson raised some questions about it, so let's discuss.

Options

Option 1: We could run unasync at commit time, and check the generated code into the repo. (With a CI check to make sure that it stays up to date.)

Upsides: tooling generally expects your Python code to live inside the repo, and optimizes for this case. So e.g. setuptools, pytest, flit, development installs, etc. all automatically work. It's easy to look at the final code since it's present on disk / in PR diffs etc., instead of having to imagine it. You can use the library directly from a checkout without any "build" step. Coverage "just works".

Downsides: since the generated code affects most of the library logic + tests, almost all commits will causes changes in the generated code. That means we'd have to run the generator script on like, every single commit, and every PR diff would have double copies of every change. We get to tell contributors that they forgot to re-run the generator script, over and over.

Option 2: We could run unasync when building wheels or installing. (This is what we do now.)

Upsides: Only actual "source of truth" code gets checked into version control, so no doubled diffs, no need to worry about things getting out of sync, etc.

Downsides: We need a "build" step before we can use the code, which makes testing and packaging more complicated. (E.g., this rules out using flit. Unless we can convince @takluyver to add a hook to run some arbitrary code during flit's build step, but so far he's been resistant to adding such features :-).) Though I guess even with "option 1" we'd still want to make our test harness always re-run the code generator before running tests, b/c anything else is too prone to confusing mistakes, so that's probably pretty similar between these two approaches. Coverage is messy though. (Maybe we could use some hack to re-map coverage of lines in _sync/*.py back to the corresponding file in _async/*.py?)

Option 3: We could run unasync at import time.

Upsides: can't possibly get out of sync; no need for special workflow steps.

Downsides: Probably makes import too slow to be acceptable. Source may not be available at import time (some people distribute apps with only .pycs included, no .pys). And in fact, eval may not even be available at import time (e.g. Dropbox patches it out of their client's bundled interpreter as a hardening measure, or some environments may disable it using PEP 578 audit hooks). If we want to run unasync on tests (and we do want to run unasync on tests), then we'd have to figure out how to make our magic import stuff play nicely with pytest's magic import stuff (where they rewrite asserts).

Are there any other options I'm missing?

Comments

Code generation at import time isn't really viable, except maybe as a convenience for development. In that case it might have some merit – anything that makes test iterations faster is great! – but for that we'd still have to figure out how to get pytest to play along, which sounds painful. (I once went digging in pytest to try to figure out how their import magic works, in hopes of fixing non-src/ layouts to work better, and I completely failed to understand it at all.)

So aside from that special case, I think the choice is really between committing the generated code, versus auto-generating it at build time.

For Trio, we have some generated code, and we actually commit it to the repo, like Option 1. But that's a bit of a different situation: the generated code in Trio is all trivial boilerplate, and rarely changes. So it having to re-generate it manually isn't much of burden on contributors – in fact most contributors never encounter it at all. And in the rare cases where it does need to be regenerated, it's easy to ignore during reviews – because there's nothing interesting in those files anyway, so you can just skip them.

For this project, most of the interesting logic and tests are in files that affect the generated code, so I think we'll need to optimize our regular contributing workflow around dealing with that. That's why we decided to go with build-time generation initially – it has more of an up-front cost in terms of building the infrastructure to make it work at all, but it's not much worse than any project with a build step (e.g. any project that uses C extensions), and the hope is that it'll produce less manual churn during day-to-day development.

I could be wrong about that though :-). If anyone else has thoughts, please share!

Remove unused is_fp_closed function

It is only covered because it's tested explicitly, this function is otherwise unused. This means we can remove src/urllib3/util/response.py.

Move urllib3 into src/urllib3

I think we have to do this if we want to get tox working, because of these issues: https://docs.pytest.org/en/latest/goodpractices.html#choosing-a-test-layout-import-rules

And, weirdly... it might actually help with our problems with git having trouble figuring out the diff, because right now git is convinced that e.g. the new urllib3/_async/connectionpool.py can't be the same as the old urllib3/connectionpool.py, even though their contents are almost identical, because our branch still has a file called urllib3/connectionpool.py (and who cares that it's completely different from the old urllib3/connectionpool.py). But if we rename urllib3/src/urllib3/, then git will fall back on looking at content to try to match up files, and should decide that urllib3/connectionpool.py matches src/urllib3/_async/connectionpool.py, instead of src/urllib3/connectionpool.py.

At least... I hope that's true?

Add http2 support

Having support for htttp2 would be nice, but that's not a priority right now, we can use reuests-core's h22 branch as a reference.

review connection lifecycle

The intended lifecycle of a urllib3 connection involves being set up, put into a connection pool, taken out of the connection pool, used, put back in the connection pool, etc. Some of these steps are tricky:

  • if something goes wrong when setting up the connection (e.g., can't get the socket to connect, TLS handshake error) then it should never be put into the pool
  • when taking it out of the pool, it might have been closed by the server, so we have to check if it's still usable. (Generally, "if the socket is readable, it's not usable" is the heuristic we want to use, since it indicates that the server has sent some sort of 'go away' message or just closed the socket; making this work with Twisted is non-trivial.)
  • we should only put it back into the pool if it's in a clean and ready to re-use state. (If our caller only reads half the response and then stops, we can't re-use that connection.)

Test the Proactor event loop on Windows + Python 3.8

There are two issues here:

  1. Tornado does not intend to support the Proactor event loop, so we need to configure Tornado to use the selector loop without changing the global policy
  2. AnyIO does not support the Proactor loop yet, but we rely on AnyIO to support asyncio

So this is blocked until AnyIO supports the proactor event loop or until we add an asyncio backend.

Discussion: the urllib3 bleach-spike branch

Edit: current status

There's a lot of discussion and updates below, making this issue hard to read if you just want to know what the current status is. So tl;dr:

This is a project to add support to urllib3 (and ultimately the packages that depend on it like requests, boto, etc.) for async mode with multiple IO library backends (currently trio + twisted, and it's easy to add more), while keeping the original sync API mostly intact. For a big picture overview and demo see: urllib3/urllib3#1323

We're definitely looking for help to move this forward. If you want to help see below for some summary of what needs to be done, and you can post in this thread or join us in our chat if you want more information.

Things that are working:

  • Basic HTTP requests using sync mode, trio mode, twisted mode (again, see urllib3/urllib3#1323 for a demo, or check out the demo/ directory)

  • The test suite runs, though lots of tests are temporarily marked xfail or skip.

Things that need to be done (incomplete list)

  • Get the tests running on travis, so we can make sure new PRs aren't breaking things

  • Review all the tests marked xfail or skip and one-by-one make them pass (or delete them if they no longer apply). This will require fixing a bunch of things (timeouts, https, etc.)

  • Make it so trio and twisted aren't import-time requirements.

  • Take the syncifier code in setup.py and factor it out into its own project. Also it would be nice to get it running under python 2.

  • Get the tests running on python 2 (they should work if you can get things set up, but we currently can't build on python 2 because of the previous bullet point, so testing is awkward)

  • Get tox working (for now, use runtests.py instead)

  • Maybe make close methods async? (see #1 (comment) and replies and links therein)

  • Currently we're only testing the synchronous mode; we need to add tests for the different async backends. This probably requires making parts of the test suite async (with some kind of parametrization to test against multiple backends) and then using the syncifier script to generate a synchronous version of the tests to test the synchronous mode.

  • Add tests for new features, in particular servers that send an early response.

  • Eventually: we'll need to update the docs.

  • Perhaps set up some kind of more useful way of tracking this todo list

Getting started

So you want to help? Awesome!

  1. Check out the bleach-spike branch from the python-trio/urllib3 repo

  2. You should be able to run the tests with python runtests.py (this currently requires python 3.5+)

  3. You should see lots and lots of output, and eventually get a message from pytest that looks like "683 passed, 241 skipped, 31 xfailed, 20 warnings".

  4. Hack on the code to make the "passed" number go up, and the other numbers go down. Or start working on any of the other suggestions up above.

The below is the original post that started off this thread.


Original post

(See urllib3/urllib3#1228 for context)

The bleach-spike branch in this repo is based off of @Lukasa's v2 branch in the main urllib3 repo. The idea is to experiment with converting urllib3 to support three backends/public APIs: plain sockets (like it does now, using select/poll internally), trio (with an async-based public API), twisted (with an async-based public API). [These were chosen as a representative sampling of approaches, i.e. if it works for these it'll probably also work for asyncio/tornado/curio/..., modulo specific limitations like asyncio's lack of starttls.] I'm putting some notes here because @merrellb asked if he could help, so :-).

Here's the diff link: urllib3/urllib3@v2...python-trio:bleach-spike

The overall goal here is not to immediately create a polished product, but just to test the viability of this approach, so a bit of sloppiness is OK; the goal is to get a proof-of-concept as efficiently as possible.

The basic idea is: we define a common "backend API" that provides a common set of socket operations, and acts as the abstraction boundary between urllib3 and our different network backends. notes.org has some details on what my current draft of the API looks like. urllib3/backends/ has the actual backend implementations. The backend uses async/await, but the plain socket implementation of it never actually takes advantage of this -- its methods never yield. The trio and twisted backends of course assume that they're running in a trio or twisted async context.

Done:

  • Define a plausible backend API that I think provides everything we need for HTTP/1 support
  • Implemented the three backends, but there's probably a bunch of terrible bugs because I haven't tried running the code at all
  • Rewrote sync_connection.py in terms of the backend API, but again haven't actually tried running it. I think it's much more readable and reliable (once we fix the silly bugs) than the version in v2. (The filename is now something of a misnomer, but I figured I'd leave it alone for now to make the diff more readable. See above re: proof-of-concept.)

Todo:

  • Thread through these changes to the rest of urllib3; in particular:
    • we need to pass the backend object through from the connection pool through to the individual connection objects
    • we need to propagate async/await markers out from sync_connection.py out to all their callers
  • Figure out how connection pooling interactions (return connection to pool, retrieve connection from pool) are supposed to work... that's the part of sync_connection.py that I didn't really understand, so I put in some best-guesses but it's probably wrong. Probably there are other interactions with the rest of the world that aren't quite settled as well.
  • Attempt a basic GET request on all the backends, and then iterate on fixing all the syntax errors etc. until it works
    • https exercises unique paths
    • proxy support exercises unique paths

The trio code currently assumes the existence of trio.open_tcp_stream, which, uh, doesn't exist yet. I realized today though that contrary to previous reports we don't need to robustify SSLStream to make this work (i.e., python-trio/trio#198 is not actually a blocker). (The reason is that I changed the backend interface here a little bit in a way that both simplified the code here and also removed this problem. Specifically, before I was trying to use read_and_write_for_a_while to send the request and then switch to receive_some to read the response headers; now I just use a single call to read_and_write_for_a_while to do both.)

Further items that maybe aren't on the critical path:

  • Implement a "bleaching script" that uses the tokenize module to automatically transform the library back into a synchronous library by deleting async and await tokens, plus replacing __aiter____iter__, __anext____next__
  • Go through all the todo items marked XX in the source

CC: @merrellb

Should we drop Python 2.7 support?

When this project was conceived in mid-2017, it was obvious that we needed to support Python 2.7, because:

  • Python 2.7 was going to be a thing until April 2020.
  • As of November 2019, urllib3 still gets slightly more than 50% of Python 2 downloads, does not plan to drop 2.7 anytime soon, and Red Hat is going to support Python 2.7 until 2024.
  • This was designed as urllib3 v2, so it was going to support the same versions as urllib3.

However, it's now clear that this project won't be urllib3 v2, and we're going to offer a completely different API. We won't offer a stable version soon, and nobody should be starting new Python 2.7 projects.

What is Python 2.7 support costing us? Surprisingly, not much, because it's already here, so it's not a lot of effort to keep the support. Here are possible benefits of dropping Python 2.7:

  • reduce the number of builds we have to worry about in CI
  • use nonlocal instead of the context hack in connection.py, and remove other Python 2 compat code, especially Python < 2.7.9 SSL stuff
  • stop excluding async backends from coverage reports
  • use the latest versions of test dependencies such as Tornado and pytest who no longer support Python 2.7
  • possibly stop using code generation for runtime code and use run_secretly_sync_async_fn instead for sync support
  • possibly use type annotations without hacks

My personal conclusion: I think it makes sense to remove Python 2.7 support (and App Engine!) at the same time we stop merging changes directly from urllib3, which should happen real soon now (TM).

Discussion: What do we call the async versions of hip.get, hip.post, etc.?

So everyone's used to being able to do:

import requests
response = requests.get(...)

So I'm assuming we will also want to support:

import hip
response = hip.get(...)

But what if you want to do an async get? If hip.get is already claimed for the sync version, then we can't use that for async gets.

There are three options I've thought about:

Option 1: Don't offer top-level async helpers at all. If you want to make async requests, you have to make an AsyncSession.

This is what I originally thought we'd want to do, because most users generally want to re-use connections, and especially b/c we want to support HTTP/2 in async mode, which requires background tasks and a shared connection pool. So pushing them towards sessions is the way to go, and we'd treat the no-session mode as a kind of simplified legacy interface for sync users only.

But... based on the discussion in #125, it sounds like we're going to be able to support HTTP/2 and shared connection pools even if users aren't explicitly using a session. So that makes the top-level hip.get-style helpers way more useful than before, and makes them equally useful in both sync and async mode. So maybe we want them after all!

Option 2: We could follow the same pattern as Python uses for e.g. __iter__ vs __aiter__, and have:

response = hip.get(...)  # sync
response = await hip.aget(...)  # async

This would work. There are some downsides though: it makes the async version feel a little more awkward and "second class". It means we either need to either rename the session methods too (await async_session.aget(...)), or else have an inconsistency (await hip.aget(...) / await async_session.get(...)). And if someone accidentally writes hip.get inside synchronous code, then it'll seem to work but cause nasty performance problems, as it blocks the whole main loop.

Also, making users pick between these two modes on a call-by-call basis just feels a bit wrong: generally either your whole program is sync, or else your whole program is async; it's not going to magically change between two lines in the same file.

Option 3: Use different namespaces. E.g.:

# Sync
import hip
response = hip.get(...)

# Async
import ahip
response = ahip.get(...)

This seems pretty attractive. We could even remove the class name munging code, so instead of hip.Session and hip.AsyncSession, we'd just have hip.Session and async_hip.Session – make the two APIs completely identical except for the top-level namespace and the async/await annotations. And it has the nice property that if you're using async, you import the async namespace at the top of your file, and then you can't accidentally use the sync version; and the async version gets to use all the nice unqualified names like get.

It also makes life somewhat simpler for code that wants to build on top of hip, while using unasync to support both sync+async modes. You just do if ASYNC_MODE: import async_hip as hip; else import sync_hip as hip, and then the rest of your file uses hip and it just works in both modes. With "option 2" above you can also make this work, but you need a more sophisticated version of unasync that can munge agetget, etc.

If we go down this road then we'll need to do some major bikeshedding to pick the names – hip vs async_hip? sync_hip versus async_hip? hip.sync versus hip.async_? (Unfortunately you can't have a module named async because it's a keyword :-(.) hip.s versus hip.a? This is kind of a second-order issue, because if separate namespaces are a bad idea, then we don't need to even think about names. But it's also kind of important, because my main hesitation with this approach is that all the "async hip" names look kind of ugly.

I guess we could even take this a step further and distribute hip and ahip as two separate packages in PyPI. Not sure if that's better or worse that bundling them together.

Timeout handling

We don't have a coherent story for timeout handling yet. I think at this point we do pass timeouts through from the external APIs into the backends, but:

  • the Trio and Twisted don't actually implement timeouts yet

  • we don't have a general strategy for reporting timeouts

It used to be that timeout handling involved some coordination between the urllib3 code and the network layer, where urllib3 would set some special options to make timeouts happen, and then catch the native timeout exceptions (like socket.timeout) and translate them into the exceptions used in urllib3's public API.

I think we should figure out what exceptions our users are expecting, and make all the backends responsible for raising these directly, and remove the translation layer.

[API design] Streaming upload API: push-style or pull-style?

[This is partially discussed in #124, but that's a huge omnibus issue so I'm pulling it out into a more focused one. This is also capturing some notes from a voice chat @sethmlarson and I had last week.]

Say you're uploading some data as a request body, and you want to generate it on the fly. There are two basic API approaches that hip could offer. In a "push" style API, the user pushes us data as it becomes available:

request_object = await hip.start_request(...)
await request_object.send_all(data1)
await request_object.send_all(data2)
response = await request_object.send_eof()

In a "pull" style API, the user provides an (async) iterable, and hip pulls data out of it:

async def generate_body_parts():
    yield data1
    yield data2
await hip.post(..., body=generate_body_parts())

These are kind of duals of each other. Generally, Trio's style prefers the "push" approach, for two reasons: First, keeping the loop in user code makes the control flow is more explicit and transparent, so I think it's easier to reason about. And second, if you have a push-style API, then it's easy to implement a pull-style API, but going the other direction is much more awkward:

# Pull on top of push is just a for loop:
async for data in generate_body_parts():
    await request_object.send_all(data)

# Push on top of pull needs more complicated machinery:
send_side, receive_side = trio.open_memory_channel()
async with trio.open_nursery() as nursery:
    nursery.start_soon(partial(hip.post, ..., body=receive_side))
    await send_side.send(data1)
    await send_side.send(data2)
    # FIXME: also have to retrieve the return value from hip.post somehow

Push-style APIs are also much more natural if we want to support bidirectional HTTP, i.e. letting people send a bit of the request, then read a bit of the response, then send a bit more of the request, etc. (I'm not sure we do want to do this, but still, it's a trade-off to consider.)

However, for hip's purposes, the push-style approach has some significant downsides:

  • For HTTP uploads specifically, it's very common to already have some kind of object that represents the body (a byte-string, some JSON, an open file, an async generator, etc.), and it's natural to want to just pass this object as a simple kwarg like body=. We probably don't want to make people explicitly write a for loop every time. So, if we supported a push-style API, then we'd also end up adding a pull-style option as sugar on top of it, and then we have two APIs.

  • Push-style APIs expose a lot more states in the API. With a classic requests.get API, you call the function, and it doesn't return until at least the whole request body has been delivered and the response headers have been received. With a push-style API, I guess we'd need to... first return an object that represents the request-in-progress, and then the user would push data through it for a while, and then when they mark the request body complete then that would need to return a second object that has the response headers etc.? And what happens if the server starts responding early? This makes the API much more complex to design and use.

    • In particular, the "early response from server" case is important for bidi HTTP: for normal standard-conformant HTTP, if the server does an early response then you want to hard-abort sending the request body. The whole point of bidi HTTP though is that you don't do this. So even if we did have a push-style API, that wouldn't be enough on its own to support bidi HTTP; we'd still need some extra API surface area beyond that.
  • Push-style APIs are even more awkward if you want to handle redirects and retries, where you might have to rewind the body and re-send it. So now we need some way for hip to tell the user that they've been redirected and need to start over on pushing the body, etc.

Putting this all together, I think the conclusion is that we should follow requests instead of Trio, and expose a pull-style API instead of a push-style API.

If we do decide that bidi HTTP is an important use case (e.g. so grpc clients can use hip), then another possibility would be to make the API look similar to an Upgrade/CONNECT. For Upgrade/CONNECT, I imagine that our API will involve sending a request, getting a response object, and then if it succeeded calling some method on the response object to get a bidi byte stream for the new protocol to use. For bidi HTTP, we could use a similar API pattern, where during the request you set some special flag that says "I want to do bidi HTTP", and then we use some special internal processing that doesn't send any request body immediately, but then lets you use the pull-out-a-byte-stream method to get a byte stream referring to the request and response bodies. I think that will end up being much more natural.

How to track session-related state beyond cookies?

@sethmlarson points out that there are a bunch of bits of state that HTTP clients might want to track across otherwise-independent requests:

  • cookies
  • caches
  • altsvc
  • permanent redirects
  • hsts
  • (what else?)

Traditionally http client libs handle cookies, but mostly drop the other stuff on the floor. Maybe we should have a more comprehensive strategy. I think he's thought about this some, so I'll let him fill it more details :-)

How to handle Response bodies?

Some time ago @njsmith and I discussed how best to implement the "streaming" / "preloading" of Response data and what that interface might look like.

Existing Implementations

urllib3 has preload_content=False on the HTTPResponse object to signal that the data should not be loaded into memory and instead .read() until empty.

Requests has stream=True on the .request() method which changes whether preload_content is called on the urllib3.HTTPResponse object but also provides functions like .iter_content() with optional decoding of the raw binary data into text. Requests also provides a .raw object which if I remember correctly doesn't do any decompression of the raw data stream.

Other ideas include having different methods on the Response itself that trigger streaming versus load into memory (.text() / .stream_text()). This way is tough because we then have to worry about GC and connection issues if the user is using the library to only check the status code, in an interactive console, etc.

Another idea is to have a separate method for streaming but that'd move away from our "one function entrypoint" for no real benefit, maybe

Ideas for Hip

How to configure streaming?

Definitely going to support both streaming and preloading of the body onto the Response.

Users expect preloading the body by default as is the case in Requests and urllib3. This makes calling things like .text and .json easy and also makes ensuring your connection is returned to the pool a default, no GC / pooling issues.

Using typing.Literal[True] and typing.Literal[False] along with overload we should be able to achieve proper type-hinting and IDE support for stream=True/False knowing that a given response will be streamable or not. Requests does this so it'd be familiar to users. We're both leaning towards this as the solution.

Calling .text and .json on a Response don't make sense from a streaming context, so @njsmith and I thought it'd be a good idea to have two response types, one for preloaded mode and one for streaming mode. Then we can avoid the awkward state on HTTP responses

What happens when Response body is larger than memory?

Current behavior for HTTP clients is to load everything into memory by default.

  • Maybe we could set a reasonable limit and error out if more than that is loaded info memory?
  • Maybe detect memory size and error out on that threshold by default?
  • stream=True case doesn't have to worry about memory, users are able to make their own decisions about memory.

How to access a text stream / file-like object?

I've wanted to add a mode to get back a file-like object for interfaces that only support that (think csv.reader(), etc) as this has been a pain point for playing with HTTP response data.

Requests also has .iter_content() with decoding support, what would this interface look like for Hip? Requests chunk sizes are also always related to the amount of bytes read, rather than the amount of text that is decoded. This struck me as confusing but maybe it's still the right thing to do (limiting actual data read from the socket not data chunks coming from the stream).

Improve documentation UX for sync and async APIs

Some discussions around how we can provide both examples and an API reference for Hip when there are namespaces with very similar but obviously have basically two separate audiences.

A great idea from @njsmith was to do something similar to a lot of projects that support multiple programming languages (Python, Java, Ruby, etc) and have a "Sync / Async" toggle on documentation navigation. Would change all code examples and autodoc refs from ahip to hip, have/exclude async/await etc.

This would be especially useful for the reference as it will be mostly symmetrical thanks to unasync.

Some thoughts and ideas that came up:

  • When you're on a sync / async page toggling the slider should bring you to the corresponding page and restore your scroll position on the new page.
    • The toggle can probably be a piece of JavaScript that does URL rewriting as long as the doc URLs are symmetric (/ref/async/... to /ref/sync/...)
    • Could use fragments, scroll %, a combination of both to restore scroll position?
  • There will probably be a need for being able to mark sections as "sync-only" and "async-only" for gotchas, tips, limitations, etc
  • Should this slider be extended to Sync, Trio, Asyncio, Curio?
    • Some examples will probably want to use specific APIs like trio.run() or trio.open_nursery()
      for showing off how to do multiple concurrent requests.

Initial runtests.py run fails on macOS - ModuleNotFoundError: No module named 'h11'

On macOS (10.13), running runtests.py fails because h11 is not installed correctly. This has been true since the beginning of runtests.py, even though I'm only reporting it now.

When installing the dev requirements, I expect "Successfully installed h11-0.7.0+dev" but I see "Successfully installed h11". Even though pip reports success, h11 and h11-0.7.0+dev-py3.6.egg-info are not in test-venv-cp36m-macosx_10_13_x86_64/lib/python3.6/site-packages.

Workarounds:

  • Using venv instead of virtualenv, but we want to support Python 2 eventually so that's not useful
  • Install h11 manually in the newly created venv
  • Activate the newly created venv, install an updated wheel, then run runtests.py

I tried to use activate_this.py without much success. I've wasted a lot of time with this issue, so I'm not planning to investigate further, but I thought I would report what I've tried.

Review HTTPS

We're probably not doing anything globally sensible here. We need to go through and make sure everyone is exposing the right APIs and generally agrees on what htey are doing.

Disable do_handshake_on_connect in ssl.wrap_socket

Python 2.x and Python 3.x have this attribute in ssl.wrap_socket called do_handshake_on_connect: https://docs.python.org/3/library/ssl.html#ssl.wrap_socket:

The parameter do_handshake_on_connect specifies whether to do the SSL handshake automatically after doing a socket.connect(), or whether the application program will call it explicitly, by invoking the SSLSocket.do_handshake() method. Calling SSLSocket.do_handshake() explicitly gives the program control over the blocking behavior of the socket I/O involved in the handshake.

It's enabled by default, but we probably want to disable it to avoid having uncontrolled blocking I/O? I'm still fuzzy on the details here.

Intermittent failures

Okay so we're soon going to multiple backends, so all tests are going to be run many more times per build, so it's going to be even more important that they're reliable. I have kept putting this off, but I think it's important to be serious about it.

So I think I'm going to start listing failures here, to remember that I need to fix them. Usually the failures are simple timeouts issues where the timeout is not correctly chosen or applies to too many operations.

Add support for Async and Sync Iterators as Request bodies

These two scripts should be allowed to bring us in line with Requests:

def sync_gen():
    yield b"data"

async def async_gen():
    yield b"data"

# ahip supports async and sync iterators

import ahip
import trio

http = ahip.PoolManager()

async def main():
    await http.request("POST", "http://example.com", body=sync_gen())
    await http.request("POST", "http://example.com", body=async_gen())

trio.run(main)

# 'hip' supports only sync

import hip

http = hip.PoolManager()
http.request("POST", "http://example.com", body=sync_gen())

Some notes:

  • Treat both of these cases as errors if a redirect occurs, raise UnrewindableBody exception
  • shouldn't set content-length since length is not known until we drain the body, should be transfer-encoding: chunked

Related:

Early data support

I just had a long discussion with @sethmlarson as we tried to wrap our head around RFC 8470, the RFC for supporting TLS 1.3 "early data" in HTTP. It's super complicated, so I'm going to try to write down what (I think) we figured out, before I forget again.

First, usually we're lazy and just talk about "clients" and "proxies" and "servers", but for this we need much more precise terminology. So in this post I'm going to be super anal about this. A HTTP request may traverse multiple hops:

user agent → intermediary → intermediary → origin

The user agent is where the request starts, then it bounces through zero or more intermediaries, until it finally reaches the origin, which actually handles the request and sends back the response.

At each hop, there's a client and a server: on the first hop, the user agent is the client and the first intermediary is the server, but then on the next hop that intermediary is the client and the next intermediary is the server, etc.

hip is an HTTP client library. So this means sometimes we're running inside the user agent (most of the time), and sometimes we're running inside an intermediary (for example, if someone implements an HTTP reverse proxy by writing a web server that turns around and calls hip to forward requests to another server).

OK, now, early data. The way this works is, TLS servers hand out "session tickets" to TLS clients, and if the client hangs onto that ticket, then the next time they make a connection they can use the ticket to skip some steps and make the connection faster.

When a server issues a ticket, there's a flag it can set that tells the client whether it's allowed to use "early data". If this flag is set, then when the client uses the ticket to make a new connection, then it can start sending encrypted application data right away, in the first volley of packets. This is nice because it reduces latency – you get to skip a round trip. BUT the trade-off is, since the client + server haven't had a chance to negotiate any per-connection entropy yet, the early data can't be "customized" to this connection. Therefore, if someone evil is snooping on the network and observes this early data, they can save a copy, and then later make a new connection and play the data back – a "replay attack".

This is somewhat limited: the data is encrypted, so the attacker doesn't necessarily know what it says – but they may be able to make a good guess based on context, e.g. if I convince you to send me $5 on paypal, and then I see you make an HTTPS connection to paypal and capture the data, then I can guess that the data I captured is likely to be an encrypted version of "send $5 to @njsmith". If I replay that a few hundred times then it might have some nice effects. Well, nice for me, anyway; not so nice for you.

And, after the early data, the client/server still have to complete the handshake before they can continue, and an attacker can't do this, because this requires having the actual session ticket, not just a copy of the early data. So this means that an attacker can't actually get a response. And if a server waits for the handshake to be complete, then they can retroactively tell that the early data actually was real, not a replay attack. But the whole point of early data is that you want to get started processing the request early, without waiting for the handshake. And when I'm attacking paypal, I don't really care if I can't see the responses to my replayed messages, just so long as the money gets transferred.

So, servers can't just blindly treat early data like regular data. Like, for Paypal, if the request is GET / then it's fine to process that as early data so the homepage loads faster, but if it's POST /transfer-money?to=njsmith&... then processing that as early data is not OK. In general, origin servers need some awkward application-specific logic to decide what's OK and what's not; there's nothing the IETF can do to help with that. Instead, RFC 8470 is about answering: What do user-agents and intermediaries need to do, in order to make it possible for origins to apply their awkward application-specific logic?

For a user-agent, it's not too complicated: if they have a request that they think might be a good candidate for early-data (e.g. a GET with no body), and they have a valid session ticket with the early data flag set, then they can go ahead and try to use TLS early data. Otherwise, they make the request like normal.

If they do use TLS early data, then the server can say eh this is like GET /, I don't care about replay attacks, I'm going to accept the early data and respond immediately. That's easy. But if the server thinks the request is one where replay attacks would be bad, then it has three options:

  • It can use some TLS-level messages to reject the early data entirely. In this case it's like it was never sent, and the client's TLS library will finish the handshake and then sends it again as regular data.

  • It can stash the early data in an internal buffer somewhere, and wait to process it until after the handshake completes. Once the handshake is complete, we know that there's no replay attack, so this is equivalent to the case above – it's just a question of whether you're more worried about server memory (for the internal buffer) versus network bandwidth (for the re-sending).

  • It can send back a 425 Too Early response, in which case the client has to make a whole new request, and this time it shouldn't even try to use early data.

If you're a user-agent, the first two cases are pretty trivial: your TLS library probably handles them internally. And for the last case, a 425 response, you can transparently retry without worrying. So when hip is acting as a user-agent, then it can try using early data opportunistically, without any user intervention, and automatically handle any issues that arise without hassling the user.

But sometimes hip might be acting as an intermediary, and this gets trickier. As an intermediary, you might receive incoming early data, and want to immediately forward it on to the origin server without waiting for the handshake to complete. But, this is risky: you know that this is early data, and might be a replay attack. But if you send it on to the origin server as normal, then this information might get lost – the origin server has no way to tell that this is a potential replay attack, and it might fail to reject POST /transfer-money?to=njsmith&.... So you need to tell the origin server.

You might think well, you should just send the next request as early data yourself, so that the server will know! But, two problems: (1) one of the server's options for being paranoid is to accept the early data and silently hold onto it until the handshake completes, and then treat it as regular data. And your TLS library probably doesn't have an option to be like "hey send this early data, but then pause and don't finish the handshake until I tell you to". So the server might decide that actually this data is safe against replays, even if you send it as early data, and have no way to tell whether this has happened. (2) you might not have a usable TLS session ticket, so you can't necessarily send early data anyway. Heck, your connection to the upstream server might not even be using TLS at all.

(A common configuration where this would happen is if the the intermediary is a reverse proxy inside the same data center as the origin server: the user-agent is really far away, so you want to minimize round trips between the user-agent and the proxy, but the intermediary and origin are really close, so it's fine if you have to do a full handshake before you can pass on the request.)

So, you can't rely on sending the next request as early data. And for hip that's very convenient, because it means we don't need to expose a force_early_data=True config option.

Instead, if you're in this situation, what you have to do is (a) make sure that the upstream server supports RFC 8470, via some kind of out-of-band configuration (again, this is OK for the reverse-proxy-inside-a-data-center case), and (b) add an Early-Data: 1 header to the forwarded request.

And then on the server side, if it sees Early-Data: 1 and decides that this request is unsafe, then it has to use the 425 response method; the other two options aren't allowed.

And then when the intermediary sees the 425 response, it can't just blindly retry it immediately, like a user-agent would: it has to either forward the 425 back to the user-agent, or else it has to at least wait for the user-agent to complete the handshake, so that the intermediary knows that there isn't a replay attack happening, and then it can immediately re-send the request to the origin.

So this is what makes things awkward: when hip is a user-agent, it wants to opportunistically use early-data, and to transparently handle 425 responses. But when hip is an intermediary, it can't handle 425 responses transparently.

However, I think there is a solution:

If someone is building intermediary and wants to handle incoming early-data requests, then they have to do something to explicitly opt-in to that. No HTTP server is going to blindly deliver early-data to applications that aren't expecting it. So we can assume that whoever is calling hip already knows that they're dealing with early data.

And, the folks building this intermediary have to know whether the upstream server supports RFC 8470, and if it does, they have to add the Early-Data: 1 header. So from hip's perspective, the presence of an Early-Data: 1 header in the request will happen if and only if it's being used as part of an intermediary, and the request it's being asked to send needs the special intermediary handling.

So I think the bottom line is: hip can opportunistically attempt to use early data, and it can automatically retry on 425 responses, except that on requests where Early-Data: 1 is set, then it should return 425 responses back to the application. (And then they can handle that however they want.)

Add SOCKS proxies support.

Currently urllib3 is using SOCKS library for adding adding support for SOCKS proxies which is synchronous and does its own socket handling, so it would be really hard for us to integrate into our backends have async functionality. As suggested by @njsmith we can use this socks5 package to have support for SOCKS proxies.

First steps towards a high-level HTTP client API

@njsmith and I discussed what the next steps would look like for Hip to become less like urllib3 and more like what users expect from a high-level HTTP client.

  • Make a v0.1 release with the current code-base before changing everything.
    This will allow people including ourselves to play around with the code-base and
    hammer on the implementation, especially the async implementation.
  • Make modifications to our unasync configuration to have two top-level names ahip and hip instead of hip._async and hip._sync. This will probably require a change in unasync to make it more generic?
  • Create a top-level async def request(method: str, url: str) -> Response function within the ahip namespace and add a simple unit test for the function. Figure out how in the future we'll be writing tests for both async and sync.
  • Hide all of the "urllib3" API. We could accomplish this by basically adding underscores in front of everything. Keep all the existing unit tests the same for now and replace all names with the new _-prefixed ones.
  • Start adding existing features to the top-level request() function. When functionality is added to the top-level function migrate those unit tests away from testing the internal APIs and to only use ahip.request() and hip.request().

From there we're well on our way towards a v1.0. Let me know if we forgot any steps that you think are important, there's obviously a lot of work to do in addition to what's listed above but hopefully we can focus on the above list to get us closer to a stable user-facing API.

Each of the above tasks can potentially be turned into their own issue when we start working on them?

Overall plans for the project

Now that we're transitioning out of "stealth mode" and have some new contributors joining, we should probably make sure we're all on the same page about goals, and start making some plans. Here's some notes to kick that off; discussion is very much welcome.

Goal/overall vision

For me, the goal is to create a Python HTTP client library that works for everyone. Like, nothing against other client libraries, we're not enemies or anything, but we want to eventually reach a point where our quality and features are so great that the maintainers of urllib3/requests/aiohttp/asks/treq/hyper/etc. are all comfortable recommending their users switch to our thing, or rebasing their libraries on top of ours. Of course that will take a lot of work, but it seems like in the long run it will take less work than everyone duplicating effort on maintaining all these HTTP stacks in parallel. Maintaining a production-quality HTTP client is hard.

Scope

We originally started out with a fairly narrow goal of "replace urllib3", possibly even keeping the name (as "urllib3 v2"). Over time, my feelings on this have shifted, due to a few factors:

  • It turns out that urllib3's public API exposes a lot of internal details, which is a major challenge for maintenance, because it's difficult to change things without accidentally breaking backwards compatibility. (This is a common problem in libraries that grew organically and then got really popular – numpy has similar problems.) And in particular, it's not actually possible to switch from httplib → h11 without breaking public API. And that's a precondition to pretty much everything else we want to do. So, keeping the urllib3 name and full public API simply isn't possible.

  • As we've gotten more familiar with urllib3's internals, I've found more places that I think could benefit from some cleanup :-)

  • The more I studied the requests/urllib3 split, the more it seemed to me that the separation doesn't make a lot of technical sense, at least in retrospect. The "high level" features that requests adds are not terribly complicated, they're pretty much always useful to have available, and they can be done in a way that doesn't add much or any overhead/complexity if you don't need them.

So... there's a tricky balance here, because we don't want to go wild trying to fix every frustration we've ever had with any HTTP library. We need to keep our schedule manageable and we want to make it easy for existing code to migrate to our new library. But I do think we should:

  • take this opportunity to review and pare down the public API and make sure that we're not exposing more internal details than we want to, and

  • increase our scope to target roughly requests+urllib3's feature set, rather than just urllib3 alone.

Decision-making (aka "governance")

So far we've been doing this all in the python-trio org, which I guess de facto makes me BDFL. I'm not particularly attached to that, but I do have some of the relevant skills and it's a simple solution to keeping things organized, so I'm happy to continue in that role for now if that's what we want. Keeping stuff in the python-trio org also lets us lean on the shared org infrastructure (e.g. code of conduct, contribution guidelines, automatic invitation bot, chat, forum, etc.). It might also make sense to switch to some other org like python-http or similar, but as of today that's more of an aspirational placeholder than a functioning organization. Or there's python-hyper, but that's effectively moribund. So I don't really know of a better place than python-trio to put stuff.

To be explicit though: even if it stays in python-trio for now, that doesn't mean that trio has some special status; we'll still have first-class support for asyncio etc., and the goal is to create something that belongs to the Python community as a whole.

Next steps

So far we've been trying to stay close to upstream urllib3 to make it easy to merge changes. But I think we've taken that about as far as it can go, so it's time to pull the trigger and start diverging for real. So some things we want to start doing soon:

  • Switch to our new name (which we should hopefully get ownership of any time now...)
  • Update the README and docs to include the critical information on what the new project is (at least the rough goals and that the public API is very much NOT stable, and how it compares to httpx)
  • Upload a v0.0.1 to pypi and announce it to the world

That's kind of needs to happen in that order, I guess. There are also some next steps that are more parallelizable:

  • Start seriously extending the test suite to cover the async code
  • Start reviewing the public API and making decisions about what it should look like (there are some other issues open for parts of this)
  • I know @sethmlarson wants to start refactoring the test suite to make it simpler/less fragile, which would be awesome
  • Probably more stuff I'm not thinking of right now

Again this is just a brain dump to prompt discussion, not the final word on anything, but hopefully it's a good start.

What do you all think? Any additions, concerns you want to raise, tweaks you want to make?

Stop worrying about backwards-compatibility

There are a number of things that were done in the urllib3 codebase that were done in the name of backwards-compatibility that can be removed now if that makes sense. This is an umbrella issue to record the small things that can be removed.

There's nothing listed right now, but grep for "backwards" or "deprecat" to find candidates.

More cleanups

Now that we're no longer urllib3, there are a number of things that we can and should remove right now. I think @jab, @sethmlarson and @RatanShreshtha would like to help, it turns out that those cleanups are a great way to help right now. Please send one PR for each change:

  • Remove the _appveyor directory (forgot about in #153)
  • Remove App Engine files from src/urllib3/contrib and test/appengine and the GAE line commented out in .travis.yml
  • Remove the downstream tests from CI in _travis/downstream and the references in .travis.yml
  • Switch mentions in the docs from urllib3 to hip (except where we actually want to refer to urllib3)
  • There's #130 too which is again up-for-grabs

Fix test_connection_cleanup_on_read_timeout

So we have this failing test named test_connection_cleanup_on_read_timeout. @RatanShreshtha investigated to understand the situation. Here's the test:

https://github.com/python-trio/urllib3/blob/e8603c043894bd28809e76b0edf209fc3e062759/test/with_dummyserver/test_socketlevel.py#L551-L578

The test fails because the connection is not returned to the pool, so poolsize != pool.pool.qsize(). In the bleach-spike branch, the connection is only released when the response is complete:

https://github.com/python-trio/urllib3/blob/e8603c043894bd28809e76b0edf209fc3e062759/src/urllib3/_async/response.py#L289-L298

The response is complete when the client and server h11 states are IDLE:

https://github.com/python-trio/urllib3/blob/e8603c043894bd28809e76b0edf209fc3e062759/src/urllib3/_async/connection.py#L523-L532

This is True when the request has completed successfully, but it should be also be True when there's not a clean exit. So I think the fix is to make sure the state comes back to idle when self.close() is called.

Merge CI improvements upstream

I think I prefer to wait until we've caught up with upstream to do this, but I still want to record the list, that I will continue to update:

  • Fix TLS 1.3 tests when using ssl from the standard library (a1fa2c2)
  • Bump macOS Python versions (important to test 3.7.4 with the above fix)
  • Skip one PyPy 2 test (#110)
  • Remove newlines after with statements (done as part of #119)
  • Python 2 coverage uploads
  • setup.cfg

Things that we don't get right either:

  • Is the no brotli test running somewhere?
  • Can we drop the warnings configuration when launching pytest with the latest pytest warning changes? This would allow to use --disable-warnings
  • Fix lines merged by black (using rg '" "')
  • Use cls in setup_class
  • Remove useless proxy_from_url redirection (it's part of the public API)

Things that are specific to us:

  • We no longer use pytest-random-order: either configure it or drop it
  • Remove enforce_content_length=False
  • AppVeyor does not upload its coverage results to codecov

[discussion] [potential heresy] global connection pools and global tasks

Here's a convenient thing about current Python HTTP clients: when you do create a Session, you can just do session = Session(). You can use with Session() as session, but it's not required. This flexibility can be very convenient – in fact trio's own github bot uses it.

Here's an annoying thing about HTTP/2 and HTTP/3: you really need a background task to handle the shared connection. (And that issue doesn't even talk about cancellation safety, which is also a huge issue that I don't see how to solve without background tasks.) According to the principles of structured concurrency, that means that if you want to support HTTP/2 or HTTP/3, then you must use a with block to manage sessions, which is a usability regression compared to older HTTP/1.1-only clients.

We could potentially restrict this regression to just async clients, since sync clients will probably never support HTTP/2 or HTTP/3. But that's pretty awkward, since our goal is to keep the async and sync APIs matching as close as possible. Another option: we could support both async_session = AsyncSession() and async with AsyncSession() as async_session... but the first session is restricted to HTTP/1.1, and you have to write it the second way if you want your session to use HTTP/2 or HTTP/3. But, this would suck, because it's a really subtle distinction that would create a really surprising behavioral difference.

Here's an annoying thing about current Python HTTP clients: you have to explicitly create and use a Session to make things efficient. If you just call requests.get(...), then it can't reuse connections. Wouldn't it be nice if things were magically efficient for everyone? Golang handles this by having a global Client object (= their version of a Session) that gets used by bare functions like http.get(...). We could do that too! (Well, one global session for the sync API, and one session-per-async-loop for the async API.) However... you can't have an implicitly constructed global session, if sessions require a with. So if you want to make HTTP efficient by default out-of-the-box, then your sessions have to support the bare constructor form.

So... I'm wondering if this is a place where we should commit heresy against the structured concurrency dogma, and use globally-scoped tasks to manage HTTP/2 and HTTP/3 connections. (And then if the answer is yes, we should also consider whether to have a global connection pool.)

Heresy: historical context

Back in the 1970s, the folks arguing for structured programming were originally super dogmatic about it. Structured programs weren't allowed to include constructs like break or continue or return, because those can jump out of multiple nested blocks at once. That means they broke Dijkstra's black-box rule, and were considered gotos in disguise. Instead you were supposed to make sure your loop exit condition covered all possible ways the loop could exit, and arrange your functions so that you always fell off the bottom eventually.

Of course today this sounds pretty silly. It's certainly possible to write some confusing code using these constructs, but they're nowhere near as bad as goto: their effects are local and pretty restricted, and avoiding them is often worse than the disease. E.g. if you encounter a fatal error inside a nested loop, you'd have to write a bunch of duplicated boilerplate conditions to carefully ferry it out of the loop and down to the bottom of your function, and that's also super error prone and confusing.

But I'm sympathetic to those early structured programming evangelists. The only reason we know that these constructs are worth it is because they made a serious attempt to go without them. And I'm sure there were dozens of other cases where goto proponents insisted that that some kind of goto was the right solution, and only a super-dogmatic insistence on structure could force them to actually try the structured version... and discover that in fact it was better.

So the lessons I take from this are:

  • A dogmatic insistence on "structured concurrency" is probably only, like 98% correct, not 100% correct, and we'll eventually look back and think we were silly for insisting on that 2%
  • But we don't yet know which 2% is the silly part!
  • And if we're going to make exceptions, we have to be really careful to analyze and justify them, because otherwise we'll lose the moral high ground
  • ...and besides, that's the only way to move the theory forward.

So lets try to put those lessons to work.

So how bad would this heresy be exactly?

There's a few reasons to suspect that using background tasks to pump HTTP/2 and HTTP/3 connections might not be that bad.

First, the idea is to keep the exact same public API as we had back in the old days with HTTP/1.1 and purely synchronous operations. So right there we know that this shouldn't mess up the control flow that our users experience, because the control flow shouldn't change at all. (If we can do it right.)

And second, the housekeeping work that HTTP/2 and HTTP/3 connections need is basically the same as the housekeeping work that the kernel already does on TCP connections. Every TCP connection has its own concurrent tasks running in the kernel to manage the TCP state machine, track ACKs, manage flow control, etc. When we call send or recv on a socket, that's actually just passing some request off to these concurrent tasks, which they perform on their own schedule. But we mostly don't have to think about this – it doesn't cause any horrible concurrency problems. So hopefully HTTP/2 and HTTP/3 should work the same way.

So that's promising, but let's work through the details and try to distill out some principles.

I think the fundamental reason it works for TCP, is that the send/recv API is very careful to preserve the illusion that these really are happening immediately when you call them. That's what we want here too: users shouldn't even be able to tell whether their requests are using HTTP/1.1 or HTTP/2.

Crucially, this includes error handling. If the kernel detects an error, like a TCP RST packet or something, it doesn't crash; it holds onto the error and reports it the next time we do a call on the socket. That's what we want for HTTP requests too. And for TCP, the semantics of send/recv are such that if there's an error on a socket that we never look at, then it's OK for the kernel to discard that error. Fortunately, that also applies to HTTP/2 and HTTP/3: if some background connection dies at a moment when we aren't actually using it for an HTTP request, then that's not really an error; we just silently discard that connection and open a new one later if we need to. The fact that multiple requests might share the same connection is designed to be entirely invisible, an implementation detail.

And the kernel is written very carefully to make sure it never ever crashes. Furthermore: if the kernel does crash, it takes down our program with it. However, the reverse isn't true: if our program crashes, then the kernel keeps going. Since the kernel is strictly more reliable than our code, that makes it safe for us to hand off work to it to happen later. For example, after you close a socket, the kernel might keep working on it for a while, to send any data that you passed to send earlier. If our program exits after calling close, that data isn't lost. OTOH if the kernel does lose the data, it's probably because the whole system died, in which case we would have lost the data no matter what we did. So the hand-off doesn't affect reliability.

Twisted and asyncio also like to pump data in the background, but they don't have this property: if you hand them some data to send, and then your program exits or crashes, their event loop stops, and the data that you thought you sent might not actually be sent. For our background tasks, we could try to force the program to keep running until they finish their work, but probably an easier way to handle this is to make sure that if a user sends something, that call doesn't report success until the background task has handed the data off to the kernel. (This is also the simplest way to handle backpressure.) And of course we'd need to be very very careful that they can never ever crash, and we'd need a backup plan for if that fails and they crash anyway.

But if we do all that..... I actually can't think of any problems that these background tasks would cause? This isn't exactly a tight list of reusable principles yet, but it's a bit closer than we were.

So what kind of architecture are we talking about?

I'm imagining something like Trio's "system task" concept, which gives a way to spawn a task into a global "system" nursery. They have some special properties that are all appropriate here:

  • If they raise an exception, then the whole program is cancelled (basically they follow the normal nursery rules, but running as a sibling of the entire program) – so if you have some horrible programming bug then at least you'll notice, but you're strongly encouraged to make sure that they never ever raise an exception
  • They're protected from receiving KeyboardInterrupt, which is kinda important for that "never raise an exception" thing (but means they have to be careful not to ever get stuck in an infinite loop or anything)
  • They're automatically cancelled when the main program exits.

Emulating some of these details on top of asyncio is probably a hassle, but I'm going to assume the Trio semantics for now and figure out how to compromise for other libraries later.

So given that, we can silently create a background task whenever an HTTP/2 or HTTP/3 connection is made, and automatically kill it again when the connection is closed. That means that in principle, we could implement pretty much any of the classic Session patterns.

However, I think the best architecture if we can manage it would be to have a single loop-global connection pool. The idea would be that it stores as little configuration state as possible; all it should hold is:

  • The cache of already-open connections
  • The minimal configuration needed to manage that cache (e.g. maximum number of idle connections)
  • A cache of which origins support HTTP/3 (since HTTP/3 is served over UDP, it has a wierd negotiation system where you never use it by itself; you have to first make a connection using HTTP/1.1 or HTTP/2 and in the response the server mentions hey btw, I also have an HTTP/3 server on this port, you can use that for future requests)

This is the stuff that you want to share as globally as possible, because the more widely its shared the better the chances that any given request in your program can skip expensive connection setup or use a more efficient protocol.

But that means we can't associate configuration with the pool. Instead, we'd associate configuration with individual requests. For example: you don't disable certificate validation on the connection pool, like urllib3 does; instead you disable it on individual requests. And part of the connection pool's job is to keep track of which connections have certification validation enabled or disabled, and match up requests with an appropriate connection. (urllib3 already does something similar, but slightly more convoluted: it stores configuration on individual ConnectionPools, but then the PoolManager class does matchmaking between requests and pools based on the per-request configuration settings – see _key_fields and associated stuff in poolmanager.py. So this is taking the PoolManager approach and doubling down on it.)

Of course it's still useful to be able to set up configuration on a Session object once, and then re-use it multiple times. But in this approach, the Session would only be a collection of default configuration that it applies to requests made through it, before delegating the actual request handling to the global pool. That would make Session lifecycle management super trivial :-).

We might also want to expose the option of making new pools, for weird cases and for testing. So maybe the layering would be:

  • A Session holds configuration, including a reference to which low-level HTTP transport object to use to fulfill individual requests.
  • By default this points to the global connection pool
  • But you can make new pools, and explicitly configure a Session to use them

I'm not 100% sure whether we'd want to expose the low-level transport object as part of the public API – #124 has lots of musing on this topic. But that seems like the basic architecture we'd want internally, at the least.

Like any global cache, we'd have to be very careful to make sure that the pool can't accidentally leak state in unexpected ways. For example, for the sync version of the pool, we'd have to make sure that it was both thread-safe and fork-safe. That seems pretty doable though.

Another interesting challenge will be figuring out where to stash the "global" connection pool. For the sync backend, we can just use a global variable. For Trio, we can use a RunVar. For asyncio and Twisted... maybe we need a WeakKeyDict mapping loops/reactors to connection pools, or something? I'm sure we can figure something out.

That sounds elegant and all, but does reality agree?

Currently libraries like requests and urllib3 don't have the kind of careful-and-systematic separation of configuration from the connection pool that I'm envisioning above. Could it actually work in practice?

Here's the state that requests keeps on Session objects:

  • defaults for extra headers, auth, params to attach to every request
  • default for stream=True/False (!?)
  • TLS settings: trust configuration, client cert
  • maximum redirect chain length
  • proxy configuration
  • "hooks" = a list of arbitrary callables to run on responses before returning them
  • "adapters" = which low-level "raw" HTTP transports to use for different protocols
  • cookies

Other things I can think of:

  • default timeout settings
  • default retry settings
  • default settings for transport-level configuration, like TCP keepalive or whether to use HTTP/2

A lot of this is assorted "configuration" stuff that would fit very naturally in my description above.

We'd also want to think hard about whether to make any of this stuff mutable, because if there's a global default Session and its configuration is mutable then that creates all kinds of potential for folks to get themselves into trouble.

The tricky cases:

Connection caching policy obviously has to be configured on the connection cache. Arguably this is a good thing... the main reason for policy settings here is to avoid allocating too many fds, and fds are a process-global resource, so it kinda makes sense to set this policy at a process-global level. But is that good enough in practice, or do people need the option to create private connection pools with their own policy, inside particular modules?

I'm not quite sure what to do with cookies. They can mostly fit into the "configuration" model, if we expose a "cookie jar" type, and let you specify which cookie jar to use on a per-request basis. But what should we do when there's no cookie jar specified: use a process-global jar? Don't store cookies at all? And when you create a Session without specifying a jar, does it default to using the global settings, or should it create a new cookie jar?

Go's approach is that when you make an http.Client, you can either specify a CookieJar, or set it to their version of None. If it's None, then all the default cookie handling is disabled (so not only does it not automatically add cookies to requests or save cookies from responses, it doesn't even strip cookies when following redirects! not sure what the reasoning there was). And the default global Client has the CookieJar set to None.

I can see arguments either way! On the one hand, having a global CookieJar would be a vector for state to leak between different parts of your program in potentially-surprising ways. On the other hand, a lot of Python programs are quick scripts to mess with HTTP APIs and we want to make that beginner path as smooth as possible – and how often do you really want to throw away cookies? Browsers have a global cookie jar and it works out OK for them. Though it wouldn't be terrible to say that by default cookies are ignored, and if you want automatic cookie handling then you have to create a Session. And I guess browsers do have stuff like incognito windows and facebook containers. And if you have two tasks interacting with some API on behalf of different users, you definitely don't want to accidentally mix up their cookies. So I'm leaning towards making the global session use a null cookiejar that's always empty.

I somehow screwed up the merge from master

I have no idea how I did this -- I didn't even know you could do this -- but somehow when I merged to bleach-spike from master in e1d8b60, git didn't record that as a merge commit, it just recorded it as a regular commit containing all the changes in the merge. Like, I cherry-picked from master instead of merging?

So, we've kinda lost track of the relationship between our branch and master.

Fixing this is possible but I guess it will require some delicate git surgery.

It looks like the actual merge point was 5fd9026. It's at the right time, and I've confirmed that its last change is indeed present in bleach-spike, while the next change in master (db1c1b6) is not present in bleach-spike.

Choosing content-Length vs chunked encoding for request bodies

@RatanShreshtha pointed to this PR: urllib3/urllib3#1715, asking whether it applies to us. I looked into it and realized it involves a larger issue that I don't understand, so I'm moving it from chat into a proper issue :-)

So: Body encoding is tricky. In HTTP/1.0, the only option was to use Content-Length. Of course this caused problems for streaming uploads. So HTTP/1.1 added the option to use Transfer-Encoding: chunked. For response bodies, this is fine – we can handle whichever one the server gives us. But for request bodies, we have a problem: we have to pick which one we're using before we've talked to the server, and we don't know yet whether it's a HTTP/1.0 server or a HTTP/1.1 server!

So in some cases you MUST use Content-Length, because the server doesn't support anything else, and in some cases you MUST use Transfer-Encoding: chunked, because you don't know the length of the data ahead of time and can't afford to generate it all up front in-memory (either because it's too big, or because your semantics actually demand streaming, e.g. consider a Travis worker streaming logs up to the central Travis server). There's no way to automatically detect which case we're in. So, it kind of has to be exposed to the user as an option they can set.

In upstream urllib3, this is done through the chunked=True/False kwarg. In our branch currently, I guess the only way to do it is by explicitly setting the Content-Length or Transfer-Encoding headers? And if you don't set either, then we unconditionally use Transfer-Encoding: chunked, see _add_transport_headers in connectionpool.py.

I don't think our current thing is going to work... there are a lot of cases where using Content-Length is easy and free (e.g. when the body is specified as an explicit string, or a file whose length we can query). And since Content-Length is more broadly compatible, we need to be using it in these cases; anything else will be a regression compared to other HTTP libraries. But there are a bunch of other cases that are trickier, and I'm not sure what we want to do with them.

One thing we definitely know is what the user passed as the body. If there's no body, that's always Content-Length: 0 (or possibly no header at all, for some methods like GET where Content-Length: 0 is the default). If the body is a fixed-length thing we know up front, then that's always Content-Length: <the length>. Or it could be something more complicated, like an arbitrary iterable or a file object that doesn't support seek/tell/fstat.

So one option would be to say that we default to Content-Length in the former cases, and default to Transfer-Encoding: chunked in the latter case, and if you want to override that then you can set the header explicitly.

Another option would be to do like urllib3 does now, and have an explicit chunked= kwarg. Then we have to decide:

  • If we get chunked=False + and iterable body, what do we do? I guess the options are:
    • Immediately read the whole iterable into memory, so we can figure out its length
    • Give an error, saying that the user should either use chunked=True or set a Content-Length
  • If the user doesn't specify chunked=, what's the default? I guess the options are:
    • chunked=False
    • chunked="guess"

This is a 2x2 space of possibilities.

Expect: 100-Continue support

I learned from here that botocore does some strange things to patch urllib3 to support Expect: 100-continue.

This is definitely not our first priority, but native support for Expect: 100-continue support should be fairly straightforward using h11 – probably easier than supporting botocore's monkeypatching! – so wanted to make a note here so I don't forget.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.