w3c / webdriver-bidi Goto Github PK

View Code? Open in Web Editor NEW

353.0 40.0 39.0 2.32 MB

Bidirectional WebDriver protocol for browser automation

Home Page: https://w3c.github.io/webdriver-bidi/

HTML 18.61% CSS 0.39% JavaScript 0.65% Shell 0.16% Bikeshed 80.19%

webdriver-bidi's Introduction

WebDriver BiDi

WebDriver BiDi is a bidirectional protocol for browser automation, building on and extending WebDriver.

WebDriver BiDi is a living standard that continuously gets new features added. For more info, consult these resources:

An explainer with more background and goals
A roadmap based on real-world end-to-end user scenarios
Detailed proposals for the initial protocol
A spec under active development

Status

How to build the specification locally

We use bikeshed to generate the specification.

Make sure you have the right version of python installed.

Now you can run in your terminal:

./scripts/build.sh

This script installs bikeshed (if not installed yet) and generates an index.html file for the specification.

Later on, you can use the --upgrade argument to force installing a newer version.

How to generate CDDL locally

Make sure you have npm and rust installed.

Now you can run in your terminal:

./scripts/test.sh

This script installs required npm and cargo packages (if not installed yet) and generates the CDDL files for the remote end (remote.cddl) and the client (local.cddl).

Later on, you can use the --upgrade argument to force installing newer versions.

webdriver-bidi's People

Contributors

Stargazers

Watchers

Forkers

mathiasbynens browse-holdings yashikajayasinghe k7z45 christian-bromann sadym-chromium isabella232 gsnedders acidburn0zzz whimboo bocoup standardgalactic juliandescottes miketaylr mzgoddard lutien zcorpan ms2ger orkon igalia vkatsikaros lightning00blade shrinidhimanchi lolaodelola outofambit sohaib0399 alexanderalonso890 alberto2101b najabat staatssicherheit yoavweiss viinodk pallavigitwork jeremyroman wwwbookline lihuibng chavoloco ladye215

webdriver-bidi's Issues

Interop of identifiers with original WebDriver spec

There are a few key places where we need to ensure smooth interop with the existing WebDriver spec. Notably, with the following:

Session IDs
Window handles
Element identifiers

The Session ID is required to allow the session to be established in the first place.

Being able to share window handles allows one to use WebDriver to discover top-level browsing contexts and use those in with WebDriver Bidi. One can also see it being helpful when returning data from Bidi to use with the original WebDriver spec.

The programming model we're seeing surface in Selenium is to use the original WebDriver protocol most of the time, but then augment that with calls to CDP (and therefore to WebDriver Bidi at a later date) As such, being able to reference an element consistently in both protocols is an extremely useful feature.

Create CDDL Index section

Bikeshed generates IDL Index sections; e.g., https://drafts.csswg.org/cssom/#idl-index

We should have similar sections grouping together local and remote ends of CDDL productions.

session.Status command doesn't really make sense

The way the spec is set up at the moment, it's not possible to have a WebSocket connection without a session. So the session status command doesn't make sense because it's always going to say yes there's a session.

For this to make sense there needs to be a way to start the websocket listener without starting a WebDriver/HTTP session.

Specify how to determine and address script execution contexts

WebDriver 1.0 doesn't have any explicit notion of realms (sometimes referred to as "targets" or "execution contexts" although the latter means something different in the context of ES). Instead it always executes script in the realm associated with the active browsing context. There is agreement that the BiDi version should treat realms as first class objects and allow commands to address specific realms.

The platform has several different mechanisms that create realms. Each browsing context has an associated realm in which scripts run (e.g. those provided by <script> elements). In addition various kinds of workers also create realms. Worker realms may be associated with zero or many browsing contexts. For example a document may create a dedicated worker which has that document in its owner set. That worker may create further workers which have the parent in their owner set. Or a document may access a SharedWorker which can have multiple documents (or other workers) in its owner set. Service workers are different again in that they are created by the user agent in response to service worker events, and may be shut down at any time.

Worklets also exist. It's unclear to me what the requirements around those are.

The requirement to be able to run script in a specified realm means that commands which execute script must provide a realm id to use as the target. This is similar to the Execution Context Id from CDP (note that CDP also has commands which take an object and infer the realm from the passed object).

In addition there must be some way to get a list of accessible realms. For a specific browsing context it may make sense to have a method to get the realm associated with that browsing context, and possibly also workers owned by the browsing context. It probably doesn't make sense to expose service workers in this way since they are more ephemeral.

It also makes sense to provide lifecycle events associated with realms being created and destroyed. This will allow getting handles to realms once they are created. It seems useful for such events to contain information about the kind of execution environment associated with the realm.

Define session creation without HTTP driver

Some implementations want to support connecting without requiring an initial HTTP request. This is required for feature parity with CDP-based clients e.g. Puppeteer that are able to establish a connection to the browser directly without going through a seperate driver binary.

The following discussion will assume a WebSockets based transport, but the same issues would apply to an implementation that wanted to allow connections over e.g. a unix pipe without the initial HTTP handshake.

Currently once a session is created, the HTTP layer returns a websocket url of the form ws://localhost:<port>/session/<sessionid> for the client to connect to. Since this requires the session id to be known it's clear that this doesn't work well for establishing a session directly over WebSockets. An obvious implementation would be to allow connecting to ws://localhost:<port>/session and defining a command like

SessionNewCommand = {
  method: "session.new",
  params: SessionNewParameters
}

SessionSubscribeParameters = {
  ? alwaysMatch: Capabilities,
  ? firstMatch: [*Capabilities],
}

Capabilities = {
  *text: any
}

Then, if you send this command when there's no existing session it would create a session and return a response with the matched capabilities and otherwise it would error.

One question is whether the session itself should reuse the connection; it would be the "wrong" resource since the session id wouldn't be in the path. But I don't immediately see a practical problem with reusing it in this case, and the alternative would require the client to establish a new connection which adds latency.

One wart might be if you want to allow reconnecting to the session once it dropped; in this case it might be necessary to connect ot the URL including the session id (to account for nodes accepting multiple sessions). That is also probably sufficient reason not to change the spec to make /session the only supported ws resource; in a node that supports multiple sessions that would make it hard to work out which session to reconnect to.

Another question is with this setup is how to to communicate the ws port to the local end. This is analogous to the problem of how to communicate the HTTP server address to local ends, and is usually solved either by putting the local end in control and allowing the client to select the address through remote-specific options (with some risk of races) or by communicating the address back through stdout of the client.

Update spec to define things in terms of Navigables

whatwg/html#6315 is going to update the HTML spec to change the session history model, and introduce the concept of "navigables" which correspond to things in the browser that can be navigated. "Top-level navigables" are (usually) tabs and windows in the browser UI.

This concept of "top-level navigable" is exactly what WebDriver means by "window", and is in fact a better match for what we want to expose in the API than "top-level browsing context". That's because the top-level browsing context in a particular tab can change over time (e.g. with COOP/COEP). We don't necessarily need to expose those changes to the user, and in particular want to expose an identifier that's constant over the full life of the tab/window.

It's not clear to me if we should update everything related to browsing contexts to instead expose navigables (with or without changing all the naming), or if we should expose both seperately.

Sending a webSocket URL to the client isn't really necessary

In Establishing a Connection, we currently construct a WebSocket URL with a host, port, and session ID, and return this URL to the client in the webSocketUrl property of the new session capabilities.

If we were to specify that all websocket URLs follow a known pattern instead, then the client can simply connect to ws://host:port/session/, and there is no need for the remote end to tell the client what this URL is. This is more in line with how traditional WebDriver endpoints work. For example, the client just knows what the proper endpoint is for sending a Find Element command for a particular session.

If we decide to support bidi sessions without going through HTTP first, as proposed in #46, then a well-known websocket URL will come in handy here too. For example, we might define a global websocket endpoint where clients can send "status" and "new bidi session" commands to.

Proposal:

Specify the path for a websocket URL for a session to be /session/
Return a boolean for the webSocketUrl capability that indicates whether bidi is supported. Probably rename it to bidi while we're at it.
If bidi is true, then the remote end will accept websocket connections at /session/

Allow `maxDepth` serialisation setting

The BiDi protocol can be used both on the local and remote hosts. As a BiDi user, I want to be able to configure the protocol behaviour by changing the serialisation depth.

Options:

Add a global number setting serialisationMaxDepth.
Add a global boolean flag previewSerialisation (switching between serialisationMaxDepth=1 and serialisationMaxDepth=0)
Add an option to specify maxDepth in each request.

Consider whether to allow multiple sessions in browsers

With classic WebDriver the blocking nature of the commands means that it only makes sense to have one active session at a time. But with BiDi we can imagine multiple clients connecting to the browser simultaneously and each getting its own set of events. For example alongside a client running tests, a seperate client might be checking performance metrics for each page load. These functions are theoretically quite decoupled but when only a single session is permitted the clients at least have to cooperate to share a session id and internal state like the event subscriptions. In practice this usually means everything has to be part of the same client (as in the devtools model). But that monolithic design might not be necessary and it might improve the tooling ecosystem when people are able to make small tools that do one thing rather than needing to create large monolithic tools where unrelated functions need to integrate at the code level.

There should be a way to unsubscribe from all events

#51 added support for subscribing to events. A common pattern is for people to want to reset the state of the browser (eg. between tests). As such, the ability to unsubscribe from all events would be useful, either by providing a way to get all subscribed events, or by a direct call to unsubscribe

Define a host check and explain when to use secure vs. insecure WebSocket connections

As discussed with @jgraham the WebDriver BiDi spec misses remote end steps for a host check when a WebSocket client wants to connect. By default it should make sure to only allow connections from localhost to not offer vectors for attackers. Hereby it shouldn't matter if the WebSocket is secure or insecure. As such using an insecure WebSocket should be fine.

But browsers might lift this restriction if really needed, and no further proxy running on the same machine that could handle remote connections. In such a case the WebDriver BiDi spec should absolutely require a secure WebSocket connection to be used.

What are the requirements for a command id?

Each command has a command id that's an integer allowing the client ("local end") to identify which response corresponds to which command. But it's not clearly specified what requirements there are on this id and how they are enforced. Presumably the id is supposed to be unique. Is it supposed to be monotonically increasing? That allows the remote end to reject messages with a possibly-reused id only storing a single integer, but also requires that the client serializes all messages through a single process. In particular it doesn't allow some architecture in which multiple independent processes write commands to the same websocket. That's probably fine, and if we really want something like that we could revisit supporting multiple connections. But I want to check there's not some reason to allow non-monotonic command ids.

Update goals in the explainer document

Copying in feedback from email (it may make sense to split this into multiple issues if any of this is contraversial):

This section captures a number of use cases, but I think there's a product-oriented view which is that we should be looking to provide the functionality that allows existing remote-automation libraries with browser-specific backends (or browser-specific features) to use a standard backend. This includes e.g. puppeteer, playwright, cypress, selenium, saucelabs.
Some of these goals like "fail fast on any js error" seem like details of possible designs that can be discussed in the relevant features rather than top-level goals.
"Access to native devtools protocol" shouldn't be an explicit goal. Having a way to support vendor extensions should be a goal (as you'd expect since WebDriver already has this capability). But for Firefox we don't anticipate building on the devtools protocol, and a requirement to expose that would significantly complicate things for us with little gain. Moreover starting a standard with the explicit goal of exposing nonstandard parts of implementations seems like it rather misses the point; we want a standard featureset that covers all the important cross-browser use cases so that people don't have to reach for the single-browser escape hatch at the cost of interop.
The "easy mapping to native devtools protocol" doesn't seem like a goal for the protocol; for example we don't expect it to have a trivial mapping to RDP used in Firefox. Vendors may of course have constraints on the technical direction of the protocol which will derive from a shared implementation with their devtools, but those should form part of the discussion rather than be an explicit goal.
An additional goal that we'd like to see is that the BiDi protocol ends up as a superset of the HTTP-based protocol i.e. there's never a requirement to send HTTP commands to get access to specific functionality. From an implementation point of view that would allow the HTTP-based and BiDi functionality to end up sharing all the code except for the transport layer.

Add bidi-specific error section

Up until now we have been reusing the error codes from webdriver/http. That's probably not going to be enough for the whole protocol. I suggest we use entirely custom errors and provide an informative mapping onto the HTTP error to help clients.

Document our design principles / rules of thumb

In #46 (comment) I spelled out a principle about the ability to implement WebDriver BiDi both as part of a driver binary like ChromeDriver, or as part of the browser directly, and avoiding assumptions that only make sense in one scenario. @jgraham agreed:

I don't think that there's a requirement that the connection has multiple hops (I don't expect the gecko implementation to have multiple hops in the default case). There may also be middleware that adds additional hops to the connection.

It would be good to write down some things like this which we've converged on, to be able to point to. A few others that come to mind, which I don't know if others agree with:

err on the side of matching existing debugging/inspector protocols where the benefit of being different is not very clear
forward compatibility: "a design characteristic that allows a system to accept input intended for a later version of itself."
provide the building blocks needed to do something first, and higher-level conveniences only if they can be expressed in terms of those building blocks, and if the convenience is almost universally needed.

Specify a mechanism for enabling events

The protocol will generate two kinds of messages originating at the browser; command responses and events in response to changes in the browser state e.g. browsing contexts being created, logs being generated, etc. Automatically sending all events generated over the protocol is likely to be a significant performance problem both for the client and the browser under test itself. Therefore there needs to be some mechanism to opt into the required events.

There are various possible levels of granularity for this opt in. The obvious axes of variation are what events should be sent, and where they should be sent from. The "what" axis is about the extent to which individual events have to be specified vs larger groups of events, and the "where" axis is about whether events are sent from all browsing contexts and execution realms or just from some.

For example in the CDP protocol, events are divided into groups called "domains" (and commands from those domains are only available when the events are being recieved). Domains are atomic i.e. you can't enable a subset of events from a domain, and specific to a "target" i.e. a domain is either on or off for a specific top level browsing context and all ancestor browsing contexts (I think). This makes sense for a devtools protocol in which the devtools are enabled on a per-tab basis, there are typically many tabs that are not being inspected, and the tools themselves are organised into panels corresponding to each domain, with each panel using most of the functionality of the domain.

For automation use cases things are a little different:

Typically the full browser instance is under automation at all times. Therefore it's more likely sensible to enable events for all the top level browsing contexts.
However there are some kinds of events that may have a performance impact or cause excessive network traffic, so it may make sense to enable those events only for specified browsing contexts or realms
There isn't such a strong connection between the client and the server, so the assumption that wanting one kind of event means that you will also be requiring all similar events doesn't make so much sense.
There is a bootstrapping issue. For example if a new top level browsing context is created, it is likely desirable to enable a set of events. This suggests the concept of a default set and a context/realm specific override set.

Decide how to handle events that the client might miss

There is a natural race between events that happen in the browser and the client subscribing to the relevant event stream. This means that a client may miss events that happened before the stream was enabled. There are a few use cases where this matters:

Where the client wants to maintain a complete mapping of some browser-side resources e.g. a complete list of current browsing contexts.
Where the client wants a complete list of transient events that occured e.g. a complete list of log events.

These are different in that in the first case the client is trying to replicate state that's current on the remote end, but in the latter case there isn't the same notion of the current state of the remote; it's just about buffering things that can be missed.

The solution that currently works in the spec for the state replication cases is to enable the desired events, and then read a the current state. The fact that this is two commands creates a race so something may get a creation event before the client has read the current state. This means the client will have to do deduplication. It also weakens invariants because it means that it's possible to e.g. get a contextCreated event for a context pointing to a parent that isn't yet known to the client.

For the pure-buffer cases where there isn't a current state to read, replicating this solution would involve having a command to get the buffer, which would only make sense in that one scenario; once you are subscribed to events presumably there's no point isn also maintaining a buffer.

So an obvious solution here is to provide a hook, so that when subscribing to certain events we immediately emit all the events to replicate the current state, or the buffered events. The problem in the current spec is we allow such fine-grained filtering of events it's not really clear how this should work. For example if browsing context A has a child B and you subscribe to context module events first for context B then for context A, you would clearly get a contextCreated event for B when subscibing to events for B, but it's unclear if you ought to get the same event again when you subscribe for A. It would be possible to write an algorithm so you get each event at most once, but it seems rather complex to explain the behaviour.

Assuming we think this is a problem that we want to solve, there are several options I can think of:

Allow subscribing to top-level events in the new session command so you can be sure that the subscription happens before any events are sent. The subscription would be a capability alongside websocketUrl. This would only work for globally enabled events since at the time of new session we don't have browsing context ids. I think we should do this in any case, because in practice I think it will cover the most important use cases. But it doesn't cover all use cases.
Only allow subscribing to events per top-level browsing context, not for any context. That makes things simpler (although there is still a decision to make if you first subscribe for a specific context and then later globally) and more or less matches what CDP has (but CDP doesn't smooth over out of process iframes; I assume we'd want to do that).

Spec-wise I think we're going to need a post-enable hook that an event can define to emit the initial events after it's enabled. For logging this will presumably involve an explicit buffer we put events in when they aren't to be emitted; for contexts &c. it will probably involve reading the current state and synthesising events we would have got.

What does it mean to "match […] against the remote end definition"

e.g., "Match parsed against the remote end definition. If this results in a match:"

What does it mean to "match" against it? When does the match fail? Is an extra property a failure? https://tools.ietf.org/html/rfc8610#section-4.2 deliberately leaves this up to the application, so we need to define this.

Subscribing to events for only specific browsing context

In #51 we landed on this model for events:

Every browsing context (effectively) has a tri-state setting for each event: explicitly enabled, explicitly disabled, or unset
There's a global event set where each event can also be explicitly enabled globally
To determine if an event is enabled, one walks up the tree of browsing contexts until one finds it to be explicitly enabled or explicitly disabled, ultimately falling back to the global event set

As promised in https://www.w3.org/2020/10/07-webdriver-minutes.html, I think there's a bit of a race condition worth raising here. We have been focused so far on not missing any events, but there's also an issue of avoiding events that you don't want or expect.

Scenario: I want to listen for wait for the "load" event on the top-level browsing context, but don't care about load events on any iframes, etc.

One can deal with this in one of two ways:

Filter events on the browsing context ID
Subscribe to the event for the top-level browsing context only, and don't do any filtering

The issue is that the latter sounds appealing, and makes some code simpler, but would in fact be racy. Right now one has to disable the event on any descendent browsing contexts after they are created, and events could be fired before that. It's racy, however, so might not happen when writing the tests, and later become flaky.

I think this specific problem could be averted by making the event subscription tri-state but in a different way:

enabled for only this browsing context
enabled for this and descendant browsing contexts, including future ones
unset, falling back to ancestor browsing contexts and finally the global event set

This doesn't allow subscribing to an event for the top-level browsing context and only some of its iframes, and would in that scenario require client-side filtering of events. Not sure what the right tradeoff is, but putting this idea out there.

Specify a network request interception feature

A common use case for automation is to take a request that would be made e.g. to a third party API and provide a mock response for that reques, or to take a request that requires HTTP authentication and automatically provide that authentication. This is a feature that's exposed in Puppeteer and is in the works for Selenium.

The existing CDP API is the Fetch domain. Something similar has been added to WebKit. Gecko has similar features with a different API exposed to extensions.

Given that the CDP-ish model is already widely implemented in browsers and used in clients, it makes sense to spec something close to that. Practically this means:

A means to enable request interception for a specific set of contexts (or realms; serviceworkers for example seem like an important use case here) and with a URL filter that's allows only intercepting a subset of responses
An event that's produced when a request matches the filter with details about the request, including an id that can be used to continue the request.
A command to fulfill the request, suppliying the response headers and body.
A (possibly seperate) command to cause a network error
Some way to get the response that would have been returned by the network request so that it can be modified before use.

Auth seems to be handled somewhat seperately, with the following:

An event that's fired when a request results in a HTTP auth challenge
A command to provide the authentication response

Specify session history navigation

We should have commands for moving around in history.

Currently WebDriver has forward and back commands that correspond to tranversing the hsitory by a delta in the HTML spec. This is straightforward to specify. CDP takes a different approach; it allows getting a list of session history entries and navigating to an entry by id. Probably that's widely implementable because it corresponds to browser features like a drop down on the back button showing the full history. The question is whether the additional flexibility has use cases that merit the additional complexity.

Define remote object lifecycle

To prevent memory leakage, there should be a way to release the received remote object.
This can be implemented like in CDP Runtime.releaseObject

Only allow top-level browsing contexts (navigable) in session subscribe/unsubscribe

As of now the session.subscribe command doesn't make it clear that only top-level BrowsingContexts (navigables) should be passed in for enabling event subscription for the whole BrowsingContext tree. Instead we implicitly get these from a given BrowsingContext when updating the events map (step 7.2.2).

It would be better to only allow top-level BrowsingContexts to be passed-in, and error out respectively if that isn't the case.

The same also applies to session.unsubcribe.

Specify closing a session

It should be possible to close a session. This could optionally shut down the UA; certainly some way to close the browser is a requirement for our use cases.

Specify reloading current page

CDP has Page.reload. Seems like a reasonable place to start. Difference from Page.navigate with transition type set to reload? Presumably have the same wait arguments as navigation.

Specify a low-level transport format

The bi-directional protocol needs a low-level format for encoding messages that are sent over the wire.

The discussion so far has assumed that the format will be JSON-based; this has compatibility with the HTTP protocol and good human readability. It matches other existing protocols (although note that CDP now has a binary variant) and likely makes sense if we think that the message parsing overhead will be small compared to other latencies in the system.

@burg proposed that we adopt JSON-RPC as the specific flavour of JSON for the protocol. This meets several requirements:

Provides a command type and a response type with an id field on match commands and responses
Provides an event type that isn't generated by a command
Provides an error type

However there are several details that might not match our requirements

Requires a version number on every message, bloating the protocol for little gain (updating versions in WebDriver would be a difficult undertaking given the compatibility guarantees we provide).
The protocol has a pipelining feature where an array of commands can be sent at once. This kind of feature seems useful to us, but the particular requirement that the implementation produce an array of responses doesn't seem like it matches use cases we might have (like running multiple concurrent commands and getting the response from each as soon as its ready).
The spec has several SHOULD-level requirements; this is owrrying from the point of view of obtaining interop.

Machine readable definitions

This is the BiDi sibling of issue w3c/webdriver#1510, see that issue's description for the full background.

The solution for REST and BiDi likely won't be the same, and we might do one without the other.

For BiDi specifically, @bwalderman has already put together a openrpc.json proposal.

Use the message ID in errors, even if everything else was invalid

Discussed at #50 (comment).

https://w3c.github.io/webdriver-bidi/#handle-an-incoming-message will respond with an error if any of a number of conditions are true. Although https://w3c.github.io/webdriver-bidi/#respond-with-an-error is just a placeholder, it's clear that the ID isn't passed in, and so can't be passed back.

The behavior should be, I believe, that the ID is sent back if there was one.

Open question: If the ID was a string or and object, should that be sent back? Probably not?

Decide on a value serialization format

Remote automation frequently requires serializing js objects and handles to DOM objects over the network. Different serialization formats are possible here with different tradeoffs. We should ensure we make an intentional decision for the format we use for the BiDi protocol; in particular it need not be the same as that for the HTTP protocol, although we do have a requirement to be able to share object handles i.e. it must be possible for a client (local end) to transform a WebElement from the HTTP protocol into a form where it can be transmitted over BiDi and will end up refering to the same element.

Prior Art

WebDriver / HTTP

JS values are serialized as JSON. Handles to elements and windows are serialized using a special object format that's included inline in the JSON. Other values that aren't in JSON (e.g. BigInt, Infinity, etc.) aren't supported at all.

CDP

Any remote object can be represented via a RemoteObject object. This includes a mechaism for returning json serialized values where that is possible, and for representing js primitives that aren't serializable as json. However there isn't a mechanism for representing e.g. an array of elements, which means that something like findElements may require repeated trips over the API.

Firefox RDP

https://docs.firefox-dev.tools/backend/protocol.html#grips

Specify how to enable an event stream

For various use cases (e.g. logging) we want the ability to enable a stream of events from the browser to the client. But it isn't possible to have all events enabled all the time; we want a design that allows us to enable and disable particular events, either for the session as a whole or for specific browsing contexts. In some cases enabling a specific event stream might also enable additional commands that can make use of data from the events.

The explainer document currently proposes a mechanism where different event streams have enable/disable pairs with a counter that is incremented for each enable and decremented for each disable (clamped at 0), with the stream only being actually disabled when the count reaches 0. But there are concerns that this mechanism may be over-complex, providing a protocol-level solution to application-level concern around how to handle multiple parts of the code.

The simplest solution here seems to be to divide events into various groups ("domains" in CDP) and have an enable and disable command for each group. Some additional property may be needed to determine the scope of the enabling (e.g. one browsing context / agent vs global).

Confusing naming. `context: string` and `contexts: object`

context is used both as an object: contexts: [*BrowsingContextInfo] implying context is a BrowsingContextInfo, and as an ID string: context: BrowsingContext specified as an ID string. This twofold interpretation makes some confusion.

The response to the browsingContext.getTree command should look like:

...
    "contexts": [{
        "context": "3BD893F38972478DB095B393E8C3AFE7",
        "parent": null,
        "url": "about:blank",
        "type": "page"
      }]
...

While reading, I tend to misinterpret BrowsingContextInfo as a context, because of the property name contexts.

We should stick to the consistent naming and either renaming string context to contextID or array contexts to contextInfos.

Simplify `browsingContext.contextDestroyed` event

Currently, event browsingContext.contextDestroyed contains all the BrowsingContextInfo. I suggest to keep only context, which seems to be enough.

BrowsingContextDestroyedEvent = {
 method: "browsingContext.contextDestroyed",
 params: BrowsingContextInfo
}

Testing approach and policy for normative changes

To achieve interoperability between implementations, we will need an extensive test suite for WebDriver BiDi in web-platform-tests, like all other new web platform features should.

A few things to sort out sooner rather than later:

A testing plan would help define clearly what we're trying to test.
The testing approach will likely be similar to the existing WebDriver tests, but there are new problems to solve because of the async nature of the protocol. We will need to update wpt infra.
At what point should we start requiring tests for normative changes? (background)

A new standard requires more authentication ways than basic auth only

Currently Selenium supports only basic HTTP authentication. In a new standard it would be great to have more ways of authentication like using mutual TLS authentication or tokens.

Require some minimum message size to work?

In Design Proposal for ChromeDriver establishing BiDi WebSocket connection @k7z45 noted that the WebSocket implementation in Chromium has 256 MB limit on messages sent and received.

That's a large limit, and the upper limit should probably be treated as a quality of implementation question as is typically done for things like URL lengths, <canvas> size, image size, etc.

However, should we have a minimum message size to support that we test in web-platform-tests, so that clients can depend on that working?

Provide a mechanism for extension commands to send custom error codes

WebDriver 1.0 has a fixed set of error code strings. We can re-use many of these to report errors in corresponding bidi commands and add new error codes for any new bidi commands we introduce. However, if we eventually support extension commands in the bidi protocol, it would be useful if implementers had a way to send custom error codes to report errors from their extension commands.

This could be as simple as requiring custom error codes to have a prefix followed by a ":", similar to extension capabilities.

Transfer issues from other repo

Let's transfer https://github.com/w3c/webdriver/labels/BiDi to this repo.

It looks like I have the ability to do it (here's how) but I want to ask first.

@jgraham @bwalderman @christian-bromann good idea?

Omit `id: null`

According to the spec,

command id can be null, in which case the id field will also be set to null, not omitted from response.

Having id: null in each command seems redundant. I suggest to omit id: null.

Forward compatibility and the use of {*text => any} in CDDL

In #77 (comment) I learned what the {*text => any} in our CDDL is for, it's to allow arbitrary extra keys in an object. I don't know what part of the CDDL spec to read to understand this, but https://tools.ietf.org/html/rfc8610#appendix-H has it in an example.

In order to allow future extensions of the (unversioned) protocol, we will want to accept extra keys everywhere, which means we'll need ensure {*text => any} is used wherever extra keys would otherwise be rejected. It looks like SubscribeParameters is a case where we don't have it, so that session.subscribe with params { events: 'browsingContext.*', futureExtraStuff: true } would fail.

Can tooling help us ensure that we don't miss {*text => any} anywhere? Or is there a way of using CDDL where the default behavior is inverted?

Add a module for executing script

Script execution is a fundamental feature requirement for automation, and is a protocol feature that allows clients to implement many higher-level automation features before we have explicit support (i.e. anything that is accessible to content script).

In terms of precedent there are several examples to consider:

WebDriver has endpoints for executing a script both synchronously and asynchronously (although both calls are blokcing). In both cases the script is wrapped inside an implied function. In the async case one of the arguments to that function is a callback that is called with the return value. This allows use cases like adding a script that returns when the page recieves an event, but running the callback in the context of the event handler.
CDP (and, reportedly, the WebInspector protocol) has an Runtime.evaluate endpoint. This evaluates a script directly in the realm ("execution context") provided in the call, and returns the completion value. Async behaviour can be modelled by setting up a promise and using Runtime.awaitPromise to provide a callback once a condition is met.
WebExtensions have content injected scripts (e.g. gecko docs. These are run in the content process, but in a sandbox ("isolated world") so that they get access to the page without modifications to globals made by in-page scripts. They also provide unique DOM APIs to allow access to extension-related functionality and to postMessage data back to the main extension script.

The unique ability of extensions to return arbitary data at a time of their choosing seems very valuable for automation. For example it would give the ability to send an event every time a mutation observer is called. It also allows modelling the behaviour of WebDriver/HTTP and CDP; to model the behaviour of execute script, the provided script text would be wrapped in a function like:

let result = function (${args}) {
  $(script)
}
WebDriver.emit(result)

and execute async script would be similar but with the WebDriver.emit function passed in as the final argument. To match the script execution to the returned value, we could return a token in the initial script response which would later be provided in any the output of emit calls originating in that script. I'm not sure how to model that in IDL, but in principle it's just a requirement that when script is injected the interface is instantiated with the token as internal data that the exposed API can access.

Another question is whether injected script should run directly in the targeted realm, or should be sandboxed as extensions are. In gecko we already partially sandbox WebDriver scripts in a way that allows them to access functions on the page, but does not allow globals to be set by WebDriver and read by the page. Not having any access to page-set state seems problematic, particuarly for inspecting the state of objects. But some ability to sandbox scripts so they are able to access the initial values of DOM properties rather than page-set overrides does also seem useful.

Specify navigating a browsing context

One of the first things any automation script will do is to get a browsing context (#42) and navigate it. Navigating a browsing context is therefore something we'll want to work on fairly early, although it might not be the very first command.

At a high (enough) level it should be very similar to https://w3c.github.io/webdriver/#navigate-to, but there are plenty of considerations:

Should the command return as soon as the navigation is started (navigate sync steps done) or wait for navigation to complete like WebDriver classic does?
If there is waiting, is the page loading strategy a parameter?
Which events have to fire while navigating a browsing context?

What happens if a socket prematurely dies?

This was asked by @AutomatedTester in #24 (comment).

The minimum we should do is to define clearly what happens to the remote end if the connection is reset. Beyond that, it would be good if it's possible for a client to reconnect, and the most likely way that wouldn't be possible is if we decided (for simplicity elsewhere) that there could only be one websocket connection at all for the duration of a session. In other words, this issue is likely slightly entangled with what to do if a client tries to set up two parallel connection.

Include description in Symbol serialization?

https://w3c.github.io/webdriver-bidi/#serialize-as-a-remote-value

While a Symbol isn't defined by its description, it does have one, and it would be useful to include it without requiring another call to get it.

The description of Symbol('foo') is the string "foo". It looks like the description of well-known symbols are strings like "Symbol.iterator".

Do we support > 1 connection for a single session?

This is an inline issue in https://w3c.github.io/webdriver-bidi/#transport.

In ChromeDriver establishing BiDi WebSocket connection Design by @k7z45 this question has come up and we should resolve it on the spec side so that it can be implemented and tested.

Specify a Logging module

WebDriver BiDi should have a Logging module to make it easy to monitor messages logged from various sources. This module might be a good candidate for early prototyping.

In practice, we've seen test automation assert that there is no logging to the console at all, or failing if there's any message text that matches a certain pattern, which brings pitfalls with any browser-specific log output.

The protocol should distinguish different kinds log entries and allow filtering based on various criteria to further reduce traffic over the websocket.

One event:

logEntryAdded. Some possible parameters:
- level (String): e.g. info
- text (String): the log message
- kind (String): is it a Console API call, a kind of error or warning (CSS, JS, security), a network request?
- scriptContextId or browsingContextId for which this log entry was added
- timestamp (number)
- extra (object): additional optional metadata that applies to different kinds of log entries
  - url: resource URL for which this log entry was added
  - args: Console API call args
  - method: Console API method called
  - lineNumber
  - colNumber
  - requestId: network request ID
  - bootstrapScriptId if the entry originated from a bootstrap script

In addition to being able to subscribe to the logEntryAdded event, we should be able to narrow down the subscription with filters like the kind, a url prefix, scriptContextId, whether it's from a bootstrap script...

Maybe the general subscribe method could have a match parameter to specify such filters along the same lines as the match patterns in the bootstrap-scripts proposal.

Add a bootstrap scripts feature

Initial proposal at https://github.com/w3c/webdriver-bidi/blob/master/proposals/bootstrap-scripts.md

Cross-reference CDDL productions

saying something like "matching the BrowsingContextInfo production" is annoying without any link for cross-referencing

Specify browsing context discovery

We need a mechanism to discover which top level / other browsing contexts exist and a set of events for the browsing context lifecycle (created/destroyed/etc.).

Command response: "value" or "result"?

In #50, the CommandResult CDDL definition was added, with "value" as the property to hold the result of a successful command. I noticed this and suggested changing it in #56 (comment), but moving to a separate issue.

https://www.jsonrpc.org/specification uses "result" and in #14 (comment) and #14 (comment) I provided samples of Chrome and Safari's devtools/inspector protocols showing that they also use "result".

Although it will not be possible to use existing libraries for JSON-RPC or existing devtools/inspector protocols with WebDriver BiDi, I would like to match existing protocols where it is of no consequence and only a matter of picking a name. So I am suggesting that we use "result".

Specify how to establish a connection

https://github.com/w3c/webdriver/issues/1498 concerns what kinds of messages to exchange over a connection, but to get off the ground we also need to first enumerate possible connections and establish a connection. This would, I think, also be the "upgrade mechanism" for getting from a WebDriver HTTP connection to a BiDi connection.

Straw proposal, assuming an already created WebDriver session:

GET /session/{session id}/targets to enumerate targets to which one can connect. Among other information there would be a (WebSocket?) URL to connect to. That's all.

However, this is probably a flawed proposal, and it seems like the choices here will have large consequences. Questions that come to mind:

What is the scope of a connection, is it a single realm, an agent cluster, a browsing context group, the whole browser, or something else?
Is it a goal to reduce the number of connections by multiplexing?
Should it be possible to create a session using BiDi, not merely "upgrade" one?