whatwg / urlpattern Goto Github PK

View Code? Open in Web Editor NEW

152.0 24.0 19.0 428 KB

URL Pattern Standard

Home Page: https://urlpattern.spec.whatwg.org/

License: Other

Makefile 0.56% HTML 99.44%

standard urlpattern whatwg whatwg-urlpattern

urlpattern's People

Contributors

Stargazers

Watchers

Forkers

domenic hdaghash wanderview global-localhost global19 global19-atlassian-net lucacasonato autokagami ekmixon jagarinart styfle sayan751 alicedukina jeremyroman seanpm2001

urlpattern's Issues

consider making service worker scopes only use pathname matching

At the TPAC 2020 discussion @annevk suggested making service workers scopes only match against pathnames instead of the entire URL. This is related to the question of removing hostname matching in #19, but slightly different.

After thinking about the suggestion and what it would mean I am leaning towards doing this. My idea is to expose a separate interface:

let p = new URLPathnamePattern('/foo/*');
if (p.test(my-path)))
  // do stuff

It would also take a baseURL for relative pathnames:

let p = new URLPathnamePattern('./*', self.location);
if (p.test(my-path)))
  // do stuff

The full URLPattern and this URLPathnamePattern interface could share the same internal code for matching pathnames.

For lists we would need something like URLPathnamePatternList.

One notable change for the test() and exec() methods is they would no longer require a full parsable URL. Instead they would only require a pathname. For code that only has a pathname to begin with this would make the API easier to use instead of constructing full URLs just to wildcard all the non-path parts.

Service worker scopes would then take URLPathnamePattern or URLPathnamePatternList. This would cause service workers to lose the ability to match against any search string values, but I think we all agree that was an accidental feature and does not work well for real use cases.

My current plan is to prototype this kind of path-only interface in addition to URLPattern for use cases looking for full URL matching.

Query parameter matching not supporting on all platforms

I'm a bit concerned about the proposed ability to match on query parameters. We will want this behaviour to be consistent in the web app manifest spec and for installed PWAs. However, platforms like Android do not offer the ability for apps to differentiate captured URLs based on query parameters. Specifically, Android supports capturing based on <scheme>://<host>:<port>/<path>, and thus we can't implement query scoping rules on Android natively.

The spec "legitimises" the ability to differentiate scopes based on query parameters. If sites start using this frequently, URL capturing won't work as expected on platforms that don't support this.

Missing token usage

https://wicg.github.io/urlpattern/#add-a-token doesn't describe what do use the token for. I assume step 6 should be appending token to tokenizer's token list

ServiceWorkerAllowed should be mentioned in the explainer security section

Currently we discussed ServiceWorkerAllowed here:

https://github.com/wanderview/service-worker-scope-pattern-matching/blob/master/explainer.md#serviceworkerallowed-behavior

But not in the security section. We should at least point back to the previous section and/or briefly summarize the approach.

pass patterns as strings to service worker API

Per discussions leading up to w3c/ServiceWorker#1468 (comment) there is a growing consensus that patterns should be passed to the service worker API as strings and not objects.

While that integration will ultimately be spec'd in the service worker repo I'm filing an issue here to track it and for eventually updating the explainer.

Tracking changes important to the polyfill

I made a small polyfill on top of path-to-regex to play around with this and found a few differences.

First of all, something like

pathname: '/*.:imagetype(jpg|gif|png)'

doesn't work, the wildcard has to be (.*) instead. I do like the /* wildcard so maybe we can just document this as an improvement.

You expose named groups as result.pathname.groups - but JS regexp actually supports named groups now (path-to-regex doesn't use that feature) and that it exposed to result.groups.

Basically this

const imagePattern = new URLPattern({
  pathname: '/(.*).:imagetype(jpg|gif|png)'
});

let result = imagePattern.exec("/photo.jpg");

console.log(result)
console.log(result.pathname.groups['imagetype']);

could be done with regexps like:

const imageRegex = new RegExp("/(.*)\.(?<imagetype>jpg|gif|png)");
let res = imageRegex.exec("/photo.jpg");

console.log(res);
console.log(res.groups['imagetype']);

should URLPattern constructor throw SyntaxError or TypeError for bad patterns?

Currently the implementation and spec has URLPattern always throwing TypeError exceptions. For bad pattern strings, though, should we instead through SyntaxError?

Does /foo/?* prevent /foobar from matching?

Based on the description that "The ? pattern character makes the preceding character optional" and "The * pattern character matches zero or more characters", it would seem that a path of "/foo/?*" would match "/foo", "/foo/", "/foobar", and "/foo/bar", basically it's allowed to either have a slash, or not, and it's allowed to have anything after the slash.

I think the intention is that "/foobar" would not be matched by this (otherwise why not just write "/foo*"?). It's not immediately obvious that this would be blocked.

It's possible that this is explained by the "non-variable prefix" stuff, but I haven't fully understood that from a cursory reading of the explainer. If so, it should maybe be written to explain that explicitly and up front.

Add link to the spec to the GitHub project page

Sad:

Nice:

(You could also turn off releases/packages/environments in the sidebar via settings if you wanted.)

URL Normalization

Some touchy edge-cases with pathname matching that I've come up against that are worth considering are:

Handling of repeated / in URLs
- In some cases, some services handle multiple slashes the same as a single slash (e.g. https://github.com/WICG/////urlpattern resolves just fine)
URL encoding and patch matching
- When you write a pattern, should you write the unicode character or the encoded value? I understand people would like to just write /你好 and have it work instead of writing /%E4%BD%A0%E5%A5%BD.
String normalization
- Covered in the below link and MDN there is a number of ways one can write the same unicode character. It may be worth doing string normalization in some way to avoid this edge case, though it's probably reasonable if you don't since there would be performance implications

Some of the solutions to this is covered in https://github.com/pillarjs/path-to-regexp#process-pathname, but a lot of implementations ignore the edge cases which result in inconsistent handling across users. Both 1 and 3 likely depend on how "exact" matches want to be.

consider matching structured input

Sometimes a developer may only care about matching a part of the URL; e.g. just against pathname. The plan of record to support this is:

let pattern = new URLPattern({ pathname: "/foo/bar/*" });

This matches the pathname, but all other URL components are wildcards.

This mostly works for the use case, but is still awkward to use since URLPattern.match() requires parseable full URLs to be passed in. For example:

const input = "/foo/bar/baz";

// fails due to invalid input
pattern.match(input);

// awkward and expensive, but works
pattern.match(new URL(input, self.location).href);

To make this more ergonomic we could allow URLPattern.match() to accept structured input:

pattern.match({ pathname: input });

This allows the developer to indicate they only have a pathname. The other components of the URL are unknown and will only match if the pattern has a wildcard specified for the component.

Double check the spec to verify we are handling code unit vs code point correctly

The c++ I am translating from operates in UTF-8 strings. Spec language uses UTF-16 strings. Its possible I've confused things in my translation. I think it should be ok if we are indexing into code points, but this issue is intended as a reminder to go back and check.

should we make empty string port values and the default port number equivalent?

Consider a situation where you only want to support the default port a protocol; e.g. 443 for https. To do this today you would need to specify a port pattern like port: '(443)?'.

As an alternative we could make the empty string port and the default port number equivalent. Then you could write a pattern like port: '' which would match either the default port explicitly set or the empty string for the port. It would not match other port values.

My inclination is to maybe not do this at first and only add it later if requested. I'm not sure how backward compatible that approach is, though.

Relationship with URI Templates (RFC 6570)

Hi,

First of all, thank you for working on this topic.

There is an existing web standard (implemented by several libraries and server-side frameworks) to express URI templates: https://datatracker.ietf.org/doc/html/rfc6570

As stated in the RFC, it isn’t sufficient to create a fully-featured router, however it is sufficient for simple use cases:

Some URI Templates can be used in reverse for the purpose of variable matching: comparing the template to a fully formed URI in order to extract the variable parts from that URI and assign them to the named variables. Variable matching only works well if the template expressions are delimited by the beginning or end of the URI or by characters that cannot be part of the expansion, such as reserved characters surrounding a simple string expression. In general, regular expression languages are better suited for variable matching.

For instance, we use this syntax in the Mercure protocol to specify topics to match: https://mercure.rocks/spec#topic-selectors

This syntax is also widely used in the PHP community to define routing patterns (it is usually converted as a regex for performance by the router libraries).

For consistency with the existing standards of the web platform, shouldn’t your proposal extend the syntax defined in RFC 6570, or at least reuse the parts that can be reused (the named groups for instance)?

Also, having a pure string representation of the pattern (#22) would help interoperability. For instance, this would allow to use this syntax in the Mercure protocol (as a query parameter value) or in declarative routers.

Best regards,

Patterns as a string

Creating a meta proposal to bring together the discussion around #20, #19, #14, and #13. Feel free to correct me or close if I haven't thought about some edge cases in the individual locations that the pattern intends to be used, I'm less familiar with things like service worker scopes and CSP.

I'm making a couple of assumptions:

Using a single string is the simplest API for developers
Being incompatible with path-to-regexp is acceptable (on this point, it already is - just in a different way)
You don't use * (wildcards) as the default behavior
You enforce the scheme/hostname to be defined (e.g. http://, https?:// or *://)

With these assumptions, I think you could create a simpler API that mirrors the URL API. For example, new URLPattern('*', window.location) would be supported. You could always enforce a full URL instead of partial URLs, similar to the current URL API too.

In the service worker use-case, you can have it throw if the "origin" doesn't match the static prefix of the URLPattern. There was already some need for this sort of behavior due to the scope matching ordering: https://github.com/WICG/urlpattern/blob/master/explainer.md#scope-match-ordering.

Finally, on the path matching "magic", if you remove the magic prefix/suffix and make it explicit you can just do new URLPattern('/:foo{/bar}?', 'http://{:subdomain.}*.example.com').

should URLPattern be case sensitive or insensitive

The path-to-regexp library is case insensitive by default with an option sensitive:true to override. In contrast, most components of a URL are case sensitive with the hostname being the only case insensitive part.

What should URLPattern default to and should it provide an option to override?

We could follow path-to-regexp as our cowpath as to what is popular/expected and make all components case insensitive by default. We would then likely need to provide an option to require case sensitivity since it will matter in some URLs.

I believe @domenic advocates that we by default match URL behavior. We might need to provide an override option to always be case insensitive.

consider how to handle changes to ecma-262 regex standard

While reviewing some of my prototype code @jeremyroman raised the point that we need to consider what to do when ecma-262 regex changes upstream in the spec and js implementations.

The path-to-regexp syntax we are using embeds regular expressions and while it largely can pass the expressions through it does need to have some knowledge of internal regex structure. This is mainly associated with detecting nested groupings and forbidding unnamed capture groups.

Its possible our initial parse algorithm that works for these scenarios today could break if regex syntax is expanded in the future. We might start accidentally blocking the use of a new feature that is added, etc. We should consider what do in this situation.

I'm filing this issue so we don't forget, but I'm not sure there is anything we can do immediately. I don't think we want to embed a complete regex parser in urlpattern directly. Instead I think we would want to consider the impact to regex changes on a case-by-case basis and asses how to adapt urlpattern, if necessary. My expectation is that regex does not change too frequently, so this should not be excessively burdensome.

Missing argument in `is a non-special pattern char` calls

https://wicg.github.io/urlpattern/#ref-for-constructor-string-parser②②

Return the result of running is a non-special pattern char given parser’s token index and "#".

https://wicg.github.io/urlpattern/#ref-for-constructor-string-parser②①

If result of running is a non-special pattern char given parser’s token index and "?" is true, then return true.

Both of these calls miss parser to be passed

Why the limit to one search parameter?

At most one search parameter pattern.

I'm curious as to the rationale behind this limit. It seems potentially useful to be able to allow any number of search param restrictions.

I would imagine that these search parameters act as additional constraints, i.e., there is no way to make a URL that matches a particular pattern, but adding extra query params to the URL fails to match. However, multiple search parameter patterns would require a matching URL to feature ALL of those search parameters.

make URLPattern and URLPatternList serializable/structured cloneable

During the call in w3c/ServiceWorker#1535 @asutherland asked if these types will be serializable. I interpreted this as asking if they can be structured cloned. I think it makes sense to so and we should include it in the proposal.

should URLPattern match a trailing "/" by default like path-to-regexp?

During the call in w3c/ServiceWorker#1535 we discussed the optional prefix behavior. For example:

new URLPattern({ pathname: '/foo/:bar?' });

Would match:

/foo
/foo/xyz

But would not match:

/foo/
/foobar

The main point of the feature is to avoid matching the /foobar here, but not matching /foo/ may be a problem for sites that normalize URLs with a trailing slash.

This discrepancy could be because the proposal is currently based on "strict" path-to-regexp mode instead of "default" mode. In "default" mode a trailing "/" is always permitted.

This issue is to figure out how to handle this. Should we switch the proposal to use default mode which is what most developers use? Should we do something different for optional prefix behavior?

Web NFC does some URL matching, would be good to coordinate

https://w3c.github.io/web-nfc/#url-pattern-match-algorithm

consider generating URL strings from a URLPattern

One of the most frequent pieces of feedback we have gotten is that developers would like to generate URL and component strings from a URLPattern. For example, something like:

const pattern = new URLPattern({ pathname: '/product/:id' });
const my_pathname = pattern.generate('pathname', { id: '12345' });
// my_pathname is "/product/12345"

There is a discussions thread about this in #41. I'm filing this issue to consider as a future enhancement.

should exec() results contain canonicalized or original input?

Currently the exec() method returns a URLPatternResult which has an input property. There is also then a URLPatternComponentResult for each component which each then has their own input property containing the component input value.

So, for example:

const p = new URLPattern({ hostname: 'cafe.com' });
const r = p.exec({ hostname: 'cafe.com' });
// r.input contains an object { hostname: 'cafe.com' }
// r.hostname.input contains 'cafe.com'

Now, what should we do if the input had to be canonicalized?

My implementation currently returns the original input for result.input, but will use the canonicalized value for the component input property. For example:

const p = new URLPattern({ hostname: 'xn--caf-dma.com' });
const r = p.exec({ hostname: 'café.com' });
// r.input contains an object { hostname: 'café.com' }
// r.hostname.input contains 'xn--caf-dma.com'

On the one hand, I like that you can see both the original input and the canonicalized input on the result. It seems like this would be useful for debugging why a match worked or not.

On the other hand I don't like that there are two properties named the same thing with slightly different behavior.

Thoughts?

should we provide pattern matching against hostnames?

At the TPAC 2020 discussion @annevk raised the question that the use cases for hostname pattern matching are not well documented. Also, path-to-regexp does not have broad usage for hostname matching like it does for pathnames.

So far the hostname matching use cases are:

FetchEvent handling that needs to execute different logic based on the origin of the request; e.g *.my.cdn.com, etc. This is based on discussing workbox use cases with @jeffposnick.
Web platform features other than service workers that need to match against hostname. CSP is a legacy API that does this. @jeremyroman is working on a new API feature that needs this and was hoping to use URLPattern.

So far, however, it does not sound like these use cases need my proposed "suffix behavior" where . is treated as a suffix character that is grouped with optional and repeating groups. Since this is an addition to path-to-regexp, it probably makes sense just to remove this feature for now. Instead, hostname would support "simple" patterns without any special prefix or suffix behavior.

Based on all that I plan to at least prototype the full URLPattern including hostname matching. This will not require that much extra work if I only expose the "simple matching" approach without prefix/suffix behavior. Once developers can play with the API and give feedback we can decide whether to keep or remove it.

Let's use this issue to track use cases and feedback around the feature.

Compatibility of future features

I like the idea of adding features in future, but I'm not sure how we can do it.

With /*.(png|jpg), this will initially match:

/foo.(png%7Cjpg)
/bar.(png%7Cjpg)

And these will not match:

/foo.png
/bar.jpg

If we later add support for groups, the above reverses. Is that something we can do? It feels like a compatibility issue. Or, will there be a way to opt into new features? Or, will we reserve a set of characters for future use, making /*.(png|jpg) throw "not implemented" or somesuch?

Finish and tidy up the domintro section

The domintro section currently lists the properties only; it doesn't list the constructor or methods.

Also, the way the properties are listed is a bit strange; it makes them look like static properties: URLPattern.protocol. Instead they should look like instance properties: urlPattern.protocol.

See whatwg/meta#190, or probably just skip to whatwg/meta#190 (comment) for my summary of best practices that we're trying to converge the ecosystem on a bit.

API bikeshedding

baseUrl should be baseURL per the URL Standard (but see #1)
path should probably be pathname to match URL and friends?

I suggest that search be overloaded between a string, an iterable of two-element iterables, or a record, similar to the URLSearchParams constructor itself:

new URLPattern({
  ...,
  search: "foo=bar&*=baz"
});

new URLPattern({
  ...,
  search: [['foo', 'bar'], ['*', 'baz']]
});

new URLPattern({
  ...,
  search: { foo: 'bar', '*': 'baz' }
});

Why are baseUrl and path separate fields?

It seems simpler and equally expressive to just have a single field representing the origin and path parts of the scope. I'm not sure what to call it, perhaps just baseUrl?

{
    baseUrl: self.origin,
    path: '/foo/?*',
    search: '*',
}

becomes:

{
    baseUrl: self.origin + 'foo/?*',
    search: '*',
}

Alternatively, if you want to keep them separate, why not rename baseUrl to origin?

make URLPattern and URLPatternList serializable/structured cloneable

Consider providing a natural ordering for URLPattern objects

In this twitter thread:

https://twitter.com/posva/status/1418470155579953153

It was pointed out having an ordering for patterns is helpful for routering framework.

I have a plan to provide an ordering for restricted patterns that are service worker scopes. This issue is to consider if there is a way to come up with a generalized ordering.

The big question in my mind would be how to treat custom regular expression groups. My inclination would be to treat them as equal to a full wildcard * in terms of specificity since we can't really tell how specific or not they are. (Inspecting a custom regexp to determine what its going to do is probably going to be too complex, etc.)

should there be a stringifier and what should it return?

Should there be a stringifier for URLPattern? If so, what should it return?

There is no single string input to URLPattern that works for all patterns, so we can't use that as the toString() output.

We could add a general human readable string, but its unclear if that is necessary if devtools already supports iterating properties. With per-component pattern properties its already quite debuggable:

Matching root subpaths with an exception of e.g. /cms

I believe a quite common case is to have a public web page on root (/) and some CMS on a sub path (e.g. /cms, /admin). Effectively those can be completely separate applications.

Note that the public page typically would also have routes like "/product/123/title" or "/blog/2020/my-article".

So it would be useful to have a separate serviceWorker for both applications. And it would be useful to exclude "/cms/*" instead of defining all paths for the main app.

Consider a string syntax instead of a dictionary

I wonder if a syntax for URL patterns would be easier to learn and read instead of designing a dictionary which takes up multiple lines just to express a simple pattern.

I'm thinking that it should basically have a syntax that looks like a URL with ? and * being special, and \? (an escaped question mark) separating the path from the query, and with special treatment being applied to the query, allowing a match in any order.

For example:

https://example.com/foo/bar/?*\?wiz=wham&zot=*

Means to match any URL starting with "https://example.com/foo/bar" or a sub-path thereof, AND which must have the "wiz=wham" anywhere in its query param list, AND it must have a zot parameter with any value. wiz and zot can appear in any order, and the URL is allowed to have other query params, and they can come before or after wiz and zot.

The counter I can think of to this is that there's potential confusion to the user in that it looks like a glob or regex that would require those query parameters to appear in that order and first in the param list, where in fact we are creating a new type of URL-specific glob that treats query parameters separately (allowing them to appear in any order). Also having to write /? is ugly and you might forget the slash and create an entirely different meaning.

The advantage of this would be that you can use it in both Service Worker and Manifest (a JSON dictionary) with the exact same syntax, since you'd be defining a syntax for a "URL pattern" rather than a class URLPattern (requiring a separate explanation of how to express it in a JSON dictionary).

Support more generic URLPattern such as scheme, host, origin, etc

I think this API has more potential usage, if we expand on what can be supported.

For example, if we support scheme:

new URLPattern({
    scheme: 'https',
  })

This can be used to pass to Sanitizer API for pattern matching against a['href'], iframe['src'], etc.

Similarly, if we can support host or origin:

new URLPattern({
    origin: 'https://example.com',
  })

This can be again used to pass to Sanitizer API for pattern matching against form['action'].

CC: @otherdaniel

`add a token` call missing arguments

https://wicg.github.io/urlpattern/#ref-for-tokenize-policy①

Run add a token given tokenizer and "char".

This call is missing multiple arguments

Escape Regexp String missing index increment step

https://wicg.github.io/urlpattern/#escape-a-regexp-string is missing a step for incrementing the index

Defaults differ depending on base URL being set or not, should they?

Taking this from kenchris/urlpattern-polyfill#26

It may well be a non-issue but i figured i should at least raise it still.

new URLPattern({}); // all parts are wildcard
new URLPattern({ baseURL }); // all parts not specified in the base URL are exact matches of empty strings, no wildcards

This difference in behaviour confused me so may confuse others, too.

We need to support empty strings as values, but maybe we could treat them as wildcards during conversion/parsing of a base URL?

Presumably under the hood, passing a base URL results in it being parsed (probably via URL) and we then assign each part (property) to the URLPattern. At that point, maybe we could assign empty strings as wildcards?

Another example:

new URLPattern({ baseURL: 'https://example.com' })
// results in a pattern object:
{
  hostname: "example.com",
  protocol: "https",
  pathname: "/",
  port: "",
  username: "",
  search: "",
  password: "",
  hash: ""
}

Meanwhile:

new URLPattern({
  hostname: 'example.com',
  protocol: 'https',
  pathname: '/'
});
// results in a pattern object:
{
  hostname: 'example.com',
  protocol: 'https',
  pathname: '/',
  port: '*',
  username: '*',
  search: '*',
  password: '*',
  hash: '*'
}

If we considered empty values in base URLs as wild cards, these two would behave the same.

In real world, i think its likely someone might try this:

new URLPattern({
  pathname: '/user/:id',
  baseURL: 'https://example.com/'
})

But consider how that'd then work:

pattern.test("https://example.com/user/123"); // true
pattern.test("https://example.com/user/123?foo=bar"); // false
pattern.test("https://example.com/user/123#foo"); // false

if you think its better how it currently works, thats also fine but please do document the behaviour difference well, the fact that it will not default to wildcards with a base url.

Consider fully supporting RTL and Bidi URLs

RTL text in URL is not as uncommon as many think. This API must support RTL and bidi text in URL and patterns. It's important because : is used for path parameters in the beginning of the parameter name. In a RTL URL it would be a bit confusing unless it's fully specified.

should we attempt to URL canonicalize patterns?

The URL() constructor does a number of things to canonicalize the input string. For example:

Illegal characters are automatically URL encoded. For most components this is percent encoding, but for hostname it appears to use IDNA encoding, etc.
Some characters, like {, normally get percent encoded by URL encoding, but URLPattern uses it as one of its special characters.
For some protocols (http/https/etc) the pathname is required to consist of at least /. The slash is added if missing.
For some protocols (http/https/etc) the pathname is flattened to remove .. and ..
If the port matches the default port for a given protocol, then it is coerced to the empty string.
Probably many others...

So the question is can we and should we apply these transformations to component patterns. My original intent was to try to do so, but I have identified a number of problems:

Often the transformations are not safe within the custom regexp pattern groupings. For example, if you try to percent encode a unicode character within a regexp [a-z] character list it won't work correctly since each character in the percent encoding is considered independently. Similar difficulties arise with the .. and . flattening since those characters may appear within a regexp. In general we don't want to have to duplicate the regexp parser in URLPattern, so solving this seems quite difficult.
A URLPattern may often not have a fixed protocol. This makes it difficult to apply transformations that are conditional on protocol.
Developers may confused that some characters are URL encoded, but others are not.

Therefore I'm leaning towards not applying any canonicalization or automatic encoding for patterns. For example, if a non-ascii character is included in the pattern we would throw. In contrast, however, URL values passed to test() and exec() would be fully URL canonicalized. Therefore developers would be required to write patterns to match canonical URLs, but we would not fully enforce or automatically help with that at URLPattern construction time.

I intend to implement the above to start, but if it becomes a problem we could fall back to the a half measure. The URL canonicalization could be applied to patterns, but only outside of custom regexp groups. If you include a unicode character within a custom regexp then URLPattern would throw, but otherwise would be automatically URL encoded. Of course, pattern special characters like { would still need to be exempted from encoding, so it would not be quite equivalent to URL constructor behavior.

Generally it will be easier to move from the "no canonicalization" approach to "canonicalize outside of regexp" without breaking existing patterns.

Thoughts?

consider allowing an 'origin: foo' init attribute

When passing an init dictionary to the URLPattern constructor you can set things like port, hostname, and port either through setting values for these components individually:

new URLPattern({ protocol: 'https', hostname: 'example.com', port: '' });

Or through the baseURL attribute:

new URLPattern({ baseURL: 'https://example.com/' pathname: '/foo/:bar' });

The first is a bit awkward and the second causes the path to also be inherited. If you override the pathname as in the example above this is fine, but consider a case where you care about search strings:

new URLPattern({ baseURL: 'https://example.com/' search: '(.*)q=foo(.*)' });

Here you want to match any URL with the given query value, however the pathname was inherited as an exact / from the baseURL. You need to add an additional pathname: '/(.*)' explicitly to match any pathname.

As an alternative we could add an origin: https://example.com/ attribute. It would function like baseURL, but would not apply the pathname leaving it as a wildcard.

consider exposing URLPatternList

After the last face-to-face meeting we changed the plan to make service worker scope take a pathname patter string instead of a full URLPattern object and an array of strings instead of a URLPatternList object.

Given this was the main use case for URLPatternList, should we just not expose URLPatternList? Are there other use cases for URLPatternList?

@jeffposnick @jakearchibald do you think developers would need URLPatternList vs simply just managing a list of URLPattern objects themselves?

support unicode group names

Currently path-to-regexp and URLPattern only support alphanumeric and underscore ASCII characters in a group name like :foo. It would be good to allow other unicode characters to support developers working in different languages.

I can support this in URLPattern, but ideally it would be nice to align path-to-regexp to it as well. I think to do this we would likely need two changes:

In addition to alphanumeric and underscore ASCII characters also allow unicode characters.
Support some kind of callback to encode unicode characters in other parts of the pattern.

@blakeembrey what do you think? If you think this is reasonable I can work on a PR.

embedded regex non-capturing groups and assertions

Current path-to-regexp uses a check for (? sequences in embedded regular expressions in order to block named capture groups:

https://github.com/pillarjs/path-to-regexp/blob/125c43e6481f68cc771a5af22b914acdb8c5ba1f/src/index.ts#L102

@jeremyroman points out this prevents additional regular expression features:

Non-capturing groups like (?:x).
Assertions like (?<y)x.

@blakeembrey do you think it would be reasonable to expand this check to not exclude these cases? I imagine it might add some bundle size to path-to-regexp for what is probably very niche uses.

Express goal of folding into URL Standard

A generic URL pattern match library would most logically fit into the URL Standard I think, especially if we want to more deeply consider shorthand syntax and such.

Given that scope it might also be worth adjusting the name and such some. At least from the meeting it seemed clear that beyond service worker scope, filtering in the service worker fetch event was also considered an important use case with somewhat different requirements.

consider maximum scope size

At the TPAC 2020 discussion @asutherland suggested that we should consider maximum scope size limits. While its theoretically possible for sites to stick large amounts of info encoded in the scope URL today, the scope list mechanism might encourage further size growth. We should have an interoperable agreement as to what is too much.

passing base URL as a dictionary property vs a separate string argument

So far I've been mostly aiming to have the base URL, if present, passed as a property on the dictionary argument. For example:

new URLPattern({ pathname: '/foo/bar', baseURL: 'https://example.com' });

And:

p.test({ pathname: '/foo/bar', baseURL: 'https://example.com' });

I also have a TODO, though, to support a baseURL for the string URL input to test()/exec(). For example:

p.test('/foo/bar', 'https://example.com');

@blakeembrey suggested maybe we should always support base URL as a second argument. For example, the first two examples would look like this:

new URLPattern({ pathname: '/foo/bar' }, 'https://example.com');

p.test({ pathname: '/foo/bar' }, 'https://example.com');

This also seems reasonable to me. I think the one downside is that you lose the "named arguments" style of the dictionary. We could possibly support both styles, though, and let developers choose which they prefer.

Thoughts?

should URLPattern require a SecureContext?

There is a trend towards requiring secure contexts for new features, although sometimes it seems this is only applied to "powerful" new features. Should URLPattern require a SecureContext?

@annevk @domenic what is the current school of thought on this in standards?

Should we expose a combined groups result?

Currently each component in a URLPattern produces a separate groups map. They are accessed like result.pathname.groups, result.hostname.groups, etc. This is a bit verbose and also requires the person accessing the groups to understand which named group the component is in.

We could instead try to expose a result.groups that combines all the matched values from all the components into one map.

The main issue is what to do for conflicting group names. For example, if both the hostname and pathname have :id groups. Or worse, what if both hostname and pathname have anonymous groups that get the 0 index name.

Maybe we could offer a convenience like result.uniqueGroups that contains any matched groups with a unique name. If you need a value that does not have a unique name then you have to go to result.pathname.groups.

Consider alternatives to path to regex pattern

Great work so far. Happy to see progress here. Just one quick thing I'd like to point out is that the regex approach here certainly is not the fastest. I haven't read through everything in detail get so forgive me if I've missed something. I wanted to at least point towards the radix based approach that the fastify framework uses for parsing and matching https://github.com/delvedor/find-my-way.

whatwg / urlpattern Goto Github PK

urlpattern's People

Contributors

Stargazers

Watchers

Forkers

urlpattern's Issues

Recommend Projects

Recommend Topics

Recommend Org