opws / opws-dataset Goto Github PK
View Code? Open in Web Editor NEWProfiles for the user account systems of various sites.
License: Open Data Commons Open Database License v1.0
Profiles for the user account systems of various sites.
License: Open Data Commons Open Database License v1.0
I feel that really diving into either of these is better handled by another project (like ToSDR). We can maybe track the URL of each page, and a foreign key for associating another database for each of these.
It seems to me like separating the profiles into directories separated by public suffix doesn't really get us much except a more confusing filename structure. It's not like it's saving us from overpacking a single directory, since they're still mostly in /profiles/com.
The bugs.gnupg.org profile needs review.
Looking at #4, #37, and #38, I think it might make more sense to just refactor many of the fields of password.reset to something more like password.reset.steps
, which would be an array detailing what happens when a reset token is activated:
So, the best sites go [change, expire, autologin]
, Bountysource goes [expire, change, autologin]
, Neocities goes [autochange]
(and not even [autochange, login]
), and there are lots of weird different ways this can go down that are all worth documenting.
The distinction needs to be made about whether the email (if HTML content is sent) contains:
Most entries with a password.reset.response written before github.yaml (all 14 that aren't WiBit's temp password) conflate all of these into "response: token", which isn't super helpful (as pretty much all password reset mechanisms are token-based).
Also, I may have been ignoring sites that include usernames in the body of the email.
Similar to #60, sites that close for certain spans of time - or have a set obsoletion date - are an open question. Right now, I'm choosing to remove the profile of any site where the management of accounts becomes inaccessible - both in terms of sites that have shut down / suspended operations indefinitely, and sites that are likely to resume operation in the future (#115).
Open questions:
#12 proposes a "markets" array, where "market" here means "a locale where aspects of a site operate differently". (The slightly-less-general example is giving alternate versions of a site's name.)
This is generally handled by new profiles, really. If localization is not part of the domain, it's kind of out of scope?
I think names should maybe just be special-cased.
I mean, most sites that allow third-party login also allow third-party registration, and the other credential forms one may use for login (password, totp) are at the top level.
opensets/domainprofiles describes objective data only, such as mechanisms and locations.
opensets/domainprofiles does not restrict what domains it profiles on the basis of content, nor does it attempt to describe a domain's content with subjective labels, such as "not safe for work".
Endpoints that want to filter, or otherwise profile, sites based on their content should look into an existing commercial product or service, such as WebSense or Blue Coat, that best fits their use case. Endpoints may also consider reading content keywords provided via meta tags when retrieving pages on the domain.
name.com currently has this string under redflags:
Maximum password length only disclosed when exceeded.
This is actually pretty common (if the site states its maximum password size at all), so I'm no longer listing it as a red flag (so long as the site recognizes it has a maximum password size - shame on OKC and GOG). That said, it'd be nice if there were some way to recognize sites for being exemplary in disclosing things up-front, like:
See also #28.
Is trying to log in with a nonexistent account distinguished from trying to log in with the wrong password? Does the password recovery form tell you whether or not a user with that account exists?
These are things that can potentially be considered leaks (and if one does it, the other should too): if they're not considered a major leak, they should both be differentiated for usability.
Only distinguishing one is a red flag.
Right now, this is (sparsely) documented as profile.fields
: this should be changed to user.profile.fields
, (It's user.profile
to distinguish it from, say, an organization profile, and, further, to distinguish it from the overall site profile.)
There also needs to be a more rigorously defined structure. There are a few axes of information to document about user profiles:
These will probably follow the block-array model used by questions
, where each group of fields sharing the same value for all axes are lumped together.
Alongside user.profile.fields will be other profile info, like where to view or modify user profiles (akin to password.reset and password.change).
Also, there should be a note that this is not meant to be exhaustive. A user's blog posts could be considered part of their "profile information", but that doesn't make them part of the structure.
We have CONTRIBUTING.md, but we can do better.
Right now, I don't want to incorporate this, as it would be redundant with https://github.com/rmlewisuk/justdelete.me/blob/master/sites.json.
However, I wouldn't be opposed to merging the datasets down the line - the only change I'd need would be that the "difficulty" field would need to be converted into objective criteria, eg. something like a "manual" field with tokens for various manual interventions that are commonly required that determine a site's difficulty.
It seems like it'd be worth recording whether sites accept email for logging in, or just usernames, or what.
Field names are either separated by levels (if there is liable to be a neighboring field of that scope) or by compound word (which is not separated by underscores or anything like that), with the exception of localization fields (as specified in #12), which follow an established convention that uses underscores (to fit the pattern of traditional C identifier compatibility).
Part of the reason words are not separated (eg. thirdparty
) is because that's the way colloquial English usage normally trends toward, like the way "login" has become a word. It'd be a little cringeworthy to see log_in
, the way it is when you read really old writing and people write things like "Inter-Net".
requester-ip
is the only case I can think of in the currently-documented stuff where something actually uses a separator, and that's because I really couldn't think of a more eloquent way to put it. (It'd be better as something like origin
, but that's a discussion for another time.)
Yes, sometimes management is only as extensive as retrieving a single fixed API key for the user, but lots of other sites have more complex support, like provisioning and revocation.
We should track these: whether they're present, what the time duration is, if that's disclosed, if it's configurable, what the default is...
I originally designed the schema so the strings "yes" and "no" are used for fields where a boolean might be used, to leave the door open for non-binary values (eg. password.reset.token.login used to have the values "no", "before", and "after").
However, I'm now not certain that that's a solid strategy: pseudo-boolean values (with values neither "yes" nor "no") just seem like bad solutions.
On the other hand, explicitly checking for "no" beats explicitly checking for false
(which can usually be confused for an undefined / nil value). So I'm not fully in one camp or another here.
For these reasons, I propose we should drop the https
field from profiles.
eBay offers phone calling and SMS messaging for password reset credentials. Wells Fargo doesn't have any way to set a password via email (you just enter credentials and off you go).
Right now, these sites' profiles don't really represent this: there should probably be some way to model this in the data (possibly even one that wouldn't require an overhaul to existing fields).
Obviously primitives like strings / arrays / maps (objects - the terminology should probably settle on "map" to match the YAML spec) don't need to be specified - they can defer to the YAML spec.
However, higher-level types, like token-strings and localized-string-maps (#12), should be specified in their own document, along with other higher-level domain-specific abstractions (like "password usability token strings" and "element-describing maps" (#10)).
I'm thinking the tests will still read from fields.md, but the types will be tested with a lib in tests/lib/types.js (which may read from types.md for things like verifying that a token string uses only documented values).
For documenting types, I'm thinking h2 will cover a few specific high-level types, where h3s under it are docs for types of that type (like token-string is an h2 and value sets for token strings are h3s).
What about character classes? That's probably going to get used by a few different types, and applied to tests for those different types by code in tests/lib/types.js.
Since some fields are polymorphic, I'm thinking the headers in fields.md will be redesigned to be something like fieldname, otherfieldname: type1, type2
, where anything matching fieldname
or otherfieldname
will be subject to the tests for type1
and type2
. Also, type1
and type2
will probably be links to the relevant section in fields.md (in the Markdown, probably written with [type1][]
and a list of links at the bottom).
For instance, a government site for Seattle residents.
This will be important when organizing regular reviews for sites like this, as only certain people will be able to review sites like this, and it'd be nice to let them know about restrictions like this up front, before they go opening a tab for a site they're not eligible for an account on.
Bountysource appears to invalidate tokens on GET (or, at least, they just invalidated the link I got after I signed out and visited it again without resetting).
(Implicit redflag: This is super bad practice since GET requests should be idempotent.)
I'm thinking this should be tracked as password.reset.token.invalidate
, with the value being a string (like password.reset.token.login
) that describes when the invalidation happens.
I think it'd be better to have blacklist be a combination of "classes" (the content the blacklist has now), "strings" (for explicitly-banned strings), "variables" (for stuff like the user's name), "dictionaries" (for types of lists of passwords that aren't allowed, like "english" for English words and "common" for common passwords) and "previous" (a number of previously-used passwords that are banned) - this would cover pretty much all the "mustnot" rules I can recall having to put there.
I guess we'd probably change "whitelist" to "whitelist.classes" for symmetry, even though there's not really any practical way to have a whitelist of anything but character classes (though we might also have "whitelist.strings" for specific characters).
Right now, email usability is either "once" or "twice", while the token for a single-entry password form is "single".
I think it'd be better to share a lexicon between the two, for consistency.
Slack reset tokens expire after 3 days, but they also expire 30 minutes after visiting the page. There should be a field like password.reset.form
for the actual reset stuff (this would then be password.reset.form.expires
, separate from password.reset.token.expires
), where some other fields currently on password.reset
should maybe go.
Things to list:
As for passwords, if the password is case-insensitive, that is a HUGE red flag, and it belongs in redflags
(since it's not common).
The clunkiness of the string approach is apparent with tokens like "emailonce" and "emailtwice". This should be two values of "register.usability.email".
Not all sites have "Current password" alongside "new password" - sometimes, you're prompted for your password just to access the setting, and sometimes you only have to reenter it to commit the change.
Fields of open-ended text like notes
and redflags
should be changed to use objects with the language code for the language the notes are written in (to permit them to be localized in applications that present notes to the user).
Tentative routes:
Object of all domains that are profiled, and the hash/ETag of its source material.
Gives a 404 if the domain does not exist. Domain-collapsing logic is expected to be handled by the client.
Maybe takes a query string that allows for subset requests. Maybe. (I kind of feel "select" handles that better, although it's true that the data can mismatch between requests if the server failovers to a newer version between requests for two distant fields from a select.)
Returns all domains with the field specified, and what the value of that field is.
Gets everything (domainprofiles.json).
I've never liked Grunt. https://github.com/sighjs/sigh sounds nice.
I've been using the "v0.1.0" and "future" milestones for a while now - I think I should probably take inventory on what each of those will entail.
I want to mandate the presence of the "Reviewed" field for all profiles (so at least they need to be reviewed before inclusion). This will entail reviewing all existing profiles:
(checklist removed)
There are a few things surrounding registration that are currently undocumented:
This is just sort of a hypothetical issue around the notion of creating a file resembling one of these profiles as a definition to feed into a framework that could create these routes.
Only one canonical version of the profile is maintained. All modification for previous schemas will be done via live backward migrations from the current model. As all data in a profile is optional and must not be expected to be present, this will not cause a problem if a later version of the schema outright removes a field.
The tag / version of the repo refers to the schema used by the live data.
Could, and should, be present and mirroring the new password.change
and password.reset
(login
integrating some more of the loose fields on root like totp
).
The old way of doing this had "reset" and "change" be URL fields under "password", as well as a "rules" array of (plaintext) rules, and a "length" field for password length (min and max).
Rules was usually copied directly from a posted set of password restrictions, was often partially redundant with "length", and was never parseable.
To make way for new data, password
needs to be restructured like so:
change
is now a dictionary with:
url
field for the URL for changing passwords.
http://example.com/user/ + username + /changepw
.reenter
, a boolean stating whether or not the original password is required to change the password.
reset
is now a dictionary with:
url
field for the URL for resetting passwords.accept
field for accepted identifiers. Usually either "email", "username", or the two combined with a space (in either order).
captcha
, if the site's password reset requires a CAPTCHA. (Absence of this value should be considered false
.)auth
, token-separated string of what authentication measures, if any, are required to trigger a reset. Usually "question" (for an answer to a security question), if any.response
, what type of measure is sent to reset the password. Known values:
rules
is now a dictionary with:
blacklist
, for easily-explained blacklisted characters (usually just space)length
, for length restrictionsmust
, for special reqiurementsmustnot
, for special prohibitions (complex, secret blacklists are described here)redflags
no longer include redundant red flags apparent from the data (such as limited password length, restricted character sets, or HTTPS breakages).notes
can now be under any field for which the notes are relevant (no more root notes about password resetting).
must/mustnot was originally documented as special
, for things like required character classes (which frequently take forms of "at least two of X, Y, and Z" that are not easily serialized) and bizarre pattern restrictions (hooooly cow Seattle's utilities site).
Rethinking the https
field should definitely happen some time soon (possibly replacing it as an access
dictionary), but it's currently being postponed.
Slack is the one I'm thinking here.
The way I'm thinking of doing this is "subdomains.vary", which would be a space-separated list of tokens describing the things that are different per-subdomain: for Slack, it would be "username password".
There'd be no distinguishing of specific subdomains, as that's a job for separate profiles. The "subdomains" property is inherently a reference to wildcard subdomains of the profiled domain.
Taking a sec to split this out to its own issue, since it raises some separate discussions in and of itself.
See #43 (comment) for the meat of what a documentation restructure would look like (as much for accommodating that issue's plans as #42).
So oDesk rebranded as Upwork and has moved the domain.
Going forward, how should these rebrands work? (What I've been doing, just moving the profile and changing my own uses of it, is clearly not viable going forward.) I'm thinking the old domain should have a migrate
field that is like use
but signals that data tied to the old domain should be moved to the new one.
For backreferencing, I'm thinking profiles should maybe also have a formerly
array (Blotpass would then use this array like use
).
Maybe instead of migrate
, it should be moved.to
, with maybe a moved.on
that states the date that the move occured.
Similar to #45, some sites have you sign up for an account up front, while other sites send you an email you use to register (following the structure I personally believe is best) - respoke.io is an example of the latter.
Right now, "platform" is just a simple field. However, it's entirely feasible that these profiles could hold detailed platform info (on the order of BuiltWith.com's info). Also, it's likely that platforms could be more complex than just 1:1 name matching.
I propose that platform is redefined to something like platform.application.name
and platform.application.version
, where "applications" is a new set (next to "profiles") where versions, with their defaults, are provided as an array.
Moreover, there's probably a need for some kind of "mount path" data for apps - keys they use as the basis of their URL defaults. This needs further consideration.
URLs should be as widely-applicable as possible, with as many components as possible removed that refer to specific contexts (such as referrers and locales), so long as visiting the simplified URL still results in the desired page.
In the event that certain components can't be removed without breaking the page (such as adobe.com), the used components should be as generic as possible (ie. using those components that would be presented when reaching that page from the root domain). Where generic values aren't possible but specific universal values are, they should use the most common specific instance (eg. 37signals accounts using Basecamp's Launchpad) or closest to the origin / most targeted market (so eg. if Pottermore required a locale, it would use en-GB).
(In the event of ties, English and specifically en-US wins, following the same logic as #12.)
In the event that there is no acceptable universally-usable value, the component should be variable-concatenated out (ie. the way usernames in URLs are).
Some of this is being answered with a new password.reset.sessions.invalidate
field (#69), but there are other interactions that could be profiled:
As #4 describes, there should be a field describing where you go after a password reset form is submitted (which is independent of password.reset.token.login
).
Maybe password.reset.submit
, which could have a 'redirect' (with 'yes' or the status code used for redirection, or 'no' if it actually doesn't redirect on submission), as well as a 'location' (the URL you are redirected to). And/or something that describes if you're redirected to the login page, and if that login form has some fields pre-filled.
Bring out your use cases, bring out your use cases...
Here are a few refactors I'm considering:
password.reset
that only apply to the first step of password reset. Now that so much of password.reset
involves other components of password reset (and password.reset.steps
entails all of it), I'm thinking things would be cleaner if these fields would move to password.reset.request
. (This also opens up the door for making request
part of steps
, which will make #70 more versatile.)password.reset.accept
would make more sense as password.reset.request.accepts
, since it's not an imperative "accept" but a passive "accepts".username.remind
has a similar problem, both in imperative mismatch and in not having a request
level. It would match much better as username.reminder.request.accepts
.expiration
getting added with #103), it'd make sense to move register
to registration
. This doesn't mismatch the other fields: it just makes them match as nouns, rather than verbs (which makes more sense anyway).A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.