Giter Site home page Giter Site logo

opws-dataset's People

Contributors

stuartpb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

opws-dataset's Issues

Privacy policy / ToS / ToU / TaC fields

I feel that really diving into either of these is better handled by another project (like ToSDR). We can maybe track the URL of each page, and a foreign key for associating another database for each of these.

Why not include all profiles in one directory?

It seems to me like separating the profiles into directories separated by public suffix doesn't really get us much except a more confusing filename structure. It's not like it's saving us from overpacking a single directory, since they're still mostly in /profiles/com.

Refactoring password.reset

Looking at #4, #37, and #38, I think it might make more sense to just refactor many of the fields of password.reset to something more like password.reset.steps, which would be an array detailing what happens when a reset token is activated:

  • change
  • autochange (for when the password is changed to something else by the site, a la Neocities)
  • button (when the next step happens automatically, but only after a POST)
  • expire (documenting at what point the token is expired) see comment below
  • stub (the page displays a message confirming the reset and you go nowhere)
  • login (user is directed to the login page)
  • autologin (user is logged in)

So, the best sites go [change, expire, autologin], Bountysource goes [expire, change, autologin], Neocities goes [autochange] (and not even [autochange, login]), and there are lots of weird different ways this can go down that are all worth documenting.

password.reset.response needs cleanup

The distinction needs to be made about whether the email (if HTML content is sent) contains:

  • the visible URL
  • a link to click
  • a hand-typeable token (outside of a URL)

Most entries with a password.reset.response written before github.yaml (all 14 that aren't WiBit's temp password) conflate all of these into "response: token", which isn't super helpful (as pretty much all password reset mechanisms are token-based).

Also, I may have been ignoring sites that include usernames in the body of the email.

Profiling ephemeral / seasonal sites

Similar to #60, sites that close for certain spans of time - or have a set obsoletion date - are an open question. Right now, I'm choosing to remove the profile of any site where the management of accounts becomes inaccessible - both in terms of sites that have shut down / suspended operations indefinitely, and sites that are likely to resume operation in the future (#115).

Open questions:

  • Should these sites have profiles?
  • What would be the protocol for reviewing them out-of-season?
  • If we profile them, how should the data on their shutdown schedule be presented?
  • Should they get rotated out of profiles at the time they go out of service?
  • Should they be moved to legacies (#43)?

Localized variants ("markets")

#12 proposes a "markets" array, where "market" here means "a locale where aspects of a site operate differently". (The slightly-less-general example is giving alternate versions of a site's name.)

This is generally handled by new profiles, really. If localization is not part of the domain, it's kind of out of scope?

I think names should maybe just be special-cased.

Content policy

opensets/domainprofiles describes objective data only, such as mechanisms and locations.

opensets/domainprofiles does not restrict what domains it profiles on the basis of content, nor does it attempt to describe a domain's content with subjective labels, such as "not safe for work".

Endpoints that want to filter, or otherwise profile, sites based on their content should look into an existing commercial product or service, such as WebSense or Blue Coat, that best fits their use case. Endpoints may also consider reading content keywords provided via meta tags when retrieving pages on the domain.

How much of this data is stated by the site?

name.com currently has this string under redflags:

Maximum password length only disclosed when exceeded.

This is actually pretty common (if the site states its maximum password size at all), so I'm no longer listing it as a red flag (so long as the site recognizes it has a maximum password size - shame on OKC and GOG). That said, it'd be nice if there were some way to recognize sites for being exemplary in disclosing things up-front, like:

  • Their password and username restrictions
  • What email address reset emails come from
  • When their reset tokens expire
  • Whether changing passwords logs you out (on other machines)

See also #28.

Nonexistent account disclosure

Is trying to log in with a nonexistent account distinguished from trying to log in with the wrong password? Does the password recovery form tell you whether or not a user with that account exists?

These are things that can potentially be considered leaks (and if one does it, the other should too): if they're not considered a major leak, they should both be differentiated for usability.

Only distinguishing one is a red flag.

Profiling the structure of user profiles

Right now, this is (sparsely) documented as profile.fields: this should be changed to user.profile.fields, (It's user.profile to distinguish it from, say, an organization profile, and, further, to distinguish it from the overall site profile.)

There also needs to be a more rigorously defined structure. There are a few axes of information to document about user profiles:

  • How private a field is: viewable to the world/search engines, viewable to any registered users, viewable to a limited group of users, or private to one's self
  • To what degree the user can control the aforementioned privacy of a field
  • What's prompted on registration, what's prompted after registration, and what's never prompted
  • Degrees of requirement: what's required, what's sometimes required, and what's completely optional (this could be part of the previous data)

These will probably follow the block-array model used by questions, where each group of fields sharing the same value for all axes are lumped together.

Alongside user.profile.fields will be other profile info, like where to view or modify user profiles (akin to password.reset and password.change).

Also, there should be a note that this is not meant to be exhaustive. A user's blog posts could be considered part of their "profile information", but that doesn't make them part of the structure.

Account-deletion info

Right now, I don't want to incorporate this, as it would be redundant with https://github.com/rmlewisuk/justdelete.me/blob/master/sites.json.

However, I wouldn't be opposed to merging the datasets down the line - the only change I'd need would be that the "difficulty" field would need to be converted into objective criteria, eg. something like a "manual" field with tokens for various manual interventions that are commonly required that determine a site's difficulty.

login.form.account.accepts

It seems like it'd be worth recording whether sites accept email for logging in, or just usernames, or what.

Structural style guide

Field names are either separated by levels (if there is liable to be a neighboring field of that scope) or by compound word (which is not separated by underscores or anything like that), with the exception of localization fields (as specified in #12), which follow an established convention that uses underscores (to fit the pattern of traditional C identifier compatibility).

Part of the reason words are not separated (eg. thirdparty) is because that's the way colloquial English usage normally trends toward, like the way "login" has become a word. It'd be a little cringeworthy to see log_in, the way it is when you read really old writing and people write things like "Inter-Net".

requester-ip is the only case I can think of in the currently-documented stuff where something actually uses a separator, and that's because I really couldn't think of a more eloquent way to put it. (It'd be better as something like origin, but that's a discussion for another time.)

api.key.retrieval should be api.key.management

Yes, sometimes management is only as extensive as retrieving a single fixed API key for the user, but lots of other sites have more complex support, like provisioning and revocation.

"Stay Logged In" checkboxes

We should track these: whether they're present, what the time duration is, if that's disclosed, if it's configurable, what the default is...

Replacing "yes" and "no" enums

I originally designed the schema so the strings "yes" and "no" are used for fields where a boolean might be used, to leave the door open for non-binary values (eg. password.reset.token.login used to have the values "no", "before", and "after").

However, I'm now not certain that that's a solid strategy: pseudo-boolean values (with values neither "yes" nor "no") just seem like bad solutions.

On the other hand, explicitly checking for "no" beats explicitly checking for false (which can usually be confused for an undefined / nil value). So I'm not fully in one camp or another here.

Dropping the `https` field

  • This is a volatile field, with a lot of factors that can influence its results, and one that is often prone to change circumstantially (moreso than, say, password requirements).
  • Unlike the rest of the fields in profiles, it's fairly straightforward to determine this data through automated testing, so (among other things) it could easily be re-gathered programmatically after removal (and would probably be more accurate).
  • It's largely redundant to the data collected for the HTTPS Everywhere Atlas.

For these reasons, I propose we should drop the https field from profiles.

Modeling non-email password reset

eBay offers phone calling and SMS messaging for password reset credentials. Wells Fargo doesn't have any way to set a password via email (you just enter credentials and off you go).

Right now, these sites' profiles don't really represent this: there should probably be some way to model this in the data (possibly even one that wouldn't require an overhaul to existing fields).

Documenting types

Obviously primitives like strings / arrays / maps (objects - the terminology should probably settle on "map" to match the YAML spec) don't need to be specified - they can defer to the YAML spec.

However, higher-level types, like token-strings and localized-string-maps (#12), should be specified in their own document, along with other higher-level domain-specific abstractions (like "password usability token strings" and "element-describing maps" (#10)).

I'm thinking the tests will still read from fields.md, but the types will be tested with a lib in tests/lib/types.js (which may read from types.md for things like verifying that a token string uses only documented values).

For documenting types, I'm thinking h2 will cover a few specific high-level types, where h3s under it are docs for types of that type (like token-string is an h2 and value sets for token strings are h3s).

What about character classes? That's probably going to get used by a few different types, and applied to tests for those different types by code in tests/lib/types.js.

Since some fields are polymorphic, I'm thinking the headers in fields.md will be redesigned to be something like fieldname, otherfieldname: type1, type2, where anything matching fieldname or otherfieldname will be subject to the tests for type1 and type2. Also, type1 and type2 will probably be links to the relevant section in fields.md (in the Markdown, probably written with [type1][] and a list of links at the bottom).

Does a site restrict membership registration to certain people?

For instance, a government site for Seattle residents.

This will be important when organizing regular reviews for sites like this, as only certain people will be able to review sites like this, and it'd be nice to let them know about restrictions like this up front, before they go opening a tab for a site they're not eligible for an account on.

How does use invalidate reset tokens?

Bountysource appears to invalidate tokens on GET (or, at least, they just invalidated the link I got after I signed out and visited it again without resetting).

(Implicit redflag: This is super bad practice since GET requests should be idempotent.)

I'm thinking this should be tracked as password.reset.token.invalidate, with the value being a string (like password.reset.token.login) that describes when the invalidation happens.

Refactoring password (and username) rules for extensibility

I think it'd be better to have blacklist be a combination of "classes" (the content the blacklist has now), "strings" (for explicitly-banned strings), "variables" (for stuff like the user's name), "dictionaries" (for types of lists of passwords that aren't allowed, like "english" for English words and "common" for common passwords) and "previous" (a number of previously-used passwords that are banned) - this would cover pretty much all the "mustnot" rules I can recall having to put there.

I guess we'd probably change "whitelist" to "whitelist.classes" for symmetry, even though there's not really any practical way to have a whitelist of anything but character classes (though we might also have "whitelist.strings" for specific characters).

Password reset links expiring on visit

Slack reset tokens expire after 3 days, but they also expire 30 minutes after visiting the page. There should be a field like password.reset.form for the actual reset stuff (this would then be password.reset.form.expires, separate from password.reset.token.expires), where some other fields currently on password.reset should maybe go.

Username case fields

Things to list:

  • username.case.sensitivity: should be "insensitive", can be "sensitive" if the system is stupid (accepts "stuart" and "Stuart" as two different users).
  • username.case.display: Whether the system coerces usernames to "upper" or "lower", or keeps them "mixed".

As for passwords, if the password is case-insensitive, that is a HUGE red flag, and it belongs in redflags (since it's not common).

Freeform text should be per-locale

Fields of open-ended text like notes and redflags should be changed to use objects with the language code for the language the notes are written in (to permit them to be localized in applications that present notes to the user).

API Endpoints

  • There should be an API that gives a list of domains that are profiled, and the hash/ETag of its source material.
  • There should be a list of every domain that has a "use" record, and what the value of that record is.
    • We could also have just generalized sideways record querying, so you can get all the records for any field. We could even memoize them (though this could have the potential to make RAM blow up).

Tentative routes:

/v0/1/etags

Object of all domains that are profiled, and the hash/ETag of its source material.

/v0/1/domain/:domain

Gives a 404 if the domain does not exist. Domain-collapsing logic is expected to be handled by the client.

Maybe takes a query string that allows for subset requests. Maybe. (I kind of feel "select" handles that better, although it's true that the data can mismatch between requests if the server failovers to a newer version between requests for two distant fields from a select.)

/v0/1/select/:field

Returns all domains with the field specified, and what the value of that field is.

/v0/1/profiles/domains

Gets everything (domainprofiles.json).

Roadmap

I've been using the "v0.1.0" and "future" milestones for a while now - I think I should probably take inventory on what each of those will entail.

All profiles should be reviewed

I want to mandate the presence of the "Reviewed" field for all profiles (so at least they need to be reviewed before inclusion). This will entail reviewing all existing profiles:

(checklist removed)

Is the user logged in after registration / email verification?

There are a few things surrounding registration that are currently undocumented:

  • Whether the user is logged in after submitting the registration field, or if they need to log in manually, or...
  • If the user has to validate their email for registration to complete, or if it's just to make the account "full", and/or if this validation email logs them in.

Using the profile format prescriptively

This is just sort of a hypothetical issue around the notion of creating a file resembling one of these profiles as a definition to feed into a framework that could create these routes.

Versioning, Schemas, and Migration

Only one canonical version of the profile is maintained. All modification for previous schemas will be done via live backward migrations from the current model. As all data in a profile is optional and must not be expected to be present, this will not cause a problem if a later version of the schema outright removes a field.

The tag / version of the repo refers to the schema used by the live data.

`register` and `login` fields

Could, and should, be present and mirroring the new password.change and password.reset (login integrating some more of the loose fields on root like totp).

Standardize to new layout

The old way of doing this had "reset" and "change" be URL fields under "password", as well as a "rules" array of (plaintext) rules, and a "length" field for password length (min and max).

Rules was usually copied directly from a posted set of password restrictions, was often partially redundant with "length", and was never parseable.

To make way for new data, password needs to be restructured like so:

  • change is now a dictionary with:
    • url field for the URL for changing passwords.
      • URLs that require parameters like username / ID separate it with spaces and '+' signs, like http://example.com/user/ + username + /changepw.
    • reenter, a boolean stating whether or not the original password is required to change the password.
      • Cases like GitHub's sudo mode count as "true".
      • Additional authentication measures, as required on sites like PayPal, are not currently listed.
  • reset is now a dictionary with:
    • url field for the URL for resetting passwords.
    • accept field for accepted identifiers. Usually either "email", "username", or the two combined with a space (in either order).
      • For sites that require multiple identifiers, they are separated by a plus.
    • captcha, if the site's password reset requires a CAPTCHA. (Absence of this value should be considered false.)
    • auth, token-separated string of what authentication measures, if any, are required to trigger a reset. Usually "question" (for an answer to a security question), if any.
    • response, what type of measure is sent to reset the password. Known values:
      • "settoken", a token (usually in the form of a link) that, when consumed (visited), allows you to set a new password
      • "temp", a temporary password that immediately prompts for a new password on use
      • "new", a new password to log in with that it is suggested you change
  • rules is now a dictionary with:
    • blacklist, for easily-explained blacklisted characters (usually just space)
    • length, for length restrictions
    • must, for special reqiurements
    • mustnot, for special prohibitions (complex, secret blacklists are described here)
  • redflags no longer include redundant red flags apparent from the data (such as limited password length, restricted character sets, or HTTPS breakages).

notes can now be under any field for which the notes are relevant (no more root notes about password resetting).

must/mustnot was originally documented as special, for things like required character classes (which frequently take forms of "at least two of X, Y, and Z" that are not easily serialized) and bizarre pattern restrictions (hooooly cow Seattle's utilities site).

Rethinking the https field should definitely happen some time soon (possibly replacing it as an access dictionary), but it's currently being postponed.

Subdomain-variant data

Slack is the one I'm thinking here.

The way I'm thinking of doing this is "subdomains.vary", which would be a space-separated list of tokens describing the things that are different per-subdomain: for Slack, it would be "username password".

There'd be no distinguishing of specific subdomains, as that's a job for separate profiles. The "subdomains" property is inherently a reference to wildcard subdomains of the profiled domain.

Restructuring documentation

Taking a sec to split this out to its own issue, since it raises some separate discussions in and of itself.

See #43 (comment) for the meat of what a documentation restructure would look like (as much for accommodating that issue's plans as #42).

Recording historical data

So oDesk rebranded as Upwork and has moved the domain.

Going forward, how should these rebrands work? (What I've been doing, just moving the profile and changing my own uses of it, is clearly not viable going forward.) I'm thinking the old domain should have a migrate field that is like use but signals that data tied to the old domain should be moved to the new one.

For backreferencing, I'm thinking profiles should maybe also have a formerly array (Blotpass would then use this array like use).

Maybe instead of migrate, it should be moved.to, with maybe a moved.on that states the date that the move occured.

How are emails verified / accounts registered?

Similar to #45, some sites have you sign up for an account up front, while other sites send you an email you use to register (following the structure I personally believe is best) - respoke.io is an example of the latter.

Documenting stacks/platforms instead of individual domains

Right now, "platform" is just a simple field. However, it's entirely feasible that these profiles could hold detailed platform info (on the order of BuiltWith.com's info). Also, it's likely that platforms could be more complex than just 1:1 name matching.

I propose that platform is redefined to something like platform.application.name and platform.application.version, where "applications" is a new set (next to "profiles") where versions, with their defaults, are provided as an array.

Moreover, there's probably a need for some kind of "mount path" data for apps - keys they use as the basis of their URL defaults. This needs further consideration.

URL policy

URLs should be as widely-applicable as possible, with as many components as possible removed that refer to specific contexts (such as referrers and locales), so long as visiting the simplified URL still results in the desired page.

In the event that certain components can't be removed without breaking the page (such as adobe.com), the used components should be as generic as possible (ie. using those components that would be presented when reaching that page from the root domain). Where generic values aren't possible but specific universal values are, they should use the most common specific instance (eg. 37signals accounts using Basecamp's Launchpad) or closest to the origin / most targeted market (so eg. if Pottermore required a locale, it would use en-GB).

(In the event of ties, English and specifically en-US wins, following the same logic as #12.)

In the event that there is no acceptable universally-usable value, the component should be variable-concatenated out (ie. the way usernames in URLs are).

What happens when you try to reset your password and you're logged in?

Some of this is being answered with a new password.reset.sessions.invalidate field (#69), but there are other interactions that could be profiled:

  • Whether the password-reset page is accessible when you're logged in
  • Whether visiting a reset link works when you're logged in, and what effect it has

Reconciling password.reset.flow.submit and password.change

As #4 describes, there should be a field describing where you go after a password reset form is submitted (which is independent of password.reset.token.login).

Maybe password.reset.submit , which could have a 'redirect' (with 'yes' or the status code used for redirection, or 'no' if it actually doesn't redirect on submission), as well as a 'location' (the URL you are redirected to). And/or something that describes if you're redirected to the login page, and if that login form has some fields pre-filled.

Refactoring for password.reset.request

Here are a few refactors I'm considering:

  • There are a few fields on password.reset that only apply to the first step of password reset. Now that so much of password.reset involves other components of password reset (and password.reset.steps entails all of it), I'm thinking things would be cleaner if these fields would move to password.reset.request. (This also opens up the door for making request part of steps, which will make #70 more versatile.)
  • In light of that, password.reset.accept would make more sense as password.reset.request.accepts, since it's not an imperative "accept" but a passive "accepts".
  • username.remind has a similar problem, both in imperative mismatch and in not having a request level. It would match much better as username.reminder.request.accepts.
  • In the same vein as many of these other things (including expiration getting added with #103), it'd make sense to move register to registration. This doesn't mismatch the other fields: it just makes them match as nouns, rather than verbs (which makes more sense anyway).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.