Giter Site home page Giter Site logo

louislva / openactiondata Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 1.0 6.44 MB

Building a diverse and clean dataset of humans using the web. Open source.

Home Page: https://open-action-data.vercel.app

HTML 15.67% TypeScript 71.69% JavaScript 8.19% CSS 3.53% Shell 0.92%

openactiondata's People

Contributors

louislva avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

openactiondata's Issues

UI/UX: "We want to save this session" flow

Whenever the data engine (#1) identifies an interesting session, the automatic anonymization is performed (#3), and then the user should be able to review the anonymized recording & give confirmation for us to save this session to the dataset.

In my head, it's something like a 3-step process:

  • Extension pops up, with copy-writing explaining that the last X seconds of browsing was very relevant for the dataset
  • User watches the anonymized replay; the anonymized replay should maybe even highlight which data/strings was stripped away. E.g. "We replaced all instances of lou*****@gmail.com with [OAD_EMAIL_PLACEHOLDER]"
  • The user gets to press confirm and gets a big, animated "๐Ÿฅณ" emoji or something similar.

Looking for someone to help develop the UX, mock up the UI, or just build the frontend.

Test-suite for data anonymization

We need to develop automatic data anonymization, and to do that sanely, we should have a test-suite to check for false negatives in the data anonymization.

A simple way to do that: Record a number of sessions of humans typing in (fake) sensitive data, and save them as JSON files. Then make a test-suite that puts each JSON file through the anonymize() function, and checks whether the values to be anonymized are present after. It should also check for them inside the concatenated keystrokes. If they are still present, this should fail the test case.

The kind of sensitive data we should test for:

  • Email address
  • Password
  • Name
  • Home address
  • Phone no
  • Bank account / credit card details
  • Crypto seed phrases
  • API keys
  • Social security / VAT number / passport number
  • ... anything else you can think of! Please throw a comment!

Data-engine to identify interesting "sessions"

Right now, we have some simple code that just records & uploads indiscriminately. For various reasons (storage costs, data cleanliness, user consent), we should aim to only collect sessions that matter.

With a data engine you're trying to find all the sessions that would actually provide a loss for your model. E.g. 20 million Google Searches wouldn't provide much loss each (because they'd quickly be learned), but that one-off usage of a website we've never seen before or some really complicated workflow in Figma, might be super high-loss (and hence valuable) for the model.

Problem is, we're not gonna be doing active learning with a model to start with, so you need approximations of the metric I described above ^

Some ideas are:

  • URLs / domain names; if we have a lot of data from X, don't collect more. Also, could weigh them by expected variance. E.g. DuckDuckGo, low variance, don't collect that many. Figma, high variance, collect more. And so on.
  • Uniqueness of URL walk
  • Amount of user interaction; mouse movements + keystrokes is ultimately what you have to predict, so the less there is, the less signal to train on, the more useless the session. (e.g. for driving datasets, don't sample driving straight, sample turns)

I'm open to more ideas / feedback!

Verify that rrweb collects keystrokes & create a utility to extract them all to one string

We use rrweb to record sessions. If you spin up the mock-backend & the chrome-extension, you can start recording & saving some sessions to JSON files locally.

I'm actually not sure if rrweb records keystrokes as-is, so first make sure of that.

If it does, make a script that can take one of these JSON files, find every keystroke, and concatenate them into one string. This is useful for #2, because we need to be able to verify that there is no sensitive data left inside of the keystrokes (after anonymization).

UI/UX: Let user press "never record this site"

This app only succeeds if it's not annoying. Users might just have websites they'd just never record - let's make sure we don't repeatedly ask them about these.

Maybe we can even infer "never record this site" from a few rejections in a row?

Reduce size of JSON sessions as much as possible

We need to be efficient with storage. The JSON is needlessly large, and can be reduced, at least 10x, but let's try get 20-50x.

Some ideas:

  • GZip; in my tests this will get the file to 5-10% of original size
  • Binary representation of data (like pickle or similar); not sure how much this would help when applied in combination with GZip
  • Strip JSON of useless data

Automatic data anonymization

Build a function which takes JSON sessions of recorded browser activity, and replaces sensitive data with placeholder values. So if I type af7AFx2aGH as my password, it should replace every instance of it with [OPEN_ACTION_DATA_PASSWORD] or something similar. Including inside the recorded keystrokes.

A list of data types that are considered sensitive + a series of tests for their anonymization is being developed in #2.

Proper backend to store sessions from the user

The rough stack I'm thinking:

  • Object storage (S3 or alternative) for the recorded sessions
  • PostgreSQL for metadata about the sessions stored in S3
  • Next.js API which inserts into DB + responds with a signed PUT url for S3 (allows client to upload directly to S3, saving Next.js costs)

Wrt. pricing, I think Vercel/Next.js would get us at least ~30 million/requests per month on the $20/month business plan. Seems like enough for a while. Object storage (S3) will be much more costly. DB should be pretty cheap as well.

I don't know if we should have any kind of authentication? I want to know the ~identity of the user submitting, for spam-prevention & cleaning purposes. But it seems like IP + google chrome identifier + captcha might take us a long way here?

Adblock interferes with recordings

Repro:

  • Install UBlock Origin; this seems the culprit
  • Use the chrome-extension paired with the mock-backend and recording a session on https://www.azlyrics.com/lyrics/villagepeople/ymca.html.
  • Use the mock-backend route to replay the session, and notice that the dark purple bar is much larger in the replay (as large as it would have been if the ad was still there) than it was when using the website. This also causes the cursor not to be aligned.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.