louislva / openactiondata Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 6.44 MB

Building a diverse and clean dataset of humans using the web. Open source.

Home Page: https://open-action-data.vercel.app

HTML 15.67% TypeScript 71.69% JavaScript 8.19% CSS 3.53% Shell 0.92%

openactiondata's People

Contributors

Stargazers

Watchers

Forkers

oijoijcoiejoijce

openactiondata's Issues

UI/UX: "We want to save this session" flow

Whenever the data engine (#1) identifies an interesting session, the automatic anonymization is performed (#3), and then the user should be able to review the anonymized recording & give confirmation for us to save this session to the dataset.

In my head, it's something like a 3-step process:

Extension pops up, with copy-writing explaining that the last X seconds of browsing was very relevant for the dataset
User watches the anonymized replay; the anonymized replay should maybe even highlight which data/strings was stripped away. E.g. "We replaced all instances of lou*****@gmail.com with [OAD_EMAIL_PLACEHOLDER]"
The user gets to press confirm and gets a big, animated "🥳" emoji or something similar.

Looking for someone to help develop the UX, mock up the UI, or just build the frontend.

Test-suite for data anonymization

We need to develop automatic data anonymization, and to do that sanely, we should have a test-suite to check for false negatives in the data anonymization.

A simple way to do that: Record a number of sessions of humans typing in (fake) sensitive data, and save them as JSON files. Then make a test-suite that puts each JSON file through the anonymize() function, and checks whether the values to be anonymized are present after. It should also check for them inside the concatenated keystrokes. If they are still present, this should fail the test case.

The kind of sensitive data we should test for:

Email address
Password
Name
Home address
Phone no
Bank account / credit card details
Crypto seed phrases
API keys
Social security / VAT number / passport number
... anything else you can think of! Please throw a comment!

Install Sentry for error-logging

On webapp and inside chrome-extension

Data-engine to identify interesting "sessions"

Right now, we have some simple code that just records & uploads indiscriminately. For various reasons (storage costs, data cleanliness, user consent), we should aim to only collect sessions that matter.

With a data engine you're trying to find all the sessions that would actually provide a loss for your model. E.g. 20 million Google Searches wouldn't provide much loss each (because they'd quickly be learned), but that one-off usage of a website we've never seen before or some really complicated workflow in Figma, might be super high-loss (and hence valuable) for the model.

Problem is, we're not gonna be doing active learning with a model to start with, so you need approximations of the metric I described above ^

Some ideas are:

URLs / domain names; if we have a lot of data from X, don't collect more. Also, could weigh them by expected variance. E.g. DuckDuckGo, low variance, don't collect that many. Figma, high variance, collect more. And so on.
Uniqueness of URL walk
Amount of user interaction; mouse movements + keystrokes is ultimately what you have to predict, so the less there is, the less signal to train on, the more useless the session. (e.g. for driving datasets, don't sample driving straight, sample turns)

I'm open to more ideas / feedback!

Verify that rrweb collects keystrokes & create a utility to extract them all to one string

We use rrweb to record sessions. If you spin up the mock-backend & the chrome-extension, you can start recording & saving some sessions to JSON files locally.

I'm actually not sure if rrweb records keystrokes as-is, so first make sure of that.

If it does, make a script that can take one of these JSON files, find every keystroke, and concatenate them into one string. This is useful for #2, because we need to be able to verify that there is no sensitive data left inside of the keystrokes (after anonymization).

UI/UX: Let user press "never record this site"

This app only succeeds if it's not annoying. Users might just have websites they'd just never record - let's make sure we don't repeatedly ask them about these.

Maybe we can even infer "never record this site" from a few rejections in a row?

Reduce size of JSON sessions as much as possible

We need to be efficient with storage. The JSON is needlessly large, and can be reduced, at least 10x, but let's try get 20-50x.

Some ideas:

GZip; in my tests this will get the file to 5-10% of original size
Binary representation of data (like pickle or similar); not sure how much this would help when applied in combination with GZip
Strip JSON of useless data

Automatic data anonymization

Build a function which takes JSON sessions of recorded browser activity, and replaces sensitive data with placeholder values. So if I type af7AFx2aGH as my password, it should replace every instance of it with [OPEN_ACTION_DATA_PASSWORD] or something similar. Including inside the recorded keystrokes.

A list of data types that are considered sensitive + a series of tests for their anonymization is being developed in #2.

Proper backend to store sessions from the user

The rough stack I'm thinking:

Object storage (S3 or alternative) for the recorded sessions
PostgreSQL for metadata about the sessions stored in S3
Next.js API which inserts into DB + responds with a signed PUT url for S3 (allows client to upload directly to S3, saving Next.js costs)

Wrt. pricing, I think Vercel/Next.js would get us at least ~30 million/requests per month on the $20/month business plan. Seems like enough for a while. Object storage (S3) will be much more costly. DB should be pretty cheap as well.

I don't know if we should have any kind of authentication? I want to know the ~identity of the user submitting, for spam-prevention & cleaning purposes. But it seems like IP + google chrome identifier + captcha might take us a long way here?

Adblock interferes with recordings

Repro:

Install UBlock Origin; this seems the culprit
Use the chrome-extension paired with the mock-backend and recording a session on https://www.azlyrics.com/lyrics/villagepeople/ymca.html.
Use the mock-backend route to replay the session, and notice that the dark purple bar is much larger in the replay (as large as it would have been if the ad was still there) than it was when using the website. This also causes the cursor not to be aligned.

louislva / openactiondata Goto Github PK

openactiondata's People

Contributors

Stargazers

Watchers

Forkers

openactiondata's Issues

UI/UX: "We want to save this session" flow

Test-suite for data anonymization

Install Sentry for error-logging

Data-engine to identify interesting "sessions"

Verify that rrweb collects keystrokes & create a utility to extract them all to one string

UI/UX: Let user press "never record this site"

Reduce size of JSON sessions as much as possible

Automatic data anonymization

Proper backend to store sessions from the user

Adblock interferes with recordings

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent