louislva / openactiondata Goto Github PK
View Code? Open in Web Editor NEWBuilding a diverse and clean dataset of humans using the web. Open source.
Home Page: https://open-action-data.vercel.app
Building a diverse and clean dataset of humans using the web. Open source.
Home Page: https://open-action-data.vercel.app
Whenever the data engine (#1) identifies an interesting session, the automatic anonymization is performed (#3), and then the user should be able to review the anonymized recording & give confirmation for us to save this session to the dataset.
In my head, it's something like a 3-step process:
Looking for someone to help develop the UX, mock up the UI, or just build the frontend.
We need to develop automatic data anonymization, and to do that sanely, we should have a test-suite to check for false negatives in the data anonymization.
A simple way to do that: Record a number of sessions of humans typing in (fake) sensitive data, and save them as JSON files. Then make a test-suite that puts each JSON file through the anonymize()
function, and checks whether the values to be anonymized are present after. It should also check for them inside the concatenated keystrokes. If they are still present, this should fail the test case.
The kind of sensitive data we should test for:
On webapp
and inside chrome-extension
Right now, we have some simple code that just records & uploads indiscriminately. For various reasons (storage costs, data cleanliness, user consent), we should aim to only collect sessions that matter.
With a data engine you're trying to find all the sessions that would actually provide a loss for your model. E.g. 20 million Google Searches wouldn't provide much loss each (because they'd quickly be learned), but that one-off usage of a website we've never seen before or some really complicated workflow in Figma, might be super high-loss (and hence valuable) for the model.
Problem is, we're not gonna be doing active learning with a model to start with, so you need approximations of the metric I described above ^
Some ideas are:
I'm open to more ideas / feedback!
We use rrweb
to record sessions. If you spin up the mock-backend & the chrome-extension, you can start recording & saving some sessions to JSON files locally.
I'm actually not sure if rrweb
records keystrokes as-is, so first make sure of that.
If it does, make a script that can take one of these JSON files, find every keystroke, and concatenate them into one string. This is useful for #2, because we need to be able to verify that there is no sensitive data left inside of the keystrokes (after anonymization).
This app only succeeds if it's not annoying. Users might just have websites they'd just never record - let's make sure we don't repeatedly ask them about these.
Maybe we can even infer "never record this site" from a few rejections in a row?
We need to be efficient with storage. The JSON is needlessly large, and can be reduced, at least 10x, but let's try get 20-50x.
Some ideas:
Build a function which takes JSON sessions of recorded browser activity, and replaces sensitive data with placeholder values. So if I type af7AFx2aGH
as my password, it should replace every instance of it with [OPEN_ACTION_DATA_PASSWORD]
or something similar. Including inside the recorded keystrokes.
A list of data types that are considered sensitive + a series of tests for their anonymization is being developed in #2.
The rough stack I'm thinking:
Wrt. pricing, I think Vercel/Next.js would get us at least ~30 million/requests per month on the $20/month business plan. Seems like enough for a while. Object storage (S3) will be much more costly. DB should be pretty cheap as well.
I don't know if we should have any kind of authentication? I want to know the ~identity of the user submitting, for spam-prevention & cleaning purposes. But it seems like IP + google chrome identifier + captcha might take us a long way here?
Repro:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.