qascade / dcr Goto Github PK

A PoC framework to orchestrate interoperable Differentially Private Data Clean Room Services using Intel SGX hardware as root of trust.

License: GNU Affero General Public License v3.0

Go 99.81% Makefile 0.19%

data data-security datacleanroom confidential-computing differential-privacy intel-sgx golang intel-sgx-sdk gssoc23 gssoc

dcr's Introduction

Hi 👋, I'm Shubh Karman Singh

A Computer Science Engineer, guitarist and self-proclaimed computer science lover. I work as a Data Engineer at LiveRamp

🔭 I maintain and build a Data Clean Room Solution: dcr
📝 I'm accepting PR's for my projects dcr, Yet Another Streaming Tool.
🌱 I’m currently learning Distributed Systems, Scala and Spark.
👨‍💻 All of my projects are available at https://github.com/qascade
💬 Ask me about golang, rust, databases, differential privacy, confidential computing, containers, docker, binary exploitation, computer-networks
📫 How to reach me [email protected]
📄 Know about my experiences here
⚡ Fun fact I am very passionate about everything jazz, neo-soul and RnB. I have collected and completed all the AC Games...

Connect with me:

Languages and Tools:

dcr's People

Contributors

Stargazers

Watchers

Forkers

kr-2003 ayushpanditmoto shraddha761 shraddhasingh761 vethan123 prashant235913 vasudevrani atharva-malode queensekinah chococandy63 sarthak027 prakrit55 akankshar05 saxenaanushka102 shrey-a-gupta debankanmitra yeonns

dcr's Issues

Improving the readme file of this project

Description

Add description of issue here.
I would like to improve the readme file of this project making it engaging for the users and make it more expressive.

Screenshots

Put any screenshot(s) here.

How are you planning to resolve on this issue?

Introducing few code changes
Making changes in the readme file and improving the file.

Description

We want dcr to initialize a test environment which is two postgres containers with test data preloaded.

How are you planning to resolve on this issue?

dcr init --testenv should spin up two postgres containers with media/advertiser data loaded in each of them.

Description

As of now, if some one wants to test run dcr, he has to use a Linux machine and has to go through the hassle of installing all the dependencies. I want to make it easier by allowing the demonstration to run on any host.
Prepare
Dockerize dcr and add ENTRYPOINT of dcr run -p sample/init_collaboration. It's better to update docs on how to interact with containerized binary.

fix(adress): authorized Sorting failing on non-authorization

Description

Before the Cleanroom service is run, we do a modified Topological sort of graph that we generate after parsing the yaml. Ideally, we want to remove the unauthorized node and its dependencies, but the whole service terminates if anything comes back unauthorized. The graph has edges directed toward its dependencies( dest -> transformation, transformation -> source). Inside the code the graph nodes are defined as addresses.

The current algorithm is a topo sort using Kahn's algorithm but does an authorization check before pushing the node into the queue. You can look AuthorizedSort() function inside collaboration/address/topo.go

This would be a good issue for DSA peeps as its a pure graph problem involving Topo Sort and basic traversals.

How to fix this?

Solution 1 : Create a separate graph for all destinations and compute over it.
Solution 2: Reverse all the edges i.e make all the edges point towards order of execution rather than dependencies. This should make things much easier.
Please note the graph in real world use cases will not have more than 20 nodes. So, I don't want any unnecessary optimizations in the implementation. Simple Naive traversal is sufficient as long as it gets the work done.

refactor(config): remove repeated code in package config

Description

While writing the config parser. I had to give in to some level of redundancy for the sake of other high priorities. I want that to be fixed. I want to refactor and refine the config parser and remove unnecessary redundancies.

[Feature Request]: add codeql workflow

Is your feature request related to a problem? Please describe.

The repository contains code in go language, it does not have workflow for code scanning.

Describe the solution you'd like

I want to add the codeql workflow to automate security checks. CodeQL is the code analysis engine developed by GitHub to identify vulnerabilities in code. It will analyze your code and display the results as code scanning alerts. It will be enabled on every push, commit and pull request using GitHub actions.

Record

I agree to follow this project's Code of Conduct
I'm a GSSoC'23 contributor
I want to work on this issue

doc(readme) update readme badges

Description

I want to update readme badges

Screenshots

Are you raising this PR under GSSoC'23?

Yes

How are you planning to resolve on this issue?

Introducing few code changes in readme file

feat(config)!: make the transformation.yaml specs more generic.

Description

Current transformation specs only support the examples used in tests but we need to make it generic enough to support any transformation given template inputs.

Specs: Research and Design specs to incorporate Differential Privacy Measures in dcr

Differential Privacy Primer

Architechting Differentially Private SQL Engine

Add Auto Comment Feature to Improve Collaboration

Issue Description:
As an active contributor to your open-source project, I believe that implementing an auto-comment feature would greatly enhance collaboration and communication within the project. This feature would automatically generate comments in response to specific events, such as when an issue is opened, a pull request is created, an issue is assigned, or an issue is unassigned.

Feature Details:

When an issue is opened, the auto-comment should greet the author and provide a brief acknowledgement and request for additional context.
When a pull request is opened, the auto-comment should greet the author, express gratitude, and remind them to follow the project's contributing guidelines.
When an issue is closed, the auto-comment should thank the author for their contribution and encourage further engagement.
When an issue is assigned to someone, the auto-comment should notify the assignee and encourage them to start working on it.
When an issue is unassigned from someone, the auto-comment should notify the assignee about the change and suggest reassignment if they are offline.

Benefits:

Improved communication and engagement with contributors.
Provides clear instructions and acknowledgements for various events.
Enhances collaboration by setting expectations and providing reminders.
Reduces manual effort by automating comment generation.

Acceptance Criteria:

The auto-comment feature should be implemented using the "wow-actions/auto-comment" GitHub Action.
Comments should be appropriately customized for each event, mentioning relevant parties and providing the necessary information.
The auto-comment workflow should trigger on the following events: issues opened, pull requests opened, issues closed, issues assigned, and issues unassigned.
The feature should be added to the project's existing GitHub Actions workflow file.

Additional Context:
Feel free to ask any questions or seek clarification regarding the auto-comment feature. I'm excited about contributing to your project and believe this enhancement will greatly benefit its community.

refactor: migrate the ego-server code from different repo to dcr

Description

Currently the server code is on a different repo and is deployed separately. I want both the repos to be merged and a flag to be added on the dcr run cmd to specify that we want to run on server mode or local mode. The server code is to be refactored to make it maintainable as well as idiomatic.

Server Code: https://github.com/tiklup11/dcr_ego_server

refactor(source): merge transformation_owners_allowed and destinations_allowed to transformations_allowed.

Description

For a Source, transformation_owners_allowed and destinations_allowed feels redundant I want to unify them into transformations_allowed as a data access grant.

Sample Yaml to understand the changes you have to make:

collaborator: Media #name of the collaborator 
sources: 
  - name: Media_customers
    csv_location: ./media_customers.csv
    description: table having data for media customer
    columns:
      - name: email 
        type: string
        masking_type: sha256
        selectable: true 
        aggregates_allowed: 
          - private_count
          - private_count_distinct  
        join_key: true 
    transformations_allowed: 
      - ref: /Research/transformation/customer_overlap_count 
        noise_parameters: 
          - noiseType: Laplace
          - epsilon: math.Log(2) 
          - maxPartitionsPerUser: 1

Make necessary changes in the address authorization and the graph population logic to enable this yaml.

feat: enable use of address graph and maneuvar policies between address nodes.

Description

Sources, Transformations, and Destinations create a directed acyclic graph. Each node in this graph is called an address. Each Source and Transformation in the graph has a policy from which a policy for transformation output is derived. Create an Algorithm to propagate/assign policies to the transformation output through address graph.

doc: comment every code snippet to maintain a godoc ref

Description

This codebase was built by just two guys, also due to certain priorities at hand, we ignored proper comments. But now as we transition to make this framework open source, We should have better documentation.

fix(config): error message on wrong yaml input.

Description

There is a code in the ParseSpec() function that outputs a better error message if there is an error in input yaml but for some reason, the proper error message is not propagating back.

test: Make test checks strict.

Description

The tests I have written in this project are not strictly checking (just printing) the contents of the results that are coming as output after running the tests. I want to make the tests more strict. All the file results are to be hashed and matched. All the maps/struct results are to be sorted and Deeply matched.

Specs: Find a Good SQL Parser for Snowflake.

feat(transformation): design specs for supporting dp sql queries as transformation in dcr

Description

SQL queries are much more user-friendly and readable than any other general-purpose languages like Python or Go. We want to support such SQL queries as transformation if possible, but the catch is that the query result should be differentially private. We may have to look into different strategies on how we can generate an execution plan based on query input that makes the SQL output dp. We would also want the query execution to be done inside a tee.

Some high level strategies may include:

Take any existing DB and modify its query language and execution plan that uses google's dp definition
Write an OLAP db from scratch and make sure the above conditions are met.
Write an interpreter that generates confidential go-apps after reading a SQL query. (In this we will be only able to support a limited set of queries and may have to go from query by query)

Resources:

Specs: Support for contract package.

dcr should be able to parse and use the collaboration package defined in spec docs to orchestrate cleaning services.

doc(readme): add contributors graph

Description

I want to add contributors graph in readme

Screenshots

Put any screenshot(s) here.

Are you raising this PR under GSSoC'23?

Yes

How are you planning to resolve on this issue?

Introducing few code changes in readme file

feat(collaboration): noise validation and quantification in dcr

Description

We have a concept of Trust Groups. A trust group is a set of sources who have given destination_allowed permission to the same destination.

How are you planning to resolve on this issue?

For this we need to complete validateNoises function in lib/collaboration/collaboration.go.

This validateNoises will also need the list of collaborators who permits to the same destination.

This list will be fetched from the address graph. All these collaborators will form a Trust Group. After this, we will have three options for noise Validation/Propagation
1. Only one collaborator from the trust Group is allowed to define noises. This validation may be simplified in yaml, where other collaborators can acknowledge that by referring to the noise parameters, which can be introduced as an address_type.
2. All collaborators from a trust group must give the same noises at the source level. If the noises mismatch, it will result in an error.
Another option may be which we will introduce later if feasible, There is no such thing as a trust group; everybody can define whatever amount of noise they want. We will have to define a mechanism such that from all the lists of noises that contribute, the largest noise in the result will be selected.

For now, we will start implementing the first option.

Adding dependabot.yml file

Description

We need to add dependabot.yml file as it helps us to keep our dependencies up to date. Every day, it checks your dependency files for outdated requirements and opens individual PRs for any it finds.

Are you raising this PR under GSSoC'23?

Yes @qascade

How are you planning to resolve on this issue?

sample file :-
dependabot.yml

Add readme more appealing and sco friendly.

test: make TestGraph tests strict

Description

Currently all the address.Graph struct inside the package graph is being populated using NewGraph() func. This function is being tested using TestGraph function, I want to make sure that all the graph values are checked deeply against hardcode required struct rather than just printing.

How are you planning to resolve on this issue?

Create a hardcoded stub for expected values per field in the graph and compare it with the output of the Graph from inside the code. As all the structs have already been checked manually all tests should pass in general. Please make sure to check across config yamls for cross-verification.

feat(service): automate csv assignment inside enclave.json

Description

CSVs can't be moved inside enclaves unless explicitly mentioned inside the enclave.json. Check files field in service/temp_enclave.json. I want that to be automated and to be picked from Yaml, not through temporary CSV's.
The enclave.json is to created/modified before ego-sign using sources specified in yaml.

How are you planning to resolve this issue?

One of the following strategies may be used. You are free to create you own strategy as well.

Use basic json Writers to manipulate the file.
Use some templates and use pongo2 templating engine.

feat: implement a SQL grammer that lets collaborators define differentially private transformation and run it on our system.

Description

For now our system is built for generic transformations and a collaborator has to provide the code for transformation which is good for ML Use Cases (given it is Federated) but normal analysts just want simple SQL queries to run. We should give them a user friendly way to express their queries with additional specifications/parameters for Noises and Compile it onto some database that can be run on intel SGX.

One such option may be : https://github.com/edgelesssys/edgelessdb

Some idea on how the SQL may look like from Google's ZetaSql extension:

SELECT WITH ANONYMIZATION OPTIONS (   
		epsilon={{epsilon_val}}, 
		delta={{delta_val}}, 
		kappa={{kappa_val}}
) ANON_COUNT({{count_col}} CLAMPED {{threshold}}) AS {{transformation_name}}
 FROM {{table1}} join {{table2}} using {{join_key}}.

Add 'Back-to-top' link in ReadMe.md

Description

I would like to add a 'back-to-top' link in ReadMe.md
Adding a back-to-top will make navigation easier.

Are you raising this PR under GSSoC'23?

Yes

Add issue templates

Description

Would like to add issue templates.
Could this be assigned to me?

Screenshots

Are you raising this PR under GSSoC'23?

Yes

Specs: Design Specs for dcr CLI

Specs Link: dcr Specs

Specs: Design Specs for compliance checker.

feat: Clean Room Service should spin up a Trusted Container(Simulated) with ZetaSQL engine running inside the enclave

Description

Implement a Dockerfile that generates a Linux-sgx container with a ZetaSQL engine inside an enclave. We also need a container server inside the enclave that acts as a gateway between ZetaSQL and the Clean Room Service. This server is under development in issue: #20 by @qascade and @ShisuiMadara.

spec doc: https://www.notion.so/Solution-3-07d81059daab40cb84180336a33c3dd9 under Trusted Env section.

How are you planning to resolve on this issue?

TBD. Please write it in the spec docs and get it signed off.

fix(service): unable to capture output for command:exit status 2

Description

After setting up the project and running this command on terminal

./dcr run --pkgpath ../samples/init_collaboration
I am getting the following error

ERRO[0013] err running event: unable to capture output for command:exit status 2 
Error: err running service: err running event: unable to capture output for command:exit status 2

How are you planning to resolve on this issue?

Introducing few code changes

Specs: Implement Contract Verification between two collaborators

Design specs to make sure that we as an orchestrator are able to verify that two collaborators have agreed upon the contract package and are we good to go with rest of the dcr proceedings.

feat(service): upgrade the ego-server inside the feat.ego_server_run branch to enable attested tls.

Description

Ego gives an example for the server to verify that the payload is coming from a trusted source. I want to enable that feature inside the clean room server, which runs inside the sgx enclave. Please note as we are just simulating the tee, you can assume the server to be Root CA itself.

How are you planning to resolve on this issue?

You can take a look at this: https://github.com/edgelesssys/ego/tree/master/samples/attested_tls

Feat: Implement SourceWarehouse Interface

Description

This issue is to track the development of SourceWarehouse interface.

Screenshots

Put any screenshot(s) here.

How are you planning to resolve on this issue?

Introducing few code changes

feat(config): enable use of relative addresses in config yaml.

Description

For now, if a user wants to use dcr, they have to mention the full reference of the address to refer to a certain address on the address graph, which is not user-friendly. We should enable relative addressing in yaml, and the full address is to be calculated using the yaml scopes.

Specs: Research Fully HomoMorphic Encryption and design specs to support FHE in dcr.

ci: propose an automated way to generate changelogs

Description

This project strictly adheres to Conventional Commits, but we haven't added any mechanism that tracks changes on a Changelog using a Version. I will give a version no per commit. I want a way that generates and modifies changelogs on a file called changelog. md before committing.
Give me a strategy through a script that takes the semantic version as input or maybe use some existing tooling such as link. Also add a CI Workflow to make sure that the changelog has been generated before the merge.

Are you raising this Issue under GSSoC'23?

Yes

feat(service): multiple collaborations enabled per ego-server.

Description

The current Server implementation inside the feat.ego_server_run branch only supports a single collaboration. I want multiple collaborations to be queued or even running on multiple threads at the same time. As all the sources are to be read only there should be no race conditions on any of the sources while reading.

How to resolve this issue?

Design a strategy to identify unique collaborations and their entities. One option is to append a CollaborationID in front of every AddressRef like {collab_id}/{collaborator}/{address_type}/{address_name}. A collaboration will have the same id if none of the addresses are modified. We can use some hashing strategy to distinguish changes inside the collaboration package and update ids accordingly.

chore: Add CI, testing workflows, ISSUE/PR templates.

feat: The collaboration should execute on itself with only collaboration package as input.

Description

Yaml is declarative in nature if, some destination is mentioned it has to be executed, we should not wait for some trigger/request to run transformation/destination.

How are you going to do it?

We already have a Graph that populates the dcr addresses. We will find a toposort of the graph and run each transformation/destination if authorized.

Feat: Implement dcr CLI Specs

Description

This issue is to track development of dcr CLI.

feat: dp sql query Proof of Concept

Description

Confidential GoApps are great but we want to create better UX by incorporating SQL queries as transformations. You can look at ZetaSQL differential privacy extension.

Ideally, what we want is a SQL engine that executes differentially privately.

feat: add an example of a join query with a group by but using confidential go app.

Description

The use case only does a simple count query without any partitions. I want to add an example of partitions that properly demonstrates the use case of the maxContributionsPerUsers option inside the google dp definition.

How to do this ?

Use the same media/ advertiser/research data set for simplicity, although you are free to run your imagination for a different scenario you will have to generate your datasets.

You can do a query like What are the common customers grouped by the kind of pet they have. So the output should be:
Private Count of Customers who have dogs, Private Count of Customers who have cats.

In SQl terms the query should look like:

SELECT
    ac.pets,
    COUNT(DISTINCT ac.email) AS count_common_customers
FROM
    media_customers mc
INNER JOIN
    airline_customers ac ON mc.email = ac.email
GROUP BY
    ac.pets;

So the end result should be a Confidential GoApp with appropriate Yaml modifications that functions exactly as the above query would.

Feat: Encrypted Data Migration from Postgres to EdgelessDB instance and Running Query on EdgelessDB instance.

Description

As we have decided to use postgres instances as data sources we want to pipe the data present in postgres instances through encrypted channels using TLS. Also, modify the contract yaml to have relevant decryption keys filled by collaborators. Then run a simple query on edgelessDB.

How are you planning to resolve on this issue?

Service Interface should have a MigrateEncrypted() method that takes necessary context parameters as input and populate the data on the edgeless db instance, which is a MySQL database on the interface.

Service should also have Run() method which takes query as an input. This Run() will later modified to use Validate SQL Compliance.

References

Code generated using chatgpt: https://gist.github.com/qascade/ec04c90d1cd93a1a208a157b17deca16

feat: protocol to send destinations to collaborators allowed to access them.

Description

Look at the Solution 3 spec doc.
Assumption: We have validated/compiled the transformation and defined who is allowed to get the generated output destination_tables.
We want that a output folder populated with destination_tables get generated under the respective collaborators package who has been prompted to receive the destination table.

So Create a server on dcr side and another server that will get go inside container.

Problem1: How do we send transformation from dcr side server to container server. And Container server triggers it on the Sql and receive. (Medium)
Problem2: Given destination_tables recieved by container server. How does container server makes sure that they reach to their respective destinations securly. (Hard)

ci: a actions config to test ego apps on github.

Description

For now any tests that use ego to build, sign and run the Go App fails because we don't have any CI set up for the same. I want to enable those tests on the github actions and to be used as check before PR merge.

How are you planning to resolve on this issue?

Write a CI yaml that enables us to write tests using confidential go apps. This will help automate testing of attested tls as well at some later stage.

qascade / dcr Goto Github PK

dcr's Introduction

Hi 👋, I'm Shubh Karman Singh

A Computer Science Engineer, guitarist and self-proclaimed computer science lover. I work as a Data Engineer at LiveRamp

Connect with me:

Languages and Tools:

dcr's People

Contributors

Stargazers

Watchers

Forkers

dcr's Issues

Description

Screenshots

How are you planning to resolve on this issue?

Description

How are you planning to resolve on this issue?

Description

Description

How to fix this?

Description

Description

Screenshots

Are you raising this PR under GSSoC'23?

How are you planning to resolve on this issue?

Description

Description

Description

Description

Description

Description

Description

Description

Description

Screenshots

Are you raising this PR under GSSoC'23?

How are you planning to resolve on this issue?

Description

How are you planning to resolve on this issue?

Description

Are you raising this PR under GSSoC'23?

How are you planning to resolve on this issue?

Description

How are you planning to resolve on this issue?

Description

How are you planning to resolve this issue?

Description

Description

Are you raising this PR under GSSoC'23?

Description

Screenshots

Are you raising this PR under GSSoC'23?

Description

How are you planning to resolve on this issue?

Description

How are you planning to resolve on this issue?

Description

How are you planning to resolve on this issue?

Description

Screenshots

How are you planning to resolve on this issue?

Description

Description

Are you raising this Issue under GSSoC'23?

Description

How to resolve this issue?

Description

How are you going to do it?

Description

Description

Description

How to do this ?

Description

How are you planning to resolve on this issue?

References

Description

Description

How are you planning to resolve on this issue?

Recommend Projects

Recommend Topics