Giter Site home page Giter Site logo

diffix / reference Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 0.0 1.62 MB

Reference implementation for the Open Diffix anonymization mechanism.

Home Page: https://www.open-diffix.org

License: Other

F# 100.00%
anonymization privacy-protection privacy-tools anonymize anonymisation

reference's Introduction

Open Diffix reference implementation

Purpose

This is the reference implementation of Open Diffix. It serves as a sandbox in which we can quickly try, and validate, new mechanisms for anonymization. The reference implementation is meant to offer anonymization quality matching that of production-ready code, however without the same amount of polish, usability, documentation, and performance work a commercial product would require. It can safely be used to anonymize data, but there will be rough edges.

Documentation

Once a mechanism is considered stable, it will be added to the pg_diffix project, using the reference code as a guideline. As such, all Diffix-related documentation present there should also apply to this project. Under-development features will not be documented.

Development process

The concepts implemented will be, much of the time, complex. Therefore, we do not skimp on code quality or legibility. Code going into the master and release branches is peer-reviewed, tests should pass, etc.

The tests rely on the presence of a test database existing. For more information on how to create it, please read the data/README.

Design considerations

We use SQLite as a dumb backing store for data. Dumb in the sense that we just read the data out of it and otherwise rely on our own code to perform the query execution. The reason for this is that it allows us to perform all the operations we would otherwise do in a database, but without the constraints of having to work around the complex and large codebase of a real world database engine.

Organization

The codebase is currently organized in a number of projects:

  • OpenDiffix.Core: Contains the meat of this project. It is the query and anonymization engine. The main public interface to it is the QueryEngine.run function here
  • OpenDiffix.CLI: A command line interface that can be used to exercise the reference implementation.

Branches

To avoid merge conflicts we work on feature branches. Once automated tests pass it can either be reviewed or merged into master.

Each finished version of Diffix will have its own release branch. Code in master should be considered unstable and representing the next, under-development, version of Diffix.

Formatting

We use fantomas for formatting. It might be beneficial to have a pre-commit git hook like the following to ensure the code you commit to the repository has been formatted:

.git/hooks/pre-commit:

#!/bin/sh
git diff --cached --name-only --diff-filter=ACM -z | xargs -0 dotnet fantomas
git diff --cached --name-only --diff-filter=ACM -z | xargs -0 git add -p

Getting "Format on Save" to work correctly in VS Code is not a straightforward task. Existing formatting extensions either don't work at all, use an incompatible version of fantomas, or ignore the workspace settings. Here is a solution that produces results consistent with our CI checks:

  1. Install the right version of fantomas globally.
  2. Add the Custom Local Formatters extension to VS Code.
  3. Configure the extension by adding this entry to the customLocalFormatters.formatters array:
  {
      "command": "fantomas --stdout ${file}",
      "languages": [
          "fsharp"
      ]
  }
  1. Open the file .vscode/extensions/jkillian.custom-local-formatters-0.0.6/out/extension.js from your home folder and comment out this line:
(_a = process.stdin) === null || _a === void 0 ? void 0 : _a.write(originalDocumentText);

Creating a release

To generate an executable of the command line interface, run:

dotnet publish -r <win|linux|osx>-x64 -o build -c Release \
  /p:PublishSingleFile=true /p:PublishTrimmed=true /p:IncludeNativeLibrariesForSelfExtract=true \
  --self-contained true src/OpenDiffix.CLI/

If the build succeeds, the binary will be placed in the build folder. It is self-contained and can be moved anywhere you desire.

Using CLI

The reference implementation can be used through the provided command line interface offered as part of the solution. See "Creating a release" for more information on how to build the command line interface.

The -h command gives you an overview of the available parameters. Typical usage should achievable with one of the two following sample commands:

  • Run a single query: OpenDiffix.CLI -f data/data.sqlite --aid-columns customers.id -q "SELECT city, count(*) FROM customers GROUP BY city".
  • Run a batch of queries (significantly faster if you want to run many queries at one time): OpenDiffix.CLI --queries-path queries-sample.json. For an example of what the input file format should look like, please consult [queries-sample.json].

In both cases the query result will be written back to the terminal (standard out).

reference's People

Contributors

cristianberneanu avatar edongashi avatar pdobacz avatar sebastian avatar yoid2000 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

reference's Issues

Align on functionality for v0

This decision takes a step back to align on what features we should support in a version 0, and why.

We are eventually targeting multiple levels of protection (lacking names for these versions as of yet):

  • (a) protecting against accidental disclosure
  • (b) very strong, but potentially not safe against a very malicious and resourceful attacker
  • (c) bullet proof, safe to open to the world

In terms of the complexity of the anonymization the versions don't follow an obvious progression.
(a) has the lowest implementation complexity and lowest amount of protections. Most things are allowed and no sophisticated protections are offered. Our system really doesn't care much what the user does.
In terms of overall complexity (c) is likely the next in as I believe it will offer a restricted subset of the functionality of (b) that we deem to be safe.

I believe whatever functionality we build for version (a) is also required in the other two versions. Therefore I think the goal should be to start with version (a) and do as follows:

  • build an early reference implementation of (a)
  • let people play with the reference implementation
  • if not useful then revert and think some more
  • if moderately useful then make a Postgres implementation of (a) that is actually usable in the wild
  • subsequently continue with (b) or (c)

Functionality I think should be part of a first iteration of (a):

  • anonymizing aggregates (let's start with count, sum and avg?)
  • low count filtering
  • support for multiple aids (this keeps being a recurring issue and requirement for people we talk to)

What we initially skip:

  • remaining aggregates
  • window functions

Thoughts? @diffix/developers, @yoid2000, @fjab

Support connection strings

We support Sqlite databases today and in the CLI we allow you to specify the path to the Sqlite database using -d. This doesn't really scale well beyond supporting data from a single store.

Let's instead move to using a connection string.

We could use one of two formats.

  • Connection strings with an explicit schema, i.e. like csv://myfile.csv/?delimiter=,
  • Old school connection strings, i.e. like DRIVER=csv;PATH=myfile.csv;DELIMITER=,;

The command line interface could then be used like: cli --connection-string <connection string> or cli -c <connection string>.

This sort of scheme would also allow us to easy add future data providers, like Excel, Postgres, or anything else!

I'll leave the exact format of the connection string up to the implementor.

Think through noise levels given LED

This is kindof a todo list for Paul.

I realized that I need to rethink noise, both in terms of amount (size of SD) and in how we do flattening. This is because with LED, a whole class of attacks is no longer possible (difference attacks). As a result, we might not need as much noise or flattening as previously.

Things to do:

  • Do intersection attacks still work given LED?
  • If so, how much noise needed to defend against intersection attacks?
  • How much noise is needed to defend against reconstruction attacks?
  • Are there reconstruction attacks specifically against heavy hitters?
  • If yes, how much flattening/noise needed to defend?
  • other?

Easy Diffix salt generation

Currently the salt for an Easy Diffix table is supplied as a CLI parameter.

I would like to have a default way of producing the salt, which can be over-ridden by the CLI parameter, but where the normal operation is to use the default.

The reason for this default approach is two-fold:

  1. To ensure that if the same table is loaded, then the same salt is used.
  2. To ensure that a different salt is used if a modified version of the table is loaded (i.e. an updated table).

I suggest the following approach:

  • Initialize a variable xor_result = 0
  • Loop through every cell in the table
    • For each cell, hash the value and XOR the hashed result to xor_result
  • Set salt to be the one-way hash of xor_result

Diffix Publish v0.1 spec

Question: should this spec live here, or in the strategy doc, @fjab?

Diffix Publish v0.1 spec

This doc outlines the design of the first public open-diffix software release.

Over time the goal is to produce a Postgres extension with a range of anonymisation capabilities. These can be turned on or off. Different combinations of enabled protections make for product versions that satisfy varying use cases and requirements. Three design points we have in mind are Diffix Publish, Diffix Cloak, and Diffix Knox. These range from more relaxed to the strictest set of protections open-diffix is capable of offering.

Our initial release will target Diffix Publish.

Overarching design goal

Diffix Publish aims to allow the full set of query capabilities offered by Postgres. It should be safe to use to create reports for consumption by parties that do not themselves have direct influence over the queries generated. Use cases are those of a trusted analyst producing a report or dashboard for an audience that is not privileged to see the raw data.

Goal of v0.1

In the initial release of Diffix Publish, we will not strive for the full set of query capabilities Postgres offers. Instead, the goal is to restrict the product sufficiently to get something shipped sooner rather than later. The challenges of learning how to integrate with Postgres already constitute a good hurdle.

The goal is to have a version that includes enough features that a user can produce simple counts.

Features: Included/Excluded

We want Diffix Publish v0.1 to include support for the following features.

Anonymisation features

  • A single AID
  • Simplified, but sticky, noise layers.
  • We assume all tables contain sensitive data and reject queries for tables that lack an AID configuration.
  • Strict low count filtering (i.e. no dropping of columns to surface more data) that is sticky. I.e. the low count filtering does not change between repeat query runs.
  • Freestanding count-aggregates (support for count(distinct aid) and count(*) that are not nested within, or do themselves contain a function). No other aggregates are supported.

General features

  • The ability to configure which column is the AID-column.
  • The ability to lock a user/system into anonymisation mode.
  • There are no restrictions on the usage of non-aggregate functions. The only exception is that they cannot be used together with aggregate functions.
  • Simple WHERE-clause conditions are allowed.
  • Only support flat single table queries (no joins, no subqueries).

Notable exclusions compared to Aircloak Insights:

  • We do not produce * rows representing rows that were low count filtered.
  • We do not allow the other common aggregates such as min/max/sum/avg/etc. Not because we do not know how, but rather to limit the work we need to do.

What is a simple WHERE-clause condition?

Most WHERE-clause conditions are considered simple in the context of Diffix Publish. We allow unbounded ranges, complex functions, exclusion of rows. These are things we will need to keep an eye on in other versions of Diffix, but they are fine in Publish.
What we do disallow are WHERE-clause conditions that make use of subqueries. An example might be:

...
WHERE name in (
  SELECT names 
  FROM users 
  ORDER BY count(*) DESC 
  LIMIT 10
)

The reason is that we do not want to deal with subqueries at this time, not for that matter the difference between anonymising and non-anonymizing queries.

Anonymisation

Noise layers

We take some shortcuts when it comes to the noise layers. Specifically, we do not take query semantics into account. We will address this shortcoming at a later time. It’s OK for Diffix Publish.

We generally seed our noise layers with:

  • A statically configured seed value
  • The set of AIDs in the bucket that we are producing noise for

The noise layers are zero mean gaussian values with a given standard deviation.

Limit values

Throughout, when selecting limits (such as the low count threshold), we uniformly pick a number between two bounds. Therefore a parameter such as low_count_threshold will really have to be specified as the pair low_count_threshold_low, low_count_threshold_high.

The sample from the uniform distribution is chosen with a random number generator seeded just like that of noise layers: using a combination of a preconfigured seed and the set of AIDs.

Count

While we might have gotten away with very simplistic count anonymisation for Diffix Publish, I have decided to use an AID based approach. The reasons are that this approach is likely to carry over to other versions of Open Diffix and that it puts a hurdle in our path that we need to cross sooner rather than later, namely the ability to track AIDs.

The count anonymisation roughly follows our UID-based anonymisation as it existed in Aircloak Insights. You can find the reference to the original design here.

count(*)

The process requires the following configuration parameters for the limit values used when suppressing outliers:

  • outlier_count
  • top_count
Procedure

In pseudo-code:

NCo = noisy limit value based on outlier_count
NCt = noisy limit value based on top_count
all_users = group data by users and sort descending per user count

_outliers, rest = split all_users on NCo
top_users, regular_users =  split rest on NCt
outlier_compensation = (average top_users) * NCo
final_count = sum of [regular users, top_users, outlier_compensation]

In human language:

  • For each bucket, we produce a per-AID count.
  • We produce a noisy outlier count NCo using the outlier_count parameter.
  • The counts of the NCo AIDs with the highest count are dropped (to be replaced in the final step)
  • We produce a noisy top count NCt using the top_count parameter.
  • We calculate the average amongst the NCt users with the highest counts remaining after dropping the NCo users with the most extreme values.
  • The final count is the sum of the remaining counts (including the NCt top users) plus NCo times the NCt counts average.

count(distinct AID)

The process requires the following configuration parameter, defined as the pair of a magnitude and a standard deviation:

  • distinct_count
Procedure
  • We calculate the number of distinct AIDs Cd.
  • The final count is Cd plus a noise layer based on distinct_count.

Low count filtering

The process requires the following limit value configuration parameters:

  • low_count_threshold

Procedure

We low count filter based on the number of distinct AIDs in a bucket. The limit is sticky and persists between query reruns. The limit is a limit value sampled from a uniform distribution as described in the Limit values section.

Rows that are low count filtered are removed from the resultset in their entirety.

We provide no global estimate of the number of rows suppressed, silently dropping rows. We can improve this in a later iteration of our tool.

Discuss the high-level project structure

I think we need to establish a high-level design of the project, so we are all aligned regarding the road ahead and we can each start working on separate, but compatible pieces.

I'll start the discussion with a few topics that I consider important:

  1. Input data
    I don't like putting the data into a SQLite database and committing that into the repository.
    I propose that we instead write a small tool to generate the data according to some specification. We can then use the same generated data for development, automated testing and the deployed instance (same exact data would be preferable).
    We also need to settle on what additional configuration we'll require for exposing the data through diffix and include it beside the generated data.

  2. Repository structure
    We should have a library that contains the main logic for the reference implementation.
    Besides it, for development and testing, we create a CLI application that runs a single query on the default generated data.
    The deployed instance can be a separate web frontend, that executes a given query over generated data.

  3. Initial library structure
    Before taking the first steps into the actual implementation, we need to write a mockup of a real database engine: a parser, compiler, validator, normalizer, planner and executor. We do this by taking heavy inspiration from the inner workings of Postgres (but we distill each stage to the bare essentials). This should lead to us learning some lessons that will be applicable to the integration with an actual database.
    We can then decide which anonymization features we implement for version 1.0.

Seeding in multi-AID world

The current reference design does some processing per-AID. For example it determines the amount of flattening and what a representative top-group average would be for noise generation by AID.

The random number generator is only seeded (at present) with the AID that is being operated on, rather than the full set of AIDs. More concretely, if you have AIDs email, ssn, and phone_number then each calculation operates on one, but not all, of the AIDs.

  • Investigate whether all AIDs should be used
  • Investigate whether using all AIDs opens up for attacks
  • Investigate whether not using all AIDs opens up for attacks

CLI: add schema discovery verb

The command line argument parser we are using supports sub commands.

It could make sense to split the CLI API into separate commands such as:

  • schema
  • anonymize

etc.

This way we don't overload the query command for schema detection and end up with an API that's easier to consume.

Exact API terminology up to implementor.

Both of the sub commands specified above would need to support the same connection string arguments.

Attack testing interface requirements

The interface for test attacks will need the following:

  • Ability to insert named tables that can be referred to in subsequent SQL queries
    • These tables generally small, a few hundred lines, maybe a few thousand at most
    • Generally I'll simply use the same table name every time, so table will be overwritten
  • Ability to query for Diffix parameters, including but not limited to:
    • Version number
    • LCF parameters
    • outlier and top parameters
    • noise parameters (SD)
    • eventually some list of anonymization features that are enabled/disables
  • Ability to convey the system secret seed along with the query

Specify system class (Publish, Cloak, Knox) in query

In my tests, I'd like to automate whether or not I think the attack should succeed or fail. At the moment it doesn't matter because we have only one version (Publish), but once we have more I'll need to know what it is I'm attacking.

For now, I'm going to put the attribute 'class' inside 'anonymization_parameters'.

The possible values will be 'publish', 'cloak', and 'knox'.

Let me know if this is going to break anything.

Not ready: Use all AIDs for seeding random number generators

Ref: #124
See: #129

With multiple AIDs we do a lot of repetitive work. For example for each AID individually we do flattening which requires us to generate extreme value and top group sizes. These random variables should come from random number generators that have been seeded with all AIDs, not just the single AID being processed.

Support more scalar functions.

Only a couple scalar functions are currently supported.
More (like round, abs, cast, etc.) will soon be needed for the Easy Diffix tool.

Look into the "create NULL values with JOIN using count(*)" difference attack

Cristian raised the following difference attack:

I was under the impression this attack was already described in another discussion. But to be more explicit:

We have 2 tables: purchases, with columns product_id and client_id, and clients, with columns id and ssn.
Column ssn is labeled as the AID. Assume the following data is present:

Clients:

id ssn
1 a
2 b
3 c
4 d
5 e
6 f

Purchases:

product_id client_id
1 1
2 2
3 3
4 4
5 5
1 6
2 6
3 6
4 6
5 6

If we issue a query like:

select count(*) from purchases join clients on client_id = clients.id

we get the following input to the count aggregator:

count ssn
1 a
1 b
1 c
1 d
1 e
5 f

If Ne=1 and Nt = 2, this results in a flattening of 4. The output will be 10 - 4 + noise(1) = 6 + noise(1).

For the query:

select count(*) from purchases left join clients on client_id = clients.id and ssn <> 'f'

we get the following input to the count aggregator:

count ssn
1 a
1 b
1 c
1 d
1 e
5 NULL

The last row is put aside as unaccounted.
If Ne=1 and Nt = 2, this results in a flattening of 0 for accounted rows. The output will be 5 - 0 + 5 - 0 + noise(1) = 10 + noise(1).

The difference in the results is large enough to dominate the noise, allowing us to know the entity with ssn = 'f' is an outlier and to detect other attributes of it by conditional inclusion/exclusion in queries.

Validate whether all AIDs need to be non-null

In #118 (specifically in this commit: 437e4aa) rows are dropped where not all AIDs have a non-null value.

This maps onto the design we used previously in Insights Air where we dropped a value if the AID was null. This does have quite some side effects in the new design:

  • LEFT JOIN and RIGHT JOIN between sensitive tables effectively become INNER JOINs since the rows where one or the other side is empty are dropped. As a result we effectively only support INNER JOIN and CROSS JOINs for sensitive tables

  • OUTER JOINs only remain possible with non-sensitive tables

  • Investigate whether it's safe to have null-AID values

    • If yes, then update reference

Requirements for auto-generating tables for testing

Here are my requirements so far (for building attacks).

  • I need to create a separate table per attack. I'll eventually have some hundreds of attacks, so will have hundreds of tables. I do this in order to create a known attack condition that I then try to attack.
    • Currently I dynamically upload each table at the beginning of each attack. (Sebastian's current reference code allows this.)
  • Basic approach is that I build a "base" table, and then add or remove rows from the base table to create final table with the attack conditions.
  • For the base table, I specify column name, type, and the values that should go into the table.
    • If AID column, I'll say whether the AIDs are unique per row or whether the number of rows per IAD follow some distribution (currently only the former implemented)
    • If non-AID column, I list the distinct values that should go into that column.
  • The base table contains 10 rows for every combination of non-AID column values. (So grows exponentially with the number of columns and values. But typically I only need a few columns so this not a problem.)
  • For deleting rows, I give a logical expression for which rows to delete ('c1 == 6 and 'c2 == 'v'). (In my current implementation, this is just a dataframe search expression.)
  • For adding rows, I state what each column value should be, either by stating the value explicitly, or stating a token that means "unique value" (something not already in the table).

Adapt flattening design to flatten boths ends up a distribution

See this comment for more context: #142 (review)

TL;DR: It's not enough to just split positive vs negative numbers as currently written in the design.

Things that need to be considered:

  • What is done when there is a mix of positive and negative numbers within the extreme value group or top group (say the bottom end of a value range is -1, 1, and then 10000)?
  • What value is used for noise scale? Should it be the largest across all AID instances still? Or something else?

Multi-level aggregates with direct usage in top query

Consider the following queries, where an inner flattened aggregate is not aggregated again but is used directly.

SELECT 2 * cnt FROM (
  SELECT count(*) AS cnt
  FROM table 
) x
SELECT name FROM (
  SELECT name, count(*) AS cnt
  FROM table
  GROUP BY name
) x
WHERE cnt > 100

What do we do in this case?

Do we rewrite to something like this:

SELECT 2 * apply_noise(cnt, aid_set) FROM (
  SELECT flat_count(aid) AS cnt, aid_agg(aid) as aid_set
  FROM table 
) x

Or do we reject usage of inner aggregates in the top query?

Language of implementation

Let's agree on a language of implementation.
The current reference implementation as it exists is cheap and cherry. We can throw it away and not much will be lost, so let's not take it into consideration for the purposes of choosing a language.

My preferences for features in the language of choice would be:

  • strong type system (makes expressing complex ideas easier, guards against implementation mistakes, and makes refactoring much simpler)
  • functional language (personal preference, as it tends to have less boiler plate, and lends itself to a natural way of expressing things)
  • immutable and mutable where needed

My preference: F#

My vote is for F#. It satisfies all the criteria above. The .NET 5 virtual machine is portable and performant. We have an incredibly large ecosystem of packages we can use through nuget (all the packages from C# etc).

Other candidates

rust

Has strong type system. Is performant and efficient. Seems much slower to work in. More ideal for a final implementation than than a reference implementation

python / ruby / javascript / your choice of dynamic language

Large community and widely used. But dynamic and none of the handholding a strongly typed language gives you.

C#

Large community and the same large wealth of 3rd party libraries we would get with F#. Difference being not strongly typed with the benefits that provides

C / C++

Same language we will likely have to use in a final implementation inside the DB.
High risk of making mistakes. Seems to low level and too high risk of making mistakes in my book to be suitable for a quick moving reference implementation

OCaml

F# is a derivative of OCaml, only much further evolved, and with proper industry backing and significantly larger set of packages to use, and incomparably much better tooling support.

Haskell

Great type system, but a fucking pain to work with in practice in my experience. I don't see it having any real benefits over F# for the context of what we are doing.


@cristianberneanu, @edongashi please chime in.

Make a distinction between analysis VARs and planner VARs

Analyzing/rewriting a query where we have to pinpoint a flattened index from query range is awkward.
It also assumes that the planner will preserve join order.

We could have have 2 expressions: a VarReference which points to a range index / column index during analysis and a ColumnReference which points to the attribute's index in a plan tuple. For this change we'd also need to have a flat QueryRange and a separate JoinTree.

Quite a few changes have to be made to the analyzer and planner; as such I consder this low priority.

Use a RNG with an internal state of at least 64-bits.

The default .NET RNG has an internal state of 32-bits (or 31?). We need one that can be seeded with at least 64-bits.
We could use the provided secure RNG, but that is slow and only generates bytes, so we'll need custom functions to generate numbers (and we have to make sure they are not biased).

Instrument query phases

We need to know query phases and runtime measurements. The latter is very useful because there are reports that the anonymization is slow and we need to know what the bottleneck is.

Not ready: Ensure the analyst learn from the order in which AIDs are processed

Given that we use all AIDs when seeding the random number generators and that random number generators are often stateful, we need to be careful to make it so that an analyst cannot influence the order in which the AIDs are processed, and therefore produce fresh noise samples (potentially resulting in different extreme and top group sizes!).

This could be done through:

  • ensuring the AIDs are processed in an order that is sorted by some internal AID identifier (internally used table.column name for example, or some object identifier in Postgres), or
  • ensure that all per-AID processing uses an identical random number generator (which in turn would mean that all AIDs use the same top and extreme value group sizes for a given query and dataset

question regarding postgres extension for "replaying" the query on certain rows

@edongashi this may be for you...

For LED, there is a point where we want to know which rows are flipped (go from included to excluded or vice versa) if a given condition is forced to be true or false.

One way to do this is to know exactly the role that the condition plays in the query logic (i.e. A or (B and C), we know the boolean outcome of the complete expression if we know the boolean outcome of each condition).

Another way would be to have the example in the truth table. So, for instance, suppose we obtain this truth table from query execution:

A B C outcome
C1 false ---------- ---------- false
C2 true true ---------- true
C3 true false true true
C4 true false false false

and we want to know what happens to combination C3 if we force C = false. In this case, we can see that then it would be equivalent to C4, and so the outcome is false.

But what if none of the rows matched C4? Then it would not be in the generated truth table, and we would not know the outcome of C4.

So, what we'd need to be able to do is replay the query execution where the conditions match C3, and see if the outcome is true or false.

My question is, is this possible?

Or, more generally, is there an easier way to build the truth table without knowing the condition logic explicitly?

Add additional aggregators.

The Easy Diffix tool will soon need support for additional aggregators (min, max, sum) in the reference implementation.

How do we handle multiple AIDs per row

This is mostly a note that results from thinking through low effect detection.
It might turn out to be superfluous, but I wanted to have it recorded in case it's not.


In our new design we aim to make the query writing experience more natural. For example we allow
a mix of per user aggregates, and cross user aggregates, in different parts of a query, without forcing
anonymization to take place.

When a cross user aggregate takes place, we need to account for the fact that the resulting rows
pertain to multiple individuals, and what each users contribution is in a given row.

I believe this is also the case when not dealing with aggregates.

While we in Aircloak Insights ensured that any single non-aggregate row belonged to a single identifiable user,
and did so by requiring a JOIN condition on the AID column, we do not enforce this in Open Diffix.

Take the following query as an example of what should be allowed:

SELECT height, amount
FROM (
  SELECT height
  FROM users
) t, (
  SELECT amount
  FROM transactions
) s

The cross join ensures that each row is associated with one or two AIDs.
I.e. the resulting table might be something like the following, even no aggregation has taken place.

Height Amount AID contribution
183 200 [aid1]
181 200 [aid1; aid2]
183 200 [aid2; aid3]
199 111 [aid3]

I wonder if we shouldn't get into the mindset of always thinking of rows to belonging to one or more users whether or not the data is aggregated.

We could easily extend the AID contribution information to contain a per user count (here denoted by a tuple of form aid, number of occurrences:

Height Amount AID contribution
183 200 [aid1, 1]
181 200 [aid1, 1; aid2, 1]
183 200 [aid2, 1; aid3, 1]
199 111 [aid3, 1]

which would map perfectly onto the result of a subquery that does aggregation as well, which could yield a table such as:

Height Amount AID contribution
183 200 [aid1, 3]
181 200 [aid1, 2; aid2, 1]
183 200 [aid2, 1; aid3, 3]
199 111 [aid3, 10]

Please note that in all this I am not thinking about or considering the support for different AID classes. These are all problems we need a solution for, even in the base case of our system only providing support for a single AID type per database!

The map visitor utils should not be recursive

The node mutators NodeUtils.map are confusing because they try to be recursive. As we will soon see, we want to control the process of visiting child nodes depending on context. Subqueries is an example where we don't want the same logic to apply for subqueries. I ran into difficulties with this while trying to validate queries and subqueries.

Because of the above reasons, map should visit only immediate children of a node. The callback can easily call map again with itself to make it work the same as before.

Postgres also does this and I think it's more flexible.

Add text to LCF spec regarding pre-configured parameter settings

We should have a subsection early on stating what the appropriate parameter settings are for different levels of protection, as well as default minimum values (i.e. prevent lower= or sd=0 type settings).

Don't forget to mention relationship between LCF parameters and other features like per-column select disable (if any)

Also determine best conceptual parameter settings for admin (i.e. lower and mean).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.