Giter Site home page Giter Site logo

ipa's People

Contributors

benjaminsavage avatar bmcase avatar csharrison avatar danielmasny avatar eriktaubeneck avatar martinthomson avatar richajaindce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ipa's Issues

Proposal: Extending IPA to support advanced reach reporting

tl;dr: R&F measurement is important for brand advertising, and it seems like IPA can be extended to support this use case. Lots of trade-offs to consider, but definitely worth a discussion/deep-dive.

Context: Brand advertisers are a significant part of the digital advertising ecosystem, and accurate assessment of reach & frequency of ads have always been prioritized by advertisers, large and small, that focus on brand building. In addition to measuring R&F, advertisers also want to manage the frequency of ads, to ensure ads are seen at least M times, but not more than N times. This type of measurement and management has most often required access to individual identifiers (facilitated by 3rd party cookies and device identifiers) -- mainly because the de-duplication happens across devices/platforms/media and a persistent identifier helps with easy count(distinct) queries. However, to truly protect user privacy on the web, while still providing utility to brand advertisers, you can enable these measurements in aggregate -- R&F is an aggregate measure anyway; while frequency management done in aggregate can be quite effective.

Progress to-date: A group of advertisers, in collaboration with trade groups, media publishers and broadcasters, have been working on a cross-media measurement solution that intends to enable privacy-forward reach & frequency measurement. The proposed architecture essentially has 3 parts: 1. A service that takes in ad impression information and spits out a statistical label (the labels on their own don't mean anything and don't relate to an individual, but labels can be analyzed together to create accurate reach & frequency measurement; 2. A service that takes these statistical labels and converts them to encrypted sketches (like bloom filters, these sketches preserve some notion of the underlying database without having store each individual impression, thereby significantly reducing storage and compute needs); and 3. A 3+ party MPC that takes encrypted sketches and produces a differentially privacy aggregate. One thing to note is that it is preferable to perform some of the steps outlined above (especially step1) on-device to be able to scale to all publishers on the web.

Overlap/Synergy with IPA:
The team working on R&F recently requested Chrome to support this functionality using the shared storage method. Essentially, the ask is for browsers to enable a client-side mechanism to generate encrypted statistical labels that can be used in a downstream secure computation (using 3+ party MPC). Now, the first part of IPA, which is setting an encrypted match key client-side, can be very helpful to generate the statistical label (step1 above). With this enabled, the R&F solution doesn't have to rely on any impression level identifiers on-device, and instead can leverage the browser/device-generated key as the input to generate the statistical labels. For web platforms, this allows them to solve 2 large use cases (R&F and conversion attribution) using the same mechanism. There is also a lot of commonality in terms of MPC consortium (helper party network) and differentially private aggregate outputs. I think it would be valuable for the team to think through this use case and have a discussion on what it would take for IPA to also enable the R&F use case.

How do Privacy Budgets work in IPA?

How do Privacy Budgets work in IPA?

with Ben Savage and Martin Thomson

Summary:

  • This document explains how privacy budgets are allocated and managed in IPA.
  • For the cases of Self-Attributing Publishers, Self-Attributing Advertiser, and Mobile Measurement Partners (MMPs), the solution is fairly well established, but for the case of Ad Networks we are still at the exploration phase and would like to facilitate more discussion and ideation.
  • We would like to invite feedback and additional suggestions on designs for how privacy budgets could work for Ad Networks.

Private Scope in IPA

Privacy budgets are a new thing for the web. In the world of 3rd party cookies sites learn information with full confidence about specific individuals' activity on other sites. Applying differential privacy to what sites learn from IPA queries enables us to limit how much information a site can learn about any specific individual.

The goal of IPA is that each site has a budget per epoch on the amount of information that can be learned about people who interacted with the site during that epoch. Our best approximation of a “person” in the IPA system is the matchkey, so more specifically IPA proposes that each site has a budget per epoch on the amount of information that can be learned about a given matchkey’s interaction with the site during that epoch.

At a high level, IPA allows sites to request encrypted matchkeys for source or trigger events occurring on their site. Sites can then add attributes to these reports (e.g. values to trigger events and breakdown keys to source events) and share them with other sites, also called Report Collectors. At any time, a Report Collector can take a batch of source and trigger reports and submit a query to the MPC to get an attribution measurement on these events. With each query the Report Collector must specify how much of its per epoch budget it wants to spend on that particular query. The Report Collector also specifies for each query the per matchkey sensitivity cap to be enforced by the MPC. The cap and budget allocated to this query together determine the parameters of the noise that is applied to the outputs such that the information released about each matchkey is at most the query’s budget.

Queries for different types of Report Collectors

There are different types of Report Collectors who will need to submit IPA queries. See our What is a “Report Collector?” explainer for more details, but the main classes we are working to support are Self-Attributing Publishers, Self-Attributing Advertisers, Ad Networks, and MMPs.

Source fan-out queries for Self-Attributing Publishers

A Self-Attributing Publisher is a site that runs their own ads and collects source reports for them. They also collect trigger reports from Advertiser Websites/Apps. IPA enables these sites to submit source fan-out queries, which consist of source reports from only that source site along with trigger reports from any number of Advertisers sites.

The budget to be spent on a source fan-out query is deducted from the source site’s per epoch budgets but not from the budgets of the trigger sites. More specifically, if a source fan-out query has source reports from multiple epochs, each of those epoch’s budgets for the source site is reduced by the amount to be spent on that query.

It is the Helper Parties who run the MPC queries that are also responsible for enforcing the privacy budgets. They are responsible for checking several things about each submitted query. Recall that the encrypted matchkeys have authenticated associated data with them that contains the site that requested the encrypted matchkey, the epoch when it was requested, and whether it was requested for a source or trigger event.

For source fan-out queries, the Helper Parties check that

  1. All the source reports in the query correspond to the Report Collector that is submitting the source fan-out query.
  2. All the source reports in the query come from the set of epochs specified in the source fan-out query.
  3. The Report Collector has available budget for every epoch that was specified in the query

source-fan-out

Trigger fan-out queries for Self-Attributing Advertisers

A Self-Attributing Advertiser is an advertiser site that is large enough to perform its own ad-measurement in-house. They collect source reports from the publishers they buy ads from. IPA enables these trigger sites to submit trigger fan-out queries, which consist of trigger reports from only that trigger site along with source reports from any number of publisher sites or ad networks.

The budget to be spent on a trigger fan-out query is deducted from the trigger site’s per epoch budgets but not from the budgets of the source sites. If a trigger fan-out query has trigger reports from multiple epochs, each of those epoch’s budgets for the trigger site is reduced by the amount to be spent on that query. In practice, trigger fan-out queries likely just include reports from the most recent one or two epochs; source fan-out queries might look back several epochs for longer attribution windows.

For trigger fan-out queries, the Helper Parties check that

  1. All the trigger reports in the query correspond to the Report Collector that is submitting the trigger fan-out query.
  2. All the trigger reports in the query come from the set of epochs specified in the trigger fan-out query.
  3. The Report Collector has available budget for every epoch that was specified the query

trigger-fan-out

Queries for MMPs

“Mobile Measurement Partners” or MMPs are another example of a current “Report Collector”. They help advertisers perform conversion attribution queries across multiple publishers / ad-networks, and have the ability to perform cross-publisher attribution (including multi-touch attribution). In IPA MMPs run trigger fan-out queries on behalf of Advertiser Apps / Websites. This is nearly identical to the case of self-attributing publishers, with the only difference being that the responsibility of running queries has been delegated to the MMP. The Advertiser Apps/Website enables the MMP to submit IPA queries on its behalf and spend its privacy budget.

One MMP who is a service provider for many Advertisers won’t be able to combine budgets from multiple advertisers. They will see and spend from the budgets of all their different trigger sites with separate trigger fan-out queries for each.

MMP

Queries for Ad Networks

Ad Networks show ads across a large number of publisher apps / websites on behalf of many Advertiser apps / websites. They will need to collect reports about source and trigger events in order to submit IPA queries.

AdNetworks

We are still exploring what the best options are for supporting privacy budgets for Ad Networks. We are considering two design proposals right now but would be open to additional constructions that would give good privacy protections for end-users.

  1. Design Proposal 1: Ad Networks have to work within the above structure of budgets being given to sites but have the additional ability to run source fan-out queries that contain source reports from many source sites. In order to run with multiple source sites in a query, budgets of all the included source sites would need to be spent.
  2. Design Proposal 2: Ad Networks themselves get budgets allocated to them and sites make commitments to only allow a certain set of Ad Networks to run queries with reports generated on their site.

Design Proposal 1 (no custom support added for Ad Networks)

In this proposal Ad Networks have to work (for the most part) within the earlier constraints of running source and trigger fan-out queries on behalf of the websites they work with. However, in the previous settings of the Self-Attributing publisher, the helpers verified that all source events originated from the same source site. This is not possible for ad networks as impressions are shown across many sites. In order to support source fan-out queries involving source events shown across many source sites, we could imagine adding support for source queries across multiple sites - and simply deduct from the privacy budget of all included sites.

Since sites generally work with many Ad Networks, this would lead to sites needing to delegate partial amounts of their budget to the different Ad Networks they work with. How might sites delegate their budgets?

  1. One idea would be to do in proportion to the number of ads shown from a particular Ad Network. However, if you only showed one ad on a site and get a tiny fraction of the budget, you wouldn't be able to include that report in a query with reports from sites that delegated larger budgets. (Since the query deducts equally from all budgets, the minimum budget available becomes the query budget).
  2. A second idea would be to standardize a maximum number of Ad Networks a site would work with and then assign them all the same budget from that site. This would make it easier for Ad Networks to run source fan-out queries across many sites.

Design1

In summary, managing the partitioning of privacy budgets across multiple ad networks would be very complex to manage. Worst case, it could push the ecosystem towards consolidation.

Design Proposal 2 (separate budgets for Ad Networks)

We consider an additional way of supporting Ad Network budgets. Instead of fixing the budget for a site and letting that be delegated towards Ad Networks, we have considered the idea of allowing each source site to delegate to a limited number of ad networks who would each have a constant-sized, cross-web privacy budget.

In this design, the total privacy loss is proportional to the number of ad networks that the user is exposed to rather than the number of sites they visit. For a user that visits relatively few sites, this could be worse, but for users that visit a modest number of sites, the set of ad networks they are exposed to could be less than the number of sites. The privacy loss in that case would be reduced and might not increase further as the user visits more sites (assuming they have delegated to the same set of ad networks the user previously encountered).

Assumptions:

  • A site would choose to delegate to ad networks and not run its own IPA queries. (Otherwise this would purely increase DP leakage).
  • There are controls on who can register with the Helper Parties to be an Ad Network to avoid per-site ad networks, which would increase privacy loss.
  • There would still need to be a limit on the number of Ad Networks that could operate on a site;

This proposal would essentially reduce the Ad Network case to the same situation as the Self-Attributing publisher case:

  • A single privacy budget against which all queries are spent, which is a fixed value irrespective of the number of ad impressions served
  • Full flexibility for how source-events are broken down via breakdown keys since all source events can be combined into a single query.

In order to implement this second design, we would need the browser to bind the source reports to the ad network that is displaying the ad on the publisher’s website, in addition to the publisher’s site. To do this we would need the following:

  1. Sites would commit to using a particular set of Ad Networks in a way the Helper Parties can verify.
  2. The getencryptedmatchkey() API will have an additional boolean parameter, delegated, which if false will tell the browser to bind this report to only the top-level domain. If true, the report will be bound to both the top-level domain as well as the current (frame) context (here we assume the ads being shown by Ad Networks are in iframes that correspond to a domain operated by that Ad Network).
  3. When the Ad Network calls the getencryptedmatchkey() API in the iframe of the ad they will supply the delegated parameter as true and the browser will create the report and bind it to both the site and the Ad Network.
    1. If the Ad Network calls the API with false, then they get back a report bound to the top-level site. Since this top-level site is one which has decided to delegate queries, any report bound only to this site will be rejected by the Helpers and never leak any information about the user.
  4. When the Ad Network submits a query with this report, the Helper Parties will verify that the site has committed to delegating to this Ad Network.
    1. This way if reports are transmitted out of band to an Ad Network the site has not committed to delegate to, that network will not be able to use it.

Comparison of Ad Network Designs

The following figure illustrates a comparison of privacy budgets between the two designs.

design_comparison

Here is a table that compares the main two designs considered so far.

table2

Open Questions for discussion:

  • How many Ad Networks per site are needed? How to ensure that a user wouldn’t be exposed to too many of them as they browse the web?
  • What about publishers who serve a mixture of their own ads, as well as some ads from Ad Networks?
  • In terms of the impact on total information leakage and how it varies based on the number of sites a person visits (or ad networks to which they are exposed) how do people feel?
  • What mechanisms can we employ to prevent multiple, duplicate ad-networks that are in fact operated by the same entity? How about preventing per-site ad networks?

Clarification on epoch and encrypted match keys binding

In #11, @benjaminsavage described the lookback attribution process as follows:

For "Trigger Fanout Queries", ... assuming we have attestation for the "epoch", we could simply ask the helpers to validate that all of the trigger events come from the epoch that was specified in the query. It would be OK to allow "source events" from preceding epochs (within some range - that would be limited by how often the helpers rotate their encryption keys, TBD.

Also, as the IPA end-to-end document describes,

An encrypted match key is bound both to the site that requested it and to the epoch in which it was generated.

It is, however, fairly common for advertisers to run their ad campaigns continuously for weeks or even months. This implies that the conversions on a trigger website happen across epochs, thus trigger events included in a trigger fanout query also have events from different epochs just like source events. This is also true for source fanout queries.

We are currently assuming that the encrypted match key is bound to the epoch in which it was generated and can't be used outside that. There were discussions of epochs in the context of DP, but it was unclear to me how this impacts the cross-epoch attribution scenario.

I imagined two options.

  1. We enforce a trigger fanout query to contain trigger events from one epoch only. Report collectors are only allowed to generate per-epoch attribution reports.
  2. Allow trigger fanout query to include trigger events (as well as source events) from multiple epochs. This raises other issues to be discussed such as a) how do helper parties store past encryption keys? and b) how do helper parties know which key to use to decrypt the match keys (send epoch in the clear or utilize HPKE @dominiccooney mentioned in #5)?

Consider a randomized-response-like privacy mechanism

I want to propose a new privacy mechanism to IPA that:

  • Serves to help solve the optimization use-case
  • Does so under the existing differential privacy constraints of IPA as I understand them

This mechanism is inspired by the following research papers:

This research operates in the “label DP” setting, where model features are publicly known, but labels are not. This matches the setting of IPA well, because the report collector has full information about the raw impression event (and thus the corresponding feature vector), but the label (e.g. whether the impression led to a conversion) is protected with differential privacy. While these techniques are in the “local” differential privacy regime, they (somewhat surprisingly) perform close to the state of the art in private model training depending on the task.

Here is an outline of how we could implement one of the algorithms described here, RROnBins in the IPA setting:

  • For every per-source breakdown key, pass along a list of possible output bins. For example, a bucketized range of values like {[0, 10], [11, 100], [101+]}
  • After aggregation*, rather than applying a fixed noise distribution like Gaussian to the sum, perform k-ary Randomized Response on the specified k output bins for that breakdown key. That is, with a probability p = k/(k -1 + e^epsilon), pick a bin at random, otherwise pick the correct bin the aggregate falls in. This mechanism satisfies epsilon differential privacy.
  • Note: the mechanism can also be extended to do randomized response over a restricted set of trigger-side breakdown keys if those become supported by IPA.

*Note: for practical purposes, we would likely need to support a unique breakdown key per source to take advantage of this research, i.e “aggregate” over only single sources. While IPA currently only mentions “aggregate” queries, as far as I can tell there are no restrictions in the existing design to aggregate only over single events (i.e. a unique breakdown key per source), as long as the output is protected by differential privacy. The mechanism described above offers the same protection.

For an overview of how we’re thinking about these capabilities in ARA, see the flexible event-level explainer we recently published for ARA.

Clarification on oblivious sorting and attribution

Hey,
In the End 2 End protocol it states:

Our oblivious attribution algorithm operates on the sorted list of secret-shared values (described above) and has O(log N) iterations where N is the list size.

I'm probably missing something, but can you explain how you plan to sort on a secret shared list ? Are you planning to use an order-preserving encryption scheme ? You mentioned that the sorting algorithm is not defined yet, but do you have something in mind ?

We definitely don't want to store match keys in the clear, but maybe some other information. Are you planning to use your attribution algorithm using unencrypted helper bit, stop bit, trigger bit and credit?

Thank you.

Consider annotating `getMatchKey` calls with an event type

We should consider asking callers to annotate calls to getMatchKey with a type, e.g. source or trigger. This clarifies the use of the API and can help drive UX and UA-specific policy that might deviate across event types (e.g. permission prompts, user gesture gating, etc).

This may also end up being useful if we need to cap the sensitivity of the # of source and trigger events from the client, e.g. to support something like patcg/meetings#97, where the sensitivity of sources and triggers are different.

Note: we may want to have more specific types beyond source / trigger e.g. click, view, ghost-view to more clearly specify the intended event type.

Encrypted match keys current site may conflict with fenced frames

The end-to-end doc talks about get_encrypted_match_key and "the report collector (acting on behalf of a specific source/trigger website/app)" and reports containing "current_website/app."

My understanding is the "current website" means, approximately, the site displayed in the URL bar; a common use case would be a script from an ad network, running in an iframe displaying an ad, would call get_encrypted_match_key to generate an impression event which includes the top-level site.

There is a proposal for an iframe-like thing, Fenced Frames, with many use cases related to ads. Many fenced frames goals hinge on cutting communication between the fenced frame and the containing frame. For example, in Fenced Frame Issue #38, @shivanigithub said:

[sandboxed iframes] can still have communication with the top-level frame via window.top, window.parent, joint history, post message (I think) etc. and fenced frames do not have any of those.

(Emphasis mine.) Digging through the Chromium source in lieu of a more detailed spec, blink::Frame::Parent(FrameTreeBoundary::kFenced) does not cross the fenced frame boundary and this is what window.top and location.ancestorOrigins use for their walks. Thus it appears that Fenced Frames intend to stop the framed content from learning the top-level site identity.

This potentially puts IPA and Fenced Frames into conflict: IPA plans to share the top-level site with the framed content; Fenced Frames plans to hide it.

Here's a couple of potential paths forward:

First, Fenced Frames could change to not hide the top-level site. I'm not acutely aware of the tradeoffs fenced frames are making, and I understand their requirement to limit cross-domain communication, but it seems like a serious defect that ads cannot detect what site they are appearing on. It is reasonable and typical for advertisers to want to verify they are not displayed on adult sites.

Second, IPA could share the top-level site, but only with helpers. As written this is what get_encrypted_match_key suggests, but this interaction with fencedframe might be a good issue to flag for TAG. Naturally there may also be privacy budget implications if people were forced to use fencedframe widely. For example, a report collector could no longer independently discard event spam from unrelated origins. (But the helper network could.)

Note, in #5 @martinthomson suggested moving some repetitive fields, like current site, out of the encrypted report and into the HPKE associated data to make batched reports smaller. This is a good idea. However this requires the callee to know the plaintext top-level site.

What are the interfaces for the setMatchKey and getEncryptedMatchKey APIs?

This PR tries to define the interface of the two client-side APIs: getEncryptedMatchKey and getHelperNetworks . These would be the interfaces user-agents (like Chromium, Gecko, Webkit etc) would expose to developers.

Proposed API surface

[SecureContext]
interface PrivateAttribution {
  // Retrieve an encryption of a secret-sharing of the match key that was previously set by a 
  // given match key provider. This encryption will be specific to the site (i.e. registrable domain) 
  // at the time when this function is invoked
  Promise<PrivateAttributionEncryptedMatchKey> getEncryptedMatchKey(
     // Report collector origin who is collecting the events
     required DOMString reportCollector, 

      // other arguments which are needed to generate encryptions
     required PrivateAttributionOptions options
  );
  
  // Get the set of helpers that the browser is willing to encrypt match keys towards.
  Promise<sequence<PrivateAttributionNetwork>>
    getHelperNetworks();
};

dictionary PrivateAttributionOptions {
     // Is this a source or trigger event
      required PrivateAttributionEventType eventType,

      // Which helper party network do you want to encrypt towards
      required PrivateAttributionNetwork helperNetwork

      // TODO add more parameters, such as
      // binding to a report collector for partial budget allocation.
      // See: https://github.com/patcg/docs-and-reports/tree/main/threat-model#33-privacy-parameters
};

interface PrivateAttributionNetwork {
  sequence<DOMString> origins;
};

enum PrivateAttributionEventType {
  "source",
  "trigger",
}

[Exposed=Window]
partial interface Window {
  attribute PrivateAttribution attribution;
};

getEncryptedMatchKey

getEncryptedMatchKey API allows retrieval of an encryption of secret-shares of the match key that was previously set. If not found, this API also generates a random value for the matchkey, and stores it.

API arguments

eventType: specifies if this is a source or a trigger event
helperNetwork: Helper party network towards which the match keys should be encrypted
options: Optional arguments which might be required to be passed

API implementation detail

Step 1: Find the correct match_key

  • If a match_key is found in the storage, it will be used.
  • If the user has enabled “private browsing”, it will return a secret-sharing of a random value.
  • There may be a user setting to disable this API. If the user has disabled this API, this API will return a secret-sharing of a random value.
  • Else, a match_key will be set on-device and used to serve as match key

Step 2: Split match_key into 3 shares

  • match_key from the previous step will be split into 3 shares. The shares are obtained by generating 2 random shares and then doing an XOR of match key with these two shares.

Step 3: Prefix eventType to each packet

  • A bit will be prefixed to each share to indicate source event (0) or trigger event (1) (see: link).

Step 4: Encrypt packets

  • Each packet obtained from last step will be encrypted towards the helper party network
    using public encryption keys (see below in getHelperNetworks for details).
  • If no helper network is identified, one will be chosen for the site.

Please note that

  • Calls to this API on different sites should return different secret sharing of the underlying match key to ensure there is no cross-site link-ability introduced.
  • The browser will only cache the raw shares it created upon first invocation and store that against the calling “site” (note: not origin, see: link).
  • Each time this API is called, on the same site, during the same epoch a new set of encrypted shares will be created using the public keys and associated data. This is to avoid learning distribution of same users across source and trigger site.

API return type

This has been covered in detail in #50

getHelperNetworks

The third API, getHelperNetworks returns a list of the helper party networks towards which the browser is willing to encrypt match keys.

About public encryption keys: Browsers will need to know which public keys to use when encrypting data inside getEncryptedMatchKey(). This can be hard-coded into the browser, part of some sort of configuration system, or periodically downloaded from helpers. This needs to be globally consistent to prevent the choice of key from being used to identify users; see https://datatracker.ietf.org/doc/html/draft-ietf-privacypass-key-consistency

About PrivateAttributionNetwork: We won’t provide a constructor for PrivateAttributionNetwork, so that only values that the browser supports can be passed to the options parameter.

Additional Note

  • These APIs would only be available in foreground and not work in background by default. As an example, from Chromium they won’t be available for “Workers” by default.
  • Chromium implementation details can be found in this doc.

Specify how helpers/helper party sets are named, discovered

When pages call getEncryptedMatchKey they need to specify a helper party network. User agents will mediate whether the request is acceptable to users, for example, by requiring networks have $\ge n$ helpers, are in different legal jurisdictions, etc.

How does the caller name these sets? (Related to naming, are the sets identified by their members, or are they abstract and the members can vary over time? Is a specific epoch also part of the identity?)

How does a page discover which sets a given user agent accepts? (Also might depend on the identity provider, see Issue #26.)

Note, in the "fit and finish" of the API, specific helper identity, or set member order, may be important. The IPA explainer mentions secret shares $v_1, v_2, v_3$ and it is functionally important that $v_n$ reaches the right member of the set... But sets A,B,C and C,A,B are presumably equivalent so some naming, canonical order, etc. is necessary.

How does a report collector map one of these sets, to endpoints it can submit queries against?

IPA Biweekly Meeting Agenda Request - Changes to Match Key Providers

Agenda+: Changes to Match Key Providers

Discuss recent concerns with malicious MKPs and proposed mitigations. Probably best to focus on what it looks like to use device generated match keys for an initial version with option for platform syncing of match keys, while keeping the design compatible with future support for a MKP.

Time

20 min

Issue Link

#57

#42

Partitioning queries (by geography)

In #3 we talk about budgeting and note the budgeting system assumes that all users are present for every query made by a site. This is a direct consequence of there being a single budget that is spent fractionally on each query.

A standard technique in managing differential privacy is to find non-overlapping portions of the database that can have independent budgets. If such a partition can be made, then a query might be made that spends budget in one partition without reducing the budget for queries in other partitions. The key is aligning partitions to people/subjects/users, so that each only falls into a single partition. This means those people contribute information to a single query, meaning they have zero sensitivity in other partitions.

For IPA, the number of partitions that might be made is limited. We need something that can definitively partition events so that they can't be reused in other partitions. This means marking all affected events with a value that is assigned at the time the event is generated, using information that the site knows at that time.

Once we have that partition key, we apply that key to each generated event. Then, when making a query, helpers count a budget for each partition that is present in the query. (Again, only source events need to be counted for source queries; trigger events for trigger queries.) The helpers only match events to those that are in the same partition.

In a way, this is like the attribution constraint ID in that it effectively extends the match key for the purposes of matching. The difference is that while budgets apply across the attribution constraint ID, each partition would have its own privacy budget.

Note that we already have one such partition key: we call it the epoch. This is globally synchronized in a way that ensures that there is no disagreement about where each event falls. We would need a key with similar properties for this.

The only partition key @benjaminsavage and I could come up with is geography-based, but there might be other information sites could use. It's possible that we could just make a partition key free-form, with sites able to work out their own partitioning approach.

The advantage of partitioning here is that you can make independent queries for different markets fairly confidently. The cost is that for those users that move geographies (international travel, VPNs, etc...) you miss any attribution in any geography-restricted queries you make. You can make queries that hit multiple geographies, which will catch those missed attributions. The cost of that is that your numbers for per-geography and global queries might be harder to reconcile.

IPA Biweekly Meeting Agenda Request - IPA Sharding with Sensitivity Bounds

Agenda+: IPA Sharding with Sensitivity Bounds

In order to support the higher end of query sizes outlined in patcg/private-measurement: issue #21, it's almost certain that we will need a solution that horizontally scales. This presents some challenges in the IPA setting, as we would want to avoid a helper party learning which events land on which shard. Preventing this information leakage includes limiting information about the sizes of the shards, as well.

Issue #35 explores this in detail.

Time

This will take at least 30 min, potentially all 60.

Issue Link

If there is an open issue on the or a relevant repo associated with this agenda item, please link it here. If there is not yet an open issue, consider opening one as this issue will not persist after the meeting occurs.

Other Links

#35

IPA Biweekly Meeting Agenda Request - Privacy Budgets for Ad Networks brainstorm

Agenda+: What do you want to discuss?

We'll continue discussing Privacy Budgets in IPA for Ad Networks. After presenting at the June PATCG several folks gave feedback and offered to help brainstorm further.

Time

Give us an estimate of how much time you think you'll need and, if your topic breaks down into sections, a rough estimate of how much time for each section.

Let's spend the first 45min on this.

Issue Link

#78
#78 (comment)

Other Links

What should be the return value of getEncryptedMatchKey()?

This issue tries to define the object which will be returned by getEncryptedMatchKey API.

Relevant reading

https://github.com/patcg-individual-drafts/ipa/blob/main/details/encryption.md

Proposed return value

We have detailed out getEncryptedMatchKey interface and implementation detail in #52 and in this PR, we would like to double-down on its return value, PrivateAttributionEncryptedMatchKey.

dictionary PrivateAttributionHelperShare {
    // This is the identifier for the HPKE key that was used to encrypt. 
    // Since the helper party may have multiple keys, this indicates which one to apply
    uint8 keyId;
    ArrayBuffer encryptedShare;
};


dictionary PrivateAttributionEncryptedMatchKey {
    // The registrable domain of the top-level "site" that the encrypted match key was generated for.
    DOMString site;
    // Epoch during which the encrypted match key was generated
    uint16 epoch;
    // Map from helper to the encrypted bits they get and additional info used to generate those bits
    record<DOMString, PrivateAttributionHelperShare> shares;
};

PrivateAttributionEncryptedMatchKey will consist of

  • site: The top-level site for which the encrypted match key was generated (see #51 )
  • epoch: The epoch during which the encrypted match keys were generated.
  • shares: This will be a map from helper to PrivateAttributionHelperShare which contains the encrypted bits i.e. encryptedShare and additional information which will be needed to decrypt these shares i.e. keyId. keyId is the identifier for the HPKE key that was used to encrypt this share. Helper parties can have multiple keys. In this case, keyId would indicate which one to apply to decrypt this particular share.

Standards Position of the Chrome Privacy Sandbox Measurement Team

This meta issue describes the Chrome Privacy Sandbox Measurement team’s current position on the Interoperable Private Attribution proposal, designed to satisfy similar use cases as the Attribution Reporting API (ARA).

Note: Chrome still plans to ship the Attribution Reporting API as explained in this document. Our engagement with experimental new APIs like IPA doesn’t change those plans.

Overall, we think more exploration is warranted given some beneficial design characteristics of IPA:

  • It has structural support for cross-device and cross-platform attribution.
  • It removes report delays and background fetches from the client, which improves usability and eliminates a source of unpredictability in the API.
  • It places the privacy mechanism entirely server-side, which allows for more efficient privacy/utility trade offs.
  • The client-side implementation of IPA is simple, making it easier for independent implementations to be interoperable.

Because of these benefits, we see IPA as a worthwhile direction to explore. We think prototyping code in the Chromium codebase will be valuable for developing and addressing some of our concerns with the proposal, some of which we outline below:

  • IPA’s server-side infrastructure will need to process multiple orders of magnitude more data than ARA’s, because it moves the on-device join into the server. We need to be confident that this is feasible and cost-effective for all critical use-cases. See issue 22 on the private-measurement repo for the general scale issue for off-device attribution.
  • IPA has been designed to provide only simple aggregates keyed by impression-side information, which doesn’t address fundamental ecosystem use-cases including model training or reporting by conversion type. We will need to ensure that IPA can provide a rich enough set of outputs such that these use-cases and others can be achieved. This may also require investigating alternative privacy mechanisms, e.g. #60.
  • Issue 39: IPA’s pervasive sharing of “match keys” means match key providers are forced to share proprietary user data with arbitrary other parties on the web.
  • Issue 42: IPA allows for arbitrary linking to off-device and off-web events, as long as they can be annotated with a match key. This potential abuse vector is hard to reason about and may need extra mitigations before we’re confident it’s safe. See also issue 57 which identifies a privacy attack based on this capability. Potential fixes which remove the notion of a match key provider reduce the potential utility of IPA (e.g. to support attribution across platforms).
  • Issue 25: IPA currently only specifies a Javascript API, which we don’t think will be tenable for real-world adoption where ad measurement is often at the HTTP layer.
  • If sites participating in IPA want to delegate to one (or many) third parties to do their measurement for them, they need an explicit mechanism to advertise to browsers/helper party networks about this delegation. This mechanism is not currently well-defined in the IPA proposal (although a strawman is discussed in issue 6). Our experience with deploying ARA leads us to believe that this mechanism will be a source for issues unless carefully crafted.

We are interested to see if these concerns can be addressed with iteration, and are happy to evaluate and discuss proposals. We also plan on continuing to explore alternative designs apart from IPA within the PATCG, with the goal of converging on a cross-browser standard for attribution measurement.

Insecure timestamps allow replay attacks across epochs

In the current doc timestamps for each event are provided by the report collector with no client-side attestation. This allows a report collector to replay old events by modifying the timestamp accordingly. This attack doesn't necessarily break the desired differential privacy protection, since we'd still bound the impact of any one event per epoch, but it allows some "worst case" style attacks where one sensitive event could get queried again and again every epoch.

It might be the case that we accept this kind of leakage, but I think it's worth a discussion of whether we'd want to restrict in some way events from the client from participating in unbounded epochs. Note that doing this might break legitimate use-cases ("lookback windows" for attribution that are longer than an epoch), so we should proceed carefully.

IPA Biweekly Meeting Agenda Request - Chrome Standards Position and open concerns

Agenda+: Standards Position of the Chrome Privacy Sandbox Measurement Team

I propose we give some time to discuss the recent issue on the Standards Position of the Chrome Privacy Sandbox Measurement Team . We could start by letting the Chrome team add any additional color and answer any questions before getting into discussing on a select few of the open concern bullet points (I propose we focus on trigger side breakdowns, match keys without JavaScript (if this needs discussion), delegated report collectors).

@csharrison let me know if this sound good to you or if you want to change it up at all

Time

Overview and questions: 15 min

Select bullet points to discuss (remaining time, 25min):

IPA’s server-side infrastructure will need to process multiple orders of magnitude more data than ARA’s, because it moves the on-device join into the server. We need to be confident that this is feasible and cost-effective for all critical use-cases. See patcg/private-measurement#21 on the private-measurement repo for the general scale issue for off-device attribution.

Since last meeting we talked mostly about scale and I don't think there are any updates in this area, I propose we revisit scaling proposals in a later biweekly call.

IPA has been designed to provide only simple aggregates keyed by impression-side information, which doesn’t address fundamental ecosystem use-cases including model training or reporting by conversion type #55. We will need to ensure that IPA can provide a rich enough set of outputs such that these use-cases and others can be achieved. This may also require investigating alternative privacy mechanisms.

I propose we discuss this issue on trigger side breakdowns and what sort of circuit we would want to compute in the MPC using trigger breakdownkeys. As for alternative privacy mechanisms discussed in #60, it looks like the next step from the comments thread is to bring this to the broader PATCG for discussion.

#39: IPA’s pervasive sharing of “match keys” means match key providers are forced to share proprietary user data with arbitrary other parties on the web.
#42: IPA allows for arbitrary linking to off-device and off-web events, as long as they can be annotated with a match key. This potential abuse vector is hard to reason about and may need extra mitigations before we’re confident it’s safe. See also #57 which identifies a privacy attack based on this capability. Potential fixes which remove the notion of a match key provider reduce the potential utility of IPA (e.g. to support attribution across platforms).

We may have already covered in the earlier part of the meeting but can revisit if there is more to discuss.

#25: IPA currently only specifies a Javascript API, which we don’t think will be tenable for real-world adoption where ad measurement is often at the HTTP layer.

This is outside my area, so will let others decide if they want to discuss anything in the meeting.

If sites participating in IPA want to delegate to one (or many) third parties to do their measurement for them, they need an explicit mechanism to advertise to browsers/helper party networks about this delegation. This mechanism is not currently well-defined in the IPA proposal (although a strawman is discussed in #6). Our experience with deploying ARA leads us to believe that this mechanism will be a source for issues unless carefully crafted.

Would be great if we have time to discuss this; I've been thinking of flushing this out more recently. It would be great to hear from the experience of deploying ARA what some of the challenges may be.

Issue Link

#59

Should we use sites (i.e. registrable domains) or origins?

Based on our understanding, the web platform features have moved towards scoping on origin as compared to the site. This has been called out as a Warning in the HTML standard doc under “Site section”

For the reasons explained in URL, the same site and schemelessly same site concepts should be avoided when possible, in favor of same origin checks

For IPA, there are two places where we need to make this choice:

  1. Match Key Providers
    When a match key is set, we propose using the origin. Correspondingly, when calling getEncryptedMatchKey(), the first argument to that method is the “match key provider”. This should also be an origin.

  2. Storage of encrypted match key values
    The return value of getEncryptedMatchKey() should be consistent for a given site (for a given epoch). The User-Agent should compute the value once and store it to be returned on subsequent calls. Unfortunately, here we cannot use origin and must use site. There are a few reasons for this:

    • Privacy Budgeting is to be implemented at the “site” level. Doing so requires the helpers to check the site to which encrypted match keys are bound. We cannot give a separate privacy budget per origin as cookies enable linkage across origins belonging to the same site. Thus, scoping privacy budget to “origin” could allow sites to obtain an unlimited privacy budget.
    • The helper nodes need to know which “site” an encrypted match key was bound to in order to decrypt it. This is information helpers see in the clear. Helpers should not see data about origins, just the top level site.
    • Our current ideas about “DP Sharding” involve sensitivity capping that requires all records from a given user on a given site (in a given browser, in a given epoch) are consistent. Generating different values per origin would violate this constraint. (See: #35)

Better understanding IPA "report collector" coordination needs

I want to deepen my understanding of report collector set up coordination requirements when there are 2 or more report collectors or report collector delegates. I suspect I'm not the only one who may have these questions. I hope to use what I learn to make suggestions for updated documentation and potentially design.

Scenario

For the sake of clarity, I'm going to use real companies names in my hopefully somewhat real-life scenario. I'm going to interchange between saying "it is my understanding" and ending sentences with "correct?" to indicate areas where IPA authors may feel obliged to affirm or correct me.

Let's say Dyson is running a campaign using GCM360 as its general advertiser ad server and Sizmek Ad Suite where it must. Dyson's agency divides the total campaign budget evenly between three DSPs:

The Trade Desk serves Dyson video and banner creatives (all hosted/measured in GCM360) on nytimes.com, cnn.com, wsj.com, weather.com and howtocleanstuff.net via SSPs representing those sites. It is my understanding The Trade Desk and GCM360 can call get_encrypted_match_key() just by nature of delivering ad code to the page for these sites. However given VAST is delivered as XML, there will be no way to call it on those creatives. Likely the site would need to pass the value through the video player in a macro. I don't see or expect anything stopping sites from doing that.

DV360 serves video creatives (hosted/measured in GCM360) youtube.com and cnn.com. GCM 360 and DV360 will run into the same VAST XML issues above calling get_encrypted_match_key() so will need youtube.com and cnn.com to pass its value through the video players on those sites.

Amazon serves product ad creatives (hosted/measured in Sizmek by Amazon) on amazon.com and howtocleanstuff.net. via Amazon's SSP. Again, it is my understanding Amazon can call get_encrypted_match_key() just by nature of delivering ad code to the page for these sites.

GCM360, The Trade Desk, DV360 and Amazon (including Sizmek) all have code on the Dyson landing page to log potential conversion events and therefore can call get_encrypted_match_key(), correct?

The following table lays out source and trigger report collector opportunities based on the understanding above.

Source Collector Delegate for Trigger Collector Delegate for
GCM360 nytimes.com, cnn.com, wsj.com, weather.com, youtube.com, howtocleanstuff.net dyson.com
The Trade Desk nytimes.com, cnn.com, wsj.com, weather.com, howtocleanstuff.net dyson.com
DV360 cnn.com, youtube.com dyson.com
Amazon howtocleanstuff.net, amazon.com dyson.com

Each of these entities needs to run both source and trigger fan out queries (where they can based on the table above) to understand campaign performance to understand campaign performance for Dyson, correct? Where each of these entities has both source and trigger data, they can individually select whatever helper party network for the epoch, correct?

Suppose nytimes.com, youtube.com and amazon.com are interested in running source fanout queries. There might be several entities wanting to run them. In the case of nytimes.com I suspect the SSPs who delivered ads? Presumably Google runs source fanout queries for youtube.com and Amazon for amazon.com. Do each of those entities need to be running their source reports into the same helper party network GCM360 used, who saw all of what happened on dyson.com? Additionally, whichever of these entities wants to query will also need to align on the attribution constraint ID with at least GCM360 (the advertiser's ad server for all instances outside of amazon.com), correct?

There are clearly a lot of actors wanting to understand what is happening:

  • Dyson and its ad buying/measuring technology providers
  • Each of the websites listed and their monetization technology providers

My understanding of the current design (July 29, 2022) is that pre-campaign coordination may be necessary to optimize privacy budgets, helper party network selection and Attribution Constraint IDs. I may be wrong, but I hope this somewhat detailed and reality-based scenario will help us all understand a bit more clearly where scale/adoption/utility issues exist and might be addressed.

"Gathering Events" section is very confusing

The first sentence "collecting information on important events". It's unclear what events these are (source events, trigger events, both, something else?) and where information is being collected - is this a page's JS collecting source (trigger?) event information and passing it to the browser, the browser collecting this information (provided by sites) in a database, or this information being collected by a sever after reports have been triggered? The sentence does being with "IPA begins", but I don't know what the authors considered the "start" of IPA - is it the start of data being actually aggregated by a server, the start of one or both types of events, or what?

The section also mentions collecting information, collecting events, and collecting reports. It's not clear if these are one, two, or three distinct concepts.

Event protection and encoding

Right now, we're a little imprecise when we talk about how the encryption of various pieces is put in place.

Each match key is protected toward the three helpers. There is not a lot we can do to optimize there. Though I will observe that we don't need to explicitly encode a lot of the data. It is sufficient to include some fields under an AAD. That trick can provide integrity protection for info that is public, but needs to be bound to the value, like site name and epoch.

Aside from the match key, most of the data comes directly from the report collector. This information can be protected using bulk transfer keys. For that, we can use TLS if the report collector talks directly to helpers, or a single large encrypted item otherwise.

That leads to each helper receiving a stream of records that comprise:

  1. Site name (cleartext)
  2. Epoch (cleartext)
  3. An encrypted match key (which need only contain the XOR shares of the match key; the AAD can bind to the site name and epoch)
  4. XOR share of the attribution constraint ID (this might be reduced to a share of a single bit, indicating whether to attribute to the previous items if events are presorted)
  5. XOR share of the timestamp (this might be removed entirely if items are presorted)
  6. A share of a bit indicating if this is a trigger or source event.
  7. An additive share of a value. For a source event, the breakdown key; for a trigger event, the trigger value.

Helpers will use the first three fields to decrypt their shares of the match key, then execute the sort on that field. There should be no need to sort on attribution constraint ID or timestamp if we require that the items are presorted.

If we want to include multiple breakdowns, we can then do multiple values. The reason you might want separate fields for breakdown key and trigger value is that they might have different ranges. However, I think it would be enough to bound the range of values by construction (i.e., share on $\mathbb{Z}_{2^n}$ and assume a contribution of $2^n-1$ - or even $2^n$1 - always), but pass the number of breakdown keys for each breakdown up front. If the breakdown keys exceed the limit, there is no privacy consequence because those items with bad breakdown keys contribute nothing.

Something worth noting here is that there will be a lot of repetitions for the public information. Site name and epoch will repeat often. It might be best for encoding reasons to assign these (either individually or as a tuple) a small integer value so that they can be sent without spending too many bytes. Note that trigger events on a trigger query will have the same site name on every item; same for source events on a source query - this information can therefore be encoded on zero bits.

For anything less than 127 different sites and epochs, the public data might encode in one byte, and two bits should cover all but the most extreme uses. Then with 48 bytes of HPKE (x25519 and AES-GCM or ChaCha20Poly1305) overhead we have 54 bytes for a 64-bit match key. That's the big part done. The rest could probably easily fit into one or two bytes per breakdown. If you have small breakdown values, you might get as many as 8 breakdowns and the site+epoch identifier into 10 bytes, with a total size of 64 bytes per item.

Packing values this tightly will mean that we need to do a modulus conversion for values, which adds a multiplication round or two. We can probably do that in parallel with other operations though. The advantage that comes from doing that is that range checking comes for free, which saves a lot of communication between the report collector and helpers2.

Footnotes

  1. The cost of doing $N$ breakdowns with values in the range $[0, M)$ is that you need to ensure that $M\ge N$, which might just mean scaling values up. That's a little awkward if your range tops out at $2^n-1$, so we might consider adding 1 to the value to make this easier. If you have 10 breakdowns and you want values from 1 to 4, then multiplying by four to get to a maximum of 16 is easier than working out how to scale up to 15.

  2. Prio's range check involves a fairly large number of field elements being submitted. For a value $[0,2^16)$, this involves sending 20 field elements of something in the order of 80 bits each, so 200 bytes. That's ~100x more to save the communication of a multiplication.

Two-way sharing for (some) inputs

The document currently contemplates the potential for inputs to be provided as simple two-way shares. These can be upgraded to 2-3 replicated shares by a simple process.

If we assume that each helper, $i$, receives a value, $v_i$, with one helper instead setting $v_i = 0$. Then for each value, at each helper:

  1. Use PRSS to generate a share of zero, $r_i$. This can use the pairwise PRSS generator, taking the difference between the "left" and "right" values.
  2. Send $v_i+r_i$ to the right.
  3. Receive $v_{i-1}+r_{i-1}$ from the left.
  4. Set the share to $(v_i+r_i, v_{i-1}+r_{i-1})$.

This requires one round of communication, but reduces the input size by a factor of three. In the case where all input data arrives at a single helper, the total communication cost of getting full replicated secret shares to all helpers is 6N (for query submission) + 4N (for redistribution) = 10N. With two-way sharing, this reduces to 2N (submission) + N (distribution) + 3N (the above protocol) = 6N. (Here, N is the number of records times the size of each share.)

The inter-helper communication here is unchanged, but the submission costs are reduced by a factor of three. Costs for submission might be higher than for inter-helper communication. There are a great many reasons to find cheap ways to move data between helpers, so we might expect lower costs there. After all, we're sending far more than 4N between helpers during protocol execution.

The main drawback with this approach is that it is not easy to protect against additive attacks in the above protocol. Any helper can add an arbitrary amount to the provided values without being detected.

However, it might be possible to apply this process selectively to different input fields. It might even be possible to do that selectively, at the discretion of the record collector. For instance, though it might not be possible to do this for match keys, a record collector might be able to opt for this approach for all of the data that it provides. Though an additive attack is possible, which affects the correctness of the outcome, the record collector might consider this risk to be acceptable.

From a user privacy perspective, my current belief is that additive attacks on the inputs provided by the record collector are not possible. We already allow the record collector to set these values.

Stable DP Sharding

Issue #35 outlines a feasible method of sharding with a very conservative epsilon.

But there was an open question on that post:

Sharding messes up the presort on timestamps

One optimization we’ve been hoping to use is to have the report collector presort on timestamps in the clear before submitting a query. However, in the computation of the shard id with the PRF we must shuffle all the rows. This breaks the previous sort on timestamps. It is an open problem to figure out if there is a way to maintain this presorting optimization in some way.

I’d like to present a possible solution to this problem: “Stable DP Sharding”. It’s a variation on the multi-bit radix sort algorithm.

Step 1: Modulus conversion

Input (streaming):

The report collector submits reports containing an XOR replicated sharing of a 45-bit match key. So each helper receives two 45-bit integers, one of which it shares in common with the helper to its left, and one of which it shares in common with the helper to its right.

$[mk_i]$

We assume that the report collector is submitting a query of 100 billion records, and it’s impossible for all the records to fit in memory on a single machine. As such, the helpers must operate on streaming records as they arrive.

The modulus conversion protocol is an MPC protocol to “upgrade” these bitwise sharings in $F_2$ to a vector of sharings in $F_p$, for some large value of $p$.

${ [mk^0_i], [mk^1_i], …, [mk^{45}_i] }$

We had originally proposed that the helpers would use a prime sufficiently large to produce our intended security parameter. We are currently using the prime $2^{32} - 5$.

However, certain PRFs we might want to use in Step 1 require larger fields. In this case, we might consider either using a very large $p$, or potentially perform the modulus conversion twice, once to produce the inputs used to generate the PRF, and once for use in the rest of the protocol (i.e. sorting, and computing of helper bits)

Step 2: Compute the PRF

Input (streaming):

For each input row i, each helper, receives 45 secret-shares, which are sharings of the bits of the match key.
${ [mk^0_i], [mk^1_i], …, [mk^{45}_i] }$

Each secret value $mk^j_i ∈ {0, 1}$
But the secret-sharings are in a large field: $[mk^j_i] ∈ Z_p$

The helpers perform an MPC protocol to compute a secret-sharing of a PRF of these values: $[PRF(mk)_i]$

Potential PRFs we might consider:

Step 3: Bit-decompose the PRF

Input (streaming):

For each input row, at the end of step 1, each helper should be in possession of a secret-share of a PRF of the match key:
$[PRF(mk)_i] ∈ Z_p$

Now the helpers must run a bit-decomposition protocol to extract the M least-significant bits of this value:
${ [PRF(mk)^0_i], [PRF(mk)^1_i], .., [PRF(mk)^M_i] }$

We will utilize these M bits in much the same way as the multi-bit sort algorithm. At each round of sharding, the input rows will be sharded into $2^M$ different shards. It’s very likely that M=3 will continue to be the optimal value, but it’s possible that we will want to choose a larger value to minimize the number of malicious shuffle operations, which will be non-trivial for extremely large input sizes.

There will likely be multiple rounds, and there is no requirement that M be the same in each round.

Step 4: Compute equality checks

Input (streaming):

For each input row, the helpers consider their secret-shares of M bits of the bit-decomposed PRF:

${ [PRF(mk)^0_i], [PRF(mk)^1_i], .., [PRF(mk)^M_i] }$

Consider this a bit-decomposition of an unsigned integer, meaning it must represent one of the numbers in the range: $[0, 2^M)$

The helpers run a protocol to compare these M secret-shared bits with all of these possible values. The results is a vector:
${ [eq(0)_i], [eq(1)_i], …, [eq(2^M - 1)_i] }$

Where all of the elements are sharings of zero except one, which is a sharing of one.

All of these checks can be done with just 4 multiplications when M=3 and 11 multiplications when M=4. (See the implementation of this in: multi_bit_permutation.rs)

Step 5: Increment running sums

Input (streaming):

For each input row, the helpers take their shares of the $2^M$ equality checks computed in step 4:
${ [eq(0)_i], [eq(1)_i], …, [eq(2^M - 1)_i] }$

Before processing input row 0, the helpers first initialized a vector of secret-shares of zero:
${ [count(0)], [count(1)], …, [count(2^M - 1)] }$

Upon processing each input row, the helpers update these counts by adding the equality check shares computed in the last step:

$[count(0)] = [count(0)] + [eq(0)_i]$
$[count(1)] = [count(1)] + [eq(1)_i]$

$[count[2^M - 1)] = [count[2^M - 1)] + eq(2^M - 1)_i]$

This only requires local processing and no inter-helper communication.

Since only one equality check is a sharing of one, only one counter will be updated. The others will remain the same.

Step 6: compute the permutation destination index

Input (streaming):

For each input row, the helpers take both:

  1. The vector of shares we called “equality checks” computed in step 4:
    ${ [eq(0)_i], [eq(1)_i], …, [eq(2^M - 1)_i] }$
  2. The current value of the running sums updated in step 5:
    ${ [count(0)], [count(1)], …, [count(2^M - 1)] }$

The helpers run a protocol to compute the dot-product of these two vectors. The result is a secret-sharing of a number:
$[σ(i)]$

This number will be a part of a per-shard permutation. Its value will eventually be revealed (after shuffling). The number will indicate the destination index (on that shard) to which this record should be taken.

A few notes about this value:

  • For some rows $σ(i)$ will be the same -- but those rows are guaranteed to go to different shards.
  • For all rows that go to the same shard, $σ(i)$ is guaranteed to be unique

Step 7: malicious shuffle (part 1): re-distribute

Input (streaming):

The original input row, and in addition to that a few values which were computed in the past few steps:

  1. The M least significant bits of the PRF:
    ${ [PRF(mk)^0_i], [PRF(mk)^1_i], .., [PRF(mk)^M_i] }$
  2. The per-shard destination index:
    $[σ(i)]$

We assume that we must process the rows in a streaming fashion, because there are too many to fit into memory on a single machine.

We assume that the helper party has access to Q physical machines, such that if there are N input rows, then N/Q rows can comfortably fit into memory.

Two helpers will use their shared randomness to generate a sort of “permutation”, but this will be done slightly differently due to the streaming nature of this shuffle.

For each row, these two helpers will generate a pseudo-random value in the range [0, Q) using shared randomness, and send the record to the corresponding physical machine.

Step 8: malicious shuffle (part 2): add dummy records

Input (none):

The same two helpers from the last step will now generate a random number of “dummy records” per “shard”.

There will be $2^M$ “shards”. These will (at a later step) be revealed by running a “reveal” protocol on the M bits of the PRF which were used in step 4.

For each of the values in the range $[0, 2^M)$, the helpers can generate a random number of “dummy records”. They can select the values of the M bits of the PRF to indicate the desired shard (they need not compute any actual PRF).

For example, when M=3, then when generating “dummy records” for shard 6, the two helpers should ensure that:
$[PRF(mk)^0_i] = [1]$
$[PRF(mk)^1_i] = [1]$
$[PRF(mk)^2_i] = [0]$

They must also take care to generate legitimate destination indices. They can do this by continuing to increment the corresponding counters.

For example, when generating a “dummy record” for shard 6, the two helpers should first increment counter 6:
$[count(6)] = [count(6)] + 1$

Then, they can use this as the secret-sharing of the destination index on shard 6 as:
$[σ(i)] = [count(6)]$

Each dummy record is then sent to a random physical machine $[0, Q]$, determined by the shared randomness of the two helpers.

Step 9: malicious shuffle (part 3): distributed shuffle

Each of the Q physical machines operated by the two helpers nodes now randomly permute all of the rows they were sent using Fisher-Yates.

Step 10: malicious shuffle (part 3): bringing it all back together

Starting from physical machine 0, and ending with physical machine Q-1, the shuffled records on each machine are then sent back to the coordinator machines that originally processed the streaming inputs.

This is where the “reshare” process happens. The two helpers that know the “permutation” need some communication between themselves as they reshare the values based on PRSS.

The helper who was “out of the loop” so to speak, is simply informed of the total number of records it must generate, and it initializes that many rows, where each sharing is just initialized with PRSS.

Step 11: repeat the shuffle 2 more times

As the records stream in during step 10, the next shuffle immediately begins. A different set of two helpers participates each time. They immediately repeat step 7, randomly sending each record to one of Q physical machines.

The helpers repeat steps 7-10 a total of three times, each time with a different helper "out of the loop".

Step 12: Reveal the shards and permutation destinations

As soon as the 3rd shuffle is complete, and the “bringing it all back together” process is happening, the 3 helpers can immediately (in a streaming fashion) reveal two things about each record they process:

  1. The M bits of the PRF which were used in step 4
    ${ [PRF(mk)^0_i], [PRF(mk)^1_i], .., [PRF(mk)^M_i] }$
  2. The per-shard permutation destination
    $[σ(i)]$

The three helpers run a “reveal” protocol to open these secret-sharings.

The M bits of the PRF are considered as an M-bit unsigned integer:
shard_id = $PRF(mk)^0_i + 2 * PRF(mk)^1_i + … + 2^{M-1}*PRF(mk)^M_i$

This row can immediately be sent, in a streaming fashion, to the machine intended to process that shard_id.

Step 13: Apply the permutation

Each helper can now use $2^M$ physical machines (shards) to process the inputs. All of the records about a given user will be present on just one shard. And due to the dummy rows, we have a differential privacy guarantee about the amount of information released via the cardinality of rows on each shard.

The machine processing each shard can simply move each record to the destination index given by $σ(i)$. No information is revealed in this process because it’s just a random permutation (per shard).

This ensures that all of the real records that were sent to each shard appear in the same order (relative to one another) as they did in the original input. In this way, we can avoid the need to sort by timestamp. This is why this is called a “stable sharding”.

Yes, a malicious helper knows that the dummy records are all at the end of the shard, but it doesn’t know how many of them are dummy records, and it doesn’t know which real records made it to this shard. Due to the usage of a PRF, a malicious match key provider cannot even ensure two different match keys will end up on the same shard.

Request: Add support for a fixed attribution window

The current "IPA end-to-end" document does not provide a way to constrain the attributed conversions based on the amount of time between the source and trigger event.

Currently, if any of the source-events in an IPA query share a common match key (and "attribution constraint ID") with any of the trigger-events that came after it, they'll be attributed.

Today, it is common for advertisers to count the number of "1-day / 7-day / 28-day conversions". If this is behaviour that advertisers would like to keep (as opposed to just, "Do any of the trigger events I've included in this query come from the same match-key and later in time than any of these source events I've included in this query") then we would need to add support for this.

It would be great to get feedback about how much demand there is for this use-case, and to provide a sample oblivious algorithm to achieve this outcome.

Request: Add support for queries where clicks take precedence over impressions

One common attribution heuristic is to attempt to attribute conversions to a click, and failing that, attribute to an impression.

The current "IPA end-to-end doc" does not demonstrate a sample query which would achieve this. Instead, all source events are treated the same.

It would be nice to be able to somehow support this functionality. It doesn't seem like it should have any different privacy budget characteristics compared to other queries.

This would require the report collector to pass in a (secret-shared?) bit of information to indicate if a source event corresponds to an impression or a click.

Request: Add a sample "multi-touch" implementation

The current "IPA end-to-end" doc provides an example of how to achieve "last touch attribution" via an oblivious algorithm which can be run in MPC.

It would be nice to also provide an example of an oblivious algorithm that could provide some simple type of "multi-touch attribution", for example, "equal credit, up to the last 3 touches".

It would also be nice to benchmark this and see how much it would change the overall cost-structure of an IPA query.

Timestamps and partial presorting

In the current document we contemplate an option where entries are presorted by the report collector. This saves MPC time because every bit that is sorted requires extra work. The only risk here is that the report collector might not have the timestamp necessary to do this sorting, primarily because sites might not be happy to share the sort of timing information necessary to do this sort with each other.

There's a potential for a progressive resolution here that might not even increase complexity that much. We already have a requirement that events be sorted at the level of epoch. Not only does the report collector need to know epoch, but the helpers do also. But epoch is really just the high bits of a timestamp. We should definitely require that events be presorted based on epoch; though that could be done in the clear by each helper, it is likely easier to manage an efficient encoding of input events if they are presorted.

If we had to sort on the remainder of the timestamp in MPC, report collectors would have to provide the N low bits of the timestamp to the helpers. (If this information is encrypted by the site that generated the event -- so that the report collector can't see it -- that would require some additional encapsulation, but we'll leave that for the moment.) If events were presorted on timestamp, report collectors would be providing no low bits.

There are probably a few numbers between N and 0 that could be used. If some of those bits are disclosed to the report collector and the remainder protected for MPC, sites might be able to choose how much information to expose to the report collector vs. how much extra queries might cost.

A lot of this comes down to how much sites are willing to release time information at a granularity necessary for attribution. If the report collector can be trusted with that much information, then a full time-based presort is ideal.

Permissioning layer for match keys

Currently in IPA, a match key set by a site can be used by any other site. There is no mechanism in the system where a site could choose to keep a match key to itself. I have serious concerns about this.

In practical terms, to use IPA to achieve cross-device attribution, match keys would likely be derived from PII. This means there is a tension between using IPA to its fullest, and pervasively sharing your device graph / PII-derived user data with the rest of the web ecosystem.

Like most other APIs that store user data, IPA should abide by the Same Origin Policy by default, and not expose read access to match keys, even in an encrypted form. If sites want to share their match keys with others, we can support selectively exposing access in an opt-in way with a permissioning system. This could look something like a set policy declared at setMatchKey time:

setMatchKey(<key>, {
  exposeToOrigins: ['https://foo.com', 'https://bar.*.com']
}

Where we could support pattern matching using the URLPattern API infrastructure, which will let a provider allow a specific set of report collector origins (or everyone with "*") to consume their match keys.

Additionally, we could support dynamic, just-in-time permissions, e.g.

  • getMatchKey could automatically grant permission to a report collector if the API was called within a document whose origin matches the provider.
  • If we support an HTTP API (#25) for getting match keys, we could allow a provider to redirect to a report collector and append its match key.

These changes would allow sites to use IPA without fear of leaking their user’s data to parties outside their control.

Note: We might need something more sophisticated if we want match key setting to persist beyond a single browser / app (e.g. storage mediated by an operating system), but I think we can discuss that later on.

IPA Sharding with Sensitivity Bound

IPA Sharding with Sensitivity Bound

Thus far sharding IPA has proven difficult because of the inability to bound the sensitivity of a matchkey’s contribution to shard size. An attacker could request a large number N of encryptions of a matchkey on two websites and then (colluding with a Helper Party) see whether one shard increased by 2N (meaning it was the same user across the two websites). Several ideas for sharding were presented at the last PATCG and referred to this paper for differentially private histograms but for these to be applied the matchkeys would first need to have bounded multiplicity.

As a step towards bounding the sensitivity of a matchkey it was suggested (@benjaminsavage mentioned discussing with several folks at the last PATCG) that the device can cache the encryption of the matchkey per domain/app (or report collector) that is requesting it. Then when a website requests N encryption of a matchkey from a user, they will get back the exact same authenticated ciphertext N times. The Helper Party Network processing these reports in a query will be able to filter out or reject this query as malicious.

If a Match Key Provider is the one colluding with a Helper Party to launch this attack, they are able to bypass the sensitivity capping for a matchkey by directly encrypting with different encryptions as many times as they like. This seems to allow the same style of attacks to continue.

However, we claim that the technique for caching encryptions on the device to get a bound on the sensitivity of unknown matchkeys is sufficient to provide protection against cardinality attacks in sharding, even if the Matchkey Provider is able to add arbitrarily many encryptions for known matchkeys into the sharding process. The reason is that even though a MKP can create an arbitrary number N of encryptions of a known matchkey, the attack only works by observing the sum of this number and the number M of matchkeys that come from a user’s actual devices. If we can bound M with some reasonable sensitivity (say 100 = (# different devices) * (max #domains visited per device)), then we can add non-negative DP noise such that the sum N + M + noise is differentially private.

(This may have been clear to some discussing this at the PATCG, but overall I've found it helpful write out the approach more fully).

Amount of Noise Needed

For DP parameters that are a couple orders of magnitude less than the overall DP budget, we can get by with adding around 4k dummy records per shard. This is a relatively small amount compared to the overall query sizes and does not depend on the size of the total query or the number of shards. The number of dummy records that need to be added only depends on the sensitivity bound for unknown matchkeys and the epsilon and delta DP parameters.

More specifically, if the sensitivity of a matchkey is 100, then we can make the value N + M visible to any attacker differentially private with epsilon = 0.01 and delta = 10^{-8} by adding noise from a Truncated Geometric distribution with the expected size of the noise being 1827 (see appendix section for details). To let this be the amount of noise from the perspective of all parties, we will probably need at least two of the Helper Parties to add this many (in expectation) dummy records. So we can expect to add fewer than 4k dummy rows per shard.

Two stage sharding

To enforce the removal of identical encryptions of matchkeys, we will need to use an extra sharding stage, since we don’t assume the full input set could fit on any one shard for deduplication. To be maliciously secure, we most likely need each Helper Party to enforce the deduplication on its own set encrypted shares.

Stage 1: Ciphertext Sharding: First we will shard based on the value of the ciphertext itself, specifically the ciphertexts that encrypt the matchkey. This cannot leak any additional information since the ciphertext values are public to all parties. In the case the matchkey is encrypted as replicated shares (6 ciphertexts in all) the Helpers could agree on public sharding function such as hashing all the ciphertexts in a certain order and then interpreting those bits as an integer, y, to then compute the share as i = y modulo number of shards.

Stage 2: PRF Sharding: Then each shard will decrypt and compute a PRF of the matchkey using MPC. Here each of these shards for a particular Helper Party will need to communicate with the paired shards of the other Helpers.

IPA_ Two Stage Sharding

Optimized encryption for shard id computation

When the device encrypts the matchkey towards the helper parties, it may be a useful tradeoff to let the matchkey be encrypted in two ways:

  1. As replicated shares to be used in the MPC for sorting
  2. With some other encryption to be used specifically for the computation of a PRF. For example the PRF referenced here could begin by having the UA encrypt the matchkey using threshold ElGamal encryption. To get malicious security we will want to use a maliciously secure PRF such as the Dodis-Yampolski PRF. Daniel Masney and some academic collaborators have an efficient protocol for computing this in the honest majority setting that we can share more on.

Sharding messes up the presort on timestamps

One optimization we’ve been hoping to use is to have the report collector presort on timestamps in the clear before submitting a query. However, in the computation of the shard id with the PRF we must shuffle all the rows. This breaks the previous sort on timestamps. It is an open problem to figure out if there is a way to maintain this presorting optimization in some way.

An old approach revisited

An earlier approach that we considered for IPA was to perform the join by applying an OPRF to all of the matchkeys, instead of using sorting. This approach is vulnerable to known cardinality attacks on the matchkeys and for this reason we moved away from it. It is worth noting here though that bounding the cardinality of unknown matchkeys is not sufficient to let us go back to this approach.

First consider the direction we would like to go in to make that OPRF approach have differentially private leakage. If the sensitivity bound on a user’s matchkeys coming off real devices is M = 100, there will be some number of matchkeys with cardinality 1, some with cardinality 2, … some with cardinality 100. For each of these histogram buckets will we add non-negative DP noise by adding sets of dummy reports with those certain numbers of identical matchkeys. We could add DP noise to each of the buckets with sensitivity 1. Here the bound on the user’s number of matchkeys gives us the number of buckets in the histogram.

The flaw here is that we cannot actually bound the number of times a user’s matchkey is submitted. A malicious MKP can submit a targeted user’s matchkey many times (e.g. in a source-fan-out query there could be many copies of a matchkey with source events or split across multiple sets of trigger reports from fake trigger sites).

As a result the histogram has an unbounded number of buckets that we would need to add noise to. To use the notation above, even though M is bounded N is unbounded, but to make this approach work there would have to be a way to bound N + M.

Truncated Geometric/Discrete Laplace with large sensitivity

We can use the techniques for non-negative DP from our other paper. In the paper we had focused on the case the sensitivity $\Delta=1$. We now go back and show first for the truncated geometric / discrete laplace how to determine what the parameters of the noise distribution should be for large values of $\Delta$. Other distributions could also be used. This extends section 3.2 of the paper.

In that section we consider a DP mechanism using the following distribution:
$M(X) = f(X) + z; \quad z \sim p_{DoubleGeometric}(n)$
$\textrm{Pr}_{DoubleGeometric}(x|n) = Ae^{-\epsilon|n-x|}; \qquad x\in {0,\ldots,2n}$
For some normalizing constant $0&lt; A&lt; 1$ and some $n \in \mathbb{N}$.

As a probability this must sum to 1, which lets us solve for $A$. Letting $r=e^{-\epsilon}$, we get (see equation 10 in paper)
$A=(1-r)/(1+r-2r^{n+1})$

If we have a sensitivity and desired and , then we need to cover the tail as
\delta \geq A \sum_{k=n-Delta+1}^{n} e^{-k \epsilon}.

For larger than 1, we give a simple script to compute the smallest n such that the above equation holds for a given $\Delta, \epsilon, \delta$.

 import math

def RHS(n, Delta, epsilon): 
	# computes the sum A * \sum_{k = n - Delta + 1}^n  e^{- k * epsilon}
	# where A = (1 - r ) / (1+r - 2 r^{n+1})
	# where r = e^{-\epsilon}
	# for e the base of ln. 
	# We can write the sum as A *  \sum_{k = n - Delta + 1}^n  r^(k)

	r = math.exp(-epsilon)
	A = (1-r) / (1+r - 2 * r**(n+1))
	result = 0
	for k in range(n - Delta + 1, n+1):
		result += r**k

	return A * result 

def find_smallest_n(Delta, epsilon, small_delta):
	# find the smallest positive integer n such that delta >=  RHS(n, Delta, epsilon)
	# Lemma:  The smallest n will always be larger than Delta.  

	for n in range(Delta,2**32,1):
		# print("n =", n, "RHS =", RHS(n,Delta,epsilon))
		if small_delta >= RHS(n,Delta,epsilon):
			print("n =", n, "RHS =", RHS(n,Delta,epsilon))
			return n
			
assert(find_smallest_n(1, 0.5, 10**(-6)) == 25)  # example in paper 

find_smallest_n(100, 0.01,10**(-8))

Thanks @benjaminsavage, @eriktaubeneck, @akoshelev, and @danielmasney for lots of discussions on this.

Meetings: Monthly VC Call to Discuss Open Issue and Progress

We will be holding monthly calls to discuss and make progress on open issues on IPA, open to anyone participating in the PATCG (which is, in turn, is an open community group.) We will take minutes in the same manner as the PATCG meetings, and we will report important progress back in the regular PATCG meetings.

Timing

The IPA authors and collaborators span many timezones, which makes timing difficult. For now, we are pegging the meeting to the second Tuesday of the month at 3pm PT (both PST and PDT) for 1 hr.

These are on the w3.org PATCG Calendar listed as "PATCG Ad-Hoc Meetings: IPA", with appropriate timezone conversions.

Logistics

Specify whether identity providers can constrain the set of helper parties

The getEncryptedMatchKey operation must encrypt identity secret shares towards specific helpers. Because the user agent implements the operation, it can exercise control over whether a set of helpers is acceptable to the user. Likewise, the site can exercise discretion about whether to call the operation and which helpers to request.

Can the identity provider constrain the set of helper parties it supports encrypting its identities towards? If so, how is it set? Updated? setMatchKey could be elaborated, or the identity provider could publish a separate statement about these.

How match keys are retrieved

We need a more thorough description of how match keys are stored and subsequently retrieved.

This is non-trivial as we have the case where no match key is set, default values, and a need to consider how site state clearly interacts with this.

Attack on privacy budgets using fake sites and malicious Match Key Provider

Attack on privacy budgets using fake sites and malicious MKP

Summary:

  • There is an attack that a malicious Match Key Provider (MKP) can carry out by creating fake sites and using those sites' privacy budgets to measure events that happened on its own source or trigger site.
  • We discuss some potential mitigations to the attack

I’ve been thinking about the overall threat model for managing the privacy budgets. Last summer for the End-to-End doc I was thinking about similar budget sharing attacks but those attacks at the time seemed preventable, but I was able to crystallize this attack into something we don’t prevent with what we are currently doing.

Background

We have recently been assuming that the associated data to encrypted matchkeys includes: (site, the trigger/source bit, requested MKP, epoch, and previously the RC intended to run the query). Also for source-fan-out queries we’ve been assuming that the Helper Party Network (HPN) will verify that all source events were created on the site that is submitting the query and spending the budget.

Attack

Steps for the attack using a source site:

  1. Suppose there is a malicious MKP with a source site (a.example).
  2. a.example creates fake source sites a1.example, …, a10.example, and each one independently registers with the HPN to get a privacy budget.
  3. Trigger site b.example sends trigger reports to a.example.
  4. As a MKP a.example can create the source reports for its source events without calling the API. Thus can use the same underlying data to create copies of the source events that happened on a.example as if they happened on a1.example,...,a10.example.
  5. Now a.example, a1.example, …, a10.example all run source-fan-out queries with the trigger reports from b.example.
  6. Each one of these queries is for the same underlying data and lets a.example learn about the events that happen on it using the budget from 10x more sites.

If we also required encrypted matchkeys to include who would be the RC issuing the query (b.example deciding their events can be queries by a.example but not a1.example,...,a10.example), that would prevent this attack unless b.example is also colluding and requests copies for all the fake sites. But since we assume source and trigger sites will collude in our threat model, this would be an insufficient mitigation in the general case. A similar attack could be done with the MKP using a trigger-site

Potential Mitigations

The IPA team have all been thinking of some potential mitigations to the above attack

  1. Remove having a MKP all together and just use on device matchkeys
    1. This would prevent a MKP from having the ability to create events that didn’t actually occur as a result of a user visiting a website.
    2. It would give up cross-device attribution, except for in cases where a users matchkey could use other OS supported methods of being synced across logged in devices
    3. It would also make sort more expensive by needing increased size for matchkeys, except that we may have the option to use an OPRF based matching, see below.
  2. Limit who can be a MKP
    1. This doesn’t completely remove the attack but limits who could carry it out and makes those parties more trusted in the system.
    2. Ideally, we are able to combine this with additional technical mechanisms like query transparency that allow detecting if a matchkey provider is being malicious in this way.
    3. See also issue #42 on limiting who can be a MKP
  3. Query transparency
    1. With the addition of query transparency (having the HPN publish various high level statistics about the queries being run and by who) it could let you see that some of the same ciphertexts were used in many different queries (all the queries run by the fake sites in the attack above include the same set of trigger reports). This alone isn’t sufficient to tell you who is running the attack as the real site (a.example) need not run a query and just the fake sites. But if there are a limited number of MKPs, detection of such behaviors could be used to trigger more frequent audits.
  4. Make privacy budgets at the business level
    1. When you sign up to get a privacy budget from the HPN, we could ensure more KYC to know what business is running the queries. This could make it harder than simply creating fake sites to get more budget.
  5. Use multiple non-colluding match key providers
    1. One idea @eriktaubeneck suggested was to use the XOR of match keys from different, non-colluding match key providers. It would give something closer to the intersection of the device graphs. @martinthomson has thought about an extension to this using a group key exchange.

Likely we would want to use a combination of the above methods. We think device generated matchkeys, 1), would be a decent MVP and would be similar to the first in-market test we are planning, but we will continue to research ways of supporting a MKP with better security guarantees.

Even without a MKP involved, there is the possibility of more general privacy budget sharing attacks that try to induce a client to submit events for many sites. For example if a site can induce a client to submit events for many sites with the ability to know they all were by the same person, they could carry this attack out on their own. For example, sites might try to do some form of bounce tracking to try and get an event registered as if it happened on multiple sites. There are probably some easy mitigations to bounce tracking and similar abuses like having minimum delays from page load until release of the match key, which could be cut short after user engagement.

Implications for MPC – option for OPRF based matching

If we were to go with mitigation 1) or 2) above, that changes the trust model around what sort of attacks a malicious MKP can carry out. If we carry this same new threat model through the MPC we can get some optimizations. Note that from a privacy/security standpoint we don’t have to make this change to matching; it is just that this becomes an option we could more likely pursue as an optimization.

  1. Consider first that if there is no MKP: in this case if we still cache the matchkey ciphertext per domain to maintain a sensitivity bound on the matchkey (#35), we get a much stronger bound. No longer can a malicious matchkey provider increase the cardinality of known matchkeys. So all matchkeys are have cardinality bounded the number of domains likely to be visited from one device in an epoch. With this bound in place we can return to using an OPRF based join instead of sorting and can prevent the cardinality attack with differential privacy.
  2. If we limit who a MKP can be and trust them not to carry out attacks on the privacy budget, we might as well also trust them not to run malicious cardinality attacks.
  3. Note that if we don’t switch to an OPRF based matching but do include who would be the final RC to run the query in the encryption of the matchkey as associated data, that could mess up the ciphertext caching we hope to use to enable private sharding.

Overall threat model for privacy budgets

This attack is a particular example of the sort of attacks sites might try to carry out through sharing their budgets. We should expand the threat model document(link) to capture a more robust definition of a privacy budget threat model. Other APIs are susceptible to similar style budget sharing attacks. PCM is known to be vulnerable to a similar style of fakes sites attack, see 2.2.5 of https://mozilla.github.io/ppa-docs/pcm.pdf. ARA has been adjusting its privacy grain and limits to deal with similar style privacy budget sharing concerts, see WICG/attribution-reporting-api#661 and WICG/attribution-reporting-api#725

Privacy budgets

Any application of differential privacy is dependent on understanding the sensitivity of the system. If we are going to provide some defensible definition of the privacy properties of IPA - at least in terms of our chosen rubric of differential privacy - we need a solid understanding of the scope of information the system ingests so that we might understand what a contribution from a single person might look like.

The goal from the outset is to reach some sort of conclusion about the total information release about each user, over each epoch (read: week), to each site (advertiser/publisher/etc...) that uses the API, for all the queries that might be made (both trigger and source queries). There are multiple dimensions to this, with each having different considerations.

We're looking for input on a general strategy of bounding contributions from individuals so that we might use a more generous $\epsilon$ value when applying noise to results. As we add dimensions along which we cannot limit contributions directly, we want to see if there are some reasonable assumptions we might make regarding total contribution so that we don't have to shave the overall $\epsilon$ too finely and increase noise.

Multiple Queries and Breakdowns

The noise added by DP protections can be reduced if different noise is contributed over multiple queries.

The first major split in IPA queries is between source and trigger queries. Each site is able to make each type of query independently, with source queries only containing source events for that site and trigger queries only containing trigger events for that site. This immediately adds a factor of 2 to the information release or sensitivity of the system.

Multiple queries of the same type are another major feature. Sites need to be able to query the same data different ways. The current design aims to manage the risk by providing sites with a fixed $\epsilon_e$ budget that can be expended over multiple queries. This is mostly effective in the sense that each query can then expend a finite amount of budget. Sites making the queries can manage how much noise they can tolerate on each query.

This budget cannot be infinitely finely sliced. Given the nature of the system, we are probably looking at something less than ideal $(\epsilon, \delta=0)$-differential privacy. We likely need a $\delta$ value that is non-zero. That means that there will need to be finite limits on queries to avoid the overall value of $\delta$ growing very large over multiple queries.

Multiple breakdowns over the same events look very much like separate queries from this perspective, so while we might get some greater assurances when it comes to accountability (and accounting!), for analysis purposes we can consider them to be multiple queries.

This approach basically assumes that every user is present in every query. From a privacy perspective, this means that it forces careful limiting of queries and ensures that the total query count is reflective of the privacy cost. There's a big caveat there for the grouping construct, but I'll take that up on a different issue.

We have discussed using geolocation (country/jurisdiction) as a sharding key that would allow a site to effectively split their $\epsilon_e$ budget so that they can make independent queries. That too can be taken up on a different issue.

Overall, how the design manages multiple queries seems manageable.

Users and Devices

Though IPA attempts to provide cross-device/cross-platform attribution, nothing forces a site to use a functional match key provider that would enable this. In the extreme case, a user might contribute events from multiple devices in a way that is known to a site, but not visible to IPA itself for use within budgeting.

Here, we might simply need to make some assumptions about the number of devices on which a user might present.

Note here that it is probably not sufficient to assume that activity will be restricted to a subset of a user's devices. If we consider an adversary that is motivated to recover more information about a user than we wish to reveal, then they can use the API to generate information for any device that sees any use. IPA only requires that sites are visited before an event can be generated. So any assumption needs to capture the number of devices that see use in any given epoch.

Sites

A more difficult aspect of this that each site makes its own queries independently. This is a critical usability feature of the design: we don't want a situation where the queries one site makes causes another site to be able to make fewer queries. So each site operates entirely independently.

This makes sense from a functional perspective, but as a practical matter sites are able to join user identities in a number of ways. The use of primary identifiers (email, phone numbers, federated identity, even just usernames) makes it possible for sites to link user identity across sites. Sites that have a shared understanding of user identity can potentially use that to amplify the information IPA provides. Multiple sites that make queries for the same users relative to a target site might be able to reduce the effective noise in results, allowing them to learn something about those users on the target site. Think of this as being able to average out the differential privacy noise through multiple queries, except that the multiple queries are split between sites and therefore not obviously accountable.

Time

A site is going to want to make queries over multiple epochs. No useful system can be built on a premise of an absolutely finite bound on the information release.

From a privacy perspective, the continuous release of information has always been the most difficult thing to come to terms with. This means that sites that dedicate their use of a measurement system can - given sufficient time - overcome any differential privacy protections we apply.

As a practical matter, it might be best to simply accept and acknowledge this limitation. We can then concentrate our efforts on describing the time period over which information is released. Then we can concentrate on choosing an overall target $\epsilon$ for a period of a year or six months or whatever time period we deem appropriate. That value can then be translated into an equivalent value for any arbitrary period. Though any time period can be chosen, using a benchmark period will ensure that we can talk about it without constantly qualifying it.

Combining Different Dimensions

This means we have the following factors to consider:

  • Source/Trigger, $n_t = 2$
  • Queries, $n_q$
  • Number of devices for the same user, $n_d$
  • Number of potentially coordinating sites, $n_s$
  • Number of epochs, $n_e$

As far as I can see, each of these is independent. So - aside from queries, where we have a means of direct control - we might reasonably assign each a value and multiply them out to produce a single scaling factor. Then we can decide a global $\epsilon_G$ target value over which this is split. Then - if we go with Gaussian noise and can scale on $L^2$ sensitivity - the budget for a site over each week might then be something like:

$$ \epsilon_e = \frac{\epsilon_G}{\sqrt{n_d n_t n_s n_e}} $$

Effect of (Poor) Assumptions

Any assumptions we make will inevitably not hold universally. It is also important to understand how privacy might be degraded if assumptions are violated.

To take some extreme examples, we might assume that users will use 100 devices, which is high enough that very few people would ever find that to be a problem. If we scaled $\epsilon$ accordingly, the extra privacy cost to a user with 101 devices might be insignificant.

On the other hand, an assumption of $n_d=2$ devices might give us a more generous $\epsilon_e$ but this limit might be exceeded very often. Although using multiple devices is uncommon, there are some who have quite a bit more than two devices. So, while users who use 8 or more devices are a small minority, it might not be acceptable for them to suffer privacy loss that is double (or more) than our baseline assumptions.

Any assumptions we make will naturally be violated over time, which gives us a means of converting a violation of assumptions into something more meaningful. The increased information exposure that results exceeding an assumed limit might then be expressed as a reduction in the time it takes to reach the global $\epsilon_G$ target. The inverse is also true: if fewer than the assumed number is used, then the time extends.

Feedback Needed

Is this a reasonably comprehensive survey of the options? Is it reasonable to make some assumptions about sensitivity rather than describe absolutes?

Is this composition approach reasonable? Are these properly independent and without any means to exploit correlations to improve the per-epoch $\epsilon_e$ value?

Are there values that we might use for these assumptions? What principles might you apply in choosing those numbers?

Are there ways we can deal with the common cases where far fewer than the assumed limit is used? (My sense is that the answer is "no", but no harm in asking.)

Add a mechanism to support breakdowns by trigger data

It is a very common use-case to query conversion aggregates broken down by information embedded in the trigger/conversion associated with a given attributed impression. For example:

  • The “type” of conversion e.g. an enum like a purchase, email signup, etc
  • A signal about whether the conversion should be an input into a bidding optimization algorithm
  • Finer grained breakdowns on things like purchase cart contents (product categories, etc)

Supporting this use-case will bring more value to IPA, and we can do it without impacting the privacy bar, so we should consider adding support to this in IPA.

We have support in this in ARA in two places:

  1. In event-level reports, we have a “trigger data cardinality” of 2 for views and 8 for clicks. This allows event-level reports to embed 1-3 bits of trigger-side metadata.
  2. In aggregatable reports, we allow the aggregation key (similar to IPA’s breakdown key) an arbitrary amount of trigger-side data.

One caveat here is that the kind of “conversion label” is sometimes a function of both source and trigger-side data. For example, Google Ads has a “selective optimization” feature that allows for labeling a conversion as eligible for bidding optimization based on the campaign. This is a useful technique for reducing the dimensionality of the output (and thus saving privacy budget).

There are many techniques to explore to satisfy this use case, but at a high level, this requires two things:

  1. Some mechanism to "mix" breakdown keys across contexts (e.g. binary OR, etc.)
  2. Some mechanism to query mixed breakdown keys (e.g. automatically querying the cartesian product, callers directly specifying the output domain, some thresholding mechanism for sparse histograms, etc)

More work would be needed to generate trigger breakdown keys as a function of source-side data. For reference on how ARA solves this, see our event explainer, whose mechanism works similarly for generating aggregatable reports. At a high level, we designed a simple DSL that allows selecting trigger data / aggregation keys based on source-side information.

IPA Biweekly Meeting Agenda 4/26

Agenda+:

Here is a list of proposed agenda topics for the Biweekly PATCG call this week. I seem not to be able to add agenda+ label to existing issues so will link out to some existing ones here for us to continue discussion on.

Multi-touch attribution algorithms

Agenda request here to talk more about this and get publisher input on what MTA algorithms would be desired in the in-market test of IPA

on-device matchkeys -- short and long term plans

In the long term with on-device matchkeys we would like different browsers (and apps) to be able to access the same per-device matchkey; in the near term in the Chromium design what we can do is implement a browser specific matchkey.

Sharding

  1. We can give a short update on the design for caching matchkey shares instead of ciphertext
  2. We should follow up on discussion of whether webviews would impact sensitivity of matchkey cardinality

Trigger side-breakdowns

@csharrison has a couple slides on trigger side-breakdowns we didn't get to last meeting.

IPA Biweekly Meeting Agenda Request - Additional Sharding Designs

Agenda+: What do you want to discuss?

@marianapr and @schoppmp have been thinking about additional ideas for sharding that could benefit IPA scalability. We would like to let them present and get feedback on this direction in a future PATCG/IPA call.

To prepone a few questions about their design

  1. What is the setting/assumptions? can it match IPA's malicious honest majority setting?
  2. Can it work as a wrapper around IPA's main MPC queries? In that it partitions the data into shards and each shard runs IPA's MPC query.
  3. Does it work with stable sharding on timestamps or would sorting on timestamp be needed after?
  4. How does it prevent leakages, like cardinality attacks on matchkeys and ensuring that shard size are differentially private?

Time

We can probably spend most of a meeting on this

Issue Link

Other Links

A couple other issues related to sharding for IPA are:
#35
#49

IPA Biweekly Meeting Agenda Request - Event Labeling Queries with per matchkey DP bound

Agenda+: What do you want to discuss?

There has been considerable interest in supporting event level outputs in IPA (IPA issue 60, PATCG issue 41). This was discussed at the May 2023 PATCG meeting where the consensus was we could support this so long as we can enforce a per user bound on the information released.

I've been working on outlining (#70 ) how to do this with an Event Labeling Query that labels events with noisy labels and can be done in a way that lets us maintain a per matchkey DP bound. Would be great to iterate with this group on the requirements and initial design.

Time

~30min

Issue Link

I opened as a PR on the end-to-end doc as I think that will be an easier place to iterate on the design than in an issue #70

Clarification on protection offered from restricting to single source/trigger websites

In the end to end doc it states:

Finally, to prevent a report collector from circumventing their privacy budget by running duplicate queries, we need the helper party network to enforce that a report collector, (acting on behalf of a specific source or trigger website app) can only run source or trigger fanout queries related to that website/app. We can achieve this by having the helper party network ensure that:

  1. That they (the helper party network) are the helper party network committed to by the report collector.
  2. All provided encrypted match keys are from the match key provider committed to by the report collector.
  3. That the current source or trigger fanout query only contains source or trigger reports (respectively) with a match key generated on the report collector’s associated website/app, which is recorded in the current website/app field of the datai

I can understand why you would want to enforce (3) (e.g. you don't want to support other types of queries than single site trigger / source fanout), but I am not sure why removing it allows circumventing privacy budgets. In particular, the website/app does not seem to appear in the "grain" specified in the budget management section, which species a grain of (match key, report collector, epoch).

e.g. if a trigger fanout query contains multiple trigger sites, what breaks here? What kind of "duplicate queries" are we worried about?

Threat model: Privacy of data known in the clear to the report collector

A question that came up in the context of #60 is whether the threat model of IPA aims to protect data known in the clear to one of the parties (e.g., the record collector). It seems clear that cross-site data (the match key) and anything derived from it must be protected by DP, but the draft is not explicit about whether this also applies to other inputs. So are protocols that assume some of the inputs are known to all helpers (but not necessarily public) within scope? For randomized response, one could for example think of priors derived from first-party data that can then be used to improve accuracy by restricting the domain of the randomized response (as for example done here).

CC @marianapr @csharrison

Client-server bandwidth optimization

tl;dr it might be possible to reduce client-server bandwidth to one sixth, with an increase in server-server bandwidth of the same magnitude

Our current plan for managing bandwidth usage has report collectors submit input events to each helper separately. After creating a query, which happens on one helper, the events that are destined for different helpers are sent to that helper directly. However, there is a model in which the client-server bandwidth can be greatly reduced, inspired by a trick that the libprio people have implemented.

In the direct-to-helper mode, each helper receives events that comprise an encrypted blob (containing two shares of a match key and some supplementary stuff), metadata (like site, epoch, and so forth), and shares of breakdown/trigger/timestamp. Shares are replicated.

In a mode where messages go directly to each helper, we could greatly reduce client-server bandwidth by taking two of the three shares and replacing those with shared randomness. This potentially applies to both match keys and the additional shares, but the greatest benefit comes from compressing the additional shares.

The basic idea is that two of the three shares are entirely replaced by a seed for a CSPRNG or PRF. We then define a way to use that RNG to generate values for those shares. For the breakdown/trigger/timestamp/... shares, these all come from the same node, so a single seed is sufficient to supply values for those shares. The final share is set to $x_2=x\oplus f(s_0, j)\oplus f(s_1, j)$, (where $f()$ is the PRF, $s_i$ is the seed that replaces shares of $x_i$, and $j$ is an identifier for this particular share). This final share has to be transmitted in full.

The drawback is that we're using replicated secret sharing. We're still sending two copies of the full share. But if we are willing to have helpers manage distribution of that, we can encrypt the stream of full shares under a key that we make available to the two helpers that will receive that share. The same trick applies to the shared secrets.

That means that - for the shares they submit - report collectors could generate:

  • $\left\{k_{s,0}, k_{s,1}, k_{s,2} \right\}$ - symmetric keys for encrypting shared data; these are not sent
  • $\left\{e_{aead}(k_{s,0}, s_0), e_{aead}(k_{s,1}, s_1), e_{aead}(k_{s,2}, X)\right\}$ - symmetric encryption of the two CSPRG seeds, and then (the big) symmetric encryption of the final set of shares.
  • $\left\{e_{hpke}(k_{h,0}, (k_{s,0}, k_{s,1})), e_{hpke}(k_{h,1}, (k_{s,1}, k_{s,2})), e_{hpke}(k_{h,2}, (k_{s,2}, k_{s,0}))\right\}$ - public key encryption toward the public keys of each help are submitted, which means that each helper is able to decrypt two of the three values provided (seed or shares)

Relative to what we currently do, this results in a small amount of added fixed overhead. But it means one sixth of the per-record overhead. The cost is that the helper that receives this data needs to send a copy of that large per-record data to another helper node. If we value client-server bandwidth more highly than server-server bandwidth, this is likely a very good trade-off.

This trick could be used for match keys, but the savings are small enough that it is not worthwhile. The CSPRNG seed will be roughly equivalent to the size of the large match key we are considering for use in the OPRF (32 bytes). That's larger than the 40-bit value we are currently using, even with the doubling. That means that the only potential saving comes if we are using the larger OPRF, because we would avoid the cost of sending the two shares to each helper, but half of those costs are cancelled by the need to encrypt the keys separately, which costs 16 bytes extra.

Using same match key per-browser vs per user-profile

For the initial design, we have scoped down the underlying match key to be generated by the user-agent. It would never be revealed to anyone and just stored on the device. This is due to the malicious match key provider attack which was found recently.

Based on ongoing discussion with @csharrison and @benjaminsavage, in this issue we want to discuss what should be the scope of a match key issued by a browser in this narrowed scope.

  • Should there exist separate match keys per user profile who have logged into the browser or,
  • Can they all share the same match key on the device?

Arguments in favour of having same match key across profiles

  1. In terms of attribution, there would be better coverage if ads shown on one profile could receive attribution for conversion events generated in another.
  2. Since sites cannot detect anything about the user from an encrypted match key, per-browser vs per-profile would have no impact on the end user experience of using websites.
  3. Separately, in terms of end-user control, users can disable the IPA API entirely in settings, or temporarily by using private browsing.

Arguments in favour of having different match keys across profiles

  1. It is not a common to have a browser global state like this go between profiles.
  2. One common use-case for profiles is when multiple people use the same browser. Sharing a match key across multiple people who live together effectively becomes "household attribution", which might not be what advertisers / publishers want.

We would appreciate feedback about this.

Suggestion: Add more clarity about how multiple breakdowns could share one sort

In the last PAT-CG meeting, I mentioned the possibility of a different type of IPA query, where instead of passing in just one "breakdown_key" per source-event, a report collector could instead specify an array of "breakdown_keys" per source event. For cases where there is a need to perform multiple, different breakdowns on the same set of source and trigger events, this would be more efficient from both a processing, and differential privacy perspective.

  1. The events only need to be sorted once. The various breakdowns could be computed off of the same sorted list. Since the "sort" operation dominates the cost of an IPA query, amortising that cost across multiple queries is attractive as a cost-saving measure.
  2. In this scenario, it's impossible for the report collector to modify the queries they are running in response to the results of prior queries. They'd receive the results for all the different breakdown queries at once. This allows for more efficient composition of differential privacy. The upshot is that less noise needs to be added this way, compared to if the queries were all run one after another.

It would be good to add all of this to the "IPA end-to-end doc" somewhere.

Match keys without JavaScript (for browser implementations)

The proposal currently suggests that (when in a browser) the browser implement a new function, get_encrypted_match_key(). However, many sites serve ads on publisher websites without having the ability to execute JavaScript (even in an iframe.)

We should explore ways in which a delegated report collector, acting on behalf of either the publisher or the advertiser, can receive an encrypted match key without requiring the ability to execute JavaScript on a website.

Matchkey Specification: UAV as fallback MKP

Matchkey Specification: UAV as fallback MKP

Summary:

  • We propose a way to optimize the matchkey size by letting the User Agent Vendor (UAV) be a fallback matchkey provider for its User Agents (UAs) instead of having the UA generate random on-device matchkeys when none has been set by a Matchkey Provider (MKP).

Our original design for setting matchkeys allowed for two ways of setting the matchkey 1) where a MKP calls the set_match_key() API and 2) when no MKP has set a key the device would generate its own matchkey the first time get_encrypted_matchkey() is called. The second case leads to the need for much larger size matchkeys to avoid collisions (think birthday paradox). Here we propose an additional way to set the matchkey which allows the User Agent Vendor to be a fallback matchkey provider for its User Agents.

The User Agent Vendor would supply each of its User Agents with a unique matchkey. These matchkeys would only enable same-user-agent attributions, thus enabling the same functionality as having the UA set its own matchkey randomly. But what we achieve here is a significant optimization in terms of the size required for these matchkeys. Since they can be set uniquely as the UAV we don’t need to increase the size to bound the probability of spurious collisions. Since sorting in the MPC attribution scales with the number of bits in the matchkey and sort is still the most expensive stage in MPC, this translates to a large optimization for the MPC.

Setting the matchkey

We propose three ways of setting the matchkey: 1) by a Matchkey Provider (MKP), 2) by the User Agent Vendor (UAV), or 3) by the UA itself. We also propose that the matchkey itself encodes how it was set so there is no overlap between matchkeys set from the different methods.

For the structure of the matchkey

  1. The first bit is called set_by_MKP and is a 1 if the matchkey has been set by that matchkey provider and 0 otherwise. When a MKP sets a matchkey they will set the 44 bit mk_from_MKP part to be unique for each of their users.

  2. In the case a MKP has not set their matchkey on a UA and the get_encrypted_matchkey() is called, the API will still return a matchkey but it will be one unique to that user agent. For setting such a matchkey we propose that the UAV can act as a fallback MKP supplying each of its user agents with a unique matchkey. The second bit of the matchkey is called set_by_UAV and is a 1 if the UAV has set the matchkey and 0 otherwise. In the case the UAV has set the matchkey, they will use the next 8 bits to set their UAV_id. The reason for this is that we don’t want collisions coming from different UAVs and don’t want to rely on the User Agent String separate from these IPA APIs to determine which UA a user is using. With the remaining 34 bits the UAV will provide each of its UA with a unique key. Allocating 8 bits for the UAV_id would help to future proof the design and if not that many are needed immediately then browsers with lots of users could take multiple entries (from which they can allocate as though they were additional bits), which would allow small and large browsers to coexist.

  3. In the case that neither a MKP nor a UAV have set a matchkey the first 2 bits will be zeros and the UA will generate and store its own matchkey the first time get_encrypted_matchkey() is called. Since UAs would be doing this independently of each other there will be the possibility of collisions.

IPA Matchkey Specification (1)

Hiding user counts

When the MKP or the UAV sets a matchkey uniquely for each of its users, they likely do not want the matchkeys to reveal anything about their total user counts (in case there is a way with device access to learn what matchkeys have been set on your device). To prevent this a MKP/UAV could use a keyed pseudo-random permutation (PRP) to map from a user index [0, number_of_users) to a still small space [0,2^34). Since this is a permutation, it will not introduce any collisions but will prevent a few matchkeys from telling anything about how many total users a MKP has. This may be one such available primitive to use for the PRP.

Collisions for UA generated matchkeys

As mentioned above, for matchkeys that are generated independently on devices when no MKP or UAV has set the matchkey there is the possibility of collisions. We would like to see how many such matchkeys a query can contain before there start to be spurious collisions. The expected number of matchekeys that are the same as some other in the query is $E=n(1-(1-1/N)^{n-1})$ where n is query size and N is the number of matchkeys.

  • Using this formula, if we can tolerate an expected number of collisions < 10, then with 43 bits for the matchkey we need to have fewer than ~9.3M randomly generated matchkeys in a query.
  • If we can tolerate an expected number of collisions < 1, then with 43 bits for the matchkey we need to have fewer than ~3M randomly generated matchkeys in a query.

Thanks @martinthomson for lots of good discussions on this and for suggesting how to hide the user counts and @benjaminsavage, @richajaindce and @akoshelev for reviewing.

Restrictions on match key providers

Given that match key providers have great power in the IPA proposal to do cross-device attribution and to match web events with arbitrary off-web events that can be linked to a given match key, we should consider possible mitigations to limit the harm here. I want to discuss two possible mitigations, but I’m open to more:

  1. permission prompt for setMatchKey

    The UA can show a permission prompt in front of the user when setMatchKey is invoked. While we're mindful that the UX would need to be constructed in a user-understandable way, it would mitigate this issue, especially if the prompt mentioned possible linking with off-web or cross-device activities.

  2. UA-allowlist for match key providers

    The UA can maintain an allowlist of match key providers, and anyone not on the list that tries to call setMatchKey will fail. This opens up the possibility of match key providers needing to abide by certain UA policies / guidelines to be allowed to set match keys. A policy layer is useful because IPA has no mechanism for technical enforcement of event provenance.

    Of course, this has downsides in that it opens up interoperability concerns if different UAs have different policies / allow lists of providers. It also makes the API less “open” in that it limits which entities can use the most powerful features. Still, something like this might be needed for some UAs to consider the proposal with setMatchKey “safe enough” for shipping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.