patcg-individual-drafts / private-ad-measurement Goto Github PK

Privacy preserving advertising attribution

private-ad-measurement's Introduction

private-ad-measurement

This document is an individual draft proposal. It has not been adopted by the Private Advertising Technology Community Group.

private-ad-measurement's People

Contributors

Stargazers

Watchers

private-ad-measurement's Issues

Consider routing reports through an (untrusted) ingestion server

Rather than sending reports directly to the MPC system, I want to consider routing them through an ingestion server, e.g. operated by the advertiser / publisher / ad tech. I believe this does not regress the security or privacy stance of this API, and it comes with a number of benefits:

The MPC nodes need to keep much less state. The ingestion server is responsible for long-term storage of the raw reports prior to issuing an aggregate query, and so the state the MPC nodes need to keep is only the minimum required for keeping track of anti-replay / privacy budgets.
Minimizing the responsibility of the MPC will make deploying MPC nodes much easier.
This allows us to support some more flexible queries without changing the MPC protocol at all. These include:
- Queries that group together many conversion IDs
- Queries that filter out records (e.g. based on a context ID) e.g. if they are marked as spammy

We should figure out what anti-replay state management would look like in this system, but I think it will end up overall simplifying the system and making it more flexible.

Private Logistic Regression in PAM

Model training on private client data is a well motivated use case for ads. There are several approaches which could be compatible with PAM. The original PRIO paper (section 5.3) describes how to implement linear regression using the aggregations supported in PRIO. Upon reflection, it seems that logistic regression could be just as easily implemented, perhaps with an approach like this one. This seems somewhat similar to a paper Apple published on Federated Learning, in that PRIO is used to privately aggregate updates from many devices.

Clarifying the problem setup

The training dataset consists of $N$ $k$-dimensional feature vectors

$X^{(1)}, X^{(2)}, …,X^{(N)} \in \mathbb{R}^k$

where each component of $X^{(i)}$ is in the range [0,1].

There are also $N$ corresponding binary labels for whether this user had a conversion or not

$y^{(1)},y^{(2)},...,y^{(N)} \in \{0,1\}.$

We assume that the features are all known in the clear to one party and that the piece we are trying to keep private is the label. In particular we want a label-DP guarantee for the labels. For more details see the above writeup, but the core idea we can focus in on here is that the part of the computation that needs to be computed privately is the following term

$$\frac{1}{N} \sum_{i=1}^{N} y^{(i)}X^{(i)} \in \mathbb{R}^k$$

which is a linear combination of $k$-dimensional feature vectors.

The key parameters of the problem are

The number of dimensions, $k$, in the feature vectors. (e.g. 32 - 256 is probably a good range to evaluate).
The number of bits, $b$, of precision used to represent the floating point numbers in the range [0,1]. (e.g. 8 bits is probably a good benchmark)
The number of rows, $N$, in the training dataset.

Implementing in PAM

There are two main ways this could be implemented in PAM and they likely vary considerably in performance:

PAM with an impression nonce (see discussion here about doing this in PAM)
PAM without an impression nonce

If impressions can be identified by a nonce, we don’t need to send large feature vectors to and from the device but can join the label to the features before submitting to the MPC.

Using reports with impression nonce

To implement this logistic regression in PAM for publisher queries we can maintain a low client-server bandwidth by implementing it in the following way:

When a site registers an impression, the device will generate a random nonce which is returned to the caller
1. The publisher creates secret shares of the feature vector $X^{(i)}$ which they associate with this nonce.
The device will compute the attribution result for this impression, which is the label $y^{(i)}.$
Later, the publisher report will be sent as (nonce, EncrptedSecretShares($y^{(i)}$), DP parameters)
Upon receipt of the attributed impression report, the Publisher joins the EncrptedSecretShares($y^{(i)}$) to the stored secret-shared feature vector $X^{(i)}$ having the same nonce.

MPC steps

We need a way to prove to the MPC that the values in $X^{(i)}$ are appropriately bounded (i.e. each component of $X^{(i)}$ in the range [0,1]).
1. Option 1: The publisher computes a zero-knowledge proof that the components in $X^{(i)}$ are in the range [0,1] and submits this along with a secret shared $X^{(i)}$ to the MPC. The MPC Helpers check the zero-knowledge proofs.
2. Option 2: The Publisher creates an XOR sharing of each of the 8 bits of each feature, such that each secret shared bit can only be a {0, 1} value. The MPC Helpers then interpret this as 8 bits of precision for a number in the range [0,1].
The MPC then multiplies every component of the secret shared $X^{(i)}$ by the secret shared label
$y^{(i)}.$
Next the MPC sums these across all records, to compute $$\sum_{i=1}^N y^{(i)}X^{(i)}$$
MPC adds DP noise to the $k$-dimensional output vector
Reveal the $k$-dimensional output vector
Scaling by $1/N$ can be computed outside the MPC.

Using reports without impression nonce

To implement this logistic regression without an impression nonce, the feature vector would need to be sent down to the device. The multiplication of the feature vector by the label would happen on the device. The resulting $y^{(i)}X^{(i)}$ vector would then need to be secret shared and set off the device for aggregation. The device would also need to generate a zero knowledge proof that the values in $y^{(i)}X^{(i)}$ are appropriately bounded. The MPC would then check these proofs, aggregate and add DP noise. This approach will considerably increase the report size and client-server bandwidth. The MPC will be somewhat cheaper in not needing to multiply $y^{(i)}X^{(i)}$ in MPC.

Questions:

Would Apple consider offering private logistic regression in an ads measurement API, using any of the above approaches or others?
What are your thoughts on the tradeoffs of these two approaches and do either seem viable?

Per-impression privacy budgeting vs. explicit conversion limits

The current design for PAM requires the impression and conversion to “agree” on how many conversions an impression can contribute to. This works OK in the steady state, but during “strategy” migrations it can be fragile. For example, imagine I am running a campaign where I set the conversion limit to 3, but I realize halfway through that the noise is just too high so I want to reduce this to 2. So, I change the impression-side registrations to use 2 instead of 3, but at this point we still have a bunch of pending impressions with 3 that are not compatible with “2” conversions, so it isn’t clear how to complete this migration on the conversion side.

The ARA approach solves this by giving each impression a fixed “budget” that it can “spend” somewhat arbitrarily at conversion time, and if an impression doesn’t have enough budget it results in no contributions. This gracefully degrades during migrations. In the example above, we would just begin consuming /2 per conversion from each impression, rather than /3.

This approach of maintaining an abstract "contribution budget" can allow for other flexibility, so stay tuned for a follow-up on this.

Support for combining reports with different IDs

Ideally, it should be possible to make aggregate queries as flexible as possible within the system. To that end, it might be useful to consider support for grouping together multiple histograms that otherwise would need to be queried separately.

Here is an example: imagine I am tracking two types of conversions, purchases and “add to cart” events. Usually I query these independently to get breakdowns per type so I give them different Conversion IDs. However, during a particular week I may have a reduced volume of reports, and so I’d prefer to get a histogram combining the purchase and add to cart reports, which allows me to get aggregate data with less overall noise (one noise addition vs. two).

Note that this is the analog to ARA/issues/470 on the ARA side to allow more dynamic decision making.

Ad-hoc discussion Thursday October 19th

There will be an ad-hoc discussion of PAM and ARA on Thursday October 19th at 12PM Pacific time for discussion of on device joining of advertising impression and conversions.

Webex Meetings
https://appleinc.webex.com/appleinc/j.php?MTID=mf3f7592d8371f49ab500f8d7b38fb67a

Phone
tel://+14156550003,,24887616395,,,#

Global call-in numbers:
https://appleinc.webex.com/cmp3300/webcomponents/widget/globalcallin/globalcallin.do?siteurl=appleinc

Dial from video system [AVCN]:
Enter the Meeting ID: 24887616395

Private measurement of single events (and privacy mechanisms optimized for this setting)

The existing advertiser and publisher reports support the private measurement of “single events”, a topic we’ve previously discussed in the PATCG (and developed a high level consensus here). The two types of reports naturally support this setting if the conversion ID / Ad ID is unique, and you make an “aggregate” query over just a single instance.

Given that we see a lot of value in supporting this setting (particularly for the “publisher reports” which reveal impression features), I have two questions:

Can you confirm that these queries are acceptable to the PAM privacy model? i.e. in other words, you believe the DP protection is enough to protect these events? Given that they are technically supported in the existing design, I believe the answer is “yes”, but would like confirmation.
If the answer to (1) is “yes”, are you open to considering additional privacy mechanisms that optimize for utility in this setting without regressing differential privacy? E.g. If you look at slide 6 in this presentation, you will see that for binary questions, the Laplace mechanism has ~15% increased effective noise rate vs. randomized response for epsilon = ln(3), but provides the same privacy protection. If you are open to considering alternative mechanisms in this setting, I would be happy to constructively engage on that front, since it’s an area we are investing research into.

Support for multiple impression-side keys, and naturally support both count + value queries

Currently the design supports an impression registering a single “key” / ad id that gets mapped to a single histogram bucket. This makes certain use-cases impossible to satisfy, like supporting querying a hierarchy of impression features (see this paper which explains how Google Ads is using hierarchical queries in ARA: https://arxiv.org/pdf/2308.13510.pdf).

A simple example: you might want to measure a hierarchy of {campaign, location, impression day}. You could combine these into a single key, but this effectively “spends” all of your privacy budget on the leaves of the tree, when there are more optimal methods if you care about reporting aggregates at every level of the tree (i.e. getting an accurate campaign count, campaign x location, and campaign x location x day}.

Another use-case is if you just want two different views of the data that don’t form a hierarchy:

conversion counts by campaign
conversion counts by location

In ARA we support this by giving every impression an abstract “contribution budget”, which they can use to contribute across many keys. The underlying constraint is that the L1 sensitivity of the total contribution for a given impression is bounded by some large constant which noise in the aggregation layer is tuned to. Each key can get a separate fraction of the contribution budget, and so get different amounts of effective noise after aggregation.

This is tough to integrate into PAM since it requires listing out all of the Ad IDs that can convert at conversion time. An alternative is to have the conversion specify key labels along with a domain size. This allows you to more succinctly map into the aggregation key domain, especially useful if you want a “one to many” mapping.

Let me propose an alignment with ARA here, to show how this could be done (super simplified):

// Impression registration lists out label -> key mappings
{
  "aggregation_keys": {
    "geo": 0x5,
    "campaign": 0x159
  }
}

// Conversion registration allocates budget consumption to labels. Imagine the total
// budget is 100 and you want to spend more budget on the per-campaign queries.
// Internally, the labels are mapped to the keys listed out at impression reg
{
  aggregatable_values: {
    "geo": 5,
    "campaign": 95,
  },
  // Imagine we have ~100 geos and a domain that fits 400 simultaneous campaigns
  // If any keys land outside of this domain they are ignored.
  histogram_size: 500
}

The above configuration would generate a 500-element vector where index 0x5 has value 5 and index 0x159 has value 95. This is much more succinct than pre-specifying 500 different Ad IDs at conversion time.

Note that this approach is related to the approach mentioned in #3 which avoids the need for impressions to specify their maximum # of conversions. It also allows for supporting measuring both counts and values, whereas currently IIUC the current PAM design only supports either count measurement or value measurement for a single conversion. Counts in this model can be thought of as just a fixed fraction of the contribution budget (determined by the API user).

Also: this strawman does not support many features that ARA has for specifying histogram buckets (namely, that the bucket can be a function of both impression and conversion-side information). If you are generally OK exploring more use-cases we can iterate on other features along these lines.

Clarifying Google’s sparse histogram use case for PAM

I’ll open this issue here on PAM, but it is more of a question directed to @csharrison. Charlie, I’d like to clarify my understanding of the use case you are trying to solve for with the sparse histograms that you talked about in the PAM ad hoc call and expressed the need for having a very large # of adIDs sent to the device for the mapping table used to generate Advertiser reports.

In showing ads on the open web, there are three use cases that seem like they could be related to what you’re looking for (informed by Ben Savage’s experience with Meta’s Audience Network):

We want to provide the Advertiser measurement of conversions across their ads. What we don’t think Advs want to see is a breakdown of how many conversions resulted from ads shown on each of the 10,000s of sites their ad was shown on. Rather what they want to see is the results grouped by more actionable breakdowns like different creatives they have used. In this case, we don’t see the need for the mapping table sent to the device to be huge – it can just be on the order of actual distinct creatives or ads the Adv wants to measure.
An Adv may actually want to see a breakdown of how many times their ad was shown on each of the 10,000s of sites for the purpose of brand safety (the Adv caring to know their ad is’t appearing on sites they don’t want their brand associated with). But this problem can be solved without any cross-site measurement because we don’t need to consider if there was a conversion and the ad network can just count how many impressions from the Adv are shown on each site and give the Adv this report.
The third use case that really gets interesting is to support an ad network that needs to understand the post-click conversion rate for different site_ad pairs. What happens often is that the same ad shown on different sites may lead to much different rates of deep funnel conversions. This is because some sites are poorly designed and result in a lot of accidental clicks. The ad network needs to know this about different sites to incorporate into their bid model telling the Adv how much they should bid to show their ad on different sites. What we want for calibrating this is to see a breakdown of how many conversions result from every site_ad pair – that may be too many breakdowns to get good signal/noise ratio so we could coursen ad to ad_set or even put similar sites together so we have enough traffic to measure.

My understanding is you’re trying to solve something like this 3rd use case using Advertiser reports which is why you need to ship down a set of adIDs roughly the size of the # of sites. I’ve been thinking about how you might be able to solve for this 3rd use case using PAM publisher reports and keeping this huge mapping off the device. Luke clarified that doing what we usually call “late binding” of breakdown keys to publisher reports seems reasonable in PAM. In fact PPM has an issue to potentially support just this in PRIO in letting shares come with labels and then the query is just to aggregation all things with the same label.

I think this can let us solve the 3rd use case in the following way:

Suppose you have a show Adv showing an ad campaign with an ad network. The ad network will put the same adID on every impression though they are displayed on many 10,000s of sites. Then the Adv’s conversion mapping contains just one adID.
Attribution happens on the device – let’s say everything is 1-day click through. Then for impressions after 1 day they return their report to the ad network. To support the late binding each ad impression has a nonce that was placed in the impression and retired in the report (doesn’t leak anything cross-site). Then the ad network can group these impression reports by site_ad_set where ad_set can be a coarser grain than the ad_id sent to the device (maybe up to the campaign level). They could also combine multiple sites into the same bin. Once they have received enough traffic for this bin they can supply all these reports to the MPC to aggregate and get this site_ad_set noisy total of conversions.

Charlie, can you clarify if these are the use cases you’re trying to support or if there is a further complex use case? Luke, if you see something about this construction PAM couldn’t support please let me know.

Support filtering: Add a (cleartext) per-report context ID to reports

The privacy model of PAM assumes that the server receiving a report can associate it 1:1 with the context that created it. This is due to the fact that reports are sent unconditionally and deterministically when the API is invoked.

Given that this is assumed to be known by a potential adversary already, I think there is value in making this relationship explicit. API invocation should allow the caller to specify some context_id, which could be embedded in the cleartext of the report when it is sent out.

This ID is useful for a number of different use-cases:

Filtering spammy aggregates: an offline process can mark various impression / conversion events as spammy or fraudulent after they occur, based on the context that created the event. It would be useful to have a hook into this ID within a report, so that we could potentially support filtering out reports matching certain spammy IDs
Context IDs may be useful if we support post-hoc grouping of reports based on their identifier (will file a follow-up issue on this)
Context IDs make the API more understandable. Servers processing reports can (in theory) know which of their API invocations actually successfully led to reports being sent out and when (useful for debugging).

Let me know what you think, or if I’ve misunderstood the PAM privacy model in some way. This is the direction we are thinking of going in ARA in WICG/attribution-reporting-api#974, which matches the "unconditional report" model of PAM and introduces an ID for filtering along the way.