Giter Site home page Giter Site logo

Comments (11)

csharrison avatar csharrison commented on September 26, 2024 2

Yes I think it should provide enough protection (just my own opinion). If we can we should validate with real-world testing to see if the data is really necessary though, as I think even without this you should be able to estimate the window with pretty high accuracy.

@johnivdel FYI.

from attribution-reporting-api.

csharrison avatar csharrison commented on September 26, 2024

I think this really hinges on whether we need to add random noise to report delays, since the existing API does not have any noise except to stagger queued reports at browser startup.

If we implement what is currently describe in the explainer, then this proposal doesn't seem necessary. If you get a report associated with a click you can tell which report window it came from by looking at the time delta from the click time. The only source of noise here is from browser up-time / startup staggering which hopefully should be minimal.

However, if we feel we want to add noise to the API, we should be analyzing why we are doing it and whether reporting the conversion-window makes the noise less effective. For instance, in PCM noise is needed because click-time information is not known from a report and conversions are reported based on some delay relative to the conversion-time. We may consider unifying this mode of delay (see privacycg/private-click-measurement#29 for more context) by e.g. adding noise on top of the conversion windows.

If we did something like that, the main thing we want to achieve is for PCM's privacy model to hold (i.e. your click time and your conversion time are well hidden). However, we may be able to come up with other solutions here that don't involve just adding the two delay mechanisms together.

from attribution-reporting-api.

jinghao avatar jinghao commented on September 26, 2024

Thanks for the thoughtful response @csharrison !

The only source of noise here is from browser up-time / startup staggering which hopefully should be minimal.

This is entirely possible though I don't have the data to confirm one way or another.

On the other hand, even small inaccuracies in our reports can be significantly problematic for our trust with advertisers. In the past, Facebook has been criticized for inaccuracies in its metrics (of this range) so I hope you can see why I'm keen on trying to get the most accurate data as possible, subject to privacy constraints.

If we did something like that, the main thing we want to achieve is for PCM's privacy model to hold (i.e. your click time and your conversion time are well hidden).

If we report the conversion-window at a day level resolution, I think we can maintain the privacy of the click time and conversion time. For instance, suppose there's a random delay of 0-24 hours for conversion reports. If we got a conversion with a 7-day window, we would know that the click happened 8-7 days ago, still a 24-hour window, and that the conversion happened in the past 8 days. When you add in the additional noise of browsers being offline at the end of the window, it becomes unknowable.

What do you think? If that doesn't work, how do you think we can move forward here? I'm really keen on finding solutions to let us credit conversions to the right window because advertisers have different targets based on them. Thanks!

from attribution-reporting-api.

csharrison avatar csharrison commented on September 26, 2024

Good point. I thought about this with @johnivdel and I think we concluded that reporting a time that omits browser downtime (and associated randomized report delays at startup) is fine for the privacy model of preventing identity joining, but it unfortunately leaks information about user behavior that we might not want to leak: how long it has been since the user last used their browser. This is an unexpected thing to be able to learn from the API so I would be cautious adding it unless it's very necessary.

Note that this problem only exists for the event-level API. For the aggregate API I believe we can send along the time the report was "scheduled" to be sent, since no report can be tied to any individual (unlike the event level API).

However, I wonder how big a problem this is really? e.g. with a reporting window of 1, 7, 28 days you would need a user who converts between day 1-7 to be offline for >2 weeks before they are "mixed in" with users that convert in the 7-28 day window. I think this should be pretty rare.

from attribution-reporting-api.

csharrison avatar csharrison commented on September 26, 2024

Though I suppose in some sense you will be able to learn this with any implementation of the event level API to some extent, especially if you configure just a single reporting window. In the existing case if a report comes T time after the last reporting window you know (% some complexity with startup scheduling) that the user hasn't started the browser for T time, so maybe this is moot.

Still, I think I would be interested in seeing if this "implicit noise" is a problem in practice, since it does boost some of the privacy guarantees (like learning the true conversion delay).

from attribution-reporting-api.

jinghao avatar jinghao commented on September 26, 2024

Though I suppose in some sense you will be able to learn this with any implementation of the event level API to some extent, especially if you configure just a single reporting window. In the existing case if a report comes T time after the last reporting window you know (% some complexity with startup scheduling) that the user hasn't started the browser for T time, so maybe this is moot.

Yeah, this is why I didn't consider it a new risk. We could always somewhat mitigate the problem by adding some random delay to firing off the report even after the browser is restarted.

You are right that in practice the cross-window case should be pretty rare. If we think the value of being 100% accurate outweighs the potential privacy risks, I'd still prefer this if possible.

Note that this problem only exists for the event-level API. For the aggregate API I believe we can send along the time the report was "scheduled" to be sent, since no report can be tied to any individual (unlike the event level API).

That's great to hear.

from attribution-reporting-api.

csharrison avatar csharrison commented on September 26, 2024

Yes I think the natural way to solve this is to introduce some noise to startup that hides browser-uptime a bit more. Right now we add uniform noise to simulate a shuffle, but it might make sense to alter it to be exponential noise, with the key property that it has an unbounded tail, so a user always has some plausible deniability that they got really unlucky but were actually online.

Right now Chrome's uniform delay is between 0 and 5 minutes, so maybe Exponential[1/3] minutes is a reasonable (somewhat conservative) starting point so the mean delay is 3 minutes and there is only a ~3% chance a report will be delayed more than 10 minutes at startup.

The only problem with this approach is that it bunches up reports a bit more than the uniform noise we have currently (lower variance), so timing attacks may be more effective to join impressions from different publishers. We should weigh these two approaches or maybe consider summing the noise from two noise sources to get best of both worlds.

from attribution-reporting-api.

jinghao avatar jinghao commented on September 26, 2024

If we label the conversion window, I think we would be tolerant of a wider delay than Exponential[1/3]. We're not trying to figure out when people's browsers are back online; we just want to be able to assign conversions to the right bucket. What do you think?

from attribution-reporting-api.

csharrison avatar csharrison commented on September 26, 2024

It seems reasonable to me, but one problem with wider delays is that it has a chance of perpetually delaying conversions from users that use the browser for very short periods of time before closing it (e.g. via Android killing the process). We will likely need to tune some of the parameters with data to make sure we're making the right trade-offs here.

from attribution-reporting-api.

jinghao avatar jinghao commented on September 26, 2024

That makes sense. Do you think that will give us enough protections to include the attribution window in the conversion report?

from attribution-reporting-api.

johnivdel avatar johnivdel commented on September 26, 2024

Following up on this:

We've gone in the direction on not adding substantial random delays to reports. In the Aggregate explainer reports are annotated with their scheduled_report_time.

For parity between the two, it seems useful to also send the scheduled_report_time for event level reports.

This would also be helpful for debugging as it lets the reporting endpoint know what state the browser was working with sending the report.

As noted above in some scenarios you may be learning additional information than before (e.g. a report in the 7 day window that got reported on day 40 because the browser was shutdown can now be tied back to the 7 day window), but this should not affect the worst/"standard" case because the report time is derived from the impression-side data.

from attribution-reporting-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.