glouppe / paper-learning-to-pivot Goto Github PK

Repository for the paper "Learning to Pivot with Adversarial Networks"

Makefile 0.08% TeX 16.70% Jupyter Notebook 82.46% Python 0.57% Shell 0.19%

paper-learning-to-pivot's Introduction

Learning to Pivot with Adversarial Networks

Gilles Louppe
Michael Kagan
Kyle Cranmer

Many inference problems involve data generation processes that are not uniquely specified or are uncertain in some way. In a scientific context, the presence of several plausible data generation processes is often associated to the presence of systematic uncertainties. Robust inference is possible if it is based on a pivot -- a quantity whose distribution is invariant to the unknown value of the (categorical or continuous) nuisance parameters that parametrizes this family of generation processes. In this work, we introduce a flexible training procedure based on adversarial networks for enforcing the pivotal property on a predictive model. We derive theoretical results showing that the proposed algorithm tends towards a minimax solution corresponding to a predictive model that is both optimal and independent of the nuisance parameters (if that models exists) or for which one can tune the trade-off between power and robustness. Finally, we demonstrate the effectiveness of this approach with a toy example and an example from particle physics.

Please cite using the following BibTex entry:

@article{louppe2016pivot,
           author = {{Louppe}, G. and {Kagan}, M. and {Cranmer}, K.},
            title = "{Learning to Pivot with Adversarial Networks}",
          journal = {ArXiv e-prints},
    archivePrefix = "arXiv",
           eprint = {1611.01046},
     primaryClass = "stat.ML",
             year = 2016,
            month = nov,
}

paper-learning-to-pivot's People

Stargazers

Watchers

paper-learning-to-pivot's Issues

Add Binder support

Hi @glouppe. A few people on ATLAS are trying to reproduce the results of this repo and so to make it easier I thought I would just Binderize it — which I'm in the process of doing now (mostly just pinning dependencies due to Theano being used and deprecation of APIs since publication).

It was clear from the comment in Jets.ipynb that the HDF5 datasets being used were the ones at http://www.igb.uci.edu/~pfbaldi/physics/data/hepjets/highlevel/ which would need to be downloaded, but in jets.py there is

# Prepare data
fd = open("jets-pile.pickle", "rb")

I'm assuming that jets-pile.pickle is just the pickling the output of

X_train, X_test, y_train, y_test, z_train, z_test = train_test_split(X, y, z, test_size=25000, random_state=1, stratify=strates)

in Jets.ipynb, correct? Nothing else was done?

Toy data is already independent

Hello Gilles, I wanted to verify the following claim about your toy example.

The data (X, y, z) in your example is generated from a source for which Y and Z are independent, i.e I(Y, Z)=0. If this is correct, then the adversarial network R will try to find the marginal distribution of Z regardless of the prediction f(X) produced by D because p(z|y) = p(z) for all y~Y. And as a consequence the optimal solution would be to train D and R separately.

It appears to me that the adversarial network R found a very small relationship between Z and Y that was introduced during the sampling process and we are trying to eliminate (or reduce) this fictitious relationship by tuning D. That explains the plots, and the large value of lambda that was needed, but I think it was not the original purpose. Do you think it would be better to introduce some relationship between Z and Y for future experiments?

plots don't reproduce on Binder (also tested locally)

I wanted to show the Toy notebook to someone as a demo. The non-adversarial training works as expected, but it seems like there is a bug in the adversarial training or the Binder setup.

I see the same problems locally (but I'm mostly repeating the setup of Binder so not sure if that's a useful test)

I briefly skimmed the code and nothing obvious stands out (but I'm not too familiar with the code)

stability training variation

When I saw the talk about stability training:
http://arxiv.org/abs/1604.04326

I thought it wasn't very applicable to us b/c you need to have the paired images of an original and a noisy version. And for the same reasons I mentioned in #1 we often don't have these pairs of events, but instead independent samples for different systematic variations. (eg. pythia vs. herwig or almost anything that involves running Geant again)

However, for some systematics, we do have or can make paired events. For instance, many energy scale uncertainties are dealt with by taking fully simulated events events and then perturbing them. Similarly, we can add various levels of pileup. In all of these cases we can have paired events... which is a lot like the X_λ(ω) notation used in the paper (ω is original event, λ is nuisance parameter, and X_λ is some (deterministic as written, but could also be stochastic I would think) function that perturbs the original event.

In those cases, I think the stability training approach might work well and it would be interesting to try.

initial thoughts... "learning to pivot"

we haven't yet had time to talk, but here are some quick thoughts.

The notation seems a bit overly heavy... do you really need to have this underlying abstract probability space and the maps X_λ, Y etc.?

It seems like iid samples x ∈ Rᵖ is probably fine and would make it much easier for many people read.

Similarly, instead of introducing the parametrized set of functions X_λ and the realizations X_λ(ω), it would be more straight forward to just replace X(ω) with x and then capture the λ dependence in the distribution as in p(x|λ) and p(x,y | λ).

Your training data {x,y,λ} implies a prior distribution over λ, which is going to be a fundamental issue in the formulation of this problem and to what degree it is restricted to a Bayesian formalism. It is possible to try to guarantee some frequentist properties, but to optimize the power with respect to some weighting over lambda... ie. use the prior to focus capacity and power, but not in the inference step. I guess you are keeping the loss function general here, but implicitly it's going to depend on the prior given to λ

In eq. 1 you come back to the idea of the underlying event space where ω is fixed on the LHS and RHS. Here it seems like the underlying event space is important, but it's not clear to me how this would work in a physics example. Here is a situation where I think it could work...
Sometimes we simulate an event with some nominal detector simulation (λ=0) and get a set of particle energies and momenta. Then modify it in some deterministic way depending on λ. In that setting, you might say that Ω (the original event) and X (the modified event) are in the same space with X(λ=0)(ω) = ω. Then you could impose equation 1 in practice. However, that's only for a small subset of ways that we describe how systematic uncertainties work.

More generally, I have a generative model for X that depends on the parameter λ and encodes p(X|λ). I can generate from both, but I can't identify a common ω to make equation 1 practical.

Equation 2 doesn't require the ability to identify the same ω, so I'd just work with that. And if you don't need the common ω, then you can drop all the X_λ(ω) notation and just write equation 2 as
p( f(x) |λᵢ ) = p( f(x) | λⱼ) ∀ i,j [This is called a pivot in statistics]
https://en.wikipedia.org/wiki/Pivotal_quantity

The profile likelihood ratio Λ(x) (used for the generalized likelihood ratio test) is (asymptotically) a pivotal quantity. So in our initial work where we learn the likelihood p(x|λ) we arrive at your goal once we profile out λ. So the goal of this work is either to a) improve on the asymptotic convergence of the profile likelihood ratio, or b) to essentially learn the profile likelihood ratio.

In our current approach, we learn the full likelihood in terms of parameters of interest and nuisance parameters, and then we use some optimization technique to profile the nuisance parameters and arrive at the profile likelihood ratio. This is expensive b/c we have to learn a big function and the optimization can be slow. So learning the profile likelihood ratio is a nice goal since the profiling will essentially be built in and we can learn a function that depends on fewer parameters.

However, the value of the nuisance parameter that maximizes the likelihood (conditional on the parameters of interest) needed for profiling depends on the entire sample of events. In my original version of the approximate likelihood ratio paper I wrote about this some... talking about event-level and experiment-level quantities. So I think any approach that tries to estimate the appropriate value of the nuisance parameter based on individual samples is doomed to fail.

Ok.... reading along. I thought you were going to propose something different, but it looks like you are trying to learn λ̂ = r(f(X_λ;θ_f))... this is also what Redford Neal proposed in the paper that we referenced for our approximate likelihood paper. He tries to estimate the unknown nuisance parameters on an event-by-event basis. As i wrote above, I think trying to estimate λ̂ on an event-by-event basis is not the optimal strategy.

A variation on your idea... which is where I thought you were going to go is to learn a mapping f(X) that has the properties of a pivotal quantity. We can call it "learning to pivot" :-) In that case you can try to optimize the loss L_f(θ_f) and the antagonist is to see if you can discriminate between the distributions p( f(x) | λᵢ) vs. p( f(x) | λⱼ). If f(x) is a pivot, then the discriminator won't be able to tell the difference. Here you don't ever estimate λ̂(xᵢ), but instead λ̂ is constant for all x and is essentially encoded into the model used f(x, θ_f) just as it is for the profile likelihood ratio Λ(x). The loss function for the adversary would need to be generalized from situation where you try to discriminate between just two classes to a continuous set of λ.

If broke λ into N sub-classes indexed by c and use a soft-max for the predictions, I bet the equivalent of the d=1/2 saddle point is to have d_c = 1/N. Maybe there is some continuum limit there. Thinking in real time... I can see that the implicit prior on λ is going to show up here b/c reparametrizing λ is like redefining the binning for a discretization of λ to make the sub-classes indexed by c.

Fig 4

I don't want the figure to be too busy, but it seems like in addition to lambda=0 (trained with various z values) as a baseline, it would be good to have performance of nominal network trained with z=0

glouppe / paper-learning-to-pivot Goto Github PK

paper-learning-to-pivot's Introduction

Learning to Pivot with Adversarial Networks

paper-learning-to-pivot's People

Stargazers

Watchers

Forkers

paper-learning-to-pivot's Issues

Add Binder support

Toy data is already independent

plots don't reproduce on Binder (also tested locally)

stability training variation

initial thoughts... "learning to pivot"

Fig 4

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent