we haven't yet had time to talk, but here are some quick thoughts.
The notation seems a bit overly heavy... do you really need to have this underlying abstract probability space and the maps X_λ, Y etc.?
It seems like iid samples x ∈ Rᵖ is probably fine and would make it much easier for many people read.
Similarly, instead of introducing the parametrized set of functions X_λ and the realizations X_λ(ω), it would be more straight forward to just replace X(ω) with x and then capture the λ dependence in the distribution as in p(x|λ) and p(x,y | λ).
Your training data {x,y,λ} implies a prior distribution over λ, which is going to be a fundamental issue in the formulation of this problem and to what degree it is restricted to a Bayesian formalism. It is possible to try to guarantee some frequentist properties, but to optimize the power with respect to some weighting over lambda... ie. use the prior to focus capacity and power, but not in the inference step. I guess you are keeping the loss function general here, but implicitly it's going to depend on the prior given to λ
In eq. 1 you come back to the idea of the underlying event space where ω is fixed on the LHS and RHS. Here it seems like the underlying event space is important, but it's not clear to me how this would work in a physics example. Here is a situation where I think it could work...
Sometimes we simulate an event with some nominal detector simulation (λ=0) and get a set of particle energies and momenta. Then modify it in some deterministic way depending on λ. In that setting, you might say that Ω (the original event) and X (the modified event) are in the same space with X(λ=0)(ω) = ω. Then you could impose equation 1 in practice. However, that's only for a small subset of ways that we describe how systematic uncertainties work.
More generally, I have a generative model for X that depends on the parameter λ and encodes p(X|λ). I can generate from both, but I can't identify a common ω to make equation 1 practical.
Equation 2 doesn't require the ability to identify the same ω, so I'd just work with that. And if you don't need the common ω, then you can drop all the X_λ(ω) notation and just write equation 2 as
p( f(x) |λᵢ ) = p( f(x) | λⱼ) ∀ i,j [This is called a pivot in statistics]
https://en.wikipedia.org/wiki/Pivotal_quantity
The profile likelihood ratio Λ(x) (used for the generalized likelihood ratio test) is (asymptotically) a pivotal quantity. So in our initial work where we learn the likelihood p(x|λ) we arrive at your goal once we profile out λ. So the goal of this work is either to a) improve on the asymptotic convergence of the profile likelihood ratio, or b) to essentially learn the profile likelihood ratio.
In our current approach, we learn the full likelihood in terms of parameters of interest and nuisance parameters, and then we use some optimization technique to profile the nuisance parameters and arrive at the profile likelihood ratio. This is expensive b/c we have to learn a big function and the optimization can be slow. So learning the profile likelihood ratio is a nice goal since the profiling will essentially be built in and we can learn a function that depends on fewer parameters.
However, the value of the nuisance parameter that maximizes the likelihood (conditional on the parameters of interest) needed for profiling depends on the entire sample of events. In my original version of the approximate likelihood ratio paper I wrote about this some... talking about event-level and experiment-level quantities. So I think any approach that tries to estimate the appropriate value of the nuisance parameter based on individual samples is doomed to fail.
Ok.... reading along. I thought you were going to propose something different, but it looks like you are trying to learn λ̂ = r(f(X_λ;θ_f))... this is also what Redford Neal proposed in the paper that we referenced for our approximate likelihood paper. He tries to estimate the unknown nuisance parameters on an event-by-event basis. As i wrote above, I think trying to estimate λ̂ on an event-by-event basis is not the optimal strategy.
A variation on your idea... which is where I thought you were going to go is to learn a mapping f(X) that has the properties of a pivotal quantity. We can call it "learning to pivot" :-) In that case you can try to optimize the loss L_f(θ_f) and the antagonist is to see if you can discriminate between the distributions p( f(x) | λᵢ) vs. p( f(x) | λⱼ). If f(x) is a pivot, then the discriminator won't be able to tell the difference. Here you don't ever estimate λ̂(xᵢ), but instead λ̂ is constant for all x and is essentially encoded into the model used f(x, θ_f) just as it is for the profile likelihood ratio Λ(x). The loss function for the adversary would need to be generalized from situation where you try to discriminate between just two classes to a continuous set of λ.
If broke λ into N sub-classes indexed by c and use a soft-max for the predictions, I bet the equivalent of the d=1/2 saddle point is to have d_c = 1/N. Maybe there is some continuum limit there. Thinking in real time... I can see that the implicit prior on λ is going to show up here b/c reparametrizing λ is like redefining the binning for a discretization of λ to make the sub-classes indexed by c.