Comments (5)
Hi, Simone.
You're doing somewhat strange and expect algorithms to do the things they can't know about.
Cross-validation of machine learning is easy when you have some figure of merit (ROC AUC, MSE, classification accuracy). In this case evaluation is quite straghtforward.
However in case of reweighting, correct validation requires 2 steps:
- weak check: looking at 1d distributions (or computing simple 1-d tests)
- strong check: checking that machine learning model used in the analysis can't discriminate data after reweighting.
(Also, is there any reason to optimize parameters automatically?)
from hep_ml.
Hi,
OK, let me try to clarify what the situation is
I have played a bit with the hyper parameters and ended up using the following configuration
GBReweighterPars = {"n_estimators" : 200,
"learning_rate" : 0.1,
"max_depth" : 4,
"min_samples_leaf" : 1000,
"subsample" : 1.0}
However, when I use different samples with much less stats I am afraid the above are far from being optimal, e.g. too many n_estimators, causing the to misbehave
Rather than trying by myself other settings, I was wondering if there is an automated way to study this
In particular, after having created the reweighter I do compute the ROC AUC on a number of variables of interest, which I could use a FoM
Would that be useful?
Thanks
from hep_ml.
Would that be useful?
Not really. 1-dimensional discrepancies are not all discrepancies.
You can drive 1-dimensional ROC AUCs to 0.5 with max_depth=1, but you'll not cover any non-trivial difference between distributions.
(Well, you can use it as a starting point, and then check results using step 2, but completely no guarantees can be done for this approach)
from hep_ml.
OK, therefore how do you suggest to pick up the hyper parameters?
from hep_ml.
If you really want to automate this process, you need to write evaluation function which encounters both steps 1) and 2) mentioned above. E.g. sum over KS(featuture_i) + abs(ROC AUC classifier - 0.5)
As for me: I pick relatively small number of trees 30-50, select leaf size and regularization accordingly to the dataset and play with depth (2-4) and learning rate (0.1-0.3). I stop when I see that I significantly reduced discrepancy between datasets. There are many other errors to be encountered in the analysis and trying to minimize only one of those to zero isn't a wise strategy.
from hep_ml.
Related Issues (20)
- Random behavior of GBReweighter and UGradientBoostingClassifier
- Nominal weights when correcting already weighted original HOT 1
- Assertion Error with UGradientBoost HOT 1
- sPlot returns NAN sWeights HOT 3
- Odd behaviour of GBReweighter HOT 3
- Using sWeights with GBReweighter HOT 1
- Saving uboost BDT with tf/keras base estimators HOT 5
- Persistify GBReweighter instance HOT 1
- Error propagation from weights HOT 6
- Create a new release? HOT 1
- Theano is going away HOT 1
- Benchmark with independent classification model HOT 3
- New release? HOT 2
- Large variations in signal/background distributions HOT 7
- GBReweighter KeyError: 'squared_error' ?? HOT 7
- Porting loss function to XGBoost HOT 1
- numpy.float and numpy.int deprecated/removed in newer versions of numpy HOT 3
- GPU Acceleration in GBDT HOT 6
- Documenting behavior of normalization HOT 1
- GBReweights seems to be not working in my case HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hep_ml.