cmstas / hgganalysisdev Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 7.0 69.67 MB

Python 7.79% Jupyter Notebook 92.11% Shell 0.10%

hgganalysisdev's People

Contributors

Stargazers

Watchers

Forkers

bsathian mhl0116 maxgalli gekobs jonathon-langford ireed3597 kdownham yashmehra028

hgganalysisdev's Issues

Implement signal region optimization tools

Currently, signal region optimization relies on the ttH/FCNC code (using RooFit + combine). This is fine for our purposes now, but it is a bit messy and the code is hard to read + not as flexible as I would like.

A pure-python implementation would be faster and could be made to be more configurable, readable, and user-friendly.

I'd propose to use zfit [1] as a replacement to RooFit and hepstats [2] as a replacement for combine.

I will plan to do this, but anyone else interested should feel free as well.

[1] https://github.com/zfit/zfit
[2] https://github.com/scikit-hep/hepstats

More Dynamic Workflow

At the moment, it seems that the workflow for this analysis framework is essentially based around a configuration JSON. While this is nice for simplicity, it might also suffer from its rigidity in the future. For example, I know that we have to do many checks and small studies for VBS HWW wherein I have to kind of brutalize my code in order to get timely results. As such, I wonder if you would be open to moving towards a different workflow, one that we've been using for VBS HWW and that I've been messing around with on the side. I think it could lend some flexibility that would prove generally useful.

The credit for this workflow is due to Philip. He, for a long time, has been organizing his analyses into "cutflow" objects. These cutflows are constructed by stringing together "cuts" in a tree structure. Each cut is given a name, and two lambda functions: one that contains the logic for the cut (i.e. returns pass or fail selection) and another that returns the event weight for that cut (e.g. xsec weight, b-tagging scale factors). Overall, this workflow allows for a lot of niceties, and I have listed a few under "Pros" at the bottom that it also shares. However, while I think it is nice, I think it can also be simplified and improved.

I would like to propose the following overall structure. It seems that a binary search tree is more fitting, in general, than the multi-branch tree that we use in VBS HWW. That is, each cut is a node in the BST, and whether it returns true or false determines if it goes right or left (respectively) in the tree. The iteration in the event loop then terminates when a leaf is reached. For example, I took Philip's cutflow idea (and related tools) and put them into something I call RAPIDO. I would take a similar approach within Python, where a lot of this stuff would be way simpler due to Python's dynamic typing (a lot of acrobatics is needed to achieve a similar effect in C++, and it is needed for writing to TTrees). That is, I would use a similar structure to what you have now, where selections are separated into their own functions, etc. However, I would change the overall organization towards this cutflow structure.

Let me conclude with a quick summary of the workflow along with a few "pros" I would like to highlight.

Workflow:
I propose that you organize the analysis like a BST cutflow, where you have one (or a few) common cutflow(s) in this repo. Contributors then clone this repo and make necessary changes to these common cutflows to answer questions, do weird studies, etc. Common object IDs and other tools (e.g. Python equivalent of cmstas/NanoTools) are also kept here.

Pros:

More explicitly (i.e. pep8/general gospel of "explicit over implicit") organized, less hidden behind nested objects
Cutflows are easily printable
Simple/diagnostic histograms can be filled while looping over events at different stages of the cutflow
Different signal regions or control regions are more easily/explicitly definable in the BST framework
The BDTs that you ultimately produce can also be exported to this BST format and analyzed as above

Apologies for the long issue. Please let me know if this is at all interesting to you; I would be happy to meet with folks to discuss further.

gen matching code

We don't have gen matching yet. I'm in the process of implementing it

Preselection: Tool for making data/MC plots

Implement script (e.g. make_plots.py) that takes a dataframe (output from the looper) as input and makes various plots.

Ideal configurable options:

Input dataframe
Type of plot: data/MC plot OR shape comparison
Linear vs. log axis (hopefully the script also calculates reasonable axis ranges)
Include/don't include ratio pad
List of variables to make plots for. For this, a json file is my first idea at making this easily configurable. Can contain a list of variables along with relevant data for each (binning, axis limits, additional text to print on plot, etc)

Configurable options for data/MC plots:

List of samples to consider as backgrounds (plot these as stacked histograms)
List of samples to consider as signals (plot these as solid lines)
Normalization of background: scale by lumi or scale total background to data

I know there are many different plotting packages already out there, and I am relatively ignorant about them, so please comment if you have one in mind that is the best/most user-friendly/etc. These would probably be a good starting point for the script and might save a majority of the work.

parquet instead of pickle?

Sorry for being nosy - I am curious about all the columnar technologies in an actual analysis context like Hgg :)

I saw that pickle is mentioned/used in several places. You might try parquet instead (df.to_parquet(), pd.read_parquet()). It's essentially the industry version of ROOT, so it'll be much faster at serializing/deserializing than pickle. Pickle is probably faster without compression, but if you use df.to_pickle("blah.pkl.gz"), parquet will be faster. Of course this all depends on how big your files are.

Preselection: Implement Dask/condor submission for looper

Implement functionality to submit jobs to Dask and/or condor for the preselection looper.

Implementation would go in this function in prep_helper.py:

HggAnalysisDev/Preselection/helpers/loop_helper.py

Line 151 in a96044c

def submit_jobs(self):

Likely the cleanest way to do it would be to make a Batch directory and build helper classes for Dask/condor submission there.

Merging of output dataframes would also need to be updated accordingly:

HggAnalysisDev/Preselection/helpers/loop_helper.py

Lines 197 to 208 in a96044c

    
           def merge_outputs(self): 
        
               master_file = self.output_dir + self.selections + "_" + self.output_tag +  ".pkl" 
        
               master_df = pandas.DataFrame() 
        
               for file in self.outputs: 
        
                   if self.debug > 0: 
        
                       print("[LoopHelper] Loading file %s" % file) 
        
                   if not os.path.exists(file): 
        
                       continue 
        
                   df = pandas.read_pickle(file) 
        
                   master_df = pandas.concat([master_df, df], ignore_index=True) 
        
               master_df.to_pickle(master_file)

Another thing to keep in mind: it would be nice to have the batch submission tools not be entirely specific to the looper (or at least easily generalizable), as they will also be useful for MVA training (e.g. hyperparameter scans) and Signal Region Optimization (scanning MVA cut values).

Implement four-vector tools

It will be useful to have the options to compute four-vector related quantities.

For example, we may want to build an H->TauTau candidate out of two hadronic taus/leptons and compute its pT and eta, compute its deltaR with respect to photons/diphoton, etc.

Given that we already have many useful quantities saved in the skims (gg_pt, gg_eta, SVFit quantities), this is not super urgent, but will be necessary if we want to do things without remaking skims.

I'd suggest we make a PhysicsTools directory inside Preselection and build either a class or a set of functions in e.g. four_vector_utils.py. The main functionality we'd want is the ability to add two four vectors together and then return the resulting four vectors properties (pT, eta, phi, mass). Once we have this, we can do things like compute dR(H->TauTau, H->gg) with existing tools for calculating delta R (these need to be cleaned up as well).

Preselection: Tool for making yield tables

Implement a script (e.g. make_tables.py) that takes a dataframe object (output from the looper) as input and prints out various yield tables.

Prints out yields & uncertainties for each process, total background (i.e. sum of all background MC), and ratio of each process to the total background yield.

Ideally the following options would be configurable:

Input dataframe
List of samples to consider as background
List of samples to consider as signal (just ggTauTau for now, but will be nice to have this easily configurable in the future)
Option to scale non-resonant background yields to m_gg mass window, ~[122,128], for more fair comparison with signal and resonant backgrounds
Option to make tables separately by year
Options to apply cuts based on columns saved in the dataframe (e.g. print yields after cutting on some value of m_tautau). The easiest way to do this would probably be to supply a json file with a list of cuts as an input, then the script makes yield tables for each cut listed in the config file.

Should go in a directory tables under the Preselection dir.

Discussion for v5+ of skims

Thread for discussing details on v5+ of the skims

cmstas / hgganalysisdev Goto Github PK

hgganalysisdev's People

Contributors

Stargazers

Watchers

Forkers

hgganalysisdev's Issues

Implement signal region optimization tools

More Dynamic Workflow

gen matching code

Preselection: Tool for making data/MC plots

parquet instead of pickle?

Preselection: Implement Dask/condor submission for looper

Implement four-vector tools

Preselection: Tool for making yield tables

Discussion for v5+ of skims

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def merge_outputs(self):
	master_file = self.output_dir + self.selections + "_" + self.output_tag + ".pkl"
	master_df = pandas.DataFrame()
	for file in self.outputs:
	if self.debug > 0:
	print("[LoopHelper] Loading file %s" % file)
	if not os.path.exists(file):
	continue
	df = pandas.read_pickle(file)
	master_df = pandas.concat([master_df, df], ignore_index=True)

	master_df.to_pickle(master_file)