Giter Site home page Giter Site logo

hgganalysisdev's People

Contributors

bsathian avatar fsetti avatar gekobs avatar leonardogiannini avatar mhl0116 avatar sam-may avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hgganalysisdev's Issues

Implement signal region optimization tools

Currently, signal region optimization relies on the ttH/FCNC code (using RooFit + combine). This is fine for our purposes now, but it is a bit messy and the code is hard to read + not as flexible as I would like.

A pure-python implementation would be faster and could be made to be more configurable, readable, and user-friendly.

I'd propose to use zfit [1] as a replacement to RooFit and hepstats [2] as a replacement for combine.

I will plan to do this, but anyone else interested should feel free as well.

[1] https://github.com/zfit/zfit
[2] https://github.com/scikit-hep/hepstats

More Dynamic Workflow

At the moment, it seems that the workflow for this analysis framework is essentially based around a configuration JSON. While this is nice for simplicity, it might also suffer from its rigidity in the future. For example, I know that we have to do many checks and small studies for VBS HWW wherein I have to kind of brutalize my code in order to get timely results. As such, I wonder if you would be open to moving towards a different workflow, one that we've been using for VBS HWW and that I've been messing around with on the side. I think it could lend some flexibility that would prove generally useful.

The credit for this workflow is due to Philip. He, for a long time, has been organizing his analyses into "cutflow" objects. These cutflows are constructed by stringing together "cuts" in a tree structure. Each cut is given a name, and two lambda functions: one that contains the logic for the cut (i.e. returns pass or fail selection) and another that returns the event weight for that cut (e.g. xsec weight, b-tagging scale factors). Overall, this workflow allows for a lot of niceties, and I have listed a few under "Pros" at the bottom that it also shares. However, while I think it is nice, I think it can also be simplified and improved.

I would like to propose the following overall structure. It seems that a binary search tree is more fitting, in general, than the multi-branch tree that we use in VBS HWW. That is, each cut is a node in the BST, and whether it returns true or false determines if it goes right or left (respectively) in the tree. The iteration in the event loop then terminates when a leaf is reached. For example, I took Philip's cutflow idea (and related tools) and put them into something I call RAPIDO. I would take a similar approach within Python, where a lot of this stuff would be way simpler due to Python's dynamic typing (a lot of acrobatics is needed to achieve a similar effect in C++, and it is needed for writing to TTrees). That is, I would use a similar structure to what you have now, where selections are separated into their own functions, etc. However, I would change the overall organization towards this cutflow structure.

Let me conclude with a quick summary of the workflow along with a few "pros" I would like to highlight.

Workflow:
I propose that you organize the analysis like a BST cutflow, where you have one (or a few) common cutflow(s) in this repo. Contributors then clone this repo and make necessary changes to these common cutflows to answer questions, do weird studies, etc. Common object IDs and other tools (e.g. Python equivalent of cmstas/NanoTools) are also kept here.

Pros:

  • More explicitly (i.e. pep8/general gospel of "explicit over implicit") organized, less hidden behind nested objects
  • Cutflows are easily printable
  • Simple/diagnostic histograms can be filled while looping over events at different stages of the cutflow
  • Different signal regions or control regions are more easily/explicitly definable in the BST framework
  • The BDTs that you ultimately produce can also be exported to this BST format and analyzed as above

Apologies for the long issue. Please let me know if this is at all interesting to you; I would be happy to meet with folks to discuss further.

gen matching code

We don't have gen matching yet. I'm in the process of implementing it

Preselection: Tool for making data/MC plots

Implement script (e.g. make_plots.py) that takes a dataframe (output from the looper) as input and makes various plots.

Ideal configurable options:

  • Input dataframe
  • Type of plot: data/MC plot OR shape comparison
  • Linear vs. log axis (hopefully the script also calculates reasonable axis ranges)
  • Include/don't include ratio pad
  • List of variables to make plots for. For this, a json file is my first idea at making this easily configurable. Can contain a list of variables along with relevant data for each (binning, axis limits, additional text to print on plot, etc)

Configurable options for data/MC plots:

  • List of samples to consider as backgrounds (plot these as stacked histograms)
  • List of samples to consider as signals (plot these as solid lines)
  • Normalization of background: scale by lumi or scale total background to data

I know there are many different plotting packages already out there, and I am relatively ignorant about them, so please comment if you have one in mind that is the best/most user-friendly/etc. These would probably be a good starting point for the script and might save a majority of the work.

parquet instead of pickle?

Sorry for being nosy - I am curious about all the columnar technologies in an actual analysis context like Hgg :)

I saw that pickle is mentioned/used in several places. You might try parquet instead (df.to_parquet(), pd.read_parquet()). It's essentially the industry version of ROOT, so it'll be much faster at serializing/deserializing than pickle. Pickle is probably faster without compression, but if you use df.to_pickle("blah.pkl.gz"), parquet will be faster. Of course this all depends on how big your files are.

Preselection: Implement Dask/condor submission for looper

Implement functionality to submit jobs to Dask and/or condor for the preselection looper.

Implementation would go in this function in prep_helper.py:

def submit_jobs(self):

Likely the cleanest way to do it would be to make a Batch directory and build helper classes for Dask/condor submission there.

Merging of output dataframes would also need to be updated accordingly:

def merge_outputs(self):
master_file = self.output_dir + self.selections + "_" + self.output_tag + ".pkl"
master_df = pandas.DataFrame()
for file in self.outputs:
if self.debug > 0:
print("[LoopHelper] Loading file %s" % file)
if not os.path.exists(file):
continue
df = pandas.read_pickle(file)
master_df = pandas.concat([master_df, df], ignore_index=True)
master_df.to_pickle(master_file)

Another thing to keep in mind: it would be nice to have the batch submission tools not be entirely specific to the looper (or at least easily generalizable), as they will also be useful for MVA training (e.g. hyperparameter scans) and Signal Region Optimization (scanning MVA cut values).

Implement four-vector tools

It will be useful to have the options to compute four-vector related quantities.

For example, we may want to build an H->TauTau candidate out of two hadronic taus/leptons and compute its pT and eta, compute its deltaR with respect to photons/diphoton, etc.

Given that we already have many useful quantities saved in the skims (gg_pt, gg_eta, SVFit quantities), this is not super urgent, but will be necessary if we want to do things without remaking skims.

I'd suggest we make a PhysicsTools directory inside Preselection and build either a class or a set of functions in e.g. four_vector_utils.py. The main functionality we'd want is the ability to add two four vectors together and then return the resulting four vectors properties (pT, eta, phi, mass). Once we have this, we can do things like compute dR(H->TauTau, H->gg) with existing tools for calculating delta R (these need to be cleaned up as well).

Preselection: Tool for making yield tables

Implement a script (e.g. make_tables.py) that takes a dataframe object (output from the looper) as input and prints out various yield tables.

Prints out yields & uncertainties for each process, total background (i.e. sum of all background MC), and ratio of each process to the total background yield.

Ideally the following options would be configurable:

  • Input dataframe
  • List of samples to consider as background
  • List of samples to consider as signal (just ggTauTau for now, but will be nice to have this easily configurable in the future)
  • Option to scale non-resonant background yields to m_gg mass window, ~[122,128], for more fair comparison with signal and resonant backgrounds
  • Option to make tables separately by year
  • Options to apply cuts based on columns saved in the dataframe (e.g. print yields after cutting on some value of m_tautau). The easiest way to do this would probably be to supply a json file with a list of cuts as an input, then the script makes yield tables for each cut listed in the config file.

Should go in a directory tables under the Preselection dir.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.