fast-hep / fast-plotter Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 8.0 247 KB

Manipulate binned pandas dataframes into plots

Home Page: https://fast-hep.web.cern.ch

Makefile 2.60% Python 97.40%

hacktoberfest plotting python

fast-plotter's People

Contributors

Stargazers

Watchers

Forkers

eshwen gluonicpenguin bundocka dbanthony snwebb kreczko

fast-plotter's Issues

Add options for the ratio plot error bar calculation

Imported from gitlab issue 5

The current error calculation assumes the ratio is for an efficiency plot, which will not always be the case.
https://gitlab.cern.ch/fast-hep/public/fast-plotter/blob/master/fast_plotter/plotting.py#L134

It would be good to add the error calculation type as an option. The main options I can think of are

efficiency
data/MC (standard)
data/MC (Poisson)

Review functions in init.py

Imported from gitlab issue 4

There are still a few functions in init.py from the initial commit.

We should review these, move anything useful to utils.py, and remove anything obsolete.

Signal line colours are inconsistent

Even if specifying a colour for a dataset in the dataset_colours block of a plotting config, it is not adhered to. Signal lines always seem to be ordered by total yield, and follow the tab10 colourmap from matplotlib

Bugs in v0.1.4

Imported from gitlab issue 10

Recent changes have broken things for Data / MC plots:

Return code not set properly if there are crashes
Crashes caused by treating single line plots as multiple line plots

Streamline re-ordering/merging/dropping columns (Rob)

So streamline the functions to let user merge columns together, then re-order or drop columns as necessary, for input to fast-plotter, ensuring the index is set for the new merged column (see working code below)

stages:
    - {rename_region: ReBin}
    - {rebin_met: ReBin}
    - {combine_region_met: CombineColumns}
    - {drop_region_met: AssignDim}
    - {save: WriteOut}

rename_region:
    axis: region
    drop_others: true
    mapping:
	0: SR
	6: SB0
	7: SB1
	8: SB2
	9: SB3
	10: SB4

rebin_met:
    axis: met
    drop_others: true
    mapping:
	"[200.0, 300.0)":   "[200.0, 300.0)"
        "[300.0, 400.0)":   "[300.0, 400.0)"
        "[400.0, 500.0)":   "[400.0, 600.0)"
        "[500.0, 600.0)":   "[400.0, 600.0)"
        "[600.0, 700.0)":   "[600.0, 2000.0)"
        "[700.0, 800.0)":   "[600.0, 2000.0)"
        "[800.0, 900.0)":   "[600.0, 2000.0)"
        "[900.0, 1000.0)":   "[600.0, 2000.0)"
        "[1000.0, inf)":   "[600.0, 2000.0)"

combine_region_met:
    format_strings: {"region_met":"{region}_{met}"}
    as_index: [region_met]

drop_region_met:
    drop_cols: [region, category, met]

save: #{}
    filename: "tbl_dataset.region.category.met--sig_regions_fit.csv"

Versatility in signal vs background plots

Imported from gitlab issue 8

Include a "summary" panel for signal/sqrt(background) or the Asimov formula when plotting signal vs background. Ideally use colour-coding so that in the summary plot, for each bin, there's a f(S, B) value for each signal where the colour of the point matches the top plot.

Dataset order not applied to signal

When specifying certain datasets as signal in a config, i.e.,

signal: '.*SVJ*.'

and a dataset order is specified for them, i.e.,

dataset_order:
    - SVJ_3000_20_0.9_peak (Pythia)
    - 'SVJ_3000_20_0.9_peak (MadGraph)'
    - SVJ_3000_20_0.1_peak (Pythia)
    - 'SVJ_3000_20_0.1_peak (MadGraph)'
    - SVJ_1000_20_0.3_peak (Pythia)
    - 'SVJ_1000_20_0.3_peak (MadGraph)'

the order in which the datasets are plotted is not the same. Presumably, they are just ordered by yield, as can be demonstrated with the following plots produced even with the above config fragments applied.

Get CI pipeline working

Imported from gitlab issue 6

Require a dummy "background" dataset or row (Rob)

Currently cannot plot only signal, requires a background row in order to plot (i.e. will always require plotting stacks) as lines

Push package to pypi

Imported from gitlab issue 7

ValueError: scatter requires x column to be numeric

Imported from gitlab issue 2

csv at: /afs/cern.ch/user/d/danthony/public/combined_signal/

(chip_env) [zw18769@soolin updated_combined_signals]$ fast_plotter /users/zw18769/CHIP/analysis/output/MC_signals/combined_signal/updated_combined_signals/tbl_dataset.njet.nbjet--weight_nominal.csv

 
fast_plotter - INFO - Processing: /users/zw18769/CHIP/analysis/output/MC_signals/combined_signal/updated_combined_signals/tbl_dataset.njet.nbjet--weight_nominal.csv
Traceback (most recent call last):
  File "/users/zw18769/.local/bin/fast_plotter", line 11, in <module>
    load_entry_point('fast-plotter', 'console_scripts', 'fast_plotter')()

  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/__main__.py", line 45, in main
    process_one_file(infile, args)

  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/__main__.py", line 67, in process_one_file
    scale_sims=args.lumi, yscale=args.yscale)

  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/plotting.py", line 21, in plot_all
    plot = plot_1d_many(projected, data=data, dataset_col=dataset_col, yscale=yscale, scale_sims=scale_sims)

  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/plotting.py", line 63, in plot_1d_many
    _actually_plot(df_data, kind=kind_data, label="Data", ax=main_ax)

  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/plotting.py", line 52, in _actually_plot
    df.reset_index().plot.scatter(x=x_axis, y="sumw", yerr="err", color="k", label=label, ax=ax)

  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 3461, in scatter
    return self(kind='scatter', x=x, y=y, c=c, s=s, **kwds)

  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 2941, in __call__
    sort_columns=sort_columns, **kwds)

  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 1977, in plot_frame
    **kwds)

  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 1743, in _plot
    kind=kind, **kwds)

  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 845, in __init__
    super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs)

  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 820, in __init__

raise ValueError(self._kind + ' requires x column to be numeric')

ValueError: scatter requires x column to be numeric

Discrete variables are plotted incorrectly

If, in fast-carpenter, variables are binned discretely (e.g., njet), it is plotted incorrectly. The first two bins share the same value, and the last two bins share the same value. For example,
plot_dataset.njet--n_jets--weight_nominal--project_njet-yscale_log.pdf

Feature requests

A few suggestions:

Add 'merge_df' functionality into Plotter
Allow datasets to be normalised to unity (grouped by the datasets in 'value replacements')
Add optional 'ignore' as command-line argument for columns you don't want to plot
Add optional 'vars' as command-line argument for variables you do want to plot (to avoid requring such a specific naming scheme)
Add option to to classify datasets undefined in the config as 'other'; this is an easy way to plot the datasets you're intereted and combine togethers the other you don't care about
Add optional 'style' to how variables should be plotted. By this, I just mean using the 'style' option to effectively treat all datasets as signal or all datasets as background for the purposes of plotting

FAST-PLOTTER: Add 2D histogram functionality

This involves editing from [https://github.com/FAST-HEP/fast-plotter/blob/master/fast_plotter/plotting.py#L47]

ModuleNotFoundError `cycler` issue

I was trying to use fast_plotter recently but it couldn't find cycler yet it's installed.

$ fast_plotter --help
Traceback (most recent call last):
  File "/home/anaylor/.pyenv/versions/miniconda3-4.3.30/bin/fast_plotter", line 5, in <module>
    from fast_plotter.__main__ import main
  File "/home/anaylor/.pyenv/versions/miniconda3-4.3.30/lib/python3.6/site-packages/fast_plotter/__main__.py", line 7, in <module>
    import matplotlib
  File "/home/anaylor/.local/lib/python3.6/site-packages/matplotlib/__init__.py", line 139, in <module>
    from . import cbook, rcsetup
  File "/home/anaylor/.local/lib/python3.6/site-packages/matplotlib/rcsetup.py", line 31, in <module>
    from cycler import Cycler, cycler as ccycler
ModuleNotFoundError: No module named 'cycler'
$ pip freeze | grep cycler
cycler==0.10.0
$ pip freeze | grep fast-plotter
fast-plotter==0.8.1

Feature request: create output directory for plots if it doesn't currently exist

Imported from gitlab issue 9

If dataset_order is a list, fast-plotter fails

If a user specifies dataset_order as a list in their plotting config file, fast-plotter complains at this line

fast-plotter/fast_plotter/utils.py

Line 132 in 814f0f7

if dataset_order.startswith("sum"):

as lists don't have the startswith() attribute. This line should be at the top of the function:

fast-plotter/fast_plotter/utils.py

Line 140 in 814f0f7

if isinstance(dataset_order, list):

Also, when specifying dataset_order as a list, the datasets are actually plotted in reverse order.

Return of 'ValueError: scatter requires x column to be numeric'

Imported from gitlab issue 3

CSV's are attached - seems to be an issue with all of them

(chip_env) [zw18769@soolin updated_combined_signals]$ fast_plotter -s ".*" -o ~/CHIP/analysis/output/MC_signals/combined_signal_5-12-18/ -l 41800 -w weight_nominal ~/CHIP/analysis/output/MC_signals/combined_signal/tbl_dataset.leadJet_pt.leadJet_eta--weight_nominal.csv 

fast_plotter - INFO - Processing: /users/zw18769/CHIP/analysis/output/MC_signals/combined_signal/tbl_dataset.leadJet_pt.leadJet_eta--weight_nominal.csv

Traceback (most recent call last):
  File "/users/zw18769/.local/bin/fast_plotter", line 11, in <module>
    load_entry_point('fast-plotter', 'console_scripts', 'fast_plotter')()
  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/__main__.py", line 49, in main
    process_one_file(infile, args)
  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/__main__.py", line 71, in process_one_file
    data=args.data, signal=args.signal, scale_sims=args.lumi, yscale=args.yscale)
  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/plotting.py", line 21, in plot_all
    plot = plot_1d_many(projected, data=data, signal=signal, dataset_col=dataset_col, yscale=yscale, scale_sims=scale_sims)
  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/plotting.py", line 99, in plot_1d_many
    plot_ratio(summed_data, summed_sims, x=x_axis, y=y, yvar=yvar, ax=summary_ax)
  File "/users/zw18769/CHIP/fast-plotter/fast_plotter/plotting.py", line 135, in plot_ratio
    ratio.reset_index().plot.scatter(x=x, y="Data / MC", yerr="err", ax=ax)
  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 3461, in scatter
    return self(kind='scatter', x=x, y=y, c=c, s=s, **kwds)
  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 2941, in __call__
    sort_columns=sort_columns, **kwds)
  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 1977, in plot_frame
    **kwds)
  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 1743, in _plot
    kind=kind, **kwds)
  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 845, in __init__
    super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs)
  File "/users/zw18769/.local/lib/python2.7/site-packages/pandas/plotting/_core.py", line 820, in __init__
    raise ValueError(self._kind + ' requires x column to be numeric')

ValueError: scatter requires x column to be numeric`

tbl_dataset.leadJet_eta--weight_nominal.csv

cuts_signal_selection-.csv

tbl_dataset.met--weight_nominal.csv

tbl_dataset.leadJet_pt.leadJet_eta--weight_nominal.csv

tbl_dataset.njet.nbjet--weight_nominal.csv

tbl_dataset.sublJet_eta--weight_nominal.csv

tbl_dataset.sublJet_pt.sublJet_eta--weight_nominal.csv

Regex comparison issue when using Latex notation

If marking signal contributions using Latex e.g. including backslashes, cannot generate this in the plot given this regex comparison line (https://github.com/FAST-HEP/fast-plotter/blob/master/fast_plotter/utils.py#L77) will raise an error re.error: bad escape \X at position Y.

Workaround is to manually append latex-formatted string to the first_values array, with double-backslashes. This is due to the known re.match issue when including "" in string.

Additional customisation options

Some small, additional customisation options could be useful (or used as default):

Option for hatched error bar on background MC instead of filled (in main plot)
Alter the range of the data/MC sub-plot (e.g., 0.5 - 1.5)

Fast plotter fails to extract weights from file name

Imported from gitlab issue 1

Fast plotter fails to open my dataframe in output from fast-carpenter with this backtrace:

  File "testplot.py", line 6, in <module>
    dataframe = read_binned_df("output/tbl_genJetPt.deltaPt.csv")
  File "/users/sb17498/.local/lib/python2.7/site-packages/fast_plotter/utils.py", line 25, in read_binned_df
    read_opts = get_read_options(filename)
  File "/users/sb17498/.local/lib/python2.7/site-packages/fast_plotter/utils.py", line 17, in get_read_options
    index_cols, _ = decipher_filename(filename)
  File "/users/sb17498/.local/lib/python2.7/site-packages/fast_plotter/utils.py", line 13, in decipher_filename
    weights = groups.group("weights").split(".")

when run on tbl_genJetPt.deltaPt.csv.
The example in the fast_cms_public_tutorial repo does not fail as it does not to perform the groups.group("weights") operation.

I presume the error is due to not using weights in the sequence, see sequence_cfg.yaml.