cta-observatory / protopipe Goto Github PK
View Code? Open in Web Editor NEWPrototype data analysis pipeline for the Cherenkov Telescope Array Observatory
Home Page: https://protopipe.readthedocs.io/en/latest/
License: Other
Prototype data analysis pipeline for the Cherenkov Telescope Array Observatory
Home Page: https://protopipe.readthedocs.io/en/latest/
License: Other
This issue is part of a project described in issue #24
The following is a "real-time" list of points that are found to be differences between the pipelines using the comparison.
Not all features are critical to recover the missing performance, but all should be implemented (as more similar as possible) in order to allow their optional use when comparing different algorithms.
These quantities depend on the calibration process (see issue #31) and in principle could be defined via benchmarking.
Currently we define them by requiring the rejection of 99.7% of the noise (which in the case of the pure MonteCarlo simulation means pixels where the true number of photoelectrons is 0).
Preliminary testing without double-pass image extraction (see below) shows that such thresholds are ~(8,4) for core and boundary pixels respectively for both LSTCam and NectarCam.
These values should decrease as we approach the method implemented in CTA-MARS.
This is more relevant for direction reconstruction, but it is done using output of this data-level.
Description is here.
In protopipe, this requires the output of parameters that are currently missing from the generated DL1 files (the coordinates of the Center of Gravity of the parametrized image - see ISSUE #40 ).
SEE ISSUE #31 FOR LATEST NEWS
Even if in the end we will most likely use the leakage parameter to assess if an image is clipped, this is still relevant for image parametrization.
In CTA-MARS the conversion between distances and degrees in each camera is done via a linear factor. Such quantity depends on the geometry of the dish containing the mirrors and takes into account optical aberrations.
This is explained in more detail in the second point of this section of the wiki
Unit-testing coverage is currently ~60%.
Many of the hidden operations are provided by ctapipe, so they are indirectly tested.
The code related specifically to protopipe should be covered as much as possible.
Already in the very old versions of protopipe, it was possible to select a cleaning which retained the number of islands.
This has never been used (at least based on the common tailcut cleaning) and the main cluster cleaning has been chosen for both,
which is also what (at least) CTAMARS does.
EDIT: this is not correct, see issue #117 - still important to check the effect of clusters on classifier training!
In case the image is produced by a proton (electrons tend to produce gamma-like images due to the nature of EM showers) it is expected (but never tested in protopipe) that the number of surviving clusters could be useful for classification purposes (by becoming a feature in the classification models).
This shouldn't be instead important for energy training, since there we already use images exclusively produced by gammas.
Right now the configuration files necessary to use protopipe, both on single files and on the grid are divided under grid and protopipe (under both protopipe/aux/example_config_files
and when using the auxiliary script create_dir_structure.py
)
no need fo this since there is only 1 of such files for the grid
Since we want to recover the missing performance between protopipe and CTA-MARS, a step-by-step comparison will show where the changes happen.
CTAMARS step-by-step analysis can be found here.
Of course, divergencies can be present across the entire pipeline with more or less importance.
The work could be divided into the following steps:
Each of these points should be tracked in a separate issue.
Both supporting libraries (ctapipe and pyirf) already provide a logging system that makes use of python's logging facility.
Even after the refactoring, there will be the need for a better login system of the pipeline, and having it available already in 0.4.0-dev would be nice.
Lunching the first script write_dl1.py with the files from the new divergent simulations I get the following error due to astropy when creating reco_result
:
Traceback (most recent call last):
File "i/protopipe/protopipe/scripts/write_dl1.py", line 442, in <module>
main()
File "/protopipe/protopipe/scripts/write_dl1.py", line 222, in main
) in preper.prepare_event(source, save_images=args.save_images):
File "/protopipe/protopipe/pipeline/event_preparer.py", line 1163, in prepare_event
for tel_id in point_altitude_dict.keys()
File "/protopipe/protopipe/pipeline/event_preparer.py", line 1163, in <dictcomp>
for tel_id in point_altitude_dict.keys()
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/coordinates/sky_coordinate.py", line 257, in __init__
frame_cls(**frame_kwargs), args, kwargs)
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/coordinates/sky_coordinate_parsers.py", line 244, in _parse_coordinate_data
valid_components.update(_get_representation_attrs(frame, units, kwargs))
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/coordinates/sky_coordinate_parsers.py", line 589, in _get_representation_attrs
valid_kwargs[frame_attr_name] = repr_attr_class(value, unit=unit)
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/coordinates/angles.py", line 618, in __new__
self.wrap_angle = wrap_angle
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/coordinates/angles.py", line 654, in wrap_angle
self._wrap_internal()
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/coordinates/angles.py", line 645, in _wrap_internal
super().__setitem__((), value)
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/units/quantity.py", line 1060, in __setitem__
self.view(np.ndarray).__setitem__(i, self._to_own_unit(value))
File "/miniconda3/envs/protopipe/lib/python3.7/site-packages/astropy/units/quantity.py", line 1372, in _to_own_unit
raise TypeError("cannot convert value type to array type "
TypeError: cannot convert value type to array type without precision loss
If I use a file from the old simulations the script run without error. Both the simulations were done using the same CORSIKA and SIMTEL version, only difference is the add of a TELESCOPE 0 in the simtel conflict file that correspond to the array pointing. I checked and this should not interfere, I have 19 telescope with ID from 1 to 19.
The value of point_azimuth_dict[tel_id]
and point_altitude_dict[tel_id]
are both float32. One difference with the old simulation is that I have some negative value for the azimuth.
point_altitude_dict[tel_id]: 1.2257447242736816 rad float32
point_azimuth_dict[tel_id]: -3.0990066528320312 rad float32
Maybe this can be the problem for astropy? @vuillaut @maxnoe
The paths to both new and old simulation on the GRID can be find on redmine here https://forge.in2p3.fr/issues/36268
This issue is part of a project described in issue #24
The following is a "real-time" list of points that are found to be differences between the pipelines using the comparison.
Not all features are critical to recover the missing performance, but all should be implemented (as more similar as possible) in order to allow their optional use when comparing different algorithms.
Issues
Description:
Pull-Requests
Description:
explained in detail here.
As with many other features, it will be developed in parallel in the developoment version of ctapipe (and imported from there with the new release) and in protopipe (based on ctapipe 0.7 conda version)
calibscale
It should be done in ctapipe but for the moment done also here in PR #140 .
Description:
as explained here/Conversion to "photoelectrons",
the factor to transform the integrated charges (in ADC counts) into number of photoelectrons is the pixel-wise one calculated by simtelarray (0.92*LasCalData->calib[][][]).
We thus need to check if in ctapipe this value is taken into account.
Right now it is already possible to analyze diffuse data, but it is a bit annoying to set up the training.
From the modeling configuration files (either regressor.yaml
or classifier.yaml
) it is possible to define selection cuts for training data in the section SigFiducialCuts
(and BkgFiducialCuts
for the classifier).
One of these cuts can be for example
offset < 1.0 # deg
but in the case of a diffuse analysis, this is not sufficient and one wants to differentiate between different offset binnings (concentric rings around a location or similar selections).
protopipe.scripts.build_models
should be able to accept an array of values from the configuration file that will act like bin edges and run itself a number of times equal to the bin numbers.
The model outputs should then be written bin-wise in separate folders and protopipe.scripts.write_dl2
should then select the right model depending on the event-candidate offset.
pywi-cta and pywi dependencies should be removed and the required code moved to protopipe/ctapipe. These two packages are likely to not be updated anymore (they were developed by J. decock in CEA initially for the wavelet cleaning studies)
Because some features are hard-coded, protopipe, version 0.2.1, cannot analyse data file containing FlashCam because some camera features (e.g. camera radius) are missing.
Test done with write_dl1.py
and a prod3b Paranal file.
This limitation is not documented, and it should be (it is only mentioned that MST and LST are hard-coded).
Now that some Prod5 files have been produced, it is time to add support to read them.
Two possible solutions:
protopipe.pipeline.utils.prod3b_array
should be added, and be a bit less complicated since we can retrieve the subarray combinations from here.The changelog file is intended to collect a few phrases explaining the major changes done to the code and not the "comments" issued with the commit. It should use understandable phrase and help the reader to understand the change without reading the notifications/commit messages.
Description:
This issue is part of #37 .
As mentioned in the CTA-MARS step-by-step analysis wikipage,
in order to correct for the average effect of optical aberrations, that in some telescopes shift the focused light spot "outwards" (i.e. further away from the center than it would be for perfect optics), we modify the pixels coordinates before proceeding to image parametrization:, we move all pixels towards the center, with the shift being proportional to the distance from the center of the camera. That is, we "shrink" the camera by a given factor. The factor depends on the telescope type. It is 1.0466 for the LSTs, and 1.0278 for NectarCams. The values, that can be dubbed "effective focal lengths" are calculated by K. Bernloehr and taken from https://www.mpi-hd.mpg.de/hfm/CTA/MC/Prod3/Config/PSF/flen.pdf This correction is important, and if not applied will lead to significant mis-reconstruction of the events - for example, it will shift the reconstructed gamma-ray directions away from the FoV center by the same factors quoted above; for an LST, this would introduce a 0.1 deg systematic error (at all energies) for gamma rays arriving 2-degree off-axis.
Such factor is related to the ratio between effective and nominal focal lengths.
While the nominal ones are already available from the simtel files, the effective ones are taken from the PDF file mentioned above.
Implementation:
The transformation bringing from Camera to Telescope frame is something like,
x [rad] = x[m] / nom
, with nom
nominal focal length,
what is done instead in CTA-MARS is the equivalent of doing,
x[rad] = (x[m] / nom) * (nom/eff) = x[m] / eff
, with eff
effective focal length.
This change needs to happen between image cleaning and Hillas parametrization;
Caveat:
The implementation of this procedure in the code of protopipe implies that HillasReconstructor
needs to be tweaked.
In fact, in such class it is expected that the input parameters are in the camera frame, whereas we will input them already in the telescope frame.
Eventually the same correction will impact also ctapipe.
There, the values of eff
could be read from ctapipe-extra and fed directly to the focal_length
attribute of ctapipe.coordinates.CameraFrame
.
Also HillasReconstructor
would need to be modified, in order to always start directly from the Telescope frame. Fur further details on this, please follow cta-observatory/ctapipe#1223.
Data related to 2nd pass reconstructed values associated to simulated "signal" and "noise" true pixels (separately) are currently built from incorrect data.
From a simplified version, I recovered the correct results, which I put already here for preliminary information.
This doesn't seem to influence the calculated values of the optimized cleaning thresholds - checks will follow.
When enabling the --save_images
option of protopipe.scripts.write_dl1
the images saved are all the same for each event.
This happens because the variables dl1_phe_image
and mc_phe_image
are single arrays and not dictionaries, like the great majority of the output of protopipe.pipeline.event_preparer.prepare_event
.
To fix this it will be sufficient to initialize these to variables as empty dictionaries using the values of the variable tel_id as keys, both in protopipe.pipeline.event_preparer.prepare_event
and protopipe.scripts.write_dl1
.
Running model_diagnostic.py for checking the energy regression model, I got this error:
File "model_diagnostic.py", line 195, in main
save_fig(path, outdir, "features", fig=fig)
File "/cta-protopipe/protopipe/protopipe/pipeline/utils.py", line 12, in save_fig
fig.savefig(path.join(outdir, name, format=ext))
TypeError: join() got an unexpected keyword argument ‘format'
"format" should be an argument of savefig() and not join().
The current definition of the data levels in protopipe do not correspond to the data models defined in ctapipe.
Taking advantage of the DL1 model now frozen from ctapipe 0.8.0 and the upcoming release of protopipe 0.3 (in which the DL1 data model will be indirectly frozen) it's time to name properly the scripts that generate the data tables.
write_dl1.py
produces DL1 data plus direction estimation (and energy estimation if a model exists already)Its result is a set of events used to train the model estimators: this is called data_training.py
in (#58).
write_dl2.py
produces data corresponding to events (aka showers) with direction, energy and classification information.It is pratically DL2, but most probably it will differ from ctapipe's DL2 and its result is a set of data that goes into performance estimation (namely cuts optimization and IRF production).
A possible candidate name could be data_processing.py
, but other suggestions are welcome!
WARNING: after this change, also the GRID interface will need to be modified accordigly!
provide better versioning, with version number parsed from 1 source (see cta-observatory/ctapipe#1334)
start packaging (at least) to PyPI
make aliases to scripts
Using the script build_model.py
with classifier.yaml
I get the following error:
Traceback (most recent call last):
File "/storage/gpfs_data/ctalocal/adonini/protopipe/protopipe/scripts/build_model.py", line 248, in <module>
main()
File "/storage/gpfs_data/ctalocal/adonini/protopipe/protopipe/scripts/build_model.py", line 169, in main
force_same_nsig_nbkg=use_same_number_of_sig_and_bkg_for_training,
File "/storage/gpfs_data/ctalocal/adonini/protopipe/protopipe/mva/train_model.py", line 66, in split_data
target_name=self.target_name
File "/storage/gpfs_data/ctalocal/adonini/protopipe/protopipe/mva/utils.py", line 54, in split_train_test
run_max_train = obs_ids[max_train_obs_idx]
IndexError: index 0 is out of bounds for axis 0 with size 0
The files I use should be right:
filename_sig: DL1/for_classification/dl1_tail_gamma_merged.h5
filename_bkg: DL1/for_classification/dl1_tail_proton_merged.h5
and the data are loaded correctly in data_sig
and data_bkg
:
But if I print them after the "add label", lines 149-150 in build_model.py, I get an empty data frame:
data_sig: Empty DataFrame
data_bkg: Empty DataFrame
The problem should be in the function prepare_data
in utils.py
, but I cannot see it.
Currently the modeling features,
are defined through the configuration files (either regressor.yaml
or classifier.yaml
)
when the appropriate classes in protopipe.mva
read them they pass through protopipe.mva.utils.prepare_data
which adds modified versions of the basic DL1/DL2a variables into the dataframes (such as e.g. log10 of variables or more complex analytical combinations)
most importantly they are hardcoded into write_dl2.py
These 3 steps as they are make a bit difficult if not annoying and easily error-prone to play with different features.
My current idea is to make a dictionary - open to the user through the documentation - where all possible features and new ones (so this dictionary would be open-ended) are mapped to integers.
So something like,
1: hillas_width
2: hillas_intensity
....
14: some crazy function
In doing this then the user would input the features from the configuration files in form of a list of integers which will be then read by the DL2 script as it is mapping unambiguously the features to the estimation section.
Since we are using the free version of Travis CI and we are approaching the usage threshold of our plan we need to find an alternative.
It is expected that sooner or later we migrate to the CTA-ASWG GitLab workspace, but for the time being it is better to enable the CI from the current hosting service, i.e. GitHub.
This is sub-summary issue part of #85 .
benchmarks/DL2/benchmarks_DL2_direction-reconstruction.ipynb
)minimum number of telecopes = 2 (configurable)
minimum angle between the image axes = 15 deg for 2-telescope events
maximum uncertainty in direction reconstruction
a cut removing images which are cut off at the edge of the camera FoV (configurable)
This will represent a major enhancement of the capabilities of protopipe and possibly other pipelines to interface with other groups and compare results between themselves.
It will also represent a major change to the code structure of protopipe (mainly write_dl1.py
and the code portions called by it).
Further information in cta-observatory/ctapipe#1163.
Of course, we depend on ctapipe's own timescale.
It would be much preferable to wait for ctapipe 0.8 and import this feature straight from it (better if in sub-modules of the complete DL1 script, which currently doesn't have all protopipe's features) than hardcode it and then take it out once that release will be available.
This represents a requirement for protopipe 0.3.
Using protopipe 0.2.1 official release
Trying to analyse a data file with write_dl1.py
Using prod3b files (La Palma) and asking explicitly that only LSTCam and NectarCam are analysed (using analysis.yaml configuration file in aux/), the code crashes because it tries to decode FlashCam information.
This is in contradiction with the fact that the cam_id_list
was not containing FlashCam.
(see error below)
Using the --cam_ids option on the command line does not solve the problem.
A protection is missing.
> python write_dl1.py --config_file testconfig.yaml -o mydl1.fits -i. -f data-prod3b-simtel.gz -m 10
/local/home/stolar/soft/anaconda3/envs/ctapipe07-021/lib/python3.7/site-packages/corsikaio/subblocks/dtypes.py:20: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
return np.dtype(dict(**dt))
pywicta package could not be imported
wavelet cleaning will not work
No module named 'pywicta'
pywicta package could not be imported
wavelet cleaning will not work
No module named 'pywicta'
found 1 files
file: 0 filename = ./data-simtel-North.gz
Traceback (most recent call last):
File "write_dl1.py", line 355, in <module>
main()
File "write_dl1.py", line 190, in main
) in preper.prepare_event(source, save_images=args.save_images):
File "/local/home/stolar/soft/anaconda3/envs/ctapipe07-021/lib/python3.7/site-packages/protopipe-0.1-py3.7.egg/protopipe/pipeline/event_preparer.py", line 315, in prepare_event
pmt_signal, camera
File "/local/home/stolar/soft/anaconda3/envs/ctapipe07-021/lib/python3.7/site-packages/protopipe-0.1-py3.7.egg/protopipe/pipeline/image_cleaning.py", line 119, in clean_image
mask = self.cleaners[cam_id](new_img, geom, self.clean_opts[cam_id])
KeyError: 'FlashCam'
Closing remaining open files:mydl1.fits...done
This is basically the end result for what concerns the pipeline structure.
It shouldn't change much depending on the nature of the data (simulated or real).
Summary of refactored pipeline steps/tools:
You can find a more schematic view of this from my presentation at the October 2020 CTAC meeting.
ctapipe-stage1
to ctapipe-process
, cta-observatory/ctapipe#1726protopipe-MODELS
to read from new ctapipe data model (see also cta-observatory/ctapipe#1755)ctapipe-process
(see cta-observatory/ctapipe#1744)BONUS(es)
Here the single tools are explained.
This is the tool that produces data that can potentially be used for the training of the models.
I say "potentially" because of course it could be used for DL1/DL2a studies or it could be used partially for training and testing (like it is done in the current version of protopipe).
Based on : ctapipe-process
Input: simtel files or real R1 data
Output:
This is currently the less defined one, mainly because it is not strictly necessary for it to depend on ctapipe.
Currently, this is done by protopipe.scripts.build_models.py
which is based on protopipe.mva
.
The main point here is that the tool should have its own configuration system and only worry about I/O.
All the internal operations will be basically a black box from the point of view of the pipeline itself.
This will allow to switch between ML libraries/frameworks/testing-code such as protopipe.mva
, aict-tools or ctapipe.
Based on :
Input: from the Training tool
Output: model files with/without train/test data used (current protopipe includes them for intermediate benchmarking)
The only requirement of the output is to have at least 1 commonly accepted format (we can support more later if needed) so as to allow the use/comparison of different code sources.
This is basically the tool that produces DL2 files (like protopipe.scripts.write_dl2
).
Based on : ctapipe-process
Input:
Output:
This is the tool that translates science case and DL2 data into the specific DL3 data.
This tool could also take care of event classes and types.
Based on : science-case-based configuration file + pyirf
Input:
from the Data processing tool,
Output: data in DL3 format as in GADF + output from CTA IRF WG
protopipe.perf
is expected to contain at least 1 script to produce DL3 data (see issue #73).
The possibility to add statistical uncertainties to protopipe DL3 output has been already tested using the bootstrap method.
The single script(s) should perform the basic operation (produce DL3 from DL2).
In this issue, a second script is proposed to encapsulate the first one and perform the bootstrapping.
name: make_DL3_bootstrap.py
expected input: configuration file and/or CLI options
structure proposal:
# IMPORTS
# ...
from protopipe.perf import make_DL3_EventDisplay
from protopipe.perf import make_DL3_CTAMARS
# + any other pyirf-based approach for DL2->DL3 production
# ...
from protopipe.perf.utils import getDL2
def DL2_resampling(DL2): # where DL2 is a table
# ....
return DL2_table_from_resampled_data
def main():
# READ CONFIG
conf = load_config(yaml) # updated version of the current YAML configuration file
DL2_protopipe = conf.indir # path to DL2 data as produced by protopipe.scripts.write_dl2
N = conf.N # boostrap iterations
approach = conf.approach # DL3-generating script to use
quantities =[...] # list of DL3 output information of which we want to save expectation values + uncertainties
DL2_pyirf = getDL2(DL2_protopipe) # translation of protopipe/ctapipe DL2 data into pyirf internal format
# BOOTSTRAP CYCLE
expectation_values = {IRF_1 : [], ..., ecc }
uncertainty_values = {IRF_1 : [], ..., ecc }
DL3_iteration_output = {}
for iteration in range(N):
# resample DL2 data in pyirf-ready format
DL2_pyirf_resampled = DL2_resampling(DL2_pyirf):
# output could be the FITS file or a list of HDUs
DL3_iteration_output[iteration] = globals()["approach"](DL2_pyirf_resampled, iteration)
if conf.save_all: # save the DL3 data from the single iteration in a separate folder
# save DL3_interation to f`protopipe_{approach}_bootstrap#{iteration}.fits.gz`
for quantity in quantities: # from DL3_iteration_output[iteration]
# Get expectation values (e.g. medians) in expectation_values
# Get uncertainties (e.g. stds) in uncertainty_values
# write DL3 fits file by using expectation_values in the standard format place and uncertainty_values as an additional colum
if __name__ == "__main__":
main()
Previously data was split by obs_id, but the diagnostic benchmarks didn't really care.
After PR #114 the split will be performed on single images, thus providing a more equilibrate splitting (all simulated runs should be more or less the same, but events can trigger a different number of images)
it could make sense to have model/DL2 benchmarking/diagnostics also on the level of runs or events in the future.
For this, we need to link the event Id with the observation Id as well as the image parameters to split and combine the model output
allow for choosing the number of images to be used for training based on camera types (so not a single integer flag --max_events
, but a dictionary with camera types as keys and number of training images as values - this would also make the current cam_id_list
flag obsolete)
The latest working version of gammapy for protopipe is 0.8, which could be quite old for some people.
With gammapy 0.10 came PR #1998: a change was made to the class SensitivityEstimator for which the keyword irf was dropped.
This breaks behavior in protopipe, which instead uses it.
Indeed, when launching the script make_performance.py the user gets the following error,
### Estimating sensitivity... Traceback (most recent call last): File "/Users/michele/Applications/ctasoft/protopipe/protopipe/scripts/make_performance.py", line 185, in <module> main() File "/Users/michele/Applications/ctasoft/protopipe/protopipe/scripts/make_performance.py", line 181, in main sensitivity_maker.estimate_sensitivity() File "/Users/michele/Applications/ctasoft/protopipe/protopipe/perf/irf_maker.py", line 101, in estimate_sensitivity sensitivity_estimator = SensitivityEstimator(irf=self.irf, livetime=obs_time) TypeError: __init__() got an unexpected keyword argument 'irf'
So if we want to let protopipe use the latest version of gammapy (provided that this will not break anything else, of course), at least line #101 in protopipe/protopipe/perf/irf_maker.py will need to be updated accordingly.
`../../protopipe/scripts/write_dl1.py --config_file configs/protopipe/default.yml --save_images -o gamma_dl1.h5 -f gamma_20deg_180deg_run11___cta-prod3-demo-2147m-LaPalma-baseline_cone10.simtel.gz -i $PWD/data/DL0
file: 0 filename = /home/bregeon/CTA/Data/Pipelines/ctasoft/protopipe/work/test/data/DL0/gamma_20deg_180deg_run11___cta-prod3-demo-2147m-LaPalma-baseline_cone10.simtel.gz
Traceback (most recent call last):
File "../../protopipe/scripts/write_dl1.py", line 355, in
main()
File "../../protopipe/scripts/write_dl1.py", line 190, in main
) in preper.prepare_event(source, save_images=args.save_images):
File "/home/bregeon/CTA/Data/Pipelines/ctasoft/protopipe/protopipe/pipeline/event_preparer.py", line 319, in prepare_event
mask_biggest = largest_island(labels)
File "/home/bregeon/CTA/Data/Pipelines/ctasoft/protopipe/protopipe/pipeline/event_preparer.py", line 118, in largest_island
return islands_labels == np.argmax(np.bincount(islands_labels[islands_labels > 0]))
File "<array_function internals>", line 6, in argmax
File "/home/bregeon/data/Soft/miniconda3/envs/protopipe/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1153, in argmax
return _wrapfunc(a, 'argmax', axis=axis, out=out)
File "/home/bregeon/data/Soft/miniconda3/envs/protopipe/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
return bound(*args, **kwds)
ValueError: attempt to get argmax of an empty sequence
Closing remaining open files:gamma_dl1.h5...doneimages.h5...done
`
The problem is basically that largest_island is called on an image where there is no island.
Hi,
me and @adonini have created two notebooks adapted from the scripts in pyirf for:
The notebooks are ready and they work but we would like to discuss some points:
pointing_az/alt
columns used in pyirf are not stored in protopipe DL2 files (at the moment we are hard coding these values)gh_score
parameter in pyirf equivalent to gammaness
and/or score
in protopipe DL2 files? @vuillaut @maxnoe @HealthyPearIn the framework of issue #37 , one of the tests is based on the use of the miss parameter.
In order to calculate it, we need the coordinates of the image Center Of Gravity (cog_x
, cog_y
, cog_phi
in ctapipe slang) which are currently missing from the generated DL1 files.
These parameters are the ingredients for the ctapipe.image.hillas.camera_to_shower_coordinates
that returns miss.
This will eventually be superseded by the new DL1 format, but since the comparison is ongoing, it makes sense to do it before.
These are NOT unit-tests.
These are NOT benchmarks.
These are tests checking that all protopipe scripts (aka the "steps" or "tools" of the pipeline) end correctly with proper output.
Each integration test should be contained in a template function that performs the following checks,
We should have 2 datasets (CTA-N + CTA-S both baseline arrays) per production.
Each dataset could have the same composition as a normal analysis, so
UPDATE: Actually it is not necessary, we can just use the same proportions between files and at the same time keeping the current pipeline workflow
They could be produced directly using simtel and not by cutting an already existing simtel file.
This to avoid situations in which a small test file doesn't contain enough triggered events.
So we could simulate e.g. 100 showers (=10% statistics) with E > 10TeV within a scatter radius of 150 meters.
UPDATE: Documentation has been updated accordingly (at least for Prod3b test files)
Aim of this issue:
The documentation has not been touched for a while (at least since March 2019!).
Many things happened since then, and even if some Sphinx extensions already allow for automatic documentation, many parts of the code are not set up for this (e.g. missing docstrings).
Also, ideas and TODOs are scattered throughout the text and some of them could be solved or obsolete.
Needless to say, any new feature from now on, should be properly documented directly inside the associated Pull Request(s).
Things to do:
The script write_dl2.py
as it is doesn't work.
Line 79:
allowed_tels, cams_and_foclens = prod3b_array(filenamelist[0], site, array)
the definition of the variable subarray
is missing
Line 219 some values defined in PreparedEvent are missing: dl1_phe_image_mask_reco
, dl1_phe_image_1stPass
, calibration_status
and leakage_dict
.
After adding these values the script works fine.
This issue is part of a project described in issue #24.
The following is a "real-time" list of points that are found to be differences between the pipelines using the comparison.
Not all features are critical to recovering the missing performance, but all should be implemented (as more similar as possible) in order to allow their optional use when comparing different algorithms.
Basic quality cuts (configurable)
Intensity>50 (in the "biased" units - see #31 and the calibration benchmarking notebook)
0.1 < Width / Length < 0.6
80% camera radius containment
Direction LUT and associated quality cut
From CTAMARS wiki page:
"a look-up table is built per telescope type, containing, in bins of image intensity and Width / Length, the square of . This allows us to estimate for an image, given its Intensity, Width and Length, how reliable its axis is as a measure of the shower axis' orientation. The values in these LUTs will be used to set relative weights for the different telescopes i the stereoscopic reconstruction of events with three or more valid images."
The associated image quality cut is number of events in the corresponding direction LUT bin <10
.
From IRF Prod3b report,
"A maximum likelihood fit is performed to obtain the shower axis parameters (direction, core location).
In this fit each image takes part with a weight calculated according to its elongation (Width/Length) and
Size (total light content)."
This could be implemented in ctapipe (see cta-observatory/ctapipe#1636)
The weight mentioned here comes from the Direction LUT.
Apparently, in CTAMARS the image cleaning uses all the clusters, whereas the one in between the 2-pass image extraction uses only the brightest cluster.
Right now we use the main-cluster cleaning throughout the whole pipeline.
This will likely update the results summarized in PR #24 .
so
This issue is part of a project described in issue #24.
The following is a "real-time" list of points that are found to be differences between the pipelines using the comparison.
Not all features are critical to recovering the missing performance, but all should be implemented (as more similar as possible) in order to allow their optional use when comparing different algorithms.
The only one missing should be
-Concentration (ratio of the highest charge contained in two contiguous pixels and Intensity)
which as the same definition problem as in #92
This issue is in common with #92.
Right now we weigh with Intensity
, while CTAMARS uses Intensity^0.54
.
CTAMARS uses the whole gamma-2 and proton-1 samples to train the classification model, whereas protopipe splits the original TRAINING data into train/test sub-samples.
This allows applying intermediate benchmarking before applying the models to the rest of the analysis data sample (DL2 production takes more time and it could be convenient to make studies on the models without producing every time DL2 data).
This upgrade is expected to still use the current script (protopipe.pipeline.make_performance) with:
Related issues and PR
Current plan for EARLY development
protopipe.scripts should contain,
Since there was the implementation of the other CTA camera, it could be add also the compatibility to the MAGIC camera, already implemented in ctapipe. There is now a group working on MAGIC+LST data and it would be nice to be able to analyze the data with protopipe too. Beginning at least with MC data.
I hard coded the MAGICCam infos, as it was done for the other cams, and added a new layout for the tel ids. Now I will run some tests on MC files.
This is sub-summary issue part of #85 .
This part of the pipeline is more complex and structured than the others because it takes into account both energy and classification.
It will be subdivided later on.
benchmarks/DL2/benchmarks_DL2_direction-reconstruction.ipynb
)This is not a critical thing, but it seems nice.
An easy solution could be to replicate what ctapipe does and use the same release drafter.
This issue is part of a project described in issue #24.
The following is a "real-time" list of points that are found to be differences between the pipelines using the comparison.
Not all features are critical to recovering the missing performance, but all should be implemented (as more similar as possible) in order to allow their optional use when comparing different algorithms.
Currently, protopipe uses an Adaptive Boost Regressor based on a Decision Tree, while CTAMARS uses a Random Forest regressor.
Add missing parameters
This is not defined in the same way in ctapipe, not sure if we should add it this way, better wait to see if the difference in definition plays a role in the overall performance of the pipeline.
The last 3 features may require a small enhancement in the management of the features read from the configuration file and form the scripts which produce DL2 data (see #90)
Right now we get only the estimate but we should get also a measure of the variance from the trees estimates.
This is used also in the weighting for the gammaness estimation for the DL2-candidate event.
From the wiki page of the CTAMARS analysis,
the RFs provide estimates for log10 E and its RMS, but the average is done after converting those to linear energy scale)
and now issue #139 is breaking this behaviour because this operation is performed on the base-10 logarithm scale!
This issue will be useful also for #93.
Right now we weigh with Intensity
, while CTAMARS uses 1/RMS^2
where RMS comes from each Random Forest regressor.
As shown here we have lost sensitivity at low energies (mainly due to the necessary changes between 0.3.0 and 0.4.0).
Currently, it is not clear if with the previous 2 points this will be solved or it will require to fix/add something in DL1 and/or DL2a.
CTAMARS uses the whole gamma-1 and samples to train the classification model, whereas protopipe splits the original TRAINING data into train/test sub-samples.
This allows applying intermediate benchmarking before applying the models to the rest of the analysis data sample (DL2 production takes more time and it could be convenient to make studies on the models without producing every time DL2 data).
In the case of energy estimation, this could represent a minor problem than classification, in fact, the energy estimation benchmarks can be applied (as of 0.4.0) to the gamma-2 sample, which is used to train the classification model.
Context
Some of these steps are different between CTA-MARS and EventDisplay.
We should save as much information as possible in terms of what operations are missing in our API.
What CTA-MARS does
Paraphrasing from the IRF document:
In each energy bin, cuts in:
- multiplicity (scanned from >= 2 to >= 9 telescopes of any type),
- Hadronness (from 0.5 to 0.95 gamma efficiency)
- event direction (from 0.5 to 0.95 gamma efficiency)
are optimized in order to achieve the best possible (pointlike- source) gamma-ray flux sensitivity (against the background of cosmic-ray electrons and protons), according to the standard definition within CTA ( 5 sigma, >= 10 events, >= 5% of residual background).
Current situation
(to be updated as the discussion in this issue grows)
Questions for @moralejo :
Hello, I'm from a team working on a shower reconstruction technique based on neural networks. We think it'd be nice to have the option to use Keras models along with the already implemented Scikit-Learn regression/clasification models for tasks like this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.