asalzburger / sms2021-tra-tra Goto Github PK

Repository for SummerStudent 2021 project to learn a (conformal) TRAnsform for TRAcks

Batchfile 0.08% Jupyter Notebook 98.71% Python 1.08% Shell 0.13%

sms2021-tra-tra's Introduction

This is the repository for Andrew Spanopoulos' CERN Summer Student project.

The aim is to investigate and implement a machine learning based (conformal) transform method transforms hits from a detector coordinate system into a representation where either clustering or histogramming techniques can be applied.

This project has been made public for the benefit of showcasing Summer Student Project Tracking

sms2021-tra-tra's People

Contributors

Stargazers

Watchers

sms2021-tra-tra's Issues

Try to procude your first data set

As soon as you have installed the ACTS project, you can produce your first data set:

Try to run this commands (https://github.com/asalzburger/tra-tra/blob/main/data/pdg13-n25-0.5to10GeV-0.5eta/production.cmd):

./ActsExampleParticleGun -n100 --gen-nparticles 25 --gen-mom-gev 0.25:10. --gen-mom-transverse true --gen-eta -0.5:0.5 --output-csv
./ActsExampleFatrasDD4hep --dd4hep-input=../../acts/thirdparty/OpenDataDetector/xml/OpenDataDetector.xml --output-csv  --bf-constant-tesla 0:0:2 --input-dir="./"

They are used to:

generate the particle sample
simulate how they interact with the detector

With the command written above you are producing 100 events (-n100) with 25 particles each (--gen-nparticles 25) [by default you simulate single muons with pdgId=13] uniformly distributed in eta between (-0.5, 0.5) (--gen-eta -0.5:0.5). You set the transverse momentum (--gen-mom-transverse true) to be uniformly distributed between 250 MeV and 10 GeV (--gen-mom-gev 0.25:10.). The output files will be in cvs format (--output-csv).

For the detector simulation we use the OpenDataDetector (ODD) (--dd4hep-input=../../acts/thirdparty/OpenDataDetector/xml/OpenDataDetector.xml) assuming an homogeneous axial magnetic field of 2T (--bf-constant-tesla 0:0:2). The directory of the input files is also specified (--input-dir="./") as well as the format of the output files (--output-csv).

When you feel comfortable, try to produce one dataset with very high pT, e.g. 50-100 GeV. This will be used to cross-check if the approximation of the Hough transform space in the transverse plane holds (as it has to for high pT tracks).

More complex input files

In data/pdg13-n25-0.5to10GeV-0.5eta there are more complicated input files. The simulation is now done with

particles of transverse momentum 0.5 to 10 GeV
25 particles per event

Learn how to associate the particle truth information to the particles
Plot, eg. number of hits per particle (e.g. in dependent of eta, phi, etc)
Try to fit a circle into the hits from one particle -> estimate the transverse momentum & compare to the truth
make a PR to import your code

Info: curvature, magnetic field, particle mass.

Implementing Hough transform for (phi_0,q/p_T) feature space

Define the transformation to move from the (x,y) parameter space to the (phi_0,q/p_T) feature space. A bit of math is posted below.
Plot the tracks in the (phi_0,q/p_T) and produce the binned version of the accumulation histogram, where the crossing point can be evaluated.

Make heat map with different colour scheme (e.g. inverting the colour palette).
Evaluate efficiency of the Hough transform as a function of p_T for the precise and approximated transformation.
Produce heat map for the longitudinal transformation.
Start combining the information in the longitudinal and transverse planes.

About the efficiency plots:

Correct the p_T used for the efficiency plots: it has to use the truth p_T
- Make two plots:
- Number of generated particles vs generated p_T
- Number of generated found particles vs generated p_T (for the precise and approximated transformation)
Add efficiency vs eta (again for the truth particle)

Study effect of material and non-homogeneous magnetic field

You have successfully validated the Hough transform in (phi,q/p_T) and built up combinations of the information from the two feature spaces. It is time now to study the effect of detector material and non-homogeneous magnetic field.

Considering the results from #12 and #16, we can use the approximated transformation and use combination 1.

Evaluate the efficiency as a function of pT and eta for the three cases: perfect scenario, w/ material effects and w/ non-homogeneous magnetic field.
Evaluate the fake rate and the duplicate rate, if possible as a function of eta and pT as well.
It would be also nice to see how these effects pollute the Hough transform and how the hot spots look like in the feature space. This would require some gymnastic with the random number generator to be able to generate the same particles for the three configurations considered.

Duplicate Removal

More complex simulation setups

List of datasets to be produced:

detector with material

This needs a material map file odd-material-map.root. To activate this, one needs to specify the material input file in the Fatras simulation, by adding the following program options:"

--mat-input-type=file --mat-input-file=odd-material-map.root"

with non-homogenous magnetic field

This needs a magnetic field map odd-bfield.root. To use this, one needs to add the following commands to the Fatras configuration:

  --bf-map-file=odd-bfield.root

with material and non-homogenous magnetic field

These two options can also be done together of course (which presents the worst case).

The files are part of a PR into the ODD detector (not merged yet):

https://github.com/acts-project/OpenDataDetector/pulls

Accessing them needs git large file system (git-lfs) support.

As we hadn't talked about particle type, we should do that as well:

muons (PDG 13)
electrons (PDG 11)
pions (PDG 211)

Some actions:

Investigate those with constant field, but under influence of material
- how many hits are produced (truth) by those particles
- what's the efficiency for these particles

If we want to see the effects of magnetic field / material on muons themselves:

this is visible in the low momentum regime (0.5 - 1 GeV)

Study binning effect on selection of Hough tracks

Develop a method to study the effect of noise due to binning effect (also useful to understand the material and magnetic field effects)

Evaluate the bin of the phi (or q/pT) for the true particle and pick the bin center

Use the phi (or q/pT) bin center from the true particle as hit values and make the Hough space accumulation plot. This will allow to understand how the bins spread for one variable if we fix the other one.

Plot the residual for phi and q/pT (comparing the value of the reconstructed found track (nhits>8, 9, 10) and the true particle)
Compare with bin width used
For q/p_T the bin size in the Hough space is not optimal. Let's try to change it and see how the counts vs q/pT residual changes.
Same as above, shown using nhits>8, 9, 10 when selecting the reconstructed tracks
Once we have defined a better q/p_T bin size and observed how the residual change as a function of the number of hits in the bin used to select the reco track we need to look at efficiency and duplicate rate as a function of the number of hits.
--> Look at the other issue where combining the longitudinal and transverse HTs ;)

Understand and decode GeometryID

The geometry_id carries the volume, layer and sensitive information via bit masks:

  // (2^8)-1 = 255 volumes
  static constexpr Value kVolumeMask = 0xff00000000000000;
  // (2^8)-1 = 255 boundaries
  static constexpr Value kBoundaryMask = 0x00ff000000000000;
  // (2^12)-1 = 4096 layers
  static constexpr Value kLayerMask = 0x0000fff000000000;
  // (2^8)-1 = 255 approach surfaces
  static constexpr Value kApproachMask = 0x0000000ff0000000;
  // (2^28)-1 sensitive surfaces
  static constexpr Value kSensitiveMask = 0x000000000fffffff;

Prepare functions to decode this

Decode volume, layer, sensitive number from geometry_id
How are the distributions of those numbers in your dataset
Investigate truth tracks regarding their geometry paths

Investigate & plot single particle files

In data/pdg13-n1-1GeV-0.5eta there are 100 single muon (PDG code 13) events simulated in a test detector.

The simulation was done with:

constant magnetic field of 2 Tesla in direction (0,0,1)
constant transverse momentum of 1 GeV
restricted to |eta| < 0.5 to stay within the barrel region (eta is called pseudo-rapidity)
without any material in the detector

Investigate the hits and particle content:

plot the hits in x/y plane (transverse view)
plot the hits in r/z plane (longitudinal view)

Look / research the concept of a Hough transform.

First (straight line) hough transform

Implement a first straight line based hough transform, e.g. from sci-kit learn:

https://scikit-image.org/docs/0.3/auto_examples/plot_hough_transform.html

try detect the straight lines in the r-z plane and compare to ground truth
define efficiency/fake/duplicate rate (for the moment, hopefully only efficiency matters)
think of expanding the straight-line hough transform to a helical conformal mapping transform.

Further refinements:

write at purely truth based efficiency function (1)
implement the book-keeping

(1) Truth based efficiency function

Given a set of hits found [f], look up how many of the hits are actually from the same particle using the truth association.
I.e. every hit in the production file has an identifier, which particle produced it.

every hit has weight, assume weight 1
if multiple particles are present, calculate fraction of all particles, and take leading particle
sum up the weight of the hit w_i for every particle / truth sum, results in a truth probability function in [0,1)

Updates:

Plot a histogram of "matching probability"
Calculate the efficiency per tracks, efficiency vs eta, efficiency vs pT
Start writing a selection/classification function (2)

(2) So far, you are taking the 25 best tracks per event, because you know (falsely) that there are 25 particles produced.

To overcome this, we use selection criteria without looking at truth information:

minimum number of hits
holes
shared hits, are the hits already used by another particle
compatibility with track model, in general via a least squares estimator, yields a chi2
- ML: trained classifier

Imporevement in metric runtime

Andreas was right, dictionaries are not for big data. I implemented (literally) everything using arrays. Now the results are incredible. For npileup = 200, the runtime of the efficiency rate dropped from 1-2 years to 1 minute. Amazing. Here are the results:

The Efficiency looks nice. The duplicate/fake rates are an issue. The purification algorithm takes a lot to compute, since basically it runs the rz Hough Transform for each bin. I may need to think of a smarter solution to overcome this problem. Maybe NNs are the way to go, so I will invest most of my time there.

NN possibilities

Typical NN classifier: A NN that classifies bins as good or bad:

Input features: max number of hits times (x,y,z), N hidden layers, output node: [0,1] probability, [0,1] contamination

Typical NN classier:

Take first NN with only [0,1] probability output, classify every hit in the bin to be on/off.
Best architecture would be an RNN.

Inference NN:

Let's assume your HT brings hight quality track candidates (start with optimal):
Construct a NN that predict the particle properties of the particle that created the hits.

Input would be:
Maximum size x (x,y,z) -> Some hidden layers -> 6 variables: (x0,y0,z0),(px0,py0,pz0).