Giter Site home page Giter Site logo

sms2021-tra-tra's Introduction

This is the repository for Andrew Spanopoulos' CERN Summer Student project.

The aim is to investigate and implement a machine learning based (conformal) transform method transforms hits from a detector coordinate system into a representation where either clustering or histogramming techniques can be applied.

This project has been made public for the benefit of showcasing Summer Student Project Tracking

sms2021-tra-tra's People

Contributors

asalzburger avatar andrewspano avatar noemina avatar

Stargazers

 avatar

Watchers

 avatar

sms2021-tra-tra's Issues

Try to procude your first data set

As soon as you have installed the ACTS project, you can produce your first data set:

./ActsExampleParticleGun -n100 --gen-nparticles 25 --gen-mom-gev 0.25:10. --gen-mom-transverse true --gen-eta -0.5:0.5 --output-csv
./ActsExampleFatrasDD4hep --dd4hep-input=../../acts/thirdparty/OpenDataDetector/xml/OpenDataDetector.xml --output-csv  --bf-constant-tesla 0:0:2 --input-dir="./"

They are used to:

  • generate the particle sample
  • simulate how they interact with the detector

With the command written above you are producing 100 events (-n100) with 25 particles each (--gen-nparticles 25) [by default you simulate single muons with pdgId=13] uniformly distributed in eta between (-0.5, 0.5) (--gen-eta -0.5:0.5). You set the transverse momentum (--gen-mom-transverse true) to be uniformly distributed between 250 MeV and 10 GeV (--gen-mom-gev 0.25:10.). The output files will be in cvs format (--output-csv).

For the detector simulation we use the OpenDataDetector (ODD) (--dd4hep-input=../../acts/thirdparty/OpenDataDetector/xml/OpenDataDetector.xml) assuming an homogeneous axial magnetic field of 2T (--bf-constant-tesla 0:0:2). The directory of the input files is also specified (--input-dir="./") as well as the format of the output files (--output-csv).

  • When you feel comfortable, try to produce one dataset with very high pT, e.g. 50-100 GeV. This will be used to cross-check if the approximation of the Hough transform space in the transverse plane holds (as it has to for high pT tracks).

More complex input files

In data/pdg13-n25-0.5to10GeV-0.5eta there are more complicated input files. The simulation is now done with

  • particles of transverse momentum 0.5 to 10 GeV
  • 25 particles per event
  • Learn how to associate the particle truth information to the particles
  • Plot, eg. number of hits per particle (e.g. in dependent of eta, phi, etc)
  • Try to fit a circle into the hits from one particle -> estimate the transverse momentum & compare to the truth
  • make a PR to import your code

Info: curvature, magnetic field, particle mass.

Implementing Hough transform for (phi_0,q/p_T) feature space

  • Define the transformation to move from the (x,y) parameter space to the (phi_0,q/p_T) feature space. A bit of math is posted below.

  • Plot the tracks in the (phi_0,q/p_T) and produce the binned version of the accumulation histogram, where the crossing point can be evaluated.

image

  • Make heat map with different colour scheme (e.g. inverting the colour palette).
  • Evaluate efficiency of the Hough transform as a function of p_T for the precise and approximated transformation.
  • Produce heat map for the longitudinal transformation.
  • Start combining the information in the longitudinal and transverse planes.

About the efficiency plots:

  • Correct the p_T used for the efficiency plots: it has to use the truth p_T
    • Make two plots:
    • Number of generated particles vs generated p_T
    • Number of generated found particles vs generated p_T (for the precise and approximated transformation)
  • Add efficiency vs eta (again for the truth particle)

Study effect of material and non-homogeneous magnetic field

You have successfully validated the Hough transform in (phi,q/p_T) and built up combinations of the information from the two feature spaces. It is time now to study the effect of detector material and non-homogeneous magnetic field.

Considering the results from #12 and #16, we can use the approximated transformation and use combination 1.

  • Evaluate the efficiency as a function of pT and eta for the three cases: perfect scenario, w/ material effects and w/ non-homogeneous magnetic field.

  • Evaluate the fake rate and the duplicate rate, if possible as a function of eta and pT as well.

  • It would be also nice to see how these effects pollute the Hough transform and how the hot spots look like in the feature space. This would require some gymnastic with the random number generator to be able to generate the same particles for the three configurations considered.

Duplicate Removal

  • Implement baseline methods for removing duplicate tracks from the Hough Transform output.

    For both baseline approaches implemented, the efficiency drops as well with the duplicate-fake rates. This happens because tracks that are not duplicates are considered so, thus they are removed. To solve this, we must fine tune a bit further those baseline algorithms:

    • Implement a purity function to see what percentage of estimated tracks actually belong to the leading particle, in order to understand how much noise there is in the bins.
    • Fine-tune the algorithm so that the efficiency is not affected at all.
  • Implement a more profound (yet somewhat deterministic) method of filtering out duplicate tracks. Maybe build on top of the baseline and also use geometries?

  • Implement a Machine Learning approach to duplicate removal. For this:

    • Setup the NVIDIA GPU on my computer.
    • Implement a function that given the results of a run of the Hough Transform algorithm, creates a dataset with duplicate and non-duplicate tracks. Make sure that the data is not biased and not too easy for the NN to distinguish it. That is, for non-duplicate tracks (negative examples), make sure they are actually kind of close in the hough space, since the duplicate tracks will also be close in the Hough space.
    • Do the above for many bin-sizes and nhits, in order to make sure that we don't overfit to a specific case.
    • Also create a test dataset with unseen bin-size and nhits.
    • Train the model in the training data and assess the results on both seen and unseen data.

More complex simulation setups

List of datasets to be produced:

  • detector with material

This needs a material map file odd-material-map.root. To activate this, one needs to specify the material input file in the Fatras simulation, by adding the following program options:"

--mat-input-type=file --mat-input-file=odd-material-map.root"
  • with non-homogenous magnetic field

This needs a magnetic field map odd-bfield.root. To use this, one needs to add the following commands to the Fatras configuration:

  --bf-map-file=odd-bfield.root
  • with material and non-homogenous magnetic field

These two options can also be done together of course (which presents the worst case).

The files are part of a PR into the ODD detector (not merged yet):

https://github.com/acts-project/OpenDataDetector/pulls

Accessing them needs git large file system (git-lfs) support.

As we hadn't talked about particle type, we should do that as well:

  • muons (PDG 13)
  • electrons (PDG 11)
  • pions (PDG 211)

Some actions:

  • Investigate those with constant field, but under influence of material
    • how many hits are produced (truth) by those particles
    • what's the efficiency for these particles

If we want to see the effects of magnetic field / material on muons themselves:

  • this is visible in the low momentum regime (0.5 - 1 GeV)

Study binning effect on selection of Hough tracks

  • Develop a method to study the effect of noise due to binning effect (also useful to understand the material and magnetic field effects)
  • Evaluate the bin of the phi (or q/pT) for the true particle and pick the bin center
  • Use the phi (or q/pT) bin center from the true particle as hit values and make the Hough space accumulation plot. This will allow to understand how the bins spread for one variable if we fix the other one.
  • Plot the residual for phi and q/pT (comparing the value of the reconstructed found track (nhits>8, 9, 10) and the true particle)

  • Compare with bin width used

  • For q/p_T the bin size in the Hough space is not optimal. Let's try to change it and see how the counts vs q/pT residual changes.

  • Same as above, shown using nhits>8, 9, 10 when selecting the reconstructed tracks

  • Once we have defined a better q/p_T bin size and observed how the residual change as a function of the number of hits in the bin used to select the reco track we need to look at efficiency and duplicate rate as a function of the number of hits.

  • --> Look at the other issue where combining the longitudinal and transverse HTs ;)

Understand and decode GeometryID

The geometry_id carries the volume, layer and sensitive information via bit masks:

  // (2^8)-1 = 255 volumes
  static constexpr Value kVolumeMask = 0xff00000000000000;
  // (2^8)-1 = 255 boundaries
  static constexpr Value kBoundaryMask = 0x00ff000000000000;
  // (2^12)-1 = 4096 layers
  static constexpr Value kLayerMask = 0x0000fff000000000;
  // (2^8)-1 = 255 approach surfaces
  static constexpr Value kApproachMask = 0x0000000ff0000000;
  // (2^28)-1 sensitive surfaces
  static constexpr Value kSensitiveMask = 0x000000000fffffff;

Prepare functions to decode this

  • Decode volume, layer, sensitive number from geometry_id
  • How are the distributions of those numbers in your dataset
  • Investigate truth tracks regarding their geometry paths

Investigate & plot single particle files

In data/pdg13-n1-1GeV-0.5eta there are 100 single muon (PDG code 13) events simulated in a test detector.

The simulation was done with:

  • constant magnetic field of 2 Tesla in direction (0,0,1)
  • constant transverse momentum of 1 GeV
  • restricted to |eta| < 0.5 to stay within the barrel region (eta is called pseudo-rapidity)
  • without any material in the detector

Investigate the hits and particle content:

  • plot the hits in x/y plane (transverse view)
  • plot the hits in r/z plane (longitudinal view)

Look / research the concept of a Hough transform.

First (straight line) hough transform

Implement a first straight line based hough transform, e.g. from sci-kit learn:

https://scikit-image.org/docs/0.3/auto_examples/plot_hough_transform.html

  • try detect the straight lines in the r-z plane and compare to ground truth
  • define efficiency/fake/duplicate rate (for the moment, hopefully only efficiency matters)
  • think of expanding the straight-line hough transform to a helical conformal mapping transform.

Further refinements:

  • write at purely truth based efficiency function (1)
  • implement the book-keeping

(1) Truth based efficiency function

Given a set of hits found [f], look up how many of the hits are actually from the same particle using the truth association.
I.e. every hit in the production file has an identifier, which particle produced it.

  • every hit has weight, assume weight 1
  • if multiple particles are present, calculate fraction of all particles, and take leading particle
    sum up the weight of the hit w_i for every particle / truth sum, results in a truth probability function in [0,1)

Updates:

  • Plot a histogram of "matching probability"
  • Calculate the efficiency per tracks, efficiency vs eta, efficiency vs pT
  • Start writing a selection/classification function (2)

(2) So far, you are taking the 25 best tracks per event, because you know (falsely) that there are 25 particles produced.

To overcome this, we use selection criteria without looking at truth information:

  • minimum number of hits
  • holes
  • shared hits, are the hits already used by another particle
  • compatibility with track model, in general via a least squares estimator, yields a chi2
    • ML: trained classifier

Imporevement in metric runtime

Andreas was right, dictionaries are not for big data. I implemented (literally) everything using arrays. Now the results are incredible. For npileup = 200, the runtime of the efficiency rate dropped from 1-2 years to 1 minute. Amazing. Here are the results:

Screenshot from 2021-09-08 00-50-04

The Efficiency looks nice. The duplicate/fake rates are an issue. The purification algorithm takes a lot to compute, since basically it runs the rz Hough Transform for each bin. I may need to think of a smarter solution to overcome this problem. Maybe NNs are the way to go, so I will invest most of my time there.

NN possibilities

  • Typical NN classifier: A NN that classifies bins as good or bad:

Input features: max number of hits times (x,y,z), N hidden layers, output node: [0,1] probability, [0,1] contamination

  • Typical NN classier:

Take first NN with only [0,1] probability output, classify every hit in the bin to be on/off.
Best architecture would be an RNN.

  • Inference NN:

Let's assume your HT brings hight quality track candidates (start with optimal):
Construct a NN that predict the particle properties of the particle that created the hits.

Input would be:
Maximum size x (x,y,z) -> Some hidden layers -> 6 variables: (x0,y0,z0),(px0,py0,pz0).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.