alisw / machinelearninghep Goto Github PK

License: GNU General Public License v3.0

Shell 1.56% Python 16.79% C 2.29% Jupyter Notebook 79.36%

machinelearninghep's Issues

Logger not working in the processer

@benedikt-voelkel. I added this issue to keep in mind to fix the logger inside the processer.
Cheers GM

Travis got coronavirus.

The pylint checks performed by the Travis CI tool keep failing.
Travis always starts sneezing too much and the log gets too long.
Error message:

The job exceeded the maximum log length, and has been terminated.

Skim databases

make common path structures to heavily skim databases
enhances usability and maintenance
less error prone
use YAML anchors where appropriate (see #608 )

Add a dev branch for

Create a dev branch for PRs where developments and dependencies might vary in time fairly fast. Keep the master branch clean and update on longer time scale to keep things as stable as possible.

Can't install the ML package without superuser permission

I tried to install the package as specified in the wiki (Installation) on Ubuntu 18.04.

Running python3 setup.py install as a user I receive an error related to missing permissions, see log_error.txt for the error message.

I tried also python3 setup.py install --user obtaining still a permission error

running install
running bdist_egg
running egg_info
error: [Errno 13] Permission denied

As a superuser the installation was fine.

Dear all,
I added a new parameter in the LcpK0s database to select if the FONLL file is a root file or a csv file.
I did it because the fonll that was used before was for the 5TeV (and it was a csv file) while the fonll for 13TeV is a root file.
You can still use the old file setting the new parameter to false ----- isFONLLfromROOT: false
but you have to add it to your database.

Fix usage of pylint exceptions

# pylint: disable= exceptions are overused and misused.

Fix warnings that can be fixed.
Use the remaining exceptions only within the scope where they are supposed to apply, i.e.

at the desired block level or at the end of the desired line of code

See documentation and FAQ for more details.

DB updates (just not to forget ...)

add perc_v0m, mult_v0m to all DBs
add MB vs v0m, perc, tracklet
adjust mult binning
consistently use trigger_hasbit_XXX

Separate predictions from applied dataframe

for each model, create one pickled dataframe per child_i/pack_j/AnalysisResults<suffix>.pkl with predictions

-> no pre-application needed, full data available
-> no additional duplication of data, only additional one-column dataframes per model prediction

prediction directory
cross check
remove applied pickles from DB

Enhance documentation

add github.io page
generate automatically per release (tag)
great way for documentation (and advertisement ;) )

Avoid ROOT dependency

We should avoid ROOT dependency if possible or at least make it optional for different reasons, among potential others:

effort of getting ROOT running as well as used ROOT functionality within this package doesn't scale with the purpose of this package (also, building this package in principle takes seconds whereas it may take up to hours if someone needs to build ROOT from source)
above point may discourage people who just want to use the nice ML functionality of this package

CI doesn't check setup.py linting

Currently the linting of setup.py isn't checked by Travis and it looks like it doens't respect PEP8.

Running grid search crashes the execution

The execution of the program crashes when running the grid search, dogridsearch == 1, introduced in PR #20. This is the error message.

The problem should be the incompatibility of scikit-learn > 0.17 with sklearn-evaluation.
@mfaggin Don't you have already a solution to this issue?

Sort final results first by year

Sort final results first by year and below by MC and data. This also gives the opportunity to place additional info concerning both, MC and data, inside the corresponding year directory.

Remove hardcoded numbers in signal selection

We should get rid of these hardcoded numbers in the signal selection strings in database_ml_parameters.yml

  sel_signal: cand_type_ML==10 or cand_type_ML==11 or cand_type_ML==18 or cand_type_ML==19

used in

  df_sig = df_sig.query(sel_signal)

Especially now we will add the IsSelectedPID and IsSelectedTopo bits in the TreeCreator soon, and probably will start using meson-dependent bits in the same bitmap.

I did some quick tries with bitwise operations, but these don't seem to work in the query() functions. I found a solution, but it is a bit cumbersome as my Panda Dataframe knowledge is still limited ;-)

# For testing
signalmap = 0b0001         #should be stored in database_ml_parameters.yml
df = pd.DataFrame([[2,6,34,0],[1,4,23,1],[2,3,54,2],[4,5,56,3],[3,2,56,4],[6,3,73,5],[9,0,23,6],[7,5,54,7],[8,9,83,8],[3,6,92,9]],columns=['a', 'b', 'c', 'cand_type_ML'])

df_temp = df.query("(cand_type_ML>>signalmap) == 1")                  # Doesn't work
df_temp = df.query("numpy.bitwise_and(cand_type_ML,1) == 1")          # Doesn't work
df_temp = df[df['cand_type_ML'].apply(lambda x: x & signalmap == 1)]  # Works, but cumbersome

There should be is a nicer solution. Anyone has a better idea?

No need to reprocess the D0 to add the 1-2 GeV

@vkucera @benedikt-voelkel

Improve the webapp

Here's a checklist of things to do to improve the webapp.

Use Twisted/Klein (evolution of Flask)
Properly copy input files in the installation directory
Handle sessions, progress, for when client disconnects/changes network
Use a temporary space as output directory (not "current" directory)
Deployment on Appspot: fix configuration files
Deployment on the development machine: dockerize
Use headless Pyplot

Move to new dataset with updated ntracklet subtraction for Lc->pK0s

Hello @DelloStritto. I just opened this issue to make sure we don't forget to update the dataset of the Lc->pK0s. For the moment I added a temporary new database https://github.com/ginnocen/MachineLearningHEP/blob/master/machine_learning_hep/data/data_prod_20200304/database_ml_parameters_LcpK0spp_0304_sub.yml.

Revise and bump versions of dependencies

Versions of dependencies used by the package should be revised. E.g. the problem we see in #692 is gone using pandas version >=1.0.0.
However, especially that version cannot be bumped yet as some changes in the new pandas API might break our code.

But for post-approval stress times :)

Abseil interferes with our logger

Abseil interferes with Python's logging used via
import logging
For details see abseil/abseil-py#99

Note that what abseil does isnot really best practice and there will be most likely a fix addressing this issue.

Since our logger is also used for fatal messaging with subsequent aborting and we use it also to log to a file, we want to keep it and NOT move to abseil.logging - as some people suggest in the above linked issue.

There is a temporary fix thanks to abseil/abseil-py#99 (comment), see PR #196

If someone is spotting some weird behaviour, please let us( @benedikt-voelkel or @ginnocen ) know.

Missing graphviz dependency

Using the SciKit models with activate_scikit = 1 an error is obtained. This is due to the absence of graphviz as a dependency in Prerequisites. It's addition should solve the problem.

However, graphviz is used only to obtain a graph of the SciKit single decision tree model.
Given that a single decision tree is not very interesting for ML in HEP and that this plot can't be done for BDTs, another possibility is to drop this feature from the code in order to avoid the graphviz dependency.

I would go for the last solution also because, at least on Ubuntu, graphviz is provided as a system package and requires superuser permissions to be installed.

@ginnocen What do you think?

Document the code with docstrings

I would suggest to document at least the functions exposed to the user in doclassification_regression.py with docstrings, following the style guides PEP 8 and PEP 257.

In this way information about a function could be accessed interactively with ipython or jupyter. Moreover, they could be used (in the future) to automatically build a code documentation.

Currently the missing-docstring error is disabled in our pylint configuration, we could think to enable it if we want to be strict.

Move `process_histomass`

Move process_histomass workflow to Analyzers as this is analysis specific and does not belong to the Processer logic

Specify data types for dataframes

If datatypes for pandas.DataFrame are not specified, a query operation with negative values most likely fails (see also: https://stackoverflow.com/questions/50400843/using-negative-numbers-in-pandas-dataframe-query-expression)

Hence, we cannot directly apply custom cuts with negative values using pandas.DataFrame.query which is a major problem.

Furthermore, specifying datatypes might save diskspace!

This needs to be done at the first processing stage when TTrees are converted and pickled.

Save plots also in pickle

Besides saving the plots as png's in the /plots_nevt*/ directory, it would be good to also have the object itself saved. So one doesn't have to rerun for making small changes in the layout, and it is possible to show two differently trained pT bins in one canvas.

Pickle should work right? @ginnocen, what do you think?

Code will become something like this

plotname = folder+'/ROCcurve.png'
plt.savefig(plotname)
plotnamepickle = folder+'/ROCcurve.pickle'
with open(plotnamepickle,'wb') as fid:
    pickle.dump(figure2 ,fid)

And for offline use

import matplotlib.pyplot as plt
import pickle as pl

fig_handle = pl.load(open('ROCcurve.pickle','rb'))
fig_handle.show()

#Get the data, for further operations
x = fig_handle.axes[0].lines[0].get_data()

Include new example of input data with the new dataformat structure

In the Monte Carlo SmallSkimmed_*.root files, downloaded by ml-get-data, very few or no signal candidates are present. They are also very small ~100 KB. Probably they could be replaced by bigger files with more data.

At the moment, the code could not work properly using these files.

Explicit usage of `nsigTOF_Pr_0` breaks efficiency

The explicit usage of nsigTOF_Pr_0 in https://github.com/ginnocen/MachineLearningHEP/blob/master/machine_learning_hep/processerdhadrons_mult.py#L409-L410 breaks the code with any database/particle which does not extract this field from the produced TTrees (e.g. D0, see https://github.com/ginnocen/MachineLearningHEP/blob/master/machine_learning_hep/data/data_prod_20200417/database_ml_parameters_D0pp_0417.yml)

This was introduced in #673

Organise code in sub-directories

Inside machine_learning_hep we should organise the single .pys such that we group them accordingly into sub-directories.
This makes the structure cleaner (especially for people who are new, having a first look to the code etc.)

Cross section doesn't work when dofit + doeff are disabled

When the full analysis chain is enabled (so fit by loading AliHFInvMassFitter, efficiency with the python code, and cross section by loading HFPtSpectrum) everything runs smoothly. When only docross is enabled, the package crashes after processing the first period with the following message:

Traceback (most recent call last):
  File "do_entire_analysis.py", line 23, in <module>
    main()
  File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/steer_analysis.py", line 461, in main
    do_entire_analysis(run_config, db_analysis, db_ml_models, db_ml_gridsearch, db_run_list)
  File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/steer_analysis.py", line 390, in do_entire_analysis
    myan.multi_makenormyields()
  File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/multianalyzer.py", line 113, in multi_makenormyields
    self.process_listsample[indexp].makenormyields()
  File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/analyzer.py", line 1513, in makenormyields
    gROOT.LoadMacro("HFPtSpectrum.C")
SystemError: int TROOT::LoadMacro(const char* filename, int* error = 0, bool check = kFALSE) =>
    problem in C++; program state has been reset

 *** Break *** segmentation violation
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/cvmfs/alice.cern.ch/el6-x86_64/Packages/ROOT/v6-16-00-18/lib/ROOT.py", line 770, in cleanup
    isCocoa = _root.gSystem.InheritsFrom( 'TMacOSXSystem' )
TypeError: none of the 2 overloaded methods succeeded. Full details:
  bool TObject::InheritsFrom(const char* classname) =>
    problem in C++; program state has been reset
  bool TObject::InheritsFrom(const TClass* cl) =>
    could not convert argument 1

 *** Break *** segmentation violation
Segmentation fault (core dumped)

I don't see what can cause this. @benedikt-voelkel, do you have an idea? For plot_multi_trial you are also loading a .C macro. Does this work stand-alone?

Add validation plotting function

Hello @njacazio. In the PR #625
I added a simple skeleton for producing plots of validation. The new function should read the file where you included all the validation histogram, build plots and save them to pdf.

Using the PCA crashes the execution in various situations

Using the PCA dopca: 1 crashes the execution when one of the following flag is set: doimportance: 1, dotesting: 1, applytodatamc: 1.

The PCA must be applied on the relevant data for the various flags.

alisw / machinelearninghep Goto Github PK

machinelearninghep's Issues

Recommend Projects

Recommend Topics

Recommend Org