alisw / machinelearninghep Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
@benedikt-voelkel. I added this issue to keep in mind to fix the logger inside the processer.
Cheers GM
The pylint checks performed by the Travis CI tool keep failing.
Travis always starts sneezing too much and the log gets too long.
Error message:
The job exceeded the maximum log length, and has been terminated.
make common path structures to heavily skim databases
enhances usability and maintenance
less error prone
use YAML anchors where appropriate (see #608 )
Create a dev
branch for PRs where developments and dependencies might vary in time fairly fast. Keep the master
branch clean and update on longer time scale to keep things as stable as possible.
I tried to install the package as specified in the wiki (Installation) on Ubuntu 18.04.
Running python3 setup.py install
as a user I receive an error related to missing permissions, see log_error.txt for the error message.
I tried also python3 setup.py install --user
obtaining still a permission error
running install
running bdist_egg
running egg_info
error: [Errno 13] Permission denied
As a superuser the installation was fine.
Dear all,
I added a new parameter in the LcpK0s database to select if the FONLL file is a root file or a csv file.
I did it because the fonll that was used before was for the 5TeV (and it was a csv file) while the fonll for 13TeV is a root file.
You can still use the old file setting the new parameter to false ----- isFONLLfromROOT: false
but you have to add it to your database.
# pylint: disable=
exceptions are overused and misused.
at the desired block level or at the end of the desired line of code
See documentation and FAQ for more details.
for each model, create one pickled dataframe per child_i/pack_j/AnalysisResults<suffix>.pkl
with predictions
-> no pre-application needed, full data available
-> no additional duplication of data, only additional one-column dataframes per model prediction
add github.io
page
generate automatically per release (tag)
great way for documentation (and advertisement ;) )
We should avoid ROOT dependency if possible or at least make it optional for different reasons, among potential others:
Currently the linting of setup.py
isn't checked by Travis and it looks like it doens't respect PEP8.
The execution of the program crashes when running the grid search, dogridsearch == 1
, introduced in PR #20. This is the error message.
The problem should be the incompatibility of scikit-learn > 0.17 with sklearn-evaluation.
@mfaggin Don't you have already a solution to this issue?
Sort final results first by year and below by MC and data. This also gives the opportunity to place additional info concerning both, MC and data, inside the corresponding year directory.
We should get rid of these hardcoded numbers in the signal selection strings in database_ml_parameters.yml
sel_signal: cand_type_ML==10 or cand_type_ML==11 or cand_type_ML==18 or cand_type_ML==19
used in
df_sig = df_sig.query(sel_signal)
Especially now we will add the IsSelectedPID and IsSelectedTopo bits in the TreeCreator soon, and probably will start using meson-dependent bits in the same bitmap.
I did some quick tries with bitwise operations, but these don't seem to work in the query() functions. I found a solution, but it is a bit cumbersome as my Panda Dataframe knowledge is still limited ;-)
# For testing
signalmap = 0b0001 #should be stored in database_ml_parameters.yml
df = pd.DataFrame([[2,6,34,0],[1,4,23,1],[2,3,54,2],[4,5,56,3],[3,2,56,4],[6,3,73,5],[9,0,23,6],[7,5,54,7],[8,9,83,8],[3,6,92,9]],columns=['a', 'b', 'c', 'cand_type_ML'])
df_temp = df.query("(cand_type_ML>>signalmap) == 1") # Doesn't work
df_temp = df.query("numpy.bitwise_and(cand_type_ML,1) == 1") # Doesn't work
df_temp = df[df['cand_type_ML'].apply(lambda x: x & signalmap == 1)] # Works, but cumbersome
There should be is a nicer solution. Anyone has a better idea?
Here's a checklist of things to do to improve the webapp.
Hello @DelloStritto. I just opened this issue to make sure we don't forget to update the dataset of the Lc->pK0s. For the moment I added a temporary new database https://github.com/ginnocen/MachineLearningHEP/blob/master/machine_learning_hep/data/data_prod_20200304/database_ml_parameters_LcpK0spp_0304_sub.yml.
Versions of dependencies used by the package should be revised. E.g. the problem we see in #692 is gone using pandas
version >=1.0.0.
However, especially that version cannot be bumped yet as some changes in the new pandas API might break our code.
But for post-approval stress times :)
Abseil interferes with Python's logging used via
import logging
For details see abseil/abseil-py#99
Note that what abseil
does isnot really best practice and there will be most likely a fix addressing this issue.
Since our logger is also used for fatal messaging with subsequent aborting and we use it also to log to a file, we want to keep it and NOT move to abseil.logging
- as some people suggest in the above linked issue.
There is a temporary fix thanks to abseil/abseil-py#99 (comment), see PR #196
If someone is spotting some weird behaviour, please let us( @benedikt-voelkel or @ginnocen ) know.
Using the SciKit models with activate_scikit = 1
an error is obtained. This is due to the absence of graphviz as a dependency in Prerequisites. It's addition should solve the problem.
However, graphviz is used only to obtain a graph of the SciKit single decision tree model.
Given that a single decision tree is not very interesting for ML in HEP and that this plot can't be done for BDTs, another possibility is to drop this feature from the code in order to avoid the graphviz dependency.
I would go for the last solution also because, at least on Ubuntu, graphviz is provided as a system package and requires superuser permissions to be installed.
@ginnocen What do you think?
I would suggest to document at least the functions exposed to the user in doclassification_regression.py
with docstrings, following the style guides PEP 8 and PEP 257.
In this way information about a function could be accessed interactively with ipython or jupyter. Moreover, they could be used (in the future) to automatically build a code documentation.
Currently the missing-docstring error is disabled in our pylint configuration, we could think to enable it if we want to be strict.
Move process_histomass
workflow to Analyzer
s as this is analysis specific and does not belong to the Processer
logic
If datatypes for pandas.DataFrame
are not specified, a query operation with negative values most likely fails (see also: https://stackoverflow.com/questions/50400843/using-negative-numbers-in-pandas-dataframe-query-expression)
Hence, we cannot directly apply custom cuts with negative values using pandas.DataFrame.query
which is a major problem.
Furthermore, specifying datatypes might save diskspace!
This needs to be done at the first processing stage when TTrees are converted and pickled.
Besides saving the plots as png's in the /plots_nevt*/ directory, it would be good to also have the object itself saved. So one doesn't have to rerun for making small changes in the layout, and it is possible to show two differently trained pT bins in one canvas.
Pickle should work right? @ginnocen, what do you think?
Code will become something like this
plotname = folder+'/ROCcurve.png'
plt.savefig(plotname)
plotnamepickle = folder+'/ROCcurve.pickle'
with open(plotnamepickle,'wb') as fid:
pickle.dump(figure2 ,fid)
And for offline use
import matplotlib.pyplot as plt
import pickle as pl
fig_handle = pl.load(open('ROCcurve.pickle','rb'))
fig_handle.show()
#Get the data, for further operations
x = fig_handle.axes[0].lines[0].get_data()
In the Monte Carlo SmallSkimmed_*.root
files, downloaded by ml-get-data
, very few or no signal candidates are present. They are also very small ~100 KB. Probably they could be replaced by bigger files with more data.
At the moment, the code could not work properly using these files.
The explicit usage of nsigTOF_Pr_0
in https://github.com/ginnocen/MachineLearningHEP/blob/master/machine_learning_hep/processerdhadrons_mult.py#L409-L410 breaks the code with any database/particle which does not extract this field from the produced TTrees (e.g. D0, see https://github.com/ginnocen/MachineLearningHEP/blob/master/machine_learning_hep/data/data_prod_20200417/database_ml_parameters_D0pp_0417.yml)
This was introduced in #673
Inside machine_learning_hep
we should organise the single .py
s such that we group them accordingly into sub-directories.
This makes the structure cleaner (especially for people who are new, having a first look to the code etc.)
When the full analysis chain is enabled (so fit by loading AliHFInvMassFitter, efficiency with the python code, and cross section by loading HFPtSpectrum) everything runs smoothly. When only docross is enabled, the package crashes after processing the first period with the following message:
Traceback (most recent call last):
File "do_entire_analysis.py", line 23, in <module>
main()
File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/steer_analysis.py", line 461, in main
do_entire_analysis(run_config, db_analysis, db_ml_models, db_ml_gridsearch, db_run_list)
File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/steer_analysis.py", line 390, in do_entire_analysis
myan.multi_makenormyields()
File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/multianalyzer.py", line 113, in multi_makenormyields
self.process_listsample[indexp].makenormyields()
File "/home/lvermunt/JulyALICE/fork/MachineLearningHEP/machine_learning_hep/analyzer.py", line 1513, in makenormyields
gROOT.LoadMacro("HFPtSpectrum.C")
SystemError: int TROOT::LoadMacro(const char* filename, int* error = 0, bool check = kFALSE) =>
problem in C++; program state has been reset
*** Break *** segmentation violation
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/cvmfs/alice.cern.ch/el6-x86_64/Packages/ROOT/v6-16-00-18/lib/ROOT.py", line 770, in cleanup
isCocoa = _root.gSystem.InheritsFrom( 'TMacOSXSystem' )
TypeError: none of the 2 overloaded methods succeeded. Full details:
bool TObject::InheritsFrom(const char* classname) =>
problem in C++; program state has been reset
bool TObject::InheritsFrom(const TClass* cl) =>
could not convert argument 1
*** Break *** segmentation violation
Segmentation fault (core dumped)
I don't see what can cause this. @benedikt-voelkel, do you have an idea? For plot_multi_trial you are also loading a .C macro. Does this work stand-alone?
Using the PCA dopca: 1
crashes the execution when one of the following flag is set: doimportance: 1, dotesting: 1, applytodatamc: 1
.
The PCA must be applied on the relevant data for the various flags.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.