pyqalm: Clustering with learned fast transforms
This repository aims at providing base scripts to play with the qk-means algorithm. You can find:
- easily installable package which provide all necessary functions to play with qkmeans or the palm4msa algorithm;
- a minimum working exemple on how to use our qkmeans algorithm;
- the history and evolution of our experiment script until its final version;
- the history of tested parameters for this experiment script;
- the utility functions necessary to download or create the datasets used in experiments;
- we also provide the source to compile our report with latex;
- the vizualization scripts used to generate figures.
Note: This is a view of the developpement repository. We have tried to make it as clear as possible in order to help the reader to reproduce results. If you have any question or need any clearance on the repository content, do not hesitate to raise an issue.
Note 2: Everything isn't clean and tidy everywhere. We have left the code in the if __name__ == "__main__" blocks in each module so that you can inspire yourself from them but they are not guaranteed to work. There is certainly a lot of non-working code but we choose to remain transparent in case it would be of any use to any one. If you have any question, please raise an issue.
Giffon, L., Emiya, V., Kadri, H., & Ralaivola, L. (2021). QuicK-means: accelerating inference for K-means by learning fast transforms. Machine Learning, 1-25.
For this installation procedure, we assume you have a fresh Python environment activated. Python version should be 3.7.
Before installing, you have to make sure that opencv is installed in your envorionment so that python -c "import cv2; print(cv2.__version__)" doesn't crash.
Also install tensorflow version 1.13.1 separately.
To install the package, simply type:
make install
Then you should also install Luc Giffon scientific toolbox: https://github.com/lucgiffon/scikit-luc
To verify that everything is working, you can use the test rule from the root directory:
make test
Under the code/scripts/examples directory, you will find 2 minimum working examples:
- example_hierarchical_palm4msa.py shows how the heirarchical-palm4msa function can be called
- example_qkmeans.py show how the qkmeans function can be called
These scripts can be run directly and shouldn't raise any error.
Running experiments can be done in few steps:
- Prepare data
- Prepare parameters
- Launch experiment
- Vizualize results
See the details of those steps in the following subsections.
Note: We point to the last working versions of the experiments/parameters/vizualizations but you can look at the other if you find it interesting. Note that there is some kind of match between the names of the experiments scripts, the parameter files and vizualization scripts.
Look inside the Makefile to see what rules you can call for what dataset. You can also create all datasets at once using the following command but this will take a lot of time and a lot of space on your disk. Be carefull then. The data will be stored under the data/external directory.
To prepare all the data (including synthetic data, which is long) type the following (make sure you have downloaded the kddcup04 dataset for that, see after):
make data
To prepare all the data but __NOT__ the synthetic data type the following (make sure you have downloaded the kddcup04 dataset for that, see after):
make data_no_million
To download the kddcup04 dataset, you must go [here](http://osmot.cs.cornell.edu/kddcup/datasets.html) and subscribe then download the datasets using your credentials. Once it is done, extract it into data/raw so that data/raw/data_kddcup04/bio_train.dat exists.
If you want to remove all data, you can do (this can take some time too)
make clean
In the parameters/08/aaai directory, you will find some files called lazyfile_*.yml. These files describe combination of parameters to test. You can interpret them using the lazygrid command from the [scikit-luc](https://pypi.org/project/scikit-luc/) package (which should be isntalled if you have installed qkmeans). To print the command lines produced by a lazyfile, do for example:
cd parameters/08/aaai lazygrid -l lazyfile_qmeans_analysis_caltech_decoda2.yml
You can use the scripts code/scripts/2019/08/3_4_qmeans_minibatch_hierarchical_init.py with any combination of the previously generated parameters. Find the usage help on top of the script, in the dosctring. For example you can do:
cd code/scripts/2019/08 python 3_4_qmeans_minibatch_hierarchical_init.py qmeans --sparsity-factor 2 --seed 0 --nystrom 5000 --assignation-time 5000 --batch-assignation-time 5000 --1-nn --initialization uniform_sampling --nb-iteration 10 --nb-cluster 128 --caltech 28
You can try to adapt the code in code/visualization/2019/08/aaai_conference to vizualize the results. This code won't work in your case because of invalids paths but maybe you can take inspiration from it.
Latex makes it difficult to create a make rule for its compilation so you'll have to do it by yourself. To compile the paper as pdf go to reports/aaai_2020 then use pdflatex and bibtex:
cd reports/aaai_2020 pdflatex aaai2020_qmeans.tex -synctex=1 -interaction=nonstopmode bibtex aai2020_qmeans.aux pdflatex aaai2020_qmeans.tex -synctex=1 -interaction=nonstopmode pdflatex aaai2020_qmeans.tex -synctex=1 -interaction=nonstopmode
For the implementation of PALM4MSA, we used the description from the PALM4MSA paper and we helped ourselves with their open source matplotlib implementation available here: https://faust.inria.fr/ . Note that our implementation maybe sub-optimal compared to their and you can get in touch with them to use their python version.
The Palm4MSA paper: Le Magoarou, Luc, et Remi Gribonval. « Flexible Multilayer Sparse Approximations of Matrices and Applications ». IEEE Journal of Selected Topics in Signal Processing 10, nᵒ 4 (juin 2016): 688‑700. https://doi.org/10.1109/JSTSP.2016.2543461.
QKmeans paper: Luc Giffon, Valentin Emiya, Liva Ralaivola, Hachem Kadri. QuicK-means: Acceleration of K-means by learning a fast transform. 2019. ⟨hal-02174845v2⟩
Use logger for debugging:
from qkmeans.utils import logger
logger.debug("your logging message level debug") logger.info("your logging message level info") logger.warning("your logging message level warning") logger.error("your logging message level error")
If you want to change the logging level, do:
import logging import daiquiri
daiquiri.setup(level=logging.DEBUG) # for level debug, change value of level according to your needs
# your script (with no prints)
- Laboratoire d'Informatique et Systèmes
- Université d'Aix-Marseille
- Centre National de la Recherche Scientifique
- Université de Toulon
Released under the GNU General Public License version 3 or later (see LICENSE.txt).