I tried comparison two kinds of program python and <c

Commands: in faiss dir <div cla

I tried IVF1024 python cpu <div class="highlight highlight-text-

I tried C++ modified <a href="https://github.com/facebookresearch/fa

Slow training IVF indexes about faiss-rs HOT 13 CLOSED

ava57r commented on August 22, 2024

Slow training IVF indexes

from faiss-rs.

Comments (13)

Enet4 commented on August 22, 2024

Could you please include how you are building and running the Rust application? It should be in release mode for a fair comparison.

The random vector data set generation can be a bit more idiomatic:

let xt: Vec<f32> = rng.sample_iter(rand::distributions::Standard)
    .take(1_000_000 * d)
    .collect();

Other than that, some profiling would help identify where the potential overhead is.

from faiss-rs.

ava57r commented on August 22, 2024

Commands:

in faiss dir

cp build/c_api/libfaiss_c.so ~/faiss-lib/libfaiss_c.so
cp build/faiss/libfaiss.so ~/faiss-lib/libfaiss.so
cp build/faiss/libfaiss_avx2.so ~/faiss-lib/libfaiss_avx2.so

run training program

time LD_LIBRARY_PATH=/home/user/faiss-lib/ LIBRARY_PATH=/home/user/faiss-lib/ cargo run --release

from faiss-rs.

ava57r commented on August 22, 2024

I tried IVF1024

python cpu

time python3 train-cpu-rand.py 
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training

real    0m8.587s
user    8m26.756s
sys     10m56.531s

Rust

    Finished release [optimized] target(s) in 0.01s
     Running `target/release/train-faiss-cpu`
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training

real    4m51.672s
user    68m31.671s
sys     1m9.906s

I tried perf util

from faiss-rs.

Enet4 commented on August 22, 2024

I tried to run the same Rust program in my environment (although I don't have a Python environment handy at the moment).
Intel Core i7-9750H CPU, 12GB, running Manjaro Linux, kernel 5.12.9, with the index factory descriptor changed to "IVF1024,Flat".

It does take a long time, but it appears to spend most of it in sgemm_.

What is particularly curious is that it is not parallelizing the training process (should it?). In any case, the performance differences could either be related with the differences in the BLAS versions that were linked, or maybe how the Python bindings are set up for multithreading.

Can you check whether Python is using multiple threads? And whether the Python faiss package is linking against OpenBLAS or MKL (this script seems to suggest the latter)? I imagine that this could make a severe difference.

from faiss-rs.

ava57r commented on August 22, 2024

What is particularly curious is that it is not parallelizing the training process (should it?). In any case, the performance differences could either be related with the differences in the BLAS versions that were linked, or maybe how the Python bindings are set up for multithreading.

It looks strange. I have a lot of threads being created, but only a few(max 3) are 100% loaded.

Can you check whether Python is using multiple threads? And whether the Python faiss package is linking against OpenBLAS or MKL ([this script seems to suggest the latter]( this script))? I imagine that this could make a severe difference.

I will try.

I use miniconda3 + https://github.com/facebookresearch/faiss/blob/master/INSTALL.md#installing-from-conda-forge for test python

from faiss-rs.

Enet4 commented on August 22, 2024

OK, I managed to run the Python program with miniconda3 and the faiss-cpu package. It saturated all 6 cores (hyperthreading disabled) and ran pretty fast.

❯ time /opt/miniconda3/bin/python main-cpu.py 
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training

real	0m7.115s
user	0m32.828s
sys	0m8.887s

The Rust version:

❯ time cargo run --release
    Finished release [optimized + debuginfo] target(s) in 0.00s
     Running `target/release/use-faiss`
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
        Training IVF residual
IndexIVF: no residual training

real    4m35.884s
user    4m38.034s
sys     0m0.293s

Although I am not entirely sure, both appear to be using OpenBLAS. Maybe some configuration is lacking. We could also remove some intermediate layers and write an equivalent C++ program.

from faiss-rs.

ava57r commented on August 22, 2024

I tried C++

modified example_c

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include "AutoTune_c.h"
#include "IndexFlat_c.h"
#include "Index_c.h"
#include "clone_index_c.h"
#include "error_c.h"
#include "index_factory_c.h"
#include "index_io_c.h"

#define FAISS_TRY(C)                                       \
    {                                                      \
        if (C) {                                           \
            fprintf(stderr, "%s", faiss_get_last_error()); \
            exit(-1);                                      \
        }                                                  \
    }

double drand() {
    return (double)rand() / (double)RAND_MAX;
}

int main() {
    time_t seed = time(NULL);
    srand(seed);
    printf("Generating some data...\n");
    int d = 128;     // dimension
    int nb = 1000000; // database size
    float* xb = malloc(d * nb * sizeof(float));

    for (int i = 0; i < nb; i++) {
        for (int j = 0; j < d; j++)
            xb[d * i + j] = drand();
        xb[d * i] += i / 1000.;
    }

    FaissIndex* index = NULL;
    FAISS_TRY(faiss_index_factory(
            &index, d, "IVF1024,Flat", METRIC_L2)); // use factory to create index
    printf("is_trained = %s\n",
           faiss_Index_is_trained(index) ? "true" : "false");
    faiss_Index_set_verbose(index, 1);
    FAISS_TRY(faiss_Index_train(index, nb, xb)); // add vectors to the index
    printf("ntotal = %lld\n", faiss_Index_ntotal(index));

    printf("Freeing index...\n");
    faiss_Index_free(index);
    printf("Done.\n");

    return 0;
}

results:

time main_cpu
Generating some data...
is_trained = false
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training
ntotal = 0
Freeing index...
Done.

real	4m52.076s
user	71m44.659s
sys	1m16.614s

I tried config from https://github.com/facebookresearch/faiss/wiki/Threads-and-asynchronous-calls#performance-of-internal-threading-openmp

results

time OMP_WAIT_POLICY=PASSIVE main_cpu
Generating some data...
is_trained = false
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training
ntotal = 0
Freeing index...
Done.

real	4m40.540s
user	4m57.808s
sys	0m17.758s

libfaiss_c.so depends of libfaiss_avx2.so

ldd main_cpu
	... // omitted
	libfaiss_c.so
	... // omitted

ldd libfaiss_c.so 
	... // omitted
	libfaiss_avx2.so
	... // omitted

from faiss-rs.

ava57r commented on August 22, 2024

in https://anaconda.org/conda-forge/faiss v1.7.0

from faiss-rs.

ava57r commented on August 22, 2024

I had problems with slow training earlier but now I can reproduce it's. In v1.6.3

from faiss-rs.

ava57r commented on August 22, 2024

What is particularly curious is that it is not parallelizing the training process (should it?). In any case, the performance differences could either be related with the differences in the BLAS versions that were linked, or maybe how the Python bindings are set up for multithreading.

Can you check whether Python is using multiple threads? And whether the Python faiss package is linking against OpenBLAS or MKL ([this script seems to suggest the latter]( this script))? I imagine that this could make a severe difference.

python uses

multiple threads
OpenBLAS from miniconda installation

My C++ version main-cpu creates a lot of treads but uses only 1 thread for training.

from faiss-rs.

ava57r commented on August 22, 2024

build options facebookresearch/faiss#1511

OpenBLAS supports avx2 from v0.3.8 https://github.com/xianyi/OpenBLAS/releases/tag/v0.3.8

test CentOS

rpm -qa | grep blas
blas-3.8.0-8.el8.x86_64

rpm -qa | grep lapack
lapack-3.8.0-8.el8.x86_64

miniconda3

conda list | grep blas
libblas                   3.9.0                8_openblas    conda-forge
libcblas                  3.9.0                8_openblas    conda-forge
liblapack                 3.9.0                8_openblas    conda-forge
libopenblas               0.3.12          pthreads_hb3c22a3_1    conda-forge

https://github.com/xianyi/OpenBLAS/tree/v0.3.12

UPD:
OpenBLAS FAQ single-threaded/multithreading library
https://github.com/xianyi/OpenBLAS/wiki/Faq#how-can-i-find-out-at-runtime-what-options-the-library-was-built-with-

from faiss-rs.

ava57r commented on August 22, 2024

in CentOS
I installed and rebuild libraries and binaries.

sudo dnf install openblas

rpm -qa | grep openblas
openblas-0.3.12-1.el8.x86_64

ldd ./libfaiss_c.so 
	libfaiss_avx2.so
	libopenblas.so.0

Rust version OMP_WAIT_POLICY=PASSIVE

time OMP_WAIT_POLICY=PASSIVE cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/train-faiss-cpu`
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training

real	0m20.339s
user	0m47.973s
sys	0m26.781s

Rust version without OMP_WAIT_POLICY

time cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/train-faiss-cpu`
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training

real	0m20.186s
user	0m45.038s
sys	0m20.870s

C++ version without OMP_WAIT_POLICY

time main_cpu
Generating some data...
is_trained = false
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training
ntotal = 0
Freeing index...
Done.

real	0m20.714s
user	0m53.912s
sys	0m27.579s

from faiss-rs.

ava57r commented on August 22, 2024

I tried again with MKL BLA_VENDOR=Intel10_64_dyn

python version. pytorch installation

time python3 train-cpu-rand.py 
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training

real	0m11.389s
user	2m20.324s
sys	5m0.904s

Rust version with MKL

time LD_LIBRARY_PATH=/usr/local/lib64:/opt/intel/mkl/lib/intel64 cargo run --release 

    Finished release [optimized] target(s) in 0.01s
     Running `target/release/train-faiss-cpu`
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training

real	0m36.642s
user	18m36.431s
sys	0m18.177s

C++ with MKL

time LD_LIBRARY_PATH=/usr/local/lib64:/opt/intel/mkl/lib/intel64 ./build/c_api/main_cpu 
Generating some data...
is_trained = false
Training level-1 quantizer
Training level-1 quantizer on 1000000 vectors in 128D
Training IVF residual
IndexIVF: no residual training
ntotal = 0
Freeing index...
Done.

real	0m37.687s
user	18m46.975s
sys	0m18.039s

It largely depends on how to configure the faiss libraries. It can be seen that the Rust code does not give an additional slowdown in this case

from faiss-rs.

Slow training IVF indexes about faiss-rs HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent