pytorch / fbgemm Goto Github PK

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

License: Other

CMake 1.33% C++ 53.80% C 0.39% Starlark 0.20% Cuda 19.43% Python 24.80% Makefile 0.03% Shell 0.02%

fbgemm's Introduction

FBGEMM

FBGEMM (Facebook GEneral Matrix Multiplication) is a low-precision, high-performance matrix-matrix multiplications and convolution library for server-side inference.

The library provides efficient low-precision general matrix multiplication for small batch sizes and support for accuracy-loss minimizing techniques such as row-wise quantization and outlier-aware quantization. FBGEMM also exploits fusion opportunities in order to overcome the unique challenges of matrix multiplication at lower precision with bandwidth-bound operations.

FBGEMM is used as a backend of Caffe2 and PyTorch quantized operators for x86 machines:

See the full Documentation for more information on building, installing, and developing with FBGEMM, as well as the most up-to-date support matrix and API documentation for this library.

What's New?

New Features and Recent Improvements (January, 2020)

Citation

For a high-level overview, design philosophy and brief descriptions of various parts of FBGEMM please see our blog post.

For those looking for the appropriate article to cite regarding FBGEMM, we recommend citing our paper:

@article{fbgemm,
  title={FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference},
  author={Khudia, Daya and Huang, Jianyu and Basu, Protonu and Deng, Summer and Liu, Haixin and Park, Jongsoo and Smelyanskiy, Mikhail},
  journal={arXiv preprint arXiv:2101.05615},
  year={2021}
}

Join the FBGEMM community

For questions, support, news updates, or feature requests, please feel free to:

File a ticket in GitHub Issues
Post a discussion in GitHub Discussions
Reach out to us on the #fbgemm channel in PyTorch Slack

For contributions, please see the CONTRIBUTING file for ways to help out.

License

FBGEMM is BSD licensed, as found in the LICENSE file.

fbgemm's People

Contributors

Stargazers

Watchers

Forkers

github30 templeblock jspark1105 hongyunnchen boncheolgu nieshaoshuai jamesr66a b-xiang ajitmaths zgsxwsdxg skymysky guidewsp hiekay basicv8vc sihengtech01 dskhudia jianyuh entn-at hubbucket-team sszth yuhonghong66 kenshin92 bunnyrabbit8mile yns88 amylittleyang lifeiteng dut3062796s williamtambellini pyzh gan-challenger houseroad lly-zero-one yazici protonu ykim362 dhpollack marian-nmt fengzifrank anteplim huihanl smilejames ychjiang yinghai yujunfeng nidao66 mrx0089 vellano xiaomengy tricky61 efiks melonelabs mn5k zrphercule dreamerdoremi fightseed wuhuikx liangan1 tejashah94 peterjc123 rdpli liaoheping malfet nevolver awesome-archive peteroxic shenmayufei superian wshyb1995 klonggan daydreamcoding lolothomasi dennisjin gazelei frozenmaple fkueyu xiangjun0103 tglgg ijingo xijiangyuexin keyman9848 reno77 jixiaoqiang yecq tank-2019 kentchun33333 ytxjonathan thebluesmoke focus3d saonam dexception vectorlu shulepov de30 lamprosmousselimis cadet354 anopen yuanzhenjie menchunlei clin99 grezzzle

fbgemm's Issues

Does cache size and instruction set the cpu support affect the performance of FBGEMM?

I found my system get slower after switching to FBGEMM from OpenBLAS. I replaced the float matrix-matrix multiplication in OpenBLAS with Int8 matrix-matrix multiplication build with FBGEMM. But my system get slower after switching to the Int8 operation.
I set the thread number to 1 and test on the same machine.
I guess the cache size and the instruction set my cpu support affect the performance. My CPU is Intel Core i7-7500U, and it only supports avx2. The L1 cache size is 128 KB, L2 cache size is 512 KB and L3 cache size is 4.0MB.
I use the default Packing Parameters now.
I hope to speed up my system by using the Int8 matrix-matrix multiplication pipeline built with FBGEMM. But I am not sure what makes the Int8 version even slower. Should I switch to the CPU that support avx512 vnni? Or should I find the best params for my machine as #553 ?

‘_mm256_cmpge_epi32_mask’ was not declared for fbgemm when compiling CPU only pytorch

When I compile CPU only pytorch, I encountered below issue for fbgemm module:
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc: In function ‘void fbgemm::internal::compressed_indices_remap_avx512(int32_t, const IndexType*, const int32_t*, const IndexType*, const float*, IndexType*, IndexType*, float*)’:
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:66: error: there are no arguments to ‘_mm256_cmpge_epi32_mask’ that depend on a template parameter, so a declaration of ‘_mm256_cmpge_epi32_mask’ must be available [-fpermissive]
__mmask8 cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:66: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:476:59: error: there are no arguments to ‘_mm256_cmpge_epi32_mask’ that depend on a template parameter, so a declaration of ‘_mm256_cmpge_epi32_mask’ must be available [-fpermissive]
cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc: In instantiation of ‘void fbgemm::internal::compressed_indices_remap_avx512(int32_t, const IndexType*, const int32_t*, const IndexType*, const float*, IndexType*, IndexType*, float*) [with IndexType = int; bool HAS_WEIGHTS = true; int32_t = int]’:
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:595:1: required from here
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope
__mmask8 cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:476:42: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: note: ‘_mm256_cmpge_epi32_mask’ declared here, later in the translation unit
__mmask8 cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc: In instantiation of ‘void fbgemm::internal::compressed_indices_remap_avx512(int32_t, const IndexType*, const int32_t*, const IndexType*, const float*, IndexType*, IndexType*, float*) [with IndexType = int; bool HAS_WEIGHTS = false; int32_t = int]’:
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:596:1: required from here
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:476:42: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: note: ‘_mm256_cmpge_epi32_mask’ declared here, later in the translation unit
__mmask8 cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc: In instantiation of ‘void fbgemm::internal::compressed_indices_remap_avx512(int32_t, const IndexType*, const int32_t*, const IndexType*, const float*, IndexType*, IndexType*, float*) [with IndexType = long int; bool HAS_WEIGHTS = true; int32_t = int]’:
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:597:1: required from here
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:476:42: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: note: ‘_mm256_cmpge_epi32_mask’ declared here, later in the translation unit
__mmask8 cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc: In instantiation of ‘void fbgemm::internal::compressed_indices_remap_avx512(int32_t, const IndexType*, const int32_t*, const IndexType*, const float*, IndexType*, IndexType*, float*) [with IndexType = long int; bool HAS_WEIGHTS = false; int32_t = int]’:
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:598:1: required from here
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:476:42: error: ‘_mm256_cmpge_epi32_mask’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);
^
/home/sunrise/pytorch/third_party/fbgemm/src/EmbeddingSpMDMAvx512.cc:441:49: note: ‘_mm256_cmpge_epi32_mask’ declared here, later in the translation unit
__mmask8 cmp_res_v = _mm256_cmpge_epi32_mask(len_v, vec_len_v);

Could anyone provide help?

Do you have the plan of supporting Intel AMX instruction set on fbgemm backend?

🚀 Feature

Support Intel AMX instruction set on fbgemm backend

Motivation

AMX instruction is more efficient than before. we hope to use it in fbgemm backend when we are quantizing model to int8 with PyTorch.

Pitch

We hope fbgemm support AMX instruction set so that we can use the same code without any change on a machine with AMX instruction set.

Performance Bottleneck for Quantize/Dequantize

Description:
The dequantize/quantize op is implemented with single thread in fbgemm and these dequantize ops will be the performance bottleneck when they are used in int8 model.

We use pytorch-transformers to enable int8 model and only Linear ops are quantized ops, so a lot of dequantize/quantize ops are need. For glue/CoLA task with large-base-mode, the profile result are as follows:

In order to improve performance, we use OpenMP to speed dequantize/quantize op. The profile results are as follows:

From the above results, we can see that single thread dequantize/quantize ops seriously impact the performance of quantized model.

Environment:
Cacade lake 8280 CPU

[Question] Does FBGEMM support packing B matrix offline now?

Hello! I'm trying to accelerate the matrix multiplication in my own project with FBGEMM. However, I found the matrix multiplication get slower after using the FBGEMM, and I think it's caused by repacking the weight matrix(B matrix).

Your blog says that the cost of repacking can be avoided by prepacking B matrix. But I did not find an example showing how to pack B matrix offline and reuse the prepacked B matrix. I found #427, which provides an unofficial version of FBGEMM, but it doesn't work in my project.

Could you please tell me does FBGEMM support packing B matrix offline and reusing the prepacked matrix now? And is there any usage I can refer to?

I will be very grateful for your help.

Compilation broken in the latest git master

Hey,
Latest git master doesn't compile due to a missing include of <stdexcept>

In file included from /tmp/FBGEMM/include/fbgemm/Fbgemm.h:17,
                 from /tmp/FBGEMM/src/QuantUtilsAvx2.cc:14:
/tmp/FBGEMM/include/fbgemm/./ConvUtils.h: In constructor ‘fbgemm::conv_param_t<SPATIAL_DIM>::conv_param_t(int, int, int, std::array<int, N>, int, std::array<int, N>, std::array<int, N>, std::array<int, (SPATIAL_DIM * 2)>, std::array<int, N>)’:
/tmp/FBGEMM/include/fbgemm/./ConvUtils.h:73:18: error: ‘runtime_error’ is not a member of ‘std’
   73 |       throw std::runtime_error(
      |                  ^~~~~~~~~~~~~
/tmp/FBGEMM/include/fbgemm/./ConvUtils.h:78:18: error: ‘runtime_error’ is not a member of ‘std’
   78 |       throw std::runtime_error(
      |

Including it at the top of the file fixes the issue.

Also, a question: include/fbgemm/FbgemmSpMM.h" has been removed? Has that functionality been removed or moved to another place?

Cheers,

Nick

[Question] offline packed memory

Hi,
I would like to skip pack() in PackBMatrix() in my use case. I pack memory offline and keep raw memory in another place. For a workaround, I created another constructor in my fork of FBGEMM. https://github.com/marian-nmt/FBGEMM/blob/master/src/PackBMatrix.cc#L241 (This is one of the main reasons Marian maintains a separate fork of FBGEMM)
Would there be any way to create PackBMatrix() without calling pack(), or would you consider to merge if I submit a PR for this?
Thanks!

build from source failed

Scanning dependencies of target cpuid-dump
Scanning dependencies of target gtest
Scanning dependencies of target clog
Scanning dependencies of target fbgemm_avx512
Scanning dependencies of target fbgemm_avx2
Scanning dependencies of target fbgemm_generic
Scanning dependencies of target asmjit
[ 0%] Building C object cpuinfo/deps/clog/CMakeFiles/clog.dir/src/clog.c.o
[ 0%] Building C object cpuinfo/CMakeFiles/cpuid-dump.dir/tools/cpuid-dump.c.o
[ 0%] Building CXX object CMakeFiles/fbgemm_avx512.dir/src/FbgemmBfloat16ConvertAvx512.cc.o
[ 0%] Building CXX object CMakeFiles/fbgemm_avx512.dir/src/FbgemmFloat16ConvertAvx512.cc.o
[ 1%] Building CXX object CMakeFiles/fbgemm_avx512.dir/src/UtilsAvx512.cc.o
[ 1%] Building CXX object CMakeFiles/fbgemm_avx512.dir/src/FbgemmFP16UKernelsAvx512.cc.o
[ 1%] Building CXX object CMakeFiles/fbgemm_avx512.dir/src/FbgemmFP16UKernelsAvx512_256.cc.o
[ 1%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/EmbeddingSpMDMAvx2.cc.o
[ 2%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8Depthwise3DAvx2.cc.o
[ 2%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8Depthwise3x3Avx2.cc.o
[ 2%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o
[ 2%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwisePerChannelQuantAvx2.cc.o
[ 3%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/OptimizedKernelsAvx2.cc.o
[ 3%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/PackDepthwiseConvMatrixAvx2.cc.o
[ 1%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/FbgemmBfloat16ConvertAvx2.cc.o
[ 1%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/FbgemmFloat16ConvertAvx2.cc.o
[ 3%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/UtilsAvx2.cc.o
[ 4%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/FbgemmFP16UKernelsAvx2.cc.o
[ 4%] Building CXX object googletest/googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
[ 4%] Building CXX object CMakeFiles/fbgemm_avx2.dir/src/QuantUtilsAvx2.cc.o
[ 4%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/arch.cpp.o
[ 4%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/assembler.cpp.o
[ 4%] Linking C executable cpuid-dump
[ 4%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/builder.cpp.o
[ 4%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/EmbeddingSpMDM.cc.o
[ 5%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/EmbeddingSpMDMNBit.cc.o
[ 5%] Linking C static library libclog.a
[ 6%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/callconv.cpp.o
[ 6%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/ExecuteKernel.cc.o
[ 6%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/codeholder.cpp.o
[ 6%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/ExecuteKernelU8S8.cc.o
[ 6%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/compiler.cpp.o
[ 6%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/Fbgemm.cc.o
[ 6%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/constpool.cpp.o
[ 7%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/cpuinfo.cpp.o
[ 8%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmBfloat16Convert.cc.o
[ 8%] Built target cpuid-dump
[ 8%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmConv.cc.o
[ 8%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/emitter.cpp.o
[ 8%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmFP16.cc.o
[ 8%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/func.cpp.o
[ 8%] Built target clog
[ 9%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmFloat16Convert.cc.o
[ 10%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/globals.cpp.o
[ 10%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmI64.cc.o
[ 10%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmI8Spmdm.cc.o
Scanning dependencies of target cpuinfo_internals
[ 10%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/inst.cpp.o
[ 10%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/jitallocator.cpp.o
Scanning dependencies of target cpuinfo
[ 10%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmSpConv.cc.o
[ 10%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/jitruntime.cpp.o
[ 11%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/logging.cpp.o
[ 11%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/init.c.o
[ 11%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/api.c.o
[ 12%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/FbgemmSpMM.cc.o
[ 13%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/init.c.o
[ 14%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/init.c.o
[ 14%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC16.cc.o
[ 14%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/api.c.o
[ 14%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/osutils.cpp.o
[ 14%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/init.c.o
[ 14%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/operand.cpp.o
[ 14%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC16Avx512.cc.o
[ 14%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/ralocal.cpp.o
[ 14%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/info.c.o
[ 14%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC16Avx512VNNI.cc.o
[ 14%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/vendor.c.o
[ 15%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC32.cc.o
[ 15%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/info.c.o
[ 15%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/uarch.c.o
[ 16%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC32Avx512.cc.o
[ 16%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/rapass.cpp.o
[ 16%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/rastack.cpp.o
[ 17%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/vendor.c.o
[ 17%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/uarch.c.o
[ 16%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/string.cpp.o
[ 18%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/support.cpp.o
[ 18%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC32Avx512VNNI.cc.o
[ 19%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/name.c.o
[ 19%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/name.c.o
[ 19%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/topology.c.o
[ 20%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/GroupwiseConvAcc32Avx2.cc.o
[ 20%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/isa.c.o
[ 21%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackAMatrix.cc.o
[ 21%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/cache/init.c.o
[ 22%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/topology.c.o
[ 22%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackAWithIm2Col.cc.o
[ 22%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/isa.c.o
[ 22%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/target.cpp.o
[ 22%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/type.cpp.o
[ 22%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/cache/init.c.o
[ 22%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/virtmem.cpp.o
[ 22%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackBMatrix.cc.o
[ 22%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/cache/descriptor.c.o
[ 23%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/zone.cpp.o
[ 23%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/cache/descriptor.c.o
[ 23%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/cache/deterministic.c.o
[ 22%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/linux/init.c.o
[ 23%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/zonehash.cpp.o
[ 24%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackMatrix.cc.o
[ 25%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/cache/deterministic.c.o
[ 26%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackAWithQuantRowOffset.cc.o
[ 26%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/linux/smallfile.c.o
[ 26%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/x86/linux/cpuinfo.c.o
[ 26%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackAWithRowOffset.cc.o
[ 26%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackWeightMatrixForGConv.cc.o
[ 26%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/linux/init.c.o
[ 26%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/linux/multiline.c.o
[ 26%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/zonelist.cpp.o
[ 26%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/x86/linux/cpuinfo.c.o
[ 26%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/linux/smallfile.c.o
[ 26%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/linux/current.c.o
[ 26%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/QuantUtils.cc.o
[ 26%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/zonestack.cpp.o
[ 28%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/zonetree.cpp.o
[ 29%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/RefImplementations.cc.o
[ 29%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/PackWeightsForConv.cc.o
[ 29%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/linux/processors.c.o
[ 29%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/linux/multiline.c.o
[ 30%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/SparseAdagrad.cc.o
[ 30%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/core/zonevector.cpp.o
[ 31%] Building C object cpuinfo/CMakeFiles/cpuinfo_internals.dir/src/linux/cpulist.c.o
[ 31%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/linux/current.c.o
[ 31%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/linux/cpulist.c.o
[ 31%] Building CXX object CMakeFiles/fbgemm_generic.dir/src/Utils.cc.o
[ 31%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/x86/x86assembler.cpp.o
[ 32%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/x86/x86builder.cpp.o
[ 32%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/x86/x86callconv.cpp.o
[ 33%] Building C object cpuinfo/CMakeFiles/cpuinfo.dir/src/linux/processors.c.o
[ 33%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/x86/x86compiler.cpp.o
[ 33%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/x86/x86features.cpp.o
[ 34%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/x86/x86internal.cpp.o
[ 34%] Building CXX object asmjit/CMakeFiles/asmjit.dir/src/asmjit/x86/x86instdb.cpp.o
/tmp/ccWuobBM.s: Assembler messages:
/tmp/ccWuobBM.s:52: Error: operand size mismatch for vbroadcastss' /tmp/ccWuobBM.s:54: Error: operand size mismatch for vbroadcastss'
make[2]: *** [CMakeFiles/fbgemm_avx512.dir/src/FbgemmFloat16ConvertAvx512.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....

undefined reference to `cblas_sgemm'

I followed the build instruction in Readme file and ge the following error message. Is this something I need to deal with or just an unstable mistake? How can I fix it?

[ 75%] Linking CXX executable FP16Benchmark
CMakeFiles/FP16Benchmark.dir/FP16Benchmark.cc.o: In function `double fbgemm::measureWithWarmup<performance_test(int, bool, int, bool)::{lambda()#1}, performance_test(int, bool, int, bool)::{lambda()#2}>(performance_test(int, bool, int, bool)::{lambda()#1}&&, int, int, performance_test(int, bool, int, bool)::{lambda()#2} const&, bool) [clone ._omp_fn.1]':
FP16Benchmark.cc:(.text+0x15c): undefined reference to `cblas_sgemm'
CMakeFiles/FP16Benchmark.dir/FP16Benchmark.cc.o: In function `performance_test(int, bool, int, bool)':
FP16Benchmark.cc:(.text+0x1e19): undefined reference to `cblas_sgemm'
FP16Benchmark.cc:(.text+0x20dc): undefined reference to `cblas_sgemm'
collect2: error: ld returned 1 exit status
bench/CMakeFiles/FP16Benchmark.dir/build.make:152: recipe for target 'bench/FP16Benchmark' failed
make[2]: *** [bench/FP16Benchmark] Error 1
CMakeFiles/Makefile2:2102: recipe for target 'bench/CMakeFiles/FP16Benchmark.dir/all' failed
make[1]: *** [bench/CMakeFiles/FP16Benchmark.dir/all] Error 2
Makefile:138: recipe for target 'all' failed
make: *** [all] Error 2

Conv1D via gemm is slow?

I am trying to use FBGEMM to implement a 1D convolution. I am doing via an im2col method, in which I copy convert a [batch, timesteps, input_channels] tensor into a [batch * timesteps, conv_width * input_channels] tensor and then multiply by the B-matrix of shape [conv_input * input_channels, output_channels]. I have also noticed that wav2letter takes the same approach in their source code.

I am using ReQuantizeForFloat as the output pipeline and doing the multiplication in 8-bit integer arithmetic with 32-bit accumulation.

My benchmarks show that, for large numbers of timesteps, this is significantly slower than MKL (operating in fp32). I've reduced my issue to a small benchmark that only tests the gemm with a large batch size (on the order of 500 to 5000).

Is this the expected behaviour? Can I do something differently to get good performance from FBGEMM gemms on the sizes I'm working with? (5000x256 * 256x256, for example.) Thanks.

FbgemmF16 is not always faster than "PrePacked" OpenBLAS SGEMM

Environment:

MacBook Pro (15-inch, 2017)
2.9 GHz Intel Core i7

Change to OpenBLAS:

https://github.com/xianyi/OpenBLAS/blob/v0.3.4/driver/level3/level3.c#L322 and https://github.com/xianyi/OpenBLAS/blob/v0.3.4/driver/level3/level3.c#L377 are commented out as we can pre pack it like Fbgemm

export OPENBLAS_NUM_THREADS=1
export GOTO_NUM_THREADS=1
export OMP_NUM_THREADS=1

Test results(cblas_sgemm no transpose):

   OPENBLAS_PrePacked_FP32 m =     1 n =  3072 k =  1024 Gflops =  17.4617 GBytes =  17.5072
    FBGEMM_N6fbgemm5__f16E m =     1 n =  3072 k =  1024 Gflops =  12.0319 GBytes =  12.0632

   OPENBLAS_PrePacked_FP32 m =     2 n =  3072 k =  1024 Gflops =  30.1207 GBytes =  15.1388
    FBGEMM_N6fbgemm5__f16E m =     2 n =  3072 k =  1024 Gflops =  20.4900 GBytes =  10.2984

   OPENBLAS_PrePacked_FP32 m =     3 n =  3072 k =  1024 Gflops =  29.9095 GBytes =  10.0477
    FBGEMM_N6fbgemm5__f16E m =     3 n =  3072 k =  1024 Gflops =  32.2804 GBytes =  10.8442

   OPENBLAS_PrePacked_FP32 m =     4 n =  3072 k =  1024 Gflops =  63.1276 GBytes =  15.9463
    FBGEMM_N6fbgemm5__f16E m =     4 n =  3072 k =  1024 Gflops =  43.8964 GBytes =  11.0884

   OPENBLAS_PrePacked_FP32 m =     5 n =  3072 k =  1024 Gflops =  49.5468 GBytes =  10.0384
    FBGEMM_N6fbgemm5__f16E m =     5 n =  3072 k =  1024 Gflops =  54.3773 GBytes =  11.0171

   OPENBLAS_PrePacked_FP32 m =     6 n =  3072 k =  1024 Gflops =  58.9594 GBytes =   9.9801
    FBGEMM_N6fbgemm5__f16E m =     6 n =  3072 k =  1024 Gflops =  61.0574 GBytes =  10.3352

   OPENBLAS_PrePacked_FP32 m =     7 n =  3072 k =  1024 Gflops =  47.5032 GBytes =   6.9099
    FBGEMM_N6fbgemm5__f16E m =     7 n =  3072 k =  1024 Gflops =  66.0421 GBytes =   9.6066

   OPENBLAS_PrePacked_FP32 m =     8 n =  3072 k =  1024 Gflops =  77.7143 GBytes =   9.9167
    FBGEMM_N6fbgemm5__f16E m =     8 n =  3072 k =  1024 Gflops =  69.4565 GBytes =   8.8629

   OPENBLAS_PrePacked_FP32 m =     1 n =  2688 k =   896 Gflops =  18.7028 GBytes =  18.7584
    FBGEMM_N6fbgemm5__f16E m =     1 n =  2688 k =   896 Gflops =  11.3259 GBytes =  11.3596

   OPENBLAS_PrePacked_FP32 m =     2 n =  2688 k =   896 Gflops =  33.0037 GBytes =  16.6001
    FBGEMM_N6fbgemm5__f16E m =     2 n =  2688 k =   896 Gflops =  22.9513 GBytes =  11.5439

   OPENBLAS_PrePacked_FP32 m =     3 n =  2688 k =   896 Gflops =  28.6379 GBytes =   9.6312
    FBGEMM_N6fbgemm5__f16E m =     3 n =  2688 k =   896 Gflops =  32.1949 GBytes =  10.8275

   OPENBLAS_PrePacked_FP32 m =     4 n =  2688 k =   896 Gflops =  57.9563 GBytes =  14.6616
    FBGEMM_N6fbgemm5__f16E m =     4 n =  2688 k =   896 Gflops =  42.1195 GBytes =  10.6552

   OPENBLAS_PrePacked_FP32 m =     5 n =  2688 k =   896 Gflops =  49.6843 GBytes =  10.0847
    FBGEMM_N6fbgemm5__f16E m =     5 n =  2688 k =   896 Gflops =  50.0093 GBytes =  10.1507

   OPENBLAS_PrePacked_FP32 m =     6 n =  2688 k =   896 Gflops =  58.1400 GBytes =   9.8630
    FBGEMM_N6fbgemm5__f16E m =     6 n =  2688 k =   896 Gflops =  63.4359 GBytes =  10.7614

   OPENBLAS_PrePacked_FP32 m =     7 n =  2688 k =   896 Gflops =  51.4508 GBytes =   7.5032
    FBGEMM_N6fbgemm5__f16E m =     7 n =  2688 k =   896 Gflops =  63.0543 GBytes =   9.1954

   OPENBLAS_PrePacked_FP32 m =     8 n =  2688 k =   896 Gflops =  77.9591 GBytes =   9.9769
    FBGEMM_N6fbgemm5__f16E m =     8 n =  2688 k =   896 Gflops =  71.0457 GBytes =   9.0922

error: 'posix_memalign' was not declared in this scope

../third_party/fbgemm/include/fbgemm/Fbgemm.h:1012:7: error: 'posix_memalign' was not declared in this scope
if (posix_memalign(&aligned_mem, __align, __size))

when built with MinGW.

The solution: https://stackoverflow.com/questions/10862121/undefined-reference-to-posix-memalign-using-mingw32

Add stdexcept to include/fbgemm/FbgemmPackMatrixB.h

Compiled on Fedora 32 failed. I added #include to include/fbgemm/FbgemmPackMatrixB.h to fix it.

Thanks.

Runtime query if machine has required instruction sets

This code path:

FBGEMM/src/PackMatrix.cc

Line 54 in 25ab38b

assert(0 && "unsupported architecure");

Causes failures when integrating FBGEMM into PyTorch at the moment. A possible alternative is to provide an API for the framework (PyTorch) to query so that it can select between the FBGEMM specialized version of the operator or the default version. Can we export a simple API that returns a boolean describing--at runtime--whether we are on a machine that supports the necessary AVX instructions?

Get CMake Error when trying to generate doc using sphinx.

Hello! I'm trying to generate the docs of FBGEMM and get some errors.

I modify the CMakeList.txt to turn the FBGEMM_BUILD_DOCS ON(by moditify option(FBGEMM_BUILD_DOCS "Build fbgemm documentation" OFF) to option(FBGEMM_BUILD_DOCS "Build fbgemm documentation" ON)) and do cmake ...

But it reports CMake Error like this:
CMake Error at docs/CMakeLists.txt:2 (find_package):
By not providing "FindSphinx.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "Sphinx", but
CMake did not find one.
...

I try to add the path where the Sphinx is installed like this
cmake .. -DCMAKE_PREFIX_PATH=/usr/lib/python3/dist-packages/sphinx
but I get the same error.

Is there something I did wrong?

Error: invalid register for .seh_savexmm when building with MinGW

<censored>\mingw32\bin\g++.exe   -I../cmake/../third_party/cpuinfo/include -Ithird_party/fbgemm/third_party/asmjit/src -I../third_party/fbgemm/include -I../third_party/fbgemm
 -I../third_party/protobuf/src -Wno-deprecated -fvisibility-inlines-hidden -O3 -DNDEBUG   -m64 -mavx2 -mfma -mavx512f -masm=intel -std=c++11 -MD -MT third_party/fbgemm/CMakeFiles/fbgemm_avx512.dir/src/Utils_avx512.cc.obj -MF third_party\fbgemm\CMakeFiles\fbgemm_avx512.dir\src\Utils_avx512.cc.obj.d -o third_party/fbgemm/CMakeFiles/fbgemm_avx512.dir/src/Utils_avx512.cc.obj -c ../third_party/fbgemm/src/Utils_avx512.cc
<censored>\AppData\Local\Temp\cczFskKo.s: Assembler messages:
<censored>\AppData\Local\Temp\cczFskKo.s:48: Error: invalid register for .seh_savexmm
<bunch of messages like this>

[Questions] Why doesn't PyTorch use accumulation int16_t mode

FBGEMM is faster with accumulation int16_t. Why don't I see PyTorch use this mode?

Is it possible to speed up matrix multiplication by adjusting the values of the Packing parameters under the same hardware environment?

Hi! I am reading the source code of FBGEMM and interested in the CPU optimization part. I found that FBGEMM sets Packing parameters for each ISA separately. I am curious whether the values of these parameters are determined empirically or by a certain algorithm? Is it possible to speed up matrix multiplication by adjusting the values of the Packing parameters under the same hardware environment? Is it possible to run FBGEMM on more ISA by appropriately setting the values of the Packing parameters? I will be very grateful for your help.

Error: invalid use of register on Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz

When compiling PyTorch 1.6, I get this:

FAILED: third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/QuantUtilsAvx2.cc.o 
/p/software/hdfml/stages/Devel-2020/software/GCCcore/9.3.0/bin/g++ -DFBGEMM_STATIC -DTH_BLAS_MKL -I../third_party/cpuinfo/include -I../third_party/fbgemm/third_party/asmjit/src -I../third_party/fbgemm/include -I../third_party/fbgemm -I../cmake/../third_party/benchmark/include -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem /p/software/hdfml/stages/Devel-2020/software/imkl/2020.2.254/mkl/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -O2 -ftree-vectorize -march=native -fno-math-errno -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -O3 -DNDEBUG -fPIC -fvisibility=hidden -m64 -mavx2 -mf16c -mfma -masm=intel -std=c++14 -MD -MT third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/QuantUtilsAvx2.cc.o -MF third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/QuantUtilsAvx2.cc.o.d -o third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/QuantUtilsAvx2.cc.o -c ../third_party/fbgemm/src/QuantUtilsAvx2.cc
/tmp/cc2RqTt6.s: Assembler messages:
/tmp/cc2RqTt6.s:7795: Error: invalid use of register
/tmp/cc2RqTt6.s:7798: Error: invalid use of register
/tmp/cc2RqTt6.s:7916: Error: invalid use of register
/tmp/cc2RqTt6.s:7919: Error: invalid use of register
/tmp/cc2RqTt6.s:8543: Error: invalid use of register
/tmp/cc2RqTt6.s:8658: Error: invalid use of register
/tmp/cc2RqTt6.s:9486: Error: invalid use of register
/tmp/cc2RqTt6.s:9488: Error: invalid use of register
/tmp/cc2RqTt6.s:9612: Error: invalid use of register
/tmp/cc2RqTt6.s:9614: Error: invalid use of register
/tmp/cc2RqTt6.s:10270: Error: invalid use of register
/tmp/cc2RqTt6.s:10389: Error: invalid use of register
/tmp/cc2RqTt6.s:11249: Error: invalid use of register
/tmp/cc2RqTt6.s:11251: Error: invalid use of register
/tmp/cc2RqTt6.s:11382: Error: invalid use of register
(many more)
...

This step goes through if I remove the -masm=intel

Is there anything I should be careful of when creating a pointer pointing to `PackBMatrix` and passing it as a parameter?

Hello! I am using FBGEMM in my project. I create a pointer pointing to PackBMatrix<int8_t> and pass it to a function as parameters. Segmentation fault occurs when fbgemmPacked() is called. I think may be something goes wrong when creating the pointer and passing parameters. Is there something I should be careful of when creating a pointer to PackBMatrix and using it as parameter? Following is a piece of my code. I will be very grateful for your help!

PackBMatrix<int8_t> *packedBN=nullptr;

packedBN = new PackBMatrix<int8_t>(
transpose ? matrix_op_t::Transpose : matrix_op_t::NoTranspose,
nrow, ncol, quantized, transpose ? nrow : ncol, packedbuf, 1, params);

...
void fbgemmPacked8Gemm(
...
//PackBMatrix as parameter
PackBMatrix<int8_t>* packedBN,
...
){
...
// gemm computation
fbgemmPacked(packAN, *packedBN, (float*)C.data, (int32_t*)C.data, (int32_t) n, outputProcObj, 0, 1, params);
...
}

...
fbgemmPacked8Gemm(
...
packedBN,
...
);

use git submodules

Would it be possible to use git submodules for ASMJIT?
The downloading during the (PyTorch) build can be a bit hard for me when hacking PyTorch while travelling.
Thanks!

Can fast BF16 <-> Float conversion be used at more places in PyTorch?

Currently, the vectorized BF16 <-> Float conversion in PyTorch is slower.

FBGEMM has faster conversion routines, and is being used in quantized kernels.
Can these routines for BF16 be used at more places in PyTorch?

Thank you!

[Question] Does AVX2 BlockingFactors work on a AVX512 machine for the int8 GEMM?

Assuming I pack A, B matrix with AVX2 parameters (

FBGEMM/include/fbgemm/Fbgemm.h

Line 413 in ef5c9ef

const BlockingFactors* params = nullptr);

FBGEMM/include/fbgemm/Fbgemm.h

Line 331 in ef5c9ef

const BlockingFactors* params = nullptr);

), can this packed A, B used on AVX512 machine?

I am also using AVX2 parameter for the gemm function.
(

FBGEMM/include/fbgemm/Fbgemm.h

Line 1358 in ef5c9ef

const BlockingFactors* blocking_params = nullptr);

Thanks.

[Question] 8bit integers and negative numbers

Hey,

I have been reading the code for sparse 8bit gemm: https://github.com/pytorch/FBGEMM/blob/master/test/SpMMI8Test.cc and I have a few questions.

I noticed that getRandomSparseVector will only generate positive numbers. Is this because you rely on the maddubs instruction? Does it mean that the A matrix can only contain positive numbers?

I noticed this bit in the code:

for (int i = 0; i < m * k; ++i) {
   aptr[i] &= 0x7F;
 }

You avoid large numbers to avoid saturation. Does this mean there is no handling of saturation when it happens?

Thanks,

Nick

Error when compiling for shared target

Hi, I'm looking to build a so library which depends on Wav2Letter which depends on FBGEMM. For some FFI usage. I'm receiving some error.

I tried enabling -fPIC but can't figure out how to make the linking work. Can I circumvent this?

/usr/bin/ld: 3rdparty/wav2letter/inference/inference/module/fbgemm/src/fbgemm/libfbgemm.a(FbgemmFP16.cc.o): relocation R_X86_64_32 against `.bss' can not be used when making a shared object; recompile with -fPIC
3rdparty/wav2letter/inference/inference/module/fbgemm/src/fbgemm/libfbgemm.a: error adding symbols: Bad value

Here I tried to manipulate the CMakeLists.txt of fbgemm. No change though. This is the relevant extract.

project(fbgemm VERSION 0.1 LANGUAGES CXX C)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fPIC")

set(FBGEMM_LIBRARY_TYPE "shared")

This is my cmake part

add_library(${project_name}_api
  SHARED
  src/api.cpp
  src/api.h
)

set_target_properties(fbgemm PROPERTIES POSITION_INDEPENDENT_CODE ON)
set_target_properties(streaming_inference_modules_nn_impl PROPERTIES POSITION_INDEPENDENT_CODE ON)
set_target_properties(streaming_inference_modules_nn_backend PROPERTIES POSITION_INDEPENDENT_CODE ON)
set_target_properties(${project_name} PROPERTIES POSITION_INDEPENDENT_CODE ON)
set_target_properties(${project_name}_api PROPERTIES POSITION_INDEPENDENT_CODE ON)

quantized matrix multiplication question

Could you please point me to a quantized matrix-matrix multiplication example in the test or benchmark directory ? That is to say,

C = A * B // A, B, C are single-precision floating-point matrices
C' = dequant (quant(A) * quant (B) ) // quant(A) and quant(B) are int8 matrices

Thanks

Is this available on windows?

Can this library be built on windows?

AVX512 part of code uses std libraries that may leak avx512 instructions.

In code here:

https://github.com/pytorch/FBGEMM/blob/master/src/GenerateKernelU8S8S32ACC16Avx512.cc#L294

STL functions are used (from iostream). This is also seen in avx2 part of the code here:

FBGEMM/src/FbgemmI8DepthwiseAvx2.cc

Lines 9 to 12 in 895646c

    
           #include <algorithm> // for min and max 
        
           #include <cassert> 
        
           #include <cmath> // for lrintf and sqrt 
        
           #include <tuple> // for tie

Usage of stl functions is known to possibly leak intrinsic instructions in template functions: during linkage time, the linker is free to choose the instantiation of template functions, and if the linker chooses the instantiation that is built with, say, avx512, and then the code is run on a machine that does not have avx512, segfault with "illegal instruction" happens.

We might want to explicitly ban all usage of c++ headers, and avoid the use of any template instantiations, in .cc files (and included header files) that are built with -mavx2 and -mavx512f.

Exception: Illegal in tests

After building using exactly the commands listen in the README.md (submodules installed successfully), I run make test and get this:

...
me@localhost:~/Workspace/qnn/torchQ/FBGEMM/build$ make
. . .
[100%] Built target RowwiseAdagradFusedBenchmark
me@localhost:~/Workspace/qnn/torchQ/FBGEMM/build$ make test
Running tests...
Test project /home/marvin/Workspace/qnn/torchQ/FBGEMM/build
      Start  1: Bfloat16ConvertTest
 1/19 Test  #1: Bfloat16ConvertTest ..............***Exception: Illegal  0.16 sec
      Start  2: EmbeddingSpMDM8BitTest
 2/19 Test  #2: EmbeddingSpMDM8BitTest ...........***Exception: Illegal  0.16 sec
      Start  3: EmbeddingSpMDMNBitTest
 3/19 Test  #3: EmbeddingSpMDMNBitTest ...........***Exception: Illegal  0.16 sec
      Start  4: EmbeddingSpMDMTest
 4/19 Test  #4: EmbeddingSpMDMTest ...............***Exception: Illegal  0.16 sec
      Start  5: FP16Test
 5/19 Test  #5: FP16Test .........................***Exception: Illegal  0.14 sec
      Start  6: Float16ConvertTest
 6/19 Test  #6: Float16ConvertTest ...............***Exception: Illegal  0.14 sec
      Start  7: GConvTest
 7/19 Test  #7: GConvTest ........................***Exception: Illegal  0.14 sec
      Start  8: I64Test
 8/19 Test  #8: I64Test ..........................***Exception: Illegal  0.14 sec
      Start  9: I8DepthwiseTest
 9/19 Test  #9: I8DepthwiseTest ..................***Exception: Illegal  0.15 sec
      Start 10: I8SpmdmTest
10/19 Test #10: I8SpmdmTest ......................***Exception: Illegal  0.14 sec
      Start 11: Im2ColFusedRequantizeTest
11/19 Test #11: Im2ColFusedRequantizeTest ........***Exception: Illegal  0.14 sec
      Start 12: PackedRequantizeAcc16Test
12/19 Test #12: PackedRequantizeAcc16Test ........***Exception: Illegal  0.14 sec
      Start 13: PackedRequantizeTest
13/19 Test #13: PackedRequantizeTest .............***Exception: Illegal  0.14 sec
      Start 14: QuantUtilsTest
14/19 Test #14: QuantUtilsTest ...................***Exception: Illegal  0.14 sec
      Start 15: RequantizeOnlyTest
15/19 Test #15: RequantizeOnlyTest ...............***Exception: Illegal  0.14 sec
      Start 16: RowWiseSparseAdagradFusedTest
16/19 Test #16: RowWiseSparseAdagradFusedTest ....***Exception: Illegal  0.15 sec
      Start 17: SparseAdagradTest
17/19 Test #17: SparseAdagradTest ................***Exception: Illegal  0.14 sec
      Start 18: TransposeTest
18/19 Test #18: TransposeTest ....................***Exception: Illegal  0.14 sec
      Start 19: UniConvTest
19/19 Test #19: UniConvTest ......................***Exception: Illegal  0.20 sec

0% tests passed, 19 tests failed out of 19

Total Test time (real) =   2.85 sec

The following tests FAILED:
	  1 - Bfloat16ConvertTest (ILLEGAL)
	  2 - EmbeddingSpMDM8BitTest (ILLEGAL)
	  3 - EmbeddingSpMDMNBitTest (ILLEGAL)
	  4 - EmbeddingSpMDMTest (ILLEGAL)
	  5 - FP16Test (ILLEGAL)
	  6 - Float16ConvertTest (ILLEGAL)
	  7 - GConvTest (ILLEGAL)
	  8 - I64Test (ILLEGAL)
	  9 - I8DepthwiseTest (ILLEGAL)
	 10 - I8SpmdmTest (ILLEGAL)
	 11 - Im2ColFusedRequantizeTest (ILLEGAL)
	 12 - PackedRequantizeAcc16Test (ILLEGAL)
	 13 - PackedRequantizeTest (ILLEGAL)
	 14 - QuantUtilsTest (ILLEGAL)
	 15 - RequantizeOnlyTest (ILLEGAL)
	 16 - RowWiseSparseAdagradFusedTest (ILLEGAL)
	 17 - SparseAdagradTest (ILLEGAL)
	 18 - TransposeTest (ILLEGAL)
	 19 - UniConvTest (ILLEGAL)
Errors while running CTest
Makefile:116: recipe for target 'test' failed
make: *** [test] Error 8

my Testing/Temporary/LastTest.log looks like this:

Start testing: Nov 03 18:28 CET
----------------------------------------------------------
1/19 Testing: Bfloat16ConvertTest
1/19 Test: Bfloat16ConvertTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/Bfloat16ConvertTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"Bfloat16ConvertTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from FBGemmBfloat16Test
[ RUN      ] FBGemmBfloat16Test.Conversion
[       OK ] FBGemmBfloat16Test.Conversion (0 ms)
[ RUN      ] FBGemmBfloat16Test.Conversion_simd
[       OK ] FBGemmBfloat16Test.Conversion_simd (17 ms)
[ RUN      ] FBGemmBfloat16Test.Conversion_simd2
m = 203 n = 491
<end of output>
Test time =   0.16 sec
----------------------------------------------------------
Test Failed.
"Bfloat16ConvertTest" end time: Nov 03 18:28 CET
"Bfloat16ConvertTest" time elapsed: 00:00:00
----------------------------------------------------------

2/19 Testing: EmbeddingSpMDM8BitTest
2/19 Test: EmbeddingSpMDM8BitTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/EmbeddingSpMDM8BitTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"EmbeddingSpMDM8BitTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 1536 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 1536 tests from InstantiationName/Fused8BitRowwiseEmbeddingLookupTest
[ RUN      ] InstantiationName/Fused8BitRowwiseEmbeddingLookupTest.basicTest/0
<end of output>
Test time =   0.16 sec
----------------------------------------------------------
Test Failed.
"EmbeddingSpMDM8BitTest" end time: Nov 03 18:28 CET
"EmbeddingSpMDM8BitTest" time elapsed: 00:00:00
----------------------------------------------------------

3/19 Testing: EmbeddingSpMDMNBitTest
3/19 Test: EmbeddingSpMDMNBitTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/EmbeddingSpMDMNBitTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"EmbeddingSpMDMNBitTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 3072 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3072 tests from InstantiationName/FusedNBitRowwiseEmbeddingLookupTest
[ RUN      ] InstantiationName/FusedNBitRowwiseEmbeddingLookupTest.basicTest/0
<end of output>
Test time =   0.16 sec
----------------------------------------------------------
Test Failed.
"EmbeddingSpMDMNBitTest" end time: Nov 03 18:28 CET
"EmbeddingSpMDMNBitTest" time elapsed: 00:00:00
----------------------------------------------------------

4/19 Testing: EmbeddingSpMDMTest
4/19 Test: EmbeddingSpMDMTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/EmbeddingSpMDMTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"EmbeddingSpMDMTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 3072 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3072 tests from InstantiationName/EmbeddingSpMDMTest
[ RUN      ] InstantiationName/EmbeddingSpMDMTest.basicTest/0
<end of output>
Test time =   0.16 sec
----------------------------------------------------------
Test Failed.
"EmbeddingSpMDMTest" end time: Nov 03 18:28 CET
"EmbeddingSpMDMTest" time elapsed: 00:00:00
----------------------------------------------------------

5/19 Testing: FP16Test
5/19 Test: FP16Test
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/FP16Test"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"FP16Test" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from InstantiationName/FBGemmFP16Test
[ RUN      ] InstantiationName/FBGemmFP16Test.Test/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"FP16Test" end time: Nov 03 18:28 CET
"FP16Test" time elapsed: 00:00:00
----------------------------------------------------------

6/19 Testing: Float16ConvertTest
6/19 Test: Float16ConvertTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/Float16ConvertTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"Float16ConvertTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 6 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 6 tests from InstantiationName/FBGemmFloat16Test
[ RUN      ] InstantiationName/FBGemmFloat16Test.Conversion/0
[       OK ] InstantiationName/FBGemmFloat16Test.Conversion/0 (0 ms)
[ RUN      ] InstantiationName/FBGemmFloat16Test.Conversion/1
[       OK ] InstantiationName/FBGemmFloat16Test.Conversion/1 (0 ms)
[ RUN      ] InstantiationName/FBGemmFloat16Test.Conversion_simd/0
[       OK ] InstantiationName/FBGemmFloat16Test.Conversion_simd/0 (0 ms)
[ RUN      ] InstantiationName/FBGemmFloat16Test.Conversion_simd/1
[       OK ] InstantiationName/FBGemmFloat16Test.Conversion_simd/1 (0 ms)
[ RUN      ] InstantiationName/FBGemmFloat16Test.Conversion_simd2/0
m = 100 n = 990
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"Float16ConvertTest" end time: Nov 03 18:28 CET
"Float16ConvertTest" time elapsed: 00:00:00
----------------------------------------------------------

7/19 Testing: GConvTest
7/19 Test: GConvTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/GConvTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"GConvTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 26 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 24 tests from InstantiationName/fbgemmGConvAcc32WithQuantGranularityTest
[ RUN      ] InstantiationName/fbgemmGConvAcc32WithQuantGranularityTest.requantizeTest/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"GConvTest" end time: Nov 03 18:28 CET
"GConvTest" time elapsed: 00:00:00
----------------------------------------------------------

8/19 Testing: I64Test
8/19 Test: I64Test
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/I64Test"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"I64Test" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Int64GemmTest
[ RUN      ] Int64GemmTest.test
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"I64Test" end time: Nov 03 18:28 CET
"I64Test" time elapsed: 00:00:00
----------------------------------------------------------

9/19 Testing: I8DepthwiseTest
9/19 Test: I8DepthwiseTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/I8DepthwiseTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"I8DepthwiseTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 83 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 16 tests from InstantiationName/FBGemmDepthWiseTest
[ RUN      ] InstantiationName/FBGemmDepthWiseTest.Test2D/0
<end of output>
Test time =   0.15 sec
----------------------------------------------------------
Test Failed.
"I8DepthwiseTest" end time: Nov 03 18:28 CET
"I8DepthwiseTest" time elapsed: 00:00:00
----------------------------------------------------------

10/19 Testing: I8SpmdmTest
10/19 Test: I8SpmdmTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/I8SpmdmTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"I8SpmdmTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 20 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 20 tests from Instance0/fbgemmSPMDMTest
[ RUN      ] Instance0/fbgemmSPMDMTest.TestsSpMDM/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"I8SpmdmTest" end time: Nov 03 18:28 CET
"I8SpmdmTest" time elapsed: 00:00:00
----------------------------------------------------------

11/19 Testing: Im2ColFusedRequantizeTest
11/19 Test: Im2ColFusedRequantizeTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/Im2ColFusedRequantizeTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"Im2ColFusedRequantizeTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 30 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 30 tests from InstantiationName/fbgemmIm2colTest
[ RUN      ] InstantiationName/fbgemmIm2colTest.Acc32Test/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"Im2ColFusedRequantizeTest" end time: Nov 03 18:28 CET
"Im2ColFusedRequantizeTest" time elapsed: 00:00:00
----------------------------------------------------------

12/19 Testing: PackedRequantizeAcc16Test
12/19 Test: PackedRequantizeAcc16Test
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/PackedRequantizeAcc16Test"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"PackedRequantizeAcc16Test" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 32 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 24 tests from InstantiationName/fbgemmu8s8acc16WithQuantGranularityTest
[ RUN      ] InstantiationName/fbgemmu8s8acc16WithQuantGranularityTest.Test/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"PackedRequantizeAcc16Test" end time: Nov 03 18:28 CET
"PackedRequantizeAcc16Test" time elapsed: 00:00:00
----------------------------------------------------------

13/19 Testing: PackedRequantizeTest
13/19 Test: PackedRequantizeTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/PackedRequantizeTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"PackedRequantizeTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 60 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 48 tests from InstantiationName/fbgemmu8s8acc32WithQuantGranularityTest
[ RUN      ] InstantiationName/fbgemmu8s8acc32WithQuantGranularityTest.Test/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"PackedRequantizeTest" end time: Nov 03 18:28 CET
"PackedRequantizeTest" time elapsed: 00:00:00
----------------------------------------------------------

14/19 Testing: QuantUtilsTest
14/19 Test: QuantUtilsTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/QuantUtilsTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"QuantUtilsTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 266 tests from 8 test cases.
[----------] Global test environment set-up.
[----------] 1 test from QuantizeTestSingle
[ RUN      ] QuantizeTestSingle.vectorScalar
[       OK ] QuantizeTestSingle.vectorScalar (1 ms)
[----------] 1 test from QuantizeTestSingle (1 ms total)

[----------] 1 test from QuantizeTest
[ RUN      ] QuantizeTest.cornerCases
[       OK ] QuantizeTest.cornerCases (0 ms)
[----------] 1 test from QuantizeTest (0 ms total)

[----------] 1 test from FusedQuantizeDequantizeTestSingle
[ RUN      ] FusedQuantizeDequantizeTestSingle.vectorScalar
[       OK ] FusedQuantizeDequantizeTestSingle.vectorScalar (0 ms)
[----------] 1 test from FusedQuantizeDequantizeTestSingle (0 ms total)

[----------] 1 test from FusedQuantizeDequantizeTest
[ RUN      ] FusedQuantizeDequantizeTest.cornerCases
/home/marvin/Workspace/qnn/torchQ/FBGEMM/test/QuantUtilsTest.cc:526: Failure
Value of: floatCloseAll(dst_int8, ref)
  Actual: false ( results do not match.  mismatch at (0) 
	  ref: 0 test: 1.51395e-05
	 diff: 1.51395e-05 > 1.19209e-07
 mismatch at (1) 
	  ref: 0 test: -1.52588e-05
	 diff: 1.52588e-05 > 1.19209e-07
)
Expected: true
/home/marvin/Workspace/qnn/torchQ/FBGEMM/test/QuantUtilsTest.cc:554: Failure
Value of: floatCloseAll(dst_uint8, ref2)
  Actual: false ( results do not match.  mismatch at (0) 
	  ref: 0 test: 3.03983e-05
	 diff: 3.03983e-05 > 1.19209e-07
 mismatch at (2) 
	  ref: 0 test: 3.03983e-05
	 diff: 3.03983e-05 > 1.19209e-07
 mismatch at (4) 
	  ref: 0 test: 3.03983e-05
	 diff: 3.03983e-05 > 1.19209e-07
 mismatch at (6) 
	  ref: 0 test: 3.03983e-05
	 diff: 3.03983e-05 > 1.19209e-07
 mismatch at (8) 
	  ref: 0 test: 3.03983e-05
	 diff: 3.03983e-05 > 1.19209e-07
)
Expected: true
[  FAILED  ] FusedQuantizeDequantizeTest.cornerCases (0 ms)
[----------] 1 test from FusedQuantizeDequantizeTest (0 ms total)

[----------] 144 tests from InstantiationName/QuantizeGroupwiseTest
[ RUN      ] InstantiationName/QuantizeGroupwiseTest.quantizeGTest/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"QuantUtilsTest" end time: Nov 03 18:28 CET
"QuantUtilsTest" time elapsed: 00:00:00
----------------------------------------------------------

15/19 Testing: RequantizeOnlyTest
15/19 Test: RequantizeOnlyTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/RequantizeOnlyTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"RequantizeOnlyTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 240 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 240 tests from InstantiationName/FloatRequantizeTest
[ RUN      ] InstantiationName/FloatRequantizeTest.floatBiasTest/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"RequantizeOnlyTest" end time: Nov 03 18:28 CET
"RequantizeOnlyTest" time elapsed: 00:00:00
----------------------------------------------------------

16/19 Testing: RowWiseSparseAdagradFusedTest
16/19 Test: RowWiseSparseAdagradFusedTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/RowWiseSparseAdagradFusedTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"RowWiseSparseAdagradFusedTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 384 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 384 tests from InstantiationName/RowWiseSparseAdagradFusedTest
[ RUN      ] InstantiationName/RowWiseSparseAdagradFusedTest.rowwiseTest/0
<end of output>
Test time =   0.15 sec
----------------------------------------------------------
Test Failed.
"RowWiseSparseAdagradFusedTest" end time: Nov 03 18:28 CET
"RowWiseSparseAdagradFusedTest" time elapsed: 00:00:00
----------------------------------------------------------

17/19 Testing: SparseAdagradTest
17/19 Test: SparseAdagradTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/SparseAdagradTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"SparseAdagradTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 48 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 48 tests from InstantiationName/SparseAdagradTest
[ RUN      ] InstantiationName/SparseAdagradTest.basicTest_two_stages/0
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"SparseAdagradTest" end time: Nov 03 18:28 CET
"SparseAdagradTest" time elapsed: 00:00:00
----------------------------------------------------------

18/19 Testing: TransposeTest
18/19 Test: TransposeTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/TransposeTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"TransposeTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TransposeTest
[ RUN      ] TransposeTest.TransposeTest
<end of output>
Test time =   0.14 sec
----------------------------------------------------------
Test Failed.
"TransposeTest" end time: Nov 03 18:28 CET
"TransposeTest" time elapsed: 00:00:00
----------------------------------------------------------

19/19 Testing: UniConvTest
19/19 Test: UniConvTest
Command: "/home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test/UniConvTest"
Directory: /home/marvin/Workspace/qnn/torchQ/FBGEMM/build/test
"UniConvTest" start time: Nov 03 18:28 CET
Output:
----------------------------------------------------------
Running main() from /home/marvin/Workspace/qnn/torchQ/FBGEMM/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 6961 tests from 3 test cases.
[----------] Global test environment set-up.
[----------] 1 test from uniConvTest
[ RUN      ] uniConvTest.cornerCases
<end of output>
Test time =   0.20 sec
----------------------------------------------------------
Test Failed.
"UniConvTest" end time: Nov 03 18:28 CET
"UniConvTest" time elapsed: 00:00:00
----------------------------------------------------------

End testing: Nov 03 18:28 CET

How can I alleviate this Issue?

pytorch master branch is broken

Pytorch master branch is broken due to a recent commit. Could anyone revert the update? Thanks.
https://ezyang.github.io/pytorch-ci-hud/build/pytorch-master

pytorch/pytorch@44428d0

Is it possible to generate SPMDM kernels with asmjit?

Hi all,
Thanks for sharing such a high-performance GEMM library.

After reading through source codes, I found that only U8S8S32AC* kernels are generated from asmjit.
Is it possible to port SpMDM codes to asmjit? I'm tring to optimzie SpMDM by myself.

Thanks!
Yang

Undefined reference to `fbgemm::BCSRMatrix<signed char, 1, 4>::COLTILE'

Depending on the compiler and sometimes the build type, I have undefined references to static constexpr member variable fbgemm::BCSRMatrix<signed char, 1, 4>::COLTILE.

Here is the error with apple-clang 12 on Macos with fbgemm shared and Debug:

[62/62] Linking CXX shared library lib/libfbgemm.dylib
FAILED: lib/libfbgemm.dylib
: && /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -m64 -stdlib=libc++ -g -arch x86_64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.sdk -dynamiclib -Wl,-headerpad_max_install_names -m64 -o lib/libfbgemm.dylib -install_name libfbgemm.dylib source_subfolder/CMakeFiles/fbgemm_generic.dir/src/EmbeddingSpMDM.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/EmbeddingSpMDMNBit.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/ExecuteKernel.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/ExecuteKernelU8S8.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/Fbgemm.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmBfloat16Convert.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmConv.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmFPCommon.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmFP16.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmFloat16Convert.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmI64.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmSparseDense.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/FbgemmI8Spmdm.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GenerateKernel.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC16.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC16Avx512.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC16Avx512VNNI.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC32.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GenerateKernelU8S8S32ACC32Avx512VNNI.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GroupwiseConv.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GroupwiseConvAcc32Avx2.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GroupwiseConvAcc32Avx512.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackAMatrix.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackAWithIm2Col.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackAWithQuantRowOffset.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackAWithRowOffset.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackBMatrix.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackMatrix.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackWeightMatrixForGConv.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/PackWeightsForConv.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/QuantUtils.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/RowWiseSparseAdagradFused.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/SparseAdagrad.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/spmmUtils.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/TransposeUtils.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/GenerateI8Depthwise.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/RefImplementations.cc.o source_subfolder/CMakeFiles/fbgemm_generic.dir/src/Utils.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/EmbeddingSpMDMAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmBfloat16ConvertAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmFloat16ConvertAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8Depthwise3DAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwisePerChannelQuantAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmSparseDenseAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmSparseDenseInt8Avx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/OptimizedKernelsAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/PackDepthwiseConvMatrixAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/QuantUtilsAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/spmmUtilsAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/UtilsAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx2.dir/src/FbgemmFP16UKernelsAvx2.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/FbgemmBfloat16ConvertAvx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/FbgemmFloat16ConvertAvx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/FbgemmSparseDenseAvx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/FbgemmSparseDenseInt8Avx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/FbgemmSparseDenseVectorInt8Avx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/QuantUtilsAvx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/UtilsAvx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/FbgemmFP16UKernelsAvx512.cc.o source_subfolder/CMakeFiles/fbgemm_avx512.dir/src/FbgemmFP16UKernelsAvx512_256.cc.o  /Users/spaceim/.conan/data/asmjit/cci.20210306/_/_/package/ba203d82ae0020eccba7236c3748eb8f79fceaf6/lib/libasmjit.a  /Users/spaceim/.conan/data/cpuinfo/cci.20201217/_/_/package/9487036bb519a194183bc395da97637cb9f00de3/lib/libcpuinfo.a  /Users/spaceim/.conan/data/cpuinfo/cci.20201217/_/_/package/9487036bb519a194183bc395da97637cb9f00de3/lib/libclog.a && :
Undefined symbols for architecture x86_64:
  "fbgemm::BCSRMatrix<signed char, 1, 4>::COLTILE", referenced from:
      fbgemm::BCSRMatrix<signed char, 1, 4>::pack(signed char const*, unsigned long) in FbgemmSparseDense.cc.o
ld: symbol(s) not found for architecture x86_64

I see also this kind of error with clang on Linux with libstdc++ (didn't try with libc++ on Linux). Only with Debug build if clang >= 6, and also for Release build if clang < 6.

Per Channel Linear Quantization Gap with FBGEMM between Skylake and Cascade lake X86_64 CPU

Description:
We use per channel quantization for linear weights and per tensor affine quantization for activation in Pytorch based on the FBGEMM int8 kernel. But when we evaluate the modules, we get the different result in Skylake and Cascade lake X86_64 CPU.

The Mean Squared Error (MSE) is used to measure the gap between the output of fp32 linear and quantized linear. In order to validate this issue, we fetch the input(8x128x768), weights(768x768) and bias from CoLA task in Bert model and use these data in a UnitTest. the results are as following:

MSE Cascade lake : 259.8453
MSE Sky lake: 195320.5625

The above results show that Sky lake get obviously larger gap than Cascade lake. if you want to reproduce the above result you can get the code from https://github.com/liangan1/pytorch/tree/prv_bert and use the UT test/test_per_channel_linear.py

Environment:
gcc version 7.2.1 20170829 (Red Hat 7.2.1-1) (GCC)
CentOS Linux release 7.4.1708 (Core)
conda 4.6.11

Test failures: RowWiseSparseAdagradFusedTest, SparseAdagradTest

When testing my FBGEMM installation I noticed the following test failures:

$ make -j12 test
...
The following tests FAILED:
	 16 - RowWiseSparseAdagradFusedTest (Failed)
	 17 - SparseAdagradTest (Failed)
Errors while running CTest
Output from these tests are in: /tmp/t-astewart/spack-stage/spack-stage-fbgemm-master-kqngcydkqhc6umwgpjq2xs55rawby5es/spack-build-kqngcyd/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
Makefile:138: recipe for target 'test' failed
make: *** [test] Error 8

I'm not passing any special customizations to the build. Here is the build log:

spack-build-out.txt

Unfortunately, LastTest.log is too large to upload to GitHub. This occurs with the master version of FBGEMM built with GCC 7.3.0 on Ubuntu 18.04.

@dskhudia

Is it possible to build with already-installed asmjit/cpuinfo?

I'm writing a Spack package for fbgemm (as a pytorch dependency). Is it possible to use already-installed copies of asmjit and cpuinfo? I try to avoid relying on git submodules, especially when using a package manager.

Also, I noticed that there aren't any stable releases of fbgemm. Is this planned someday?

P.S. Would anyone like to be listed as the official maintainer for the fbgemm package? You don't need to know much about Spack, it just gives us someone to ping when a user reports build issues or wants to modify the build recipe.

Any chance that FBGEMM will support TBB?

As titile. : )

inference fail when build 32bit lib on window10

Hi @shz0116 @peterjc123 @ykim362

I have built 32bit lib on window10, but inference quantized model fail.

First, i fix errors when building, as follow:
(1)
error:
fbgemm\src\OptimizedKernelsAvx2.cc(126): error C3861: “_mm256_extract_epi64”: undefined
fix:

--- a/src/OptimizedKernelsAvx2.cc
+++ b/src/OptimizedKernelsAvx2.cc
@@ -122,6 +122,7 @@ void transpose_8rows(
     _mm_storel_epi64(
         reinterpret_cast<__m128i*>(dst + (j + 0) * ld_dst),
         _mm256_castsi256_si128(y0));
+#if defined(_M_X64)
     *reinterpret_cast<int64_t*>(dst + (j + 1) * ld_dst) =
         _mm256_extract_epi64(y0, 1);
     _mm_storel_epi64(
@@ -191,6 +192,77 @@ void transpose_8rows(
         _mm256_extract_epi64(y7, 2);
     *reinterpret_cast<int64_t*>(dst + (j + 31) * ld_dst) =
         _mm256_extract_epi64(y7, 3);
+#else
+    *reinterpret_cast<int32_t*>(dst + (j + 1) * ld_dst) =
+        _mm256_extract_epi32(y0, 1);
+    _mm_storel_epi64(
+        reinterpret_cast<__m128i*>(dst + (j + 2) * ld_dst),
+        _mm256_castsi256_si128(y1));
+    *reinterpret_cast<int32_t*>(dst + (j + 3) * ld_dst) =
+        _mm256_extract_epi32(y1, 1);
+    _mm_storel_epi64(
+        reinterpret_cast<__m128i*>(dst + (j + 4) * ld_dst),
+        _mm256_castsi256_si128(y2));
+    *reinterpret_cast<int32_t*>(dst + (j + 5) * ld_dst) =
+        _mm256_extract_epi32(y2, 1);
+    _mm_storel_epi64(
+        reinterpret_cast<__m128i*>(dst + (j + 6) * ld_dst),
+        _mm256_castsi256_si128(y3));
+    *reinterpret_cast<int32_t*>(dst + (j + 7) * ld_dst) =
+        _mm256_extract_epi32(y3, 1);
+    _mm_storel_epi64(
+        reinterpret_cast<__m128i*>(dst + (j + 8) * ld_dst),
+        _mm256_castsi256_si128(y4));
+    *reinterpret_cast<int32_t*>(dst + (j + 9) * ld_dst) =
+        _mm256_extract_epi32(y4, 1);
+    _mm_storel_epi64(
+        reinterpret_cast<__m128i*>(dst + (j + 10) * ld_dst),
+        _mm256_castsi256_si128(y5));
+    *reinterpret_cast<int32_t*>(dst + (j + 11) * ld_dst) =
+        _mm256_extract_epi32(y5, 1);
+    _mm_storel_epi64(
+        reinterpret_cast<__m128i*>(dst + (j + 12) * ld_dst),
+        _mm256_castsi256_si128(y6));
+    *reinterpret_cast<int32_t*>(dst + (j + 13) * ld_dst) =
+        _mm256_extract_epi32(y6, 1);
+    _mm_storel_epi64(
+        reinterpret_cast<__m128i*>(dst + (j + 14) * ld_dst),
+        _mm256_castsi256_si128(y7));
+    *reinterpret_cast<int32_t*>(dst + (j + 15) * ld_dst) =
+        _mm256_extract_epi32(y7, 1);
+    *reinterpret_cast<int32_t*>(dst + (j + 16) * ld_dst) =
+        _mm256_extract_epi32(y0, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 17) * ld_dst) =
+        _mm256_extract_epi32(y0, 3);
+    *reinterpret_cast<int32_t*>(dst + (j + 18) * ld_dst) =
+        _mm256_extract_epi32(y1, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 19) * ld_dst) =
+        _mm256_extract_epi32(y1, 3);
+    *reinterpret_cast<int32_t*>(dst + (j + 20) * ld_dst) =
+        _mm256_extract_epi32(y2, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 21) * ld_dst) =
+        _mm256_extract_epi32(y2, 3);
+    *reinterpret_cast<int32_t*>(dst + (j + 22) * ld_dst) =
+        _mm256_extract_epi32(y3, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 23) * ld_dst) =
+        _mm256_extract_epi32(y3, 3);
+    *reinterpret_cast<int32_t*>(dst + (j + 24) * ld_dst) =
+        _mm256_extract_epi32(y4, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 25) * ld_dst) =
+        _mm256_extract_epi32(y4, 3);
+    *reinterpret_cast<int32_t*>(dst + (j + 26) * ld_dst) =
+        _mm256_extract_epi32(y5, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 27) * ld_dst) =
+        _mm256_extract_epi32(y5, 3);
+    *reinterpret_cast<int32_t*>(dst + (j + 28) * ld_dst) =
+        _mm256_extract_epi32(y6, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 29) * ld_dst) =
+        _mm256_extract_epi32(y6, 3);
+    *reinterpret_cast<int32_t*>(dst + (j + 30) * ld_dst) =
+        _mm256_extract_epi32(y7, 2);
+    *reinterpret_cast<int32_t*>(dst + (j + 31) * ld_dst) =
+        _mm256_extract_epi32(y7, 3);
+#endif

(2)
error:
src\QuantUtilsAvx2.cc(270): error C3861: “_mm256_extract_epi64”: undefined
fix:

--- a/src/QuantUtilsAvx2.cc
+++ b/src/QuantUtilsAvx2.cc
@@ -267,9 +268,14 @@ void RequantizeFixedPointAvx2(

       clipped_v = _mm256_shuffle_epi8(clipped_v, shuffle_mask_v);
       clipped_v = _mm256_permutevar8x32_epi32(clipped_v, permute_mask_v);
+      std::cout << "Enter Quantile" << std::endl;
+#if defined(_M_X64)
       *(int64_t*)(dst + i) = _mm256_extract_epi64(clipped_v, 0);
+#else
+      *(int32_t*)(dst + i) = _mm256_extract_epi32(clipped_v, 0);
+#endif
     }
-
+#if defined(_M_X64)
     for (; i < len; ++i) {
       int64_t ab_64 =
         static_cast<int64_t>(src[i]) * static_cast<int64_t>(params.multiplier);
@@ -278,6 +284,16 @@ void RequantizeFixedPointAvx2(
         ((ab_64 + nudge) >> params.right_shift);
       dst[i] = std::min<int64_t>(std::max<int64_t>(quantized_down, 0l), 255l);
     }
+#else
+    for (; i < len; ++i) {
+      int32_t ab_64 = static_cast<int32_t>(src[i]) *
+          static_cast<int32_t>(params.multiplier);
+      int32_t nudge = 1ll << std::max(0, params.right_shift - 1);
+      int32_t quantized_down = params.target_qparams.zero_point +
+          ((ab_64 + nudge) >> params.right_shift);
+      dst[i] = std::min<int32_t>(std::max<int32_t>(quantized_down, 0l), 255l);
+    }
+#endif
   }
 }

Second, to solve 64bit/32bit compatibility， what can i do? Please give me some suggestion!!
Thanks very much!

Pytorch version: v1.3.1
fbgemm version: master branch

Poor error message for Conv when parameters are incompatible

Message:

RuntimeError: [FBGEMM_CONV_ERROR] Prepacked weights can't be used with these convolution parameters!

In particular, this does not tell me anything about why it cannot be used with these parameters, which parameters are incompatible, or what I should do to fix it

Cache_evict busy loop has performance impact on benchmarking

Clflush is speculative and it is not ordered with respect to prefetch. The underlying behavior is architecture specific. In the SLS case, the clflush, even not timed, impacts the performance of SLS kernel. To minimize the impact, it is suggested to flush the touched rows only (patch here), instead of flushing the entire embedding table. That can give a performance number closer to the real case.

Here is the data observed for 4-bit SLS:

CLflush a dummy embedding table (with same size, instead of the real table) won’t help the performance number.

Case A: Read EMB, flush EMB (original)

bit_rate     4batch size    10  num rows         4000000   emb dim   128      avg length   100
32 bit indices with prefetching, lengths_sum 898
            SLS    cache not flushed     prefetch on     b/w   7.33648 GB/s     effective b/w:          11.9275GB/s   time       9.6369e-06
            SLS        cache flushed     prefetch on     b/w   1.72568 GB/s     effective b/w:          3.24835GB/s   time      3.53854e-05

Case B: Read EMB, flush dummy EMB

bit_rate     4batch size    10  num rows         4000000   emb dim   128      avg length   100
32 bit indices with prefetching, lengths_sum 898
            SLS    cache not flushed     prefetch on     b/w     7.212 GB/s     effective b/w:          13.5755GB/s   time        8.467e-06
            SLS        cache flushed     prefetch on     b/w   1.88487 GB/s     effective b/w:          3.54799GB/s   time      3.23969e-05

We can avoid the core from cache embedding table by always reading the first row in SLS kernel. In that case, clflushing of embedding table makes no sense and we expect similar performance between flush and non-flush version (of embedding-table only). However, the data below shows the non-flush version has much better number.

Case C: Read EMB-0 (read the first row always, no prefetch), Flush EMB

bit_rate     4batch size    10  num rows         4000000   emb dim   128      avg length   100
32 bit indices without prefetching, lengths_sum 898
            SLS    cache not flushed    prefetch off     b/w   7.95871 GB/s     effective b/w:          14.9811GB/s   time       7.6726e-06
            SLS        cache flushed    prefetch off     b/w   1.90915 GB/s     effective b/w:           3.5937GB/s   time      3.19849e-05

Case D: Read EMB-0 (read the first row always, no prefetch), don’t flush EMB

bit_rate     4batch size    10  num rows         4000000   emb dim   128      avg length   100
32 bit indices without prefetching, lengths_sum 898
            SLS    cache not flushed    prefetch off     b/w   7.72982 GB/s     effective b/w:          14.5502GB/s   time       7.8998e-06
            SLS        cache flushed    prefetch off     b/w   6.42448 GB/s     effective b/w:          12.0931GB/s   time       9.5049e-06

Only flushing of the touched line (patch here) in embedding table give 1.8x higher number than Case A.

Case E: Read EMB, flush only touched rows of EMB

bit_rate     4batch size    10  num rows         4000000   emb dim   128      avg length   100
32 bit indices with prefetching, lengths_sum 898
            SLS    cache not flushed     prefetch on     b/w   7.50532 GB/s     effective b/w:          14.1277GB/s   time       8.1361e-06
            SLS        cache flushed     prefetch on     b/w   3.28172 GB/s     effective b/w:          6.17736GB/s   time      1.86073e-05

Cannot build: can't find `aligned_alloc`

I try to build on my mac and get this:

In file included from /Users/jamesreed/FBGEMM/src/FbgemmFP16.cc:7:
/Users/jamesreed/FBGEMM/include/fbgemm/FbgemmFP16.h:67:24: error: use of undeclared identifier 'aligned_alloc'
    pmat_ = (float16 *)aligned_alloc(64, matSize() * sizeof(float16) + padding);

There are some workarounds described here https://stackoverflow.com/questions/29247065/compiler-cant-find-aligned-alloc-function but I haven't gotten them to work with LLVM on mac

Building Cython bindings and a Python package

Thanks for open-sourcing this! I've spent a tonne of time trying to get a decent gemm solution for spaCy over the last two years. I won't even list the number of things I've tried and yaks I've shaved in this quest. The current plan is to package Blis, but the API and support for low precision maths is really appealing here. FBGEMM does more of what I need and none of the things I don't, so I'm interested in seeing whether we could adopt it.

The holy grail for me is a Python package that ships the source code of a C extension, along with Cython bindings. The Python package's setup.py compiles the extension, and statically links the C++ dependency into it. At the same time, binary wheel files should be available that do dynamic architecture detection.

I understand it's probably a long road to get this to work, and might not be viable. I do believe it's worth trying, though. If we can just get a decent GEMM package for Python, with C-level bindings exposed, the ML community will benefit enormously. The current GEMM problems are really bad. Currently, it's super common for people to end up with numpy installations that call into Prescott-targeted kernels, cratering CPU performance. I really believe this problem is costing labs everywhere a fortune in compute costs, because folks are moving workloads to GPU that they could execute much more cheaply on CPU if they only had their systems properly configured.

Some questions:

Does FBGEMM support dynamic architecture detection? Like, can I compile a binary on a server with avx-512 instructions, and run it on another server that only has avx2?
What's the prognosis for future Windows support look like? I think requiring LLVM is reasonable, to avoid the weirdness of MSVC. I think this should mostly make it about the build system and threading library? Note that currently, support for Windows Subsystem for Linux is impossible, as WSL doesn't expose avx2 instructions: microsoft/WSL#2234 .

Convolution too slow for depthwise when input channels != output channels

This is the root cause of https://discuss.pytorch.org/t/got-slow-speed-on-quantized-model-with-fbgemm-on-x86/74439

fbgemmConv takes the im2col route for such cases and it's too slow. Here are the results for a shape I benchmarked.

"N", "IC", "OC", "H", "W", "G", "kernel", "stride", "pad"
1, 128, 256, 32, 100, 128, 3, 1, 1

FP32 time: 3.55 ms
fbgemmConv time: 86.76ms

cc @jspark1105

Standalone sparse-dense matrix multiplication benchmark

Hi, I am wondering if FBGEMM supports standalone sparse matrix dense matrix multiplication using the unrolling approach to get register blocking as mentioned in the new release notes. It seems like the test involves the operation fused with another matrix multiplication. I am wondering if an API similar to say MKL's SpMM exists for FBGEMM. Thank you!

PyTorch build fails on Windows with FBGEMM

Error text:

[99/3428] C:\Users\circleci\project\build\win_tmp\bin\sccache-cl.exe   /TP -DTH_BLAS_MKL -D_OPENMP_NOFORCE_MANIFEST -I..\cmake\..\third_party\cpuinfo\include -I..\third_party\fbgemm\third_party\asmjit\src -I..\third_party\fbgemm\include -I..\third_party\fbgemm -I..\cmake\..\third_party\benchmark\include -I..\cmake\..\third_party\googletest\googlemock\include -I..\cmake\..\third_party\googletest\googletest\include -I..\third_party\protobuf\src -Iwin_tmp\mkl\include /DWIN32 /D_WINDOWS /GR  /w /EHa /bigobj -openmp:experimental /wd4244 /wd4267 /wd4305 /wd4309 /MD /O2 /Ob2 /DNDEBUG /w /EHa /bigobj   -std:c++14 /showIncludes /Fothird_party\fbgemm\CMakeFiles\fbgemm_generic.dir\src\FbgemmFP16.cc.obj /Fdthird_party\fbgemm\CMakeFiles\fbgemm_generic.dir\ /FS -c ..\third_party\fbgemm\src\FbgemmFP16.cc
FAILED: third_party/fbgemm/CMakeFiles/fbgemm_generic.dir/src/FbgemmFP16.cc.obj 
C:\Users\circleci\project\build\win_tmp\bin\sccache-cl.exe   /TP -DTH_BLAS_MKL -D_OPENMP_NOFORCE_MANIFEST -I..\cmake\..\third_party\cpuinfo\include -I..\third_party\fbgemm\third_party\asmjit\src -I..\third_party\fbgemm\include -I..\third_party\fbgemm -I..\cmake\..\third_party\benchmark\include -I..\cmake\..\third_party\googletest\googlemock\include -I..\cmake\..\third_party\googletest\googletest\include -I..\third_party\protobuf\src -Iwin_tmp\mkl\include /DWIN32 /D_WINDOWS /GR  /w /EHa /bigobj -openmp:experimental /wd4244 /wd4267 /wd4305 /wd4309 /MD /O2 /Ob2 /DNDEBUG /w /EHa /bigobj   -std:c++14 /showIncludes /Fothird_party\fbgemm\CMakeFiles\fbgemm_generic.dir\src\FbgemmFP16.cc.obj /Fdthird_party\fbgemm\CMakeFiles\fbgemm_generic.dir\ /FS -c ..\third_party\fbgemm\src\FbgemmFP16.cc
..\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C2039: 'runtime_error': is not a member of 'std'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.26.28801\include\vector(24): note: see declaration of 'std'
..\third_party\fbgemm\include\fbgemm/FbgemmFP16.h(100): error C3861: 'runtime_error': identifier not found
Microsoft (R) C/C++ Optimizing Compiler Version 19.26.28805 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

Can there be a `FBGEMM_SPECIALIZED_QUANTIZE_AVX512` as well?

Hello,

FBGEMM_SPECIALIZED_QUANTIZE_AVX2 exists in src/QuantUtils.cc.
Does FBGEMM_SPECIALIZED_QUANTIZE_AVX512 not currently exist because AVX512 didn't perform as well, when tested?

Thank you!

[ Question ] Leveraging Int16 Accumulation Type for convolution

Hi ,

I use pytorch vision to run INT8 optimized resnet on AVX2 machine. The code path uses "optimized_conv_t::im2col" .
It looks accumulation type for PackWeightsForConv is set to INT32 . Kindly let me know if there is a way to set Accumulation type to INT16 . My understanding is that INT16 accumulation can leverage ( MR*NR ) = 12 * 16 registers while that gets limited to 6 * 16 registers with INT32 per the packing traits of FBGEMM.

Is there a way to accumulate in INT16 ( INT8 weights and INT8 inputs ) for AVX2 path and requantize the accumulated
outputs as INT32.

Can I change the default accumulation to INT16 by modifying "include/fbgemm/Fbgemm.h" and why INT32 is set as default.

Thanks,
Arun

PackBMatrix is too slow

I'm currently considering applying FBGEMM in a use case where the B matrix must be packed online (i.e. the weight matrix is dynamically computed). In my prototype, I've observed that I hit a bottleneck in PackBMatrix and when I investigate with perf, I see that most of the slowdown occurs because of multiple integer divisions in the inner loop.

I explored several optimization techniques and was able to get a 10x speedup on that routine. In summary I did:

Loop invariant hoisting (move block offset calculation out of inner loop)
Strength reduction (change modulus in inner loop to an increment + rollover)

This is the diff of PackBMatrix<>::pack()

     T* out = BaseType::getBuf() +
         g * this->packedBufferSize(block.row_size, block.col_size);
     for (int i = block.row_start; i < block.row_start + block.row_size; ++i) {
+      auto r = i;
+      int32_t block_row_id = r / BaseType::blockRowSize();
+      int32_t brow_offset = (block_row_id * BaseType::blockCols()) *
+          (BaseType::blockRowSize() * BaseType::blockColSize());
+
+      int32_t inblock_offset_row_cpt = (r % BaseType::blockRowSize() / row_interleave_) *
+              BaseType::blockColSize() * row_interleave_ + r % row_interleave_;
+
+      int32_t block_col_id = block.col_start / BaseType::blockColSize();
+      int32_t block_col_offs = block.col_start % BaseType::blockColSize();
       for (int j = block.col_start; j < block.col_start + block.col_size; ++j) {
         T val = tr ? smat_[i + (g * block.col_size + j) * ld_]
                    : smat_[(g * block.row_size + i) * ld_ + j];
-        out[addr(i, j)] = tconv(val, out[addr(i, j)]);
+        int32_t bcol_offset =
+            block_col_id * BaseType::blockRowSize() * BaseType::blockColSize();
+        int32_t block_offset = brow_offset + bcol_offset;
+        int32_t inblock_offset = inblock_offset_row_cpt + (block_col_offs) * row_interleave_;
+        int32_t index = block_offset + inblock_offset;
+        out[index] = tconv(val, out[index]);
+
+        block_col_offs++;
+        if (block_col_offs == BaseType::blockColSize()) {
+          block_col_offs = 0;
+          block_col_id++;
+        }
       }
     }
     // fill the remaining with zero.

This works for my purposes in an offline prototype, but I am not sure if the addr() API on PackMatrix is going to be used by others. This change moves the implementation of that API to be intertwined with the loop structure in pack(), so I'm not quite sure how to proceed here

Build Error On Windows for Shared Library

I ran into the following linking errors when building the SHARED library on Windows, also on github windows-latest. It could not find the PackMatrix function. But it looks well defined in PackMatrix.cc files. @peterjc123

PackAMatrix.cc.obj : error LNK2019: unresolved external symbol "public: __cdecl fbgemm::PackMatrix<class fbgemm::PackAMatrix<unsigned char,int>,unsigned char,int>::PackMatrix<class fbgemm::PackAMatrix<unsigned char,int>,unsigned char,int>(int,int,unsigned char *,int,struct fbgemm::BlockingFactors const *)" (??0?$PackMatrix@V?$PackAMatrix@EH@fbgemm@@eh@fbgemm@@qeaa@HHPEAEHPEBUBlockingFactors@1@@z) referenced in function "public: __cdecl fbgemm::PackAMatrix<unsigned char,int>::PackAMatrix<unsigned char,int>(enum fbgemm::matrix_op_t,int,int,unsigned char const *,int,unsigned char *,int,struct fbgemm::BlockingFactors const *)" (??0?$PackAMatrix@EH@fbgemm@@qeaa@W4matrix_op_t@1@HHPEBEHPEAEHPEBUBlockingFactors@1@@z)
PackAMatrix.cc.obj : error LNK2019: unresolved external symbol "public: __cdecl fbgemm::PackMatrix<class fbgemm::PackAMatrix<unsigned char,short>,unsigned char,short>::PackMatrix<class fbgemm::PackAMatrix<unsigned char,short>,unsigned char,short>(int,int,unsigned char *,int,struct fbgemm::BlockingFactors const *)" (??0?$PackMatrix@V?$PackAMatrix@EF@fbgemm@@Ef@fbgemm@@qeaa@HHPEAEHPEBUBlockingFactors@1@@z) referenced in function "public: __cdecl fbgemm::PackAMatrix<unsigned char,short>::PackAMatrix<unsigned char,short>(enum fbgemm::matrix_op_t,int,int,unsigned char const *,int,unsigned char *,int,struct fbgemm::BlockingFactors const *)" (??0?$PackAMatrix@EF@fbgemm@@qeaa@W4matrix_op_t@1@HHPEBEHPEAEHPEBUBlockingFactors@1@@z)
PackAWithIm2Col.cc.obj : error LNK2019: unresolved external symbol "public: __cdecl fbgemm::PackMatrix<class fbgemm::PackAWithIm2Col<unsigned char,int,2>,unsigned char,int>::PackMatrix<class fbgemm::PackAWithIm2Col<unsigned char,int,2>,unsigned char,int>(int,int,unsigned char *,int,struct fbgemm::BlockingFactors const *)" (??0?$PackMatrix@V?$PackAWithIm2Col@EH$01@fbgemm@@eh@fbgemm@@qeaa@HHPEAEHPEBUBlockingFactors@1@@z) referenced in function "public: __cdecl fbgemm::PackAWithIm2Col<unsigned char,int,2>::PackAWithIm2Col<unsigned char,int,2>(struct fbgemm::conv_param_t<2> const &,unsigned char const *,unsigned char *,int,int *,bool,struct fbgemm::BlockingFactors const *)" (??0?$PackAWithIm2Col@EH$01@fbgemm@@qeaa@AEBU?$conv_param_t@$01@1@PEBEPEAEHPEAH_NPEBUBlockingFactors@1@@z)
PackAWithIm2Col.cc.obj : error LNK2019: unresolved external symbol "public: __cdecl fbgemm::PackMatrix<class fbgemm::PackAWithIm2Col<unsigned char,short,2>,unsigned char,short>::PackMatrix<class fbgemm::PackAWithIm2Col<unsigned char,short,2>,unsigned char,short>(int,int,unsigned char *,int,struct fbgemm::BlockingFactors const *)" (??0?$PackMatrix@V?$PackAWithIm2Col@EF$01@fbgemm@@Ef@fbgemm@@qeaa@HHPEAEHPEBUBlockingFactors@1@@z) referenced in function "public: __cdecl fbgemm::PackAWithIm2Col<unsigned char,short,2>::PackAWithIm2Col<unsigned char,short,2>(struct fbgemm::conv_param_t<2> const &,unsigned char const *,unsigned char *,int,int *,bool,struct fbgemm::BlockingFactors const *)" (??0?$PackAWithIm2Col@EF$01@fbgemm@@qeaa@AEBU?$conv_param_t@$01@1@PEBEPEAEHPEAH_NPEBUBlockingFactors@1@@z)

Can't compile with MinGW

Compilation with MinGW fails in Utils.cc compilation unit:

[ 67%] Building CXX object source_subfolder/CMakeFiles/fbgemm_generic.dir/src/Utils.cc.obj
C:\Users\appveyor\.conan\data\fbgemm\cci.20210316\SpaceIm\testing\build\141d94d85a537d2efd429b0ec9e349c3da14d829\source_subfolder\src\Utils.cc: In function 'void* fbgemm::fbgemmAlignedAlloc(size_t, size_t, bool)':
C:\Users\appveyor\.conan\data\fbgemm\cci.20210316\SpaceIm\testing\build\141d94d85a537d2efd429b0ec9e349c3da14d829\source_subfolder\src\Utils.cc:399:49: error: 'posix_memalign' was not declared in this scope
   ret = posix_memalign(&aligned_mem, align, size);
                                                 ^

This can be easiliy fixed with:

--- a/bench/AlignedVec.h
+++ b/bench/AlignedVec.h
@@ -100,7 +100,7 @@ class aligned_allocator {
     // Mallocator wraps malloc().
     void* pv = nullptr;
     int ret;
-#ifdef _MSC_VER
+#ifdef _WIN32
     pv = _aligned_malloc(n * sizeof(T), Alignment);
     ret = 0;
 #else
@@ -118,7 +118,7 @@ class aligned_allocator {
   }
 
   void deallocate(T* const p, const std::size_t /*n*/) const {
-#ifdef _MSC_VER
+#ifdef _WIN32
     _aligned_free(p);
 #else
     free(p);
--- a/src/Utils.cc
+++ b/src/Utils.cc
@@ -392,7 +392,7 @@ void* fbgemmAlignedAlloc(
     bool raiseException /*=false*/) {
   void* aligned_mem = nullptr;
   int ret;
-#ifdef _MSC_VER
+#ifdef _WIN32
   aligned_mem = _aligned_malloc(size, align);
   ret = 0;
 #else
@@ -406,7 +406,7 @@ void* fbgemmAlignedAlloc(
 }
 
 void fbgemmAlignedFree(void* p) {
-#ifdef _MSC_VER
+#ifdef _WIN32
   _aligned_free(p);
 #else
   free(p);

With this fix it compiles, but I have runtime errors which I was able to fix with:

--- a/src/FbgemmFP16UKernelsIntrinsicAvx2.cc
+++ b/src/FbgemmFP16UKernelsIntrinsicAvx2.cc
@@ -5,7 +5,7 @@
  * LICENSE file in the root directory of this source tree.
  */
 
-#ifdef _MSC_VER
+#ifdef _WIN32
 #include <immintrin.h>
 #include "./FbgemmFP16UKernelsAvx2.h"
 
@@ -112,4 +112,4 @@ void NOINLINE gemmkernel_6x2_Avx2_fp16_fA0fB0fC0(GemmParamsFP16* gp) {
 }
 
 } // namespace fbgemm
-#endif // _MSC_VER
+#endif // _WIN32
--- a/src/FbgemmFP16UKernelsIntrinsicAvx512.cc
+++ b/src/FbgemmFP16UKernelsIntrinsicAvx512.cc
@@ -5,7 +5,7 @@
  * LICENSE file in the root directory of this source tree.
  */
 
-#ifdef _MSC_VER
+#ifdef _WIN32
 #include <immintrin.h>
 #include "./FbgemmFP16UKernelsAvx512.h"
 
@@ -137,4 +137,4 @@ void NOINLINE gemmkernel_14x2_Avx512_fp16_fA0fB0fC0(GemmParamsFP16* gp) {
 }
 
 } // namespace fbgemm
-#endif // _MSC_VER
+#endif // _WIN32
--- a/src/FbgemmFP16UKernelsIntrinsicAvx512_256.cc
+++ b/src/FbgemmFP16UKernelsIntrinsicAvx512_256.cc
@@ -5,7 +5,7 @@
  * LICENSE file in the root directory of this source tree.
  */
 
-#ifdef _MSC_VER
+#ifdef _WIN32
 #include <immintrin.h>
 #include "./FbgemmFP16UKernelsAvx512_256.h"
 
@@ -118,4 +118,4 @@ void NOINLINE gemmkernel_14x2_Avx512_256_fp16_fA0fB0fC0(GemmParamsFP16* gp) {
 }
 
 } // namespace fbgemm
-#endif // _MSC_VER
+#endif // _WIN32
--- a/src/FbgemmI8Spmdm.cc
+++ b/src/FbgemmI8Spmdm.cc
@@ -75,7 +75,7 @@ void CompressedSparseColumn::SpMDM(
 // resnet/resnext so we are keeping arrays with dynamic size for gcc/clang and
 // dynamically allocated memory for MSVC even though dynamically allocated
 // memory works for all compilers.
-#ifdef _MSC_VER
+#ifdef _WIN32
   uint8_t* A_buffer =
       static_cast<uint8_t*>(fbgemmAlignedAlloc(64, K * 32 * sizeof(uint8_t)));
   int32_t* C_buffer =
@@ -94,7 +94,7 @@ void CompressedSparseColumn::SpMDM(
     // The cost of transpose is O(K*N) and we do O(NNZ*N) multiplications.
     // If NNZ/K is small, it's not worth doing transpose so we just use this
     // scalar loop.
-#ifdef _MSC_VER
+#ifdef _WIN32
     int32_t* C_temp = static_cast<int32_t*>(
         fbgemmAlignedAlloc(64, block.row_size * sizeof(int32_t)));
 #else
@@ -158,7 +158,7 @@ void CompressedSparseColumn::SpMDM(
         }
       } // for each column of B
     }
-#ifdef _MSC_VER
+#ifdef _WIN32
     fbgemmAlignedFree(A_buffer);
     fbgemmAlignedFree(C_buffer);
     fbgemmAlignedFree(C_temp);
@@ -179,7 +179,7 @@ void CompressedSparseColumn::SpMDM(
   for (int i1 = block.row_start; i1 < i_end; i1 += 32) {
     // Transpose 32 x K submatrix of A
     if (i_end - i1 < 32) {
-#ifdef _MSC_VER
+#ifdef _WIN32
       uint8_t* A_temp_buffer = static_cast<uint8_t*>(
           fbgemmAlignedAlloc(64, K * 32 * sizeof(uint8_t)));
 #else
@@ -200,7 +200,7 @@ void CompressedSparseColumn::SpMDM(
       for (int i2 = (i_end - i1) / 8 * 8; i2 < 32; i2 += 8) {
         transpose_8rows(K, A_temp_buffer + i2 * K, K, A_buffer + i2, 32);
       }
-#ifdef _MSC_VER
+#ifdef _WIN32
       fbgemmAlignedFree(A_temp_buffer);
 #endif
     } else {
@@ -280,7 +280,7 @@ void CompressedSparseColumn::SpMDM(
   spmdm_run_time += (dt);
   t_start = std::chrono::high_resolution_clock::now();
 #endif
-#ifdef _MSC_VER
+#ifdef _WIN32
   fbgemmAlignedFree(A_buffer);
   fbgemmAlignedFree(C_buffer);
 #endif
--- a/src/QuantUtilsAvx2.cc
+++ b/src/QuantUtilsAvx2.cc
@@ -27,7 +27,7 @@ void QuantizeAvx2(
     T* dst,
     int len,
     const TensorQuantizationParams& qparams) {
-#if defined(__AVX2__) && (defined(__FMA__) || defined(_MSC_VER))
+#if defined(__AVX2__) && (defined(__FMA__) || defined(_WIN32))
   constexpr int VLEN = 8;
   constexpr int32_t min_val = std::numeric_limits<T>::min();
   constexpr int32_t max_val = std::numeric_limits<T>::max();
@@ -162,7 +162,7 @@ void NO_SANITIZE("address") FusedQuantizeDequantizeAvx2(
   float inverse_scale = 1.f / qparams.scale;
   constexpr int32_t min_val = std::numeric_limits<T>::min();
   constexpr int32_t max_val = std::numeric_limits<T>::max();
-#if defined(__AVX2__) && (defined(__FMA__) || defined(_MSC_VER))
+#if defined(__AVX2__) && (defined(__FMA__) || defined(_WIN32))
 
   constexpr int VLEN = 8;
   // This is the largest int32 value less than int32_max

[Question] using FBGEMM (int8 GEMM) in a multi-threaded application

Hi,
I have a question about the usage of FBGEMM in a multi-threaded program.
In my test environment, I have 24 cores CPU and 20 independent threads calling FBGEMM (int8 GEMM). All the threads are doing independent jobs, so there's no parallelism for one fbgemm call. https://github.com/marian-nmt/marian-dev/blob/master/src/tensors/cpu/fbgemm/packed_gemm.cpp#L614

In a heavy load situation, I'd expect higher than 2000% cpu utilization, but the cpu utilization is always less than 250% (seems only one thread is using fbgemm). Have you had this kind of problem before?
(FYI, when I use fp16 fbgemm, I don't have this issue. The cpu utilization is higher than 2000%.)
https://github.com/marian-nmt/marian-dev/blob/master/src/tensors/cpu/fbgemm/packed_gemm.cpp#L516

Thank you in advance!

	#include <algorithm> // for min and max
	#include <cassert>
	#include <cmath> // for lrintf and sqrt
	#include <tuple> // for tie

pytorch / fbgemm Goto Github PK

fbgemm's Introduction

FBGEMM

What's New?

Citation

Join the FBGEMM community

License

fbgemm's People

Contributors

Stargazers

Watchers

Forkers

fbgemm's Issues

🚀 Feature

Motivation

Pitch

Recommend Projects

Recommend Topics

Recommend Org