Giter Site home page Giter Site logo

pyvkfft's Introduction

pyvkfft - python interface to the CUDA and OpenCL backends of VkFFT (Vulkan Fast Fourier Transform library)

VkFFT is a GPU-accelerated Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL.

pyvkfft offers a simple python interface to the CUDA and OpenCL backends of VkFFT, compatible with pyCUDA, CuPy and pyOpenCL.

The documentation can be found at https://pyvkfft.readthedocs.io

Installation

Install using pip install pyvkfft (works on macOS, Linux and Windows). See below for an installation using conda-forge, or for an installation from source.

Notes:

  • the PyPI package includes the VkFFT headers and will automatically install pyopencl if opencl is available. However you should manually install either cupy or pycuda to use the cuda backend.
  • if you want to specify the backend to be installed (which can be necessary e.g. if you have nvcc installed but cuda is not actually available), you can do that using e.g. VKFFT_BACKEND=opencl pip install pyvkfft. By default the opencl backend is always installed, and the cuda one if nvcc is found.
  • If you need to support more than 8 dimensions for the transforms, you can use e.g. VKFFT_MAX_FFT_DIMENSIONS=10 pip install pyvkfft.

Requirements:

  • pyopencl and the opencl libraries/development tools for the opencl backend
  • pycuda or cupy and CUDA developments tools (nvcc, nvrtc library) for the cuda backend
  • numpy
  • on Windows, this requires visual studio (c++ tools) and a cuda toolkit installation, with either CUDA_PATH or CUDA_HOME environment variable. However it should be simpler to install using conda, as detailed below
  • Optional:
    • scipy and pyfftw for more accurate tests (and to test DCT/DST)

This package can be installed from source using pip install ..

Note: python setup.py install is now disabled, to avoid messed up environments where both methods have been used.

Installation using conda

You can use conda (or much faster mamba) to install pre-compiled binaries with CUDA and OpenCL support on linux-x86_64, linux-aarch64, linux-ppc64le, win-amd64, macos-x86_64, macos-arm64 platforms.

conda config --add channels conda-forge
conda install pyvkfft

Note regarding CUDA support: there are multiple package versions of pyvkfft available, with either only OpenCL support, or compiled using the cuda nvrtc library versions 11.2, 11.8 or 12.x. If you want cuda support, you can install pyvkfft while using the cuda-version meta-package to select a specific cuda version. For example:

# Only install pyvkfft, select cuda nvrtc 11.2
conda install pyvkfft cuda-version=11.2

# Install pyvkfft, pyopencl, cupy with nvrtc version 12
conda install pyvkfft pyopencl cupy cuda-version=12

The only constraint is that the cuda driver must be more recent than the cuda nvrtc version requested installed (type conda info or mamba info to see conda's detected __cuda variable).

See more information in conda-forge's documentation

Once installed, you can use the pyvkfft-info script to see which languages, backends (pyopencl, pycuda, cupy) and GPU devices are available.

Installation from source (git)

git clone --recursive https://github.com/vincefn/pyvkfft.git
cd pyvkfft
pip install .

As indicated above, you can use environmental variables VKFFT_BACKEND and VKFFT_MAX_FFT_DIMENSIONS during the pip install to select the backend or the maximum number of transformed dimensions.

Examples

The simplest way to use pyvkfft is to use the pyvkfft.fft interface, which will automatically create the VkFFTApp (the FFT plans) according to the type of GPU arrays (pycuda, pyopencl or cupy), and also cache these apps:

import pycuda.autoprimaryctx
import pycuda.gpuarray as cua
from pyvkfft.fft import fftn
import numpy as np

d0 = cua.to_gpu(np.random.uniform(0,1,(200,200)).astype(np.complex64))
# This will compute the fft to a new GPU array
d1 = fftn(d0)

# An in-place transform can also be done by specifying the destination
d0 = fftn(d0, d0)

# Or an out-of-place transform to an existing array (the destination array is always returned)
d1 = fftn(d0, d1)

See the scripts and notebooks in the examples directory. An example notebook is also available on google colab. Make sure to select a GPU for the runtime.

Features

  • CUDA (using PyCUDA or CuPy) and OpenCL (using PyOpenCL) backends
  • complex (C2C) transforms
  • R2C/C2R, now fully supporting odd sizes for the fast axis with inplace transforms
  • Direct Cosine or Sine transforms (DCT/DST) of type 1, 2, 3 and 4
  • out-of-place or inplace
  • single and double precision for all transforms (double precision requires device support)
  • Allows up to 8 FFT dimensions by default (can be increased by using VKFFT_MAX_FFT_DIMENSIONS when installing).
  • arrays can have more dimensions than the FFT (batch transforms).
  • Options are available to tune (manually or automatically) the performance for specific GPUs.
  • arbitrary array size, using Bluestein algorithm for prime numbers>13 (note that in this case the performance can be significantly lower, up to ~4x, depending on the transform size, see example performance plot below). Now also uses Rader's FFT algorithm for primes from 17 up to max shared memory length (~10000, see VkFFT's doc for details)
  • transform along a given list of axes, e.g. using a 4-dimensional array and supplying axes=(-3,-1). For R2C transforms, the fast axis must be transformed.
  • normalisation=0 (array L2 norm * array size on each transform) and 1 (the backward transform divides the L2 norm by the array size, so FFT*iFFT restores the original array)
  • Support for C (default) and F-ordered arrays, for C2C and R2C transforms
  • unit tests for all transforms: see test sub-directory. Note that these take a long time to finish due to the extensive number of sub-tests.
  • Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2
  • tested on macOS (10.13.6/x86, 12.6/M1), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)
  • GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs.
  • inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case a buffer of a size equal to the array is necessary. This makes larger FFT transforms possible based on memory requirements (even for R2C !) compared to cuFFT. For example you can compute the 3D FFT for a 1600**3 complex64 array with 32GB of memory.
  • transforms can either be done by creating a VkFFTApp (a.k.a. the fft 'plan'), with the selected backend (pyvkfft.cuda for pycuda/cupy or pyvkfft.opencl for pyopencl) or by using the pyvkfft.fft interface with the fftn, ifftn, rfftn and irfftn functions which automatically detect the type of GPU array and cache the corresponding VkFFTApp (see the example notebook pyvkfft-fft.ipynb).
  • the pyvkfft-test command-line script allows to test specific transforms against expected accuracy values, for all types of transforms.
  • pyvkfft results are evaluated before any release with a comprehensive test suite, comparing transform results for all types of transforms: single and double precision, 1D, 2D and 3D, inplace and out-of-place, different norms, radix and Bluestein, etc... The pyvkfft-test-suite script can be used to run the full suite, which takes more than two days on an A40 GPU using up to 16 parallel process, with about 1.5 million unit tests. Here are the test results for pyvkfft 2024.1:

Performance

See the benchmark notebook, which allows to plot OpenCL and CUDA backend throughput, as well as compare with cuFFT (using scikit-cuda) and clFFT (using gpyfft).

The pyvkfft-benchmark script is available to make simple or systematic testss, also allowing to compare with cuFFT and clFFT.

Example results for batched 2D, single precision FFT with array dimensions of batch x N x N using a V100:

https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_V100-Linux.png

Notes regarding this plot:

  • the computed throughput is theoretical, as if each transform axis for the couple (FFT, iFFT) required exactly one read and one write. This is obviously not true, and explains the drop after N=1024 for cuFFT and (in a smaller extent) vkFFT.
  • the batch size is adapted for each N so the transform takes long enough, in practice the transformed array is at around 600MB. Transforms on small arrays with small batch sizes could produce smaller performances, or better ones when fully cached.

The general results are:

  • vkFFT throughput is similar to cuFFT up to N=1024. For N>1024 vkFFT is much more efficient than cuFFT due to the smaller number of read and write per FFT axis (apart from isolated radix-2 3 sizes)
  • the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges where CUDA performs better, due to different cache. [Note that if the card is also used for display, then difference can increase, e.g. for nvidia cards opencl performance is more affected when being used for display than the cuda backend]
  • clFFT (via gpyfft) generally performs much worse than the other transforms, though this was tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations to find the fastest combination)

Another example on an A40 card (only with radix<=13 transforms):

https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_A40-Linux-radix13.png

On this card the cuFFT is significantly better, even if the 11 and 13 radix transforms supported by vkFFT give globally better results.

Performance tuning

Starting with VkFFT 1.3.0 and pyvkfft 2023.2, it is possible to tweak low-level parameters including coalesced memory or warp size, batch grouping, number of threads, etc...

Optimising those is difficult, so only do it out of curiosity or when trying to get some extra performance. Generally, VkFFT defaults work quite well. Using the simple FFT API, you can activate auto-tuning by passing tuning=True to the transform functions (fftn, rfftn, etc..). Only do this when using iterative process which really require fine-tuning !

Here is an example of the benchmark ran on a V100 GPU by tuning the coalescedMemory parameter (default value=32):

https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-V100-cuda-2D-coalmem.png

As you can see the optimal value varies with the 2D array size: below n=1536, using coalescedMemory=64 gives the best results, 32 (default) is best between 1536 and 2048, and above that there is little difference between the values chosen.

The same test on an A40 shows little difference. On an Apple M1 pro, it is the aimThreads parameter which is better tuned from 128 (default) to 64 to yield up to 50% faster transforms. YMMV !

Accuracy

See the accuracy notebook, which allows to compare the accuracy for different FFT libraries (pyvkfft with different options and backend, scikit-cuda (cuFFT), pyfftw), using pyfftw long-double precision as a reference.

Example results for 1D transforms (radix 2,3,5 and 7) using a Titan V:

https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/accuracy-1DFFT-TITAN_V.png

Analysis:

  • in single precision on the nVidia Titan V card, the VkFFT computed accuracy is about 3 times larger (worse) than pyfftw (also computed in single precision), e.g. 6e-7 vs 2e-7, which can be pretty negligible for most applications. However when using a lookup-table (LUT) for trigonometric values instead of hardware functions (useLUT=1 in VkFFTApp), the accuracy is identical to pyfftw, and better than cuFFT.
  • accuracy is the same for cuda and opencl, though this can depend on the card and drivers used (e.g. it's different on a GTX 1080)
You can easily test a transform using the pyvkfft-test command line script, e.g.:
pyvkfft-test --systematic --backend pycuda --nproc 8 --range 2 4500 --radix --ndim 2

Use pyvkfft-test --help to list available options.

You can use the pyvkfft-test-suite script to run the comprehensive test suite which is used to evaluate pyvkfft before a new release. Several options are available to target specific (C2C, R2C..) transforms or even run a random subset of transform sizes for fast detection of issues.

TODO

  • access to the other backends:
    • for vulkan and rocm this only makes sense combined to a pycuda/cupy/pyopencl equivalent.
  • out-of-place C2R transform without modifying the C array ? This would require using a R array padded with two wolumns, as for the inplace transform
  • half precision ?
  • on-the-fly convolution ?
  • zero-padding ?
  • access to the code of the generated kernels ?

Authors & acknowledgements

  • Vincent Favre-Nicolin (@vincefn, ESRF-The European Synchrotron) - main pyvkfft author
  • Dmitrii Tolmachev, @DTolm - VkFFT author
  • Fernando Isuru (@isuruf) - conda package and other contributions

pyvkfft's People

Contributors

alexfikl avatar isuruf avatar vincefn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pyvkfft's Issues

error

The error is "The prime numbers of the FFT size is larger than 13". how to solve

Question regarding VkFFTApp and pyvkfft.fft (OpenCL backend)

As a sort of learning exercise I'm implementing Richardson-Lucy deconvolution using pyVKFFT for the FFTs that are involved for each iteration.
Originally, I was using VkFFTApp to setup the R2C FFT:
r2c_out = VkFFTApp(psf_c.shape, psf_c.dtype, queue=cq, ndim=3, r2c=True, inplace=False)
Then I use fft = r2c_out.fft(decon, fft) to calculate the R2C (and r2c_out.ifft for the inverse)

After some time, the pyvkfft.fft module was implemented so I started to play with that, switching to just importing from pyvkfft.fft import rfftn, irfftn and using those instead of the App et al.
So now I use fft = rfftn(decon) for the R2C (and irfftn for the inverse).

Note: in both cases, I do some initial FFTs outside of my for loop—this is where i setup VKFFTApp for example—so I assume the Apps are created and cached and then have the FFTs and iFFTs in the loop. Also for this simple exercise everything has the same shape (3d, float32).

Anyhow, I've managed to get everything working—I had some various PBCAK snafus.
Both are considerably faster than clFFT-based or reikna-based python deconvolution as well!
However, I do notice though that 100 iterations using the manual VkFFTApp runs some 20% faster than the rfftn/ifftn option.

Is this expected or perhaps because I'm not passing all the extra elements that are in VkFFTApp when calling rfftn?

Additionally, I'm a bit confused about in-place vs out-of-place for the R2C case.
Looking at the example notebook: https://github.com/vincefn/pyvkfft/blob/master/examples/pyvkfft-fft.ipynb
The in-place example is:

 print("R2C transform, inplace")
    dr = cla.to_device(cq, ascent().astype(np.float32))
    sh = (dr.shape[0], dr.shape[1]//2+1)
    dc = cla.empty(cq, sh, dtype=np.complex64)
    do_rfft_and_plot(dr, dc)

and the function says:

def do_rfft_and_plot(dr, dc=None):
    # if dc is None, the transform is out-of-place and the destination
    # array is allocated on-the-fly

So if in both cases a dc array is created, what's the difference/advantage of one vs the other? Certainly the out-of-place option is easier to use...

Thanks for any advice and for your efforts in making this library!

All R2C test that have ndim=2 fails [numpy], because the fast axis is not always the last one

pyopencl  R2C         (30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=1.09e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 0 iFFT: n2=1.06e-07 ninf=1.28e-07 < 2.74e-06 (0.047) 0   OK
pyopencl  R2C         (30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=1.09e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 0 iFFT: n2=9.92e-08 ninf=1.36e-07 < 2.74e-06 (0.049) 0   OK
pyopencl  R2C          (30,) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=1.09e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 1 iFFT: n2=1.06e-07 ninf=1.28e-07 < 2.74e-06 (0.047) 1   OK
pyopencl  R2C          (30,) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=1.09e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 1 iFFT: n2=9.92e-08 ninf=1.36e-07 < 2.74e-06 (0.049) 1   OK
pyopencl  R2C         (30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=8.12e-08 ninf=8.93e-08 < 2.74e-06 (0.033) 0 iFFT: n2=7.62e-08 ninf=7.70e-08 < 2.74e-06 (0.028) 0   OK
pyopencl  R2C         (30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=8.12e-08 ninf=8.93e-08 < 2.74e-06 (0.033) 0 iFFT: n2=1.31e-07 ninf=1.35e-07 < 2.74e-06 (0.049) 0   OK
pyopencl  R2C          (30,) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=8.12e-08 ninf=8.93e-08 < 2.74e-06 (0.033) 1 iFFT: n2=7.62e-08 ninf=7.70e-08 < 2.74e-06 (0.028) 1   OK
pyopencl  R2C          (30,) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=8.12e-08 ninf=8.93e-08 < 2.74e-06 (0.033) 1 iFFT: n2=1.31e-07 ninf=1.35e-07 < 2.74e-06 (0.049) 1   OK
pyopencl  R2C         (30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=2.93e-16 ninf=3.16e-16 < 5.74e-15 (0.055) 0 iFFT: n2=3.70e-16 ninf=4.57e-16 < 5.74e-15 (0.080) 0   OK
pyopencl  R2C         (30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=2.93e-16 ninf=3.16e-16 < 5.74e-15 (0.055) 0 iFFT: n2=2.13e-16 ninf=3.43e-16 < 5.74e-15 (0.060) 0   OK
pyopencl  R2C          (30,) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=2.62e-16 ninf=3.16e-16 < 5.74e-15 (0.055) 1 iFFT: n2=2.61e-16 ninf=3.43e-16 < 5.74e-15 (0.060) 1   OK
pyopencl  R2C          (30,) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=2.62e-16 ninf=3.16e-16 < 5.74e-15 (0.055) 1 iFFT: n2=2.63e-16 ninf=3.43e-16 < 5.74e-15 (0.060) 1   OK
pyopencl  R2C       (2,30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=1.37e-07 ninf=1.27e-07 < 2.74e-06 (0.047) 0 iFFT: n2=1.38e-07 ninf=2.24e-07 < 2.74e-06 (0.082) 0   OK
pyopencl  R2C       (2,30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=1.37e-07 ninf=1.27e-07 < 2.74e-06 (0.047) 0 iFFT: n2=1.22e-07 ninf=1.96e-07 < 2.74e-06 (0.071) 0   OK
pyopencl  R2C         (2,30) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=1.37e-07 ninf=1.27e-07 < 2.74e-06 (0.047) 1 iFFT: n2=1.38e-07 ninf=2.24e-07 < 2.74e-06 (0.082) 1   OK
pyopencl  R2C         (2,30) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=1.37e-07 ninf=1.27e-07 < 2.74e-06 (0.047) 1 iFFT: n2=1.22e-07 ninf=1.96e-07 < 2.74e-06 (0.071) 1   OK
pyopencl  R2C       (2,30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=8.20e-08 ninf=6.13e-08 < 2.74e-06 (0.022) 0 iFFT: n2=1.01e-07 ninf=1.30e-07 < 2.74e-06 (0.047) 0   OK
pyopencl  R2C       (2,30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=8.20e-08 ninf=6.13e-08 < 2.74e-06 (0.022) 0 iFFT: n2=1.32e-07 ninf=2.33e-07 < 2.74e-06 (0.085) 0   OK
pyopencl  R2C         (2,30) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=8.20e-08 ninf=6.13e-08 < 2.74e-06 (0.022) 1 iFFT: n2=1.01e-07 ninf=1.30e-07 < 2.74e-06 (0.047) 1   OK
pyopencl  R2C         (2,30) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=8.20e-08 ninf=6.13e-08 < 2.74e-06 (0.022) 1 iFFT: n2=1.32e-07 ninf=2.33e-07 < 2.74e-06 (0.085) 1   OK
pyopencl  R2C       (2,30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=2.25e-16 ninf=3.30e-16 < 5.74e-15 (0.057) 0 iFFT: n2=5.44e-16 ninf=7.78e-16 < 5.74e-15 (0.136) 0   OK
pyopencl  R2C       (2,30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=2.25e-16 ninf=3.30e-16 < 5.74e-15 (0.057) 0 iFFT: n2=2.97e-16 ninf=5.56e-16 < 5.74e-15 (0.097) 0   OK
pyopencl  R2C         (2,30) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=2.30e-16 ninf=3.76e-16 < 5.74e-15 (0.066) 1 iFFT: n2=2.62e-16 ninf=4.45e-16 < 5.74e-15 (0.077) 1   OK
pyopencl  R2C         (2,30) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=2.30e-16 ninf=3.76e-16 < 5.74e-15 (0.066) 1 iFFT: n2=2.76e-16 ninf=3.47e-16 < 5.74e-15 (0.061) 1   OK
pyopencl  R2C        (30,30) axes=      None ndim=   2    float32 lut=None inplace=0  norm=   0   FFT: n2=1.79e-07 ninf=2.24e-07 < 3.48e-06 (0.064) 1 iFFT: n2=1.43e+00 ninf=2.52e+00 < 3.48e-06 (723527.389) 0 FAIL
pyopencl  R2C        (30,30) axes=      None ndim=   2    float32 lut=None inplace=0  norm=   1   FFT: n2=1.79e-07 ninf=2.24e-07 < 3.48e-06 (0.064) 1 iFFT: n2=1.43e+00 ninf=2.52e+00 < 3.48e-06 (723527.355) 0 FAIL
pyopencl  R2C        (30,30) axes=      None ndim=   2    float32 lut=True inplace=0  norm=   0   FFT: n2=1.43e-07 ninf=1.81e-07 < 3.48e-06 (0.052) 1 iFFT: n2=1.43e+00 ninf=2.52e+00 < 3.48e-06 (723527.423) 0 FAIL
pyopencl  R2C        (30,30) axes=      None ndim=   2    float32 lut=True inplace=0  norm=   1   FFT: n2=1.43e-07 ninf=1.81e-07 < 3.48e-06 (0.052) 1 iFFT: n2=1.43e+00 ninf=2.52e+00 < 3.48e-06 (723527.423) 0 FAIL
pyopencl  R2C        (30,30) axes=      None ndim=   2    float64 lut=None inplace=0  norm=   0   FFT: n2=3.85e-16 ninf=4.70e-16 < 6.48e-15 (0.073) 1 iFFT: n2=1.43e+00 ninf=2.38e+00 < 6.48e-15 (367908337750424.188) 0 FAIL
pyopencl  R2C        (30,30) axes=      None ndim=   2    float64 lut=None inplace=0  norm=   1   FFT: n2=3.85e-16 ninf=4.70e-16 < 6.48e-15 (0.073) 1 iFFT: n2=1.43e+00 ninf=2.38e+00 < 6.48e-15 (367908337750424.188) 0 FAIL
pyopencl  R2C     (2,2,30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=1.45e-07 ninf=2.15e-07 < 2.74e-06 (0.078) 0 iFFT: n2=1.31e-07 ninf=2.21e-07 < 2.74e-06 (0.081) 0   OK
pyopencl  R2C     (2,2,30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=1.45e-07 ninf=2.15e-07 < 2.74e-06 (0.078) 0 iFFT: n2=1.43e-07 ninf=2.25e-07 < 2.74e-06 (0.082) 0   OK
pyopencl  R2C       (2,2,30) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=1.45e-07 ninf=2.15e-07 < 2.74e-06 (0.078) 1 iFFT: n2=1.31e-07 ninf=2.21e-07 < 2.74e-06 (0.081) 1   OK
pyopencl  R2C       (2,2,30) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=1.45e-07 ninf=2.15e-07 < 2.74e-06 (0.078) 1 iFFT: n2=1.43e-07 ninf=2.25e-07 < 2.74e-06 (0.082) 1   OK
pyopencl  R2C     (2,2,30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=1.01e-07 ninf=1.28e-07 < 2.74e-06 (0.047) 0 iFFT: n2=9.53e-08 ninf=1.23e-07 < 2.74e-06 (0.045) 0   OK
pyopencl  R2C     (2,2,30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=1.01e-07 ninf=1.28e-07 < 2.74e-06 (0.047) 0 iFFT: n2=1.37e-07 ninf=2.25e-07 < 2.74e-06 (0.082) 0   OK
pyopencl  R2C       (2,2,30) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=1.01e-07 ninf=1.28e-07 < 2.74e-06 (0.047) 1 iFFT: n2=9.53e-08 ninf=1.23e-07 < 2.74e-06 (0.045) 1   OK
pyopencl  R2C       (2,2,30) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=1.01e-07 ninf=1.28e-07 < 2.74e-06 (0.047) 1 iFFT: n2=1.37e-07 ninf=2.25e-07 < 2.74e-06 (0.082) 1   OK
pyopencl  R2C     (2,2,30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=2.88e-16 ninf=3.14e-16 < 5.74e-15 (0.055) 0 iFFT: n2=4.60e-16 ninf=8.94e-16 < 5.74e-15 (0.156) 0   OK
pyopencl  R2C     (2,2,30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=2.88e-16 ninf=3.14e-16 < 5.74e-15 (0.055) 0 iFFT: n2=2.65e-16 ninf=5.59e-16 < 5.74e-15 (0.097) 0   OK
pyopencl  R2C       (2,2,30) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=2.82e-16 ninf=3.14e-16 < 5.74e-15 (0.055) 1 iFFT: n2=2.87e-16 ninf=4.75e-16 < 5.74e-15 (0.083) 1   OK
pyopencl  R2C       (2,2,30) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=2.82e-16 ninf=3.14e-16 < 5.74e-15 (0.055) 1 iFFT: n2=2.72e-16 ninf=3.91e-16 < 5.74e-15 (0.068) 1   OK
pyopencl  R2C   (2,2,2,30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=1.53e-07 ninf=2.01e-07 < 2.74e-06 (0.073) 0 iFFT: n2=1.43e-07 ninf=3.13e-07 < 2.74e-06 (0.114) 0   OK
pyopencl  R2C   (2,2,2,30+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=1.53e-07 ninf=2.01e-07 < 2.74e-06 (0.073) 0 iFFT: n2=1.48e-07 ninf=2.80e-07 < 2.74e-06 (0.102) 0   OK
pyopencl  R2C     (2,2,2,30) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=1.53e-07 ninf=2.01e-07 < 2.74e-06 (0.073) 1 iFFT: n2=1.43e-07 ninf=3.13e-07 < 2.74e-06 (0.114) 1   OK
pyopencl  R2C     (2,2,2,30) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=1.53e-07 ninf=2.01e-07 < 2.74e-06 (0.073) 1 iFFT: n2=1.48e-07 ninf=2.80e-07 < 2.74e-06 (0.102) 1   OK
pyopencl  R2C   (2,2,2,30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=1.01e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 0 iFFT: n2=1.08e-07 ninf=2.20e-07 < 2.74e-06 (0.080) 0   OK
pyopencl  R2C   (2,2,2,30+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=1.01e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 0 iFFT: n2=1.37e-07 ninf=1.94e-07 < 2.74e-06 (0.071) 0   OK
pyopencl  R2C     (2,2,2,30) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=1.01e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 1 iFFT: n2=1.08e-07 ninf=2.20e-07 < 2.74e-06 (0.080) 1   OK
pyopencl  R2C     (2,2,2,30) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=1.01e-07 ninf=1.18e-07 < 2.74e-06 (0.043) 1 iFFT: n2=1.37e-07 ninf=1.94e-07 < 2.74e-06 (0.071) 1   OK
pyopencl  R2C   (2,2,2,30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=2.89e-16 ninf=3.35e-16 < 5.74e-15 (0.058) 0 iFFT: n2=4.58e-16 ninf=7.84e-16 < 5.74e-15 (0.137) 0   OK
pyopencl  R2C   (2,2,2,30+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=2.89e-16 ninf=3.35e-16 < 5.74e-15 (0.058) 0 iFFT: n2=3.03e-16 ninf=5.60e-16 < 5.74e-15 (0.098) 0   OK
pyopencl  R2C     (2,2,2,30) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=2.92e-16 ninf=3.35e-16 < 5.74e-15 (0.058) 1 iFFT: n2=3.06e-16 ninf=6.72e-16 < 5.74e-15 (0.117) 1   OK
pyopencl  R2C     (2,2,2,30) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=2.92e-16 ninf=3.35e-16 < 5.74e-15 (0.058) 1 iFFT: n2=3.27e-16 ninf=5.60e-16 < 5.74e-15 (0.098) 1   OK
pyopencl  R2C         (34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=1.92e-07 ninf=1.58e-07 < 2.77e-06 (0.057) 0 iFFT: n2=1.88e-07 ninf=2.73e-07 < 2.77e-06 (0.099) 0   OK
pyopencl  R2C         (34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=1.92e-07 ninf=1.58e-07 < 2.77e-06 (0.057) 0 iFFT: n2=1.72e-07 ninf=2.00e-07 < 2.77e-06 (0.072) 0   OK
pyopencl  R2C          (34,) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=1.92e-07 ninf=1.58e-07 < 5.53e-06 (0.028) 1 iFFT: n2=1.88e-07 ninf=2.73e-07 < 5.53e-06 (0.049) 1   OK
pyopencl  R2C          (34,) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=1.92e-07 ninf=1.58e-07 < 5.53e-06 (0.028) 1 iFFT: n2=1.72e-07 ninf=2.00e-07 < 5.53e-06 (0.036) 1   OK
pyopencl  R2C         (34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=1.86e-07 ninf=1.78e-07 < 2.77e-06 (0.064) 0 iFFT: n2=1.64e-07 ninf=2.63e-07 < 2.77e-06 (0.095) 0   OK
pyopencl  R2C         (34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=1.86e-07 ninf=1.78e-07 < 2.77e-06 (0.064) 0 iFFT: n2=1.71e-07 ninf=2.59e-07 < 2.77e-06 (0.094) 0   OK
pyopencl  R2C          (34,) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=1.86e-07 ninf=1.78e-07 < 5.53e-06 (0.032) 1 iFFT: n2=1.64e-07 ninf=2.63e-07 < 5.53e-06 (0.047) 1   OK
pyopencl  R2C          (34,) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=1.86e-07 ninf=1.78e-07 < 5.53e-06 (0.032) 1 iFFT: n2=1.71e-07 ninf=2.59e-07 < 5.53e-06 (0.047) 1   OK
pyopencl  R2C         (34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=6.19e-16 ninf=7.44e-16 < 5.77e-15 (0.129) 0 iFFT: n2=3.26e-16 ninf=3.41e-16 < 5.77e-15 (0.059) 0   OK
pyopencl  R2C         (34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=6.19e-16 ninf=7.44e-16 < 5.77e-15 (0.129) 0 iFFT: n2=3.07e-16 ninf=4.55e-16 < 5.77e-15 (0.079) 0   OK
pyopencl  R2C          (34,) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=6.19e-16 ninf=7.44e-16 < 1.15e-14 (0.064) 1 iFFT: n2=3.26e-16 ninf=3.41e-16 < 1.15e-14 (0.030) 1   OK
pyopencl  R2C          (34,) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=6.19e-16 ninf=7.44e-16 < 1.15e-14 (0.064) 1 iFFT: n2=3.07e-16 ninf=4.55e-16 < 1.15e-14 (0.039) 1   OK
pyopencl  R2C       (2,34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=1.66e-07 ninf=1.91e-07 < 2.77e-06 (0.069) 0 iFFT: n2=2.20e-07 ninf=2.82e-07 < 2.77e-06 (0.102) 0   OK
pyopencl  R2C       (2,34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=1.66e-07 ninf=1.91e-07 < 2.77e-06 (0.069) 0 iFFT: n2=2.38e-07 ninf=3.21e-07 < 2.77e-06 (0.116) 0   OK
pyopencl  R2C         (2,34) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=1.66e-07 ninf=1.91e-07 < 5.53e-06 (0.035) 1 iFFT: n2=2.20e-07 ninf=2.82e-07 < 5.53e-06 (0.051) 1   OK
pyopencl  R2C         (2,34) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=1.66e-07 ninf=1.91e-07 < 5.53e-06 (0.035) 1 iFFT: n2=2.38e-07 ninf=3.21e-07 < 5.53e-06 (0.058) 1   OK
pyopencl  R2C       (2,34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=1.94e-07 ninf=2.01e-07 < 2.77e-06 (0.073) 0 iFFT: n2=2.05e-07 ninf=2.96e-07 < 2.77e-06 (0.107) 0   OK
pyopencl  R2C       (2,34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=1.94e-07 ninf=2.01e-07 < 2.77e-06 (0.073) 0 iFFT: n2=2.16e-07 ninf=3.21e-07 < 2.77e-06 (0.116) 0   OK
pyopencl  R2C         (2,34) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=1.94e-07 ninf=2.01e-07 < 5.53e-06 (0.036) 1 iFFT: n2=2.05e-07 ninf=2.96e-07 < 5.53e-06 (0.054) 1   OK
pyopencl  R2C         (2,34) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=1.94e-07 ninf=2.01e-07 < 5.53e-06 (0.036) 1 iFFT: n2=2.16e-07 ninf=3.21e-07 < 5.53e-06 (0.058) 1   OK
pyopencl  R2C       (2,34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=4.90e-16 ninf=5.49e-16 < 5.77e-15 (0.095) 0 iFFT: n2=5.86e-16 ninf=8.89e-16 < 5.77e-15 (0.154) 0   OK
pyopencl  R2C       (2,34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=4.90e-16 ninf=5.49e-16 < 5.77e-15 (0.095) 0 iFFT: n2=5.30e-16 ninf=8.89e-16 < 5.77e-15 (0.154) 0   OK
pyopencl  R2C         (2,34) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=4.90e-16 ninf=5.49e-16 < 1.15e-14 (0.048) 1 iFFT: n2=5.86e-16 ninf=8.89e-16 < 1.15e-14 (0.077) 1   OK
pyopencl  R2C         (2,34) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=4.90e-16 ninf=5.49e-16 < 1.15e-14 (0.048) 1 iFFT: n2=5.30e-16 ninf=8.89e-16 < 1.15e-14 (0.077) 1   OK
pyopencl  R2C        (34,34) axes=      None ndim=   2    float32 lut=None inplace=0  norm=   0   FFT: n2=3.16e-07 ninf=3.05e-07 < 7.06e-06 (0.043) 1 iFFT: n2=1.39e+00 ninf=2.70e+00 < 7.06e-06 (382709.503) 0 FAIL
pyopencl  R2C        (34,34) axes=      None ndim=   2    float32 lut=None inplace=0  norm=   1   FFT: n2=3.16e-07 ninf=3.05e-07 < 7.06e-06 (0.043) 1 iFFT: n2=1.39e+00 ninf=2.70e+00 < 7.06e-06 (382709.486) 0 FAIL
pyopencl  R2C        (34,34) axes=      None ndim=   2    float32 lut=True inplace=0  norm=   0   FFT: n2=3.04e-07 ninf=3.78e-07 < 7.06e-06 (0.053) 1 iFFT: n2=1.39e+00 ninf=2.70e+00 < 7.06e-06 (382709.503) 0 FAIL
pyopencl  R2C        (34,34) axes=      None ndim=   2    float32 lut=True inplace=0  norm=   1   FFT: n2=3.04e-07 ninf=3.78e-07 < 7.06e-06 (0.053) 1 iFFT: n2=1.39e+00 ninf=2.70e+00 < 7.06e-06 (382709.469) 0 FAIL
pyopencl  R2C        (34,34) axes=      None ndim=   2    float64 lut=None inplace=0  norm=   0   FFT: n2=7.93e-16 ninf=1.04e-15 < 1.31e-14 (0.080) 1 iFFT: n2=1.43e+00 ninf=2.78e+00 < 1.31e-14 (212602327789630.625) 0 FAIL
pyopencl  R2C        (34,34) axes=      None ndim=   2    float64 lut=None inplace=0  norm=   1   FFT: n2=7.93e-16 ninf=1.04e-15 < 1.31e-14 (0.080) 1 iFFT: n2=1.43e+00 ninf=2.78e+00 < 1.31e-14 (212602327789630.625) 0 FAIL
pyopencl  R2C     (2,2,34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=2.07e-07 ninf=2.35e-07 < 2.77e-06 (0.085) 0 iFFT: n2=2.13e-07 ninf=2.84e-07 < 2.77e-06 (0.103) 0   OK
pyopencl  R2C     (2,2,34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=2.07e-07 ninf=2.35e-07 < 2.77e-06 (0.085) 0 iFFT: n2=2.30e-07 ninf=3.37e-07 < 2.77e-06 (0.122) 0   OK
pyopencl  R2C       (2,2,34) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=2.07e-07 ninf=2.35e-07 < 5.53e-06 (0.042) 1 iFFT: n2=2.13e-07 ninf=2.84e-07 < 5.53e-06 (0.051) 1   OK
pyopencl  R2C       (2,2,34) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=2.07e-07 ninf=2.35e-07 < 5.53e-06 (0.042) 1 iFFT: n2=2.30e-07 ninf=3.37e-07 < 5.53e-06 (0.061) 1   OK
pyopencl  R2C     (2,2,34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=2.07e-07 ninf=2.72e-07 < 2.77e-06 (0.098) 0 iFFT: n2=2.05e-07 ninf=3.36e-07 < 2.77e-06 (0.121) 0   OK
pyopencl  R2C     (2,2,34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=2.07e-07 ninf=2.72e-07 < 2.77e-06 (0.098) 0 iFFT: n2=2.00e-07 ninf=3.03e-07 < 2.77e-06 (0.110) 0   OK
pyopencl  R2C       (2,2,34) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=2.07e-07 ninf=2.72e-07 < 5.53e-06 (0.049) 1 iFFT: n2=2.05e-07 ninf=3.36e-07 < 5.53e-06 (0.061) 1   OK
pyopencl  R2C       (2,2,34) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=2.07e-07 ninf=2.72e-07 < 5.53e-06 (0.049) 1 iFFT: n2=2.00e-07 ninf=3.03e-07 < 5.53e-06 (0.055) 1   OK
pyopencl  R2C     (2,2,34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=5.30e-16 ninf=7.55e-16 < 5.77e-15 (0.131) 0 iFFT: n2=5.41e-16 ninf=1.23e-15 < 5.77e-15 (0.213) 0   OK
pyopencl  R2C     (2,2,34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=5.30e-16 ninf=7.55e-16 < 5.77e-15 (0.131) 0 iFFT: n2=5.55e-16 ninf=1.00e-15 < 5.77e-15 (0.174) 0   OK
pyopencl  R2C       (2,2,34) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=5.30e-16 ninf=7.55e-16 < 1.15e-14 (0.065) 1 iFFT: n2=5.41e-16 ninf=1.23e-15 < 1.15e-14 (0.106) 1   OK
pyopencl  R2C       (2,2,34) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=5.30e-16 ninf=7.55e-16 < 1.15e-14 (0.065) 1 iFFT: n2=5.55e-16 ninf=1.00e-15 < 1.15e-14 (0.087) 1   OK
pyopencl  R2C   (2,2,2,34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=2.13e-07 ninf=2.43e-07 < 2.77e-06 (0.088) 0 iFFT: n2=1.96e-07 ninf=3.26e-07 < 2.77e-06 (0.118) 0   OK
pyopencl  R2C   (2,2,2,34+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=2.13e-07 ninf=2.43e-07 < 2.77e-06 (0.088) 0 iFFT: n2=2.07e-07 ninf=3.64e-07 < 2.77e-06 (0.132) 0   OK
pyopencl  R2C     (2,2,2,34) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=2.13e-07 ninf=2.43e-07 < 5.53e-06 (0.044) 1 iFFT: n2=1.96e-07 ninf=3.26e-07 < 5.53e-06 (0.059) 1   OK
pyopencl  R2C     (2,2,2,34) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=2.13e-07 ninf=2.43e-07 < 5.53e-06 (0.044) 1 iFFT: n2=2.07e-07 ninf=3.64e-07 < 5.53e-06 (0.066) 1   OK
pyopencl  R2C   (2,2,2,34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=2.07e-07 ninf=2.33e-07 < 2.77e-06 (0.084) 0 iFFT: n2=1.75e-07 ninf=3.26e-07 < 2.77e-06 (0.118) 0   OK
pyopencl  R2C   (2,2,2,34+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=2.07e-07 ninf=2.33e-07 < 2.77e-06 (0.084) 0 iFFT: n2=1.95e-07 ninf=3.33e-07 < 2.77e-06 (0.120) 0   OK
pyopencl  R2C     (2,2,2,34) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=2.07e-07 ninf=2.33e-07 < 5.53e-06 (0.042) 1 iFFT: n2=1.75e-07 ninf=3.26e-07 < 5.53e-06 (0.059) 1   OK
pyopencl  R2C     (2,2,2,34) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=2.07e-07 ninf=2.33e-07 < 5.53e-06 (0.042) 1 iFFT: n2=1.95e-07 ninf=3.33e-07 < 5.53e-06 (0.060) 1   OK
pyopencl  R2C   (2,2,2,34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=5.78e-16 ninf=1.12e-15 < 5.77e-15 (0.194) 0 iFFT: n2=5.78e-16 ninf=9.91e-16 < 5.77e-15 (0.172) 0   OK
pyopencl  R2C   (2,2,2,34+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=5.78e-16 ninf=1.12e-15 < 5.77e-15 (0.194) 0 iFFT: n2=5.63e-16 ninf=1.07e-15 < 5.77e-15 (0.186) 0   OK
pyopencl  R2C     (2,2,2,34) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=5.78e-16 ninf=1.12e-15 < 1.15e-14 (0.097) 1 iFFT: n2=5.78e-16 ninf=9.91e-16 < 1.15e-14 (0.086) 1   OK
pyopencl  R2C     (2,2,2,34) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=5.78e-16 ninf=1.12e-15 < 1.15e-14 (0.097) 1 iFFT: n2=5.63e-16 ninf=1.07e-15 < 1.15e-14 (0.093) 1   OK
pyopencl  R2C        (808+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=3.52e-07 ninf=3.76e-07 < 3.45e-06 (0.109) 0 iFFT: n2=3.05e-07 ninf=4.77e-07 < 3.45e-06 (0.138) 0   OK
pyopencl  R2C        (808+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=3.52e-07 ninf=3.76e-07 < 3.45e-06 (0.109) 0 iFFT: n2=2.99e-07 ninf=5.14e-07 < 3.45e-06 (0.149) 0   OK
pyopencl  R2C         (808,) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=3.52e-07 ninf=3.76e-07 < 6.91e-06 (0.054) 1 iFFT: n2=3.05e-07 ninf=4.77e-07 < 6.91e-06 (0.069) 1   OK
pyopencl  R2C         (808,) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=3.52e-07 ninf=3.76e-07 < 6.91e-06 (0.054) 1 iFFT: n2=2.99e-07 ninf=5.14e-07 < 6.91e-06 (0.074) 1   OK
pyopencl  R2C        (808+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=2.46e-07 ninf=2.57e-07 < 3.45e-06 (0.074) 0 iFFT: n2=2.01e-07 ninf=4.55e-07 < 3.45e-06 (0.132) 0   OK
pyopencl  R2C        (808+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=2.46e-07 ninf=2.57e-07 < 3.45e-06 (0.074) 0 iFFT: n2=1.93e-07 ninf=4.09e-07 < 3.45e-06 (0.119) 0   OK
pyopencl  R2C         (808,) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=2.46e-07 ninf=2.57e-07 < 6.91e-06 (0.037) 1 iFFT: n2=2.01e-07 ninf=4.55e-07 < 6.91e-06 (0.066) 1   OK
pyopencl  R2C         (808,) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=2.46e-07 ninf=2.57e-07 < 6.91e-06 (0.037) 1 iFFT: n2=1.93e-07 ninf=4.09e-07 < 6.91e-06 (0.059) 1   OK
pyopencl  R2C        (808+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=7.58e-16 ninf=1.00e-15 < 6.45e-15 (0.155) 0 iFFT: n2=4.26e-16 ninf=8.89e-16 < 6.45e-15 (0.138) 0   OK
pyopencl  R2C        (808+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=7.58e-16 ninf=1.00e-15 < 6.45e-15 (0.155) 0 iFFT: n2=5.06e-16 ninf=1.11e-15 < 6.45e-15 (0.172) 0   OK
pyopencl  R2C         (808,) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=7.58e-16 ninf=1.00e-15 < 1.29e-14 (0.078) 1 iFFT: n2=4.26e-16 ninf=8.89e-16 < 1.29e-14 (0.069) 1   OK
pyopencl  R2C         (808,) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=7.58e-16 ninf=1.00e-15 < 1.29e-14 (0.078) 1 iFFT: n2=4.44e-16 ninf=1.00e-15 < 1.29e-14 (0.077) 1   OK
pyopencl  R2C      (2,808+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   0   FFT: n2=3.73e-07 ninf=4.34e-07 < 3.45e-06 (0.126) 0 iFFT: n2=3.60e-07 ninf=7.22e-07 < 3.45e-06 (0.209) 0   OK
pyopencl  R2C      (2,808+2) axes=      None ndim=   1    float32 lut=None inplace=1  norm=   1   FFT: n2=3.73e-07 ninf=4.34e-07 < 3.45e-06 (0.126) 0 iFFT: n2=3.50e-07 ninf=6.94e-07 < 3.45e-06 (0.201) 0   OK
pyopencl  R2C        (2,808) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   0   FFT: n2=3.73e-07 ninf=4.34e-07 < 6.91e-06 (0.063) 1 iFFT: n2=3.60e-07 ninf=7.22e-07 < 6.91e-06 (0.104) 1   OK
pyopencl  R2C        (2,808) axes=      None ndim=   1    float32 lut=None inplace=0  norm=   1   FFT: n2=3.73e-07 ninf=4.34e-07 < 6.91e-06 (0.063) 1 iFFT: n2=3.50e-07 ninf=6.94e-07 < 6.91e-06 (0.100) 1   OK
pyopencl  R2C      (2,808+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   0   FFT: n2=2.60e-07 ninf=2.70e-07 < 3.45e-06 (0.078) 0 iFFT: n2=2.64e-07 ninf=4.77e-07 < 3.45e-06 (0.138) 0   OK
pyopencl  R2C      (2,808+2) axes=      None ndim=   1    float32 lut=True inplace=1  norm=   1   FFT: n2=2.60e-07 ninf=2.70e-07 < 3.45e-06 (0.078) 0 iFFT: n2=2.53e-07 ninf=4.86e-07 < 3.45e-06 (0.141) 0   OK
pyopencl  R2C        (2,808) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   0   FFT: n2=2.60e-07 ninf=2.70e-07 < 6.91e-06 (0.039) 1 iFFT: n2=2.64e-07 ninf=4.77e-07 < 6.91e-06 (0.069) 1   OK
pyopencl  R2C        (2,808) axes=      None ndim=   1    float32 lut=True inplace=0  norm=   1   FFT: n2=2.60e-07 ninf=2.70e-07 < 6.91e-06 (0.039) 1 iFFT: n2=2.53e-07 ninf=4.86e-07 < 6.91e-06 (0.070) 1   OK
pyopencl  R2C      (2,808+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   0   FFT: n2=7.55e-16 ninf=9.57e-16 < 6.45e-15 (0.148) 0 iFFT: n2=4.53e-16 ninf=9.99e-16 < 6.45e-15 (0.155) 0   OK
pyopencl  R2C      (2,808+2) axes=      None ndim=   1    float64 lut=None inplace=1  norm=   1   FFT: n2=7.55e-16 ninf=9.57e-16 < 6.45e-15 (0.148) 0 iFFT: n2=5.30e-16 ninf=9.99e-16 < 6.45e-15 (0.155) 0   OK
pyopencl  R2C        (2,808) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   0   FFT: n2=7.55e-16 ninf=9.57e-16 < 1.29e-14 (0.074) 1 iFFT: n2=4.53e-16 ninf=9.99e-16 < 1.29e-14 (0.077) 1   OK
pyopencl  R2C        (2,808) axes=      None ndim=   1    float64 lut=None inplace=0  norm=   1   FFT: n2=7.55e-16 ninf=9.57e-16 < 1.29e-14 (0.074) 1 iFFT: n2=4.71e-16 ninf=8.88e-16 < 1.29e-14 (0.069) 1   OK
pyopencl  R2C      (808,808) axes=      None ndim=   2    float32 lut=None inplace=0  norm=   0   FFT: n2=5.83e-07 ninf=6.50e-07 < 9.81e-06 (0.066) 1 iFFT: n2=1.41e+00 ninf=3.33e+00 < 9.81e-06 (339051.468) 0 FAIL
pyopencl  R2C      (808,808) axes=      None ndim=   2    float32 lut=None inplace=0  norm=   1   FFT: n2=5.83e-07 ninf=6.50e-07 < 9.81e-06 (0.066) 1 iFFT: n2=1.41e+00 ninf=3.33e+00 < 9.81e-06 (339051.443) 0 FAIL
pyopencl  R2C      (808,808) axes=      None ndim=   2    float32 lut=True inplace=0  norm=   0   FFT: n2=3.71e-07 ninf=3.84e-07 < 9.81e-06 (0.039) 1 iFFT: n2=1.41e+00 ninf=3.33e+00 < 9.81e-06 (339051.516) 0 FAIL
pyopencl  R2C      (808,808) axes=      None ndim=   2    float32 lut=True inplace=0  norm=   1   FFT: n2=3.71e-07 ninf=3.84e-07 < 9.81e-06 (0.039) 1 iFFT: n2=1.41e+00 ninf=3.33e+00 < 9.81e-06 (339051.492) 0 FAIL
pyopencl  R2C      (808,808) axes=      None ndim=   2    float64 lut=None inplace=0  norm=   0   FFT: n2=1.16e-15 ninf=1.62e-15 < 1.58e-14 (0.103) 1 iFFT: n2=1.41e+00 ninf=3.14e+00 < 1.58e-14 (198609444611850.719) 0 FAIL
pyopencl  R2C      (808,808) axes=      None ndim=   2    float64 lut=None inplace=0  norm=   1   FFT: n2=1.16e-15 ninf=1.62e-15 < 1.58e-14 (0.103) 1 iFFT: n2=1.41e+00 ninf=3.14e+00 < 1.58e-14 (198609444611850.719) 0 FAIL

My GPU supports cl_khr_fp64, but does not support cl_khr_int64, so I made a small modification in vkFFT.h:

	if ((!strcmp(floatType, "double")) || (sc->useUint64)) {
		sc->tempLen = sprintf(sc->tempStr, "\
#pragma OPENCL EXTENSION cl_khr_fp64 : enable\n\
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable\n\
#pragma OPENCL EXTENSION cl_khr_int64_extended_atomics : enable\n\n");
		res = VkAppendLine(sc);
		if (res != VKFFT_SUCCESS) return res;
	}

Zero-padding support?

Hi,

VkFFT says they support zero padding:
"Native zero padding to model open systems (up to 2x faster than simply padding input array with zeros). Can specify the range of sequences filled with zeros and the direction where zero padding is applied (read or write stage)"

Is this available through the Python interface?
Our use case requires us to pad the input and padding can take as long as the FFT itself, so native zero padding would indeed speed it up by 2x.

VkFFT 1.3

Dear @vincefn,

I have made some substantial changes to VkFFT in version 1.3 (https://github.com/DTolm/VkFFT/tree/develop), so there will be a two-month period before it is merged in the main branch for people to adjust the dependent projects. Namely, VkFFT is no longer a header-only file, but rather a collection of headers. This should not increase the complexity of usage - you still link to vkFFT.h file and it includes the other files. The main advantage is that the code base now is more structured and way easier to understand by other developers.

I have tested the code on some systems with the implemented benchmark scripts - RTX2080 (Vulkan, CUDA, OpenCL), MI250 (HIP), A100 (CUDA), UHD610 (Level Zero) and M1 Pro (Metal), however, your suite is more thorough in this regard. Also, you might be interested in exploring the new design.

I suppose keeping an issue for this period can be helpful for discussion.

Best regards,
Dmitrii

test_fft.py fails with 2023.1.1 on a GTX 1080 Ti

Here is the relevant output of test_fft.py

test_c2c (__main__.TestFFT.test_c2c)
Run C2C tests ... nvrtcCompileProgram error: NVRTC_ERROR_BUILTIN_OPERATION_FAILURE
nvrtc: error: failed to load builtins for compute_30.
extern __shared__ float shared[];
extern "C" __global__ void __launch_bounds__(6) VkFFT_main (float2* inputs, float2* outputs) {

For a GTX 1080 Ti, it should use compute_61 afaiu.

nvidia-smi says

Driver Version: 530.30.02    CUDA Version: 12.1

and nvcc --version gives

Cuda compilation tools, release 10.1, V10.1.243

Any help will be appreciated.

Batched transforms not working (OpenCL)

Hello, it appears as though batched transforms are not working correctly (or I have misunderstood the documentation). Please find a simple example attached.

In the example, I use vkFFT to do a forward complex-to-complex 1D batched transform and compare the result to NumPy FFT. The relative different is O(1). If I instead do the same transform, but one batch at a time, then the vkFFT/NumPy transforms match exactly.

I will try to find a solution, but you might know exactly where to look!

Thanks, Matt

vkfft_batch.zip

Shared object file issues on import

When I try to import pyvkfft.cuda I get the following traceback. Any ideas on how to fix this? Thanks.

Traceback (most recent call last):
File "filtering_test.py", line 9, in
import pyvkfft.cuda as vk
File "/laptop/dspenv/lib64/python3.6/site-packages/pyvkfft/cuda.py", line 46, in
_vkfft_cuda = load_library("_vkfft_cuda")
File "/laptop/dspenv/lib64/python3.6/site-packages/pyvkfft/cuda.py", line 43, in load_library
return ctypes.cdll.LoadLibrary(os.path.join(os.path.dirname(file) or os.path.curdir, basename + ext))
File "/usr/lib64/python3.6/ctypes/init.py", line 426, in LoadLibrary
return self._dlltype(name)
File "/usr/lib64/python3.6/ctypes/init.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: /laptop/dspenv/lib64/python3.6/site-packages/pyvkfft/_vkfft_cuda.cpython-36m-x86_64-linux-gnu.so: cannot open shared object file: No such file or directory

Cannot import pyvkfft.fft.fftn without OpenCL backends

I tried install pyvkfft with only cuda backends using VKFFT_BACKEND=CUDA CUDAHOME=/usr/local/cuda pip install pyvkfft

This runs and builds the wheel for pyvkfft without errors. After running, I can import and use the cuda pyvkfft planner without error

$ python
Python 3.9.10 (main, Sep  1 2022, 19:43:27) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-10)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyvkfft.cuda import VkFFTApp
>>> 

But if I try to import pyvkfft.fft.fftn I get an OSError

$ python
Python 3.9.10 (main, Sep  1 2022, 19:43:27) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-10)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyvkfft.fft import fftn
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib/python3.9/site-packages/pyvkfft/fft.py", line 28, in <module>
    from .opencl import VkFFTApp as VkFFTApp_cl, cla, vkfft_version
  File ".../venv/lib/python3.9/site-packages/pyvkfft/opencl.py", line 15, in <module>
    _vkfft_opencl = load_library("_vkfft_opencl")
  File ".../venv/lib/python3.9/site-packages/pyvkfft/base.py", line 117, in load_library
    return ctypes.cdll.LoadLibrary(os.path.join(os.path.dirname(__file__) or os.path.curdir, basename + ext))
  File ".../Python-3.9.10/Lib/ctypes/__init__.py", line 452, in LoadLibrary
    return self._dlltype(name)
  File ".../Python-3.9.10/Lib/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: .../venv/lib/python3.9/site-packages/pyvkfft/_vkfft_opencl.cpython-39-x86_64-linux-gnu.so: cannot open shared object file: No such file or directory

Looking at the code for pyvkfft.fft it looks like it tries to handle the case where the OpenCL backend isn't installed

pyvkfft/pyvkfft/fft.py

Lines 27 to 33 in d75c620

try:
from .opencl import VkFFTApp as VkFFTApp_cl, cla, vkfft_version
has_opencl = True
except ImportError:
has_opencl = False

only I get an OSError not ImportError.

If I install pyvkfft with both OpenCL and cuda backends using CUDAHOME=/usr/local/cuda pip install pyvkfft things work

$ python
Python 3.9.10 (main, Sep  1 2022, 19:43:27) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-10)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyvkfft.fft import fftn
>>> 

but I'd like to use pyvkfft with only the Cuda backends installed.

Cannot pass 'omitDimension' to VkFFTApp

I have a real np.array z of size (h, w, 2), and I want to batch two FFTs on z[...,0] and z[...,1] (innermost batch in VkFFT vocabulary ???), getting the results in a complex array of size (h, (w/2)+1, 2). It seems to be feasible with VkFFT, playing with the omitDimension parameter. It does not seems possible to pass this parameter to VkFFTApp.

In the current state of pyVkFFT, is my operation possible in some way?

RuntimeError: VkFFT error 4051: VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE

Based on your OpenCL example, the following code

import pyopencl as cl
import pyopencl.array
import pyvkfft.opencl
import numpy as np

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
d = pyopencl.array.to_device(queue, np.random.uniform(0, 1, (8)).astype(np.complex64))
app = pyvkfft.opencl.VkFFTApp(d.shape, d.dtype, queue=queue, ndim=d.ndim)

gives the following error

clVkFFTApp error: (8,) None complex64 1 None [8] [False]
Traceback (most recent call last):
  File "/Users/alegresor/Desktop/QMCseqCL/ags.py", line 17, in <module>
    app = pyvkfft.opencl.VkFFTApp(d.shape, d.dtype, queue=queue, ndim=d.ndim)
  File "/Users/alegresor/miniconda3/envs/qmcseqcl/lib/python3.10/site-packages/pyvkfft/opencl.py", line 196, in __init__
    raise ex
  File "/Users/alegresor/miniconda3/envs/qmcseqcl/lib/python3.10/site-packages/pyvkfft/opencl.py", line 189, in __init__
    check_vkfft_result(res, shape=shape, dtype=dtype, ndim=ndim, inplace=inplace, norm=norm,
  File "/Users/alegresor/miniconda3/envs/qmcseqcl/lib/python3.10/site-packages/pyvkfft/base.py", line 515, in check_vkfft_result
    raise RuntimeError("VkFFT error %d: %s %s" % (res, r.name, s))
RuntimeError: VkFFT error 4051: VKFFT_ERROR_FAILED_TO_GET_ATTRIBUTE C2C (8,) complex64 1D inplace norm=1 [VkFFT: shape= [8] skip=[0] nbatch=1] [opencl:Portable Computing Language:cpu]

Support transform for arrays without C-odering (strides not in the order of axes)

This should mostly (?) require reordering the axes in calc_transform_axes.

The calculation of the fft_scale factor for inplace R2C transforms will also need to be corrected

A warning should probably be issued for inplace R2C transforms since the extra two values along the fast axis may not be where they are expected.

This is a followup to #19

Convolution support

Hi @DTolm , now that the release is out, I made some tests with on-the-fly convolution following Osamu's email exchange which picked my curiosity.

There is now a branch with convolution support

What I have seen (but I made so far a very limited number of tests):

  • 2D and batched 2D convolution works for R2C and C2C (compared to convolution using numpy), both for out- and in-place transforms
  • for batched 2D, I use config->coordinateFeatures = n_batch instead of numberBatches, I assume that's the proper way
  • 1D transforms kernel compilations fail (see the notebook below)
  • 3D transforms give incorrect results

The tests are all visible on the pyvkfft-convolve notebook.

pyvkfft.fft.fftn error?

Hi!
I have found that pyvkfft.fft.fftn exhibits strange behavior.
Also, in previous versions of pyvkfft, this code raises an error.

OS:Windows 10
Python: 3.12.2

import numpy as np
import cupy as cp
import pyvkfft.fft

shape = (3,2,2)

size = np.prod(shape)
x1 = cp.arange(size,dtype=cp.complex64).reshape(shape)
x2 = x1[:,None]

y1 = pyvkfft.fft.fftn(x1,axes=(-1,-2))
y2 = pyvkfft.fft.fftn(x2,axes=(-1,-2)).squeeze()

print("shape x1:",x1.shape, "shape x2:", x2.shape)
print("shape y1:",y1.shape, "shape y2:", y2.shape)
print(y1-y2)

case (i) pyvkfft == 2024.1.1 vkfft == 1.3.4

shape x1: (3, 2, 2) shape x2: (3, 1, 2, 2)
shape y1: (3, 2, 2) shape y2: (3, 2, 2)
[[[ -9.+0.j          1.+0.j       ]
  [  2.-3.4641016j   0.+0.j       ]]

 [[ 28.+3.4641016j  -2.+0.j       ]
  [-55.+0.j          3.+0.j       ]]

 [[ 44.-3.4641016j  -2.+0.j       ]
  [  2.+3.4641016j   0.+0.j       ]]]

case (ii) pyvkfft == 2023.1.1 vkfft == 1.2.9

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[2], line 8
      5 x2 = x1[:,None]
      7 y1 = pyvkfft.fft.fftn(x1,axes=(-1,-2))
----> 8 y2 = pyvkfft.fft.fftn(x2,axes=(-1,-2)).squeeze()
     10 print("shape x1:",x1.shape, "shape x2:", x2.shape)
     11 print("shape y1:",y1.shape, "shape y2:", y2.shape)

File [~\miniconda3\Lib\site-packages\pyvkfft\fft.py:200](http://localhost:8888/lab/tree/cupy_test/vkfft_error/~/miniconda3/Lib/site-packages/pyvkfft/fft.py#line=199), in fftn(src, dest, ndim, norm, axes, cuda_stream, cl_queue, return_scale)
    167 """
    168 Perform a FFT on a GPU array, automatically creating the VkFFTApp
    169 and caching it for future re-use.
   (...)
    197 :return: the destination array if return_scale is False, or (dest, scale)
    198 """
    199 backend, inplace, dest, cl_queue = _prepare_transform(src, dest, cl_queue, False)
--> 200 app = _get_fft_app(backend, src.shape, src.dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue,
    201                    strides=src.strides)
    202 if backend == Backend.PYOPENCL:
    203     app.fft(src, dest, queue=cl_queue)

File [~\miniconda3\Lib\site-packages\pyvkfft\fft.py:137](http://localhost:8888/lab/tree/cupy_test/vkfft_error/~/miniconda3/Lib/site-packages/pyvkfft/fft.py#line=136), in _get_fft_app(backend, shape, dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue, strides)
    134 @lru_cache(maxsize=config.FFT_CACHE_NB)
    135 def _get_fft_app(backend, shape, dtype, inplace, ndim, axes, norm, cuda_stream, cl_queue, strides=None):
    136     if backend in [Backend.PYCUDA, Backend.CUPY]:
--> 137         return VkFFTApp_cuda(shape, dtype, ndim=ndim, inplace=inplace,
    138                              stream=cuda_stream, norm=norm, axes=axes, strides=strides)
    139     elif backend == Backend.PYOPENCL:
    140         return VkFFTApp_cl(shape, dtype, cl_queue, ndim=ndim, inplace=inplace,
    141                            norm=norm, axes=axes, strides=strides)

File [~\miniconda3\Lib\site-packages\pyvkfft\cuda.py:123](http://localhost:8888/lab/tree/cupy_test/vkfft_error/~/miniconda3/Lib/site-packages/pyvkfft/cuda.py#line=122), in VkFFTApp.__init__(self, shape, dtype, ndim, inplace, stream, norm, r2c, dct, axes, strides, **kwargs)
    121     raise RuntimeError("Error creating VkFFTConfiguration. Was the CUDA context properly initialised ?")
    122 res = ctypes.c_int(0)
--> 123 self.app = _vkfft_cuda.init_app(self.config, ctypes.byref(res))
    124 check_vkfft_result(res, shape, dtype, ndim, inplace, norm, r2c, dct, axes, "cuda")
    125 if self.app is None:

OSError: exception: integer divide by zero

Metal backend support?

Hey, awesome work!
Are there plans to add a Metal backend support for Apple silicon? :)
An update for VkFFT was recently released enabling Metal backends from their side.

`pip install pyvkfft` fails when `nvcc` is available

On machines where nvcc is available, it seems that the source build for pyvkfft fails with a message like the following;

      /usr/bin/nvcc -I/usr/include -Isrc -c src/vkfft_cuda.cu -o build/temp.linux-x86_64-cpython-310/src/vkfft_cuda.o -O3 --ptxas-options=-v -std=c++11 --compiler-options=-fPIC
      cc1plus: fatal error: src/vkfft_cuda.cu: No such file or directory

See here for a CI log that shows this in context. Note that the issue can be reproduced by just pip install pyvkfft. The machine where I am reproducing this has CUDA installed via Debian's repository, in case it matters. The CUDA version is 11.6.2.

Metadata/license on PyPi

I noticed the GitHub says MIT (there's even a PR setting this), but on PyPi it's MPL.
I think this is a question of setup.py, but I'm not an expert in this sort of stuff.
I think

"License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)",

needs to be changed to show "MIT License".

Additionally, pip show pyvkfft doesn't show the license:

╰─ pip show pyvkfft                                               (napari-CL) ─╯
Name: pyvkfft
Version: 2022.1.1
Summary: Python wrapper for the CUDA and OpenCL backends of VkFFT,providing GPU FFT for PyCUDA, PyOpenCL and CuPy
Home-page: https://github.com/vincefn/pyvkfft
Author: Vincent Favre-Nicolin
Author-email: [email protected]
License: UNKNOWN
Location: /Users/piotrsobolewski/Dev/miniforge3/envs/napari-CL/lib/python3.9/site-packages
Requires: numpy, psutil, pyopencl
Required-by: 

So I think that metadata block needs a license = MIT License

Possible context issue using pyvkfft in a multithreaded/multigpu environment

Hey Vince, I'm trying to write an app that delegates work to threads to perform FFTs on different gpus. Each thread manages a separate gpu and is in the basic form of:

def thread_0():
    cupy.cuda.Device(0).use()
    while True:
        get_data....
        gpu_data = cp.array(data)
        fft = fftn(gpu_data)

def thread_1():
    cupy.cuda.Device(1).use()
    while True:
        get_data....
        gpu_data = cp.array(data)
        fft = fftn(gpu_data)

def main():
    spawn_threads...
    while True:
        send_data_0_thread0(...)
        send_data_1_thread1(...)

However, pyvkfft is throwing an exception:

Traceback (most recent call last):
File "/opt/python/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/opt/python/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/incoherent_worker.py", line 193, in process_thread
process_dict['results'] = process_incoherent.process_cpi(process_dict['mode'], process_dict['samples'])
File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/process_incoherent.py", line 434, in process_cpi
self._process_range_subset(cpi_data_gpu, tx_pulses, subset_min_range_idx, pulse_group_rising_edge, range_doppler[rstart:rend])
File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/process_incoherent.py", line 231, in _process_range_subset
dm_pulse = dsp.fft_demod_decimate(input_data, self._padded_tx_pulse, rising_edge[pg_idx], min_range_idx,
File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/processing/incoherent_dsp_lib.py", line 199, in fft_demod_decimate
padded_ranges = FFTBackend.fft(padded_ranges, inplace=True, axes=-1)
File "/opt/leolabs/radar/20230425T163443/leo-radar/radar/sparta/utils/backends.py", line 55, in fft
return pyvkfft_lib.fftn(input_data, dest=input_data, ndim=1, axes=axes)
File "/opt/leolabs/radar/20230425T163443/leo-radar/venv3/lib/python3.9/site-packages/pyvkfft/fft.py", line 205, in fftn
app.fft(src, dest)
File "/opt/leolabs/radar/20230425T163443/leo-radar/venv3/lib/python3.9/site-packages/pyvkfft/cuda.py", line 208, in fft
check_vkfft_result(res, src.shape, src.dtype, self.ndim, self.inplace, self.norm, self.r2c,
File "/opt/leolabs/radar/20230425T163443/leo-radar/venv3/lib/python3.9/site-packages/pyvkfft/base.py", line 425, in check_vkfft_result
raise RuntimeError("VkFFT error %d: %s %s" % (res, r.name, s))
RuntimeError: VkFFT error 4039: VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL C2C (10022,525) complex64 1D inplace norm=1 [cuda]
cuLaunchKernel error: 400, 1 10022 1 - 38 1 1

From what I can this an access issue where the code is trying to access data on the wrong GPU. Is this an issue in how pyvkfft is handling context in a multigpu environment, or am I not setting something up correctly for pyvkfft? From my debugging, it looks like all my other Cupy code is respecting the device/stream context. Please let me know if any other information I can provide.

installation on ubuntu with nvidia card.

How do I get pyvkfft to use the nvidia card + opencl on an ubuntu 18.04LTS system? When I pip install, I get the following error:

  x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/src/vkfft_opencl.o -lOpenCL -o build/lib.linux-x86_64-3.8/pyvkfft/_vkfft_opencl.cpython-38-x86_64-linux-gnu.so --shared
  /usr/bin/ld: cannot find -lOpenCL
  collect2: error: ld returned 1 exit status
  error: command 'x86_64-linux-gnu-g++' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for pyvkfft
  Running setup.py clean for pyvkfft
Failed to build pyvkfft
Installing collected packages: pyvkfft```

I am running nvidia driver versions:

NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2

FFT of transposed array throws exception

I found that I get a floating point exception when computing a 1D FFT on a transposed multidimensional array. This can be useful since its faster to compute 1D FFTs along axis=-1.

This code reproduces the issue on my machine

import cupy as cp
import pyvkfft.fft
import pyvkfft.version

print(f"version: {pyvkfft.version.vkfft_version()}")

x = cp.ones((1,100,1400), dtype=cp.complex64)
print(f"x shape: {x.shape}")

# transpose and copy data to new array, works fine
x_t1 = x.transpose((0,2,1)).copy()
print(f"this should work (shape: {x_t1.shape})")
pyvkfft.fft.fftn(x_t1, ndim=1, axes=-1)

# transpose w/out a copy, throws exception
x_t2 = x.transpose((0,2,1))
print(f"this should throw an exception (shape: {x_t2.shape})")
pyvkfft.fft.fftn(x_t2, ndim=1, axes=-1)

when I run this script I get

(venv3) [leolabs@hqsr-worker-r1-1 jimmy]$ python fft_transpose_issue.py 
version: 1.3.0
x shape: (1, 100, 1400)
this should work (shape: (1, 1400, 100))
caching VkFFTApp with:  None
this should throw an exception (shape: (1, 1400, 100))
caching VkFFTApp with:  None
Floating point exception (core dumped)

I'm using pyvkfft version 1.3.0 with CUDA version 12.2.

Thanks for the help!

Unintentional in-place irfftn transform

Hi!

Thanks for pyvkfft and your effort on this awesome opensource project!

When applying vkfft.fft.irfftn to cupy array, the input array seems to be unintentionally changed despite of not using the in-place command.

Sample code

import pyvkfft
import pyvkfft.fft
import cupy as cp

x = cp.arange(4,dtype='float32').reshape(2,2)

x_fft = pyvkfft.fft.rfftn(x)
x_fft_copy = x_fft.copy()

x2 = pyvkfft.fft.irfftn(x_fft)

print("original array x:\n",x,"\n double-transformed array x2:\n",x2)

print("fft array x_fft:\n",x_fft, "\n fft array copied x_fft_copy:\n", x_fft_copy)
# these arrays are not identical in spite of not using inplace transform.

Output

original array x:
 [[0. 1.]
 [2. 3.]] 
 double-transformed array x2:
 [[0. 1.]
 [2. 3.]]
fft array x_fft:
 [[ 1.+0.j -1.+0.j]
 [ 5.+0.j -1.+0.j]] 
 fft array copied x_fft_copy:
 [[ 6.+0.j -2.+0.j]
 [-4.+0.j  0.+0.j]]

Environment
OS:Windows 10
cuda: 11.3
pyvkfft version: 2022.1.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.