jamienoss / astropy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from astropy/astropy

0.0 0.0 0.0 73.22 MB

Repository for the Astropy core package

Home Page: www.astropy.org

License: BSD 3-Clause "New" or "Revised" License

Python 95.47% C 4.48% HTML 0.01% TeX 0.01% C++ 0.01% Objective-C 0.01%

astropy's People

Watchers

astropy's Issues

NaN treatment

Dedupe NaN conditionals

src

...
if preserve_nan:
        badvals = np.isnan(array_internal)

if nan_treatment == 'fill':
        initially_nan = np.isnan(array_internal)
        array_internal[initially_nan] = fill_value
...

To prevent possible duplicated work when preserve_nan == True and nan_treatment == 'fill' the condtions should be joined:

...
if preserve_nan or nan_treatment == 'fill':
    initially_nan = np.isnan(array_internal)
    if nan_treatment == 'fill':
        array_internal[initially_nan] = fill_value
....

nan_treatment == 'fill'

For nan_treatment == 'fill' in convolve(), since any and all possibel NaN values are removed, the subsequent npy_isnan(val), for example for boundary=None (the default), in the Cython routine boundary_none.pyx can be removed. Removing this is a step closer to removing all conditionals such that vectorization can take place.

Convolution Notes

References:

Astropy:

SciPy:

Other:

Perf:

Driver:

mprof run --include-children ../convolve.py

import astropy
print(astropy.version.version)

import astropy.convolution as astroconv
import scipy.ndimage.filters as sciconv1
import scipy.signal as scisig
#from scipy.ndimage.filters import convolve
import numpy as np
import gputools

iLength = 10000
jLength = iLength
image = np.random.random((iLength, jLength))

size = 111
ker = astroconv.Gaussian2DKernel(20,x_size=size,y_size=size)
#for 10k+1 x 10k+1 tests
#ker = np.random.random((iLength+1, jLength+1))

#smoothed = scisig.fftconvolve(image, ker.array)
#smoothed = astroconv.convolve_fft(image, ker, allow_huge=True)

#smoothed = sciconv.convolve(image, ker.array, mode='wrap')
smoothed = astroconv.convolve(image, ker, boundary='wrap')
#smoothed = gputools.convolve(image, ker.array)

Benchmark

Env

python 3.5.4-he720263_23
numpy 1.13.3-py35hfd7066c_0
scipy 1.0.0-py35h8b35106_0
astropy 2.0.3 (built on the box below w/clang & -O3) v3.0rc2 has no significant code changes.
libcxx 4.0.1-h579ed51_0
libcxxabi 4.0.1-hebd6815_0
clang Apple LLVM version 7.3.0 (clang-703.0.31)
memory-profiler 0.50.0-pip

Box

Model Name: Mac Pro
Model Identifier: MacPro6,1
Processor Name: 6-Core Intel Xeon E5
- Processor Speed: 3.5 GHz
- Number of Processors: 1
- Total Number of Cores: 6 (12w/hyperthreading ON - default)
- L2 Cache (per Core): 256 KB
- L3 Cache: 12 MB
Memory: 32 GB
Boot ROM Version: MP61.0116.B25
SMC Version (system): 2.20f18
Illumination Version: 1.4a6
Serial Number (system): F5KS302VF9VN
Hardware UUID: 549FB335-E005-524B-85C3-B582CA595478
Disk: APPLE SSD SM1024G
GPU: Dual AMD FirePro D300
- 2GB of GDDR5 VRAM (each)
- 1280 stream processors
- 256-bit-wide memory bus
- 160GB/s memory bandwidth
- 2 teraflops performance

Direct

astropy.convolution.convolve

astropy.convolution.convolve(array, kernel,
                                                 boundary='fill',
                                                  fill_value=0.0,
                                                  nan_treatment='interpolate',
                                                  normalize_kernel=True, 
                                                  mask=None,
                                                  preserve_nan=False,
                                                   normalization_zero_tol=1e-08)

scipy.ndimage.filters.convolve

scipy.ndimage.filters.convolve(input, weights, output=None, mode='reflect', cval=0.0, origin=0)

NOTE: the above two functions and their tests utilize only a single thread.

gputools.convolve. This routine is very unstable, binding my machine on more runs than not. Only 2/12 runs did not do this. #061fa99 (master) & v0.2.2 (latest release)

gputools.convolve(data, h, res_g=None, sub_blocks=None)

FFT

astropy.convolution.convolve_fft

astropy.convolution.convolve_fft(array, kernel, boundary='fill', fill_value=0.0,
                                                 nan_treatment='interpolate', normalize_kernel=True,
                                                 normalization_zero_tol=1e-08,
                                                 preserve_nan=False, mask=None, crop=True, return_fft=False,
                                                 fft_pad=None, psf_pad=None,
                                                 quiet=False, min_wt=0.0, allow_huge=False, fftn=<function fftn>,
                                                 ifftn<function ifftn>, complex_dtype=<class 'complex'>)

scipy.signal.fftconvolve

scipy.signal.fftconvolve(in1, in2, mode='full')

NOTE: the above two functions and their tests utilize only a single thread.

NOTE: There appears to be a memory ceiling at ~25GB that mprof is unable to observe beyond. E.g. when running astropy.convolution.convolve_fft(image, kernel, allow_huge=True), the OSX activity monitor records a peak usage of ~47GB with ~22GB compressed, equating to ~25 "uncompressed". ~6GB is the background usage by other apps that are open during testing, yielding the 25GB out of the total 32GB available. It appears that mprof is oblivious to this compressed memory.

Results (10k x 10k with 111 x 111)

Memory Deck

The minimum memory required for both arrays to be stored in memory:

0.8GB (float64)

Direct

The above linear increase in memory, as a function of time, is indicative of it "leaking" (may not be a literal leak).

FFT

Results (10k x 10k with 10k+1 x 10k+1)

NOTE: The below plot titles are incorrect for kernel size in that they are missing the +1 in both dims

Memory Deck

The minimum memory required for both arrays to be stored in memory:

1.6GB (float64)

Direct

Estimates assuming linear scaling:

Astropy default (None) : ~141 days. Actual memory used ~2.3GB (incomplete run).
Scipy default (reflect) : ~101.5 days. Immediately fails with MemoryError. ~3GB reached at fail-time.

FFT

NOTE: effectively the PSF deconvolution benchmark.

Scipy default (full) :
Astropy default (fill) : Aborted by OS with signal 9. OSX activity monitor indicated significantly higher mem consumption than mprof has. This was true when not profiled so cannot be attributed to mprof overhead. Approx 70GB of memory was recorded, with ~50GB compressed.

convolution.convolve() Cython code hoist optimizations

Hoist index computions

astropy/convolution/boundary_none.pyx::convolve2d_boundary_none()

...
 # Now run the proper convolution
        for i in range(wkx, nx - wkx):
            for j in range(wky, ny - wky):
                top = 0.
                bot = 0.
                for ii in range(i - wkx, i + wkx + 1):
                    for jj in range(j - wky, j + wky + 1):
                        val = f[ii, jj]
                        ker = g[<unsigned int>(nkx - 1 - (wkx + ii - i)),
                                <unsigned int>(nky - 1 - (wky + jj - j))]
                        if not npy_isnan(val):
                            top += val * ker
                            bot += ker
                if normalize_by_kernel:
                    if bot == 0:
                        conv[i, j] = f[i, j]
                    else:
                        conv[i, j] = top / bot
                else:
                    conv[i, j] = top
...

Is (at least) more performant as the foloowing (though this is probably (hopefully) optimized as such by the compiler anyhow)

...
   cdef unsigned int i, j, ii, jj
    cdef unsigned int nkx_minus_1 = nkx-1, nky_minus_1 = nky-1
    cdef unsigned int wkx_minus_i, wky_minus_j
    cdef unsigned int ker_i, ker_j
    cdef unsigned int ny_minus_wky = ny - wky
    cdef unsigned int i_minus_wkx, wkx_plus_1 = wkx + 1
    cdef unsigned int j_minus_wky, wky_plus_1 = wky + 1
    cdef unsigned int i_plus_wkx_plus_1, j_plus_wky_plus_1
    cdef unsigned int nkx_minus_1_minus_wkx_plus_i, nky_minus_1_minus_wky_plus_j

    cdef int iimin, iimax, jjmin, jjmax
    cdef DTYPE_t top, bot, ker, val

    # release the GIL
    with nogil:

        # Now run the proper convolution
        for i in range(wkx, nx - wkx):
             # wkx - 1
            wkx_minus_i = wkx - i
            # i - wkx
            i_minus_wkx = i - wkx
            # i + wkx + 1
            i_plus_wkx_plus_1 = i + wkx_plus_1
            # nkx - 1 - (wkx - i)
            nkx_minus_1_minus_wkx_plus_i = nkx_minus_1 - wkx_minus_i
            for j in range(wky, ny_minus_wky):
                wky_minus_j = wkx - j
                j_minus_wky = j - wky
                j_plus_wky_plus_1 = j + wky_plus_1
                nky_minus_1_minus_wky_plus_j = nky_minus_1 - wky_minus_j

                top = 0.
                bot = 0.
                for ii in range(i_minus_wkx, i_plus_wkx_plus_1):
                    # nkx - 1 - (wkx + ii - i)
                    ker_i = nkx_minus_1_minus_wkx_plus_i - ii
                    for jj in range(j_minus_wky, j_plus_wky_plus_1):
                        ker_j = nky_minus_1_minus_wky_plus_j - jj
                        val = f[ii, jj]
                        ker = g[ker_i, ker_j]
                        if not npy_isnan(val):
                            top += val * ker
                            bot += ker
                if normalize_by_kernel:
                    if bot == 0:
                        conv[i, j] = f[i, j]
                    else:
                        conv[i, j] = top / bot
                else:
                    conv[i, j] = top
    # GIL acquired again here
    return conv
...

Hoist kernel.sum()

c.f. #4

bot += ker (sum of the kernel) is being re-computed for every element in the image. This is unnecessary. Since bot is being reset to 0 for each element, it is highly unlikely that the C compiler is clever enough to optimize this out by hoisting it above the outer most loop.

Hmmm, maybe not. The interpolation scheme seems to prevent this hoist and is the reason that the kernel summation exists in the Cython code. If the image element is NaN then the kernel element is not incorporated in the sum. (Only an issue for NaN interpolation.)

Shortcircuit `normalize_by_kernel`

c.f. #4

If this was straight C, I would make this function inline and then wrap it with two funcs or by a single with a condition on normalize_by_kernel so as to shortcircuit the internal conditional pre-call. Since the function is now inline, the C compiler would (should) remove the conditional from within the loop altogether. This way the performance increase can be obtained whilst keeping the code in a single place to remove duplication and aid readability plus and maintenance.

Kernel normalization treatment

Untangle kernel re-normalization & NaN interpolation

The following seems untrue, unnecessary, and inefficient (src)

 # Because the Cython routines have to normalize the kernel on the fly, we
 # explicitly normalize the kernel here, and then scale the image at the
 # end if normalization was not requested.

A silver lining is that the above is only true for nan_treatment = 'interpolate'. It can, therefore, be optimized out for all other options. NaN interpolation is, however, the default.

This can also be checked before convolving to determine whether NaN values even exist.

Use C division to prevent requiring the GIL during loop

The following code excerpt from a cython extension is an example where the division remains as python division rather than C division. The issue here is that to do so it must reacquire the GIL, this happens inside of a heavily nested loop and is therefore not performant for threaded use.

src

...
 # release the GIL
    with nogil:

        # Now run the proper convolution
        for i in range(wkx, nx - wkx):
            for j in range(wky, ny - wky):
                top = 0.
                bot = 0.
                for ii in range(i - wkx, i + wkx + 1):
                    for jj in range(j - wky, j + wky + 1):
                        val = f[ii, jj]
                        ker = g[<unsigned int>(nkx - 1 - (wkx + ii - i)),
                                <unsigned int>(nky - 1 - (wky + jj - j))]
                        if not npy_isnan(val):
                            top += val * ker
                            bot += ker
                if normalize_by_kernel:
                    if bot == 0:
                        conv[i, j] = f[i, j]
                    else:
                        conv[i, j] = top / bot    #<--------------------------------------------------
                else:
                    conv[i, j] = top
    # GIL acquired again here
    return conv

This can be seen using cython's inspection feature, i.e. cython -a boundary_non.pyx

After specifying the use of C division with the decorator @cython.cdivision(True) the GIL is no longer involved.

astropy#3949 is the original issue related to releasing the GIL for this code. It relates to its performant usage with DASK, c.f. previous authors benchmark.

Test

astropy 6117643

Utilizing the Jupyter notebook test code from the following benchmark.

import numpy as np
import dask.array as da
from astropy.convolution import convolve, Gaussian2DKernel

Nz,Ny,Nx = (64,1024,2048)
shape = (Nz,Ny,Nx)
big_numpy_array = np.random.rand(*shape)

cs = (1,Ny,Nx)
big_dask_array = da.from_array(big_numpy_array, chunks=cs)

def filter_func_ap(b):
    ker = Gaussian2DKernel(1)#, x_size=111, y_size=111)
    return convolve(b.squeeze(), ker, boundary=None)[np.newaxis,:,:]

%time big_dask_array.map_blocks(filter_func_ap).sum().compute()

%timeit filter_func_ap(big_numpy_array[0])

With python division (@cython.cdivision(False)) the 1st test ran with an average wallclock time of 2.6s (over 10 runs). The second test, an average of 216ms/loop.
With C division (@cython.cdivision(True)) the 1st test ran with an average wallclock time of 2.47s (over 10 runs). The second test, an average of 206ms/loop.

The use of C division and thus not having to re-acquire the GIL for this op shaves ~10ms off per loop. This is a 4.9% performance increase (per loop). The performance increase reported in the benchmark to release the GIL, showed a performance increase per loop of 6.6%. However, I believe this to be bogus. When testing the with gil: version myself, the time per loop is ~206ms. Effectively there isn't a performance increase per loop between tests of NOT releasing the GIl + python division AND when releasing the GIL + C division. This makes perfect sense as the former test where the GIL is not released does not, therefore, have to reacquire it within the nested loop for the division op. It does, however, have to be reacquired for the division op when releasing the GIL and using python division.

To Do

Investigate any numerical differences present in the results between Python and C division.
(None found).

Investigate direct convolution in C, NO (P|C)ython

See code at end.

Using all C and no Python nor Cython, there was effectively zero performance gain. For boundary=None (only boundary code actually tested) the runtimes are almost identical. However, this is only true when compiling with full auto optimization, -O3 (which is the default opt level for building Astropy). With no optimization, the runtime's increase drastically and take 5.6x longer.

Note: still better to port to straight C as the Cython OPENMP functionality is effectively nonexistent.

This comparison also aids #6 (investigate memory allocation issue) since all memory is allocated upfront in this C code and there is no notable perf. difference - showing that there is no advantage to have Python or Cython do so either.

clang vs gcc

10kx10k image and 111x111 kernel (standard benchmark throughout)

~19min with gcc, vs ~18min with clang, making clang ~6% faster.

Optimizations

Using only Clang here as it has better optimization reporting.

Use -Rpass-missed=.* to identify optimizations that were not made and -Rpass-analysis=.* for more verbose explanations to why the optimizations were not made. With -O3 this reports the following (just -Rpass-missed=.* and only those relevant to comp_conv_inline):

conv.c:186:35: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize for more info [-Rpass-missed=loop-vectorize]
            j_plus_wky_plus_1 = j + wky_plus_1; // j + wky + 1
                                  ^
conv.c:198:41: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize for more info [-Rpass-missed=loop-vectorize]
                    val = f->data[ii*ny + jj]; //[ii, jj];
                                        ^
conv.c:209:29: remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying
      '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'.
      [-Rpass-analysis=loop-vectorize]
                        top += val * ker;
                            ^

The compiler analysis states that the above top two loops would not be beneficial to unroll and interleave.

Adding #pragma clang loop vectorize(enable) as suggested actually yields in a slower runtime ~40%. Effectively undoing the original performance gain. I'm not sure of why this is true, it is not because it prevents the inlining from removing the constant conditions (tested by manually removing these).
Using -ffast-math: if not interpolating NaN values (present or not) there is no impact on performance. However, when doing so (present or not) there is a significant performance hit, it runs ~72% slower. Comparing the compiler output using -Rpass-analysis=loop-vectorize it shows that without fast-math, neither true or false branch of the wrapper are vectorized due to the cost model indicating no benefit. With fast-math the true branch (for NaN interpolation) is vectorized, sometimes the cost models can suffer from too wide a hysteresis. This is clearly an example of this.

#include <stdlib.h>
#include <math.h>
#include <stdbool.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>

typedef struct {
    unsigned nx;
    double * data;
} Array;

void comp_conv(Array * conv, const Array * f, const Array * g, const bool nan_interpolate);
void comp_conv_inline(Array * conv, const Array * f, const Array * g, const bool nan_interpolate);

void writeBin(const Array * array, const char * fname);
double timeDiff(clock_t start);


void comp_conv(Array * conv, const Array * f, const Array * g, const bool nan_interpolate)
{
    if (nan_interpolate)
        comp_conv_inline(conv, f, g, true);
    else
        comp_conv_inline(conv, f, g, false);
}

void writeBin(const Array * array, const char * fname)
{
    printf("Writing data to file '%s'\n", fname);
    fflush(stdout);
    FILE * fp = fopen(fname, "wb");
    assert(fp);

    unsigned nx = array->nx;
    size_t count = (size_t)(nx*nx);
    assert(fwrite(array->data, sizeof(*array->data),count, fp) == count);
    assert(!fclose(fp));
}

inline double timeDiff(clock_t start)
{
    return (double)(clock() - start)/CLOCKS_PER_SEC;
}

int main(int argc, char *argv[])
{
    if (argc < 5)
    {
        printf("not enough params");
        return 1;
    }

    const unsigned nx = atoi(argv[1]);
    const unsigned ny = atoi(argv[1]);
    
    const unsigned nkx = atoi(argv[2]);
    const unsigned nky = atoi(argv[2]);

    const bool nan_interpolate = (bool)atoi(argv[3]);
    const bool normalize = (bool)atoi(argv[4]);

    clock_t start = clock();

    printf("nx = %d\n", nx);
    printf("nkx = %d\n", nkx);

    
    Array f,g,c;

    f.nx = nx;
    g.nx = nkx;
    c.nx = nx;

    f.data = malloc(nx*ny*sizeof(*f.data));
    assert(f.data);
    g.data = malloc(nkx*nky*sizeof(*g.data));
    assert(g.data);
    c.data = calloc(nx*ny, sizeof(*c.data));
    assert(c.data);

    for (unsigned i = 0; i < nx; ++i)
    {
        for (unsigned j = 0; j < nx; ++j)
        {
            f.data[i*ny + j] = rand() % 100;
        }
    }

    double k_sum = 0.;
    for (unsigned i = 0; i < nkx; ++i)
    {
        for (unsigned j = 0; j < nkx; ++j)
        {
            double val = 1. + rand() % 10;
            g.data[i*nky + j] = val;
            k_sum += val;
        }
    }
    printf("k sum = %lf\n", k_sum);
    printf("data setup time = %lf (s)\n", timeDiff(start));
    fflush(stdout);

    comp_conv(&c,&f,&g,nan_interpolate);
    
    if (normalize)
    {
        if (!nan_interpolate)
        {
            for (unsigned i = 0; i < nx; ++i)
            {
                for (unsigned j = 0; j < nx; ++j)
                {
                    c.data[i*ny + j] /= k_sum;
                }
            }
        }
    }
    else
    {
        if (nan_interpolate)
        {
            for (unsigned i = 0; i < nx; ++i)
            {
                for (unsigned j = 0; j < nx; ++j)
                {
                    c.data[i*ny + j] *= k_sum;
                }
            }
        }
    }

#if false
    start = clock();
    writeBin(&f, "image.dat");
    writeBin(&g, "kernel.dat");
    writeBin(&c, "conv.dat");  
    printf("data written out time = %lf (s)\n", timeDiff(start));
    fflush(stdout);
#endif

    free(c.data);
    free(f.data);
    free(g.data);
    return 0;
}

inline __attribute__((always_inline)) void comp_conv_inline(Array * conv, const Array * f, const Array * g, const bool nan_interpolate)
{
    const unsigned nx = conv->nx;
    const unsigned ny = conv->nx;
    const unsigned nkx = g->nx;
    const unsigned nky = g->nx;
    const unsigned wkx = nkx / 2;
    const unsigned wky = nky / 2;
    
    const unsigned int nkx_minus_1 = nkx-1, nky_minus_1 = nky-1;
    unsigned int wkx_minus_i, wky_minus_j;
    unsigned int ker_i, ker_j;
    const unsigned int nx_minus_wkx = nx - wkx;
    const unsigned int ny_minus_wky = ny - wky;
    unsigned int i_minus_wkx;
    const unsigned wkx_plus_1 = wkx + 1;
    unsigned int j_minus_wky;
    const unsigned wky_plus_1 = wky + 1;
    unsigned int i_plus_wkx_plus_1, j_plus_wky_plus_1;
    unsigned int nkx_minus_1_minus_wkx_plus_i, nky_minus_1_minus_wky_plus_j;
    int iimin, iimax, jjmin, jjmax;
    
    double top, bot, ker, val;
    
    const clock_t start = clock();
    
    
    for (unsigned i = wkx; i < nx_minus_wkx; ++i)
    {
        wkx_minus_i = wkx - i; // wkx - 1
        i_minus_wkx = i - wkx; //i - wkx
        i_plus_wkx_plus_1 = i + wkx_plus_1; // i + wkx + 1
        nkx_minus_1_minus_wkx_plus_i = nkx_minus_1 - wkx_minus_i; // nkx - 1 - (wkx - i)
    
        for (unsigned j = wky; j < ny_minus_wky; ++j)
        {
            wky_minus_j = wkx - j; // wky - j
            j_minus_wky = j - wky; // j - wky
            j_plus_wky_plus_1 = j + wky_plus_1; // j + wky + 1
            nky_minus_1_minus_wky_plus_j = nky_minus_1 - wky_minus_j; // nky - 1 - (wky - i)
            top = 0.;
            if (nan_interpolate)
                bot = 0;
            for (unsigned ii = i_minus_wkx; ii < i_plus_wkx_plus_1; ++ii)
            {
                ker_i = nkx_minus_1_minus_wkx_plus_i - ii; // nkx - 1 - (wkx + ii - i)
   //#pragma clang loop vectorize(enable) // takes twice as long with this set, values are correct though.
                for (unsigned jj = j_minus_wky; jj < j_plus_wky_plus_1; ++jj)
                {
                    ker_j = nky_minus_1_minus_wky_plus_j - jj; // nky - 1 - (wky + jj - j)
                    val = f->data[ii*ny + jj]; //[ii, jj];
                    ker = g->data[ker_i*nky + ker_j]; // [ker_i, ker_j];
                    if (nan_interpolate)
                    {
                        if (!isnan(val))
                        {
                            top += val * ker;
                            bot += ker;
                        }
                    }
                    else
                        top += val * ker;
                }
            }
            if (nan_interpolate)
            {
                if (bot == 0) // This should prob be np.isclose(kernel_sum, 0, atol=normalization_zero_tol)
                    conv->data[i*ny + j]  = f->data[i*ny + j] ;
                else
                    conv->data[i*ny + j]  = top / bot;
            }
            else
                conv->data[i*ny + j] = top;
        }
    }
    
    printf("core comp time = %lf (s)\n", timeDiff(start));
    fflush(stdout);
}

numpy.isclose vs math.isclose

Direct conv boundary default - None or `fill`

convolve() param boundary has default 'fill' whilst doc states default is None.

Investigate linear memory increase of convolution.convolve()

The tests of astropy.convolution.convolve() conducted in issue #1 are shown below.

Is this linear increase in memory over time a memory "leak"?

An actual leak?
temp loop memory allocations not being collected?
...?

Considering that all runs reach the exact same total memory usage, just at later times, this is an indication that memory is not leaking. If it where it would be an actual function of time, i.e. those completing sooner would leak less.

Testing the following:

import astropy
import astropy.convolution as astroconv
import numpy as np
import gc
import time

iLength = 10000
jLength = iLength
image = np.random.random((iLength, jLength))

size = 111
ker = astroconv.Gaussian2DKernel(20,x_size=size,y_size=size)

astroconv.convolve(image, ker, boundary=None)

gc.collect()
time.sleep(60)

with mprof run -C test.py yeilds the following:

showing that memory is NOT physically leaking beyond Python.

Fudging the code so as to force Python to allocate all of the required memory up front, seems to have no impact on the performance. It is possible that there is still an overhead preventing the nested convolution computation from gaining a perf boost from fast memory block access.

The fudge was simply to fill conv with random numbers rather than constant zeros. src I.e.

cdef np.ndarray[DTYPE_t, ndim=2] conv = np.zeros([nx, ny], dtype=DTYPE)
->
cdef np.ndarray[DTYPE_t, ndim=2] conv = np.random.random((nx,ny))

This run took ~23.7min. Faster than before but not by much, ~3.3%.

Convolution perf notes

Perf boosts

#3 NaN treatment
#4 Kernel normalization treatment
Investigate Cython routines for uneccessary persisting python (code that does NOT generate C).
c.f. generated Cyhton inspection (E.g. cython file.pyx -a)
#6 Investigate possible mem leaks as indicated from #1 results.
#5 Python vs C division, the GIL, and DASK
#7 manually hoist computations
look into this.
#9 Direct C investigation

Other Issues

convolve() param boundary has default 'fill' whilst doc states default is None. Since the default for fill_value = 0. the two are effectively the same. However, if the user accidentally changes the fill_value this value will be used as boundary=None is NOT the actual default and the internal routine convolve<N>d_boundary_fill() is called and not convolve<N>d _boundary_none(). c.f. #10
The doc for convolve() states that an exception will be raised for nan_treatment='interpolate' if the kernel sums to zero and therefore cannot be normalized. This is not actually true, no exception is raised. Note: not yet tested, code inspection only.

jamienoss / astropy Goto Github PK

astropy's People

Watchers

astropy's Issues

Dedupe NaN conditionals

nan_treatment == 'fill'

References:

Driver:

Benchmark

Env

Box

Direct

FFT

Results (10k x 10k with 111 x 111)

Memory Deck

Direct

FFT

Results (10k x 10k with 10k+1 x 10k+1)

Memory Deck

Direct

FFT

Hoist index computions

Hoist kernel.sum()

Shortcircuit normalize_by_kernel

Untangle kernel re-normalization & NaN interpolation

Test

To Do

clang vs gcc

Optimizations

Perf boosts

Other Issues

Recommend Projects

Recommend Topics

Recommend Org

Shortcircuit `normalize_by_kernel`