Here is a suggested API for fft work: <div class="snippet-clipboard-content notran

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

FFT API rewrite about lib_dsp HOT 15 CLOSED

xmos commented on July 28, 2024

FFT API rewrite

from lib_dsp.

Comments (15)

larry-xmos commented on July 28, 2024

Katja from http://www.katjaas.nl calls the unmangle/mangle operations just split spectrum and merge spectrum

from lib_dsp.

andrewstanfordjason commented on July 28, 2024

that's much better name. We'll go with that. Thanks. So the API will be:


void lib_dsp_fft_bit_reverse( lib_dsp_fft_complex_t pts[], unsigned N );
void lib_dsp_fft_inverse(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_forward(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_split_spectrum(lib_dsp_fft_complex_t pts[], unsigned N);
void lib_dsp_fft_merge_spectrum(lib_dsp_fft_complex_t pts[], unsigned N);

from lib_dsp.

andrewstanfordjason commented on July 28, 2024

Do we really need the 16 bit versions too? If we instead had an efficient 16 -> 32 and 32 -> 16 array converter functions then we could do away with the 16 bit implementations at the cost of some run time memory (one length N 32 bit array). The benefit, other than implementation, would be increased precision on the 16 bit version.

from lib_dsp.

andrewstanfordjason commented on July 28, 2024

void lib_dsp_fft_short_to_long(lib_dsp_fft_complex_short_t pts_in[], lib_dsp_fft_complex_t pts_out[], unsigned N);

void lib_dsp_fft_long_to_short(lib_dsp_fft_complex_t pts_in[], lib_dsp_fft_complex_short_t pts_out[], unsigned N);

from lib_dsp.

henkmuller commented on July 28, 2024

I love that guys graphics.

On 8 Mar 2016, at 12:10, Larry Snizek <[email protected]mailto:[email protected]> wrote:

Katja from http://www.katjaas.nl calls the unmangle/mangle operations just split spectrum and merge spectrum

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/25#issuecomment-193757874.

from lib_dsp.

johned0 commented on July 28, 2024

ASJ - you might have a good point about only providing the 32 bit versions and then converting to 16 bit. In addition to the increased precision (which has been a big problem for Thomas, with the 16 bit version), this might actually also be more efficient than the direct 16 bit versions because the 16 bit versions are a lot slower, from Thomas' benchmarks.
Maybe we should get some feedback from Thomas and customer ?
J

from lib_dsp.

johned0 commented on July 28, 2024

Love the graphics too, Henk

from lib_dsp.

andrewstanfordjason commented on July 28, 2024

If someone could contact them then that would be good. I want to get
this done asap as I want to use it properly soon. Thanks

On 08/03/2016 14:05, John Edwards wrote:

ASJ - you might have a good point about only providing the 32 bit
versions and then converting to 16 bit. In addition to the increased
precision (which has been a big problem for Thomas, with the 16 bit
version), this might actually also be more efficient than the direct
16 bit versions because the 16 bit versions are a lot slower, from
Thomas' benchmarks.
Maybe we should get some feedback from Thomas and AISpeech ?
J

—
Reply to this email directly or view it on GitHub
#25 (comment).

from lib_dsp.

johned0 commented on July 28, 2024

I can do that, Andrew. I've reviewed Thomas' emails and think I've got a handle on the performance differences. I'll let you know

from lib_dsp.

andrewstanfordjason commented on July 28, 2024

The accuracy will be very limited for a 16 bit implementation with the
current algorithm. However, if we go to the 32 bit approach then this
issue will be gone. I've written automated tests to verify correctness
and accuracy of the current implementation(32 bit). Forward is good
backwards is bad!

On 08/03/2016 14:22, John Edwards wrote:

I can do that, Andrew. I've reviewed Thomas' emails and think I've got
a handle on the performance differences. I'll let you know

—
Reply to this email directly or view it on GitHub
#25 (comment).

from lib_dsp.

ThomasGmeinder commented on July 28, 2024

Hi Guys
The main constraint at customer is memory.
They need to do a 4k FFT on each of the 8 input channels. Sample data is 16 bit.
32 bit lib_dsp_fft_forward_tworeals was prohibitive as that would need 512k just for buffers:
4096 * 8 (channels) * 4 (bytes per sample) * 2 (re+im) * 2 (double buffers).
That was the motivation to implement the 16-bit version. But that would still need 256k and thus require partitioning the processing on two tiles.
And not leave much memory for the additional temporary 32-bit buffer: 4k_4_2 = 32k. I'm assuming a double buffer would be needed for performance.

But:
Now that the method using the halved spectrum was proposed by Andrew and accepted by customer, the buffer requirements are relaxed:
4096 * 8 (channels) * 2 (bytes per sample) * 2 (double buffers) = 128 kB
That means the additional temporary buffer proposed to convert from short int to int before the FFT should fit.
An additional advantage would be that this temp buffer is only needed on the one tile.

Here's the test for the forward FFT using the new method:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/AN00209_xCORE-200_DSP_Library/app_transforms/src/app_transforms.xc#L279

Regarding Performance:
The 16 bit FFT lib_dsp_fft_forward_complex_shortfor 4k points takes 796855 cycles (7.9ms)
The 32 bit version lib_dsp_fft_forward_complex takes 524504 cycles (5.25ms)

The overhead of 2.54ms is due to unpacking short int from memory into int (for the maccs, + and - of the butterfly)
and then back into into short int before packing and storing to memory.

The function Andrew proposes to copy a 16-bit short int array in a 32-bit int array should be faster than that.
Something like this:
copy_loop:
//load 16-bit im and re from lib_dsp_fft_complex_t
{ld16s r6, r4[r2]; sub r2, r2, 1} // r6: re; decrement offset
{ld16s r3, r4[r2]; sub r2, r2, 1} // r3: im; decrement offset
std r3, r6, r5[r6] // store 32-bit im and re to lib_dsp_fft_complex_t
{bt r2, copy_loop; sub r6, r6, 1}

would take: 16384 cycles (0.164 ms).
Probably an additional cycle due to FNOP but that would still be a lot faster (0.205 ms).

Bottom line:
Performance and accuracy would be better.
The expense would be an additional 32k temp buffer which seems a worthy tradeoff.

But because memory is so tight for DOA+Beamforming+AEC (ideally also Keyword Recognition) we should also get feedback from customer.

Cheers, /T

from lib_dsp.

johned0 commented on July 28, 2024

Thanks very much, Thomas,
I'll add your comments to the case, for Knight to check.
Enjoy the snow :-)
John

from lib_dsp.

ThomasGmeinder commented on July 28, 2024

I implemented and tested the array conversion functions:
void lib_dsp_fft_short_to_long(lib_dsp_fft_complex_short_t pts_in[], lib_dsp_fft_complex_t pts_out[], unsigned N);
void lib_dsp_fft_long_to_short(lib_dsp_fft_complex_t pts_in[], lib_dsp_fft_complex_short_t pts_out[], unsigned N);

Test:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/AN00209_xCORE-200_DSP_Library/app_fft_short_int/src/app_fft_short_int.xc#L392

Performance (cycles per im/re data point):
5 cycles (incl 1 FNOP):
https://github.com/ThomasGmeinder/lib_dsp/blob/master/lib_dsp/src/lib_dsp_fft_short_to_long.S#L27

5 cycles:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/lib_dsp/src/lib_dsp_fft_long_to_short.S#L28

from lib_dsp.

larry-xmos commented on July 28, 2024

that's much better name. We'll go with that. Thanks. So the API will be

Should be merge_spectra rather than merge_spectrum

from lib_dsp.

andrewstanfordjason commented on July 28, 2024

all done now.

from lib_dsp.

FFT API rewrite about lib_dsp HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent