doonny / pipecnn Goto Github PK
View Code? Open in Web Editor NEWAn OpenCL-based FPGA Accelerator for Convolutional Neural Networks
License: Apache License 2.0
An OpenCL-based FPGA Accelerator for Convolutional Neural Networks
License: Apache License 2.0
Dear Prof. Wang,
We have tried to run your code on Altera FPGA DE5a_net_e1, unfortunately, we can not get the correct result. The result is random everytime. Sometime it is fox, sometime it is Cardigan or Pomerania. Could you please help me to figure out what was wrong? Thank you so much.
[root@dhcp70 project]# ./run.exe conv.aocx
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
61063552 total weights read
Loading picture ./data/picture/cat.jpg .....
1024 total output reference read
Platform: Intel(R) FPGA SDK for OpenCL(TM)
Using 1 device(s)
Device 0: de5a_net_e1 : Arria 10 Reference Platform (aclde5a_net_e10)
Device OpenCL Version: OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 16.1
Device Max Compute Units: 1
Device Max WorkGroup Size: 2147483647
Device Max WorkItem Size: 2147483647
Device Global Memory Size: 8192 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 1000 Mhz
Loading kernel/binary from file conv.aocx
Reprogramming device [0] with handle 1
Executing Layer 1:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 27, 27, 96)
Launching kernel lrn with local size: 1, 1, 24 (global size: 27, 27, 24)
Executing Layer 2:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 256)
Launching kernel lrn with local size: 1, 1, 64 (global size: 13, 13, 64)
Executing Layer 3:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 384)
Executing Layer 4:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 384)
Executing Layer 5:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 6, 6, 256)
Executing Layer 6:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)
Executing Layer 7:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)
Executing Layer 8:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 1024)
Copyed all batched results from fc_2 buffers.
Done !!!
Performance Summary
Total runtime: 0.057614s
Kernel runtime summary:
Layer-1:
MemRd: 8.850 ms
Conv : 8.819 ms
Pool : 8.813 ms
MemWr: 8.794 ms
Lrn : 0.643 ms
Layer-2:
MemRd: 14.013 ms
Conv : 13.992 ms
Pool : 13.987 ms
MemWr: 13.969 ms
Lrn : 0.243 ms
Layer-3:
MemRd: 9.407 ms
Conv : 9.365 ms
Pool : 0.000 ms
MemWr: 9.360 ms
Lrn : 0.000 ms
Layer-4:
MemRd: 7.080 ms
Conv : 7.057 ms
Pool : 0.000 ms
MemWr: 7.044 ms
Lrn : 0.000 ms
Layer-5:
MemRd: 4.782 ms
Conv : 4.751 ms
Pool : 4.748 ms
MemWr: 4.735 ms
Lrn : 0.000 ms
Layer-6:
MemRd: 2.583 ms
Conv : 2.547 ms
Pool : 0.000 ms
MemWr: 2.551 ms
Lrn : 0.000 ms
Layer-7:
MemRd: 1.223 ms
Conv : 1.199 ms
Pool : 0.000 ms
MemWr: 1.193 ms
Lrn : 0.000 ms
Layer-8:
MemRd: 0.331 ms
Conv : 0.286 ms
Pool : 0.000 ms
MemWr: 0.290 ms
Lrn : 0.000 ms
Total kernel runtime 48.018 ms
Batch size = 1, average process time per batch: 48.018 ms
Start verifying results ...
Selected item = 0 from the combined batch results in fc buffers
Hi!
I encountered the following problems(the warning part has been tab by bold text) when I bulid debug for my project on zcu102(
System configuration:A53openCL Linux;
Runtime:OpenCL),
SDx2017.4(SDSoc available, SDAccel not available)
And I found that the pipe.cl already existed before using the pipe_gen.py, though I still ran the py script with argument(16 8)
it was my first time using SDx kit, thank you for your help!
---- error part ----(the warning part has been tab by bold text)
21:19:17**** Incremental Build of configuration Debug for project pcnn ****
make -j40 incremental
/opt/Xilinx/SDX/SDK/2017.4/gnu/aarch64/lin/aarch64-linux/bin/aarch64-linux-gnu-g++ -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -std=c++14 -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -o "src/common/ocl_util.o" "../src/common/ocl_util.cpp"
/opt/Xilinx/SDX/SDK/2017.4/gnu/aarch64/lin/aarch64-linux/bin/aarch64-linux-gnu-g++ -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -std=c++14 -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -o "src/common/timer.o" "../src/common/timer.cpp"
/opt/Xilinx/SDX/SDK/2017.4/gnu/aarch64/lin/aarch64-linux/bin/aarch64-linux-gnu-g++ -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -std=c++14 -DSDX_PLATFORM=zcu102 -D__USE_XOPEN2K8 -I/opt/Xilinx/SDX/SDx/2017.4/runtime/include/1_2/ -I/opt/Xilinx/SDX/Vivado/2017.4/include/ -O2 -g -Wall -c -fmessage-length=0 -o "src/project/host/main.o" "../src/project/host/main.cpp"
#include "ocl_util.h"
^
compilation terminated.
make: *** Waiting for unfinished jobs....**
../src/common/ocl_util.cpp: In function \u2018_cl_program* ocl_util::createProgramFromFile(cl_context, const char*, _cl_device_id* const*, unsigned int)\u2019:
../src/common/ocl_util.cpp:410:22: warning: ignoring attributes on template argument \u2018cl_int {aka int}\u2019 [-Wignored-attributes]
scoped_array<cl_int> binary_status(num_devices);
^
21:19:18 Build Finished (took 940ms)`
Regards!
Hi, Dong
I have one more question.
Can you explain why did you use in this code "accum_piped"
Why PIPE_DEPTH = 6 ?
`
for(unsigned char ll=0; ll<LANE_NUM; ll++){
lane_accum[ll] = (MASK_ACCUM & accum_piped[ll][PIPE_DEPTH-1]) + (MASK_MULT & mac(mac_data.lane[ll], mac_weight.lane[ll]));
// Shift the pipelined registers backwards
#pragma unroll
for(unsigned int p=PIPE_DEPTH-1; p>0; p-- ){
accum_piped[ll][p] = MASK_ACCUM & accum_piped[ll][p-1];
}
// update the first copy
accum_piped[ll][0] = MASK_ACCUM & lane_accum[ll];
}
`
Thank you
Best regards
Hi,I compiled the project based on de10-standard,but it implied that LABs isn't enough.What's the solution?
aoc: First stage compilation completed successfully.
Compiling for FPGA. This process may take a long time, please be patient.
Error (170012): Fitter requires 4243 LABs to implement the design, but the device contains only 4191 LABs
Error: Cannot fit kernel(s) on device
Makefile:135: recipe for target 'conv.aocx' failed
make: *** [conv.aocx] Error 1
Thanks
I am facing the following error when executing PipeCNN on AWS F1.
Loading kernel/binary from file cnnf1_pythonpipe2.awsxclbin
ERROR: ERROR: Memory bank specified for kernel instance "memRead_1" of kernel "memRead" for argument index 21 does not match the physical connectivity from the binary.
Bank specified on host side is "M01_AXI" while bank from the binary is "M00_AXI".
ERROR: clSetKernelArg() for kernel "memRead", argument index 21.
ERROR: CL_INVALID_MEM_OBJECT
Location: ../src/host/main.cpp:730
Failed to set argument 21 kernel memRd
Was I supposed to set any parameter?
full output here
Hi,
I am using Intel SDK for OpenCL on Arria10 to run the PipeCNN - Alex net. I am getting the above error when I compile the kernel with this command -
$ aoc device/conv_pipe.cl -o bin_fpga/conv_pipe.aocx --board bdw_fpga_v1.0 -v -g
aoc: Environment checks are completed successfully.
You are now compiling the full flow!!
aoc: Selected target board bdw_fpga_v1.0
aoc: Running OpenCL parser....
In file included from :11140:
:2:30: warning: ISO C99 requires whitespace after the macro name
#define ACL_BOARD_bdw_fpga_v1.0 1
^
:3:31: warning: ISO C99 requires whitespace after the macro name
#define AOCL_BOARD_bdw_fpga_v1.0 1
^
2 warnings generated.
aoc: OpenCL parser completed successfully.
aoc: Compiling....
Compiler Error: Unrecognized function call: mult_add_fix8bx4
Error: Optimizer FAILED.
Refer to conv_pipe/conv_pipe.log for details.
When I run the same on an emulator it is running fine and gives the expected output.
Why is it not able to recognise the function mult_add_fix8bx4 ? Should it be compiled separately ?
Thanks,
Akash
Hello. As the title said, does PipeCNN support running on Windows 10? Thanks!
Dear doonny,
I'm trying to test PipeCNN framework on some Xilinx's FPGA embedded boards for take some power measurement. Actually I would like to compile the framework for Digilent ZedBoard but the synthetized program is too large for this FPGA platform, the XOCC compiler returns this error:
297 RAMB18 and RAMB36/FIFO required but only 280
Could you give me some hints for reducing the BRAM utilization?
Thank You
Hi,I complied with FLOW=sw_emu and it can be complied successfully.But when I executed command like this:
#./run.exe conv.aocx
-bash: ./run.exe: cannot execute binary file: Exec format error
Anyone got a idea?
Thanks very much.
When I build, I get the following errors during hardware accelerator integration stage.
INFO: [XOCC 60-251] Hardware accelerator integration...
===>The following messages were generated while processing /PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.sim/sim_1/behav :
ERROR: [XOCC 10-426] cannot find port pool_ch15_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:736]
ERROR: [XOCC 10-426] cannot find port pool_ch15_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:735]
ERROR: [XOCC 10-426] cannot find port pool_ch15_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:734]
ERROR: [XOCC 10-426] cannot find port pool_ch14_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:733]
ERROR: [XOCC 10-426] cannot find port pool_ch14_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:732]
ERROR: [XOCC 10-426] cannot find port pool_ch14_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:731]
ERROR: [XOCC 10-426] cannot find port pool_ch13_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:730]
ERROR: [XOCC 10-426] cannot find port pool_ch13_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:729]
ERROR: [XOCC 10-426] cannot find port pool_ch13_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:728]
ERROR: [XOCC 10-426] cannot find port pool_ch12_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:727]
ERROR: [XOCC 10-426] cannot find port pool_ch12_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:726]
ERROR: [XOCC 10-426] cannot find port pool_ch12_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:725]
ERROR: [XOCC 10-426] cannot find port pool_ch11_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:724]
ERROR: [XOCC 10-426] cannot find port pool_ch11_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:723]
ERROR: [XOCC 10-426] cannot find port pool_ch11_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:722]
ERROR: [XOCC 10-426] cannot find port pool_ch10_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:721]
ERROR: [XOCC 10-426] cannot find port pool_ch10_TVALID on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:720]
ERROR: [XOCC 10-426] cannot find port pool_ch10_TDATA on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:719]
ERROR: [XOCC 10-426] cannot find port pool_ch9_TREADY on this module [/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/ipiprj/ipiprj.srcs/sources_1/bd/dr/ipshared/8aa7/hdl/verilog/maxPool.v:718]
ERROR: [XOCC 43-3322] Static elaboration of top level Verilog design unit(s) in library work failed
ERROR: [XOCC 60-399] vivado failed, please see log file for detail: '/PipeCNN/Emulation-HW/_xocc_link_pipecnn/impl/build/hw_em/pipecnn/sv/pipecnn_ipi/vivado.log'
ERROR: [XOCC 60-626] Kernel link failed to complete
ERROR: [XOCC 60-703] Failed to finish linking
make: *** [pipecnn.xclbin] Error 1
21:27:33 Build Finished (took 10m:11s.280ms)
When compiling de1soc host ,can't find arm lib. Maybe ,modify Makefile Line 73 74
windows64,Quartus 16.1 and compile with SoC EDS 16.1 Command Shell
Hi, referring to instructions, the best result of AlexNet model executed on De1Soc is 149 ms.
However, the best result i got is just around 450 ms with the following hw configurations:
I do tried to increase the LANE_NUM to 8. But i got the following error even-though the resources of the device is not fully used up.
"kernel cannot fit into device"
Could you kindly share us the appropriate hw configuration for the VEC_SIZE, LANE_NUM, and CONV_GP_SIZE_X??
Thank you
It is needed to rename platform to xilinx_kcu1500_dynamic_5_0 for SDAccel 2017.4
A lot of warnings like this "device/conv_pipe_xilinx.cl:680:708: warning: double precision constant requires cl_khr_fp64, casting to single precision"
Finally an error message:
ERROR: [XOCC 60-896] For unified platforms, please use -c or -l
ERROR: [XOCC 60-598] Kernel build setup failed to complete
ERROR: [XOCC 60-702] Failed to finish compilation and linking
Makefile:142: recipe for target 'conv.xclbin' failed
make: *** [conv.xclbin] Error 1
I put the image.dat,weights.dat and fc8.dat in the data folder, build and run the project in cpu emulation mode.The build process finishes successfully.However, when I attempt to run the exe file,it finishes so quick and no errors occur.The console output is very short:
***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************
61063552 total weights read
154587 bytes image read
I'm using SDaccel 2017.2 gui mode.Why is the output log so short?I don't see any output files created after I run the project.
Dear Prof. Wang,
I install Xilinx's SDAccel, but it display make: aocl: 命令未找到。Can ocal application on SDAccel?
The following error was found in SDAcell version (SDK v2017.4) after executing make file under the project folder.
I think the solution is to change 'xilinx:kcu1500:4ddr-xpr:4.0' to 'xilinx_vcu1525_dynamic_5_0'.
But it is not sure that the PipeCNN's environment and behavior are still valid.
` * Error log ----------------------------------------------
ERROR: [XOCC 60-705] No device was found that matches 'xilinx:kcu1500:4ddr-xpr:4.0'. The supported devices are:
xilinx_vcu1525_dynamic_5_0
xilinx_kcu1500_dynamic_5_0
ERROR: [XOCC 60-587] Failed to add a device: specified platform xilinx:kcu1500:4ddr-xpr:4.0 is not found
Makefile:151: recipe for target 'conv.xclbin' failed
make: *** [conv.xclbin] Error 1
------------------------------------------------------------`
Regarding to the statement of pipeCNN has been tested in the following boards:
May I know where can I get the performance and cost information of these board, as only the performance of DE5-net is listed in the paper.
Thank You
你好,我前面所有的工作都完成了,预训练的模型也放在data文件夹下了,到最后一步./run.exe conv.aocx ,运行后就什么也没有发生,没报错也没输出。为什么?谢谢了
Since the width and height of a pooling rectangle could be different, it seems pool_size
could not cover this sort of case.
So does it for conv_stride
if the stride size varies for width and height.
When #define XILINX is set, generates this error.
‘write_event’ was not declared in this scope
0 /* flags, 0 means from host*/,0, NULL,&write_event[i]);
^~~~~~~~~~~
../src/host/main.cpp:466:71: error: ‘write_event’ was not declared in this scope
0 /* flags, 0 means from host*/,0, NULL,&write_event[i]);
^~~~~~~~~~~
../src/host/main.cpp: In function ‘int prepare()’:
../src/host/main.cpp:1414:5: warning: this ‘else’ clause does not guard... [-Wmisleading-indentation]
else
^~~~
../src/host/main.cpp:1418:2: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘else’
for(unsigned n = 0; n<layer_config[0][data_n]/VEC_SIZE; n++){
^~~
make: *** [src/host/main.o] Error 1
I uncomment the code for VGG16 and comment the code for AlexNet in the layer_config.h and the main.c.
The code following is what I change in main.c:
// AlexNet
// Original problem size
// File size is in num of DTYPE numbers
//#define IMAGE_FILE_SIZE (227*227*3)
////#define WEIGHTS_FILE_SIZE 60965224 //fc8-1000
//#define WEIGHTS_FILE_SIZE 61063552 //fc8-1024
//#define LAYER_NUM 8
//#define CONV_NUM 5
//const char *weight_file_path = "./data/data_alex/weights.dat";
//const char *input_file_path = "./data/data_alex/image.dat";
//const char *ref_file_path = "./data/data_alex/fc8.dat";
//const char *dump_file_path = "./result_dump.txt";
// VGG16
// Original problem size
// File size is in num of DTYPE numbers
#define IMAGE_FILE_SIZE (224*224*3)
#define WEIGHTS_FILE_SIZE 138455872 //fc8-1024
#define LAYER_NUM 16
#define CONV_NUM 13
const char *weight_file_path = "./data/data_vgg/weights.dat";
const char *input_file_path = "./data/data_vgg/image.dat";
const char *ref_file_path = "./data/data_vgg/fc8.dat";
const char *dump_file_path = "./result_dump.txt";
The code following is what I change in layer_config.h:
// Test with batch=1
// Alexnet Configuration
/*
unsigned layer_config[][NUM_CONFIG_ITEM] = {{0,
227, 227, 3, 11, 11, 3, 96, 96,
0,
55, 55, 96, 4, 0, 0, 1,
1, 27, 27, 96, 3, 2,
1,
1},//Layer-1
{0,
27, 27, 96, 5, 5, 48, 256, 256,
0,
27, 27, 256, 1, 2, 1, 1,
1, 13, 13, 256, 3, 2,
1,
1},//Layer-2
{0,
13, 13, 256, 3, 3, 256, 384, 384,
0,
13, 13, 384, 1, 1, 0, 1,
0, 13, 13, 384, 0, 0,
0,
1},//Layer-3
{0,
13, 13, 384, 3, 3, 192, 384, 384,
1,
13, 13, 384, 1, 1, 1, 1,
0, 13, 13, 384, 0, 0,
0,
0},//Layer-4
{0,
13, 13, 384, 3, 3, 192, 256, 256,
0,
13, 13, 256, 1, 1, 1, 1,
1, 6, 6, 256, 3, 2,
0,
2},//Layer-5 Note: for last conv layer, outputs are write to fc buffer
{1,
6, 6, 256, 6, 6, 256, 4096, 4096, // Note: The input size (dim1/dim2) is the combined data size (batched)
2,
1, 1, 4096, 6, 0, 0, 1,
0, 1, 1, 4096, 0, 0,
0,
3},//Layer-6 fc
{1,
1, 1, 4096, 1, 1, 4096, 4096, 4096,
3,
1, 1, 4096, 1, 0, 0, 1,
0, 1, 1, 4096, 0, 0,
0,
2},//Layer-7 fc
{1,
1, 1, 4096, 1, 1, 4096, 1024, 1024,
2,
1, 1, 1024, 1, 0, 0, 0,
0, 1, 1, 1024, 0, 0,
0,
3}//Layer-8 fc
};
char precision_config[][3] ={{8, 0, -4},//Layer-1
{ 8, 0, -2},//Layer-2
{ 8, 0, -1},//Layer-3
{ 8, -1, -1},//Layer-4
{ 8, -1, -1},//Layer-5
{11, -1, 0},//Layer-6
{10, 0, 2},//Layer-7
{10, 2, 2}//Layer-8
};
unsigned input_config[5] = {227, 227, 3, 1}; //original image size(dim1, dim2, dim3), batch size
//unsigned output_config[3] = {27, 27, 96};//Layer-1
//unsigned output_config[3] = {55, 55, 96};//Layer-1
//unsigned output_config[3] = {13, 13, 256};//Layer-2
//unsigned output_config[3] = {6, 6, 256};//Layer-5
//unsigned output_config[3] = {1, 1, 4096};//Layer-6
unsigned output_config[3] = {1, 1, 1024};//Layer-8 Note: only one result is extracted and verified
*/
// Test with batch=1
// VGG-16 Configuration
unsigned layer_config[][NUM_CONFIG_ITEM] = {{0,
224, 224, 3, 3, 3, 3, 64, 64,
0,
224, 224, 64, 1, 1, 0, 1,
0, 224, 224, 64, 0, 0,
0,
1},//Layer-1 (conv1_1)
{0,
224, 224, 64, 3, 3, 64, 64, 64,
1,
224, 224, 64, 1, 1, 0, 1,
1, 112, 112, 64, 2, 2,
0,
0},//Layer-2 (conv1_2)
{0,
112, 112, 64, 3, 3, 64, 128, 128,
0,
112, 112, 128, 1, 1, 0, 1,
0, 112, 112, 128, 0, 0,
0,
1},//Layer-3 (conv2_1)
{0,
112, 112, 128, 3, 3, 128, 128, 128,
1,
112, 112, 128, 1, 1, 0, 1,
1, 56, 56, 128, 2, 2,
0,
0},//Layer-4 (conv2_2)
{0,
56, 56, 128, 3, 3, 128, 256, 256,
0,
56, 56, 256, 1, 1, 0, 1,
0, 56, 56, 256, 0, 0,
0,
1},//Layer-5 (conv3_1)
{0,
56, 56, 256, 3, 3, 256, 256, 256,
1,
56, 56, 256, 1, 1, 0, 1,
0, 56, 56, 256, 0, 0,
0,
0},//Layer-6 (conv3_2)
{0,
56, 56, 256, 3, 3, 256, 256, 256,
0,
56, 56, 256, 1, 1, 0, 1,
1, 28, 28, 256, 2, 2,
0,
1},//Layer-7 (conv3_3)
{0,
28, 28, 256, 3, 3, 256, 512, 512,
1,
28, 28, 512, 1, 1, 0, 1,
0, 28, 28, 512, 0, 0,
0,
0},//Layer-8 (conv4_1)
{0,
28, 28, 512, 3, 3, 512, 512, 512,
0,
28, 28, 512, 1, 1, 0, 1,
0, 28, 28, 512, 0, 0,
0,
1},//Layer-9 (conv4_2)
{0,
28, 28, 512, 3, 3, 512, 512, 512,
1,
28, 28, 512, 1, 1, 0, 1,
1, 14, 14, 512, 2, 2,
0,
0},//Layer-10 (conv4_3)
{0,
14, 14, 512, 3, 3, 512, 512, 512,
0,
14, 14, 512, 1, 1, 0, 1,
0, 14, 14, 512, 0, 0,
0,
1},//Layer-11 (conv5_1)
{0,
14, 14, 512, 3, 3, 512, 512, 512,
1,
14, 14, 512, 1, 1, 0, 1,
0, 14, 14, 512, 0, 0,
0,
0},//Layer-12 (conv5_2)
{0,
14, 14, 512, 3, 3, 512, 512, 512,
0,
14, 14, 512, 1, 1, 0, 1,
1, 7, 7, 512, 2, 2,
0,
2},//Layer-13 (conv5_3) Note: for last conv layer, outputs are write to fc buffer
{1,
7, 7, 512, 7, 7, 512, 4096, 4096,
2,
1, 1, 4096, 7, 0, 0, 1,
0, 1, 1, 4096, 0, 0,
0,
3},//Layer-14 (fc6)
{1,
1, 1, 4096, 1, 1, 4096, 4096, 4096,
3,
1, 1, 4096, 1, 0, 0, 1,
0, 1, 1, 4096, 0, 0,
0,
2},//Layer-15 (fc7)
{1,
1, 1, 4096, 1, 1, 4096, 1024, 1024,
2,
1, 1, 1024, 1, 0, 0, 0,
0, 1, 1, 1024, 0, 0,
0,
3}//Layer-16 (fc8)
};
char precision_config[][3] ={{7, 0, -2},//Layer-1
{ 8, -2, -5},//Layer-2
{ 8, -5, -5},//Layer-3
{ 8, -5, -6},//Layer-4
{ 7, -6, -7},//Layer-5
{ 8, -7, -7},//Layer-6
{ 8, -7, -7},//Layer-7
{ 8, -7, -6},//Layer-8
{ 8, -6, -5},//Layer-9
{ 8, -5, -5},//Layer-10
{ 9, -5, -4},//Layer-11
{ 9, -4, -3},//Layer-12
{ 8, -3, -2},//Layer-13
{ 8, -2, 0},//Layer-14
{ 7, 0, 2},//Layer-15
{ 7, 2, 2}//Layer-16
};
unsigned input_config[4] = {224, 224, 3, 1};
//unsigned output_config[3] = {224, 224, 64};//Layer-1
//unsigned output_config[3] = {56, 56, 128};//Layer-4(pool2)
//unsigned output_config[3] = {28, 28, 256};//Layer-7(pool3)
//unsigned output_config[3] = {28, 28, 512};//Layer-8(relu4_1)
//unsigned output_config[3] = {28, 28, 512};//Layer-9(relu4_2)
//unsigned output_config[3] = {14, 14, 512};//Layer-10(pool4)
//unsigned output_config[3] = {7, 7, 512};//Layer-13(pool5)
//unsigned output_config[3] = {1, 1, 4096};//Layer-14
unsigned output_config[3] = {1, 1, 1024};//Layer-16
I compile the project successfully in the cpu-emulation mode in sdaccel gui.However,when I run the project,the error occurs:
***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************
Error: required win_buffer size is 3456, configured size is 2304
Allocate memory for data and weights failed !!!
How can I solve this problem? What else should I change in the code?
I am trying to understand the host and fpga code as well. However I am not able to link it with available sources, I have.
Can you please provide or suggest me and other viewer on this topic? Can you provide any link based on which you have written your code?
你好,编译好后,运行./run.exe conv.aocx报错,说权重文件找不到。但是您给的模型文件已经放到data目录里了,我也尝试在main.cpp中把路径改为绝对路径,但是还是找不到文件。
希望得到您的回复,谢谢
Hello! I am trying to compile the code with the Intel OpenCL FPGA SDK 17.0 and an Arria 10 board on Windows 10 using mingw-w64. I got an error when the makefile run a command looks like:
g++ ./host/main.o ../common/ocl_util.o ../common/timer.o -o run.exe -LC:/intelFPGA_pro/17.0/hld/board/a10_ref/windows64/lib -LC:/intelFPGA_pro/17.0/hld/host/windows64/lib -laltera_a10_ref_mmd -lalteracl -lacl_emulator_kernel_rt -lpkg_editor -llibelf -lacl_hostxml
and I got an error saying like:
C:/intelFPGA_pro/17.0/hld/host/windows64/lib/alteracl.lib(d:/SJ/nightly/17.0/290/w64/acds/hld/obj/windows64/acl/acl_program.obj).text[l_build_from_source_in_dir]+0xa2): undefined reference to `__imp__wassert'
C:/intelFPGA_pro/17.0/hld/host/windows64/lib/alteracl.lib(d:/SJ/nightly/17.0/290/w64/acds/hld/obj/windows64/acl/acl_program.obj).text[l_load_binary_pkg]+0xb36): undefined reference to `__security_check_cookie'
C:/intelFPGA_pro/17.0/hld/host/windows64/lib/alteracl.lib(d:/SJ/nightly/17.0/290/w64/acds/hld/obj/windows64/acl/acl_program.obj).xdata[$unwind$l_compute_hash]+0x10): undefined reference to `__GSHandlerCheck'
(too long and get truncated; only repeating these 3 errors.)
Anyone has an idea? Thanks in advance!!
Dear doonny,
I'm very interested in testing your PipeCNN in my Zynq Ultrascale ZCU102.
I have compiled the source code with Xilinx SDSoC v2017.1 with zcu102_es1_ocl platform, then before launching the PipeCNN I have sent these commands:
cd /mnt
cp libxilinxopencl.so /usr/lib
export XILINX_OPENCL=/mnt
(libxilinxopencl.so is the opencl library for aarch64), then for launching the CNN:
./PipeCNN.elf conv.aocx
and finally the output is:
61063552 total weights read
154587 bytes image read
1024 total output reference read
ERROR: No device found
ERROR: CL_DEVICE_NOT_FOUND
Could you give me some help?
Thanks in advance
This source is supposed to be compiled on Intel FPGA SDK for OpenCL but I am getting following error for arria 10. I am using following version of the Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline Compiler.
Version 17.0.2 Build 297
Error:
error: function 'read_channel_altera' is not supported by the Intel(R) FPGA SDK for OpenCL(TM), and no user definition is provided
error: function 'write_channel_altera' is not supported by the Intel(R) FPGA SDK for OpenCL(TM), and no user definition is provided
Hello! I compile the project in emulator mode and get the following error when I run "run.exe conv.aocx" with alexnet data:
run.exe conv.aocx
***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************
61063552 total weights read
154587 bytes image read
1024 total output reference read
Platform: Intel(R) FPGA SDK for OpenCL(TM)
Using 1 device(s)
Device 0: EmulatorDevice : Emulated Device
Device OpenCL Version: OpenCL 1.0 Intel(R) FPGA SDK for OpenCL(TM), Version 17.0
Device Max Compute Units: 1
Device Max WorkGroup Size: 2147483647
Device Max WorkItem Size: 2147483647
Device Global Memory Size: 2048 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 1000 Mhz
Loading kernel/binary from file conv.aocx
ERROR: CL_COMPILER_NOT_AVAILABLE
Location: ../common/ocl_util.cpp:429
Failed to build program with source
Environment: Windows 10, MinGW-w64, Arria 10 board, Intel OpenCL SDK for FPGA 17.0, MSVC 12.0
Anyone got an idea? Thanks in advance!!
Dear Prof. Wang,
For the quantized parameter, you used (n,m) pair to denote the precision. For example, in VGG-16 1st layer, you used (8,7) (8, 0) (8,-2) to denote frac_w, frac_input and frac_output, for last FC layer, you used (8, 2) (8,2) (4,7). Is there any rules or constrains how to decide the numbers? Or just use any numbers you like? Can I change to other numbers? If I changed the fraction numbers and convert a new model, does it works or not? Thank you so much.
Hello! I compile the project and get the following error
lcf@lcf-9020:~/work/PipeCNN-master/project$ make
g++ ./host/main.o ../common/ocl_util.o ../common/timer.o -o run.exe -L/home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib -L/home/lcf/intelFPGA/16.1/hld/host/arm32/lib -L/home/lcf/intelFPGA/16.1/hld/host/linux64/lib -Wl,--no-as-needed -lalteracl -lalterammdpcie -lstdc++ -lelf
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib/libalteracl.so when searching for -lalteracl
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/host/arm32/lib/libalteracl.so when searching for -lalteracl
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib/libalterammdpcie.so when searching for -lalterammdpcie
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/host/arm32/lib/libalterammdpcie.so when searching for -lalterammdpcie
/usr/bin/ld: cannot find -lalterammdpcie
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/board/de10_standard/arm32/lib/libelf.so when searching for -lelf
/usr/bin/ld: skipping incompatible /home/lcf/intelFPGA/16.1/hld/host/arm32/lib/libelf.so when searching for -lelf
collect2: error: ld returned 1 exit status
Makefile:129: recipe for target 'run.exe' failed
make: *** [run.exe] Error 1
Anyone got an idea? Thanks in advance!!
Dear Prof. Wang,
I use De1-SoC and I changed,
VEC_SIZE = 8
LANE_NUM = 8
CONV_GP_SIZE_X = 7 , as the user instruction.
and
PLATFORM = arm32
and FLOW = hw
then it's Estimated Resource Usage Summary shows that,
Logic utilization = 111%
Then I got the following error even-though the resources of the device is not fully used up.
kernel cannot fit into device
My question is, Is there any other ways to reduce "Logic utilization", except reduce "LANE NUM" ?
Thank You.
for(unsigned int k=0; k<input_num; k++){
if(pool_size==3)
row_pool_reg[ll] = pool_max(line_buf_1[ll][line_buf_ptr], line_buf_0[ll][line_buf_ptr]);
else // pool_size==2
row_pool_reg[ll] = line_buf_0[ll][line_buf_ptr];`
pool_reg[ll][0] = pool_max(row_pool_reg[ll], conv_ch_out.lane[ll]);
// Max pooling among colums
// with previous row-pooling results stored in shift-registers
if(pool_size==3)
col_pool_reg[ll] = pool_max(pool_reg[ll][1], pool_reg[ll][2]);
else //pool_size==2
col_pool_reg[ll] = pool_reg[ll][1];
pool_final.lane[ll] = pool_max(col_pool_reg[ll], pool_reg[ll][0]);
// Update line buffer
line_buf_1[ll][line_buf_ptr] = line_buf_0[ll][line_buf_ptr];
line_buf_0[ll][line_buf_ptr] = conv_ch_out.lane[ll];`
Hi Prof @doonny, Can you explain how you make this work? Is this not square max pooling?
Can i use pool_size ==2 ? If pool_size = 2 , the line_buf_1 seem redundant.
HI how to print:
" The inference result is n02123045 tabby, tabby cat (the prob is 56.00) ." at last in your documentation?
Thanks in advance..
Regards,
Ganda.
Hello Prof. Wang,
I'm trying to run the PipeCNN on Alpha Data 7v3 FPGA. [xilinx_adm-pcie-7v3_1ddr_3_0]
I'm using SDx 2017.2 and software emulation runs properly giving correct results.
When hardware is built, the timing is not met for some paths and the tool reduces the clock speed to 170.3 MHz (from the original 200 MHz).
But when i run the generated binary conv.xclbin on the FPGA it results in a hang at run time
Executing Layer 1:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 27, 27, 96)
Launching kernel lrn with local size: 1, 1, 24 (global size: 27, 27, 24)
Could you please help me in figuring out the issue?
Thanks in advance
Regards
Hi,
I am trying to get this project running on Cyclone-V SEA5. The configuration you've used is V=8,L=8,GP_X=7, where GP_X is CONV_GP_SIZE_X. But I cannot find CONV_GP_SIZE_X anywhere in the code. Could you tell where can I set this variable?
Thank you in advance!
Hi sir,
According to the description, the version of Intel's OpenCL SDK v16.1 is used in this project.
We would like to test run the program on the Terasic De1-SoC board.
But we found only the BSP for Altera SDK OpenCL 16.0 is provided in the official webpage.
Is it the one you used in this project?
And is it work fine in Intel's OpenCL SDK v16.1?
Thank you
Hi, thank you for your project.
What kind difference should be between DTYPE and MACTYPE. In your example char (8 bit) and int (32 bit).
If I want to use DTYPE short (16 bit), what MACTYPE I should choose ?
Long int (64 bit) ? Or need to rounding and truncate immedietly after every multiplication ?
Integration error, problem implementing OCL region.
Is this something I am doing wrong?
full console output here.
Any particular reason for the choice of implementing LRN on CPU and not FPGA (and gain the acceleration benefits of the FPGA)?
Hi Professor @doonny , I read your paper and i'm still confused how the Memrd works. Can sir gives some pointer to understand this kernel (memrd)? Thank you in advance.
Current maxpool have some problem if run kernel with next params:
-when size_x and size_y is odd
-and stride = 1;
Then maxpool unload less data then need.
Q1:
How to handle data is not multiple of VEC_SIZE?
take Alexnet for example:
conv1 have 11x11x3 that can't divided by VEC_SIZE 4.
so at mac operation, a1xb1+a2xb2+a3xb3+a4xb4, and at last one whouldn't have a3 and a4, is this will auto assign 0?
Q2:
And It looks like data_vec is reading bottom linearly with size of VEC_SIZE ?
in the pipeCNN paper that describe that weight is divided into size of VEC_SIZE at Z direction
ex. I have weight 3x3x4, I should have VEC_SIZE(4) of group of weight at Z direction, each have 3x3x1=9 datas.
0, 1, 2, 3, ... , 8
9,10,11,12, ... 17
18,...
27,...
but in algorithm, you group data into data_vec linearly:
{0, +1, +2 +3}, {+4, +5, +6, +7}, .... , ....., + 35
how this divided weight at Z direction?
for(unsigned short win_itm_z=0; win_itm_z<weight_dim3/VEC_SIZE; win_itm_z++){
for(unsigned char win_itm_y=0; win_itm_y<win_size_y; win_itm_y++){
for(unsigned char win_itm_x=0; win_itm_x<win_size_x; win_itm_x++){
feature_idx_dim1 = win_itm_x;
feature_idx_dim2 = win_itm_y;
feature_idx_dim3 = win_itm_z;
if(xy is at correct location){
data_vec = bottom[data_offset*data_dim1xdim2 + feature_idx_dim3*data_dim1xdim2 + (feature_idx_dim2-padding)*data_dim1 + (feature_idx_dim1-padding)];
}
else{
#pragma unroll
for(unsigned char vv=0; vv<VEC_SIZE; vv++){
data_vec.data[vv] = CZERO;
}
}
// start from using buffer[0]
win_buffer[0][win_itm_z*win_size_y*win_size_x + win_itm_y*win_size_x + win_itm_x] = data_vec;
}
}
}
hi Prof @doonny , I noticed the current version is in fixed point mac. And how was the resource utilization compared to Floating point mac?
What is the Pipe depth ?
#define PIPE_DEPTH 6
The default network in the code is AlexNet, and I have successfully built and run it in xilinx kcu1500 board.However,when I modified some code and attempted to change the network to vgg16,I built the project successfully but got wrong results.I am a beginner in CNN and opencl.I think I really need some guide on how to change the code from the default AlexNet to vgg16.I can't see any documents in the readme or the user instruction.
This is the AlexNet result which seems correct:
***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************
61063552 total weights read
154587 bytes image read
1024 total output reference read
Platform: Xilinx
Using 1 device(s)
Device 0: xilinx:kcu1500:4ddr-xpr:4.0
Device OpenCL Version: OpenCL 1.0
Device Max Compute Units: 1
Device Max WorkGroup Size: 4096
Device Max WorkItem Size: 4096
Device Global Memory Size: 16384 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 500 Mhz
Loading kernel/binary from file conv.xclbin
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
Executing Layer 1:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 27, 27, 96)
Launching kernel lrn with local size: 1, 1, 24 (global size: 27, 27, 24)
Executing Layer 2:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 256)
Launching kernel lrn with local size: 1, 1, 64 (global size: 13, 13, 64)
Executing Layer 3:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 384)
Executing Layer 4:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 13, 13, 384)
Executing Layer 5:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 6, 6, 256)
Executing Layer 6:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)
Executing Layer 7:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)
Executing Layer 8:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 1024)
Copyed all batched results from fc_2 buffers.
Done !!!
-------------------
Performance Summary
Total runtime: 1.050996s
Kernel runtime summary:
Layer-1:
MemRd: 59.144 ms
Conv : 58.641 ms
Pool : 58.187 ms
MemWr: 56.728 ms
Lrn : 381.921 ms
Layer-2:
MemRd: 81.765 ms
Conv : 81.385 ms
Pool : 80.876 ms
MemWr: 80.793 ms
Lrn : 336.314 ms
Layer-3:
MemRd: 18446744071709.031 ms
Conv : 51.617 ms
Pool : 0.000 ms
MemWr: 51.168 ms
Lrn : 0.000 ms
Layer-4:
MemRd: 18446744071656.164 ms
Conv : 18446744071656.809 ms
Pool : 0.000 ms
MemWr: 39.138 ms
Lrn : 0.000 ms
Layer-5:
MemRd: 26.615 ms
Conv : 26.061 ms
Pool : 26.632 ms
MemWr: 25.660 ms
Lrn : 0.000 ms
Layer-6:
MemRd: 27.584 ms
Conv : 27.147 ms
Pool : 0.000 ms
MemWr: 26.590 ms
Lrn : 0.000 ms
Layer-7:
MemRd: 18446744071562.098 ms
Conv : 18446744071561.719 ms
Pool : 0.000 ms
MemWr: 11.994 ms
Lrn : 0.000 ms
Layer-8:
MemRd: 3.911 ms
Conv : 3.548 ms
Pool : 0.000 ms
MemWr: 4.082 ms
Lrn : 0.000 ms
Total kernel runtime 36893488147419.102 ms
Batch size = 1, average process time per batch: 36893488147419.102 ms
Start verifying results ...
Selected item = 0 from the combined batch results in fc buffers
Check Pass !!!
The inference result is n02123045 tabby, tabby ca (the prob is 56.00)
And this is the vgg16 result which is obviously wrong:
***************************************************
PipeCNN: An OpenCL-Based FPGA Accelerator for CNNs
***************************************************
138455872 total weights read
150528 bytes image read
1024 total output reference read
Platform: Xilinx
Using 1 device(s)
Device 0: xilinx:kcu1500:4ddr-xpr:4.0
Device OpenCL Version: OpenCL 1.0
Device Max Compute Units: 1
Device Max WorkGroup Size: 4096
Device Max WorkItem Size: 4096
Device Global Memory Size: 16384 MBytes
Device Local Memory Size: 16 KBytes
Device Max Clock Freq: 500 Mhz
Loading kernel/binary from file conv.xclbin
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
WARNING: unaligned host pointer detected, this leads to extra memcpy
Executing Layer 1:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 224, 224, 64)
Executing Layer 2:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 112, 112, 64)
Executing Layer 3:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 112, 112, 128)
Executing Layer 4:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 56, 56, 128)
Executing Layer 5:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 56, 56, 256)
Executing Layer 6:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 56, 56, 256)
Executing Layer 7:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 28, 28, 256)
Executing Layer 8:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 28, 28, 512)
Executing Layer 9:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 28, 28, 512)
Executing Layer 10:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 14, 14, 512)
Executing Layer 11:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 14, 14, 512)
Executing Layer 12:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 14, 14, 512)
Executing Layer 13:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching single work-item kernel Pooling
Launching kernel MemWr with local size: 1, 1, 16 (global size: 7, 7, 512)
Executing Layer 14:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)
Executing Layer 15:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 4096)
Executing Layer 16:
Launching single work-item kernel winbuffer
Launching single work-item kernel Conv
Launching kernel MemWr with local size: 1, 1, 16 (global size: 1, 1, 1024)
Copyed all batched results from fc_2 buffers.
Done !!!
-------------------
Performance Summary
Total runtime: 6.911966s
Kernel runtime summary:
Layer-1:
MemRd: 131.136 ms
Conv : 130.630 ms
Pool : 0.000 ms
MemWr: 128.416 ms
Lrn : 0.000 ms
Layer-2:
MemRd: 18446744067806.332 ms
Conv : 18446744067805.941 ms
Pool : 18446744067805.469 ms
MemWr: 840.861 ms
Lrn : 0.000 ms
Layer-3:
MemRd: 435.174 ms
Conv : 434.800 ms
Pool : 0.000 ms
MemWr: 435.343 ms
Lrn : 0.000 ms
Layer-4:
MemRd: 18446744066528.754 ms
Conv : 821.526 ms
Pool : 821.065 ms
MemWr: 820.978 ms
Lrn : 0.000 ms
Layer-5:
MemRd: 409.369 ms
Conv : 409.873 ms
Pool : 0.000 ms
MemWr: 409.000 ms
Lrn : 0.000 ms
Layer-6:
MemRd: 18446744065296.562 ms
Conv : 807.734 ms
Pool : 0.000 ms
MemWr: 807.327 ms
Lrn : 0.000 ms
Layer-7:
MemRd: 802.170 ms
Conv : 802.702 ms
Pool : 802.189 ms
MemWr: 801.713 ms
Lrn : 0.000 ms
Layer-8:
MemRd: 18446744063685.164 ms
Conv : 18446744063684.770 ms
Pool : 0.000 ms
MemWr: 388.292 ms
Lrn : 0.000 ms
Layer-9:
MemRd: 775.116 ms
Conv : 774.742 ms
Pool : 0.000 ms
MemWr: 775.259 ms
Lrn : 0.000 ms
Layer-10:
MemRd: 775.115 ms
Conv : 774.698 ms
Pool : 774.239 ms
MemWr: 774.151 ms
Lrn : 0.000 ms
Layer-11:
MemRd: 183.018 ms
Conv : 182.621 ms
Pool : 0.000 ms
MemWr: 182.134 ms
Lrn : 0.000 ms
Layer-12:
MemRd: 183.014 ms
Conv : 182.632 ms
Pool : 0.000 ms
MemWr: 182.164 ms
Lrn : 0.000 ms
Layer-13:
MemRd: 182.661 ms
Conv : 182.243 ms
Pool : 181.786 ms
MemWr: 181.401 ms
Lrn : 0.000 ms
Layer-14:
MemRd: 18446744061195.703 ms
Conv : 80.474 ms
Pool : 0.000 ms
MemWr: 86.381 ms
Lrn : 0.000 ms
Layer-15:
MemRd: 13.703 ms
Conv : 14.216 ms
Pool : 0.000 ms
MemWr: 13.308 ms
Lrn : 0.000 ms
Layer-16:
MemRd: 18446744061093.504 ms
Conv : 18446744061093.090 ms
Pool : 0.000 ms
MemWr: 3.383 ms
Lrn : 0.000 ms
Total kernel runtime 55340232221128.656 ms
Batch size = 1, average process time per batch: 55340232221128.656 ms
Start verifying results ...
Selected item = 0 from the combined batch results in fc buffers
Item=0 is wrong (result=-3.000000, golden_ref=-6.000000)
Item=1 is wrong (result=0.000000, golden_ref=3.000000)
Item=2 is wrong (result=-4.000000, golden_ref=-8.000000)
Item=3 is wrong (result=-4.000000, golden_ref=-9.000000)
Item=4 is wrong (result=-1.000000, golden_ref=-5.000000)
Item=5 is wrong (result=-4.000000, golden_ref=-1.000000)
Item=6 is wrong (result=-3.000000, golden_ref=-12.000000)
Item=7 is wrong (result=2.000000, golden_ref=7.000000)
Item=8 is wrong (result=-1.000000, golden_ref=10.000000)
Totally 946 Wrong Results
conv_pipe.cl is heavily depend on altera specific extensions (write_channel_altera, read_channel_altera), so it can't be compiled to use with xilinx FPGAs.
I'm sorry, I'm too much hurry with this issue.
This problem is actual if add padding(=1) at right side and lower side.
for geting result with sizes : size_x = 13, size_y = 13
More information:
input: size_x = 13, size_y = 13, pool_size = 2, pool_stride = 1, depth = 512
result: size_x = 13, size_y = 13, depth = 512
Hi,I compiled project with Intel OpenCL SDK for FPGA 17.1 and get the following error
In file included from /home/wangjf/PipeCNN/project/__all_sources.cl:2:
PipeCNN/project/device/conv_pipe.cl:75:24: error: Channel support is not enabled
channel channel_vec data_ch attribute((depth(0)));
Anyone got an idea? Thanks in advance!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.