FJoin: an FPGA-based parallel accelerator for stream join

FJoin is an FPGA-based parallel accelerator for stream which leverages a large number of basic join units connected in series to form a deep join pipeline to achieve large-scale parallelism.FJoin can do High-Parallel Flow Join, in which data of the join window can flow through once to complete all join calculations after loading multiple stream tuples. The host CPU and FPGA device coordinate control, divide the continuous stream join calculation into independent small-batch tasks and efficiently ensure completeness of parallel stream join. FJoin is implemented on a platform equipped with an FPGA accelerator card.The test results based on large-scale real data sets show that FJoin can increase the join calculation speed by 16 times using a single FPGA accelerator card and reach 5 times system throughput compared with the current best stream join system deployed on a 40-node cluster, and latency meets the real-time stream processing requirements.

Introduction

In the era of the Internet of Everything, real-time streaming data is in multiple sources and ubiquitous in everywhere. Since streams join between different sources can extract key information between multi-source streaming data, stream join has become an important operation in stream processing. However, among stream processing operators, the stream join operator is likely to become a system performance bottleneck due to high computational overhead. When the computing power of the stream processing system cannot meet the actual stream data rate, it will immediately cause serious congestion, resulting in a rapid increase in processing latency, and cannot achieve stream processing requires. Therefore, it is important to study high-performance stream join systems.

Existing studies generally design multi-core parallel systems to obtain low-latency and high-throughput processing performance. The advantage of multi-core architecture is that it is easy to extend the parallelism to the number of CPU cores while maintaining the global order of stream tuples. Under the global order , it is easy to ensure that the join results are not repeated or omitted, that is, ensure the completeness. The parallel mode is also easy to collect the results generated by all the join cores, and sort them in the original order of the source tuples, that is, keep order-preserving output. However, the expansion of parallelism is limited by the resources of the machine and cannot cope with large-scale data in parallel mode.

To further improve the parallelism of stream join processing, distributed stream join systems that support scale-out have received extensive attention in recent years. Distributed methods are generally built on distributed stream processing systems, such as the widely used Apache Storm system. This introduces the inherent overhead of the framework, and at the same time increases the communication overhead between nodes. Due to the loss of hardware performance, the distributed stream join system requires a big number of CPU cores, which makes deployment cost and maintenance cost high. In summary, the multi-core parallel stream join system is easy to scale efficiently, but the expansion of parallelism is limited, and the distribution method of scale-out can increase the expansion of parallelism further but causing a serious drop in hardware processing efficiency.

Because of the efficient and large-scale expansion problem of stream join systems, we focus on using FPGAs to accelerate stream join in parallel. On the one hand, for join predicates that are easy to implement in RTL, FPGAs have the advantage of realizing a large number of special join core circuits. Comparing to the distributed system investing a big number of CPU cores to expansion, the cost-effective advantage of FPGA is obvious. On the other hand, a parallel stream join system has many advantages, but it is difficult to increase CPU and memory to achieve expansion. It is easy to add FPGA accelerator peripherals for nodes to scale up. Combining these advantages, FPGA is a very suitable accelerator platform for stream join computing. We propose and design an FPGA stream join parallel accelerator FJoin, which consists of a multi-core host and multiple FPGAs with independent memory access channels. In each FPGA, a large number of basic join units are connected in series to form a deep join pipeline, and the units include custom join predicates. The join pipeline loads multiple tuples of one stream and joins tuples in the window of another stream at a time. When the join window flows through the pipeline, all units are doing High-Parallel Flow Join. Considering that it is difficult to ensure the completeness of FPGA parallel join, a management thread is allocated to each pipeline in the host part of the system, so that the CPU and FPGA can work together. This parallel acceleration framework can not only break through the expansion limit of the multi-core parallel stream join system but also avoid the waste of hardware performance for distributed scale-out.

FJoin architecture

The overview of FJoin is shown in the picture. The system is divided into the upper part of the host-side software and the lower part of the device-side hardware. The hardware part includes the pipeline state machine(FSM) in each FPGA, and the memory access interface(Mem I/ F), and a deep join pipeline composed of a big number of basic join units in series. The software part includes a stream join task scheduler(Task Scheduler), a join completeness control module(Completeness Controller) in which generates join calculation batch tasks, and a result post-processing module(Post-Process Module), and management threads corresponding to each FPGA join pipeline.

Basic join unit

The structure of the basic join unit is shown in the picture, which contains the join logic that can be customized using RTL, three streams of stream tuples, window tuples, and result tuples passed from the front to back, the control signals passed each stage clear and stall, and a join status register.

How to use?

Prerequisites

Hardware

This project works on Xilinx U280 Data Center Accelerator card.

You can also use other FPGA acceleration devices, such as U250, just set it accordingly.

Operation System

Ubuntu 18.04 LTS

Software

Vitis 2019.2

U280 Package File on Vitis 2019.2

Environment

To compile and run the project, you need to configure the environment according to the following documents. https://www.xilinx.com/html_docs/xilinx2020_1/vitis_doc/vhc1571429852245.html

Build and Run Project

After running Vitis, please set up a workspace, and then import the project from the zip file in the /proj directory. All source code and project configuration are included in it.

After that, select the "Hardware" target in the left down corner, and press the hammer button to build it. Please wait patiently for hours. The sample project is set to each join pipeline contains 16 basic join units, including a total of 32 units. Generally, the building time will not be too long.

After building, you also need to prepare datasets and configuration files before running.

Dataset: Download the data files(gps_0102, gps_0304) from the /data, and put them into /src/data of the project.

Configuration file: Copy the run configuration file(run.cfg) in the /src to the Vitis running directory /Hardware/join_hw-Default

After running, the result will be written to the files in the /result directory, and the console will have corresponding output.

Project Configuration

FJoin is a framework designed for general stream join predicates. It can change the RTL kernel design of join predicate to accelerate different calculations. It can also configure runtime parameters to study the performance of the system under different settings.

Runtime Parameters

The system runtime parameters are concentrated in the configuration file <run.cfg>. After starting, FJoin will read the contents of this file to set runtime parameters. Some configuration items must be provided, and others can be default.

Parameters that must be set in <run.cfg>:

Datasets Definitions

R_NAME/R_DATA/S_NAME/S_DATA/W_NAME respectively specify R stream name/R stream data source file/S stream name/S stream data source file/result file

These names or file default values are empty, must be set correctly.

R_PUNC/S_PUNC/W_PUNC respectively specify the data item separator of each line of the source/result file

The default separators are empty, must be set according to the data file format.

Parameters that have default values:

(Only the parameter settings before the end line are valid.)

Runtime Parameters

window_length_in_ms sliding window size(in milliseconds) max_join_delay_in_ms maximum join calculation latency(in milliseconds) test_time_in_ms total running time (in milliseconds) default_source_speed default initial stream rate(sum of R and S) max_source_speed maximum stream rate(sum of R and S) r_ratio R stream ratio s_ratio S stream ratio add_speed_step the acceleration of the stream rate per second

Runtime Configuration Items

post_result_ts output result timestamp post_result_delay output result latency post_result_tuple output join result tuple add_speed_test stream rate increase test order_preserving output order preserving result

Join Predicate

FJoin can also change the join predicate to accelerate different join calculations, but it needs to provide the RTL kernel of the join predicate.

Compared with the switch of runtime parameters, the replacement of the join predicate definition is more complicated, but fortunately this happens less frequently.

The following example shows how to change the join predicate of FJoin by replacing the join predicate of the didi data set with the join predicate of the ip data set.

Change Tuple Definitions

Modify the definition of the stream tuples of R/S/Result.

Their definitions are in the -CUSTOM HEAD- section of <head.hpp>. Both the source code and the /proj sample project use the definition of the didi data set. Under that is the definition of the ip data set(Annotated), annotate the tuple definition of the didi data set, uncomment the tuple definition of the ip data set, and complete the modification of the tuple definition. For simplicity, FJoin fixs tuple size of 64 bytes, and the part of the tuple definition that is not satisfied needs to be filled with 0.

Replace the functions that parse the data lines.

The functions are in the <custom.cpp> file. The definition of the Line structure is in the <head.hpp> file. Three methods are provided to the system to convert the Line records read from the stream source to R/S tuple format, and to convert back the Result tuple to the Line structure.

Modify the paths of the data source files.

They are in the <run.cfg> file. When FJoin is running, it needs to read the corresponding files to form data streams.

Change the RTL core of the join predicate

The join core is defined as the JoinCore module in FJoin . In the Vitis, double-click the two .xo files in /src/vitis_rtl_kernel to enter the project of the RTL kernel for editing. FJoin defines the JoinCore module in a separate .v file. For example, the JoinCore corresponding to the didi data set is in the file <didi_Manhattan_Distance.v>. By removing the current JoinCore definition file from the project, and re-adding the prepared JoinCore module, the RTL core can be replaced.

It should be noted that the two R/S join pipelines are independent and need to be replaced twice, which facilitates the use of asymmetric join core designs.

The JoinCore corresponding to the ip data set can be found in /krnl, and the file name is <ip_equal.v>

Configure FJoin Join Pipeline

Number of Stages

FJoin can easily configure the number of stages of the stream_r_join and stream_s_join join pipeline to change the system scale. The PARA_PIPELINE_STAGE_NUMS parameter in the <para.v> file defines the pipeline depth.

Adapt to Different Devices

The example project is designed for Xilinx Alveo U280 FPGA accelerator card. Modify the project settings to adapt to different accelerator device. The <link.cfg> file defines the implementation of the RTL kernel on the FPGA accelerator device. Please refer to the device manual for setting.

Note that each join pipeline needs to allocate an independent memory bank. For example, the two AXI buses of stream_r_join_1 in the example are connected to the DDR[0] memory bank, while stream_s_join is connected to the DDR[1]. At the same time, the depth of the join pipeline cannot exceed the number of FPGA resources, otherwise the synthesis will fail.

Evaluation Result

The direct manifestation of the processing capacity of the stream join system is the number of join calculations completed per unit of time. The picture on the left is a real-time comparison of the number of join calculations completed per second between FJoin and the distributed stream join system BiStream. We use the data set of Didi Chuxing and the corresponding join predicate, the sliding time window size is set to 180 seconds. The results show that FJoin with 1024 basic join units can complete more than 100 billion join predicate calculations per second. However, BiStream which runs in a 40-node cluster with 512 CPUs completes join predicate calculations about 6 billion times per second. From the perspective of connection predicate calculations, FJoin achieves a speedup of about 17.

The picture on the right compares the real-time throughput between FJoin and BiStream. This test uses the data set of the network traffic trajectory and the corresponding join predicate, and the sliding time window size is set to 15 seconds. We also divided it into multiple tests to increase the input stream rate to test the system Throughput value. Compared with the BiStream system with 512 CPU cores, FJoin with 1024 basic join units can reach 5x real-time throughput. In addition, compared with the same 512 unit system scale, the throughput of FJoin is increased to about 4x.

Publication

If you want to know more detailed information, please refer to this paper:
Lin L T, Chen H H, Hai J. FJoin: an FPGA-based parallel accelerator for stream join (in Chinese). Sci Sin Inform

doi:10.1360/SSI-2021-0214(https://www.sciengine.com/publisher/scp/journal/SSI/doi/10.1360/SSI-2021-0214)

Authors and Copyright

FJoin is developed in National Engineering Research Center for Big Data Technology and System, Cluster and Grid Computing Lab, Services Computing Technology and System Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China by Litao Lin ([email protected]), Hanhua Chen ([email protected]), Hai Jin ([email protected]).

chenhanhua / fjoin Goto Github PK

fjoin's Introduction

FJoin: an FPGA-based parallel accelerator for stream join

Introduction

FJoin architecture

Basic join unit

How to use?

Prerequisites

Hardware

Operation System

Software

Environment

Build and Run Project

Project Configuration

Runtime Parameters

Parameters that must be set in <run.cfg>:

Datasets Definitions

Parameters that have default values:

Runtime Parameters

Runtime Configuration Items

Join Predicate

Change Tuple Definitions

Change the RTL core of the join predicate

Configure FJoin Join Pipeline

Number of Stages

Adapt to Different Devices

Evaluation Result

Publication

Authors and Copyright

fjoin's People

Contributors

Stargazers

Recommend Projects

Recommend Topics

Recommend Org