apache / incubator-gluten Goto Github PK

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

License: Apache License 2.0

Java 5.60% Scala 67.70% Shell 1.23% C++ 24.20% CMake 1.09% Python 0.11% Dockerfile 0.03% C 0.01% Makefile 0.01% PowerShell 0.02%

clickhouse simd spark-sql vectorization velox arrow

incubator-gluten's People

Stargazers

Watchers

Forkers

rui-mo quintintao weiting-chen gabriel39 zzcclp li36909 nanpaiuncle3 jackyromano ecopty l00390132 marin-ma fishcus zengruios liuneng1994 asvyatet qiangcai zhouyuan wangzxjh baibaichen zhztheplayer felixybw jackylee-ch jkself intel-bigdata haojinintel zhejiangxiaomai jinchengchenghh wanghui2022 yulongfufu weixiuli michael1589 zenithez stenicholas wanghuan2054 dream-hu latincross zhangrenhua heary-cloud kkxiaotikk acproject lviiii yaooqinn ulysses-you knightchess yao-mr-zz z123 terry1504 retinadb kevinyhzou bryanwongsz mo-avatar lastrk bigo-sg aitozi akashsha1 standardgalactic xiejiapeng wsry chaojunzhang brillise chunweilei glipner laizhou yanjiegao loneylee codingcat zhli1142015 zhixingheyi-tian yyang52 allenma philo-he radeity wangguangxin liyuance lcy999 jixuekang lianqian gf53520 microbearz vajaw cyofeiyue paul-amonson ouyangxiaochen zuochunwei lantaojin mcdull-zhang obobj deshanxiao estoianovici ramamalladiaws izchen raohuaming luciferyang kiwik yohahaha zheniantoushipashi wforget cltforever imarch1 beryllw

incubator-gluten's Issues

Upgrade Substrait and generate Substrait plan on driver

Upgrade substrait to version e1b4c04a1b
Generate substrait plan on driver
Add TPCH test data and queries

Support WS Tansformer with row-based output

Support ClickHouse as native engine;
Support Row iterator;
Add config 'spark.oap.sql.columnar.iterator' to specify whether to use columnar basic iterator

Clean up unnecessary code

Use Arrow from upstream
Remove Gandiva dependency

Arrow stream is closed too early for Velox computing

Arrow stream is closed too early for Velox computing, which causes segfault.

Print substrait plan in json string

Remove Alias

Currently, Spark has the Alias expression to assign a new name to a computation. But due to Substrait is index-based, this expression is unneeded. Do we need to remove Alias?

Port the code in branch velox_dev to master

We need to port the code in velox_dev branch to master, and enable Velox as an execution backend.
The velox_dev branch will be deleted once this work completed.

Update Velox to the latest available commit

The dependency to Velox PlanBuilder needs to be removed

Support ClickHouse Spark DataSource V2 -- Phase One

Support to use ClickHouse Spark DataSource V2 to create ClickHouse tables, refresh tables, and select tables (TPCH-Q6).

Fix runtime UnsatisfiedLinkError

[DISCUSS] Any ideas about unified memory management for different native engines?

Hi OAP team, your job is wonderful and do you have some plans or designs about unified memory management for different native engines (e.g. velox)?

Run spark-shell with gazelle-jni-jvm-1.2.0-snapshot-jar-with-dependencies.jar failed.

Run spark-shell with gazelle-jni-jvm-1.2.0-snapshot-jar-with-dependencies.jar failed.
Run this on Ububtu 20.04. The command is shown below:

Then i check the libaray libspark_columnar_jni.so with ldd. There are some undefined symbol errors.

root@ubuntu:/home/gazelle/gazelle-jni/cpp/build/releases# ldd -r libspark_columnar_jni.so
        linux-vdso.so.1 (0x00007ffcd0ae1000)
        libprotobuf.so.17 => /lib/x86_64-linux-gnu/libprotobuf.so.17 (0x00007f03889ec000)
        libdouble-conversion.so.3 => /lib/x86_64-linux-gnu/libdouble-conversion.so.3 (0x00007f03889d6000)
        libsnappy.so.1 => /lib/x86_64-linux-gnu/libsnappy.so.1 (0x00007f03889cb000)
        libglog.so.0 => /usr/local/lib/libglog.so.0 (0x00007f0388984000)
        libarrow.so.400 (0x00007f038724f000)
        libgandiva.so.400 (0x00007f0384ff7000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f0384e13000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0384cc4000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f0384ca9000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0384ab7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f038b0b5000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f0384a9b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0384a78000)
        libgflags.so.2.2 => /usr/local/lib/libgflags.so.2.2 (0x00007f0384a49000)
        libunwind.so.8 => /lib/x86_64-linux-gnu/libunwind.so.8 (0x00007f0384a2c000)
        libcrypto.so.1.1 => /lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007f0384756000)
        libssl.so.1.1 => /lib/x86_64-linux-gnu/libssl.so.1.1 (0x00007f03846c3000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f03846bd000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f03846b2000)
        libcurl.so.4 => /lib/x86_64-linux-gnu/libcurl.so.4 (0x00007f038461f000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f03845f6000)
        libnghttp2.so.14 => /lib/x86_64-linux-gnu/libnghttp2.so.14 (0x00007f03845cd000)
        libidn2.so.0 => /lib/x86_64-linux-gnu/libidn2.so.0 (0x00007f03845ac000)
        librtmp.so.1 => /lib/x86_64-linux-gnu/librtmp.so.1 (0x00007f038458c000)
        libssh.so.4 => /lib/x86_64-linux-gnu/libssh.so.4 (0x00007f038451c000)
        libpsl.so.5 => /lib/x86_64-linux-gnu/libpsl.so.5 (0x00007f0384509000)
        libgssapi_krb5.so.2 => /lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007f03844bc000)
        libldap_r-2.4.so.2 => /lib/x86_64-linux-gnu/libldap_r-2.4.so.2 (0x00007f0384466000)
        liblber-2.4.so.2 => /lib/x86_64-linux-gnu/liblber-2.4.so.2 (0x00007f0384455000)
        libbrotlidec.so.1 => /lib/x86_64-linux-gnu/libbrotlidec.so.1 (0x00007f0384447000)
        libunistring.so.2 => /lib/x86_64-linux-gnu/libunistring.so.2 (0x00007f03842c3000)
        libgnutls.so.30 => /lib/x86_64-linux-gnu/libgnutls.so.30 (0x00007f03840ed000)
        libhogweed.so.5 => /lib/x86_64-linux-gnu/libhogweed.so.5 (0x00007f03840b6000)
        libnettle.so.7 => /lib/x86_64-linux-gnu/libnettle.so.7 (0x00007f038407c000)
        libgmp.so.10 => /lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f0383ff8000)
        libkrb5.so.3 => /lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007f0383f1b000)
        libk5crypto.so.3 => /lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007f0383ee8000)
        libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007f0383ee1000)
        libkrb5support.so.0 => /lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007f0383ed2000)
        libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f0383eb6000)
        libsasl2.so.2 => /lib/x86_64-linux-gnu/libsasl2.so.2 (0x00007f0383e99000)
        libgssapi.so.3 => /lib/x86_64-linux-gnu/libgssapi.so.3 (0x00007f0383e54000)
        libbrotlicommon.so.1 => /lib/x86_64-linux-gnu/libbrotlicommon.so.1 (0x00007f0383e2f000)
        libp11-kit.so.0 => /lib/x86_64-linux-gnu/libp11-kit.so.0 (0x00007f0383cf9000)
        libtasn1.so.6 => /lib/x86_64-linux-gnu/libtasn1.so.6 (0x00007f0383ce3000)
        libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007f0383cdc000)
        libheimntlm.so.0 => /lib/x86_64-linux-gnu/libheimntlm.so.0 (0x00007f0383cd0000)
        libkrb5.so.26 => /lib/x86_64-linux-gnu/libkrb5.so.26 (0x00007f0383c3b000)
        libasn1.so.8 => /lib/x86_64-linux-gnu/libasn1.so.8 (0x00007f0383b94000)
        libhcrypto.so.4 => /lib/x86_64-linux-gnu/libhcrypto.so.4 (0x00007f0383b5c000)
        libroken.so.18 => /lib/x86_64-linux-gnu/libroken.so.18 (0x00007f0383b43000)
        libffi.so.7 => /lib/x86_64-linux-gnu/libffi.so.7 (0x00007f0383b37000)
        libwind.so.0 => /lib/x86_64-linux-gnu/libwind.so.0 (0x00007f0383b0d000)
        libheimbase.so.1 => /lib/x86_64-linux-gnu/libheimbase.so.1 (0x00007f0383af9000)
        libhx509.so.5 => /lib/x86_64-linux-gnu/libhx509.so.5 (0x00007f0383aab000)
        libsqlite3.so.0 => /lib/x86_64-linux-gnu/libsqlite3.so.0 (0x00007f0383982000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f0383947000)
undefined symbol: _ZN3fLB10FLAGS_avx2E  (./libspark_columnar_jni.so)
undefined symbol: _ZN3fLB32FLAGS_velox_exception_stacktraceE    (./libspark_columnar_jni.so)
undefined symbol: _ZN3fLB10FLAGS_bmi2E  (./libspark_columnar_jni.so)
undefined symbol: _ZN3fLI46FLAGS_velox_exception_stacktrace_rate_limit_msE      (./libspark_columnar_jni.so)
undefined symbol: _ZN3fLB22FLAGS_velox_use_mallocE      (./libspark_columnar_jni.so)
undefined symbol: _ZNK8facebook5velox7process10StackTrace8toStringB5cxx11Ev     (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost16re_detail_10710013put_mem_blockEPv (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEESaINS_9sub_matchISB_EEEE12maybe_assignERKSF_  (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox8encoding6Base646encodeB5cxx11EN5folly5RangeIPKcEE  (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox4dwio6common10encryptioneqERKNS3_20EncryptionPropertiesES6_ (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox8encoding6Base6420calculateDecodedSizeEPKcRmb       (./libspark_columnar_jni.so)
undefined symbol: event_base_new        (./libspark_columnar_jni.so)
undefined symbol: _ZN4date11locate_zoneESt17basic_string_viewIcSt11char_traitsIcEE      (./libspark_columnar_jni.so)
undefined symbol: event_active  (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox4dwrf10ProtoUtils9writeTypeERKNS0_4TypeERNS1_5proto6FooterEPNS6_4TypeE      (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox7process12TraceContext10statusLineB5cxx11Ev (./libspark_columnar_jni.so)
undefined symbol: jump_fcontext (./libspark_columnar_jni.so)
undefined symbol: event_add     (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox7process10StackTraceC1Ei    (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost13match_resultsIPKcSaINS_9sub_matchIS2_EEEE12maybe_assignERKS6_      (./libspark_columnar_jni.so)
undefined symbol: ZSTD_getErrorName     (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox7process12TraceContextD1Ev  (./libspark_columnar_jni.so)
undefined symbol: event_base_set        (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost16re_detail_10710012perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE   (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox8encoding6Base646decodeEPKcmPc      (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox8encoding6Base646encodeEPKcmPc      (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost16re_detail_10710019raise_runtime_errorERKSt13runtime_error  (./libspark_columnar_jni.so)
undefined symbol: ZSTD_decompress       (./libspark_columnar_jni.so)
undefined symbol: event_base_free       (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox8encoding6Base6420calculateEncodedSizeEmb   (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost11basic_regexIcNS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE9do_assignEPKcS7_j     (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost16re_detail_10710013get_mem_blockEv  (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost16re_detail_10710014verify_optionsEjNS_15regex_constants12_match_flagsE      (./libspark_columnar_jni.so)
undefined symbol: event_set     (./libspark_columnar_jni.so)
undefined symbol: ZSTD_getFrameContentSize      (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox8encoding6Base646decodeB5cxx11EN5folly5RangeIPKcEE  (./libspark_columnar_jni.so)
undefined symbol: event_base_loop       (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost16re_detail_10710012perl_matcherIPKcSaINS_9sub_matchIS3_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcEEEEE14construct_initERKNS_11basic_regexIcSA_EENS_15regex_constants12_match_flagsE       (./libspark_columnar_jni.so)
undefined symbol: event_del     (./libspark_columnar_jni.so)
undefined symbol: _ZNK5boost16re_detail_10710031cpp_regex_traits_implementationIcE17transform_primaryB5cxx11EPKcS4_     (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox7process12TraceContextC1ENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEb      (./libspark_columnar_jni.so)
undefined symbol: _ZNK4date9time_zone13get_info_implENSt6chrono10time_pointINS1_3_V212system_clockENS1_8durationIlSt5ratioILl1ELl1EEEEEE        (./libspark_columnar_jni.so)
undefined symbol: ZSTD_isError  (./libspark_columnar_jni.so)
undefined symbol: _ZNK5boost16re_detail_10710031cpp_regex_traits_implementationIcE9transformB5cxx11EPKcS4_      (./libspark_columnar_jni.so)
undefined symbol: LZ4_decompress_safe   (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox4dwio6common9exception18getExceptionLoggerEv        (./libspark_columnar_jni.so)
undefined symbol: event_get_version     (./libspark_columnar_jni.so)
undefined symbol: _ZN5boost16re_detail_10710024get_default_error_stringENS_15regex_constants10error_typeE       (./libspark_columnar_jni.so)
undefined symbol: event_base_loopbreak  (./libspark_columnar_jni.so)
undefined symbol: make_fcontext (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox4dwio6common11compression13lzoDecompressEPKcS5_PcS6_        (./libspark_columnar_jni.so)
undefined symbol: ZSTD_getErrorCode     (./libspark_columnar_jni.so)
undefined symbol: _ZNK4date9time_zone13get_info_implENSt6chrono10time_pointINS_7local_tENS1_8durationIlSt5ratioILl1ELl1EEEEEE   (./libspark_columnar_jni.so)
undefined symbol: ZSTD_compress (./libspark_columnar_jni.so)
undefined symbol: _ZN8facebook5velox35DeserializationRegistryForSharedPtrB5cxx11Ev      (./libspark_columnar_jni.so)
undefined symbol: event_base_get_method (./libspark_columnar_jni.so)

For example undefined symbol: _ZN3fLB10FLAGS_avx2E (./libspark_columnar_jni.so), I use c++filt to show the function name:

root@ubuntu:/home/gazelle/gazelle-jni/cpp/build/releases# c++filt _ZN3fLB10FLAGS_avx2E
fLB::FLAGS_avx2

The function FLAGS_avx2 is used by velox, but I can not find the defination of it.
I have no idea what to do next. Someone can help?
I compile gazelle_jni on branch velox_dev, compile velox on branch substrait ，So @rui-mo , can you give me some help？

Use the unified function names with Substrait

We previously used self-defined function names, which causes difficulty for the backends to use. Therefore, we need to change to use the unified names specified in Substrait yaml files.

Support columnar shuffle on clickhouse

support write shuffle data from clickhouse block and read data into block.

Fix some fallback issues

Currently, there are some fallback issues when SparkPlan is SerializeFromObjectExec, ObjectHashAggregateExec and V2CommandExec, for example:

val tookTimeArr = Array(12, 23, 56, 100, 500, 20)
import spark.implicits._
val df = spark.sparkContext.parallelize(tookTimeArr.toSeq, 1).toDF("time")
df.summary().show(100, false)

When executing the above code, it will return a 'null' result.

Refine docs and compiling process

Easier steps to build with Velox
Refine the docs for building Arrow

add compile script for clickhouse backend

need compile scripts for clickhouse

Remove redundant JNI funciton nativeInitNative from gazelle-jni

Add a fallback mechanism based on Substrait plan validation

Each backend needs to validate the Substrait plan before real computing. If validation fails, the execution will fallback into vanilla Spark.

Enable Velox's calculation for TPC-H Q6

Enable Velox's Project and Aggregate
Data format conversion from Velox RowVector to Arrow RecordBatch

Seperate the base layer and backend layer

The base layer will include some common code and configs used by every backend. The backend layer will include some specific code and configs used by that backend only. In this way, each backend will use its own specific layer based on the base layer. The computings for different backends will be well seperated.

Register should happen once per executor

Enable filter pushdown for TPCH Q1&Q6 on arrow backend

Restructure the Velox invoking process

Enable Columnar Shuffle

Use unified Jni interfaces

Below parts need to be cleaned and unified:

ExpressionEvaluator
ExpressionEvaluatorJniWrapper
BatchIterator.java
JniUtils and JniInstance
createNativeKernelWithIterator
add a config to decide whether to load Gandiva, Arrow libraries

Add option for deciding whether to compile CPP and make the library name configurable

Update Substrait

Change the arrow version id

To avoid conflict with upstream arrow, we are advised to change the version id in Arrow-7.0 into a self-assigned id.
https://github.com/oap-project/gazelle-jni/blob/master/pom.xml#L116

Upgrade the used Velox version

Support Spark Datasource V1

Do we need to exclude Pre-Projection from Aggregate?

TPC-H Q6's Aggregation includes:
Pre-Projection (Multiply)
Aggregate (Sum)
Post-Projection (Cast to String)

In a local development branch, I have excluded Post-Projection from Aggregate in Scala side by creating a new ProjectRel when needed. Do we need to do that for Pre-Projection?

Add a native test for Velox compute

Remove all the class with Gazelle and replace to Gluten in source code

We have to remove all class related with Gazelle and replace them to Gluten
However, we should keep Gazelle CPP as the one of the backend engines.

Make execution backend pluggable

Add a fake result for easier debug

A fake result of TPC-H Q6 is needed for easier debug.

Add doc for ClickHouse Backend

Enable the second stage

To enable the second stage, input will be fetched from Java iterator lazily.

Add docs for arrow backend

Unify Jni interfaces

Use the unified JNI interfaces to operate the different native engines.

Update Substrait parsing process

on master branch
on velox_dev branch

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.