Comments (9)
原因:
- insert阶段
当SET spark.sql.storeAssignmentPolicy = LEGACY
时,insert insert算子,对orc格式就是ORCBlockOutputFormat算子, 是以FakeRow输出CH Block作为其header的。导致ORCBlockOutputFormat生成orc文件在类型上和建表sql有diff。
old_header对应FakeRow输出的CH Block的schema。其中字段c的类型是Tuple(String, Nullable(String))
new_header对应表元数据中的schema。其中字段c类型是Tuple(d Nullable(String), e Nullable(String))
old_header:a#27 Int32 Int32(size = 0), b#28 Map(String, String) Map(size = 0, Array(size = 0, UInt64(size = 0), Tuple(size = 0, String(size = 0), String(size = 0)))), c#29 Tuple(String, Nullable(String)) Tuple(size = 0, String(size = 0), Nullable(size = 0, String(size = 0), UInt8(size = 0)))
new_header:a Int32 Int32(size = 0), b Map(String, Nullable(String)) Map(size = 0, Array(size = 0, UInt64(size = 0), Tuple(size = 0, String(size = 0), Nullable(size = 0, String(size = 0), UInt8(size = 0))))), c Tuple(d Nullable(String), e Nullable(String)) Tuple(size = 0, Nullable(size = 0, String(size = 0), UInt8(size = 0)), Nullable(size = 0, String(size = 0), UInt8(size = 0)))
- select阶段
在1的基础上,读取orc文件时,实际上是以Tuple(d Nullable(String), e Nullable(String))
为目标类型读取文件中类型为Tuple(String, Nullable(String))
的字段,导致读取不到正确数据。整体上表现为insert和select数据不一致。
from incubator-gluten.
Seems not related with #4317. I reverted the pr code and the issue still can be reproduced.
from incubator-gluten.
@exmy
In spark 3.5, i use this sql, it looks work, could you try this sql in 3.3?
insert overwrite tbl partition (day)
select id as a,
map('t1', 'a', 't2', 'b'),
struct('1', null) as c,
'2024-01-08' as day
from range(10)
from incubator-gluten.
@exmy In spark 3.5, i use this sql, it looks work, could you try this sql in 3.3?
insert overwrite tbl partition (day) select id as a, map('t1', 'a', 't2', 'b'), struct('1', null) as c, '2024-01-08' as day from range(10)
in spark3.3, it still can be reproduced.
from incubator-gluten.
Can't be reproduced in spark 3.5 with this branch: #6601
![image](https://private-user-images.githubusercontent.com/8181003/352892112-5736b02a-5d13-4a46-82ac-263708a9fe06.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjM3NDMyODQsIm5iZiI6MTcyMzc0Mjk4NCwicGF0aCI6Ii84MTgxMDAzLzM1Mjg5MjExMi01NzM2YjAyYS01ZDEzLTRhNDYtODJhYy0yNjM3MDhhOWZlMDYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDgxNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA4MTVUMTcyOTQ0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTRjMGI5Njk3YzMxMmRkMDAzY2JlNTViNTAwZGViNWJkNjAzMmRjMjkyMmE4YWQxNTk5MDQwZTI2ODU1MmEzOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.LqbHDDVWOCFLwpIYX2QAwWwnmKxrr9pz7izTj-Dq7Sk)
from incubator-gluten.
Let's dig into it.
from incubator-gluten.
It can be reproduced with below spark configs
SET spark.sql.catalogImplementation = 'hive';
SET spark.sql.files.maxPartitionBytes = 1g;
SET spark.serializer = 'org.apache.spark.serializer.JavaSerializer';
SET spark.sql.shuffle.partitions = 5;
SET spark.sql.adaptive.enabled = true;
SET spark.sql.files.minPartitionNum = 1;
SET spark.databricks.delta.maxSnapshotLineageLength = 20;
SET spark.databricks.delta.snapshotPartitions = 1;
SET spark.databricks.delta.properties.defaults.checkpointInterval = 5;
SET spark.databricks.delta.stalenessLimit = 3600000;
SET spark.gluten.sql.columnar.columnartorow = true;
SET spark.gluten.sql.columnar.backend.ch.worker.id = '1';
SET spark.gluten.sql.columnar.iterator = true;
SET spark.gluten.sql.columnar.hashagg.enablefinal = true;
SET spark.gluten.sql.enable.native.validation = false;
SET spark.sql.storeAssignmentPolicy = LEGACY;
SET spark.gluten.sql.columnar.backend.ch.runtime_config.logger.level = 'debug';
0: jdbc:hive2://localhost:10000/> select * from tbl;
+----+----------------------+-------+-------------+
| a | b | c | day |
+----+----------------------+-------+-------------+
| 0 | {"t1":"a","t2":"b"} | NULL | 2024-01-08 |
+----+----------------------+-------+-------------+
1 row selected (0.279 seconds)
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> set spark.gluten.enabled = false;
+-----------------------+--------+
| key | value |
+-----------------------+--------+
| spark.gluten.enabled | false |
+-----------------------+--------+
1 row selected (0.026 seconds)
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> select * from tbl;
+----+----------------------+----------------------+-------------+
| a | b | c | day |
+----+----------------------+----------------------+-------------+
| 0 | {"t1":"a","t2":"b"} | {"d":null,"e":null} | 2024-01-08 |
+----+----------------------+----------------------+-------------+
1 row selected (0.541 seconds)
file contents:
$ orc-contents ./part-00000-e89c5aef-d240-4d78-80a1-d04ebaf5868d.c000.lz4.orc
{"a": 0, "b": [{"key": "t1", "value": "a"}, {"key": "t2", "value": "b"}], "c": {"1": "1", "2": null}}
from incubator-gluten.
线索1:native和非native insert时,生成的orc文件schema有diff
native insert
$ orc-contents ./part-00000-8d8efdee-03ff-49e4-81fd-536854c0894d.c000.lz4.orc
{"a": 0, "b": [{"key": "t1", "value": "a"}, {"key": "t2", "value": "b"}], "c": {"d": "1", "e": null}}
非native insert
$ orc-contents ./part-00000-e89c5aef-d240-4d78-80a1-d04ebaf5868d.c000.lz4.orc
{"a": 0, "b": [{"key": "t1", "value": "a"}, {"key": "t2", "value": "b"}], "c": {"1": "1", "2": null}}
from incubator-gluten.
线索2
native insert with SET spark.sql.storeAssignmentPolicy = ANSI
. Notice that ANSI
is default value.
$ orc-contents ./part-00000-f7df5042-0bd8-4831-a40f-2b43d2309b3f.c000.lz4.orc
{"a": 0, "b": [{"key": "t1", "value": "a"}, {"key": "t2", "value": "b"}], "c": {"d": "1", "e": null}}
0: jdbc:hive2://localhost:10000/> select * from tbl;
+----+----------------------+---------------------+-------------+
| a | b | c | day |
+----+----------------------+---------------------+-------------+
| 0 | {"t1":"a","t2":"b"} | {"d":"1","e":null} | 2024-01-08 |
+----+----------------------+---------------------+-------------+
1 row selected (0.284 seconds)
native insert with SET spark.sql.storeAssignmentPolicy = LEGACY
$ orc-contents ./part-00000-0c3bb661-97e9-4839-88ac-2ecb5d19a4e5.c000.lz4.orc
{"a": 0, "b": [{"key": "t1", "value": "a"}, {"key": "t2", "value": "b"}], "c": {"1": "1", "2": null}}
0: jdbc:hive2://localhost:10000/> select * from tbl;
+----+----------------------+-------+-------------+
| a | b | c | day |
+----+----------------------+-------+-------------+
| 0 | {"t1":"a","t2":"b"} | NULL | 2024-01-08 |
+----+----------------------+-------+-------------+
1 row selected (0.899 seconds)
from incubator-gluten.
Related Issues (20)
- [VL] result mismatch found in round HOT 7
- [VL] Result missmatch in cast string as date HOT 1
- Timestamp type scan result is not correct HOT 1
- Unfork Substrait HOT 2
- [CH] With AQE enable, we could preserve hash table size more precisely HOT 1
- [CH] Enable cache files for hdfs
- [VL] Column name containing parts of Cyrillic cannot be read correctly
- [VL] Results mismatch with vanilla spark when using window exec HOT 4
- Throw java.lang.UnsupportedOperationException when running SQL in local mode with only one node
- [VL] Move process-wise static initializers out from VeloxListenerApi
- [Core] `sizeInBytes` could be empty in `OffloadJoin#getShuffleHashJoinBuildSide` HOT 1
- [CORE] Move planning-phase polymorphism from backend APIs to rules
- [CH] Support function arrays_overlap
- [CH] Minor refactors on expand operator
- [VL] Gluten plugin graceful shutdown when Spark session is closed HOT 1
- [VL] shuffle writter spill not triggered HOT 3
- [VL] Error during spilling: The current running driver and the request driver must be from the same task HOT 3
- [CH] Parquet table on minio not support
- [CH] There is nullability mismatch with `right join` enable
- Improve all join performance
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from incubator-gluten.