Comments (4)
It depends for what column, and what type of partitions. For daily rolling partitions NDV will be likely overlapping.
Yup. Discussion from original issue: let's say you partition by date and you have two data columns:
userId
eventId
Then for 1) you should rather take max (users come from the same id pool), even though userId NDV could be approximate the same as partition row count
For 2) you should sum because each eventId is unique.
from trino.
Should we add extrapolate NDVs instead?
Currently we chose the maximum NDV. If we decided to sum NDV's then we would need to extrapolate.
It seems that partitions might often be different chunks of data so that NDVs don't overlap.
It depends for what column, and what type of partitions. For daily rolling partitions NDV will be likely overlapping.
Alternatively we could store HLL state per column as an auxiliary partition property and calculate extrapolation based on merged HLLs
I think that's the only options we have. In order to do so - we need to store HLL states per columns in the partition properties, as the Mestastore API doesn't allow you to store arbitrary statistics.
from trino.
@findepi I'm not sure if we should call it a "bug". The decision of taking MAX NDV was a thoughtful decision. Let's change the label to "enhancment".
from trino.
I think that's the only options we have. In order to do so - we need to store HLL states per columns in the partition properties, as the Mestastore API doesn't allow you to store arbitrary statistics.
We could store HLL in table/partition properties
from trino.
Related Issues (20)
- JSON path doesn't accept wildcard array accessor HOT 7
- Issue while querying the nested data parquet files. HOT 5
- Improve performance of correlated NOT EXISTS queries
- Flaky Iceberg tests: `testDeleteRowsConcurrently` HOT 1
- Wanted to understand on hive.max-partitions-per-writers property HOT 1
- Trino Iceberg Timestamp with Timezone Query Rendering/Display Enhancment HOT 1
- Delta Lake S3 connector parallelism and network speed falls to 0
- `try_cast(parse_json(...))` incorrectly throws error in some cases HOT 3
- Add support for `BOOLEAN`, `TIMESTAMP`, and `VARBINARY` types for data column in Delta Lake connector `$partitions` system table
- PruneTableScanColumns throws an exception when column name contains only non-alphanumeric HOT 1
- Fix broken `pt (default, suite-databricks-unity-http-hms, )`
- Hive connector: SELECT * works, but SELECT on specific column doesn't - NullPointerException
- [Trino-420] Relevant Error message should be shown when clientTags is not passed while query is run
- Pinot connector has no way to pass query option along with query HOT 1
- Trino Query getting Hung after reading 400~600M rows (10~12GB data) of Parquet from object storage HOT 3
- Flaky test TestHiveTransactionalTable.testLargePartitionedUpdate: Hive TEZ failure: `Vertex did not succeed due to OWN_TASK_FAILURE`
- Invalid position 2 in block with 2 positions with multiple filters on array(varchar) HOT 3
- Metadata listing fails when Glue database (schema) dropped concurrently HOT 1
- Correctness issue for predicate pushdown on approx numeric column in Postgres
- Improve stats reporting for group by operator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from trino.