Comments (1)
I agree that it could be more beneficial for the community in general in the long run if Spark supports index as a concept in Data Source V2 API. Actually, there is already work going on (SPARK-36525). If you want to contribute to that direction, then it might be better to work on Spark directly.
I think one of the values of Hyperspace is being or trying to be data format agnostic. For example, covering indexes work as long as the data source supports creating from an existing dataset a new dataset with fewer columns and laid out in certain ways (i.e. bucketing) for efficient scanning by selected columns. Data skipping indexes work as long as the data source stores data in multiple objects (i.e. files) and can compute aggregations grouped by object. So we support not only Parquet, Delta Lake, Iceberg, but also CSV and other data sources as long as they are capable of certain things.
Looking at the current code and how it's evolving, it seems the index API in Spark doesn't allow Hyperspace-like indexing subsystems that support more than one data source to be plugged in. If we want to build Hyperspace around the Data Source V2 API, then we should propose a suitable change to the API so that for example we can hook into the index API.
from hyperspace.
Related Issues (20)
- [FEATURE REQUEST]: Update azure-pipelines.yml to build on Windows HOT 3
- Enforce/check scalafmt during build/CI
- Deprecate IndexConfig in favor of type-specific configs
- input_file_name() results change after Hyperspace is enabled HOT 1
- Spark 3.1 on Windows build test issue
- [FEATURE REQUEST]: Create helper function to check whether index is actually used in the plan HOT 1
- Notice user that indexes are not applied because indexed data protocols are different.
- Failed to debug Scala Test in IntelliJ HOT 1
- [FEATURE REQUEST]: Enable hyperspace with SparkSessionExtention
- Create a trait for shared functions across tests
- [FEATURE REQUEST]: Please consider extending indexing support to non - spark implementation of structured storage, specifically, stand - alone Java and Rust implementation of Parquet / Delta Lake
- [PROPOSAL]: ZOrderCoveringIndex
- Unable to use hyperspace on databricks runtime 8.4 (spark 3.1.2 scale 2.12) HOT 3
- Is index recommender / what-If API available?
- Z-Ordering index unavailable in Spark 2.x HOT 2
- [FEATURE REQUEST]: Integration with Presto/Trino query engine
- MinMax analysis util throws exception on large dataset HOT 1
- [FEATURE REQUEST]: Hypserspace support for Hudi dataformat
- Is Project HyperSpace Deprecated? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperspace.