Comments (5)
@MrPowers - Thanks for the easy-to-try example!!
The above code snippet definitely works, But I am afraid of the cases where the different versions of the delta table having different schema/columns and when combining all the versions with different columns was not supported out of the box in dd.read_parquet
( i.e, Schema Evolution was not supported ) for this we are heavily dependant on pyarrow dataset.
Pyarrow dataset able to read parquet files with different schema/columns into a single data frame (by filling NANs in the non-available columns)
currently tried adding a wrapper for delta-rs in this branch: https://github.com/rajagurunath/dask_deltatable/tree/feature/delta-rs, after some more documentation and testing we will create a pull request and release a new package.
cheers
from dask-deltatable.
Hi @MrPowers,
That's a great Suggestion. 👍
We initially discussed integrating delta-rs and dask here, and at that
time, we are facing some problems to read data from s3, AzureFS, GCFS.
Now I think there was explicit support for the pyarrow filesystem to read from a different backend.
After your Suggestion, Now I have started working on delta-rs integration, and currently trying to figure out how to parallelize IO using Dask (reading multiple parquet files). if we rely on delta-rs to_pyarraow_dataset
function all the parquet files are read in the same thread (Dask task) I guess. so planning to take all the advantages of delta-rs and implementing only the parallel reading function using Dask Delayed separately.
Any other Suggestions for parallelizing the deltalake read using Dask ?
And Once again thanks for trying this out and opening an issue here.
from dask-deltatable.
@rajagurunath - I actually think this is going to be super easy.
Think this will work:
from deltalake import DeltaTable
import dask.dataframe as dd
dt = DeltaTable("tmp/some-delta-pyspark")
ddf = dd.read_parquet(dt.files())
The read_parquet method takes a list of file names.
Can you try and let me know if it works?
from dask-deltatable.
Sounds good, keep me posted ;)
from dask-deltatable.
Hi @MrPowers,
Integrating delta-rs and dask work was almost completed,
if you have some time can you please review the PR created here #3. and let me know your suggestions for the same.
Thanks in Advance.
from dask-deltatable.
Related Issues (20)
- Handle timestamps other than `datetime64[us]`
- Release soon? HOT 5
- Finalize API for writing Delta Tables HOT 1
- Support pyarrow types_mapper kwarg
- Pickle error with `ParquetFileWriteOptions` and `distributed.Client`
- Support reading and writing to remote filesystems (s3, gcsfs, azure)
- Credentials for remote filesystems?
- `storage_options` inconsistency between `read_deltalake` and `to_deltalake`
- `TypeError`: cannot pickle `builtins.RawDeltaTable` object
- `read_deltalake` vs `read_parquet` performance HOT 1
- Can we get rid of `filters_to_expression`?
- What are the limitations of to_deltalake? HOT 1
- Problem with `pyarrow` dependency when installing dask-deltatable HOT 3
- Failed import when running `deltalake==0.14.0` HOT 4
- Order data by partitions if available HOT 3
- Specify AWS Permissions if reading from S3 HOT 1
- Overwriting tables
- `ImportError` with `deltalake=0.16.0` HOT 4
- Example in Readme not reproducible HOT 2
- `read_deltalake` breaks with dask>=2024.3.1 HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-deltatable.