Issue deion I had a (largely) good experience using parquet-

[Question] Ergonomically adding columns to an existing pq file about parquet-dotnet HOT 2 CLOSED

daz10000 commented on May 20, 2024

[Question] Ergonomically adding columns to an existing pq file

from parquet-dotnet.

Comments (2)

mrinal-thomas commented on May 20, 2024

Which of the APIs in your dream pseudocode are you missing? Maybe I'm not understanding, but it seems like everything in there is supported by the library (getting list of data fields, appending fields to list, constructing new schema from fields, etc.)

from parquet-dotnet.

daz10000 commented on May 20, 2024

Thanks for the quick and sorry for delay. I did explore the column api and was able to build something more ergonomic where I could programmatically construct the output schema from the input schema. It was much less ugly than my first pass where I had to explicitly define every column and more memory efficient doing it by columns, so thanks - the library was helpful. A few things (and if this is still too vague I'd be happy to elaborate). Looking at my final code for the column based approach which is (at a pseudocode level)

open existing parquet file
read 4 specific columns, calculate 2 more based on them
copy all existing columns and two new ones to output file

I like that the library supports low level and high level approaches, but it's a bit arbitrary that row oriented is high level. Small thing, but one line equivalent for the column approach (at least to get to the schema read). would be nice ergnomically (I use pandas constantly to just peek at a file, and having to open a file stream then the pq file is friction.

let! table = Parquet.ParquetReader.ReadTableFromFileAsync pqPath

Likewise for the output - it's nice to have control over all of this at a low level, but for lots of applications, a one line open for writing (with a single row group) would be handy. There's a lot of ceremony to handle the stream, and then the schema then the row group.

use outFs = File.OpenWrite(outFile)
let outSchema = new ParquetSchema(oldFields @ newFields)
use! writer = ParquetWriter.CreateAsync(outSchema,outFs)
use outWritegroup = writer.CreateRowGroup()

// write actual columns
do! outWritegroup.WriteColumnAsync(myColumn)

let outSchema = new ParquetSchema(oldFields @ newFields)
use! outPQ = ColumnWriter(outputPath,outSchema)

outPQ.WriteColumn(..)

Also small thing, but could you add a lookup for columns based on name. e.g. something like

let! column = myPqFile["mycolumn"]

I found myself writing a bunch of filters to find columns by index, and then remember the indices. The python apis all use strings for column id (I know it has its own pitfalls - an F# type provider would be a dream that made types with the columns as field names but that's a whole nother level of effort.

I'll play with it some more, but overall wonderful tool and just wondering how to get a little closer to the python versions, which are typically one line to read, one line to write. The downside of course is that it's harder to incrementally process files, so you do needed different flavors of high and low level. I guess the column and row oriented interfaces are just a little too tightly coupled with level right now, but I might also be missing something.

thanks!

from parquet-dotnet.

[Question] Ergonomically adding columns to an existing pq file about parquet-dotnet HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent