Comments (2)
Which of the APIs in your dream pseudocode are you missing? Maybe I'm not understanding, but it seems like everything in there is supported by the library (getting list of data fields, appending fields to list, constructing new schema from fields, etc.)
from parquet-dotnet.
Thanks for the quick and sorry for delay. I did explore the column api and was able to build something more ergonomic where I could programmatically construct the output schema from the input schema. It was much less ugly than my first pass where I had to explicitly define every column and more memory efficient doing it by columns, so thanks - the library was helpful. A few things (and if this is still too vague I'd be happy to elaborate). Looking at my final code for the column based approach which is (at a pseudocode level)
- open existing parquet file
- read 4 specific columns, calculate 2 more based on them
- copy all existing columns and two new ones to output file
I like that the library supports low level and high level approaches, but it's a bit arbitrary that row oriented is high level. Small thing, but one line equivalent for the column approach (at least to get to the schema read). would be nice ergnomically (I use pandas constantly to just peek at a file, and having to open a file stream then the pq file is friction.
let! table = Parquet.ParquetReader.ReadTableFromFileAsync pqPath
Likewise for the output - it's nice to have control over all of this at a low level, but for lots of applications, a one line open for writing (with a single row group) would be handy. There's a lot of ceremony to handle the stream, and then the schema then the row group.
use outFs = File.OpenWrite(outFile)
let outSchema = new ParquetSchema(oldFields @ newFields)
use! writer = ParquetWriter.CreateAsync(outSchema,outFs)
use outWritegroup = writer.CreateRowGroup()
// write actual columns
do! outWritegroup.WriteColumnAsync(myColumn)
let outSchema = new ParquetSchema(oldFields @ newFields)
use! outPQ = ColumnWriter(outputPath,outSchema)
outPQ.WriteColumn(..)
Also small thing, but could you add a lookup for columns based on name. e.g. something like
let! column = myPqFile["mycolumn"]
I found myself writing a bunch of filters to find columns by index, and then remember the indices. The python apis all use strings for column id (I know it has its own pitfalls - an F# type provider would be a dream that made types with the columns as field names but that's a whole nother level of effort.
I'll play with it some more, but overall wonderful tool and just wondering how to get a little closer to the python versions, which are typically one line to read, one line to write. The downside of course is that it's harder to incrementally process files, so you do needed different flavors of high and low level. I guess the column and row oriented interfaces are just a little too tightly coupled with level right now, but I might also be missing something.
thanks!
from parquet-dotnet.
Related Issues (20)
- Issues converting to CSV HOT 5
- Issue reading parquet from a stream HOT 3
- Issue with certain stream sources
- Deserialise new timestamp parquet format with low-level api HOT 1
- [BUG]: Risk for deadlock when ParquetWriter is disposed? HOT 3
- [BUG]: empty string array incorrect serialization
- [BUG]: Reading decimal fields ignores precision and scale
- Feature request: add API for 2 step group writting
- [BUG]: Deserialising delta binary packed encoded data produces incorrect results
- [BUG]: Enum Serialization/Deserialization update
- [BUG]: Getting different length for keyColumn and valueColumn of a partition column HOT 1
- Unable to write Timestamp Logical Type HOT 1
- [BUG]: If the data transferred to WriteColumnAsync is too large, an error occurs
- [BUG]: Serialisation error for class with nullable struct with child property
- [BUG]: Big Endian Guid problem HOT 13
- Parquet file directly in binary HOT 3
- Working with null values when reading a Parquet file in FSharp HOT 2
- [BUG]: Deserialize, ArgumentOutOfRange exception
- [BUG]: Exception: scale must be less than the precision
- Feature requst: extend DataColumn API to read column values directly into provided Span/Memory/Array
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet-dotnet.