Giter Site home page Giter Site logo

Comments (2)

mrinal-thomas avatar mrinal-thomas commented on May 20, 2024

Which of the APIs in your dream pseudocode are you missing? Maybe I'm not understanding, but it seems like everything in there is supported by the library (getting list of data fields, appending fields to list, constructing new schema from fields, etc.)

from parquet-dotnet.

daz10000 avatar daz10000 commented on May 20, 2024

Thanks for the quick and sorry for delay. I did explore the column api and was able to build something more ergonomic where I could programmatically construct the output schema from the input schema. It was much less ugly than my first pass where I had to explicitly define every column and more memory efficient doing it by columns, so thanks - the library was helpful. A few things (and if this is still too vague I'd be happy to elaborate). Looking at my final code for the column based approach which is (at a pseudocode level)

  • open existing parquet file
  • read 4 specific columns, calculate 2 more based on them
  • copy all existing columns and two new ones to output file

I like that the library supports low level and high level approaches, but it's a bit arbitrary that row oriented is high level. Small thing, but one line equivalent for the column approach (at least to get to the schema read). would be nice ergnomically (I use pandas constantly to just peek at a file, and having to open a file stream then the pq file is friction.

let! table = Parquet.ParquetReader.ReadTableFromFileAsync pqPath

Likewise for the output - it's nice to have control over all of this at a low level, but for lots of applications, a one line open for writing (with a single row group) would be handy. There's a lot of ceremony to handle the stream, and then the schema then the row group.

use outFs = File.OpenWrite(outFile)
let outSchema = new ParquetSchema(oldFields @ newFields)
use! writer = ParquetWriter.CreateAsync(outSchema,outFs)
use outWritegroup = writer.CreateRowGroup()

// write actual columns
do! outWritegroup.WriteColumnAsync(myColumn)
let outSchema = new ParquetSchema(oldFields @ newFields)
use! outPQ = ColumnWriter(outputPath,outSchema)

outPQ.WriteColumn(..)

Also small thing, but could you add a lookup for columns based on name. e.g. something like

let! column = myPqFile["mycolumn"]

I found myself writing a bunch of filters to find columns by index, and then remember the indices. The python apis all use strings for column id (I know it has its own pitfalls - an F# type provider would be a dream that made types with the columns as field names but that's a whole nother level of effort.

I'll play with it some more, but overall wonderful tool and just wondering how to get a little closer to the python versions, which are typically one line to read, one line to write. The downside of course is that it's harder to incrementally process files, so you do needed different flavors of high and low level. I guess the column and row oriented interfaces are just a little too tightly coupled with level right now, but I might also be missing something.

thanks!

from parquet-dotnet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.