Giter Site home page Giter Site logo

parquetjs's Introduction

CURRENT STATUS: INACTIVE

This project requires a major overhaul, as well as handling and sorting through dozens of issues and prs. Please contact me if you're up for the task.

parquet.js

fully asynchronous, pure node.js implementation of the Parquet file format

Build Status License: MIT npm version

This package contains a fully asynchronous, pure JavaScript implementation of the Parquet file format. The implementation conforms with the Parquet specification and is tested for compatibility with Apache's Java reference implementation.

What is Parquet?: Parquet is a column-oriented file format; it allows you to write a large amount of structured data to a file, compress it and then read parts of it back out efficiently. The Parquet format is based on Google's Dremel paper.

Installation

To use parquet.js with node.js, install it using npm:

  $ npm install parquetjs

parquet.js requires node.js >= 8

Usage: Writing files

Once you have installed the parquet.js library, you can import it as a single module:

var parquet = require('parquetjs');

Parquet files have a strict schema, similar to tables in a SQL database. So, in order to produce a Parquet file we first need to declare a new schema. Here is a simple example that shows how to instantiate a ParquetSchema object:

// declare a schema for the `fruits` table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64' },
  price: { type: 'DOUBLE' },
  date: { type: 'TIMESTAMP_MILLIS' },
  in_stock: { type: 'BOOLEAN' }
});

Note that the Parquet schema supports nesting, so you can store complex, arbitrarily nested records into a single row (more on that later) while still maintaining good compression.

Once we have a schema, we can create a ParquetWriter object. The writer will take input rows as JSON objects, convert them to the Parquet format and store them on disk.

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

// append a few rows to the file
await writer.appendRow({name: 'apples', quantity: 10, price: 2.5, date: new Date(), in_stock: true});
await writer.appendRow({name: 'oranges', quantity: 10, price: 2.5, date: new Date(), in_stock: true});

Once we are finished adding rows to the file, we have to tell the writer object to flush the metadata to disk and close the file by calling the close() method:

await writer.close();

Usage: Reading files

A parquet reader allows retrieving the rows from a parquet file in order. The basic usage is to create a reader and then retrieve a cursor/iterator which allows you to consume row after row until all rows have been read.

You may open more than one cursor and use them concurrently. All cursors become invalid once close() is called on the reader object.

// create new ParquetReader that reads from 'fruits.parquet`
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

// create a new cursor
let cursor = reader.getCursor();

// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

When creating a cursor, you can optionally request that only a subset of the columns should be read from disk. For example:

// create a new cursor that will only return the `name` and `price` columns
let cursor = reader.getCursor(['name', 'price']);

It is important that you call close() after you are finished reading the file to avoid leaking file descriptors.

await reader.close();

Encodings

Internally, the Parquet format will store values from each field as consecutive arrays which can be compressed/encoded using a number of schemes.

Plain Encoding (PLAIN)

The most simple encoding scheme is the PLAIN encoding. It simply stores the values as they are without any compression. The PLAIN encoding is currently the default for all types except BOOLEAN:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8', encoding: 'PLAIN' },
});

Run Length Encoding (RLE)

The Parquet hybrid run length and bitpacking encoding allows to compress runs of numbers very efficiently. Note that the RLE encoding can only be used in combination with the BOOLEAN, INT32 and INT64 types. The RLE encoding requires an additional bitWidth parameter that contains the maximum number of bits required to store the largest value of the field.

var schema = new parquet.ParquetSchema({
  age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});

Optional Fields

By default, all fields are required to be present in each row. You can also mark a field as 'optional' which will let you store rows with that field missing:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64', optional: true },
});

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
await writer.appendRow({name: 'apples', quantity: 10 });
await writer.appendRow({name: 'banana' }); // not in stock

Nested Rows & Arrays

Parquet supports nested schemas that allow you to store rows that have a more complex structure than a simple tuple of scalar values. To declare a schema with a nested field, omit the type in the column definition and add a fields list instead:

Consider this example, which allows us to store a more advanced "fruits" table where each row contains a name, a list of colours and a list of "stock" objects.

// advanced fruits table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  colours: { type: 'UTF8', repeated: true },
  stock: {
    repeated: true,
    fields: {
      price: { type: 'DOUBLE' },
      quantity: { type: 'INT64' },
    }
  }
});

// the above schema allows us to store the following rows:
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

await writer.appendRow({
  name: 'banana',
  colours: ['yellow'],
  stock: [
    { price: 2.45, quantity: 16 },
    { price: 2.60, quantity: 420 }
  ]
});

await writer.appendRow({
  name: 'apple',
  colours: ['red', 'green'],
  stock: [
    { price: 1.20, quantity: 42 },
    { price: 1.30, quantity: 230 }
  ]
});

await writer.close();

// reading nested rows with a list of explicit columns
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

let cursor = reader.getCursor([['name'], ['stock', 'price']]);
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

await reader.close();

It might not be obvious why one would want to implement or use such a feature when the same can - in principle - be achieved by serializing the record using JSON (or a similar scheme) and then storing it into a UTF8 field:

Putting aside the philosophical discussion on the merits of strict typing, knowing about the structure and subtypes of all records (globally) means we do not have to duplicate this metadata (i.e. the field names) for every record. On top of that, knowing about the type of a field allows us to compress the remaining data more efficiently.

List of Supported Types & Encodings

We aim to be feature-complete and add new features as they are added to the Parquet specification; this is the list of currently implemented data types and encodings:

Logical TypePrimitive TypeEncodings
UTF8BYTE_ARRAYPLAIN
JSONBYTE_ARRAYPLAIN
BSONBYTE_ARRAYPLAIN
BYTE_ARRAYBYTE_ARRAYPLAIN
TIME_MILLISINT32PLAIN, RLE
TIME_MICROSINT64PLAIN, RLE
TIMESTAMP_MILLISINT64PLAIN, RLE
TIMESTAMP_MICROSINT64PLAIN, RLE
BOOLEANBOOLEANPLAIN, RLE
FLOATFLOATPLAIN
DOUBLEDOUBLEPLAIN
INT32INT32PLAIN, RLE
INT64INT64PLAIN, RLE
INT96INT96PLAIN
INT_8INT32PLAIN, RLE
INT_16INT32PLAIN, RLE
INT_32INT32PLAIN, RLE
INT_64INT64PLAIN, RLE
UINT_8INT32PLAIN, RLE
UINT_16INT32PLAIN, RLE
UINT_32INT32PLAIN, RLE
UINT_64INT64PLAIN, RLE

Buffering & Row Group Size

When writing a Parquet file, the ParquetWriter will buffer rows in memory until a row group is complete (or close() is called) and then write out the row group to disk.

The size of a row group is configurable by the user and controls the maximum number of rows that are buffered in memory at any given time as well as the number of rows that are co-located on disk:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

Dependencies

Parquet uses thrift to encode the schema and other metadata, but the actual data does not use thrift.

Contributions

Please make sure you sign the contributor license agreement in order for us to be able to accept your contribution. We thank you very much!

License

Copyright (c) 2017-2019 ironSource Ltd.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

parquetjs's People

Contributors

arnabguptadev avatar asmuth avatar dobesv avatar dominictarr avatar gregplaysguitar avatar jeffbski-rga avatar jthomerson avatar kessler avatar markov00 avatar nateschickler0 avatar wgalecki avatar zectbynmo avatar zjonsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parquetjs's Issues

v0.8.0 isn't published on npm

Hello!

I've been trying to use this module for a project of mine, but I'm running into a few issues (#24), which are entirely solved by v.0.8.0.

Is there any way that you could publish the new version on npm?

Thanks :)

Memory Leaks?

Hi,

I'm using elasticsearchJS to export a whole index from ES in batches of 4096.
The whole tool uses about 500mb RAM while dumping ES index to parquet format.
(nodeJS has 2GB memory limit set)

If i lower or increase the batch size (or randomly) it uses a lot of memory like 2-3GB and it gets killed.
The quickest way to reproduce is to increase the batch size that it has to process.
The generate parquet file, usually has ~5.4GB.

Is there anything i can do to debug this more?

Thanks!

P.S.: I'm using git+ssh://[email protected]/ironSource/parquetjs.git#1fa58b589d9b6451379f1558214e9ae751909596 as the parquetJS package.

Error when field has no values

When an optional field has 0 values, the generated file seems unreadable by parquet-tools, the error I get is can not read class org.apache.parquet.format.PageHeader: Required field 'num_nulls' was not found in serialized data! Struct: DataPageHeaderV2(num_values:10, num_nulls:0, num_rows:10, encoding:PLAIN, definition_levels_byte_length:20, repetition_levels_byte_length:0).
When I forced the num_nulls value to be no less than 1, it started working. I'm not sure if this is an issue with the module or with the way I'm trying to use it, so I'm letting you know.

Return array of buffers instead of buffer.concat

Buffer.concat, particularly on the whole rowgroup buffers can get expensive memory wise, as we'll end up with two copies of the whole buffer (the individual parts and the concatenated version) for a moment before the garbage-collector picks up the pieces. This can limit the maximum size of the rowgroup.

It might be more efficient to return bodyParts (a sequential array of bufffers) in general instead of buffer.concat and then in writeSection we loop through the bodyParts and write them sequentially.

Cannot execute the demo code for writing a parquet file.

I tried the code below....


var parquet = require('parquetjs');

// declare a schema for the fruits table
var schema = new parquet.ParquetSchema({
name: { type: 'UTF8' },
quantity: { type: 'INT64' },
price: { type: 'DOUBLE' },
in_stock: { type: 'BOOLEAN' }
});

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = new parquet.ParquetFileWriter(schema, 'fruits.parquet');

This gives me an error.

1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
                                                                                                                                                                                                                              ^

TypeError: parquet.ParquetFileWriter is not a constructor
    at Object.<anonymous> (/usr/local/globalcdn/playground/parquet/index.js:14:14)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)
    at Function.Module.runMain (module.js:693:10)
    at startup (bootstrap_node.js:188:16)
    at bootstrap_node.js:609:3
---------------------------------------------------------------------------------------------

**The parquetjs is installed with this message.**
---------------------------------------------------------------------------------------------
$ npm install --save parquetjs

> [email protected] install /usr/local/globalcdn/playground/parquet/node_modules/ws
> (node-gyp rebuild 2> builderror.log) || (exit 0)

make: Entering directory `/usr/local/globalcdn/playground/parquet/node_modules/ws/build'
  CXX(target) Release/obj.target/bufferutil/src/bufferutil.o
make: Leaving directory `/usr/local/globalcdn/playground/parquet/node_modules/ws/build'

> [email protected] install /usr/local/globalcdn/playground/parquet/node_modules/lzo
> node-gyp rebuild

make: Entering directory `/usr/local/globalcdn/playground/parquet/node_modules/lzo/build'
  CC(target) Release/obj.target/node_lzo/lib/minilzo209/minilzo.o
  CXX(target) Release/obj.target/node_lzo/lib/lzo.o
  SOLINK_MODULE(target) Release/obj.target/node_lzo.node
  COPY Release/node_lzo.node
make: Leaving directory `/usr/local/globalcdn/playground/parquet/node_modules/lzo/build'
npm WARN [email protected] No description
npm WARN [email protected] No repository field.

+ [email protected]
added 17 packages in 4.499s
---------------------------------------------------------------------------------------------

**Environment Info.**
---------------------------------------------------------------------------------------------
 Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64)
---------------------------------------------------------------------------------------------

Problems with deep schemas to

Here is an example of a schema that is three levels deep. Shreading and Materializing a single record works fine however writing a parquet file and reading it back results in an error:

const parquet = require('parquetjs');

var schema = new parquet.ParquetSchema({
  a: {
    fields: {
      b: {
        fields: {
          c:  {type: 'UTF8'}
        }
      }
    }
  }
});


let rec = {a: {b: {c: 'this is a test'}}};


async function main() {
  // shread & materialize:
  console.log('shread & materialize:');
  let buf = {};
  parquet.ParquetShredder.shredRecord(schema, rec, buf);
  console.log(parquet.ParquetShredder.materializeRecords(schema, buf));

  // writer and reader
  console.log('writer & reader:');
  const writer = await parquet.ParquetWriter.openFile(schema, 'test.parquet');
  await writer.appendRow(rec);
  await writer.close();

  let reader = await parquet.ParquetReader.openFile('test.parquet');
  let cursor = reader.getCursor();
  let record = null;
  while (record = await cursor.next()) {
    console.log(record);
  }

  await reader.close();
}

main().then(console.log,console.log)

Output is:

shread & materialize:
[ { a: { b: [Object] } } ]
writer & reader:
TypeError: Cannot read property 'rLevelMax' of undefined
    at ParquetEnvelopeReader.readColumnChunk (/home/zjonsson/git/parquetjs/lib/reader.js:344:24)
    at <anonymous>

Potential discepancy in shred/materialize

Reading the spec carefully I see the following paragraph

One important thing to remember to understand the examples is that not every level of the tree needs a new definition or repetition level. Only repeated fields increment the repetition level, only non-required fields increment the definition level. As those levels are very small bounded values they can be encoded efficiently using a few bits.

Required fields are always defined and do not need a definition level. Non repeated fields do not need a repetition level.

This means that any path to a leaf node that has all path element as optional: false can only have a definition level of zero. (each definition level higher up needs an optional: true in the path)

However when I look at one of the materialize tests in parquetjs I see:

   var schema = new parquet.ParquetSchema({
      name: { type: 'UTF8' },
      stock: {
        repeated: true,
        fields: {
          quantity: { type: 'INT64', repeated: true },
          warehouse: { type: 'UTF8' },
        }
      },
      price: { type: 'DOUBLE' },
    });

  buffer.columnData[['stock',  'quantity']] = {
      dlevels: [2, 2, 2, 2, 0, 1],
      rlevels: [0, 1, 0, 2, 0, 0],
      values: [10, 20, 50, 75],
      count: 6
    };

Nothing in the path is optional, however many of the dlevels are non zero. If I change the dlevels to all zeros then the quantity data is not populated in the results records.

Is it possible that there is a discrepancy in the implementation?

UnhandledPromiseRejectionWarning on error

If an error occurs in the ParquetTransformer._transform the stream is left in limbo state. Node logs out the following warning:

(node:10265) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 2): invalid value for INT64: N/A
(node:10265) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Optional variables can not be explicitly `undefined` or `null`

When streaming from a database or csv, missing values are often in the Object.keys of the record with the value either set to undefined or null

Here is an example that fails on colour (which is optional) being undefined in the second record.

const parquet = require('./parquet.js');

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  colour: {type: 'UTF8', optional: true}
});

async function main() {
  const writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
  await writer.appendRow({name: 'banana', colour: 'yellow'});
  await writer.appendRow({name: 'apple', colour: undefined});
  await writer.close();
}

main()
  .then(() => console.log('done'))
  .catch(e => console.log(e));

resulting in the following error:

TypeError: Cannot read property 'constructor' of undefined
    at shredRecordInternal (/home/zjonsson/git/parquetjs/lib/shred.js:83:29)
    at Object.exports.shredRecord (/home/zjonsson/git/parquetjs/lib/shred.js:40:3)
    at ParquetWriter.appendRow (/home/zjonsson/git/parquetjs/lib/writer.js:94:22)
    at main (/home/zjonsson/git/parquetjs/test.js:11:16)
    at <anonymous>

Adding repeated properties to schema results in corrupt parquet file.

Version 0.8.0

Having some issues with repeated. The resulting parquet file seems to have errors in it.

org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/PATHTOFILE/profile.parquet

Here is the code I'm testing with, it's the identities object that is causing the problem.


let schema = new parquet.ParquetSchema({
    person: {
        repeated: false,
        fields: {
            firstName: {
                type: 'UTF8'
            },
            lastName: {
                type: 'UTF8'
            }
        }
    },
    identities: {
        repeated: true,
        fields: {
            id: {
                type: 'UTF8'
            },
            xid: {
                type: 'UTF8'
            }
        }
    }
});

async function writeToParquet(schema) {
    // create new ParquetWriter that writes to 'fruits.parquet`
    var writer = await parquet.ParquetWriter.openFile(schema, 'profile.parquet');

    writer.appendRow({
        person: {
            firstName: "Test",
            lastName: "User"
        },
        identities: [{
            id: "ID",
            xid: "XID"
        },{
            id: "ID",
            xid: "XID"
        }]
    });

    await writer.close();
}

writeToParquet(schema);```

Problems with reader and deep schemas

Here is an example of a schema that is three levels deep. Shreading and Materializing a single record works fine however writing a parquet file and reading it back results in an error:

const parquet = require('parquetjs');

var schema = new parquet.ParquetSchema({
  a: {
    fields: {
      b: {
        fields: {
          c:  {type: 'UTF8'}
        }
      }
    }
  }
});


let rec = {a: {b: {c: 'this is a test'}}};


async function main() {
  // shread & materialize:
  console.log('shread & materialize:');
  let buf = {};
  parquet.ParquetShredder.shredRecord(schema, rec, buf);
  console.log(parquet.ParquetShredder.materializeRecords(schema, buf));

  // writer and reader
  console.log('writer & reader:');
  const writer = await parquet.ParquetWriter.openFile(schema, 'test.parquet');
  await writer.appendRow(rec);
  await writer.close();

  let reader = await parquet.ParquetReader.openFile('test.parquet');
  let cursor = reader.getCursor();
  let record = null;
  while (record = await cursor.next()) {
    console.log(record);
  }

  await reader.close();
}

main().then(console.log,console.log)

Output is:

shread & materialize:
[ { a: { b: [Object] } } ]
writer & reader:
TypeError: Cannot read property 'rLevelMax' of undefined
    at ParquetEnvelopeReader.readColumnChunk (/home/zjonsson/git/parquetjs/lib/reader.js:344:24)
    at <anonymous>

thrift module is dependent on ws module

the thrift module source is hosting on fsf git not on github, I'm not familiar with the procedure for submitting PRs and such but we might need to fork it to a non ws dependent version

Implement statistics

Statistics definition: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L204-L212

DataPageHeader: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L342-L356
DataPageHeaderV2: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L379-L405
ColumnMetaData: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L472-L508

This allows min/max to been seen immediately for given pages/row, avoiding scanning data outside of the area of interest for the column.

get file content as base64

Hi,
is there a way to get the parquet file as base64 string, without having to write it to disk?

my use case is to convert an object into parquet format and upload it to AWS s3

fruits.parquet generated by test/integration.js is unreadable by Hadoop parquet-tools 1.9.0

Build parquet-mr/parquet-tools per these instructions.

Then run its cat command to dump the fruits.parquet file that is generated:

$ java -jar target/parquet-tools-1.9.0.jar cat parquetjs/fruits.parquet 

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/Users/davidr/workspaces/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Could not read footer: java.io.IOException: Could not read footer for file DeprecatedRawLocalFileStatus{path=file:/Users/davidr/workspaces/parquetjs/fruits.parquet; isDirectory=false; length=1411554; replication=1; blocksize=33554432; modification_time=1512831680000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

Using parquetjs v0.8.0.

Problem with materialize on an object with optional missing variable

In some cases where a property is optional in the schema and missing from the record, other values will not be missing as well from materialize.

Example:

Actual output [ { fruit: {} } ]
Expected output : [ { fruit: { name: 'apple' } } ]

const parquet = require('../parquetjs');

var schema = new parquet.ParquetSchema({
  fruit: {
    fields: {
      name: { type: 'UTF8' },
      colour: { type: 'UTF8', optional: true }
    }
  }
});

let buf = {};
let rec = { fruit: { name: 'apple'}};

parquet.ParquetShredder.shredRecord(schema, rec, buf);
let records = parquet.ParquetShredder.materializeRecords(schema, buf);

console.log(records);

RowGroup size recommendation is too low for optimal use of Parquet

From the README:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

This is way off compared to the intended size of RowGroups for Parquet files. The initial implementation suggests to use 128MiB as the RowGroup size to neatly fit a HDFS block.

Using very tiny RowGroup sizes removes the main benefit of the Parquet format, being columnar, utilising vectorized execution and a good trade-off between compression ratios and CPU usage due to encoding the data with the knowledge of its data type.

The smallest unit in a Parquet file, a page, is normally set to 1 MiB which is much more than 200x the recommended RowGroup size. Some implementations have used 64KiB which is also greater.

Should be usable in the browser

We want to be able to use this package in the browser, therefor we will need to get rid of or make optional, some dependencies (ws, brotli etc)

related to #1

Publish v0.8.0 to npm

Hey, I see that you have merged v0.8.0 of the library but on npm is v0.7.0.

Could you please upload the new version of the library?

Thanks! And great work!.

reader implementation

  • read RLE
  • remaining fromNative types
  • remaining PLAIN types
  • explicit col chunk metadata loading (if not in footer)
  • load compressed pages
  • load data pages v2
  • verify checksums

add snappy compression

I'm trying to compress the parquet file after its creation, but AWS Athena can't read it.

`HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket-example/data/parquet_node/year=2018/month=04/day=18/hour=18/minute=47/file.snappy.parquet (offset=0, length=11716266): can not read class parquet.format.FileMetaData: don't know what type: 15`

This query ran against the "tes" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 3b3a2df1-b202.

is it possible to add an optional snappy compression in the writer?

Package name transfer?

I own the parquet package name at npm registry, but is not actually used... I feel I did wrong just to reserve it.

ParquetFileWriter doesn't exist, and how to write to stream

According to the docs, you start with:

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = new parquet.ParquetFileWriter(schema, 'fruits.parquet');

Yet, no such parquet.ParquetFileWrite exists:

> p = require('parquetjs')
{ ParquetEnvelopeReader: [Function: ParquetEnvelopeReader],
  ParquetReader: [Function: ParquetReader],
  ParquetEnvelopeWriter: [Function: ParquetEnvelopeWriter],
  ParquetWriter: [Function: ParquetWriter],
  ParquetSchema: [Function: ParquetSchema],
  ParquetShredder: { shredRecord: [Function], materializeRecords: [Function] } }
> p.ParquetFileWriter
undefined

As it is, I came across it looking for a way to get it to write directly to a stream, rather than to a file.

parquet file does not contain codec (HIVE_CANNOT_OPEN_SPLIT)?

let salesSchema = new parquet.ParquetSchema({
time: { type: 'TIMESTAMP_MILLIS' },
quantity: { type: 'DOUBLE' },
price: { type: 'DOUBLE' },
});

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://blahblah.parquet (offset=0, length=296730): can not read class parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[RLE, PLAIN], path_in_schema:[time], codec:null, num_values:100, total_uncompressed_size:817, total_compressed_size:817, data_page_offset:4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.