ironsource / parquetjs Goto Github PK

fully asynchronous, pure JavaScript implementation of the Parquet file format

License: MIT License

JavaScript 87.68% Thrift 12.32%

nodejs javascript parquet

parquetjs's Introduction

CURRENT STATUS: INACTIVE

This project requires a major overhaul, as well as handling and sorting through dozens of issues and prs. Please contact me if you're up for the task.

parquet.js

fully asynchronous, pure node.js implementation of the Parquet file format

This package contains a fully asynchronous, pure JavaScript implementation of the Parquet file format. The implementation conforms with the Parquet specification and is tested for compatibility with Apache's Java reference implementation.

What is Parquet?: Parquet is a column-oriented file format; it allows you to write a large amount of structured data to a file, compress it and then read parts of it back out efficiently. The Parquet format is based on Google's Dremel paper.

Installation

To use parquet.js with node.js, install it using npm:

  $ npm install parquetjs

parquet.js requires node.js >= 8

Usage: Writing files

Once you have installed the parquet.js library, you can import it as a single module:

var parquet = require('parquetjs');

Parquet files have a strict schema, similar to tables in a SQL database. So, in order to produce a Parquet file we first need to declare a new schema. Here is a simple example that shows how to instantiate a ParquetSchema object:

// declare a schema for the `fruits` table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64' },
  price: { type: 'DOUBLE' },
  date: { type: 'TIMESTAMP_MILLIS' },
  in_stock: { type: 'BOOLEAN' }
});

Note that the Parquet schema supports nesting, so you can store complex, arbitrarily nested records into a single row (more on that later) while still maintaining good compression.

Once we have a schema, we can create a ParquetWriter object. The writer will take input rows as JSON objects, convert them to the Parquet format and store them on disk.

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

// append a few rows to the file
await writer.appendRow({name: 'apples', quantity: 10, price: 2.5, date: new Date(), in_stock: true});
await writer.appendRow({name: 'oranges', quantity: 10, price: 2.5, date: new Date(), in_stock: true});

Once we are finished adding rows to the file, we have to tell the writer object to flush the metadata to disk and close the file by calling the close() method:

await writer.close();

Usage: Reading files

A parquet reader allows retrieving the rows from a parquet file in order. The basic usage is to create a reader and then retrieve a cursor/iterator which allows you to consume row after row until all rows have been read.

You may open more than one cursor and use them concurrently. All cursors become invalid once close() is called on the reader object.

// create new ParquetReader that reads from 'fruits.parquet`
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

// create a new cursor
let cursor = reader.getCursor();

// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

When creating a cursor, you can optionally request that only a subset of the columns should be read from disk. For example:

// create a new cursor that will only return the `name` and `price` columns
let cursor = reader.getCursor(['name', 'price']);

It is important that you call close() after you are finished reading the file to avoid leaking file descriptors.

await reader.close();

Encodings

Internally, the Parquet format will store values from each field as consecutive arrays which can be compressed/encoded using a number of schemes.

Plain Encoding (PLAIN)

The most simple encoding scheme is the PLAIN encoding. It simply stores the values as they are without any compression. The PLAIN encoding is currently the default for all types except BOOLEAN:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8', encoding: 'PLAIN' },
});

Run Length Encoding (RLE)

The Parquet hybrid run length and bitpacking encoding allows to compress runs of numbers very efficiently. Note that the RLE encoding can only be used in combination with the BOOLEAN, INT32 and INT64 types. The RLE encoding requires an additional bitWidth parameter that contains the maximum number of bits required to store the largest value of the field.

var schema = new parquet.ParquetSchema({
  age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});

Optional Fields

By default, all fields are required to be present in each row. You can also mark a field as 'optional' which will let you store rows with that field missing:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64', optional: true },
});

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
await writer.appendRow({name: 'apples', quantity: 10 });
await writer.appendRow({name: 'banana' }); // not in stock

Nested Rows & Arrays

Parquet supports nested schemas that allow you to store rows that have a more complex structure than a simple tuple of scalar values. To declare a schema with a nested field, omit the type in the column definition and add a fields list instead:

Consider this example, which allows us to store a more advanced "fruits" table where each row contains a name, a list of colours and a list of "stock" objects.

// advanced fruits table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  colours: { type: 'UTF8', repeated: true },
  stock: {
    repeated: true,
    fields: {
      price: { type: 'DOUBLE' },
      quantity: { type: 'INT64' },
    }
  }
});

// the above schema allows us to store the following rows:
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

await writer.appendRow({
  name: 'banana',
  colours: ['yellow'],
  stock: [
    { price: 2.45, quantity: 16 },
    { price: 2.60, quantity: 420 }
  ]
});

await writer.appendRow({
  name: 'apple',
  colours: ['red', 'green'],
  stock: [
    { price: 1.20, quantity: 42 },
    { price: 1.30, quantity: 230 }
  ]
});

await writer.close();

// reading nested rows with a list of explicit columns
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

let cursor = reader.getCursor([['name'], ['stock', 'price']]);
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

await reader.close();

It might not be obvious why one would want to implement or use such a feature when the same can - in principle - be achieved by serializing the record using JSON (or a similar scheme) and then storing it into a UTF8 field:

Putting aside the philosophical discussion on the merits of strict typing, knowing about the structure and subtypes of all records (globally) means we do not have to duplicate this metadata (i.e. the field names) for every record. On top of that, knowing about the type of a field allows us to compress the remaining data more efficiently.

List of Supported Types & Encodings

We aim to be feature-complete and add new features as they are added to the Parquet specification; this is the list of currently implemented data types and encodings:

Logical Type	Primitive Type	Encodings
UTF8	BYTE_ARRAY	PLAIN
JSON	BYTE_ARRAY	PLAIN
BSON	BYTE_ARRAY	PLAIN
BYTE_ARRAY	BYTE_ARRAY	PLAIN
TIME_MILLIS	INT32	PLAIN, RLE
TIME_MICROS	INT64	PLAIN, RLE
TIMESTAMP_MILLIS	INT64	PLAIN, RLE
TIMESTAMP_MICROS	INT64	PLAIN, RLE
BOOLEAN	BOOLEAN	PLAIN, RLE
FLOAT	FLOAT	PLAIN
DOUBLE	DOUBLE	PLAIN
INT32	INT32	PLAIN, RLE
INT64	INT64	PLAIN, RLE
INT96	INT96	PLAIN
INT_8	INT32	PLAIN, RLE
INT_16	INT32	PLAIN, RLE
INT_32	INT32	PLAIN, RLE
INT_64	INT64	PLAIN, RLE
UINT_8	INT32	PLAIN, RLE
UINT_16	INT32	PLAIN, RLE
UINT_32	INT32	PLAIN, RLE
UINT_64	INT64	PLAIN, RLE

Buffering & Row Group Size

When writing a Parquet file, the ParquetWriter will buffer rows in memory until a row group is complete (or close() is called) and then write out the row group to disk.

The size of a row group is configurable by the user and controls the maximum number of rows that are buffered in memory at any given time as well as the number of rows that are co-located on disk:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

Dependencies

Parquet uses thrift to encode the schema and other metadata, but the actual data does not use thrift.

Contributions

Please make sure you sign the contributor license agreement in order for us to be able to accept your contribution. We thank you very much!

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

parquetjs's People

Contributors

Stargazers

Watchers

Forkers

zectbynmo markov00 witq kessler achwalek zjonsson danielshir simonjang sprunger thomascool leeabc mikeblanton adrizer thecjgcjg nateschickler0 doron2402 mattvaughan aconanlai abhilash-potharaju jeffbski cameroncooper donnut attractorai claytoneu-wp lightspeed-systems apuff erichusa jkarjala dobesv bmschmidt guywaldman fed135 stspyder jthomerson kingams arnabguptadev atomic-app rickyk586 samsullivan rthomps7 bartero cbiscuitsurprise apployees jamiekb vendexis halasystems nandosuk markosindustries emg110 mmackenzie-syd supermeng maxwellray joeledwards twitchsnitchdotcom ryanvaloris guysharon tlby jeffbski-rga oncelkeles jgold21 jgaffuri mrebernisak-godaddy tomokikizawaonfantom dletta nguyenthdat tulrangerfan barddoo pyjun01

parquetjs's Issues

improve RLE encoding implementation

add API to access/store key value metadata

v0.8.0 isn't published on npm

Hello!

I've been trying to use this module for a project of mine, but I'm running into a few issues (#24), which are entirely solved by v.0.8.0.

Is there any way that you could publish the new version on npm?

Thanks :)

support for nested schemas

in Schema, Shredder, Materializer

Memory Leaks?

Hi,

I'm using elasticsearchJS to export a whole index from ES in batches of 4096.
The whole tool uses about 500mb RAM while dumping ES index to parquet format.
(nodeJS has 2GB memory limit set)

If i lower or increase the batch size (or randomly) it uses a lot of memory like 2-3GB and it gets killed.
The quickest way to reproduce is to increase the batch size that it has to process.
The generate parquet file, usually has ~5.4GB.

Is there anything i can do to debug this more?

Thanks!

P.S.: I'm using git+ssh://[email protected]/ironSource/parquetjs.git#1fa58b589d9b6451379f1558214e9ae751909596 as the parquetJS package.

Error when field has no values

When an optional field has 0 values, the generated file seems unreadable by parquet-tools, the error I get is can not read class org.apache.parquet.format.PageHeader: Required field 'num_nulls' was not found in serialized data! Struct: DataPageHeaderV2(num_values:10, num_nulls:0, num_rows:10, encoding:PLAIN, definition_levels_byte_length:20, repetition_levels_byte_length:0).
When I forced the num_nulls value to be no less than 1, it started working. I'm not sure if this is an issue with the module or with the way I'm trying to use it, so I'm letting you know.

Return array of buffers instead of buffer.concat

Buffer.concat, particularly on the whole rowgroup buffers can get expensive memory wise, as we'll end up with two copies of the whole buffer (the individual parts and the concatenated version) for a moment before the garbage-collector picks up the pieces. This can limit the maximum size of the rowgroup.

It might be more efficient to return bodyParts (a sequential array of bufffers) in general instead of buffer.concat and then in writeSection we loop through the bodyParts and write them sequentially.

Cannot execute the demo code for writing a parquet file.

I tried the code below....

var parquet = require('parquetjs');

// declare a schema for the fruits table
var schema = new parquet.ParquetSchema({
name: { type: 'UTF8' },
quantity: { type: 'INT64' },
price: { type: 'DOUBLE' },
in_stock: { type: 'BOOLEAN' }
});

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = new parquet.ParquetFileWriter(schema, 'fruits.parquet');

This gives me an error.

1<process.argv.length?process.argv[1].replace(/\\/g,"/"):"unknown-program");b.arguments=process.argv.slice(2);"undefined"!==typeof module&&(module.exports=b);process.on("uncaughtException",function(a){if(!(a instanceof y))throw a;});b.inspect=function(){return"[Emscripten Module object]"}}else if(x)b.print||(b.print=print),"undefined"!=typeof printErr&&(b.printErr=printErr),b.read="undefined"!=typeof read?read:function(){throw"no read() available (jsc?)";},b.readBinary=function(a){if("function"===
                                                                                                                                                                                                                              ^

TypeError: parquet.ParquetFileWriter is not a constructor
    at Object.<anonymous> (/usr/local/globalcdn/playground/parquet/index.js:14:14)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)
    at Function.Module.runMain (module.js:693:10)
    at startup (bootstrap_node.js:188:16)
    at bootstrap_node.js:609:3
---------------------------------------------------------------------------------------------

**The parquetjs is installed with this message.**
---------------------------------------------------------------------------------------------
$ npm install --save parquetjs

> [email protected] install /usr/local/globalcdn/playground/parquet/node_modules/ws
> (node-gyp rebuild 2> builderror.log) || (exit 0)

make: Entering directory `/usr/local/globalcdn/playground/parquet/node_modules/ws/build'
  CXX(target) Release/obj.target/bufferutil/src/bufferutil.o
make: Leaving directory `/usr/local/globalcdn/playground/parquet/node_modules/ws/build'

> [email protected] install /usr/local/globalcdn/playground/parquet/node_modules/lzo
> node-gyp rebuild

make: Entering directory `/usr/local/globalcdn/playground/parquet/node_modules/lzo/build'
  CC(target) Release/obj.target/node_lzo/lib/minilzo209/minilzo.o
  CXX(target) Release/obj.target/node_lzo/lib/lzo.o
  SOLINK_MODULE(target) Release/obj.target/node_lzo.node
  COPY Release/node_lzo.node
make: Leaving directory `/usr/local/globalcdn/playground/parquet/node_modules/lzo/build'
npm WARN [email protected] No description
npm WARN [email protected] No repository field.

+ [email protected]
added 17 packages in 4.499s
---------------------------------------------------------------------------------------------

**Environment Info.**
---------------------------------------------------------------------------------------------
 Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64)
---------------------------------------------------------------------------------------------

sign the CLA

@ZJONSSON @lamabutbul

Sorry for the delayed response, please sign the CLA here https://github.com/ironSource/cla so we can accept your PRs

Thanks

implement converted type: interval

build & store statistics pages

Problems with deep schemas to

Here is an example of a schema that is three levels deep. Shreading and Materializing a single record works fine however writing a parquet file and reading it back results in an error:

const parquet = require('parquetjs');

var schema = new parquet.ParquetSchema({
  a: {
    fields: {
      b: {
        fields: {
          c:  {type: 'UTF8'}
        }
      }
    }
  }
});


let rec = {a: {b: {c: 'this is a test'}}};


async function main() {
  // shread & materialize:
  console.log('shread & materialize:');
  let buf = {};
  parquet.ParquetShredder.shredRecord(schema, rec, buf);
  console.log(parquet.ParquetShredder.materializeRecords(schema, buf));

  // writer and reader
  console.log('writer & reader:');
  const writer = await parquet.ParquetWriter.openFile(schema, 'test.parquet');
  await writer.appendRow(rec);
  await writer.close();

  let reader = await parquet.ParquetReader.openFile('test.parquet');
  let cursor = reader.getCursor();
  let record = null;
  while (record = await cursor.next()) {
    console.log(record);
  }

  await reader.close();
}

main().then(console.log,console.log)

Output is:

shread & materialize:
[ { a: { b: [Object] } } ]
writer & reader:
TypeError: Cannot read property 'rLevelMax' of undefined
    at ParquetEnvelopeReader.readColumnChunk (/home/zjonsson/git/parquetjs/lib/reader.js:344:24)
    at <anonymous>

implement primitive type: FIXED_LENGTH_BYTE_ARRAY

implement delta* encodings

Potential discepancy in shred/materialize

Reading the spec carefully I see the following paragraph

One important thing to remember to understand the examples is that not every level of the tree needs a new definition or repetition level. Only repeated fields increment the repetition level, only non-required fields increment the definition level. As those levels are very small bounded values they can be encoded efficiently using a few bits.

Required fields are always defined and do not need a definition level. Non repeated fields do not need a repetition level.

This means that any path to a leaf node that has all path element as optional: false can only have a definition level of zero. (each definition level higher up needs an optional: true in the path)

However when I look at one of the materialize tests in parquetjs I see:

   var schema = new parquet.ParquetSchema({
      name: { type: 'UTF8' },
      stock: {
        repeated: true,
        fields: {
          quantity: { type: 'INT64', repeated: true },
          warehouse: { type: 'UTF8' },
        }
      },
      price: { type: 'DOUBLE' },
    });

  buffer.columnData[['stock',  'quantity']] = {
      dlevels: [2, 2, 2, 2, 0, 1],
      rlevels: [0, 1, 0, 2, 0, 0],
      values: [10, 20, 50, 75],
      count: 6
    };

Nothing in the path is optional, however many of the dlevels are non zero. If I change the dlevels to all zeros then the quantity data is not populated in the results records.

Is it possible that there is a discrepancy in the implementation?

UnhandledPromiseRejectionWarning on error

If an error occurs in the ParquetTransformer._transform the stream is left in limbo state. Node logs out the following warning:

(node:10265) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 2): invalid value for INT64: N/A
(node:10265) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Optional variables can not be explicitly `undefined` or `null`

When streaming from a database or csv, missing values are often in the Object.keys of the record with the value either set to undefined or null

Here is an example that fails on colour (which is optional) being undefined in the second record.

const parquet = require('./parquet.js');

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  colour: {type: 'UTF8', optional: true}
});

async function main() {
  const writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
  await writer.appendRow({name: 'banana', colour: 'yellow'});
  await writer.appendRow({name: 'apple', colour: undefined});
  await writer.close();
}

main()
  .then(() => console.log('done'))
  .catch(e => console.log(e));

resulting in the following error:

TypeError: Cannot read property 'constructor' of undefined
    at shredRecordInternal (/home/zjonsson/git/parquetjs/lib/shred.js:83:29)
    at Object.exports.shredRecord (/home/zjonsson/git/parquetjs/lib/shred.js:40:3)
    at ParquetWriter.appendRow (/home/zjonsson/git/parquetjs/lib/writer.js:94:22)
    at main (/home/zjonsson/git/parquetjs/test.js:11:16)
    at <anonymous>

Indicate which version of Java Parquet this is a port of

The README contains links to the HEAD of the github.com/apache/parquet project. I can't figure out which version of Parquet this is a port of.

List/Map is not compatible with AWS Athena/Hive/PrestoDB

I generated a parquet file with parquet.js with data containing list and map, but the nested field is not readable by AWS Athena, which is based on PrestoDB. I checked other implementations and it seem this is the reason apache/parquet-mr#411

Thank you for the great job all the same.

implement page header v2

Adding repeated properties to schema results in corrupt parquet file.

Version 0.8.0

Having some issues with repeated. The resulting parquet file seems to have errors in it.

org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/PATHTOFILE/profile.parquet

Here is the code I'm testing with, it's the identities object that is causing the problem.


let schema = new parquet.ParquetSchema({
    person: {
        repeated: false,
        fields: {
            firstName: {
                type: 'UTF8'
            },
            lastName: {
                type: 'UTF8'
            }
        }
    },
    identities: {
        repeated: true,
        fields: {
            id: {
                type: 'UTF8'
            },
            xid: {
                type: 'UTF8'
            }
        }
    }
});

async function writeToParquet(schema) {
    // create new ParquetWriter that writes to 'fruits.parquet`
    var writer = await parquet.ParquetWriter.openFile(schema, 'profile.parquet');

    writer.appendRow({
        person: {
            firstName: "Test",
            lastName: "User"
        },
        identities: [{
            id: "ID",
            xid: "XID"
        },{
            id: "ID",
            xid: "XID"
        }]
    });

    await writer.close();
}

writeToParquet(schema);```

implement page checksums

replace irregular "+" syntax

https://github.com/ironSource/parquetjs/blob/master/lib/reader.js#L219
https://github.com/ironSource/parquetjs/blob/master/lib/reader.js#L251
https://github.com/ironSource/parquetjs/blob/master/lib/reader.js#L252

pipe through options in ParquetWriter ctor

opts: page_size, row_group_size, compression

Problems with reader and deep schemas

Here is an example of a schema that is three levels deep. Shreading and Materializing a single record works fine however writing a parquet file and reading it back results in an error:

const parquet = require('parquetjs');

var schema = new parquet.ParquetSchema({
  a: {
    fields: {
      b: {
        fields: {
          c:  {type: 'UTF8'}
        }
      }
    }
  }
});


let rec = {a: {b: {c: 'this is a test'}}};


async function main() {
  // shread & materialize:
  console.log('shread & materialize:');
  let buf = {};
  parquet.ParquetShredder.shredRecord(schema, rec, buf);
  console.log(parquet.ParquetShredder.materializeRecords(schema, buf));

  // writer and reader
  console.log('writer & reader:');
  const writer = await parquet.ParquetWriter.openFile(schema, 'test.parquet');
  await writer.appendRow(rec);
  await writer.close();

  let reader = await parquet.ParquetReader.openFile('test.parquet');
  let cursor = reader.getCursor();
  let record = null;
  while (record = await cursor.next()) {
    console.log(record);
  }

  await reader.close();
}

main().then(console.log,console.log)

Output is:

shread & materialize:
[ { a: { b: [Object] } } ]
writer & reader:
TypeError: Cannot read property 'rLevelMax' of undefined
    at ParquetEnvelopeReader.readColumnChunk (/home/zjonsson/git/parquetjs/lib/reader.js:344:24)
    at <anonymous>

implement dictionary encoding & dictionary pages

thrift module is dependent on ws module

the thrift module source is hosting on fsf git not on github, I'm not familiar with the procedure for submitting PRs and such but we might need to fork it to a non ws dependent version

implement converted type: decimal

implement page compression

Implement statistics

Statistics definition: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L204-L212

DataPageHeader: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L342-L356
DataPageHeaderV2: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L379-L405
ColumnMetaData: https://github.com/ironSource/parquetjs/blob/master/parquet.thrift#L472-L508

This allows min/max to been seen immediately for given pages/row, avoiding scanning data outside of the area of interest for the column.

get file content as base64

Hi,
is there a way to get the parquet file as base64 string, without having to write it to disk?

my use case is to convert an object into parquet format and upload it to AWS s3

fruits.parquet generated by test/integration.js is unreadable by Hadoop parquet-tools 1.9.0

Build parquet-mr/parquet-tools per these instructions.

Then run its cat command to dump the fruits.parquet file that is generated:

$ java -jar target/parquet-tools-1.9.0.jar cat parquetjs/fruits.parquet 

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/Users/davidr/workspaces/parquet-mr/parquet-tools/target/parquet-tools-1.9.0.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Could not read footer: java.io.IOException: Could not read footer for file DeprecatedRawLocalFileStatus{path=file:/Users/davidr/workspaces/parquetjs/fruits.parquet; isDirectory=false; length=1411554; replication=1; blocksize=33554432; modification_time=1512831680000; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}

Using parquetjs v0.8.0.

fix LZO compression support

the current lzo npm package is broken and needs to be fixed upstream

Problem with materialize on an object with optional missing variable

In some cases where a property is optional in the schema and missing from the record, other values will not be missing as well from materialize.

Example:

Actual output [ { fruit: {} } ]
Expected output : [ { fruit: { name: 'apple' } } ]

const parquet = require('../parquetjs');

var schema = new parquet.ParquetSchema({
  fruit: {
    fields: {
      name: { type: 'UTF8' },
      colour: { type: 'UTF8', optional: true }
    }
  }
});

let buf = {};
let rec = { fruit: { name: 'apple'}};

parquet.ParquetShredder.shredRecord(schema, rec, buf);
let records = parquet.ParquetShredder.materializeRecords(schema, buf);

console.log(records);

RowGroup size recommendation is too low for optimal use of Parquet

From the README:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

This is way off compared to the intended size of RowGroups for Parquet files. The initial implementation suggests to use 128MiB as the RowGroup size to neatly fit a HDFS block.

Using very tiny RowGroup sizes removes the main benefit of the Parquet format, being columnar, utilising vectorized execution and a good trade-off between compression ratios and CPU usage due to encoding the data with the knowledge of its data type.

The smallest unit in a Parquet file, a page, is normally set to 1 MiB which is much more than 200x the recommended RowGroup size. Some implementations have used 64KiB which is also greater.

publish to npm

and rewrite the installation section in readme

Should be usable in the browser

We want to be able to use this package in the browser, therefor we will need to get rid of or make optional, some dependencies (ws, brotli etc)

related to #1

Publish v0.8.0 to npm

Hey, I see that you have merged v0.8.0 of the library but on npm is v0.7.0.

Could you please upload the new version of the library?

Thanks! And great work!.

reader implementation

read RLE
remaining fromNative types
remaining PLAIN types
explicit col chunk metadata loading (if not in footer)
load compressed pages
load data pages v2
verify checksums

Change all string errors to Error objects

write & validate CRC checksums

add snappy compression

I'm trying to compress the parquet file after its creation, but AWS Athena can't read it.

`HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://bucket-example/data/parquet_node/year=2018/month=04/day=18/hour=18/minute=47/file.snappy.parquet (offset=0, length=11716266): can not read class parquet.format.FileMetaData: don't know what type: 15`

This query ran against the "tes" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 3b3a2df1-b202.

is it possible to add an optional snappy compression in the writer?

Package name transfer?

I own the parquet package name at npm registry, but is not actually used... I feel I did wrong just to reserve it.

Test files should work

Here are a few test files that ideally all should work: https://github.com/dask/fastparquet/tree/master/test-data

We get Invalid RLE encoding for some and we are missing the Dictionary encoding for others

writer: split column chunk into multiple pages

not sure how this would improve things as the code currently stands. low prio

ParquetFileWriter doesn't exist, and how to write to stream

According to the docs, you start with:

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = new parquet.ParquetFileWriter(schema, 'fruits.parquet');

Yet, no such parquet.ParquetFileWrite exists:

> p = require('parquetjs')
{ ParquetEnvelopeReader: [Function: ParquetEnvelopeReader],
  ParquetReader: [Function: ParquetReader],
  ParquetEnvelopeWriter: [Function: ParquetEnvelopeWriter],
  ParquetWriter: [Function: ParquetWriter],
  ParquetSchema: [Function: ParquetSchema],
  ParquetShredder: { shredRecord: [Function], materializeRecords: [Function] } }
> p.ParquetFileWriter
undefined

As it is, I came across it looking for a way to get it to write directly to a stream, rather than to a file.

Convert batch size from row count to size in bytes

Missing DATE logical type

DATE logical type is missing.

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#date

Could be easily done since it's primitive type is INT32.

parquet file does not contain codec (HIVE_CANNOT_OPEN_SPLIT)?

let salesSchema = new parquet.ParquetSchema({
time: { type: 'TIMESTAMP_MILLIS' },
quantity: { type: 'DOUBLE' },
price: { type: 'DOUBLE' },
});

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://blahblah.parquet (offset=0, length=296730): can not read class parquet.format.FileMetaData: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[RLE, PLAIN], path_in_schema:[time], codec:null, num_values:100, total_uncompressed_size:817, total_compressed_size:817, data_page_offset:4)