Giter Site home page Giter Site logo

fraugster / parquet-go Goto Github PK

View Code? Open in Web Editor NEW
280.0 11.0 53.0 1.22 MB

Go package to read and write parquet files. parquet is a file format to store nested data structures in a flat columnar data format. It can be used in the Hadoop ecosystem and with tools such as Presto and AWS Athena.

License: Apache License 2.0

Go 95.13% Dockerfile 0.20% Shell 0.14% Thrift 4.54%
golang golang-package parquet parquet-schema athena presto hadoop hacktoberfest

parquet-go's Introduction

parquet-go


parquet-go is an implementation of the Apache Parquet file format in Go. It provides functionality to both read and write parquet files, as well as high-level functionality to manage the data schema of parquet files, to directly write Go objects to parquet files using automatic or custom marshalling and to read records from parquet files into Go objects using automatic or custom marshalling.

parquet is a file format to store nested data structures in a flat columnar format. By storing in a column-oriented way, it allows for efficient reading of individual columns without having to read and decode complete rows. This allows for efficient reading and faster processing when using the file format in conjunction with distributed data processing frameworks like Apache Hadoop or distributed SQL query engines like Presto and AWS Athena.

This implementation is divided into several packages. The top-level package is the low-level implementation of the parquet file format. It is accompanied by the sub-packages parquetschema and floor. parquetschema provides functionality to parse textual schema definitions as well as the data types to manually or programmatically construct schema definitions. floor is a high-level wrapper around the low-level package. It provides functionality to open parquet files to read from them or write to them using automated or custom marshalling and unmarshalling.

Supported Features

Feature Read Write Note
Compression Yes Yes Only GZIP and SNAPPY are supported out of the box, but it is possible to add other compressors, see below.
Dictionary Encoding Yes Yes
Run Length Encoding / Bit-Packing Hybrid Yes Yes The reader can read RLE/Bit-pack encoding, but the writer only uses bit-packing
Delta Encoding Yes Yes
Byte Stream Split No No
Data page V1 Yes Yes
Data page V2 Yes Yes
Statistics in page meta data No Yes Page meta data is generally not made available to users and not used by parquet-go.
Index Pages No No
Dictionary Pages Yes Yes
Encryption No No
Bloom Filter No No
Logical Types Yes Yes Support for logical type is in the high-level package (floor) the low level parquet library only supports the basic types, see the type mapping table

Supported Data Types

Type in parquet Type in Go Note
boolean bool
int32 int32 See the note about the int type
int64 int64 See the note about the int type
int96 [12]byte
float float32
double float64
byte_array []byte
fixed_len_byte_array(N) [N]byte, []byte use any positive number for N

Note: the low-level implementation only supports int32 for the INT32 type and int64 for the INT64 type in Parquet. Plain int or uint are not supported. The high-level floor package contains more extensive support for these data types.

Supported Logical Types

Logical Type Mapped to Go types Note
STRING string, []byte
DATE int32, time.Time int32: days since Unix epoch (Jan 01 1970 00:00:00 UTC); time.Time only in floor
TIME int32, int64, time.Time int32: TIME(MILLIS, ...), int64: TIME(MICROS, ...), TIME(NANOS, ...); time.Time only in floor
TIMESTAMP int64, int96, time.Time time.Time only in floor
UUID [16]byte
LIST []T slices of any type
MAP map[T1]T2 maps with any key and value types
ENUM string, []byte
BSON []byte
DECIMAL []byte, [N]byte
INT {,u}int{8,16,32,64} implementation is loose and will allow any INT logical type converted to any signed or unsigned int Go type.

Supported Converted Types

Converted Type Mapped to Go types Note
UTF8 string, []byte
TIME_MILLIS int32 Number of milliseconds since the beginning of the day
TIME_MICROS int64 Number of microseconds since the beginning of the day
TIMESTAMP_MILLIS int64 Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC)
TIMESTAMP_MICROS int64 Number of milliseconds since Unix epoch (Jan 01 1970 00:00:00 UTC)
{,U}INT_{8,16,32,64} {,u}int{8,16,32,64} implementation is loose and will allow any converted type with any int Go type.
INTERVAL [12]byte

Please note that converted types are deprecated. Logical types should be used preferably.

Supported Compression Algorithms

Compression Algorithm Supported Notes
GZIP Yes; Out of the box
SNAPPY Yes; Out of the box
BROTLI Yes; By importing github.com/akrennmair/parquet-go-brotli
LZ4 No LZ4 has been deprecated as of parquet-format 2.9.0.
LZ4_RAW Yes; By importing github.com/akrennmair/parquet-go-lz4raw
LZO Yes; By importing github.com/akrennmair/parquet-go-lzo Uses a cgo wrapper around the original LZO implementation which is licensed as GPLv2+.
ZSTD Yes; By importing github.com/akrennmair/parquet-go-zstd

Schema Definition

parquet-go comes with support for textual schema definitions. The sub-package parquetschema comes with a parser to turn the textual schema definition into the right data type to use elsewhere to specify parquet schemas. The syntax has been mostly reverse-engineered from a similar format also supported but barely documented in Parquet's Java implementation.

For the full syntax, please have a look at the parquetschema package Go documentation.

Generally, the schema definition describes the structure of a message. Parquet will then flatten this into a purely column-based structure when writing the actual data to parquet files.

A message consists of a number of fields. Each field either has type or is a group. A group itself consists of a number of fields, which in turn can have either a type or are a group themselves. This allows for theoretically unlimited levels of hierarchy.

Each field has a repetition type, describing whether a field is required (i.e. a value has to be present), optional (i.e. a value can be present but doesn't have to be) or repeated (i.e. zero or more values can be present). Optionally, each field (including groups) have an annotation, which contains a logical type or converted type that annotates something about the general structure at this point, e.g. LIST indicates a more complex list structure, or MAP a key-value map structure, both following certain conventions. Optionally, a typed field can also have a numeric field ID. The field ID has no purpose intrinsic to the parquet file format.

Here is a simple example of a message with a few typed fields:

message coordinates {
    required float64 latitude;
    required float64 longitude;
    optional int32 elevation = 1;
    optional binary comment (STRING);
}

In this example, we have a message with four typed fields, two of them required, and two of them optional. float64, int32 and binary describe the fundamental data type of the field, while longitude, latitude, elevation and comment are the field names. The parentheses contain an annotation STRING which indicates that the field is a string, encoded as binary data, i.e. a byte array. The field elevation also has a field ID of 1, indicated as numeric literal and separated from the field name by the equal sign =.

In the following example, we will introduce a plain group as well as two nested groups annotated with logical types to indicate certain data structures:

message transaction {
    required fixed_len_byte_array(16) txn_id (UUID);
    required int32 amount;
    required int96 txn_ts;
    optional group attributes {
        optional int64 shop_id;
        optional binary country_code (STRING);
        optional binary postcode (STRING);
    }
    required group items (LIST) {
        repeated group list {
            required int64 item_id;
            optional binary name (STRING);
        }
    }
    optional group user_attributes (MAP) {
        repeated group key_value {
            required binary key (STRING);
            required binary value (STRING);
        }
    }
}

In this example, we see a number of top-level fields, some of which are groups. The first group is simply a group of typed fields, named attributes.

The second group, items is annotated to be a LIST and in turn contains a repeated group list, which in turn contains a number of typed fields. When a group is annotated as LIST, it needs to follow a particular convention: it has to contain a repeated group named list. Inside this group, any fields can be present.

The third group, user_attributes is annotated as MAP. Similar to LIST, it follows some conventions. In particular, it has to contain only a single required group with the name key_value, which in turn contains exactly two fields, one named key, the other named value. This represents a map structure in which each key is associated with one value.

Examples

For examples how to use both the low-level and high-level APIs of this library, please see the directory examples. You can also check out the accompanying tools (see below) for more advanced examples. The tools are located in the cmd directory.

Tools

parquet-go comes with tooling to inspect and generate parquet tools.

parquet-tool

parquet-tool allows you to inspect the meta data, the schema and the number of rows as well as print the content of a parquet file. You can also use it to split an existing parquet file into multiple smaller files.

Install it by running go get github.com/fraugster/parquet-go/cmd/parquet-tool on your command line. For more detailed help on how to use the tool, consult parquet-tool --help.

csv2parquet

csv2parquet makes it possible to convert an existing CSV file into a parquet file. By default, all columns are simply turned into strings, but you provide it with type hints to influence the generated parquet schema.

You can install this tool by running go get github.com/fraugster/parquet-go/cmd/csv2parquet on your command line. For more help, consult csv2parquet --help.

Contributing

If you want to hack on this repository, please read the short CONTRIBUTING.md guide first.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

See also the list of contributors who participated in this project.

Special Mentions

  • Nathan Hanna - proposal and prototyping of automatic schema generator jnathanh

License

Copyright 2021 Fraugster GmbH

This project is licensed under the Apache-2 License - see the LICENSE file for details.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

parquet-go's People

Contributors

akrennmair avatar alrs avatar atris avatar deankarn avatar dim avatar fzerorubigd avatar jnathanh avatar markandrus avatar mattatcha avatar nelhage avatar nwt avatar ny0m avatar panamafrancis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parquet-go's Issues

Scanning into `time.Time` without a logical type

Hi,

First and foremost: thanks for the excellent work on this library, it's very much appreciated.

All of our clients data feeds contain INT96 fields with timestamp information, but have no logical type definition (I know, I know). This makes it difficult for us to scan those fields into time.Time values without having to write custom decoder types. Is there a way to disable this requirement in https://github.com/fraugster/parquet-go/blob/v0.3.0/floor/reader.go#L230-L239? Basically, if the user insists in scanning an INT96 into a time.Time, assume a timestamp if the logical type is not set?

FIXED[16] backed DECIMAL(38,9) not accepted by BigQuery

TLDR: FIXED[16] backed DECIMAL(38, 9) field not accepted by BigQuery:

Column foo has an unexpected logical type: 0.

Trying to create parquet files to be loaded into BigQuery and running into issues with fixed_len_byte_array backed decimals.

BigQuery writes its NUMERIC fields to parquet as required fixed_len_byte_array(16) foo (DECIMAL(38, 9)):

$ parquet-tool meta bq_decimal.parquet
foo:          REQUIRED FIXED_LEN_BYTE_ARRAY R:0 D:0

$ parquet-tool schema bq_decimal.parquet
message schema {
  required fixed_len_byte_array(16) foo (DECIMAL(38, 9));
}

Recreating this schema with fraugster/parquet-go:

func TestDecimal(t *testing.T) {
	type record struct {
		Foo [16]byte `parquet:"foo"`
	}
	schema, err := parquetschema.ParseSchemaDefinition(`
		message schema {
			required fixed_len_byte_array(16) foo (DECIMAL(38, 9));
		}
	`)
	if err != nil {
		panic(fmt.Sprintf("Failed to parse schema: %v", err))
	}

	w, err := floor.NewFileWriter(
		filepath.Join(outputDir, "fraugster_decimal.parquet"),
		goparquet.WithSchemaDefinition(schema),
		goparquet.WithCompressionCodec(parquet.CompressionCodec_GZIP),
	)
	if err != nil {
		panic("Failed to create writer")
	}
	r := record{} // Note: writes all bytes as zeros which should result in zero decimal value.
	w.Write(r)
	w.Close()
}

The meta and schema outputs of the resulting file match the BigQuery parquet, and I can see the byte array written, however BigQuery does not accept this file:

$ bq load --replace --source_format=PARQUET \
    PROJECT:DATASET.fraugster_decimal fraugster_decimal.parquet
Upload complete.
Waiting on bqjob_r7d1a283391779507_0000017e2ac8cd22_1 ... (0s) Current status: DONE
BigQuery error in load operation: 
Error processing job 'PROJECT:bqjob_r7d1a283391779507_0000017e2ac8cd22_1': 
Error while reading data, error message: Column foo has an unexpected logical type: 0.

The issue is only reproducible with fixed_len_byte_array - int64 backed decimals seem to work fine (but at lower precision).

floor cannot read back empty list: "sub-group list or bag not found"

Describe the bug
Using floor struct writer/reader: an empty slice field (e.g. []string{}) written as a list cannot be read back, resulting in error sub-group list or bag not found.

Unit test to reproduce
Can modify the existing test in floor/reader_test.go:

func TestReadWriteSlice(t *testing.T) {
	_ = os.Mkdir("files", 0755)

	sd, err := parquetschema.ParseSchemaDefinition(
		`message test_msg {
			required group foo (LIST) {
				repeated group list {
					required binary element (STRING);
				}
			}
		}`)
	require.NoError(t, err, "parsing schema definition failed")

	t.Logf("schema definition: %s", spew.Sdump(sd))

	hlWriter, err := NewFileWriter(
		"files/list.parquet",
		goparquet.WithCompressionCodec(parquet.CompressionCodec_SNAPPY),
		goparquet.WithCreator("floor-unittest"),
		goparquet.WithSchemaDefinition(sd),
	)
	require.NoError(t, err)

	type testMsg struct {
		Foo []string
	}

	testData := []testMsg{
		{Foo: []string{}},  // Note: empty slice
	}

	for _, tt := range testData {
		require.NoError(t, hlWriter.Write(tt))
	}
	require.NoError(t, hlWriter.Close())

	hlReader, err := NewFileReader("files/list.parquet")
	require.NoError(t, err)

	count := 0

	var result []testMsg

	for hlReader.Next() {
		var msg testMsg

		require.NoError(t, hlReader.Scan(&msg), "%d. Scan failed", count)
		t.Logf("%d. data = %#v", count, hlReader.data)

		result = append(result, msg)

		count++
	}

	require.NoError(t, hlReader.Err(), "hlReader returned an error")
	t.Logf("count = %d", count)

	for idx, elem := range result {
		require.Equal(t, testData[idx], elem, "%d. read result doesn't match expected data")
	}

	require.NoError(t, hlReader.Close())
}

parquet-go specific details

  • Version 0.11.0

Memory issues when reading files with large row groups

I tried using the low level reader to read a ~300MB file with large number of columns, with the data that is very sparse (most of columns are nulls), and the reader attempts to read the entire row group at once before producing the first record. I read the code and realized that the reader reads all data pages from all column chunks belonging to a row group into memory. Also, when reading null columns the reader actually creates values arrays with null values in every element, which makes it so much worse.

In comparison, the Arrow C++ library and the Presto implementation of Parquet both read pages lazily as the client scans through rows, and they don't create values arrays at all for pages with nulls, and in general don't add null values to the values array.

Is there something I'm missing, or this library can't actually be used with larger sparse files?

I guess a possible workaround would be to configure the app that produces our Parquet files to create small row groups but that may have other negative consequences. The Parquet format docs recommend to set 512MB-1GB size for row groups, and I can't make this library work even with a 300MB file.

Import failure: thrift: ambiguous import

Hello,

I was trying to pull in your library into a go 1.16 project and was getting this error:

โžœ  aitk-usage git:(shauncampbell/dev-14909) โœ— go get -u github.com/fraugster/parquet-go
github.com/fraugster/parquet-go imports
	github.com/apache/thrift/lib/go/thrift: ambiguous import: found package github.com/apache/thrift/lib/go/thrift in multiple modules:
	github.com/apache/thrift v0.13.0 (/Users/shauncampbell/go/pkg/mod/github.com/apache/[email protected]/lib/go/thrift)
	github.com/apache/thrift/lib/go/thrift v0.0.0-20210120171102-e27e82c46ba4 (/Users/shauncampbell/go/pkg/mod/github.com/apache/thrift/lib/go/[email protected])

I'm not entirely certain wher eit is getting this [email protected] part from because the go.mod file specificially lists 0.13.0.

Anyone else seen this?

Richer Support for Struct Tags

It would be neat to have richer support for struct tags for auto-generated schema definitions. I added this feature to a branch off my forked repo and am happy to put up a PR if you guys think this is a good idea! I added documentation on what this would look like (I just copied the updates I made to the README on my branch).

Object Schema Definitions

The sub-package parquetschema/autoschema supports auto-generating schema
definitions for a provided object's type using reflection and struct tags. The
generated schema is meant to be compatible with the reflection-based
marshalling/unmarshalling in the floor sub-package.

Supported Parquet Types

Parquet Type Go Types Note
BOOLEAN bool
INT32 int{8,16,32}, uint{,8,16,32}
INT64 int{,64}, uint64
INT96 [12]byte Must specify type=INT96 in the parquet struct tag.
FLOAT float32
DOUBLE float64
BYTE_ARRAY string, []byte
FIXED_LEN_BYTE_ARRAY []byte, [N]byte

Supported Logical Types

Logical Type Go Types Note
STRING string, []byte
MAP map[T1]T2 Maps with any key and value types.
LIST []T, [N]T Slices and arrays of any type except for byte.
ENUM string, []byte
DECIMAL int32, int64, []byte, [N]byte
DATE int32, time.Time
TIME int32, int64, goparquet.Time int32: TIME(MILLIS, {false,true}), int64: TIME({MICROS,NANOS}, {false,true})
TIMESTAMP int64, time.Time
INTEGER {,u}int{,8,16,32,64}
JSON string, []byte
BSON string, []byte
UUID [16]byte

Pointers are automatically mapped to optional fields. Unsupported Go types
include funcs, interfaces, unsafe pointers, unsigned int pointers, and complex
numbers.

Default Type Mappings

By default, Go types are mapped to Parquet types and in some cases logical
types as well. More specific mappings can be achieved by the use of struct
tags (see below).

Go Type Default Parquet Type Default Logical Type
bool BOOLEAN
int{,8,16,32,64} INT{64,32,32,32,64} INTEGER({64,8,16,32,64}, true)
uint{,8,16,32,64} INT{32,32,32,32,64} INTEGER({32,8,16,32,64}, false)
string BYTE_ARRAY STRING
[]byte BYTE_ARRAY
[N]byte FIXED_LEN_BYTE_ARRAY
time.Time INT64 TIMESTAMP(NANOS, true)
goparquet.Time INT64 TIME(NANOS, true)
map group MAP
slice, array group LIST
struct group

Struct Tags

Automatic schema definition generation supports the use of the parquet struct
tag for further schema specification beyond the default mappings. Tag fields
have the format key=value and are comma separated. The tags do not support
converted types as these are now deprecated by Parquet. Since converted types
are still required to support backward compatibility, they are automatically
set based on a field's logical type.

Tag Field Type Values Notes
name string ANY Defaults to the lower-case struct field name.
type string INT96 Unless using a [12]byte field for INT96, this does not ever need to be specified.
logicaltype string STRING, ENUM, DECIMAL, DATE, TIME, TIMESTAMP, JSON, BSON, UUID Maps and non-byte slices and arrays are always mapped to MAP and LIST logical types, respectively.
timeunit string MILLIS, MICROS, NANOS Only used when the logical type is TIME or TIMESTAMP, defaults to NANOS.
isadjustedtoutc bool ANY Only used when the logical type is TIME or TIMESTAMP, defaults to true.
scale int32 N >= 0 Only used when the logical type is DECIMAL, defaults to 0.
precision int32 N >= 0 Only used when the logical type is DECIMAL, required.

All fields must be prefixed by key. and value. when referring to key and
value types of a map, respectively, and element. when referring to the
element type of a slice or array. It is invalid to prefix name since it can
only apply to the field itself.

Object Schema Example

type example  struct {
        ByteSlice          []byte
        String             string
        ByteString         []byte          `parquet:"name=byte_string, logicaltype=STRING"`
        Int64              int64           `parquet:"name=int_64"`
        Uint8              uint8           `parquet:"name=u_int_8"`
        Int96              [12]byte        `parquet:"name=int_96, type=INT96"`
        DefaultTS          time.Time       `parquet:"name=default_ts"`
        Timestamp          int64           `parquet:"name=ts, logicaltype=TIMESTAMP, timeunit=MILLIS, isadjustedtoutc=false`
        Date               time.Time       `parquet:"name=date, logicaltype=DATE"`
        OptionalDecimal    *int32          `parquet:"name=decimal, logicaltype=DECIMAL, scale=5, precision=10"`
        TimeList           []int32         `parquet:"name=time_list, element.logicaltype=TIME, element.timeunit=MILLIS"`
	DecimalTimeMap     map[int64]int32 `parquet:"name=decimal_time_map, key.logicaltype=DECIMAL, key.scale=5, key.precision=15, value.logicaltype=TIME, value.timeunit=MILLIS", value.isadjustedtoutc=true`
        Struct             struct {
                OptionalInt64 *int64   `parquet:"name=int_64"`
	        Time          int64    `parquet:"name=time, logicaltype=TIME, isadjustedtoutc=false"`
	        StringList    []string `parquet:"name=string_list"`
        } `parquet:"name=struct"`
}

The above struct is equivalent to the following schema definition:

message autogen_schema {
    required binary byteslice;
    required binary string (STRING);
    required binary byte_string (STRING);
    required int64 int_64 (INTEGER(64,true));
    required int32 int_8 (INTEGER(8,false));
    required int96 int_96;
    required int64 default_ts (TIMESTAMP(NANOS,true));
    required int64 ts (TIMESTAMP(MILLIS,false));
    required int32 date (DATE);
    optional int32 decimal (DECIMAL(10,5));
    required group time_list (LIST) {
        repeated group list {
          required int32 element (TIME(MILLIS,true));
        }
    }
    optional group decimal_time_map (MAP) {
        repeated group key_value (MAP_KEY_VALUE) {
          required int64 key (DECIMAL(15,5));
          required int32 value (TIME(MILLIS, true));
        }
    }
    required group struct {
        optional int64 int_64 (INTEGER(64,true));
        required int64 time (TIME(NANOS, false));
        required group string_list (LIST) {
            repeated group list {
                required binary element (STRING);
            }
        }
    }
}

FileWriter.AddData fails on unsigned integer columns

Consider this test, added to readwrite_test.go.

func TestWriteThenReadFileUint(t *testing.T) {
	sd, err := parquetschema.ParseSchemaDefinition(`
		message foo {
			required int32 u8 (UINT_8);
			required int32 u16 (UINT_16);
			required int32 u32 (UINT_32);
			required int64 u64 (UINT_64);
		}
	`)
	require.NoError(t, err)

	var buf bytes.Buffer
	w := NewFileWriter(&buf, WithSchemaDefinition(sd))
	testData := map[string]interface{}{
		"u8": uint32(8),
		"u16": uint32(16),
		"u32": uint32(32),
		"u64": uint64(64),
	}
	require.NoError(t, w.AddData(testData))
	require.NoError(t, w.Close())

	r, err := NewFileReader(bytes.NewReader(buf.Bytes()))
	require.NoError(t, err)

	data, err := r.NextRow()
	require.NoError(t, err)
	require.Equal(t, testData, data)

	_, err = r.NextRow()
	require.Equal(t, io.EOF, err)
}

It fails like this.

$ go test -run TestWriteThenReadFileUint
--- FAIL: TestWriteThenReadFileUint (0.00s)
    readwrite_test.go:889:
        	Error Trace:	readwrite_test.go:889
        	Error:      	Received unexpected error:
        	            	unsupported type for storing in int32 column: uint32 => 8
        	            	github.com/fraugster/parquet-go.(*int32Store).getValues
        	            		/Users/noah/fraugster-parquet-go/type_int32.go:187
        	            	github.com/fraugster/parquet-go.(*ColumnStore).add
        	            		/Users/noah/fraugster-parquet-go/data_store.go:102
        	            	github.com/fraugster/parquet-go.recursiveAddColumnData
        	            		/Users/noah/fraugster-parquet-go/schema.go:738
        	            	github.com/fraugster/parquet-go.(*schema).AddData
        	            		/Users/noah/fraugster-parquet-go/schema.go:695
        	            	github.com/fraugster/parquet-go.(*FileWriter).AddData
        	            		/Users/noah/fraugster-parquet-go/file_writer.go:217
        	            	github.com/fraugster/parquet-go.TestWriteThenReadFileUint
        	            		/Users/noah/fraugster-parquet-go/readwrite_test.go:889
        	            	testing.tRunner
        	            		/usr/local/Cellar/go/1.15.6/libexec/src/testing/testing.go:1123
        	            	runtime.goexit
        	            		/usr/local/Cellar/go/1.15.6/libexec/src/runtime/asm_amd64.s:1374
        	Test:       	TestWriteThenReadFileUint
FAIL
exit status 1
FAIL	github.com/fraugster/parquet-go	0.245s

One possible fix is to add cases for uint32 and []uint32 to the type switch in int32Store.getValues, cases for uint64 and []uint64 to the type switch in int64Store.getValues, and cases for uint32 and uint64 to the type switch in mapKey.

Is that the best approach?

parquet-go generates corrupt stats

Files created by parquet-go 0.9.0 create stats that are detected as corrupt, in particular the null count seems to be off.

$ parquet check-stats files/test3.parquet
files/test3.parquet has corrupt stats: Number of nulls doesn't match.
$

FileWriter sets TotalByteSize to 0

When using the FileWriter, the generated row group metadata sometimes has a TotalByteSize of 0 even though the row group contains multiple column chunks. After looking through the code, it appears that the TotalByteSize is set to 0 in FlushRowGroup(โ€ฆ) here:

fw.rowGroups = append(fw.rowGroups, &parquet.RowGroup{
	Columns:        cc,
	TotalByteSize:  0,
	NumRows:        fw.rowGroupNumRecords(),
	SortingColumns: nil,
})

Is there a specific reason the TotalByteSize is being set to 0?

If not, could one possible solution be to set the TotalByteSize to the sum of each column chunk's TotalCompressedSize for an accurate estimation of the size?

Error parsing schema from AWS Cost and Usage reporting file

Possibly the same as #12 I get similar when trying to parse the schema from a parquet file generated by AWS Cost and Usage Reporting:

$ parquet-tool schema test-00001.snappy.parquet
panic: line 1: expected {, got unknown start of token '46' instead

goroutine 1 [running]:
github.com/fraugster/parquet-go.(*schema).GetSchemaDefinition(0xc000152300, 0xc00012a048)
	/Users/cml35/go/src/github.com/fraugster/parquet-go/schema.go:936 +0x7d
github.com/fraugster/parquet-go/cmd/parquet-tool/cmds.glob..func5(0x16174c0, 0xc000112ee0, 0x1, 0x1)
	/Users/cml35/go/src/github.com/fraugster/parquet-go/cmd/parquet-tool/cmds/schema.go:35 +0x1ac
github.com/fraugster/parquet-go/vendor/github.com/spf13/cobra.(*Command).execute(0x16174c0, 0xc000112eb0, 0x1, 0x1, 0x16174c0, 0xc000112eb0)
	/Users/cml35/go/src/github.com/fraugster/parquet-go/vendor/github.com/spf13/cobra/command.go:830 +0x2c2
github.com/fraugster/parquet-go/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x1616fc0, 0xc00004af78, 0x1007f25, 0xc000102058)
	/Users/cml35/go/src/github.com/fraugster/parquet-go/vendor/github.com/spf13/cobra/command.go:914 +0x30b
github.com/fraugster/parquet-go/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/Users/cml35/go/src/github.com/fraugster/parquet-go/vendor/github.com/spf13/cobra/command.go:864
github.com/fraugster/parquet-go/cmd/parquet-tool/cmds.Execute()
	/Users/cml35/go/src/github.com/fraugster/parquet-go/cmd/parquet-tool/cmds/root.go:16 +0x31
main.main()
	/Users/cml35/go/src/github.com/fraugster/parquet-go/cmd/parquet-tool/main.go:8 +0x25

In this case it looks like there's a problem with the root message id - com.amazon.aws.origami.datawriter.parquet.ParquetDataWriter$. Seems to be failing at the first '.' (0x46).

"not enough data to read all miniblock bit widths" when writing one record

I seem to have run into trouble when writing one record using parquet.Encoding_DELTA_LENGTH_BYTE_ARRAY or parquet.Encoding_DELTA_BYTE_ARRAY. Reading the written data errors with:

not enough data to read all miniblock bit widths: unexpected EOF

Here is a complete test that fails for me, using v0.3.0:

func TestOne(t *testing.T) {
	var buf bytes.Buffer

	wr := goparquet.NewFileWriter(&buf)

	bas, err := goparquet.NewByteArrayStore(parquet.Encoding_DELTA_LENGTH_BYTE_ARRAY, true, &goparquet.ColumnParameters{})
	if err != nil {
		t.Fatal(err)
	}

	col := goparquet.NewDataColumn(bas, parquet.FieldRepetitionType_OPTIONAL)
	if err := wr.AddColumn("name", col); err != nil {
		t.Fatal(err)
	}

	for i := 0; i < 1; i++ {
		rec := map[string]interface{}{
			"name": []byte("dan"),
		}

		if err := wr.AddData(rec); err != nil {
			t.Fatal(err)
		}
	}

	if err := wr.Close(); err != nil {
		t.Fatal(err)
	}

	rd, err := goparquet.NewFileReader(bytes.NewReader(buf.Bytes()))
	if err != nil {
		t.Fatal(err)
	}

	for {
		row, err := rd.NextRow()
		if err != nil {
			if errors.Is(err, io.EOF) {
				break
			}
			t.Fatal(err)
		}
		t.Log(row)
	}
}

Changing either:

  • i < 1 to i < 2
  • parquet.Encoding_DELTA_LENGTH_BYTE_ARRAY to parquet.Encoding_PLAIN

Gets it passing.

Hopefully I haven't misused something here.

schema with non-alphanumeric identifier?

hi

i like the look at your package and am doing a prototype for a project, but i have a unique requirement with a parquet file which contains a "."

i'm having trouble doing that, any advice?

example test:

package main

import (
	"io"
	"os"
	"testing"

	goparquet "github.com/fraugster/parquet-go"
	"github.com/fraugster/parquet-go/parquet"
	"github.com/fraugster/parquet-go/parquetschema"
)

const schema = `message test {
	optional binary version (UTF8);
	optional binary meta.id (UTF8);
}`

var data = map[string]interface{}{
	"version": []byte("v1"),
	"meta.id": []byte("01EWZYTXQHBYXEHZRC329X4QD3"),
}

func Test_Main(t *testing.T) {

	// parse schema
	s, err := parquetschema.ParseSchemaDefinition(schema)
	if err != nil {
		t.Fatalf("failed to parse schema: %v", err)
	}

	// create output file
	os.Remove("output.parquet")
	var f io.WriteCloser
	if f, err = os.OpenFile("output.parquet", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, 0644); err != nil {
		t.Fatalf("failed to open file: %v", err)
	}
	defer f.Close()

	// create parquet writer
	fw := goparquet.NewFileWriter(f,
		goparquet.WithCompressionCodec(parquet.CompressionCodec_SNAPPY),
		goparquet.WithSchemaDefinition(s),
		goparquet.WithCreator("write-lowlevel"),
	)
	defer fw.Close()

	// write parquet
	if err := fw.AddData(data); err != nil {
		t.Errorf("failed to AddData: %v", err)
	}

}

produces this error:

./main_test.go:28: failed to parse schema: line 3: expected ;, got unknown start of token '46' instead

Similarly, if i attempt to read my existing file (produced by existing system),ย i can cat it but can't get schema โ€” failing with the same error โ€”ย eg.

$ parquet-tool cat ./example.parquet
version = v1
meta.id = 01EWZYTXQHBYXEHZRC329X4QD3
parquet-tool schema ./example.parquet
panic: line 3: expected ;, got unknown start of token '46' instead

goroutine 1 [running]:
github.com/fraugster/parquet-go.(*schema).GetSchemaDefinition(0xc00006c340, 0xc000010058)
        /Users/petermcintyre/go/1.14.7/pkg/mod/github.com/fraugster/[email protected]/schema.go:936 +0x7d
github.com/fraugster/parquet-go/cmd/parquet-tool/cmds.glob..func5(0x18365e0, 0xc00005af00, 0x1, 0x1)
        /Users/petermcintyre/go/1.14.7/pkg/mod/github.com/fraugster/[email protected]/cmd/parquet-tool/cmds/schema.go:35 +0x1ac
github.com/spf13/cobra.(*Command).execute(0x18365e0, 0xc00005aed0, 0x1, 0x1, 0x18365e0, 0xc00005aed0)
        /Users/petermcintyre/go/1.14.7/pkg/mod/github.com/spf13/[email protected]/command.go:830 +0x29d
github.com/spf13/cobra.(*Command).ExecuteC(0x18360e0, 0xc000057f78, 0x100772f, 0xc00007a058)
        /Users/petermcintyre/go/1.14.7/pkg/mod/github.com/spf13/[email protected]/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /Users/petermcintyre/go/1.14.7/pkg/mod/github.com/spf13/[email protected]/command.go:864
github.com/fraugster/parquet-go/cmd/parquet-tool/cmds.Execute()
        /Users/petermcintyre/go/1.14.7/pkg/mod/github.com/fraugster/[email protected]/cmd/parquet-tool/cmds/root.go:16 +0x31
main.main()
        /Users/petermcintyre/go/1.14.7/pkg/mod/github.com/fraugster/[email protected]/cmd/parquet-tool/main.go:8 +0x20

benchmark compared with github.com/xitongsys/parquet-go

Hello,

I did a benchmark of writing parquet file with the following two libraries, but the result turns out that github.com/xitongsys/parquet-go is much faster.

package main

import (
	"os"
	"testing"

	goparquet "github.com/fraugster/parquet-go"
	"github.com/fraugster/parquet-go/floor"
	"github.com/fraugster/parquet-go/parquet"
	"github.com/fraugster/parquet-go/parquetschema"
	"github.com/seaguest/log"
	parquet2 "github.com/xitongsys/parquet-go/parquet"
	"github.com/xitongsys/parquet-go/writer"
)

func BenchmarkParquet(b *testing.B) {
	for n := 0; n < b.N; n++ {
		writeParquet1()
	}
}

func writeParquet1() {
	schemaDef, err := parquetschema.ParseSchemaDefinition(
		`message test {
			required binary format (STRING);
			required int32 data_type;
			required binary country (STRING);
		}`)
	if err != nil {
		log.Fatalf("Parsing schema definition failed: %v", err)
	}

	parquetFilename := "output1.parquet"

	fw, err := floor.NewFileWriter(parquetFilename,
		goparquet.WithSchemaDefinition(schemaDef),
		goparquet.WithCompressionCodec(parquet.CompressionCodec_SNAPPY),
	)
	if err != nil {
		log.Fatalf("Opening parquet file for writing failed: %v", err)
	}

	type record struct {
		Format   string `parquet:"format"`
		DataType int32  `parquet:"data_type"`
		Country  string `parquet:"country"`
	}

	num := 1000
	for i := 0; i < num; i++ {
		stu := record{
			Format:   "Test",
			DataType: 1,
			Country:  "IN",
		}
		if err = fw.Write(stu); err != nil {
			log.Error("Write error", err)
		}
	}

	if err := fw.Close(); err != nil {
		log.Fatalf("Closing parquet writer failed: %v", err)
	}

}

func writeParquet2() {
	var err error
	w, err := os.Create("output2.parquet")
	if err != nil {
		log.Error("Can't create local file", err)
		return
	}

	type record struct {
		Format   string `parquet:"name=format, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
		DataType int32  `parquet:"name=data_type, type=INT32, encoding=PLAIN"`
		Country  string `parquet:"name=country, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
	}

	//write
	pw, err := writer.NewParquetWriterFromWriter(w, new(record), 4)
	if err != nil {
		log.Error("Can't create parquet writer", err)
		return
	}

	pw.CompressionType = parquet2.CompressionCodec_SNAPPY
	num := 1000
	for i := 0; i < num; i++ {
		stu := record{
			Format:   "Test",
			DataType: 1,
			Country:  "IN",
		}
		if err = pw.Write(stu); err != nil {
			log.Error("Write error", err)
		}
	}
	if err = pw.WriteStop(); err != nil {
		log.Error("WriteStop error", err)
		return
	}
	w.Close()
}

with fraugster/parquet-go, I got

goos: linux
goarch: amd64
pkg: gitlab.wuren.com/data/etp-storage/cmd/test
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkParquet-8   	     367	   2921834 ns/op
PASS
ok  	/test	1.408s

with xitongsys/parquet-go, I got

goos: linux
goarch: amd64
pkg: gitlab.wuren.com/data/etp-storage/cmd/test
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkParquet-8   	    1299	    971545 ns/op
PASS
ok  	/test	1.360s

is there something I did wrong to have such a big difference?

go get github.com/fraugster/[email protected] is broken

Hi, I tried to install the latest version of this package but I get the following error:

go get -u github.com/fraugster/parquet-go
# github.com/fraugster/parquet-go/parquet
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:760:36: not enough arguments in call to iprot.ReadStructBegin
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:765:55: not enough arguments in call to iprot.ReadFieldBegin
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:779:25: not enough arguments in call to iprot.Skip
	have (thrift.TType)
	want (context.Context, thrift.TType)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:838:31: not enough arguments in call to iprot.ReadFieldEnd
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:842:31: not enough arguments in call to iprot.ReadStructEnd
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:849:31: not enough arguments in call to iprot.ReadBinary
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:858:31: not enough arguments in call to iprot.ReadBinary
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:867:28: not enough arguments in call to iprot.ReadI64
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:876:28: not enough arguments in call to iprot.ReadI64
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:885:31: not enough arguments in call to iprot.ReadBinary
	have ()
	want (context.Context)
../../../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:885:31: too many errors

The v0.4.0 release uses the github.com/apache/thrift v0.13.0 but the code is assuming an earlier version of thrift:

if _, err := iprot.ReadStructBegin(); err != nil {

In other words, the context is not passed.

However, I managed to get around the issue for now by:

go get github.com/fraugster/parquet-go@5b69a907ab478b9aa2a8d663a871577efcedc801

And it seems there's another v0.4.1 but there are no tags for it.

github.com/fraugster/parquet-go v0.4.1-0.20211010182140-5b69a907ab47

extra array, slice, time, and map fields in struct cause read/write errors

when trying to use an existing struct type that has fields not in the given schema of types

  • array
  • slice
  • time
  • map

it results in nil access errors.

To repro:

func TestWriteRead(t *testing.T) {
	t.Run("fields not defined in schema are ignored, but do not error", func(t *testing.T) {
		t.Run("[1]byte", func(t *testing.T) {
			s := `message test {
				required int64 val;
			}`
			o := struct {
				Val   int64
				extra [1]byte
			}{Val: 1, extra: [1]byte{'1'}}
			e := struct {
				Val   int64
				extra [1]byte
			}{Val: 1, extra: [1]byte{}}
			assert.EqualValues(t, e, writeReadOne(t, o, s))
		})
		t.Run("[]byte", func(t *testing.T) {
			s := `message test {
				required int64 val;
			}`
			o := struct {
				Val   int64
				extra []byte
			}{Val: 1, extra: []byte{'1'}}
			e := struct {
				Val   int64
				extra []byte
			}{Val: 1, extra: nil}
			assert.EqualValues(t, e, writeReadOne(t, o, s))
		})
		t.Run("[]int", func(t *testing.T) {
			s := `message test {
				required int64 val;
			}`
			o := struct {
				Val   int64
				extra []int
			}{Val: 1, extra: []int{1}}
			e := struct {
				Val   int64
				extra []int
			}{Val: 1, extra: nil}
			assert.EqualValues(t, e, writeReadOne(t, o, s))
		})
		t.Run("time.Time", func(t *testing.T) {
			s := `message test {
				required int64 val;
			}`
			o := struct {
				Val   int64
				extra time.Time
			}{Val: 1, extra: time.Now()}
			e := struct {
				Val   int64
				extra time.Time
			}{Val: 1, extra: time.Time{}}
			assert.EqualValues(t, e, writeReadOne(t, o, s))
		})
		t.Run("map[int]int", func(t *testing.T) {
			s := `message test {
				required int64 val;
			}`
			o := struct {
				Val   int64
				extra map[int]int
			}{Val: 1, extra: map[int]int{1: 1}}
			e := struct {
				Val   int64
				extra map[int]int
			}{Val: 1, extra: nil}
			assert.EqualValues(t, e, writeReadOne(t, o, s))
		})
	})
}

func writeReadOne(t *testing.T, o interface{}, schema string) interface{} {
	t.Helper()
	defer func() {
		if r := recover(); r != nil {
			t.Error(r)
		}
	}()
	schemaDef, err := parquetschema.ParseSchemaDefinition(schema)
	require.NoError(t, err)

	var buf bytes.Buffer
	w := NewWriter(goparquet.NewFileWriter(&buf,
		goparquet.WithCompressionCodec(parquet.CompressionCodec_SNAPPY),
		goparquet.WithSchemaDefinition(schemaDef),
	))

	err = w.Write(o)
	require.NoError(t, err)

	err = w.Close()
	require.NoError(t, err)

	fr, err := goparquet.NewFileReader(bytes.NewReader(buf.Bytes()))
	require.NoError(t, err)

	pr := NewReader(fr)

	pr.Next()

	o2val := reflect.New(reflect.TypeOf(o))
	err = pr.Scan(o2val.Interface())
	require.NoError(t, err)

	return o2val.Elem().Interface()
}

Unable to Read Spark Parquet Schema DECIMAL()

Library reads the schema def as below and the panics on conversion to schema elements

message spark_schema {
optional int32 event_date (DATE);
optional int96 datetime;
optional fixed_len_byte_array(9) id1 (DECIMAL);
optional fixed_len_byte_array(9) id2 (DECIMAL);
optional fixed_len_byte_array(9) id3 (DECIMAL);
optional binary name (UTF8);
}

panic: line 4: expected (, got ")" instead

Looks like its failing since the decimal precision and scale is not specified in the schema. Spark 2.4 produced this file and has no trouble reading it. I have also used other schema readers and it spits out (DECIMAL(20,0)) as the schema instead. I am not sure if this is some default if you fail to specify? Seems that it is valid not to specify?

I have also been having trouble writing decimal schema for spark to read as it seems to fail to recognize that the field is a decimal and instead interprets the type as just fixed length byte array. For the low level writer what is the correct way write a decimal into a byte array?

Everything else is working for me just cannot seem to read or write decimal types in regard to spark.

Thanks

proposal: schema generation

I'd like to add easy entry-level access to creation and parsing of parquet files without a developer needing to understand the details of building a schema definition. There are common sense default parquet types for each go type and we could easily build those default types for a given object type.

The target api would be something like this:

	type MyType {
		Int int
		Bool bool
		Slice []int
		// other common fields...
	}

	schemaDef, err := parquetschema.Generate(new(MyType))
	require.NoError(t, err)

	var buf bytes.Buffer
	w := NewWriter(goparquet.NewFileWriter(&buf,
		goparquet.WithCompressionCodec(parquet.CompressionCodec_SNAPPY),
		goparquet.WithSchemaDefinition(schemaDef),
	))

	// ...

Is there a way to reduce memory utilization in dictPageReader/PageReader ?

Describe the bug
I am trying to load multiple concurrent parquet files into memory and try to read them row by row. I am facing OOM issue while I read 10 concurrent file of 50MB each. Do you see any obvious things in the call graph ? Thank you!!

Unit test to reproduce
Please provide a unit test, either as a patch or text snippet or link to your fork. If you can't isolate it into a unit test then please provide steps to reproduce.

parquet-go specific details

  • What version are you using? v0.11.0

Misc Details

  • Are you using AWS Athena, Google BigQuery, presto... ? AWS S3
  • Any other relevant details... how big are the files / rowgroups you're trying to read/write? 10 - 100 MB
  • Do you have memory stats to share? Yes
  • Can you provide a stacktrace? Yes

parquet-go-pprof

How to avoid performance penalty while dealing with too many unique values

My dataset involves capturing response times. The response times are floating point numbers.

When dictStore accumulates the values, the time spend in map accesses to determine unique values seemed to be a lot. When I disabled the logic with my crude code I got some benefit.

My question is that shouldn't getIndex method in dictStore consider whether dictMode is enabled or not.

Pardon me for my limited knowledge in parquet and the usage of dictMode.

Small bug in `recursiveFix` func

Describe the bug
I uncovered a small bug in schema.go, the recursiveFix func overwrites the passed in colPath by using it in the append function to add the column name to the column path on L684. Since go slices may or may not reuse the underlying array depending on capacity, this does not always necessarily return a slice that points to a newly allocated array (I have been bitten by this bug before, it's one of go's "gotchas"). This resulted in the go schema to have the same column path for all fields in a list element and thus invalid parquet.

message test_result {
  required binary version (STRING);
  required binary variant (STRING);
  required binary task_name (STRING);
  optional binary display_task_name (STRING);
  required binary task_id (STRING);
  optional binary display_task_id (STRING);
  required int32 execution;
  required binary request_type (STRING);
  required int64 created_at (TIMESTAMP(MILLIS,true));
  required group results (LIST) {
    repeated group list {
      required group element {
        required binary test_name (STRING);
        optional binary display_test_name (STRING);
        optional binary group_id (STRING);
        required int32 trial;
        required binary status (STRING);
        optional binary log_test_name (STRING);
        optional binary log_url (STRING);
        optional binary raw_log_url (STRING);
        optional int32 line_num;
        required int64 task_create_time (TIMESTAMP(MILLIS,true));
        required int64 test_start_time (TIMESTAMP(MILLIS,true));
        required int64 test_end_time (TIMESTAMP(MILLIS,true));
      }
    }
  }
}

Using this schema, all the column paths for the list elements were ["results", "list", "element", "test_end_time"] because test_end_time is the last field in the group and the recursive function continuously overwrites the colPath argument and assigns it to each column's path field.

I can put up a PR for the fix, it is a one-line change:

col.path = append(append(col.path, colPath...), col.name)

Let me know if I can provide any more info, thanks!

question

Hi @fraugster @akrennmair @panamafrancis,

I would like to know how to read the following schema with parquet-go?

message spark_schema {
  optional group foo (LIST) {
    repeated group list {
      optional double element;
    }
  }
  required group bar (LIST) {
    repeated group list {
      required double element;
    }
  }
}

the following seems not to work:

type Record struct {
	Foo FooList `parquet:"foo.list"`
	Bar   BarList   `parquet:"bar.list"`
}

type FooList struct {
	List []FooListList `parquet:"list"`
}

type FooListList struct {
	List []float64 `parquet:"element"`
}

type BarList struct {
	List BidValuesListList `parquet:"list"`
}

type BarListList struct {
	List []float64 `parquet:"element"`
}

while this works fine with github.com/segmentio/parquet-go lib:

type Record struct {
	Foo BarList `parquet:"foo"`
	Bar   BarList   `parquet:"bar"`
}

type FooList struct {
	List []FooListList `parquet:"list"`
}

type FooListList struct {
	List []float64 `parquet:"element"`
}

type BarList struct {
	List BarListList `parquet:"list"`
}

type BarListList struct {
	List []float64 `parquet:"element"`
}

Missing Logical type for INTERVAL

As converted types are deprecated and we should use Logical types, but currently there is missing the Logical Type for Interval in the library

Add Performant Low-level Operation For Adding Row

The SchemaWriter interface gives functionality for adding row level data via the AddData method. This method accepts the row information in the form of map[string]interface{} which allows the caller to provide the name of the column as the key(string) and the value in the form that is approiate for the underlying data(i.e. string, int64, []byte, bool, etc). However, this comes with a performance impact in the form of heap allocations and increased garbage collector managed memory. This is due to the key of type interface{} resulting in the usage of pointers and escaping to the heap. To increase performance, can a new method be added to support providing row data in a way that can reduce the allocations escaping to the heap while still giving the caller the control to handle dynamic data, like what is happening in the CSV to Parquet tool?

Doing a quick scan of the code base and to my untrained eye, it looks like one way to achieve this may be to create a generic struct that can encapsulate the data and use that rather than a map.

// RowData represents a row of data in a CSV file, and can be provided to the `SchemaWriter`
type RowData struct{
	Values []RowData
}

// RowData represents each field/column for a row in a CSV file
type RowData struct{
	DataName string

	/*
		Different data types.
		Only one of these should be populated at a time
	*/
	StringValue string
	IntValue int
	Int16Value int16
	Int32Value int32
	Int64Value int64
	BoolValue bool
}

The SchemaWriter interface can be updated to accept these new types, For example,

// AddDataRow writes a row of data to the underlying writer using the specified data and metadata
func AddDataRow(data RowData) error

I am wondering if my assumptions regarding performance are correct, if there are any known work arounds other than adding new functionality, if this is something that is desired for this project, and what is the desired method to achieve the results.

schema for nesting map inside array

Hi

i'm having trouble writing a schema with a map inside an array โ€” example JSON:

{
    "people": [
        { "name": "jack" },
        { "name": "jill" }
    ]
}

Things i've tried:

message test {
	optional group people (LIST) {
		repeated group person {
			optional binary name (UTF8);
		}
	}
}

Error:
failed to AddData: data is not a map or array of map, its a []interface {}

doesn't compile

got the following errors:

github.com/fraugster/parquet-go/parquet

../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:760:36: not enough arguments in call to iprot.ReadStructBegin
have ()
want (context.Context)
../../go/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:765:55: not enough arguments in call to iprot.ReadFieldBegin
have ()
want (context.Context)

Inconsistent behavior in floor.reflectMarshaller when field is not in SchemaDefinition

When attempting to write a struct that contains fields not present in the SchemaDefinition, the floor.reflectMarshaller behaves inconsistently depending on the type of field in the struct it is decoding.

If it is a complex type (for example a []byte), it will cause a panic because schemaDef.SubSchema(fieldName) will return a nil pointer which will later be unsafely dereferenced in the decodeByteSliceOrArray method.

However, simple primitives like bools will not cause a panic because there is no dereferencing. These will simply not be written, and Write will not return an error. I can see an argument for that to be the desired behavior, but as a user I expected an error in both cases, and regardless, the slice case should not cause a panic.

Reproduction steps

package main

import (
	goparquet "github.com/fraugster/parquet-go"
	"github.com/fraugster/parquet-go/floor"
	"github.com/fraugster/parquet-go/parquet"
	"github.com/fraugster/parquet-go/parquetschema"
	"log"
)

func main() {
	obj1 := struct {Bar bool}{Bar: true}
	obj2 := struct {Bar []byte}{Bar: []byte{0xFF, 0x0A}}
	if err := write("bool.parquet", obj1); err != nil {
		log.Fatal(err)
	}
	if err := write("bytearray.parquet", obj2); err != nil {
		log.Fatal(err)
	}
}

func write(filename string, obj interface{}) error {
	schemaDef, err := parquetschema.ParseSchemaDefinition(`message test { required int32 foo; }`)
	if err != nil {
		return err
	}
	fw, err := floor.NewFileWriter(filename,
		goparquet.WithSchemaDefinition(schemaDef),
		goparquet.WithCompressionCodec(parquet.CompressionCodec_SNAPPY),
	)
	if err != nil {
		return err
	}
	if err := fw.Write(obj); err != nil { // panics when writing obj2
		return err
	}
	return fw.Close()
}

This program will write an empty bool.parquet file without returning an error, but will panic when writing bytearray.parquet with the following stack trace

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x12131fd]

goroutine 1 [running]:
github.com/fraugster/parquet-go/floor.(*reflectMarshaller).decodeByteSliceOrArray(0xc00012a600, 0x12ca440, 0xc00012a660, 0x122e3a0, 0xc00012a500, 0x97, 0x0, 0xc00012a660, 0xc000056c10)
        /Users/ryanmiville/go/pkg/mod/github.com/fraugster/[email protected]/floor/writer.go:222 +0x7d
github.com/fraugster/parquet-go/floor.(*reflectMarshaller).decodeValue(0xc00012a600, 0x12ca440, 0xc00012a660, 0x122e3a0, 0xc00012a500, 0x97, 0x0, 0x0, 0x0)
        /Users/ryanmiville/go/pkg/mod/github.com/fraugster/[email protected]/floor/writer.go:202 +0x668
github.com/fraugster/parquet-go/floor.(*reflectMarshaller).decodeStruct(0xc00012a600, 0x12c7100, 0xc00011aea0, 0x1243980, 0xc00012a500, 0x99, 0xc0001320b8, 0x10, 0x124e0e0)
        /Users/ryanmiville/go/pkg/mod/github.com/fraugster/[email protected]/floor/writer.go:110 +0x407
github.com/fraugster/parquet-go/floor.(*reflectMarshaller).marshal(...)
        /Users/ryanmiville/go/pkg/mod/github.com/fraugster/[email protected]/floor/writer.go:79
github.com/fraugster/parquet-go/floor.(*reflectMarshaller).MarshalParquet(0xc00012a600, 0x12c7100, 0xc00011aea0, 0x0, 0x0)
        /Users/ryanmiville/go/pkg/mod/github.com/fraugster/[email protected]/floor/writer.go:75 +0xc5
github.com/fraugster/parquet-go/floor.(*Writer).Write(0xc00012a5a0, 0x1243980, 0xc00012a500, 0x2, 0x2)
        /Users/ryanmiville/go/pkg/mod/github.com/fraugster/[email protected]/floor/writer.go:58 +0xf7
main.write(0x127f66d, 0x11, 0x1243980, 0xc00012a500, 0x0, 0x0)
        /Users/ryanmiville/projects/parquet/main/main.go:34 +0x105
main.main()
        /Users/ryanmiville/projects/parquet/main/main.go:17 +0x105

Process finished with exit code 2

Panic attempting to print schema

When using examples/read-low-level/main.go (which is very similar to the parquet-tool -cmd schema functionality) I get panics no matter what parquet file I point it at. These files are all readable perfectly well by pyspark.

Example stack trace, where main.go is a direct copy from the example code:

goroutine 1 [running]:
github.com/fraugster/parquet-go.(*schema).GetSchemaDefinition(0xc00002c300, 0x10)
	/home/cmp/Build/go/pkg/mod/github.com/fraugster/[email protected]/schema.go:936 +0x7d
main.printFile(0x7ffd484627d7, 0x2f, 0x0, 0x0)
	/home/abc/Build/dump-go/main.go:40 +0x193
main.main()
	/home/abc/Build/dump-go/main.go:21 +0x168

I'll attach an example of a parquet file that makes things go boom ...

Hardening against adversarial input

Describe the bug

This library provides no protection against untrusted inputs.

It is trivial to write inputs that will cause a server to OOM or crash from another issue.

This makes it impossible to use this package unless you have full trust in the users uploading input.

In our minio server we are therefore forced to disable Parquet parsing in S3 Select, since the server may be running in an environment where the users are untrusted.

Adding limits and safety to this package would extend its usage a lot. I know from experience it can be hard and there are a lot of paths to cover, but usually fuzz tests can guide you.

I made a fuzz test to test basic functionality. Even without a seed corpus it crashes within a minute, and letting it run a bit longer runs the OS out of resources as well. Ideally I would like to specify limits in terms of memory usage so we can make reasonable assumptions of how much memory each Reader will take at max.

Unit test to reproduce

Pre 1.18 fuzz test:

func Fuzz(data []byte) int {
	r, err := NewFileReader(bytes.NewReader(data))
	if err != nil {
		return 0
	}
	for {
		_, err := r.NextRow()
		if err != nil {
			break
		}
		for _, col := range r.Columns() {
			_ = col.Element()
		}
	}
	return 1
}

Build+execute with:

ฮป go-fuzz-build -o=fuzz-build.zip .
ฮป go-fuzz -timeout=60 -bin=fuzz-build.zip -workdir=testdata/fuzz -procs=16

This will start crashing within a minute.

Example crash:
panic: runtime error: makeslice: cap out of range

goroutine 1 [running]:
github.com/fraugster/parquet-go/parquet.(*FileMetaData).ReadField4(0xc00036c140, {0xbcc0f0, 0xc000016168}, {0xbcf4c0, 0xc00036c1e0})
	e:/gopath/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:11882 +0xd1
github.com/fraugster/parquet-go/parquet.(*FileMetaData).Read(0xc00036c140, {0xbcc0f0, 0xc000016168}, {0xbcf4c0, 0xc00036c1e0})
	e:/gopath/pkg/mod/github.com/fraugster/[email protected]/parquet/parquet.go:11753 +0x64b
github.com/fraugster/parquet-go.readThrift({0xbcc0f0, 0xc000016168}, {0xbc80c0, 0xc00036c140}, {0xbc81a0, 0xc00035c0d8})
	e:/gopath/pkg/mod/github.com/fraugster/[email protected]/helpers.go:107 +0xe2
github.com/fraugster/parquet-go.ReadFileMetaDataWithContext({0xbcc0f0, 0xc000016168}, {0xbca2b0, 0xc000358150}, 0x80)
	e:/gopath/pkg/mod/github.com/fraugster/[email protected]/file_meta.go:68 +0x5e7
github.com/fraugster/parquet-go.ReadFileMetaData({0xbca2b0, 0xc000358150}, 0x1)
	e:/gopath/pkg/mod/github.com/fraugster/[email protected]/file_meta.go:18 +0x70
github.com/fraugster/parquet-go.NewFileReaderWithOptions({0xbca2b0, 0xc000358150}, {0xc0000bfe10, 0x1, 0x1})
	e:/gopath/pkg/mod/github.com/fraugster/[email protected]/file_reader.go:39 +0x173
github.com/fraugster/parquet-go.NewFileReader({0xbca2b0, 0xc000358150}, {0x0, 0x984993, 0x625ea3ea})
	e:/gopath/pkg/mod/github.com/fraugster/[email protected]/file_reader.go:127 +0xa9
github.com/minio/minio/internal/s3select/parquet.Fuzz({0x1c37f2c0000, 0x3b, 0x3b})
	d:/minio/minio/internal/s3select/parquet/fuzz.go:19 +0xbf
go-fuzz-dep.Main({0xc0000bff60, 0x2, 0x938c05})
	go-fuzz-dep/main.go:36 +0x15b
main.main()
	github.com/minio/minio/internal/s3select/parquet/go.fuzz.main/main.go:17 +0x45
exit status 2

I am not looking for a solution for the specific crash, but rather the class of crashes that can be triggered with malicious user inputs.

parquet-go specific details

  • What version are you using?

0.10.0

  • Can this be reproduced in earlier versions?

Likely.

Concurrent writes

Hi, I have a use-case where I'm writing billions of data entries into a file. I couldn't find any information regarding this in the README, so I wanted to ask whether it's safe to use *goparquet.FileWriter across multiple goroutines to speed up the whole process. Currently it takes several hours for the job to complete and adding parallelism could significantly improve the timing here.

Question to in memory reading

Hi,

How can I read the entire data into a slice call Foo? Is there a better option than for loop with fr.Next() as you do it here https://github.com/fraugster/parquet-go/blob/master/examples/read-low-level/main.go#L44?

type Foo struct {
	Id						int64	`parquet:"name=id, type=INT64"`
	StartDate				int64	`parquet:"name=start_date, type=BYTE_ARRAY"`
	EndDate					int64	`parquet:"name=end_date, type=BYTE_ARRAY"`
}

fr, err := NewFileReaderWithOptions(bytes.NewReader(content))
u := make([]*Foo, fr.NumRows())

Feature Request: Support for string in byte array store

By design, the byte array store, is limited to accept only []byte and not string, so if the map[string]interface{} contains the string, then the library returns an error indicated that the type is not supported.

I am trying to read from a JSON file and write into a parquet, Go json library converts all string fields to Go string type and there is no easy way (without introducing a structure) to convert it to []byte instead.

By modifying the byteArrayStore and adding the support for converting the string to []byte and []string to [][]byte, this problem will be fixed. there is a type check already there, and this will not break the backward compatibility as well.

If you agree, I can implement it.

This is the function that needs to be modified :

func (is *byteArrayStore) getValues(v interface{}) ([]interface{}, error) {
var vals []interface{}
switch typed := v.(type) {
case []byte:
vals = []interface{}{typed}
case [][]byte:
if is.repTyp != parquet.FieldRepetitionType_REPEATED {
return nil, fmt.Errorf("the value is not repeated but it is an array")
}
vals = make([]interface{}, len(typed))
for j := range typed {
vals[j] = typed[j]
}
default:
return nil, fmt.Errorf("unsupported type for storing in []byte column %T => %+v", v, v)
}
return vals, nil
}

floor writer & file reader don't support many int/uint types

These type combinations all error out when writing/reading
- int <-> int64
- int32 <-> int64
- int16 <-> int64
- int8 <-> int64
- uint <-> int64
- uint16 <-> int64
- uint8 <-> int64
- int64 <-> int32
- uint64 <-> int32
- uint32 <-> int32

To repro:

func TestWriteRead(t *testing.T) {
	t.Run("by parquet type", func(t *testing.T) {
		t.Run("int64", func(t *testing.T) {
			t.Run("int go type", func(t *testing.T) {
				o := struct{ Val int }{Val: 1}
				s := `message test {required int64 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("int32 go type", func(t *testing.T) {
				o := struct{ Val int32 }{Val: 1}
				s := `message test {required int64 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("int16 go type", func(t *testing.T) {
				o := struct{ Val int16 }{Val: 1}
				s := `message test {required int64 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("int8 go type", func(t *testing.T) {
				o := struct{ Val int8 }{Val: 1}
				s := `message test {required int64 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("uint go type", func(t *testing.T) {
				o := struct{ Val uint }{Val: 1}
				s := `message test {required int64 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("uint16 go type", func(t *testing.T) {
				o := struct{ Val uint16 }{Val: 1}
				s := `message test {required int64 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("uint8 go type", func(t *testing.T) {
				o := struct{ Val uint8 }{Val: 1}
				s := `message test {required int64 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})
		})

		t.Run("int32", func(t *testing.T) {
			t.Run("int64 go type", func(t *testing.T) {
				o := struct{ Val int64 }{Val: 1}
				s := `message test {required int32 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("uint64 go type", func(t *testing.T) {
				o := struct{ Val uint64 }{Val: 1}
				s := `message test {required int32 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})

			t.Run("uint32 go type", func(t *testing.T) {
				o := struct{ Val uint32 }{Val: 1}
				s := `message test {required int32 val;}`
				assert.EqualValues(t, o, writeReadOne(t, o, s))
			})
		})
	})
})

func writeReadOne(t *testing.T, o interface{}, schema string) interface{} {
	t.Helper()
	defer func() {
		if r := recover(); r != nil {
			t.Error(r)
		}
	}()
	schemaDef, err := parquetschema.ParseSchemaDefinition(schema)
	require.NoError(t, err)

	var buf bytes.Buffer
	w := NewWriter(goparquet.NewFileWriter(&buf,
		goparquet.WithCompressionCodec(parquet.CompressionCodec_SNAPPY),
		goparquet.WithSchemaDefinition(schemaDef),
	))

	err = w.Write(o)
	require.NoError(t, err)

	err = w.Close()
	require.NoError(t, err)

	fr, err := goparquet.NewFileReader(bytes.NewReader(buf.Bytes()))
	require.NoError(t, err)

	pr := NewReader(fr)

	pr.Next()

	o2val := reflect.New(reflect.TypeOf(o))
	err = pr.Scan(o2val.Interface())
	require.NoError(t, err)

	return o2val.Elem().Interface()
}

The parquet-tools can not read the generated file

Hi,

I have a schema

                  message test {
			optional group a {
				optional group foo (MAP) {
					repeated group key_value {
						required binary key (STRING);
						optional binary value (STRING);
					}
				}
			}
		}

And also I define a struct which implements the MarshalParquet and UnmarshalParquet method.

type record struct {
	a a
}
type a struct {
	foo map[string]any
}

func (r *record) MarshalParquet(obj interfaces.MarshalObject) error {

	g := obj.AddField("a").Group()

	m := g.AddField("foo").Map()
	for k, v := range r.a.foo {
		elem := m.Add()
		elem.Key().SetByteArray([]byte(k))
		elem.Value().SetByteArray([]byte(v.(string)))
	}
	return nil
}

func (r *record) UnmarshalParquet(obj interfaces.UnmarshalObject) error {

	g, err := obj.GetField("a").Group()
	if err != nil {
		return err
	}

The problem is when I write the data into file, no error, seems ok. But. when I use. the 'parquet-tools' to cat the parquet file, it gives error:

java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1
	at org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:272)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:246)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:195)
	at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148)
	at org.apache.parquet.tools.Main.main(Main.java:223)
java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1

How to fix this issue?

Support reading parquet maps from older writers

Describe the bug
AWS Kinesis produces maps in an old way, most notably using map instead of key_value

example:

optional group new (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (UTF8);
      optional group value {
        optional binary b (UTF8);
        optional binary n (UTF8);
      }
    }
  }

When trying to populate a struct from a parquet file with this, it'll throw an error.

Unit test to reproduce

https://github.com/parsnips/parquet-go-1/tree/parsnips/support-filling-map has unit test and fix.

parquet-go specific details
reproduced on the latest version on main.

Misc Details
This is parquet generated by AWS Kinesis Firehose ParquetSerDe V2.

I have a test file here: https://gist.github.com/parsnips/928c14d850331dd21c8d917227f77c5c#file-kinesisserde-v2-parquet

Write sparse data

I was reading through the writer code a bit and was wondering if there's way to write sparse data as in a more optimized way. I am working with a huge schema but the data rows themselves are super sparse (meaning each row only gas data for 5 % of columns in the schema.)
I was using Arrow c++ library (which is more or less implemented in similar fashion) and profiled using perf and figured 40% of my cpu went in writing nils.

So here's what I am looking for in some pseudoCode and wanna know if this library supports anything of this kind

From my understanding here is how current parquet writer implementation works for writing nils is in the fraugster library (I might be wrong)

    var data []map[string]interface
    for _, row : = range data {
        for _ , prop := range schema.properties {
             if data[prop] == nil {
                 writer.columnStore.WriteNil()
             }
        }
    }

That works but consumes a lot of cpu iterating through nil columns. What i am looking for is some thing like below

var m map[prop]int (built from schema which stores when the last non nil is seen for a column) 
for currIdx, row := range data {
     for _, prop := range event.properties { // instead of iterating on all schema columns jus iterate on whats there on event 
        oldIndex := m[prop]
        writer.columnStore.WriteNilRunLegthEncoded(currIdx-oldIdx)  // currIdx - oldIdx is basically how many nils we saw in between
        m[prop] = currIdx        
     }
}

The column path is always override by the last modification

Describe the bug
I have a schema

                  message test {
			optional group a {
				optional group foo (MAP) {
					repeated group key_value {
						required binary key (STRING);
						optional binary value (STRING);
					}
				}
			}
		}

The problem is when I write the data into file, no error, seems ok. But. when I use. the 'parquet-tools' to cat the parquet file, it gives error:

java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1
	at org.apache.parquet.hadoop.ColumnChunkPageReadStore.getPageReader(ColumnChunkPageReadStore.java:272)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:246)
	at org.apache.parquet.tools.command.DumpCommand.dump(DumpCommand.java:195)
	at org.apache.parquet.tools.command.DumpCommand.execute(DumpCommand.java:148)
	at org.apache.parquet.tools.Main.main(Main.java:223)
java.lang.IllegalArgumentException: [a, foo, key_value, key] required binary key (STRING) is not in the store: [[a, foo, key_value, value] optional binary value (STRING)] 1

Unit test to reproduce
Described as above.

I guess the root cause is: in schema.go

func recursiveFix(col *Column, colPath ColumnPath, maxR, maxD uint16, alloc *allocTracker) {
	.......
	col.maxR = maxR
	col.maxD = maxD
        // at line 684, the append function internally always update the underlying array 
	col.path = append(colPath, col.name)
	if col.data != nil {
		col.data.reset(col.rep, col.maxR, col.maxD)
		return
	}

	for i := range col.children {
                 // so no matter how many children are, the colPath is alway the last child's path due to the bug in line 684
		recursiveFix(col.children[i], col.path, maxR, maxD, alloc)
	}
}

so the quick fix should be

         // copy the parent path first
	col.path = append([]string(nil), colPath...)
	col.path = append(col.path, col.name)

parquet-go specific details

  • What version are you using?
    0.12.0
  • Can this be reproduced in earlier versions?
    not sure.

Misc Details

  • Are you using AWS Athena, Google BigQuery, presto... ? No, just normal parquet file.
  • Any other relevant details... how big are the files / rowgroups you're trying to read/write? A very small file.
  • Does this behavior exist in other implementations? (link to spec/implementation please)
  • Do you have memory stats to share?
  • Can you provide a stacktrace?
  • Can you upload a test file?

parquet-tool - split command usability not ideal

Describe the bug
larger files are only split if both the --row-group-size and --file-size parameters are set, setting the latter alone won't result in files being split. I think if the file size < row group size the file size parameter should take precedence.

parquet-go specific details

  • 0.9.0

Misc Details

  • File is a previously compacted file of ~28MB with a row group size of 128MB

GetSchemaDefinition panics if there is an array

panic: line 1: expected {, got unknown start of token '46' instead [recovered]
	panic: line 1: expected {, got unknown start of token '46' instead

goroutine 19 [running]:
testing.tRunner.func1.2(0x124fce0, 0xc000099380)
	/usr/local/go/src/testing/testing.go:1143 +0x332
testing.tRunner.func1(0xc000082900)
	/usr/local/go/src/testing/testing.go:1146 +0x4b6
panic(0x124fce0, 0xc000099380)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/fraugster/parquet-go.(*schema).GetSchemaDefinition(0xc0000f4800, 0x10)
	/Users/will/go/pkg/mod/github.com/fraugster/[email protected]/schema.go:936 +0x85

fileReader.nextRow() fails to read all data in columns of type list with null values in list.

Replicate the bug by running write-low-level and read-low-level on my fork. I believe the bug occurs in column.getNextData() in schema.go, but I'll wait to dig into that function until someone confirms this is indeed a bug.

Consider the schema:

message test {
			required int64 id;
			required binary city (STRING);
			required group population (LIST){
				repeated group list {
            		               optional int64 element;
				}
			}
		}`

and input data (notice the nil value in the second value of the list):

l := map[string]interface{}{"list":[]map[string]interface{}{
		{"element":int64(3)},
		{"element":nil},
		{"element":int64(2)}}}

	inputData := []struct {
		ID   int
		City string
		Pops  map[string]interface{}
	}{
		{ID: 1, City: "Berlin", Pops: l},

The writer properly writes the parquet file, which I verified using a different parquet reader:

โžœ  fraugster-parquet-go git:(master) > parquet-tools cat -json output.parquet
{"id":1,"city":"Berlin","population":{"list":[{"element":3},{},{"element":2}]}}

However, when nextRow() returns in read-low-level.go, the array column is [3], instead of [3,nil,2]. See this screenshot from my goland debugger, or the extra print statements in my read-low-level.go implementation:
image

DECIMAL's `maxDigits` computation is incorrect

The Apache Parquet docs on Logical Types say that, for DECIMALs using fixed_len_byte_array of length n,

Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits

So a DECIMAL(38,0) in a fixed_len_byte_array(16) should be just fine. Plugging this into Wolfram Alpha shows 38:

image

But this code here computes 37:

maxDigits := int32(math.Floor(math.Log10(math.Exp2(8*float64(n)-1)) - 1))

Notice that the final minus one is in the wrong place. It should happen inside the call to math.Log10, not after it:

     floor(    log_10(2 ^      (8 *         n  - 1)  - 1))
math.Floor(math.Log10(math.Exp2(8 * float64(n) - 1)) - 1)

parquet-tool - file statistics command

As a user i would like a command to aid in debugging parquet files. For instance I would like to obtain the following file stats in a single command:

  • compression algorithm
  • page type v1/v2?
  • row group size
  • author / created by
  • version
  • metadata
  • page size
  • total records /row count
  • any internal info that could help too

low level write fails to write go floats equal to math.NaN()

Running the following low level write example fails with "Closing parquet file writer failed: couldn't find value NaN in dictionary values". This bug is on release 0.9.0 and master. but not on 0.8.0.

The bug appears in the encodeValues function in type_dict.go:

func (d *dictEncoder) encodeValues(values []interface{}) error {
	for _, v := range values {
		if idx, ok := d.indexMap[mapKey(v)]; ok {
			d.indices = append(d.indices, idx)
		} else {
			return fmt.Errorf("couldn't find value %v in dictionary values", v)
		}
	}
	return nil
}

Schema for JSON Array and Array of Strings?

What can be the possible schema definition of the Data struct? I have tried multiple things but they do not seem to work.

type Data struct {
	Names     []string
	Addresses []Address
}

type Address struct {
	HouseNumber int
	Area        string
	IsAvailable bool
}

This is what I wrote for Names:

required group Names (LIST) {
	repeated group list {
	    optional binary element (STRING);
	}
}

But it throws the following error data is not a map or array of map, its a []string

And I can't think of a possible schema for Addresses.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.