Giter Site home page Giter Site logo

aloneguid / parquet-dotnet Goto Github PK

View Code? Open in Web Editor NEW
505.0 14.0 136.0 124.05 MB

Fully managed Apache Parquet implementation

Home Page: https://aloneguid.github.io/parquet-dotnet/

License: MIT License

PowerShell 0.01% C# 85.84% Thrift 1.47% Java 0.20% Python 0.03% Jupyter Notebook 12.34% JavaScript 0.10%
apache-parquet dotnet dotnet-core dotnet-standard apache-spark linux windows ios xamarin xbox

parquet-dotnet's Introduction

Apache Parquet for .NET NuGet Nuget

Icon

Fully managed, safe, extremely fast .NET library to 📖read and ✍️write Apache Parquet files designed for .NET world (not a wrapper). Targets .NET 8, .NET 7, .NET 6.0, .NET Core 3.1, .NET Standard 2.1 and .NET Standard 2.0.

Whether you want to build apps for Linux, MacOS, Windows, iOS, Android, Tizen, Xbox, PS4, Raspberry Pi, Samsung TVs or much more, Parquet.Net has you covered.

Features at a glance

  • 0️⃣ Has zero dependencies - pure library that just works anywhere .NET works i.e. desktops, servers, phones, watches and so on.
  • 🚀Really fast. Faster than Python and Java, and alternative C# implementations out there. It's often even faster than native C++ implementations.
  • 🏠NET native. Designed to utilise .NET and made for .NET developers, not the other way around.
  • ❤️‍🩹Not a "wrapper" that forces you to fit in. It's the other way around - forces parquet to fit into .NET.
  • 🦄Unique Features:

Links

UI

This repository now includes an implementation of parquet desktop viewer application called Floor (parquet floor, get it?). It's cross-platform, self-contained executable made with Avalonia, and is compiled for Linux, Windows and MacOS. You can download it from the releases section.

Floor is not meant to be the best parquet viewer on the planet, but just a reference implementation. There are probably better, more feature-rich applications out there.

Used by

...raise a PR to appear here...

Contributing

See the contribution page. The first important thing you can do is simply star ⭐ this project.

parquet-dotnet's People

Contributors

aloneguid avatar andycross avatar azurecoder avatar cajuncoding avatar clarkfinger avatar corrego avatar curthagenlocher avatar dmishne avatar drewmcarthur avatar dvasseur avatar ee-naveen avatar erchirag avatar felipepessoto avatar grbinho avatar krabby avatar mandyshieh avatar mcbos avatar msitt avatar mukunku avatar nikolapeja6 avatar oyhovd avatar qrczak0 avatar ramon-garcia avatar rickykaare avatar sierzput avatar skyyearxp avatar spanglerco avatar spark-spartan avatar srigumm avatar tpeplow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parquet-dotnet's Issues

does Multithreaded Reads support? how to read fast ?

i use parquet-dotnet to read big parquet file, but it seems little slow. read all rowgroups pyarrow take 14 seconds, but parquet-dotnet take 42 seconds.

` private void ReadParquet(string file)
{
//遍历读取所有rowgroup c# 42秒 python_pyarrow 14秒

    using (var reader = ParquetReader.OpenFromFile(file))
    {
        //var schema = reader.Schema;
        var dataFields = reader.Schema.GetDataFields();
        var cnt = reader.RowGroupCount;

        for (var i = 0; i < cnt; i++)
        {
            using (var groupReader = reader.OpenRowGroupReader(i))
            {
                var columns = dataFields.Select(groupReader.ReadColumn);

                foreach(var column in columns)
                //Parallel.ForEach(columns, column =>
                {
                    var data = column.Data;
                }
                //);

                //foreach (DataField field in reader.Schema.GetDataFields())
                //{
                //    var dataColumn = groupReader.ReadColumn(field);
                //}
            }
        }
    }

}`

https://arrow.apache.org/docs/python/parquet.html

Multithreaded Reads
Each of the reading functions by default use multi-threading for reading columns in parallel. Depending on the speed of IO and how expensive it is to decode the columns in a particular file (particularly with GZIP compression), this can yield significantly higher data throughput.

'Index was outside the bounds of the array.' exception from RowGroupWriter.

Version: Parquet.Net v3.7.7 (NuGet Package)

Runtime Version: .Net Core v3.1

OS: Windows

Expected behavior

RowGroupWriter.Write(Table t) should succeed.

The Table schema column count matched the RowGroupWriter it was being written to, etc. no obvious issues with the data being written.,
Please describe how you expect Parquet.Net to work.

Actual behavior

RowGroupWriter.Write(Table t) throws 'Index was outside the bounds of the array.' exception. Stacktrace snippet:

at Parquet.ParquetRowGroupWriter.WriteColumn(DataColumn column)
at Parquet.ParquetExtensions.Write(ParquetRowGroupWriter writer, Table table)
at CSVtoParquet.ParquetConverter.FinalizeWriters() in

The Table being written has a schema that matches the rowgroupwriter, but based on a review of the code and digging through the object state at the time of the exception, it looks like the parameter _colIdx = 22, when there are only 22 columns in the schema, which means that the following line throws the exception:

Thrift.SchemaElement tse = _thschema[_colIdx];

         Thrift.SchemaElement tse = _thschema[_colIdx];   **<-- Exception thrown here???**
         if(!column.Field.Equals(tse))
         {
            throw new ArgumentException($"cannot write this column, expected '{tse.Name}', passed: '{column.Field.Name}'", nameof(column));
         }
         IDataTypeHandler dataTypeHandler = DataTypeFactory.Match(tse, _formatOptions);
         _colIdx += 1;

Code snippet reproducing the behavior

I am writing fairly small files ( < 128MB, this error is thrown @~38Mb output stream size, with 5000 rows max per rowgroup).

I am using the ParquetWriter class within a inside a Parallel.ForEach anonymous method, but each invocation gets its own instance of the class, no static instances, so it's not clear why this would result from a threading problem.

        private void FinalizeWriters()
        {
            if (!(currentTable is null) && currentTable.Count > 0)
            {
                _rgWriter.Write(currentTable); **<-- Exception thrown here???**
                currentTable = null;
            }
            _rgWriter?.Dispose();
            _writer?.Dispose();
        }

Wish I could attach a file, but the file isn't written since the exception is in the writer.

datetime type doesn't work

I'm trying to set datetime column (using DateTimeOffset // INT96 type) and read the file with apache drill. Query runs, but datetime column display some weird value like below:
p}'\xC5e\x0B\x00\x00\x1D\x85%\x00
There is a setting to tread int96 as timestamp (in drill), but if set to true query throws an error (no useful message tough, just says reader error)

High memory usage

I'm trying to read 1 GB of CSV files to convert them to parquet

The problem I'm running into is that memory usage of the program, a simple POC right now, is hitting 7 GB after populating 3630439 rows with 14 columns

Right before reading the files into the Table object
image

Right after reading the files and the Table object is ready for saving
image

Right after writing to disk (gzip, compression level 1), during write it has a few spikes over 7.5 GB
image

Is this normal? Is there anything I can do to reduce memory footprint? The raw text lines from the CSV files are being streaming in to reduce the footprint.
Taking a snapshot with VS diagnostic tools doesn't even show where the memory is allocated so I'm really at a lost here
image

image

Clearing the Table object and calling GC.Collect immediately drops memory usage to initial levels

SnappyWriter EncodeBlock Internal CLR error

Version: Parquet.Net v3.7.4

Runtime Version: .Net Core

OS: Windows

Actual behavior

Sometimes crashes:

Fatal error. Internal CLR error. (0x80131506)
at IronSnappy.SnappyWriter.EncodeBlock(System.Span1<Byte>, System.ReadOnlySpan1)
at IronSnappy.SnappyWriter.Encode(System.Span1<Byte>, System.ReadOnlySpan1)
at IronSnappy.Snappy.Encode(System.ReadOnlySpan1<Byte>) at Parquet.File.Streams.SnappyInMemoryStream.MarkWriteFinished() at Parquet.File.DataColumnWriter.WriteColumn(Parquet.Data.DataColumn, Parquet.Thrift.SchemaElement, Parquet.Data.IDataTypeHandler, Int32, Int32) at Parquet.File.DataColumnWriter.Write(System.Collections.Generic.List1<System.String>, Parquet.Data.DataColumn, Parquet.Data.IDataTypeHandler)
at Parquet.ParquetRowGroupWriter.WriteColumn(Parquet.Data.DataColumn)

Steps to reproduce the behavior

Unfortunately it is not consistently reproducible, I've ran the same code many times and most of the time it works fine but once in a while it crashes.

Using the same code, I had never seen it crash before when using v3.6.0 before the new snappy implementation

Code snippet reproducing the behavior

The code is essentially
using var file = new FileStream("filename.parquet", FileMode.Create);
var parquetWriter = new ParquetWriter(_schema, _destinationStream);
parquetWriter.CompressionMethod = CompressionMethod.Snappy;
var intBuffer = new int[5000];
var currentIndex = 3293;
using(var groupWriter = parquetWriter.CreateRowGroup())
{
groupWriter.WriteColumn(new DataColumn(new DataField("IntColumnName"), new ArraySegment(intBuffer, 0, currentIndex).ToArray()));
}

Parquet.Net Version 3.6.0 assembly not signed with a strong name

Version: Parquet.Net v...

Parquet.Net -Version 3.6.0 (https://www.nuget.org/packages/Parquet.Net/3.6.0)

Runtime Version: .Net Framework v, .Net Core v etc.
All of them:

  • netstandard1.4
  • netstandard1.6
  • netstandard2.0
  • netstandard2.1

OS: Windows/Linux/MacOSX v etc.

  • Windows 10;

Expected behavior

dll's should be signed with a strong name. Now I am not able to add them to GAC;

Actual behavior

When I try to install the libraries to GAC using gacutl, I get an Error:
"Failure adding assembly to the cache: Attempt to install an assembly without a strong name"

I have used previous version (3.3.0) before, and it did not have this issue;

Steps to reproduce the behavior

  1. Step 1.
    Open "Command Prompt for VS 2019" as Administrator;
  2. Step 2.
    Execute a command:
  • gacutil /i "C:\Temp\Parquet.Net.3.6.0\lib\netstandard1.4\Parquet.dll"
  • gac/util /i "C:\Temp\Parquet.Net.3.6.0\lib\netstandard2.0\Parquet.dll"

Output:
Microsoft (R) .NET Global Assembly Cache Utility. Version 4.0.30319.0
Copyright (c) Microsoft Corporation. All rights reserved.

Failure adding assembly to the cache: Attempt to install an assembly without a strong name

Writing NULL values to parquet

Version: Parquet.Net Build based on last weeks repo

Runtime Version: .Net Framework 4.5

OS: Windows

I'm intending to dump a database-table to a Parquet file for compact transport. This works fine as long as the tabel does not contain NULL values. In the passed we used the Row-based API which operated fine with NULL values. However, as tables grow we need to move to the default Column-based API as this is more efficient in terms of memory and in terms of speed. However, in this API I can not find how to work with NULL-values. For Strings this works fine. NULL values are translated to empty-strings, which is a little loss of information, but I think this is as expected in Parquet. However, , but for Int64 we can not make it work. I tried several options:

  1. Passing in Nullable`1[System.Int64] arrays with null-values with a NULL for the repetition-levels
  2. Passing in Nullable`1[System.Int64] arrays with null-values with an all zero's array for the repetition-levels
  3. Passing in a System.Int64 array with non-null values only and adding in a definition-levels array with 1 for values and 0 for null-values. (had to change a DataColumn constructor from 'internal' to 'public' to make this work.

When looking at the empty-arrays section of the chapter on 'Complex types' in the readme I see that I did everything to get to the solution suggested over there (but then with an all nulls array for the repetition level as we don't have more than one value in a single row).
This documentation says that if you want to encode:
[1 2]
[]
[3 4]

this is encoded as
values: [1 2 null 3 4]
repetition levels: [0 1 0 0 1]

However, I want to encode
[1]
[2]
[]
[3]
[4]

So that should be
values [1 2 null 3 4]
repetition levels [0 0 0 0 0]

However, the file that comes out is not readible (not with the current parquet-dotnet and also not via the v2.0.0.0 version we have been using for years.

Below you find the stack-trace and the zipped parquet-file as an attachment.

the stack-trace when reading tst4a.pq

EndOfStreamException Unable to read beyond the end of the stream.
System.IO.BinaryReader.FillBuffer (:0)
System.IO.BinaryReader.ReadDouble (:0)
Parquet.Data.Concrete.DoubleDataTypeHandler.ReadSingle (:0)
Parquet.Data.BasicDataTypeHandler1.Read (:0) Parquet.Data.BasicDataTypeHandler1.Read (:0)
Parquet.File.DataColumnReader.ReadColumn (:0)
Parquet.File.DataColumnReader.ReadDataPage (:0)
Parquet.File.DataColumnReader.Read (:0)
Parquet.ParquetRowGroupReader.ReadColumn (:0)
Parquet.ParquetExtensions+<>c__DisplayClass3_0.b__0 (:0)
System.Linq.Enumerable+WhereSelectArrayIterator2.MoveNext (:0) System.Linq.Buffer1..ctor (:0)

tst4a.zip

build issue for .net45
PS: We had to build the library from sources as the Nuget download got us in trouble due to missing System.Buffers v4.0.3.0 and System.Memory v4.0.1.0. Possibly these references are located in Thrift v0.9.2.0 which also could not be found. However, a build from sources with a net45 framework provided a dll folder with these dependencies

ParquetRowGroupWriter .WriteColumn throws Unable to cast object of type 'System.Object[]' to type 'System.Int64[]'

Version: Parquet.Net v3.6.0

Runtime Version: .Net Framework V4.7.2

OS: Windows

Expected behavior

ParquetRowGroupWriter.WriteColumn writes to file

Actual behavior

ParquetRowGroupWriter.WriteColumn

throws Unable to cast object of type 'System.Object[]' to type 'System.Int64[]'

Steps to reproduce the behavior

  1. Step 1.
  2. Step 2.

Code snippet reproducing the behavior

I'm trying to convert a System.Data.DataTable to a parquet file.

Where dt is my DataTable:

            using (Stream fileStream = System.IO.File.OpenWrite("C:\\test.parquet"))
            {
                using (var parquetWriter = new ParquetWriter(queryResultSchema, fileStream))
                {
                    // create a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup())
                    {
                        for(int i=0; i< dt.Columns.Count; i++)
                        {
                            DataColumn c1 = new DataColumn(fieldList[i], dt.Rows.OfType<System.Data.DataRow>().Select(r => Convert.ChangeType(r[i], dt.Columns[i].DataType)).ToArray());

                            groupWriter.WriteColumn(c1);
                        }
                    }
                }
            }

Make ClrType on DataField public

Version: Parquet.Net v3.7.7

**Runtime Version:**All

OS: All

Expected behavior

I'd like to be able to get the ClrType and ClrNullableIfHasNullsType from a DataField. I need to create some items in my app based on the type and cannot easily get to the CLR Type information.

Actual behavior

The field is internal and inaccessible

Steps to reproduce the behavior

https://github.com/aloneguid/parquet-dotnet/blob/develop/src/Parquet/Data/Schema/DataField.cs#L27-L30
The field here is internal, I'd like it to be public as there is a comment above referencing the possibility.

Ability to ignore fields/properties(of business objects) during serialization to parquet file

Version: Parquet.Net

Runtime Version: .Net Core

OS: Windows/Linux/MacOSX

Expected behavior

Though current serialize API is very helpful to work with business objects, it would very helpful if it can provide a way to ignore few properties during serialization. Something like the below,

public class SimpleStructure
    {
        public int Id { get; set; }
        public string FirstName { get; set; }

        [ParquetIgnore]
        public string SSIN { get; set; }
    }

Actual behavior

No such feature exists on our current serialize API

Steps to reproduce the behavior

  1. Step 1.
  2. Step 2.

Code snippet reproducing the behavior

Reading dictionary-encoded string columns with null values from multi-page parquet files yields misaligned data

Version: Parquet.Net from v3.9.9 at least

Runtime Version: .Net Framework v4.7.2

OS: Windows

Expected behavior

Data should have the correct values across pages

Actual behavior

When reading a dictionary-encoded column from a multi-page file with null values, there is a chance extra data will be read when decoding dictionary indexes. This is because the decoding function will read up to Num_values items for that page. However, in the presence of nulls, the total number of valid elements will be smaller than Num_values, but because the decoding function doesn't know this, it will continue generating elements until it runs out of data, putting these extra elements in the lookup table and causing data misalignment issues for the pages that follow.

This bug is especially insidious because the first page of data is correctly loaded.

I'm attaching a tentative fix PR that uses statistics to calculate the number of valid items that should be read.

Bug in schema decoding

As per elastacloud/parquet-dotnet#443.

self-contained solution that reproduces this ereror

parquet-mr output:

file:                       file:/C:/tmp/ReadParquet/res/doesntWork.snappy.parquet
creator:                    parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra:                      org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"prediction","type":"double","nullable":false,"metadata":{}},{"name":"impurity","type":"double","nullable":false,"metadata":{}},{"name":"impurityStats","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"gain","type":"double","nullable":false,"metadata":{}},{"name":"leftChild","type":"integer","nullable":false,"metadata":{}},{"name":"rightChild","type":"integer","nullable":false,"metadata":{}},{"name":"split","type":{"type":"struct","fields":[{"name":"featureIndex","type":"integer","nullable":false,"metadata":{}},{"name":"leftCategoriesOrThreshold","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"numCategories","type":"integer","nullable":false,"metadata":{}}]},"nullable":true,"metadata":{}}]}

file schema:                spark_schema
--------------------------------------------------------------------------------
id:                         REQUIRED INT32 R:0 D:0
prediction:                 REQUIRED DOUBLE R:0 D:0
impurity:                   REQUIRED DOUBLE R:0 D:0
impurityStats:              OPTIONAL F:1
.array:                     REPEATED DOUBLE R:1 D:2
gain:                       REQUIRED DOUBLE R:0 D:0
leftChild:                  REQUIRED INT32 R:0 D:0
rightChild:                 REQUIRED INT32 R:0 D:0
split:                      OPTIONAL F:3
.featureIndex:              REQUIRED INT32 R:0 D:1
.leftCategoriesOrThreshold: OPTIONAL F:1
..array:                    REPEATED DOUBLE R:1 D:3
.numCategories:             REQUIRED INT32 R:0 D:1

row group 1:                RC:7 TS:956 OFFSET:4
--------------------------------------------------------------------------------
id:                          INT32 SNAPPY DO:0 FPO:4 SZ:63/61/0.97 VC:7 ENC:BIT_PACKED,PLAIN
prediction:                  DOUBLE SNAPPY DO:0 FPO:67 SZ:77/73/0.95 VC:7 ENC:BIT_PACKED,PLAIN_DICTIONARY
impurity:                    DOUBLE SNAPPY DO:0 FPO:144 SZ:97/97/1.00 VC:7 ENC:BIT_PACKED,PLAIN
impurityStats:
.array:                      DOUBLE SNAPPY DO:0 FPO:241 SZ:211/271/1.28 VC:168 ENC:RLE,PLAIN_DICTIONARY
gain:                        DOUBLE SNAPPY DO:0 FPO:452 SZ:109/107/0.98 VC:7 ENC:BIT_PACKED,PLAIN_DICTIONARY
leftChild:                   INT32 SNAPPY DO:0 FPO:561 SZ:63/61/0.97 VC:7 ENC:BIT_PACKED,PLAIN
rightChild:                  INT32 SNAPPY DO:0 FPO:624 SZ:63/61/0.97 VC:7 ENC:BIT_PACKED,PLAIN
split:
.featureIndex:               INT32 SNAPPY DO:0 FPO:687 SZ:72/68/0.94 VC:7 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
.leftCategoriesOrThreshold:
..array:                     DOUBLE SNAPPY DO:0 FPO:759 SZ:84/94/1.12 VC:7 ENC:PLAIN,RLE
.numCategories:              INT32 SNAPPY DO:0 FPO:843 SZ:67/63/0.94 VC:7 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY

Zstandard Support?

I hope you don't mind feature requests.

It looks like you support gzip and snappy. I would love to see zstandard support as well. It seems to be gaining support. I had some space issues recently with a project of mine and wrote a simple comparison of snappy and zstandard using parquet files in python.

I looks like there are two projects supporting zstandard in c#. (first one)(second one) Could one of these be used to easily add support for zstandard?

I have little formal programming training and even less experience with .net but I might be willing to pitch in at some point.

Thanks for your time!

Appending to file in ADLS Gen2

Version: Parquet.Net 3.7.7

Runtime Version: .Net Framework 4.6.1

OS: Windows 10

Actual behavior

Attempting to append to a parquet file in Azure Data Lake Storage (ADLS) Gen2 causes the following exception:

System.IO.IOException: destination stream must be seekable for append operations.

(This error is thrown during execution of the code snippet below.)

Expected behavior

Appending to a parquet file in ADLS Gen2 should work.

Clearly the stream provided by the Blob storage API is not seekable, but does this make append operations on ADLS impossible with this library? How should small updates be handled? (eg. 10 new records come in from an external source, and we wish to update the parquet file.) Am I missing something, and it is actually possible to append via some other means?

Code snippet reproducing the behavior

CloudBlobClient client = ...; // More info at [1] if necessary
CloudBlobContainer container = client.GetContainerReference(containerName);
await container.CreateIfNotExistsAsync();

string pqFileName = "asdf.parquet";
CloudBlockBlob parquetBlob = container.GetBlockBlobReference(pqFileName);

// Redefine the same schema used in asdf.parquet (can also get it by reading the file)
Parquet.Data.DataField[] dataFields = ...;
Schema schema = new Parquet.Data.Schema(dataFields);

Table table = ...; // New records with same schema

using (Stream stream = await parquetBlob.OpenWriteAsync())
{
    // Without append: true, this works; but my use case most likely needs append
    using (ParquetWriter parquetWriter = new ParquetWriter(schema, stream, append: true))
    {
        using (ParquetRowGroupWriter rgWriter = parquetWriter.CreateRowGroup())
        {
            rgWriter.Write(table);
        }
    }
}

[1] https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-dotnet-legacy#authenticate-the-client

Thanks for taking the time to read this, any help or insight would be appreciated!

Wrong repetition levels when writing table with nested lists

Ported from elastacloud/parquet-dotnet #458

Version: Parquet.Net v3.6.0

Runtime Version: .Net Core v3.1.101

OS: Windows

Expected behavior

When writing a table with nested list fields, the repetition levels should correctly reflect the data structure.

An example of nested lists could be something like this:

{
    "items":
    [
        { "values": [ "value0,0", "value0,1", "value0,2" ] },
        { "values": [ "value1,0", "value1,1", "value1,2" ] },
        { "values": [ "value2,0", "value2,1", "value2,2" ] },
    ]
}

In this case the repetition levels for the value column should be:

0, 2, 2, 1, 2, 2, 1, 2, 2

With the 0 indicating that the first value starts a new row, and the 1's indicating that the list at level 1 (items) has a new element.

Actual behavior

In the above example the repetition levels for the value column is:

0, 2, 2, 0, 2, 2, 0, 2, 2

Which results in a corrupt parquet file where values "value1,0" to "value2,2" would not be read (or would belong to different rows if more rows were added).

Steps to reproduce the behavior

  1. Create an xUnit Test Project
  2. Add reference to latest Parquet.Net nuget package, or the /src/Parquet/Parquet.csproj
  3. Add the test method below

Code snippet reproducing the behavior

[Fact]
public void ParquetDoubleList()
{
    // Arrange
    var schema = new Schema(
        new DataField<int>("RowId"),
        new ListField(
            "Items",
            new StructField(
                "Item",
                new DataField<int>("ItemId"),
                new ListField(
                    "Values",
                    new DataField<string>("Value")))));

    var inputRow = new Row(0, new[]
    {
        new Row(0, new[] { "Value0,0,0", "Value0,0,1", "Value0,0,2" }),
        new Row(1, new[] { "Value0,1,0", "Value0,1,1", "Value0,1,2" }),
        new Row(2, new[] { "Value0,2,0", "Value0,2,1", "Value0,2,2" }),
    });

    var table = new Table(schema);
    table.Add(inputRow);

    // Write Parquet to Stream
    var stream = new MemoryStream();
    using (var parquetWriter = new ParquetWriter(schema, stream))
    {
        parquetWriter.Write(table);
    }

    // Read Parquet from Stream
    stream.Position = 0;
    Row outputRow;
    using (var reader = new ParquetReader(stream))
    {
        var output = reader.ReadAsTable();
        outputRow = output[0];
    }

    // First "Item" is correctly added
    Assert.Equal(
        ((object[])inputRow[1])[0],
        ((object[])outputRow[1])[0]);

    // But only 1 "Item" is in the outputRow (expected 3)
    Assert.Equal(
        ((Array)inputRow[1]).Length,
        ((Array)outputRow[1]).Length);
}

ParquetConvert.Serialize issue in new version;

Version: Parquet.Net v3.6.0

Runtime Version: .Net Framework v 4.6.1

OS: Windows 10

Expected behavior

flowing example should run fine and generate a parquet file:
https://gist.github.com/AndyCross/eb0c3392dc421fbee54f826a03e63a74

Tried the same example with Old Parquet.Net version (3.0.5) and it's working as expected

Please describe how you expect Parquet.Net to work.

Actual behavior

while running (console application in VS 2019) this example I get an error:
System.Reflection.Emit.Lightweight is not supported on this platform.

Steps to reproduce the behavior

  1. Step 1. In Visual Studio Create a new Console App (.NET Framework) project (Net Framework v 4.6.1);
  2. Step 2. Install Parquet.Net 3.6.0 NuGet Package;
  3. Step 3. Run the program;

Code snippet reproducing the behavior

https://gist.github.com/AndyCross/eb0c3392dc421fbee54f826a03e63a74

Read and deserialize parquet file as RowGroups

Version: Parquet.Net

Runtime Version: .Net Core

OS: Windows/Linux/MacOSX

Expected behavior

Though our current "Deserialise" API is very helpful to work with parquet files, It still lacks APIs that gives ability to read parquet file as RowGroups. My current project team is in need of such API as we work with huge data like 10million record files, deserializing such a huge file and load entire business objects collection in memory is not working for us due to memory constraints.

So, It would be very helpful to dev teams if we can have these two APIs.

int GetRowGroupCount();
IEnumerable< T > ReadRowGroup(int i);

Actual behavior

No such API exists

Steps to reproduce the behavior

  1. Deserialize a parquet file with 10 million records.

Code snippet reproducing the behavior

ParquetConvert.Deserialize is ignoring ParquetColumnAttribute

See old issue here: elastacloud/parquet-dotnet#418

original text:

I am not able to use ParquetConvert.Deserialize when ParquetColumnAttribute is leveraged. One way to reproduce is to update the following unit test:

Add the following attribute to the Name property of SimpleStructure.cs, found in ParquetConvertTest.cs, and the unit test Parquet.Test.Serialisation.Serialise_deserialise_all_types will throw a NullReferenceException:

[ParquetColumn("someothername")]
public string Name { get; set; }

Object reference not set to an instance of an object.
Stack Trace:
at Parquet.Serialization.Values.MSILGenerator.GenerateAssigner(DataColumn dataColumn, Type classType) in /tmp/parquet-dotnet-master/src/Parquet/Serialization/Values/MSILGenerator.cs:line 156

Large float column leads to incorrect values being read

Version: Parquet.Net 3.7.7
Runtime Version: .Net Core 3.1
OS: Windows 10 2004

Expected behavior

I have written a single-column parquet file with Arrow using Snappy compression. It has 270K rows using a single row group with the float values {1, 2, 3, ..., 270000}. Reading the entire column should return those values.

Actual behavior

When reading the column with Parquet.NET, large slices of the Data contain zeroes instead of the expected values. This starts fairly early on in the array; for example at index 38.

Steps to reproduce the behavior

Use the attached parquet file (floats.parquet.zip) and read using the following code:

using var stream = File.OpenRead("floats.parquet");
using var parquetReader = new ParquetReader(stream);
var dataColumns = parquetReader.ReadEntireRowGroup();
var values = (float[]) dataColumns[0].Data;

Console.WriteLine($"values {{{string.Join(", ", values[0..50])}}}");

Observe that a lot of the values will be zeroes:

values {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}

General thoughts

Reading the file in Python using PyArrow works fine.

import pandas as pd
import pyarrow.parquet as pq

table = pq.read_table('floats.parquet').to_pandas()

pd.set_option('display.min_rows', 100)
pd.set_option('display.max_rows', 100)

print(len(table["Value"]))
print(table["Value"])

The problem does not manifest itself when using a smaller number of rows. Also, each value is unique and dictionary encoding has not been disabled. So it could be caused by a bug in the dictionary decoding.

v3.7.1 corrupt input when reading decimal fields

Version: Parquet.Net v3.7.1

Runtime Version: .Net Core 3.1

OS: Windows

Expected behavior

ParquetRowGroupReader.ReadColumn(DataField field) should not throw an exception when reading a decimal field.

Actual behavior

System.IO.IOException: 'corrupt input' is thrown when there are more than 4096 items in the row group. This doesn't occur in v3.7.0.

The code snippet below writes a Parquet file to a MemoryStream then reads it back. In TestClass I have tested setting Value's data type to bool, double, int, long, DateTimeOffset and string classes, they can all be read without errors. Only decimal data type causes ParquetRowGroupReader.ReadColumn(...) to throw when there are more than 4096 items in a row group.

Steps to reproduce the behavior

  1. Run the console app in the code snippet below.

Code snippet reproducing the behavior

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

using Parquet;
using Parquet.Data;

namespace ConsoleApp1
{
    class Program
    {
        static void Main()
        {
            // v3.7.1 fails if itemCount > 4096
            const int itemCount = 4097;

            // create items
            List<TestClass> items = Enumerable.Range(0, itemCount)
                                        .Select(i => new TestClass()
                                        {
                                            Value = i
                                        })
                                        .ToList();

            using (MemoryStream ms = new MemoryStream())
            {
                // serialise items
                CompressionMethod compressionMethod = CompressionMethod.Snappy;
                const int rowGroupSize = 5000;
                Schema schema = ParquetConvert.Serialize(items, ms, null, compressionMethod, rowGroupSize);
                ms.Position = 0;

                // create reader
                ParquetOptions parquetOptions = null;
                const bool leaveStreamOpen = true;
                ParquetReader reader = new ParquetReader(ms, parquetOptions, leaveStreamOpen);
                
                // get data field
                DataField dataField = reader.Schema
                                        .GetDataFields()
                                        .Single(f => f.Name.Equals(nameof(TestClass.Value)));
                
                // read values
                for (int i = 0; i < reader.RowGroupCount; ++i)
                {
                    using (ParquetRowGroupReader rowGroupReader = reader.OpenRowGroupReader(i))
                    {
                        // v3.7.0 runs correctly
                        // v3.7.1 throws System.IO.IOException: 'corrupt input' when itemCount > 4096
                        DataColumn dc = rowGroupReader.ReadColumn(dataField);

                        foreach (object value in dc.Data)
                        {
                            Console.WriteLine(value);
                        }
                    }
                }
            }
        }
    }

    class TestClass
    {
        public decimal Value { get; set; }
    }
}

Async Methods

Kinda picking up off of https://github.com/elastacloud/parquet-dotnet/issues/20

I see now that the thrift lib ONLY has async methods, and the older clients are being removed next release.

I'm not sure how often parquet updates it's thrift spec, or how often thrift adds new features or w/e though, so this might not be a big deal.

Curious if In the past have you done any testing to see what kind of negative performance making some of the reading/writing async has?

Was considering helping and trying to update to the latest netstd thrift generation, but I'm more interested in making this faster....not slower ;)

Field name with dot (.)

This commit 16b55d5 added a validation to disallow fields with dots.

Do we really have a reason to don't accept dots? If I comment the throw line I can read a file with dots in column names, and the only failing test is the test that validates if it throws when we use dots.

I couldn't find any spec saying dots are not allowed.

Thanks.

Cannot parse Decimal types

Version: Parquet.Net v3.6.0

Runtime Version: .Net Core 2.1

OS: Windows

Expected behavior

With a parquet file that has a decimal128 column, Parquet.Net should be able to read the file, parsing that column as a Decimal.

Actual behavior

Null reference exception occurs.

Steps to reproduce the behavior

  1. Create a parquet file with a column with the Decimal128 data type.
  2. Attempt to parse the file, i was using ParquetReader.ReadAsTable()
  3. Null reference exception

Code snippet reproducing the behavior

using Parquet.Data.Rows;
using Xunit;

namespace Parquet.Test
{
   public class DecimalTypeTest : TestBase
   {
      [Fact]
      public void Read_File_As_Table_With_Decimal_Column_Should_Read_File()
      {
         const int decimalColumnIndex = 4;
         Table table = ReadTestFileAsTable("test-types-with-decimal.parquet");

         Assert.Equal(1234.56m, table[0].Get<decimal>(decimalColumnIndex));
      }
   }
}

test-types-with-decimal.zip

InvalidCastException when writing DateTime columns

Version: Parquet.Net v3.7.4

Runtime Version: .Net Core v3.0

OS: Windows

Expected behavior

According to Supported Types, DateTime is supported.

Actual behavior

When attempting to write a DateTime column, I get the following exception

System.InvalidCastException : Unable to cast object of type 'System.DateTime' to type 'System.DateTimeOffset'.

Steps to reproduce the behavior

Run the Unit Test below. In the constructor of DataField<DateTime>, the IDataTypeHandler handler returned is actually the DateTimeOffset one.

Code snippet reproducing the behavior

using System;
using Parquet.Data;
using System.IO;
using Xunit;

namespace Parquet.Test
{
   public class DateTimeTypeTest : TestBase
   {
      [Fact]
      public void Write_read_DateTime()
      {
         var ms = new MemoryStream();
         var date = new DataField<DateTime>("date");

         DateTime now = DateTime.Now;

         //write
         using (var writer = new ParquetWriter(new Schema(date), ms))
         {
            using (ParquetRowGroupWriter rg = writer.CreateRowGroup())
            {
               rg.WriteColumn(new DataColumn(date, new[] { now }));
            }
         }

         //read back
         using (var reader = new ParquetReader(ms))
         {
            using (ParquetRowGroupReader rg = reader.OpenRowGroupReader(0))
            {
               Assert.Equal(now, ((DateTime[])rg.ReadColumn(date).Data)[0]);
            }
         }
      }
   }
}

Unable to write array of strings to parquet file

Version: Parquet.Net v3.7.7

Runtime Version: .Net Core v3.1 etc.

OS: Windows/Linux/MacOSX - Docker

Expected behavior

When I want to create column with array of strings type, that column should be created according to repetition levels.

Actual behavior

Parquet.Net is creating column with type of "string".

Steps to reproduce the behavior

  1. Create new DataField of type IEnumerable
  2. Create data array of type string[]
  3. Create definitionLevels array of type int[] in following format: [0, 1, 1 ... 1] or [0, 0, 0, ... 0]
  4. Create new DataColumn and apply three variables from above

Code snippet reproducing the behavior

var dataField = new DataField<IEnumerable<string>>(fieldName);
                    var data = obj.ListProperty.Select(x => x.Id).ToArray();
                    var repetitionLevels = new int[obj.ListProperty.Count()];

                    for (var i = 0; i < repetitionLevels.Length; i++)
                    {
                        repetitionLevels[i] = 0;
                    }

                    dataColumn = new DataColumn(dataField, data, repetitionLevels);
                }

Appending to file with ParquetConvert.Serialize

Version: Parquet.Net v3.7.7

Runtime Version: All

OS: All

Expected behavior

...
var items = new LIst<T>();
ParquetConvert.Serialize(items, someFilePath, append=true)

The data from items should be appended to the existing parquet file someFilePath.

Actual behavior

There is no ability to append data to the existing file with ParquetConvert.Serialize

Variable-length byte arrays in statistics should not include length prefix

Parquet dotnet appears to be looking for a length prefix for variable length arrays (strings) in the statistics.

From the files I'm using I'm not seeing any of the string fields length prefixed.

This causes an index out of range exception whenever the stats are attempted to be read.

According to the parquet.thrift document, I don't think these should be length prefixed:

/**
 * Statistics per row group and per page
 * All fields are optional.
 */
struct Statistics {
   /**
    * DEPRECATED: min and max value of the column. Use min_value and max_value.
    *
    * Values are encoded using PLAIN encoding, except that variable-length byte
    * arrays do not include a length prefix.
    *
    * These fields encode min and max values determined by signed comparison
    * only. New files should use the correct order for a column's logical type
    * and store the values in the min_value and max_value fields.
    *
    * To support older readers, these may be set when the column order is
    * signed.
    */
   1: optional binary max;
   2: optional binary min;
   /** count of null value in the column */
   3: optional i64 null_count;
   /** count of distinct values occurring */
   4: optional i64 distinct_count;
   /**
    * Min and max values for the column, determined by its ColumnOrder.
    *
    * Values are encoded using PLAIN encoding, except that variable-length byte
    * arrays do not include a length prefix.
    */
   5: optional binary max_value;
   6: optional binary min_value;
}

New SNAPPY implementation

Original implementation was copied from SnappySharp library. It's extremely slow and sometimes expands the size of the data on compression.

Migrating to IronSnappy which I've created as a separate project. Tests show that compression rates are better and speed is approx 20 times faster!

Write speed improvement for compressed parquet

from @xingchen517:

Hi Team,
Suggesstions:
1 Could you please expose the configuration for gzip CompressionLevel?
2 For Parquet.File.DataColumnWriter.WriteColumn, the GzipStream would be called every time for each value in each column, this would make the process slow, it would be very fast if you make it write first to a BufferStream then write it to GzipStream.
I tried the two things and it gave me 4x faster than before.
Thanks


compression level already implemented, need to add bufferedstream

Unable to write a column with a Date logical type

Version: Parquet.Net v3.6

Runtime Version: .Net Core v3.0

OS: Windows

Expected behavior

When writing values to a column whose schema is a int32 with a Date logical type I'd expect to be able to write int32 values to the column since the Date logical type stores the # of days since epoch time.

Actual behavior

I can only write values to the column as DateTimeOffset and they get written to the column with the teimstamp[ns] logical type instead.

Perhaps I am misunderstanding something, but I first open a file which has a int32 column and a logical Date type. I then use that files schema to create a new file and ParquetWriter. My goal is to write new column values of different dates but instead I can only write timestamps.

Am I perhaps mis-using the library?

Support char/nullable char in serializer

from elastacloud/parquet-dotnet#444

Version: Parquet.Net 3.5.0

Runtime Version: .Net Core 3.0

OS: Windows

Expected behavior

ParquetConvert.Serialize method should support char and char? (nullable char) as well.

Actual behavior

Thought serialization is not producing any runtime exceptions for char datatype, but after deserialization to c# type, char fields always getting null.

Steps to reproduce the behavior

  1. Step 1.
  2. Step 2.

Code snippet reproducing the behavior

class MyClass
{
public char TestChar { get; set; }
public int TestInt { get; set; }
}
[TestMethod]
public void TrasformUtil_Should_Serialize_PolicyObjects()
{
//Arrange
var objects = new List {
new MyClass() { TestChar='T',TestInt=1234 },
new MyClass() { TestChar ='F',TestInt=1211 }
};

        //Act
        MyClass[] policies = null;
        using (var ms = new MemoryStream())
        {
            Schema schema = ParquetConvert.Serialize(objects, ms, compressionMethod: CompressionMethod.Snappy, rowGroupSize: 2);
            ms.Position = 0;
            policies = ParquetConvert.Deserialize<MyClass>(ms);
        }

        policies.Length.Should().Be(2);
        policies[0].TestInt.Should().Be(1234); 
        policies[0].TestChar.Should().Be('T'); //Isue for char type.
    }
//here

Analysing upgraded parquet specs

ConvertedType

new types:

  • TIME_MICROS = 8;
  • TIMESTAMP_MICROS = 10;

Statistics

  • min, max deprecated
  • added min_value, max_value (same type)

new structs

/** Empty structs to use as logical type annotations */
struct StringType {}  // allowed for BINARY, must be encoded with UTF-8
struct UUIDType {}    // allowed for FIXED[16], must encoded raw UUID bytes
struct MapType {}     // see LogicalTypes.md
struct ListType {}    // see LogicalTypes.md
struct EnumType {}    // allowed for BINARY, must be encoded with UTF-8
struct DateType {}    // allowed for INT32
struct NullType {}    // allowed for any physical type, only null values stored

struct DecimalType {
  1: required i32 scale
  2: required i32 precision
}

/** Time units for logical types */
struct MilliSeconds {}
struct MicroSeconds {}
struct NanoSeconds {}
union TimeUnit {
  1: MilliSeconds MILLIS
  2: MicroSeconds MICROS
  3: NanoSeconds NANOS
}

struct TimestampType {
  1: required bool isAdjustedToUTC
  2: required TimeUnit unit
}

struct TimeType {
  1: required bool isAdjustedToUTC
  2: required TimeUnit unit
}

struct IntType {
  1: required i8 bitWidth
  2: required bool isSigned
}

struct JsonType {
}

struct BsonType {
}

new union:

/**
 * LogicalType annotations to replace ConvertedType.
 *
 * To maintain compatibility, implementations using LogicalType for a
 * SchemaElement must also set the corresponding ConvertedType from the
 * following table.
 */
union LogicalType {
  1:  StringType STRING       // use ConvertedType UTF8
  2:  MapType MAP             // use ConvertedType MAP
  3:  ListType LIST           // use ConvertedType LIST
  4:  EnumType ENUM           // use ConvertedType ENUM
  5:  DecimalType DECIMAL     // use ConvertedType DECIMAL
  6:  DateType DATE           // use ConvertedType DATE

  // use ConvertedType TIME_MICROS for TIME(isAdjustedToUTC = *, unit = MICROS)
  // use ConvertedType TIME_MILLIS for TIME(isAdjustedToUTC = *, unit = MILLIS)
  7:  TimeType TIME

  // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC = *, unit = MICROS)
  // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC = *, unit = MILLIS)
  8:  TimestampType TIMESTAMP

  // 9: reserved for INTERVAL
  10: IntType INTEGER         // use ConvertedType INT_* or UINT_*
  11: NullType UNKNOWN        // no compatible ConvertedType
  12: JsonType JSON           // use ConvertedType JSON
  13: BsonType BSON           // use ConvertedType BSON
  14: UUIDType UUID
}

SchemaElement

  • new element 10: optional LogicalType logicalType

Encoding

  • new element BYTE_STREAM_SPLIT = 9;

Compression
new methods:

  BROTLI = 4; // Added in 2.4
  LZ4 = 5;    // Added in 2.4
  ZSTD = 6;   // Added in 2.4

new struct

enum BoundaryOrder {
  UNORDERED = 0;
  ASCENDING = 1;
  DESCENDING = 2;
}

completely new (unsorted)

/** Block-based algorithm type annotation. **/
struct SplitBlockAlgorithm {}
/** The algorithm used in Bloom filter. **/
union BloomFilterAlgorithm {
  /** Block-based Bloom filter. **/
  1: SplitBlockAlgorithm BLOCK;
}

/** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash
 * algorithm. It uses 64 bits version of xxHash. 
 **/
struct XxHash {}

/** 
 * The hash function used in Bloom filter. This function takes the hash of a column value
 * using plain encoding.
 **/
union BloomFilterHash {
  /** xxHash Strategy. **/
  1: XxHash XXHASH;
}

/**
 * The compression used in the Bloom filter.
 **/
struct Uncompressed {}
union BloomFilterCompression {
  1: Uncompressed UNCOMPRESSED;
}

/**
  * Bloom filter header is stored at beginning of Bloom filter data of each column
  * and followed by its bitset.
  **/
struct BloomFilterHeader {
  /** The size of bitset in bytes **/
  1: required i32 numBytes;
  /** The algorithm for setting bits. **/
  2: required BloomFilterAlgorithm algorithm;
  /** The hash function used for Bloom filter. **/
  3: required BloomFilterHash hash;
  /** The compression used in the Bloom filter **/
  4: required BloomFilterCompression compression;
}

ColumnMetaData

  • new element 14: optional i64 bloom_filter_offset;

new (unsorted)

struct EncryptionWithFooterKey {
}

struct EncryptionWithColumnKey {
  /** Column path in schema **/
  1: required list<string> path_in_schema
  
  /** Retrieval metadata of column encryption key **/
  2: optional binary key_metadata
}

union ColumnCryptoMetaData {
  1: EncryptionWithFooterKey ENCRYPTION_WITH_FOOTER_KEY
  2: EncryptionWithColumnKey ENCRYPTION_WITH_COLUMN_KEY
}

ColumnChunk
new fields:

  /** File offset of ColumnChunk's OffsetIndex **/
  4: optional i64 offset_index_offset

  /** Size of ColumnChunk's OffsetIndex, in bytes **/
  5: optional i32 offset_index_length

  /** File offset of ColumnChunk's ColumnIndex **/
  6: optional i64 column_index_offset

  /** Size of ColumnChunk's ColumnIndex, in bytes **/
  7: optional i32 column_index_length

  /** Crypto metadata of encrypted columns **/
  8: optional ColumnCryptoMetaData crypto_metadata
  
  /** Encrypted column metadata for this chunk **/
  9: optional binary encrypted_column_metadata

RowGroup
new:

  /** Byte offset from beginning of file to first page (data or dictionary)
   * in this row group **/
  5: optional i64 file_offset

  /** Total byte size of all compressed (and potentially encrypted) column data 
   *  in this row group **/
  6: optional i64 total_compressed_size
  
  /** Row group ordinal in the file **/
  7: optional i16 ordinal

new

/** Empty struct to signal the order defined by the physical or logical type */
struct TypeDefinedOrder {}

/**
 * Union to specify the order used for the min_value and max_value fields for a
 * column. This union takes the role of an enhanced enum that allows rich
 * elements (which will be needed for a collation-based ordering in the future).
 *
 * Possible values are:
 * * TypeDefinedOrder - the column uses the order defined by its logical or
 *                      physical type (if there is no logical type).
 *
 * If the reader does not support the value of this union, min and max stats
 * for this column should be ignored.
 */
union ColumnOrder {

  /**
   * The sort orders for logical types are:
   *   UTF8 - unsigned byte-wise comparison
   *   INT8 - signed comparison
   *   INT16 - signed comparison
   *   INT32 - signed comparison
   *   INT64 - signed comparison
   *   UINT8 - unsigned comparison
   *   UINT16 - unsigned comparison
   *   UINT32 - unsigned comparison
   *   UINT64 - unsigned comparison
   *   DECIMAL - signed comparison of the represented value
   *   DATE - signed comparison
   *   TIME_MILLIS - signed comparison
   *   TIME_MICROS - signed comparison
   *   TIMESTAMP_MILLIS - signed comparison
   *   TIMESTAMP_MICROS - signed comparison
   *   INTERVAL - unsigned comparison
   *   JSON - unsigned byte-wise comparison
   *   BSON - unsigned byte-wise comparison
   *   ENUM - unsigned byte-wise comparison
   *   LIST - undefined
   *   MAP - undefined
   *
   * In the absence of logical types, the sort order is determined by the physical type:
   *   BOOLEAN - false, true
   *   INT32 - signed comparison
   *   INT64 - signed comparison
   *   INT96 (only used for legacy timestamps) - undefined
   *   FLOAT - signed comparison of the represented value (*)
   *   DOUBLE - signed comparison of the represented value (*)
   *   BYTE_ARRAY - unsigned byte-wise comparison
   *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
   *
   * (*) Because the sorting order is not specified properly for floating
   *     point values (relations vs. total ordering) the following
   *     compatibility rules should be applied when reading statistics:
   *     - If the min is a NaN, it should be ignored.
   *     - If the max is a NaN, it should be ignored.
   *     - If the min is +0, the row group may contain -0 values as well.
   *     - If the max is -0, the row group may contain +0 values as well.
   *     - When looking for NaN values, min and max should be ignored.
   */
  1: TypeDefinedOrder TYPE_ORDER;
}

struct PageLocation {
  /** Offset of the page in the file **/
  1: required i64 offset

  /**
   * Size of the page, including header. Sum of compressed_page_size and header
   * length
   */
  2: required i32 compressed_page_size

  /**
   * Index within the RowGroup of the first row of the page; this means pages
   * change on record boundaries (r = 0).
   */
  3: required i64 first_row_index
}

struct OffsetIndex {
  /**
   * PageLocations, ordered by increasing PageLocation.offset. It is required
   * that page_locations[i].first_row_index < page_locations[i+1].first_row_index.
   */
  1: required list<PageLocation> page_locations
}

/**
 * Description for ColumnIndex.
 * Each <array-field>[i] refers to the page at OffsetIndex.page_locations[i]
 */
struct ColumnIndex {
  /**
   * A list of Boolean values to determine the validity of the corresponding
   * min and max values. If true, a page contains only null values, and writers
   * have to set the corresponding entries in min_values and max_values to
   * byte[0], so that all lists have the same length. If false, the
   * corresponding entries in min_values and max_values must be valid.
   */
  1: required list<bool> null_pages

  /**
   * Two lists containing lower and upper bounds for the values of each page.
   * These may be the actual minimum and maximum values found on a page, but
   * can also be (more compact) values that do not exist on a page. For
   * example, instead of storing ""Blart Versenwald III", a writer may set
   * min_values[i]="B", max_values[i]="C". Such more compact values must still
   * be valid values within the column's logical type. Readers must make sure
   * that list entries are populated before using them by inspecting null_pages.
   */
  2: required list<binary> min_values
  3: required list<binary> max_values

  /**
   * Stores whether both min_values and max_values are orderd and if so, in
   * which direction. This allows readers to perform binary searches in both
   * lists. Readers cannot assume that max_values[i] <= min_values[i+1], even
   * if the lists are ordered.
   */
  4: required BoundaryOrder boundary_order

  /** A list containing the number of null values for each page **/
  5: optional list<i64> null_counts
}

struct AesGcmV1 {
  /** AAD prefix **/
  1: optional binary aad_prefix

  /** Unique file identifier part of AAD suffix **/
  2: optional binary aad_file_unique
  
  /** In files encrypted with AAD prefix without storing it,
   * readers must supply the prefix **/
  3: optional bool supply_aad_prefix
}

struct AesGcmCtrV1 {
  /** AAD prefix **/
  1: optional binary aad_prefix

  /** Unique file identifier part of AAD suffix **/
  2: optional binary aad_file_unique
  
  /** In files encrypted with AAD prefix without storing it,
   * readers must supply the prefix **/
  3: optional bool supply_aad_prefix
}

union EncryptionAlgorithm {
  1: AesGcmV1 AES_GCM_V1
  2: AesGcmCtrV1 AES_GCM_CTR_V1
}

FileMetaData

 /**
   * Sort order used for the min_value and max_value fields of each column in
   * this file. Sort orders are listed in the order matching the columns in the
   * schema. The indexes are not necessary the same though, because only leaf
   * nodes of the schema are represented in the list of sort orders.
   *
   * Without column_orders, the meaning of the min_value and max_value fields is
   * undefined. To ensure well-defined behaviour, if min_value and max_value are
   * written to a Parquet file, column_orders must be written as well.
   *
   * The obsolete min and max fields are always sorted by signed comparison
   * regardless of column_orders.
   */
  7: optional list<ColumnOrder> column_orders;

  /** 
   * Encryption algorithm. This field is set only in encrypted files
   * with plaintext footer. Files with encrypted footer store algorithm id
   * in FileCryptoMetaData structure.
   */
  8: optional EncryptionAlgorithm encryption_algorithm

  /** 
   * Retrieval metadata of key used for signing the footer. 
   * Used only in encrypted files with plaintext footer. 
   */ 
  9: optional binary footer_signing_key_metadata

new

/** Crypto metadata for files with encrypted footer **/
struct FileCryptoMetaData {
  /** 
   * Encryption algorithm. This field is only used for files
   * with encrypted footer. Files with plaintext footer store algorithm id
   * inside footer (FileMetaData structure).
   */
  1: required EncryptionAlgorithm encryption_algorithm
    
  /** Retrieval metadata of key used for encryption of footer, 
   *  and (possibly) columns **/
  2: optional binary key_metadata
}

KeyNotFoundException when writing Table with empty ListField

Version: Parquet.Net v3.6.0

Runtime Version: .Net Core v3.1.101

OS: Windows

Expected behavior

When writing a table that has a list field of structures, it should be possible for the field to contain empty values.

Actual behavior

In the cases where all rows contain empty values for the list field, a KeyNotFoundException is thrown when writing the table:

System.Collections.Generic.KeyNotFoundException
  HResult=0x80131577
  Message=The given key 'list.list.item.value' was not present in the dictionary.
  Source=System.Private.CoreLib
  StackTrace:
   at System.ThrowHelper.ThrowKeyNotFoundException[T](T key)
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at Parquet.Data.Rows.RowsToDataColumnsConverter.<Convert>b__4_0(DataField df)
   at System.Linq.Enumerable.SelectArrayIterator`2.ToList()
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Parquet.Data.Rows.RowsToDataColumnsConverter.Convert()
   at Parquet.Data.Rows.Table.ExtractDataColumns()
   at Parquet.ParquetExtensions.Write(ParquetRowGroupWriter writer, Table table)
   at Parquet.ParquetExtensions.Write(ParquetWriter writer, Table table)

Steps to reproduce the behavior

  1. Create an xUnit Test Project
  2. Add reference to latest Parquet.Net nuget package, or the /src/Parquet/Parquet.csproj
  3. Add the test method below

Code snippet reproducing the behavior

[Fact]
public void ParquetEmptyListField()
{
    var table = new Table(
        new DataField<int>("id"),
        new ListField(
            "list",
            new StructField(
                "item",
                new DataField<string>("value"))));

    table.Add(new Row(1, new Row[0]));
    table.Add(new Row(2, new Row[0]));

    var stream = new MemoryStream();
    using (var parquetWriter = new ParquetWriter(table.Schema, stream))
    {
        parquetWriter.Write(table);
    }

    stream.Position = 0;
    Table outputTable;
    using (var reader = new ParquetReader(stream))
    {
        outputTable = reader.ReadAsTable();
    }

    Assert.Equal(table.ToString(), outputTable.ToString(), ignoreLineEndingDifferences: true);
}

Support column statistics

  • null_count (was already supported)
  • distinct_count (in progress)
  • min_value and max_value
  • legacy min, max

Creates invalid files with Snappy compression and certain amount of data

When creating Parquet files with a certain size the Snappy compression create invalid output.
I setup a simple test that write random objects and reads them back in:
https://dotnetfiddle.net/g22ORE

It works fine for 1000 objects, but not 2000 objects.
It throws a Unhandled exception. System.IO.IOException: corrupt input
at IronSnappy.SnappyReader.Decode(Span1 dst, ReadOnlySpan1 src)
Using any other compression mechanism than snappy also solves the issue.

Version: Parquet.Net v3.7.1

Runtime Version: .Net Core v3.1

Code snippet reproducing the behavior

using System;
using System.IO;
using System.Linq;
using Parquet;

public class TestStructure
{
	public string StringValue
	{
		get;
		set;
	}

	public DateTimeOffset DateValue
	{
		get;
		set;
	}

	public double DoubleValue
	{
		get;
		set;
	}

	public static TestStructure[] GenerateTestStructures(int size)
	{
		var ret = new TestStructure[size];
		for (int i = 0; i < size; i++)
		{
			ret[i] = new TestStructure()
			{DateValue = DateTimeOffset.Now, DoubleValue = i, StringValue = Guid.NewGuid().ToString()};
		}

		return ret;
	}
}

public class Program
{
	public static void RunTest(int n, CompressionMethod compressionMethod)
	{
		var testData = TestStructure.GenerateTestStructures(n);
		var memoryStream = new MemoryStream();
		ParquetConvert.Serialize(testData, memoryStream, null, compressionMethod);
		memoryStream.Position = 0;
		var result = ParquetConvert.Deserialize<TestStructure>(memoryStream);
		if (n == result.Length)
			Console.WriteLine($"Test Succeeded with n={n} compression={compressionMethod}.");
	}
	
	public static void Main()
	{
		RunTest(2000, CompressionMethod.Gzip); // works
		RunTest(1000, CompressionMethod.Snappy); // works
		RunTest(2000, CompressionMethod.None); // works
		
		RunTest(2000, CompressionMethod.Snappy); // throws System.IO.IOException: corrupt input
	}
}

Issue or question about writing parquet file with short/int16

I am so glad that this library exists!! It would be wonderful to have full functionality in supporting different data types -- I'm having issues with shorts and wonder if there is some simple mistake I am making.

Version: 3.0.0.0

Runtime Version: .Net Framework v4.5

OS: Windows

Expected behavior

When writing parquets of type short or Int16, the file should be able to be read back in without errors.

Actual behavior

When I try to read the file back in with parquet.net, I get error

System.IO.IOException: 'not a Parquet file(head is 'R1R1')'

Trying to read it in via python with Pyarrow, I get

Unexpected end of stream

I am able to read/write files with only integers and floats using these methods without any problem.

Steps to reproduce the behavior

  1. Step 1: Write a parquet file like this (this part goes through without any issue):
//( I just changed the type of the example idColumn from the elastacloud documentation)
            var idColumn = new DataColumn(
               new DataField<Int16>("id"),
               new Int16[] { 1, 2 });

            var cityColumn = new DataColumn(
               new DataField<string>("city"),
               new string[] { "London", "Derby" });

            // create file schema
            var schema = new Schema(idColumn.Field, cityColumn.Field);

            using (Stream fileStream = System.IO.File.OpenWrite("d:\\tmp\\test.parquet"))
            {
                using (var parquetWriter = new ParquetWriter(schema, fileStream))
                {
                    // create a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup())
                    {
                        groupWriter.WriteColumn(idColumn);
                        groupWriter.WriteColumn(cityColumn);
                    }
                }
            }
  1. Step 2: try to read back in:
//
string fileNameX = @"d:\tmp\test.parquet";
Stream fs = new FileStream(fileNameX, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
var z = ParquetReader.ReadTableFromStream(fs);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.