codename-hub / php-parquet Goto Github PK

View Code? Open in Web Editor NEW

50.0 4.0 6.0 28.13 MB

PHP implementation for reading and writing Apache Parquet files/streams

License: Other

Dockerfile 0.16% PHP 93.10% PowerShell 0.48% Thrift 5.99% Shell 0.09% Python 0.19%

php php-library parquet apache-parquet

php-parquet's People

Contributors

Stargazers

Watchers

Forkers

shivampawstickee quentium-forks michael-0-1 modeltech juhniorsantos flyu518

php-parquet's Issues

When trying decode empty string with StringDataTypeHandler->plainDecode getting an error from CustomBinaryReader->readString

I'm getting an exception when StringDataTypeHandler::plainDecode getting an empty string ("") in $encoded parameter.

public function plainDecode(
    \codename\parquet\format\SchemaElement $tse,
    $encoded
  ) {
    if ($encoded === null) return null;

    $ms = fopen('php://memory', 'r+');
    fwrite($ms, $encoded);
    $br = BinaryReader::createInstance($ms);
    $element = $this->readSingleInternal($br, $tse, -1, false);
    return $element;
  }

The current validation is only for null but the encode can be an empty string.
I think the resolution can be is to check for empty and not for null only.

Expected parameter of type '\codename\parquet\data\DataField', '\codename\parquet\data\Field' provided

Hi!

In your READ example, you have

$dataFields = $parquetReader->schema->GetDataFields();

and later

foreach($dataFields as $field) {
  $columns[] = $groupReader->ReadColumn($field);
}

However, the return value of the GetDataFields() function is Field[], while the first parameter for the ReadColumn function is a DataField.

Reading Row Groups

Hey
Based on this library, I'm trying to implement a parquet adapter for Flow PHP.
I started from writing few tests (row groups are that small just for testing purpose), you can find code below:

Code Example

<?php

use codename\parquet\data\DataColumn;
use codename\parquet\data\DataField;
use codename\parquet\data\Schema;
use codename\parquet\ParquetWriter;

require_once __DIR__ . '/../vendor/autoload.php';

$id = DataField::createFromType('id', 'integer');

$schema = new Schema([$id]);

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'w+'), null, false);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [1, 2, 3, 4]));
$rowGroup->finish();
$writer->finish();

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [5, 6, 7, 8]));
$rowGroup->finish();
$writer->finish();

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [9, 10, 11, 12]));
$rowGroup->finish();
$writer->finish();

$writer = new ParquetWriter($schema, \fopen(__DIR__.'/test.parquet', 'a+'), null, true);
$rowGroup = $writer->CreateRowGroup();
$rowGroup->WriteColumn(new DataColumn($id, [13, 14, 15, 16]));
$rowGroup->finish();
$writer->finish();

But when I tried to read that using parquet-tools I'm getting following error:

parquet-tools cat --json test.parquet
{"id":1}
{"id":2}
{"id":3}
{"id":4}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Socket is closed by peer.

I also tried to check the file content through avro parquet viewer and I'm getting this:

Any idea what might be wrong here?
If you could point me in the right direction, I can debug this issue further because I'm not that familiar with parquet format so any help is welcome.

Thanks for all your work to make parquet available in PHP!

I've just started looking at this excellent package and am struggling to understand the structure for writing nested arrays. Could someone please help with the format I need for the following example data?

$data = [
    [  
        'id' => 1,
        'transcripts' => [
            ['id' => 1, 'speaker' => 'bob', 'text' => 'Hello?'],
            ['id' => 2, 'speaker' => 'jane', 'text' => 'Hello back at ya?'],
            ['id' => 2, 'speaker' => 'bob', 'text' => 'how are you today'],
        ],
        'redflags' => [
             ['id' => 1, 'type' => 'type_a', 'score' => 50, 'analysis' => 'they greeted each other'],
             ['id' => 2, 'type' => 'type_b', 'score' => 0, 'analysis' => ''],
             ['id' => 3, 'type' => 'type_c', 'score' => 0, 'analysis' => ''],
        ]
    ],
    [  
        'id' => 2,
        'transcripts' => [
            ['id' => 1, 'speaker' => 'bob', 'text' => 'Hello again?'],
            ['id' => 2, 'speaker' => 'jane', 'text' => 'Hello back at ya again?'],
            ['id' => 2, 'speaker' => 'bob', 'text' => 'how were you yesterday'],
        ],
        'redflags' => [
             ['id' => 1, 'type' => 'type_a', 'score' => 50, 'analysis' => 'they greeted each other again'],
             ['id' => 2, 'type' => 'type_b', 'score' => 0, 'analysis' => ''],
             ['id' => 3, 'type' => 'type_c', 'score' => 0, 'analysis' => ''],
        ]
    ],
];

I am then trying to digest and save to a file using:

        if(!Storage::exists('exports'))
        {
            Storage::makeDirectory('exports');
        }

        $filename = 'export_' . $form_data['collection'] . '_' . date('Y-m-d-H-i-s') . '.parquet';
        $file_path = storage_path('app/exports/' . $filename);
        if (!file_exists($file_path)) {
            touch($file_path);
        }

       $schema_fields = [
            DataField::createFromType('id', 'integer'),
            new MapField(
                'transcripts',
                DataField::createFromType('id', 'string'),
                StructField::createWithFieldArray('transcripts', [
                    DataField::createFromType('id', 'integer'),
                    DataField::createFromType('speaker', 'string'),
                    DataField::createFromType('text', 'string'),
                ])
            ),
            new MapField(
                'redflags',
                DataField::createFromType('id', 'string'),
                StructField::createWithFieldArray('redflags', [
                    DataField::createFromType('id', 'integer'),
                    DataField::createFromType('type', 'string'),
                    DataField::createFromType('score', 'integer'),
                    DataField::createFromType('analysis', 'string'),
                ])
            ),
        ];

        $schema = new Schema($schema_fields);

        $handle = fopen($file_path, 'r+');
        $dataWriter = new ParquetDataWriter($handle, $schema);

        /*
         * return data as ResourceCollection
         */
        $dataToWrite = DatasetResource::collection($data);
        $dataWriter->putBatch($dataToWrite->toArray(Request()));

        $dataWriter->finish();
        fclose($handle); 
        return $file_path;

The file gets written correctly but if I pass it to an online reader, it only seems to contain the first row of each multidimensional field transcripts and redflags but not the rest of the rows. Is this expected or am I missing something?

any pointers, thankfully received.

Regards,

Paul.

Support PHP 8.3

Can you please update composer.json to support PHP 8.3? Thanks!

PHP 8.2 support

Can we please support PHP 8.2?

Batch Writing - Example

Hi,

Does this support batch writing?

I have 100k+ data in my database and planning to do batch writing to a parquet file. I am thinking of chunking my database 5k of data at a time then performing a write to a parquet file.

I saw it in this link - https://github.com/aloneguid/parquet-dotnet/blob/master/doc/writing.md#appending-to-files that they have an example, but when I tried that on this library I am always getting this error not a Parquet file(tail is '\\\')

This is my code:

public function export($table, $startDate, $endDate)
    {
        $fields = [];
        $columns = [];

        $appendFile = false;

        $fileName = $table . '_' . str_replace(':', '-', $startDate) . '.parquet';
        $filePath = storage_path('/' . $fileName);
        $fileStream = fopen($filePath, 'a+');

        DB::table($table)->whereBetween('created_at', [$startDate, $endDate])->orderBy('id')->chunk(5000, function ($result) use (&$fields, &$columns, &$filePath, $table, $startDate, $endDate, &$parquetWriter, &$groupWriter, &$appendFile, &$fileStream) {
            if ($result) {
                $keyOnly = Arr::first($result);

                foreach ($keyOnly as $key => $value) {
                    $colType = $this->getColumnType($table, $key);
                    $dataColumn = new DataColumn(
                        DataField::createFromType($key, $colType),
                        $result->pluck($key)->toArray()
                    );

                    $columns[] = $dataColumn;
                    $fields[] = $dataColumn->getField();
                }

                $schema = new Schema($fields);

                $parquetWriter = new ParquetWriter($schema, $fileStream, null, $appendFile);

                $appendFile = true;

                // create a new row group in the file
                $groupWriter = $parquetWriter->CreateRowGroup();
                foreach ($columns as $col) {
                    $groupWriter->WriteColumn($col);
                }

                $groupWriter->finish();   // finish inner writer(s)
            }
        });

        $parquetWriter->finish(); // finish the parquet writer last

        return $filePath;
    }

Your help is greatly appreciated! Thanks.

DateTimeDataField subtracts 1 day in case of non UTC date

First of all, I am a very happy user of this code package, so thanks for the effort.

However, I figured out that the DateTimeDataField mis-calculates the date value in case a PHP DateTime object is given which is not UTC. For example in my dataset I have a DateTime field:

^ DateTimeImmutable @1664575200 {#861
  date: 2022-10-01 00:00:00.0 Europe/Amsterdam (+02:00)
}

Then the amount of unix days calculated is (by dumping https://github.com/codename-hub/php-parquet/blob/master/src/data/concrete/DateTimeOffsetDataTypeHandler.php#L161)

19265

But this value leads to 2022-09-30 in the parquet result, which is one day off

When I convert the date to a UTC time it is correct.

^ DateTimeImmutable @1664582400 {#860
  date: 2022-10-01 00:00:00.0 UTC (+00:00)
}

19266

Now the dates in the parquet files are also correct

This is the root cause of the issue, i guess the code takes the diff with 2 times of 2 timezones but PHP internally matches the timezones first and then the data shifts 1 day when moving from CEST to UTC. I tried to reproduce the issue here:

https://3v4l.org/24iPo#v8.2.6

It is visible that the reconstructured data is 1 day earlier due to the date offset.

Issue when trying to get empty string value from parquet file

Hi there,
thank you for your library. It works fine, but I had a problem when trying to retrieve data from a file where there were empty values. The problem itself occurs in the CustomBinaryReader file in the readString function. If you add a check for length and default values, everything will work fine

  /**
   * @inheritDoc
   */
  public function readString($length)
  {
    $this->position += $length; // ?
    return $length ? fread($this->stream, $length) : '';
  }

can't write data to aws S3 stream

    $disk = Storage::disk('s3');
    $file = 'test.parquet';

    $schema = new Schema([
        DataField::createFromType('id', 'integer'),
        DataField::createFromType('name', 'string'),
    ]);

    $handle = $disk->readStream($file);

    $dataWriter = new ParquetDataWriter($handle, $schema);

    // add two records at once
    $dataToWrite = [
        [ 'id' => 1, 'name' => 'abc' ],
        [ 'id' => 2, 'name' => 'def' ],
    ];
    $dataWriter->putBatch($dataToWrite);

    $dataWriter->finish(); // Don't forget to finish at some point.

the $handle is a file resource, but it can't write the data in the Stream.

Issue when readColumn in Parquet file with large amount of data

Hi, thanks for your great library.It works well with small parquet file, but when i tried to read data from Parquet file with ~500k row of data, array values from readColumn ->getData() become incorrect.

Here is my parquet file:
https://dev-sc2-pn.s3.ap-northeast-1.amazonaws.com/sc2_area_master+(3).parquet

My parquet file has only 92 rows with project_id = '123456789012345678', but when i get data from colum getData(), it return more than 300k row with this project_id.

Here is my sample code.Do you have any idea about this issue?

// open parquet file reader
$parquetReader = new ParquetReader($fileStream);

// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();

// enumerate through row groups in this file
for ($i = 0; $i < $parquetReader->getRowGroupCount(); $i++) {
	// create row group reader
	$groupReader = $parquetReader->OpenRowGroupReader($i);
	$rowCount = $groupReader->getRowCount();

	// read all columns inside each row group (you have an option to read only
	// required columns if you need to.
	$columns = [];
	foreach ($dataFields as $field) {
		$columns[] = @$groupReader->ReadColumn($field);
	}

	// $data member, accessible through ->getData() contains an array of column data
	$projectIds = $columns[0]->getData();

	dd($columns[0]->getData(0));
}

Schema DataFiled type 'timestamp'

Is there anything I can do for creating parquet file that will be used in time related queries. It seems Timestamp is not available?

Not a major issue but also bigint?

limit number of records

I have a problem reading parquet with codename/parquet/helper/ParquetDataIterator. When they are small, it reads them without problem. But after 500 records, it saturates the memory, and the whole process fails.

I have been reading the documentation, and making thousands of tests with dataIterator, and with parquetReader but it always fails with those parquet sizes, although it increases the machine and the memory.

I also tested to get only X number of rows, but I can't get it to work and there is no documentation about it.

Do you have any solution to use parquetDataIterator or parquetReader, limiting the number of records? Being able to request them in an orderly way from N in N records, without having to load everything in memory.

I am currently using these codes:

1)For parquetReader
`
use jocoon\parquet\ParquetReader;

// open file stream (in this example for reading only)
$fileStream = fopen(DIR.'/test.parquet', 'r');

// open parquet file reader
$parquetReader = new ParquetReader($fileStream);

// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();

// enumerate through row groups in this file
for($i = 0; $i < $parquetReader->getRowGroupCount(); $i++)
{
// create row group reader
$groupReader = $parquetReader->OpenRowGroupReader($i);
// read all columns inside each row group (you have an option to read only
// required columns if you need to.
$columns = [];
foreach($dataFields as $field) {
$columns[] = $groupReader->ReadColumn($field);
}

// get first column, for instance
$firstColumn = $columns[0];

// .Data member contains a typed array of column data you can cast to the type of the column
$data = $firstColumn->getData();

// Print data or do other stuff with it
print_r($data);
}`

for parquetDataIterator:

`use codename\parquet\helper\ParquetDataIterator;

$iterateMe = ParquetDataIterator::fromFile('your-parquet-file.parquet');

foreach($iterateMe as $dataset) {
// $dataset is an associative array
// and already combines data of all columns
// back to a row-like structure
}`

can't append rows in a existed parquet file

public function __construct($handle, Schema $schema, ?ParquetOptions $options = null)
{
$append = false; // TODO: detect already written data and change to true (append mode)
$this->schema = $schema;
$this->writerInstance = new ParquetWriter($schema, $handle, $options, $append);
}

expecting the append

getRowGroupCount() returns 1 row

....

ParquetReader - Get Value

Hi,

thanks for this package - I'm currently testing different parquet library in different languages to check which one could be our replacement for the current flask implementation.

It seems that your package has no problems with reading our packages ( the TS Parquet Package seems to support only Parquet 2.0 ).

I created quickly a laravel app to test it ( i have some other ideas, but the main feature should work before I start the developing ;) )

The snippet is the following:

$parquetPath = Storage::path('path/to/parquetfile.parquet');

    $parquetStream = fopen($parquetPath, 'r');

    $parquetReader = new ParquetReader($parquetStream);

    $dataFields = $parquetReader->schema->GetDataFields();

    $result = [];

    for ($i = 0; $i < $parquetReader->getRowGroupCount(); $i++) {
        // create row group reader
        $groupReader = $parquetReader->OpenRowGroupReader($i);
        // read all columns inside each row group (you have an option to read only
        // required columns if you need to.
        $columns = [];
        foreach ($dataFields as $field) {
            $column = $groupReader->ReadColumn($field);
            $columns[$column->getField()->name] = $column->getData();
        }

        $result[] = $columns;
    }

    dd($result);

$result shows me the correct columns, but the column value which I got via getData is always the binary.

I checked the code, but wasn't able to find the relevant part to convert it back to the readable value.

Maybe you could give me a hint how to do this.

Thanks!