domoritz / json2parquet Goto Github PK

Convert JSON files to Apache Parquet.

License: Apache License 2.0

Rust 100.00%

json2parquet's Issues

setting `--max-read-records 0` reads zero records

Thank you for this great tool!

The documentation notes that setting --max-read-records to 0 will stop schema inference and set the type for all columns to strings:

json2parquet/Readme.md

Lines 32 to 33 in 27dfa6a

    
                 --max-read-records <MAX_READ_RECORDS> 
        
                     The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed

The behavior I'm seeing is that json2parquet will only read the number of records set by --max-read-records.

Is there another way to stop schema inference?

Some examples demonstrating the behavior:

$ cat foo.json
{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}

$ ./json2parquet -V
json2parquet 0.6.0

$ ./json2parquet --max-read-records 0 foo.json foo.parquet -p
Schema:
{
  "fields": []
}

$ du -h --apparent-size foo.parquet
183     foo.parquet

$ ./json2parquet --max-read-records 2 foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
751     foo.parquet

$ ./json2parquet foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key3",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key4",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key5",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
1.6K    foo.parquet

Reader stops after first batch

json2parquet/src/main.rs

Line 180 in 1c46f9d

match reader.next() {

Thank you for writing these utilities! I was looking for precisely this capability.

I was consistently only getting 1024 records in my output until I realized the reader.next() function is only getting called once here. It needs to be wrapped in a loop to consume all the batches, right?

Performance improvement for compressing Snappy-compressed Parquet files?

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.

I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .

{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.

$ cat California.jsonl \
    | clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    > cali.snappy.pq

The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).

$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
        target/release/json2parquet \
        -c snappy \
        California.jsonl \
        California.snappy.pq

The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('California.snappy.pq')

In [3]: pf.schema
Out[3]: 
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
  optional binary field_id=-1 capture_dates_range (String);
  optional binary field_id=-1 geom (String);
  optional int64 field_id=-1 release;
}

In [4]: pf.metadata
Out[4]: 
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
  created_by: parquet-rs version 29.0.0
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 12
  format_version: 1.0
  serialized_size: 7969

The ClickHouse-produced PQ file has 306 row groups.

In [1]: pf = pq.ParquetFile('cali.snappy.pq')

In [2]: pf.schema
Out[2]: 
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
  optional int64 field_id=-1 release;
  optional binary field_id=-1 capture_dates_range;
  optional binary field_id=-1 geom;
}

In [3]: pf.metadata
Out[3]: 
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

I'm not sure if the row group sizes played into the performance delta.

Is there anything I can do to my compilation settings to speed up json2parquet?

LZ4-compressed PQ files unreadable by Pandas and ClickHouse

Versions:

json2parquet 0.6.0
PyArrow 10.0.1
ClickHouse 22.13.1.1119

$ vi test.jsonl

{"area": 123, "geom": "", "centroid_x": -86.86346599122807, "centroid_y": 34.751296108771925, "h3_7": "872649315ffffff", "h3_8": "882649315dfffff", "h3_9": "892649315cbffff"}

$ json2parquet -c lz4 test.jsonl lz4.pq
$ ls -lth lz4.pq # 2.5K

$ hexdump -C lz4.pq | head; echo; hexdump -C lz4.pq | tail

00000000  50 41 52 31 15 00 15 1c  15 42 2c 15 02 15 00 15  |PAR1.....B,.....|
00000010  06 15 06 1c 58 08 7b 00  00 00 00 00 00 00 18 08  |....X.{.........|
00000020  7b 00 00 00 00 00 00 00  00 00 00 04 22 4d 18 44  |{..........."M.D|
00000030  40 5e 0e 00 00 80 02 00  00 00 02 01 7b 00 00 00  |@^..........{...|
00000040  00 00 00 00 00 00 00 00  e4 c0 1d d2 15 04 19 25  |...............%|
00000050  00 06 19 18 04 61 72 65  61 15 0a 16 02 16 6a 16  |.....area.....j.|
00000060  90 01 26 08 3c 58 08 7b  00 00 00 00 00 00 00 18  |..&.<X.{........|
00000070  08 7b 00 00 00 00 00 00  00 00 00 15 00 15 1c 15  |.{..............|
00000080  42 2c 15 02 15 00 15 06  15 06 1c 58 08 19 62 dc  |B,.........X..b.|
00000090  06 43 b7 55 c0 18 08 19  62 dc 06 43 b7 55 c0 00  |.C.U....b..C.U..|

00000970  41 42 51 41 45 41 41 4f  41 41 38 41 42 41 41 41  |ABQAEAAOAA8ABAAA|
00000980  41 41 67 41 45 41 41 41  41 42 67 41 41 41 41 67  |AAgAEAAAABgAAAAg|
00000990  41 41 41 41 41 41 41 42  41 68 77 41 41 41 41 49  |AAAAAAABAhwAAAAI|
000009a0  41 41 77 41 42 41 41 4c  41 41 67 41 41 41 42 41  |AAwABAALAAgAAABA|
000009b0  41 41 41 41 41 41 41 41  41 51 41 41 41 41 41 45  |AAAAAAAAAQAAAAAE|
000009c0  41 41 41 41 59 58 4a 6c  59 51 41 41 41 41 41 3d  |AAAAYXJlYQAAAAA=|
000009d0  00 18 19 70 61 72 71 75  65 74 2d 72 73 20 76 65  |...parquet-rs ve|
000009e0  72 73 69 6f 6e 20 32 33  2e 30 2e 30 00 fe 04 00  |rsion 23.0.0....|
000009f0  00 50 41 52 31                                    |.PAR1|
000009f5

$ ipython

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('lz4.pq')

In [3]: pf
Out[3]: <pyarrow.parquet.core.ParquetFile at 0x10ca1cd90>

In [4]: pf.schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x10e74b280>
required group field_id=-1 arrow_schema {
  optional int64 field_id=-1 area;
  optional double field_id=-1 centroid_x;
  optional double field_id=-1 centroid_y;
  optional binary field_id=-1 geom (String);
  optional binary field_id=-1 h3_7 (String);
  optional binary field_id=-1 h3_8 (String);
  optional binary field_id=-1 h3_9 (String);
}

In [6]: pf.read()

# OSError: Corrupt Lz4 compressed data.

$ clickhouse client

CREATE TABLE pq_test (
    area Nullable(Int64),
    centroid_x Nullable(Float64),
    centroid_y Nullable(Float64),
    geom Nullable(String),
    h3_7 Nullable(String),
    h3_8 Nullable(String),
    h3_9 Nullable(String))
ENGINE = "Log";

$ clickhouse client \
    --query='INSERT INTO pq_test FORMAT Parquet' \
    < lz4.pq

Code: 33. DB::ParsingException: Error while reading Parquet data: IOError: Corrupt Lz4 compressed data.: While executing ParquetBlockInputFormat: data for INSERT was parsed from stdin: (in query: INSERT INTO pq_test FORMAT Parquet). (CANNOT_READ_ALL_DATA)

Specify schema

As with csv2parquet, specifying the target schema, rather than inferring it can be useful.

read jsonl from stdin

It would be nice if stdin was supported:

unzstd --memory=2048MB --stdout RS_2008-02.jsonl.zst | json2parquet out.parq

domoritz / json2parquet Goto Github PK

json2parquet's Issues

setting `--max-read-records 0` reads zero records

Reader stops after first batch

Performance improvement for compressing Snappy-compressed Parquet files?

LZ4-compressed PQ files unreadable by Pandas and ClickHouse

Specify schema

read jsonl from stdin

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	--max-read-records <MAX_READ_RECORDS>
	The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed