domoritz / json2parquet Goto Github PK
View Code? Open in Web Editor NEWConvert JSON files to Apache Parquet.
License: Apache License 2.0
Convert JSON files to Apache Parquet.
License: Apache License 2.0
Thank you for this great tool!
The documentation notes that setting --max-read-records
to 0
will stop schema inference and set the type for all columns to strings:
Lines 32 to 33 in 27dfa6a
The behavior I'm seeing is that json2parquet
will only read the number of records set by --max-read-records
.
Is there another way to stop schema inference?
Some examples demonstrating the behavior:
$ cat foo.json
{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}
$ ./json2parquet -V
json2parquet 0.6.0
$ ./json2parquet --max-read-records 0 foo.json foo.parquet -p
Schema:
{
"fields": []
}
$ du -h --apparent-size foo.parquet
183 foo.parquet
$ ./json2parquet --max-read-records 2 foo.json foo.parquet -p
Schema:
{
"fields": [
{
"name": "key1",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false
},
{
"name": "key2",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false
}
]
}
$ du -h --apparent-size foo.parquet
751 foo.parquet
$ ./json2parquet foo.json foo.parquet -p
Schema:
{
"fields": [
{
"name": "key1",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false
},
{
"name": "key2",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false
},
{
"name": "key3",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false
},
{
"name": "key4",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false
},
{
"name": "key5",
"data_type": "Utf8",
"nullable": true,
"dict_id": 0,
"dict_is_ordered": false
}
]
}
$ du -h --apparent-size foo.parquet
1.6K foo.parquet
Line 180 in 1c46f9d
Thank you for writing these utilities! I was looking for precisely this capability.
I was consistently only getting 1024 records in my output until I realized the reader.next()
function is only getting called once here. It needs to be wrapped in a loop to consume all the batches, right?
I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.
I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
| jq -c '.properties * {geom: .geometry|tostring}' \
> California.jsonl
$ head -n1 California.jsonl | jq .
{
"release": 1,
"capture_dates_range": "",
"geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}
I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.
$ cat California.jsonl \
| clickhouse local \
--input-format JSONEachRow \
-q "SELECT *
FROM table
FORMAT Parquet" \
> cali.snappy.pq
The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).
$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
target/release/json2parquet \
-c snappy \
California.jsonl \
California.snappy.pq
The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.
In [1]: import pyarrow.parquet as pq
In [2]: pf = pq.ParquetFile('California.snappy.pq')
In [3]: pf.schema
Out[3]:
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
optional binary field_id=-1 capture_dates_range (String);
optional binary field_id=-1 geom (String);
optional int64 field_id=-1 release;
}
In [4]: pf.metadata
Out[4]:
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
created_by: parquet-rs version 29.0.0
num_columns: 3
num_rows: 11542912
num_row_groups: 12
format_version: 1.0
serialized_size: 7969
The ClickHouse-produced PQ file has 306 row groups.
In [1]: pf = pq.ParquetFile('cali.snappy.pq')
In [2]: pf.schema
Out[2]:
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
optional int64 field_id=-1 release;
optional binary field_id=-1 capture_dates_range;
optional binary field_id=-1 geom;
}
In [3]: pf.metadata
Out[3]:
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 11542912
num_row_groups: 306
format_version: 1.0
serialized_size: 228389
I'm not sure if the row group sizes played into the performance delta.
Is there anything I can do to my compilation settings to speed up json2parquet?
Versions:
$ vi test.jsonl
{"area": 123, "geom": "", "centroid_x": -86.86346599122807, "centroid_y": 34.751296108771925, "h3_7": "872649315ffffff", "h3_8": "882649315dfffff", "h3_9": "892649315cbffff"}
$ json2parquet -c lz4 test.jsonl lz4.pq
$ ls -lth lz4.pq # 2.5K
$ hexdump -C lz4.pq | head; echo; hexdump -C lz4.pq | tail
00000000 50 41 52 31 15 00 15 1c 15 42 2c 15 02 15 00 15 |PAR1.....B,.....|
00000010 06 15 06 1c 58 08 7b 00 00 00 00 00 00 00 18 08 |....X.{.........|
00000020 7b 00 00 00 00 00 00 00 00 00 00 04 22 4d 18 44 |{..........."M.D|
00000030 40 5e 0e 00 00 80 02 00 00 00 02 01 7b 00 00 00 |@^..........{...|
00000040 00 00 00 00 00 00 00 00 e4 c0 1d d2 15 04 19 25 |...............%|
00000050 00 06 19 18 04 61 72 65 61 15 0a 16 02 16 6a 16 |.....area.....j.|
00000060 90 01 26 08 3c 58 08 7b 00 00 00 00 00 00 00 18 |..&.<X.{........|
00000070 08 7b 00 00 00 00 00 00 00 00 00 15 00 15 1c 15 |.{..............|
00000080 42 2c 15 02 15 00 15 06 15 06 1c 58 08 19 62 dc |B,.........X..b.|
00000090 06 43 b7 55 c0 18 08 19 62 dc 06 43 b7 55 c0 00 |.C.U....b..C.U..|
00000970 41 42 51 41 45 41 41 4f 41 41 38 41 42 41 41 41 |ABQAEAAOAA8ABAAA|
00000980 41 41 67 41 45 41 41 41 41 42 67 41 41 41 41 67 |AAgAEAAAABgAAAAg|
00000990 41 41 41 41 41 41 41 42 41 68 77 41 41 41 41 49 |AAAAAAABAhwAAAAI|
000009a0 41 41 77 41 42 41 41 4c 41 41 67 41 41 41 42 41 |AAwABAALAAgAAABA|
000009b0 41 41 41 41 41 41 41 41 41 51 41 41 41 41 41 45 |AAAAAAAAAQAAAAAE|
000009c0 41 41 41 41 59 58 4a 6c 59 51 41 41 41 41 41 3d |AAAAYXJlYQAAAAA=|
000009d0 00 18 19 70 61 72 71 75 65 74 2d 72 73 20 76 65 |...parquet-rs ve|
000009e0 72 73 69 6f 6e 20 32 33 2e 30 2e 30 00 fe 04 00 |rsion 23.0.0....|
000009f0 00 50 41 52 31 |.PAR1|
000009f5
$ ipython
In [1]: import pyarrow.parquet as pq
In [2]: pf = pq.ParquetFile('lz4.pq')
In [3]: pf
Out[3]: <pyarrow.parquet.core.ParquetFile at 0x10ca1cd90>
In [4]: pf.schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x10e74b280>
required group field_id=-1 arrow_schema {
optional int64 field_id=-1 area;
optional double field_id=-1 centroid_x;
optional double field_id=-1 centroid_y;
optional binary field_id=-1 geom (String);
optional binary field_id=-1 h3_7 (String);
optional binary field_id=-1 h3_8 (String);
optional binary field_id=-1 h3_9 (String);
}
In [6]: pf.read()
# OSError: Corrupt Lz4 compressed data.
$ clickhouse client
CREATE TABLE pq_test (
area Nullable(Int64),
centroid_x Nullable(Float64),
centroid_y Nullable(Float64),
geom Nullable(String),
h3_7 Nullable(String),
h3_8 Nullable(String),
h3_9 Nullable(String))
ENGINE = "Log";
$ clickhouse client \
--query='INSERT INTO pq_test FORMAT Parquet' \
< lz4.pq
Code: 33. DB::ParsingException: Error while reading Parquet data: IOError: Corrupt Lz4 compressed data.: While executing ParquetBlockInputFormat: data for INSERT was parsed from stdin: (in query: INSERT INTO pq_test FORMAT Parquet). (CANNOT_READ_ALL_DATA)
As with csv2parquet, specifying the target schema, rather than inferring it can be useful.
It would be nice if stdin was supported:
unzstd --memory=2048MB --stdout RS_2008-02.jsonl.zst | json2parquet out.parq
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.