Giter Site home page Giter Site logo

json2parquet's Introduction

JSON to Parquet

This repo is archived and the code moved to Arrow CLI Tools.

Crates.io Rust

Convert JSON files to Apache Parquet. You may also be interested in csv2parquet, json2arrow, or csv2arrow.

Installation

Download prebuilt binaries

You can get the latest releases from https://github.com/domoritz/json2parquet/releases/.

With Cargo

cargo install json2parquet

Usage

Usage: json2parquet [OPTIONS] <JSON> <PARQUET>

Arguments:
  <JSON>     Input JSON file
  <PARQUET>  Output file

Options:
  -s, --schema-file <SCHEMA_FILE>
          File with Arrow schema in JSON format
      --max-read-records <MAX_READ_RECORDS>
          The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
  -c, --compression <COMPRESSION>
          Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd]
  -e, --encoding <ENCODING>
          Sets encoding for any column [possible values: plain, rle, bit-packed, delta-binary-packed, delta-length-byte-array, delta-byte-array, rle-dictionary]
      --data-pagesize-limit <DATA_PAGESIZE_LIMIT>
          Sets data page size limit
      --dictionary-pagesize-limit <DICTIONARY_PAGESIZE_LIMIT>
          Sets dictionary page size limit
      --write-batch-size <WRITE_BATCH_SIZE>
          Sets write batch size
      --max-row-group-size <MAX_ROW_GROUP_SIZE>
          Sets max size for a row group
      --created-by <CREATED_BY>
          Sets "created by" property
      --dictionary
          Sets flag to enable/disable dictionary encoding for any column
      --statistics <STATISTICS>
          Sets flag to enable/disable statistics for any column [possible values: none, chunk, page]
      --max-statistics-size <MAX_STATISTICS_SIZE>
          Sets max statistics size for any column. Applicable only if statistics are enabled
  -p, --print-schema
          Print the schema to stderr
  -n, --dry
          Only print the schema
  -h, --help
          Print help information
  -V, --version
          Print version information

The --schema-file option uses the same file format as --dry and --print-schema.

Limitations

Since we use the Arrow JSON loader, we are limited to what it supports. Right now, it supports JSON line-delimited files.

{ "a": 42, "b": true }
{ "a": 12, "b": false }
{ "a": 7, "b": true }

json2parquet's People

Contributors

andyredhead avatar dependabot[bot] avatar domoritz avatar erichutchins avatar github-actions[bot] avatar yurivish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

json2parquet's Issues

Performance improvement for compressing Snappy-compressed Parquet files?

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x slower than ClickHouse when it came to converting the records into Snappy-compressed Parquet.

I converted the original GeoJSON into JSONL with three elements per record. The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .
{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

I then converted that file into Snappy-compressed Parquet with ClickHouse which took 32 seconds and produced a file 793 MB in size.

$ cat California.jsonl \
    | clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    > cali.snappy.pq

The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).

$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
        target/release/json2parquet \
        -c snappy \
        California.jsonl \
        California.snappy.pq

The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB in size. There are 12 row groups in this PQ file.

In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('California.snappy.pq')

In [3]: pf.schema
Out[3]: 
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
  optional binary field_id=-1 capture_dates_range (String);
  optional binary field_id=-1 geom (String);
  optional int64 field_id=-1 release;
}

In [4]: pf.metadata
Out[4]: 
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
  created_by: parquet-rs version 29.0.0
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 12
  format_version: 1.0
  serialized_size: 7969

The ClickHouse-produced PQ file has 306 row groups.

In [1]: pf = pq.ParquetFile('cali.snappy.pq')

In [2]: pf.schema
Out[2]: 
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
  optional int64 field_id=-1 release;
  optional binary field_id=-1 capture_dates_range;
  optional binary field_id=-1 geom;
}

In [3]: pf.metadata
Out[3]: 
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

I'm not sure if the row group sizes played into the performance delta.

Is there anything I can do to my compilation settings to speed up json2parquet?

LZ4-compressed PQ files unreadable by Pandas and ClickHouse

Versions:

  • json2parquet 0.6.0
  • PyArrow 10.0.1
  • ClickHouse 22.13.1.1119
$ vi test.jsonl
{"area": 123, "geom": "", "centroid_x": -86.86346599122807, "centroid_y": 34.751296108771925, "h3_7": "872649315ffffff", "h3_8": "882649315dfffff", "h3_9": "892649315cbffff"}
$ json2parquet -c lz4 test.jsonl lz4.pq
$ ls -lth lz4.pq # 2.5K
$ hexdump -C lz4.pq | head; echo; hexdump -C lz4.pq | tail
00000000  50 41 52 31 15 00 15 1c  15 42 2c 15 02 15 00 15  |PAR1.....B,.....|
00000010  06 15 06 1c 58 08 7b 00  00 00 00 00 00 00 18 08  |....X.{.........|
00000020  7b 00 00 00 00 00 00 00  00 00 00 04 22 4d 18 44  |{..........."M.D|
00000030  40 5e 0e 00 00 80 02 00  00 00 02 01 7b 00 00 00  |@^..........{...|
00000040  00 00 00 00 00 00 00 00  e4 c0 1d d2 15 04 19 25  |...............%|
00000050  00 06 19 18 04 61 72 65  61 15 0a 16 02 16 6a 16  |.....area.....j.|
00000060  90 01 26 08 3c 58 08 7b  00 00 00 00 00 00 00 18  |..&.<X.{........|
00000070  08 7b 00 00 00 00 00 00  00 00 00 15 00 15 1c 15  |.{..............|
00000080  42 2c 15 02 15 00 15 06  15 06 1c 58 08 19 62 dc  |B,.........X..b.|
00000090  06 43 b7 55 c0 18 08 19  62 dc 06 43 b7 55 c0 00  |.C.U....b..C.U..|

00000970  41 42 51 41 45 41 41 4f  41 41 38 41 42 41 41 41  |ABQAEAAOAA8ABAAA|
00000980  41 41 67 41 45 41 41 41  41 42 67 41 41 41 41 67  |AAgAEAAAABgAAAAg|
00000990  41 41 41 41 41 41 41 42  41 68 77 41 41 41 41 49  |AAAAAAABAhwAAAAI|
000009a0  41 41 77 41 42 41 41 4c  41 41 67 41 41 41 42 41  |AAwABAALAAgAAABA|
000009b0  41 41 41 41 41 41 41 41  41 51 41 41 41 41 41 45  |AAAAAAAAAQAAAAAE|
000009c0  41 41 41 41 59 58 4a 6c  59 51 41 41 41 41 41 3d  |AAAAYXJlYQAAAAA=|
000009d0  00 18 19 70 61 72 71 75  65 74 2d 72 73 20 76 65  |...parquet-rs ve|
000009e0  72 73 69 6f 6e 20 32 33  2e 30 2e 30 00 fe 04 00  |rsion 23.0.0....|
000009f0  00 50 41 52 31                                    |.PAR1|
000009f5
$ ipython
In [1]: import pyarrow.parquet as pq

In [2]: pf = pq.ParquetFile('lz4.pq')

In [3]: pf
Out[3]: <pyarrow.parquet.core.ParquetFile at 0x10ca1cd90>

In [4]: pf.schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x10e74b280>
required group field_id=-1 arrow_schema {
  optional int64 field_id=-1 area;
  optional double field_id=-1 centroid_x;
  optional double field_id=-1 centroid_y;
  optional binary field_id=-1 geom (String);
  optional binary field_id=-1 h3_7 (String);
  optional binary field_id=-1 h3_8 (String);
  optional binary field_id=-1 h3_9 (String);
}

In [6]: pf.read()

# OSError: Corrupt Lz4 compressed data.
$ clickhouse client
CREATE TABLE pq_test (
    area Nullable(Int64),
    centroid_x Nullable(Float64),
    centroid_y Nullable(Float64),
    geom Nullable(String),
    h3_7 Nullable(String),
    h3_8 Nullable(String),
    h3_9 Nullable(String))
ENGINE = "Log";
$ clickhouse client \
    --query='INSERT INTO pq_test FORMAT Parquet' \
    < lz4.pq
Code: 33. DB::ParsingException: Error while reading Parquet data: IOError: Corrupt Lz4 compressed data.: While executing ParquetBlockInputFormat: data for INSERT was parsed from stdin: (in query: INSERT INTO pq_test FORMAT Parquet). (CANNOT_READ_ALL_DATA)

Specify schema

As with csv2parquet, specifying the target schema, rather than inferring it can be useful.

Reader stops after first batch

match reader.next() {

Thank you for writing these utilities! I was looking for precisely this capability.

I was consistently only getting 1024 records in my output until I realized the reader.next() function is only getting called once here. It needs to be wrapped in a loop to consume all the batches, right?

setting `--max-read-records 0` reads zero records

Thank you for this great tool!

The documentation notes that setting --max-read-records to 0 will stop schema inference and set the type for all columns to strings:

json2parquet/Readme.md

Lines 32 to 33 in 27dfa6a

--max-read-records <MAX_READ_RECORDS>
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed

The behavior I'm seeing is that json2parquet will only read the number of records set by --max-read-records.

Is there another way to stop schema inference?

Some examples demonstrating the behavior:

$ cat foo.json
{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}

$ ./json2parquet -V
json2parquet 0.6.0

$ ./json2parquet --max-read-records 0 foo.json foo.parquet -p
Schema:
{
  "fields": []
}

$ du -h --apparent-size foo.parquet
183     foo.parquet

$ ./json2parquet --max-read-records 2 foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
751     foo.parquet

$ ./json2parquet foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key3",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key4",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key5",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
1.6K    foo.parquet

read jsonl from stdin

It would be nice if stdin was supported:

unzstd --memory=2048MB --stdout RS_2008-02.jsonl.zst | json2parquet out.parq

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.