Giter Site home page Giter Site logo

Comments (5)

domoritz avatar domoritz commented on September 26, 2024

Huh, strange. max_read_records should only be used for schema inference. See

match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) {
. Do you see a bug there? Could you try stepping through the code and also actually check what is in the parquet file?

from json2parquet.

cardi avatar cardi commented on September 26, 2024

Sure, let me take a look.

I don't have any experience with Rust, so it will be a bit while I try to get familiar.

from json2parquet.

cardi avatar cardi commented on September 26, 2024

[...] check what is in the parquet file?

$ cat generate_parquets.sh
#!/usr/bin/env bash

./json2parquet --max-read-records 0 foo.json foo-0.parquet
./json2parquet --max-read-records 1 foo.json foo-1.parquet
./json2parquet --max-read-records 2 foo.json foo-2.parquet
./json2parquet --max-read-records 3 foo.json foo-3.parquet
./json2parquet --max-read-records 4 foo.json foo-4.parquet
./json2parquet                      foo.json foo-5.parquet

read_parquet.py:

#!/usr/bin/env python3

import pandas as pd

for i in range(6):
  fn = "foo-" + str(i) + ".parquet"
  df = pd.read_parquet(fn)
  print(fn)
  print(df)
  print()

output:

$ ./read_parquet.py
foo-0.parquet
Empty DataFrame
Columns: []
Index: []

foo-1.parquet
     key1
0  value1
1    None
2    None
3    None
4    None

foo-2.parquet
     key1    key2
0  value1    None
1    None  value2
2    None    None
3    None    None
4    None    None

foo-3.parquet
     key1    key2    key3
0  value1    None    None
1    None  value2    None
2    None    None  value3
3    None    None    None
4    None    None    None

foo-4.parquet
     key1    key2    key3    key4
0  value1    None    None    None
1    None  value2    None    None
2    None    None  value3    None
3    None    None    None  value4
4    None    None    None    None

foo-5.parquet
     key1    key2    key3    key4    key5
0  value1    None    None    None    None
1    None  value2    None    None    None
2    None    None  value3    None    None
3    None    None    None  value4    None
4    None    None    None    None  value5

from json2parquet.

cardi avatar cardi commented on September 26, 2024

I did some digging into arrow::json::reader::infer_json_schema, and it seems like it needs to read some records to build a schema and does not build a generic schema of string types.

(It's possible that arrow has some way of building a generic schema of all Strings, but I'd expect that it would still have to read through the entire JSON file.)

definition for arrow::json::reader::infer_json_schema which calls arrow::json::reader::infer_json_schema_from_iterator—this might be where columns are coerced into Strings if they don't match certain types?

I've tested this with the following JSON file:

{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}

and with --max-read-records 1, the resulting Parquet is:

      key1
0   value1
1     None
2     None
3     None
4     None
5   value1
6   value1
7   value1
8   value1
9   value1
10  value1
11  value1
12  value1

For context, my use case for json2parquet is part of a workflow to convert network packet capture (.pcap) files --> JSON --> Parquet.

The converted JSON doesn't seem to have a consistent schema that can be inferred, e.g.,

Error: General("Error inferring schema: Json error: Expected scalar or scalar array JSON type, found: Object({ [...]

(It's very possible that I'm asking json2parquet to do something outside of a reasonable scope—I might have better success with selecting a few fields to extract from pcaps --> csv --> Parquet, which I'll try next.)

from json2parquet.

cardi avatar cardi commented on September 26, 2024

Oops—I might not have answered your earlier question.

My read of the code is that the following snippet puts a schema into schema:

json2parquet/src/main.rs

Lines 141 to 150 in 27dfa6a

match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) {
Ok(schema) => {
input.seek(SeekFrom::Start(0))?;
Ok(schema)
}
Err(error) => Err(ParquetError::General(format!(
"Error inferring schema: {}",
error
))),
}

schema is then used later to assist with the JSON reading:

json2parquet/src/main.rs

Lines 165 to 167 in 27dfa6a

let schema_ref = Arc::new(schema);
let builder = ReaderBuilder::new().with_schema(schema_ref);
let reader = builder.build(input)?;

I think the issue is arrow::json::reader::infer_json_schema does not return a "generic" schema of String types if max_read_records is set to 0, but instead returns an empty schema (and the resultant Parquet file would be empty):

{
  "fields": []
}

This seems like the expected (if not documented) behavior for arrow::json::reader::infer_json_schema, so perhaps it's just the usage statement that needs to be updated. (But it would be great to be able to generate "generic" schema of sorts!)

from json2parquet.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.