Thank you for this great tool! The documentation notes that setting

[...] check what is in the parquet file? <div class="hi

I did some digging into <a href="https://docs.rs/arrow/24.0.0/arrow/json/reader/fn.inf

setting `--max-read-records 0` reads zero records about json2parquet HOT 5 OPEN

cardi commented on September 26, 2024

setting `--max-read-records 0` reads zero records

from json2parquet.

Comments (5)

domoritz commented on September 26, 2024

Huh, strange. max_read_records should only be used for schema inference. See

json2parquet/src/main.rs

Line 141 in 27dfa6a

    
           match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) {

. Do you see a bug there? Could you try stepping through the code and also actually check what is in the parquet file?

from json2parquet.

cardi commented on September 26, 2024

Sure, let me take a look.

I don't have any experience with Rust, so it will be a bit while I try to get familiar.

from json2parquet.

cardi commented on September 26, 2024

[...] check what is in the parquet file?

$ cat generate_parquets.sh
#!/usr/bin/env bash

./json2parquet --max-read-records 0 foo.json foo-0.parquet
./json2parquet --max-read-records 1 foo.json foo-1.parquet
./json2parquet --max-read-records 2 foo.json foo-2.parquet
./json2parquet --max-read-records 3 foo.json foo-3.parquet
./json2parquet --max-read-records 4 foo.json foo-4.parquet
./json2parquet                      foo.json foo-5.parquet

read_parquet.py:

#!/usr/bin/env python3

import pandas as pd

for i in range(6):
  fn = "foo-" + str(i) + ".parquet"
  df = pd.read_parquet(fn)
  print(fn)
  print(df)
  print()

output:

$ ./read_parquet.py
foo-0.parquet
Empty DataFrame
Columns: []
Index: []

foo-1.parquet
     key1
0  value1
1    None
2    None
3    None
4    None

foo-2.parquet
     key1    key2
0  value1    None
1    None  value2
2    None    None
3    None    None
4    None    None

foo-3.parquet
     key1    key2    key3
0  value1    None    None
1    None  value2    None
2    None    None  value3
3    None    None    None
4    None    None    None

foo-4.parquet
     key1    key2    key3    key4
0  value1    None    None    None
1    None  value2    None    None
2    None    None  value3    None
3    None    None    None  value4
4    None    None    None    None

foo-5.parquet
     key1    key2    key3    key4    key5
0  value1    None    None    None    None
1    None  value2    None    None    None
2    None    None  value3    None    None
3    None    None    None  value4    None
4    None    None    None    None  value5

from json2parquet.

cardi commented on September 26, 2024

I did some digging into arrow::json::reader::infer_json_schema, and it seems like it needs to read some records to build a schema and does not build a generic schema of string types.

(It's possible that arrow has some way of building a generic schema of all Strings, but I'd expect that it would still have to read through the entire JSON file.)

definition for arrow::json::reader::infer_json_schema which calls arrow::json::reader::infer_json_schema_from_iterator—this might be where columns are coerced into Strings if they don't match certain types?

I've tested this with the following JSON file:

{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}

and with --max-read-records 1, the resulting Parquet is:

      key1
0   value1
1     None
2     None
3     None
4     None
5   value1
6   value1
7   value1
8   value1
9   value1
10  value1
11  value1
12  value1

For context, my use case for json2parquet is part of a workflow to convert network packet capture (.pcap) files --> JSON --> Parquet.

The converted JSON doesn't seem to have a consistent schema that can be inferred, e.g.,

Error: General("Error inferring schema: Json error: Expected scalar or scalar array JSON type, found: Object({ [...]

(It's very possible that I'm asking json2parquet to do something outside of a reasonable scope—I might have better success with selecting a few fields to extract from pcaps --> csv --> Parquet, which I'll try next.)

from json2parquet.

cardi commented on September 26, 2024

Oops—I might not have answered your earlier question.

My read of the code is that the following snippet puts a schema into schema:

json2parquet/src/main.rs

Lines 141 to 150 in 27dfa6a

    
           match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) { 
        
               Ok(schema) => { 
        
                   input.seek(SeekFrom::Start(0))?; 
        
                   Ok(schema) 
        
               } 
        
               Err(error) => Err(ParquetError::General(format!( 
        
                   "Error inferring schema: {}", 
        
                   error 
        
               ))), 
        
           }

schema is then used later to assist with the JSON reading:

json2parquet/src/main.rs

Lines 165 to 167 in 27dfa6a

    
           let schema_ref = Arc::new(schema); 
        
           let builder = ReaderBuilder::new().with_schema(schema_ref); 
        
           let reader = builder.build(input)?;

I think the issue is arrow::json::reader::infer_json_schema does not return a "generic" schema of String types if max_read_records is set to 0, but instead returns an empty schema (and the resultant Parquet file would be empty):

{
  "fields": []
}

This seems like the expected (if not documented) behavior for arrow::json::reader::infer_json_schema, so perhaps it's just the usage statement that needs to be updated. (But it would be great to be able to generate "generic" schema of sorts!)

from json2parquet.

setting `--max-read-records 0` reads zero records about json2parquet HOT 5 OPEN

Comments (5)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) {
	Ok(schema) => {
	input.seek(SeekFrom::Start(0))?;
	Ok(schema)
	}
	Err(error) => Err(ParquetError::General(format!(
	"Error inferring schema: {}",
	error
	))),
	}

	let schema_ref = Arc::new(schema);
	let builder = ReaderBuilder::new().with_schema(schema_ref);
	let reader = builder.build(input)?;