Comments (5)
Huh, strange. max_read_records should only be used for schema inference. See
Line 141 in 27dfa6a
from json2parquet.
Sure, let me take a look.
I don't have any experience with Rust, so it will be a bit while I try to get familiar.
from json2parquet.
[...] check what is in the parquet file?
$ cat generate_parquets.sh
#!/usr/bin/env bash
./json2parquet --max-read-records 0 foo.json foo-0.parquet
./json2parquet --max-read-records 1 foo.json foo-1.parquet
./json2parquet --max-read-records 2 foo.json foo-2.parquet
./json2parquet --max-read-records 3 foo.json foo-3.parquet
./json2parquet --max-read-records 4 foo.json foo-4.parquet
./json2parquet foo.json foo-5.parquet
read_parquet.py
:
#!/usr/bin/env python3
import pandas as pd
for i in range(6):
fn = "foo-" + str(i) + ".parquet"
df = pd.read_parquet(fn)
print(fn)
print(df)
print()
output:
$ ./read_parquet.py
foo-0.parquet
Empty DataFrame
Columns: []
Index: []
foo-1.parquet
key1
0 value1
1 None
2 None
3 None
4 None
foo-2.parquet
key1 key2
0 value1 None
1 None value2
2 None None
3 None None
4 None None
foo-3.parquet
key1 key2 key3
0 value1 None None
1 None value2 None
2 None None value3
3 None None None
4 None None None
foo-4.parquet
key1 key2 key3 key4
0 value1 None None None
1 None value2 None None
2 None None value3 None
3 None None None value4
4 None None None None
foo-5.parquet
key1 key2 key3 key4 key5
0 value1 None None None None
1 None value2 None None None
2 None None value3 None None
3 None None None value4 None
4 None None None None value5
from json2parquet.
I did some digging into arrow::json::reader::infer_json_schema
, and it seems like it needs to read some records to build a schema and does not build a generic schema of string types.
(It's possible that arrow
has some way of building a generic schema of all Strings, but I'd expect that it would still have to read through the entire JSON file.)
definition for arrow::json::reader::infer_json_schema which calls arrow::json::reader::infer_json_schema_from_iterator—this might be where columns are coerced into Strings if they don't match certain types?
I've tested this with the following JSON file:
{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
and with --max-read-records 1
, the resulting Parquet is:
key1
0 value1
1 None
2 None
3 None
4 None
5 value1
6 value1
7 value1
8 value1
9 value1
10 value1
11 value1
12 value1
For context, my use case for json2parquet
is part of a workflow to convert network packet capture (.pcap
) files --> JSON --> Parquet.
The converted JSON doesn't seem to have a consistent schema that can be inferred, e.g.,
Error: General("Error inferring schema: Json error: Expected scalar or scalar array JSON type, found: Object({ [...]
(It's very possible that I'm asking json2parquet
to do something outside of a reasonable scope—I might have better success with selecting a few fields to extract from pcaps --> csv --> Parquet, which I'll try next.)
from json2parquet.
Oops—I might not have answered your earlier question.
My read of the code is that the following snippet puts a schema into schema
:
Lines 141 to 150 in 27dfa6a
schema
is then used later to assist with the JSON reading:
Lines 165 to 167 in 27dfa6a
I think the issue is arrow::json::reader::infer_json_schema
does not return a "generic" schema of String types if max_read_records
is set to 0
, but instead returns an empty schema (and the resultant Parquet file would be empty):
{
"fields": []
}
This seems like the expected (if not documented) behavior for arrow::json::reader::infer_json_schema
, so perhaps it's just the usage statement that needs to be updated. (But it would be great to be able to generate "generic" schema of sorts!)
from json2parquet.
Related Issues (6)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from json2parquet.