queryverse / parquetfiles.jl Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 10.0 151 KB

FileIO.jl integration for Parquet files

License: Other

Julia 100.00%

julia queryverse

parquetfiles.jl's People

Contributors

Stargazers

Watchers

Forkers

scls19fr yijia-chen gaybro8777 stjordanis dennisrutjes tanmaykm mrchaos songroom2016 anhi standardgalactic

parquetfiles.jl's Issues

Unable to add ParquetFiles, Julia v1.0

(v1.0) pkg> add ParquetFiles
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package ParquetFiles [46a55296]:
 ParquetFiles [46a55296] log:
 ├─possible versions are: 0.0.1 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions 0.0.1
 └─restricted by julia compatibility requirements to versions: uninstalled — no versions left

(v1.0) pkg> add ParquetFiles#master
  Updating git-repo `https://github.com/queryverse/ParquetFiles.jl.git`
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Parquet [626c502c]:
 Parquet [626c502c] log:
 ├─possible versions are: 0.1.0 or uninstalled
 ├─restricted to versions 0.1.0-* by ParquetFiles [46a55296], leaving only versions 0.1.0
 │ └─ParquetFiles [46a55296] log:
 │   ├─possible versions are: 0.0.1 or uninstalled
 │   └─ParquetFiles [46a55296] is fixed to version 0.0.1+
 └─restricted by julia compatibility requirements to versions: uninstalled — no versions left

error when reading a file

Data
https://github.com/xiaodaigh/parquet-data-collection/blob/master/dsd50p.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "\\\\wsl\$\\Ubuntu-18.04\\git\\aia-engine-data-wrangling\\test\\data\\breast_cancer_with_schema\\BreastCancer.parquet"
print(load(a))

which yields the error

KeyError: key 5 not found
getindex(::Dict{Int64,Thrift.ThriftMetaAttribs}, ::Int64) at dict.jl:477
read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.RowGroup) at base.jl:181
read at base.jl:169 [inlined]
read at base.jl:167 [inlined]
read_container(::Thrift.TCompactProtocol, ::Array{Parquet.PAR2.RowGroup,1}) at base.jl:369
read_container(::Thrift.TCompactProtocol, ::Type{Array{Parquet.PAR2.RowGroup,1}}) at base.jl:168
read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.FileMetaData) at base.jl:190
read at base.jl:169 [inlined]
read at base.jl:167 [inlined]
read_thrift(::IOStream, ::Type{Parquet.PAR2.FileMetaData}) at reader.jl:324
metadata(::IOStream, ::String, ::Int32) at reader.jl:339
ParFile(::String, ::IOStream; maxcache::Int64) at reader.jl:57
ParFile at reader.jl:55 [inlined]
ParFile(::String) at reader.jl:46
top-level scope at look_for_parquet.jl:32

Base.Meta.ParseError

Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/new-york-city-bike-share-dataset.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))

which yields the error

Base.Meta.ParseError("extra token \"Duration\" after end of expression")
parse(::String, ::Int64; greedy::Bool, raise::Bool, depwarn::Bool) at meta.jl:184
parse at meta.jl:176 [inlined]
parse(::String; raise::Bool, depwarn::Bool) at meta.jl:215
parse at meta.jl:215 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:230
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::Base.TTY, ::ParquetFiles.ParquetFile, ::Char) at io.jl:46
println(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:73
println(::ParquetFiles.ParquetFile) at coreio.jl:4
top-level scope at look_for_parquet.jl:34

ERROR: LoadError: UndefVarError: RecCursor not defined

julia> @time versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, nocona)
  0.951149 seconds (1.18 M allocations: 53.974 MiB, 4.30% gc time)

julia> @time Pkg.add("Queryverse")
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
 [no changes]
   Updating `~/.julia/environments/v1.4/Manifest.toml`
 [no changes]
  3.222060 seconds (2.55 M allocations: 168.544 MiB, 5.66% gc time)

julia> @time using Queryverse
[ Info: Precompiling Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58]
ERROR: LoadError: UndefVarError: RecCursor not defined
Stacktrace:
 [1] top-level scope at /home/c/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:26
 [2] top-level scope at none:2
 [3] eval at ./boot.jl:331 [inlined]
in expression starting at /home/c/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:26
ERROR: LoadError: Failed to precompile ParquetFiles [46a55296-af5a-53b0-aaa0-97023b66127f] to /home/c/.julia/compiled/v1.4/ParquetFiles/WDBBU_g2XoO.ji.
Stacktrace:
 [1] top-level scope at none:2
 [2] eval at ./boot.jl:331 [inlined]
in expression starting at /home/c/.julia/packages/Queryverse/ysqbZ/src/Queryverse.jl:15
ERROR: Failed to precompile Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58] to /home/c/.julia/compiled/v1.4/Queryverse/hLJnW_g2XoO.ji.
Stacktrace:
 [1] compilecache(::Base.PkgId, ::String) at ./loading.jl:1272
 [2] _require(::Base.PkgId) at ./loading.jl:1029
 [3] require(::Base.PkgId) at ./loading.jl:927
 [4] require(::Module, ::Symbol) at ./loading.jl:922
 [5] top-level scope at util.jl:175

Reading Parquet to DataFrame is slow

Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl

MWE:

(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.6.2
  [a93c6f00] DataFrames v0.21.2
  [626c502c] Parquet v0.4.0
  [46a55296] ParquetFiles v0.2.0

using ParquetFiles, BenchmarkTools, CSV, DataFrames
CSV.read("data.csv")
DataFrame(load("data.parquet"))

Loading times for ParquetFiles

@benchmark DataFrame(load("data.parquet"))
BenchmarkTools.Trial: 
  memory estimate:  45.66 MiB
  allocs estimate:  961290
  --------------
  minimum time:     287.492 ms (0.00% GC)
  median time:      290.843 ms (0.00% GC)
  mean time:        296.344 ms (1.64% GC)
  maximum time:     326.041 ms (8.46% GC)
  --------------
  samples:          17
  evals/sample:     1

Loading times for CSV:

@benchmark CSV.read("data.csv")
BenchmarkTools.Trial: 
  memory estimate:  758.14 KiB
  allocs estimate:  2299
  --------------
  minimum time:     1.690 ms (0.00% GC)
  median time:      1.735 ms (0.00% GC)
  mean time:        1.772 ms (1.43% GC)
  maximum time:     14.096 ms (63.93% GC)
  --------------
  samples:          2817
  evals/sample:     1

As compared to pandas:

import pandas as pd
%timeit pd.read_parquet("data.parquet")                                                                                                                                          
# 3.61 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.read_csv("data.csv")                                                                                                                                                  
# 4.73 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Data are included in zip file:
data.zip

TypeError when reading

Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/noshowappointments.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))

which yields the error

TypeError: in typeassert, expected Array{UInt8,1}, got typeof(show)
top-level scope at none:2
eval at boot.jl:331 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:231
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::Base.TTY, ::ParquetFiles.ParquetFile, ::Char) at io.jl:46
println(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:73
println(::ParquetFiles.ParquetFile) at coreio.jl:4
top-level scope at look_for_parquet.jl:35

Parquet no longer exports RecCursor, but, you still refer to it.

Parquet.jl has updated, RecCursor isn't exported, so ParquetFile is broken.

It should be updated to use RecordCursor, but currently is unusable.

Which of course, breaks Queryverse.

UndefRefError

I am working with a series of large Parquet files (which I cannot share), and there seems to be a weird error when reading them:

julia> DataFrame(load("current.parquet"))
ERROR: UndefRefError: access to undefined reference

The stacktrace is as follows:

Stacktrace:
 [1] getproperty(::ParquetFiles.RCType276, ::Symbol) at ./Base.jl:33
 [2] macro expansion at /home/tpoisot/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:48 [inlined]
 [3] iterate(::ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}},ParquetFiles.RCType276}, ::Int64) at /home/tpoisot/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:39
 [4] iterate at /home/tpoisot/.julia/packages/Tables/xHhzi/src/tofromdatavalues.jl:53 [inlined]
 [5] iterate at ./iterators.jl:139 [inlined]
 [6] buildcolumns(::Tables.Schema{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}}, ::Tables.IteratorWrapper{ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}},ParquetFiles.RCType276}}) at /home/tpoisot/.julia/packages/Tables/xHhzi/src/fallbacks.jl:127
 [7] columns at /home/tpoisot/.julia/packages/Tables/xHhzi/src/fallbacks.jl:237 [inlined]
 [8] DataFrame(::ParquetFiles.ParquetFile; copycols::Bool) at /home/tpoisot/.julia/packages/DataFrames/GtZ1l/src/other/tables.jl:43
 [9] DataFrame(::ParquetFiles.ParquetFile) at /home/tpoisot/.julia/packages/DataFrames/GtZ1l/src/other/tables.jl:34
 [10] top-level scope at REPL[30]:1

Interestingly, AFAIK, the entire file is loaded, but saving to a DataFrame or CSV results in the same error being thrown. My guess is that the last line, somehow, has characters it should not?

Error in loading a file

I was able to get a file to read easily using Python and Pandas, however in Julia, the same file produces the following error:

ERROR: LoadError: Base.Meta.ParseError("extra token "Vector" after end of expression")

I usied Parquet.jl to try and load the file, and I get the following schema

Schema:
required schema {
optional INT64 timestamp# (from TIMESTAMP_MILLIS) {
optional INT64 t_datetime# (from TIMESTAMP_MILLIS) {
optional BYTE_ARRAY id_string# (from UTF8) {
optional DOUBLE v_double {
optional DOUBLE q_double {
optional BYTE_ARRAY alias_string# (from UTF8) {

However when I try to implement
schema(JuliaConverter(Main), Pq, :Tags)

I get the same error as before.

EOFError

Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/fannie_mae_perf_small.parquet
https://github.com/xiaodaigh/parquet-data-collection/blob/master/synthetic_data.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))

which yields the error

Error displaying ParquetFiles.ParquetFile: EOFError: read end of file
read at iobuffer.jl:212 [inlined]
_read_varint at codec.jl:40 [inlined]
read_hybrid(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64, ::Type{Int32}, ::Array{Int32,1}; read_len::Bool) at codec.jl:129
read_hybrid at codec.jl:125 [inlined]
(::Parquet.var"#read_hybrid##kw")(::NamedTuple{(:read_len,),Tuple{Bool}}, ::typeof(Parquet.read_hybrid), ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64, ::Type{Int32}) at codec.jl:125
(::Parquet.var"#read_hybrid##kw")(::NamedTuple{(:read_len,),Tuple{Bool}}, ::typeof(Parquet.read_hybrid), ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64) at codec.jl:125
read_hybrid at codec.jl:125 [inlined]
read_rle_dict(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32) at codec.jl:118
read_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::Int32, ::Int32) at reader.jl:222
read_levels_and_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Tuple{Int32,Int32,Int32}, ::Int32, ::Int32, ::ParFile, ::Parquet.Page) at reader.jl:261
values(::ParFile, ::Parquet.Page) at reader.jl:239
values(::ParFile, ::Parquet.PAR2.ColumnChunk) at reader.jl:178
setrow(::ColCursor{Float64}, ::Int64) at cursor.jl:144
ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at cursor.jl:115
(::Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64})(::String) at none:0
iterate at generator.jl:47 [inlined]
collect_to!(::Array{ColCursor,1}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at array.jl:710
collect_to!(::Array{ColCursor{Int64},1}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at array.jl:718
collect_to_with_first!(::Array{ColCursor{Int64},1}, ::ColCursor{Int64}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64) at array.jl:689
collect(::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}) at array.jl:670
RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{ParquetFiles.RCType267}, ::Int64) at cursor.jl:269
RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{ParquetFiles.RCType267}) at cursor.jl:269
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:74
show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
show at multimedia.jl:47 [inlined]
(::Atom.var"#61#62"{ParquetFiles.ParquetFile})(::Base.GenericIOBuffer{Array{UInt8,1}}) at display.jl:17
sprint(::Function; context::Nothing, sizehint::Int64) at io.jl:105
sprint at io.jl:101 [inlined]
render at display.jl:16 [inlined]
Copyable at types.jl:39 [inlined]
Copyable at types.jl:40 [inlined]
render at display.jl:19 [inlined]

save to parquet?

Can we save a dataFrame to a parquet file?

Fails to load data with non-alphanumerics in the column name

using ParquetFiles
a = "BreastCancer.parquet"
print(load(a))

which yields the error

syntax: field name "Cl.thickness" is not a symbol
top-level scope at look_for_parquet.jl:31
eval at boot.jl:331 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:231
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::ParquetFiles.ParquetFile) at coreio.jl:3
top-level scope at look_for_parquet.jl:31

The data can be obtained here

https://github.com/xiaodaigh/parquet-data-collection/blob/master/BreastCancer.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

queryverse / parquetfiles.jl Goto Github PK

parquetfiles.jl's People

Contributors

Stargazers

Watchers

Forkers

parquetfiles.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org