queryverse / parquetfiles.jl Goto Github PK
View Code? Open in Web Editor NEWFileIO.jl integration for Parquet files
License: Other
FileIO.jl integration for Parquet files
License: Other
(v1.0) pkg> add ParquetFiles
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package ParquetFiles [46a55296]:
ParquetFiles [46a55296] log:
├─possible versions are: 0.0.1 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions 0.0.1
└─restricted by julia compatibility requirements to versions: uninstalled — no versions left
(v1.0) pkg> add ParquetFiles#master
Updating git-repo `https://github.com/queryverse/ParquetFiles.jl.git`
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Parquet [626c502c]:
Parquet [626c502c] log:
├─possible versions are: 0.1.0 or uninstalled
├─restricted to versions 0.1.0-* by ParquetFiles [46a55296], leaving only versions 0.1.0
│ └─ParquetFiles [46a55296] log:
│ ├─possible versions are: 0.0.1 or uninstalled
│ └─ParquetFiles [46a55296] is fixed to version 0.0.1+
└─restricted by julia compatibility requirements to versions: uninstalled — no versions left
Data
https://github.com/xiaodaigh/parquet-data-collection/blob/master/dsd50p.parquet
I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing
using ParquetFiles
a = "\\\\wsl\$\\Ubuntu-18.04\\git\\aia-engine-data-wrangling\\test\\data\\breast_cancer_with_schema\\BreastCancer.parquet"
print(load(a))
which yields the error
KeyError: key 5 not found
getindex(::Dict{Int64,Thrift.ThriftMetaAttribs}, ::Int64) at dict.jl:477
read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.RowGroup) at base.jl:181
read at base.jl:169 [inlined]
read at base.jl:167 [inlined]
read_container(::Thrift.TCompactProtocol, ::Array{Parquet.PAR2.RowGroup,1}) at base.jl:369
read_container(::Thrift.TCompactProtocol, ::Type{Array{Parquet.PAR2.RowGroup,1}}) at base.jl:168
read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.FileMetaData) at base.jl:190
read at base.jl:169 [inlined]
read at base.jl:167 [inlined]
read_thrift(::IOStream, ::Type{Parquet.PAR2.FileMetaData}) at reader.jl:324
metadata(::IOStream, ::String, ::Int32) at reader.jl:339
ParFile(::String, ::IOStream; maxcache::Int64) at reader.jl:57
ParFile at reader.jl:55 [inlined]
ParFile(::String) at reader.jl:46
top-level scope at look_for_parquet.jl:32
Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/new-york-city-bike-share-dataset.parquet
I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing
using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))
which yields the error
Base.Meta.ParseError("extra token \"Duration\" after end of expression")
parse(::String, ::Int64; greedy::Bool, raise::Bool, depwarn::Bool) at meta.jl:184
parse at meta.jl:176 [inlined]
parse(::String; raise::Bool, depwarn::Bool) at meta.jl:215
parse at meta.jl:215 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:230
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::Base.TTY, ::ParquetFiles.ParquetFile, ::Char) at io.jl:46
println(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:73
println(::ParquetFiles.ParquetFile) at coreio.jl:4
top-level scope at look_for_parquet.jl:34
julia> @time versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, nocona)
0.951149 seconds (1.18 M allocations: 53.974 MiB, 4.30% gc time)
julia> @time Pkg.add("Queryverse")
Resolving package versions...
Updating `~/.julia/environments/v1.4/Project.toml`
[no changes]
Updating `~/.julia/environments/v1.4/Manifest.toml`
[no changes]
3.222060 seconds (2.55 M allocations: 168.544 MiB, 5.66% gc time)
julia> @time using Queryverse
[ Info: Precompiling Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58]
ERROR: LoadError: UndefVarError: RecCursor not defined
Stacktrace:
[1] top-level scope at /home/c/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:26
[2] top-level scope at none:2
[3] eval at ./boot.jl:331 [inlined]
in expression starting at /home/c/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:26
ERROR: LoadError: Failed to precompile ParquetFiles [46a55296-af5a-53b0-aaa0-97023b66127f] to /home/c/.julia/compiled/v1.4/ParquetFiles/WDBBU_g2XoO.ji.
Stacktrace:
[1] top-level scope at none:2
[2] eval at ./boot.jl:331 [inlined]
in expression starting at /home/c/.julia/packages/Queryverse/ysqbZ/src/Queryverse.jl:15
ERROR: Failed to precompile Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58] to /home/c/.julia/compiled/v1.4/Queryverse/hLJnW_g2XoO.ji.
Stacktrace:
[1] compilecache(::Base.PkgId, ::String) at ./loading.jl:1272
[2] _require(::Base.PkgId) at ./loading.jl:1029
[3] require(::Base.PkgId) at ./loading.jl:927
[4] require(::Module, ::Symbol) at ./loading.jl:922
[5] top-level scope at util.jl:175
Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl
MWE:
(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
[6e4b80f9] BenchmarkTools v0.5.0
[336ed68f] CSV v0.6.2
[a93c6f00] DataFrames v0.21.2
[626c502c] Parquet v0.4.0
[46a55296] ParquetFiles v0.2.0
using ParquetFiles, BenchmarkTools, CSV, DataFrames
CSV.read("data.csv")
DataFrame(load("data.parquet"))
Loading times for ParquetFiles
@benchmark DataFrame(load("data.parquet"))
BenchmarkTools.Trial:
memory estimate: 45.66 MiB
allocs estimate: 961290
--------------
minimum time: 287.492 ms (0.00% GC)
median time: 290.843 ms (0.00% GC)
mean time: 296.344 ms (1.64% GC)
maximum time: 326.041 ms (8.46% GC)
--------------
samples: 17
evals/sample: 1
Loading times for CSV:
@benchmark CSV.read("data.csv")
BenchmarkTools.Trial:
memory estimate: 758.14 KiB
allocs estimate: 2299
--------------
minimum time: 1.690 ms (0.00% GC)
median time: 1.735 ms (0.00% GC)
mean time: 1.772 ms (1.43% GC)
maximum time: 14.096 ms (63.93% GC)
--------------
samples: 2817
evals/sample: 1
As compared to pandas:
import pandas as pd
%timeit pd.read_parquet("data.parquet")
# 3.61 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_csv("data.csv")
# 4.73 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Data are included in zip file:
data.zip
Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/noshowappointments.parquet
I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing
using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))
which yields the error
TypeError: in typeassert, expected Array{UInt8,1}, got typeof(show)
top-level scope at none:2
eval at boot.jl:331 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:231
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::Base.TTY, ::ParquetFiles.ParquetFile, ::Char) at io.jl:46
println(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:73
println(::ParquetFiles.ParquetFile) at coreio.jl:4
top-level scope at look_for_parquet.jl:35
Parquet.jl has updated, RecCursor isn't exported, so ParquetFile is broken.
It should be updated to use RecordCursor, but currently is unusable.
Which of course, breaks Queryverse.
Once the file is registered with FileIO
I am working with a series of large Parquet files (which I cannot share), and there seems to be a weird error when reading them:
julia> DataFrame(load("current.parquet"))
ERROR: UndefRefError: access to undefined reference
The stacktrace is as follows:
Stacktrace:
[1] getproperty(::ParquetFiles.RCType276, ::Symbol) at ./Base.jl:33
[2] macro expansion at /home/tpoisot/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:48 [inlined]
[3] iterate(::ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}},ParquetFiles.RCType276}, ::Int64) at /home/tpoisot/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:39
[4] iterate at /home/tpoisot/.julia/packages/Tables/xHhzi/src/tofromdatavalues.jl:53 [inlined]
[5] iterate at ./iterators.jl:139 [inlined]
[6] buildcolumns(::Tables.Schema{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}}, ::Tables.IteratorWrapper{ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}},ParquetFiles.RCType276}}) at /home/tpoisot/.julia/packages/Tables/xHhzi/src/fallbacks.jl:127
[7] columns at /home/tpoisot/.julia/packages/Tables/xHhzi/src/fallbacks.jl:237 [inlined]
[8] DataFrame(::ParquetFiles.ParquetFile; copycols::Bool) at /home/tpoisot/.julia/packages/DataFrames/GtZ1l/src/other/tables.jl:43
[9] DataFrame(::ParquetFiles.ParquetFile) at /home/tpoisot/.julia/packages/DataFrames/GtZ1l/src/other/tables.jl:34
[10] top-level scope at REPL[30]:1
Interestingly, AFAIK, the entire file is loaded, but saving to a DataFrame or CSV results in the same error being thrown. My guess is that the last line, somehow, has characters it should not?
I was able to get a file to read easily using Python and Pandas, however in Julia, the same file produces the following error:
ERROR: LoadError: Base.Meta.ParseError("extra token "Vector" after end of expression")
I usied Parquet.jl to try and load the file, and I get the following schema
Schema:
required schema {
optional INT64 timestamp# (from TIMESTAMP_MILLIS) {
optional INT64 t_datetime# (from TIMESTAMP_MILLIS) {
optional BYTE_ARRAY id_string# (from UTF8) {
optional DOUBLE v_double {
optional DOUBLE q_double {
optional BYTE_ARRAY alias_string# (from UTF8) {
However when I try to implement
schema(JuliaConverter(Main), Pq, :Tags)
I get the same error as before.
Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/fannie_mae_perf_small.parquet
https://github.com/xiaodaigh/parquet-data-collection/blob/master/synthetic_data.parquet
I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing
using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))
which yields the error
Error displaying ParquetFiles.ParquetFile: EOFError: read end of file
read at iobuffer.jl:212 [inlined]
_read_varint at codec.jl:40 [inlined]
read_hybrid(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64, ::Type{Int32}, ::Array{Int32,1}; read_len::Bool) at codec.jl:129
read_hybrid at codec.jl:125 [inlined]
(::Parquet.var"#read_hybrid##kw")(::NamedTuple{(:read_len,),Tuple{Bool}}, ::typeof(Parquet.read_hybrid), ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64, ::Type{Int32}) at codec.jl:125
(::Parquet.var"#read_hybrid##kw")(::NamedTuple{(:read_len,),Tuple{Bool}}, ::typeof(Parquet.read_hybrid), ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64) at codec.jl:125
read_hybrid at codec.jl:125 [inlined]
read_rle_dict(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32) at codec.jl:118
read_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::Int32, ::Int32) at reader.jl:222
read_levels_and_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Tuple{Int32,Int32,Int32}, ::Int32, ::Int32, ::ParFile, ::Parquet.Page) at reader.jl:261
values(::ParFile, ::Parquet.Page) at reader.jl:239
values(::ParFile, ::Parquet.PAR2.ColumnChunk) at reader.jl:178
setrow(::ColCursor{Float64}, ::Int64) at cursor.jl:144
ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at cursor.jl:115
(::Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64})(::String) at none:0
iterate at generator.jl:47 [inlined]
collect_to!(::Array{ColCursor,1}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at array.jl:710
collect_to!(::Array{ColCursor{Int64},1}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at array.jl:718
collect_to_with_first!(::Array{ColCursor{Int64},1}, ::ColCursor{Int64}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64) at array.jl:689
collect(::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}) at array.jl:670
RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{ParquetFiles.RCType267}, ::Int64) at cursor.jl:269
RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{ParquetFiles.RCType267}) at cursor.jl:269
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:74
show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
show at multimedia.jl:47 [inlined]
(::Atom.var"#61#62"{ParquetFiles.ParquetFile})(::Base.GenericIOBuffer{Array{UInt8,1}}) at display.jl:17
sprint(::Function; context::Nothing, sizehint::Int64) at io.jl:105
sprint at io.jl:101 [inlined]
render at display.jl:16 [inlined]
Copyable at types.jl:39 [inlined]
Copyable at types.jl:40 [inlined]
render at display.jl:19 [inlined]
Can we save a dataFrame to a parquet file?
using ParquetFiles
a = "BreastCancer.parquet"
print(load(a))
which yields the error
syntax: field name "Cl.thickness" is not a symbol
top-level scope at look_for_parquet.jl:31
eval at boot.jl:331 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:231
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::ParquetFiles.ParquetFile) at coreio.jl:3
top-level scope at look_for_parquet.jl:31
The data can be obtained here
https://github.com/xiaodaigh/parquet-data-collection/blob/master/BreastCancer.parquet
I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing
Is there a solution to my issue here using ParquetFiles.jl?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.