Giter Site home page Giter Site logo

parquetfiles.jl's People

Contributors

davidanthoff avatar github-actions[bot] avatar scls19fr avatar xiaodaigh avatar yijia-chen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

parquetfiles.jl's Issues

Unable to add ParquetFiles, Julia v1.0

(v1.0) pkg> add ParquetFiles
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package ParquetFiles [46a55296]:
 ParquetFiles [46a55296] log:
 ├─possible versions are: 0.0.1 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions 0.0.1
 └─restricted by julia compatibility requirements to versions: uninstalled — no versions left
(v1.0) pkg> add ParquetFiles#master
  Updating git-repo `https://github.com/queryverse/ParquetFiles.jl.git`
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Parquet [626c502c]:
 Parquet [626c502c] log:
 ├─possible versions are: 0.1.0 or uninstalled
 ├─restricted to versions 0.1.0-* by ParquetFiles [46a55296], leaving only versions 0.1.0
 │ └─ParquetFiles [46a55296] log:
 │   ├─possible versions are: 0.0.1 or uninstalled
 │   └─ParquetFiles [46a55296] is fixed to version 0.0.1+
 └─restricted by julia compatibility requirements to versions: uninstalled — no versions left

error when reading a file

Data
https://github.com/xiaodaigh/parquet-data-collection/blob/master/dsd50p.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "\\\\wsl\$\\Ubuntu-18.04\\git\\aia-engine-data-wrangling\\test\\data\\breast_cancer_with_schema\\BreastCancer.parquet"
print(load(a))

which yields the error

KeyError: key 5 not found
getindex(::Dict{Int64,Thrift.ThriftMetaAttribs}, ::Int64) at dict.jl:477
read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.RowGroup) at base.jl:181
read at base.jl:169 [inlined]
read at base.jl:167 [inlined]
read_container(::Thrift.TCompactProtocol, ::Array{Parquet.PAR2.RowGroup,1}) at base.jl:369
read_container(::Thrift.TCompactProtocol, ::Type{Array{Parquet.PAR2.RowGroup,1}}) at base.jl:168
read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.FileMetaData) at base.jl:190
read at base.jl:169 [inlined]
read at base.jl:167 [inlined]
read_thrift(::IOStream, ::Type{Parquet.PAR2.FileMetaData}) at reader.jl:324
metadata(::IOStream, ::String, ::Int32) at reader.jl:339
ParFile(::String, ::IOStream; maxcache::Int64) at reader.jl:57
ParFile at reader.jl:55 [inlined]
ParFile(::String) at reader.jl:46
top-level scope at look_for_parquet.jl:32

Base.Meta.ParseError

Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/new-york-city-bike-share-dataset.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))

which yields the error

Base.Meta.ParseError("extra token \"Duration\" after end of expression")
parse(::String, ::Int64; greedy::Bool, raise::Bool, depwarn::Bool) at meta.jl:184
parse at meta.jl:176 [inlined]
parse(::String; raise::Bool, depwarn::Bool) at meta.jl:215
parse at meta.jl:215 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:230
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::Base.TTY, ::ParquetFiles.ParquetFile, ::Char) at io.jl:46
println(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:73
println(::ParquetFiles.ParquetFile) at coreio.jl:4
top-level scope at look_for_parquet.jl:34

ERROR: LoadError: UndefVarError: RecCursor not defined

julia> @time versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, nocona)
  0.951149 seconds (1.18 M allocations: 53.974 MiB, 4.30% gc time)

julia> @time Pkg.add("Queryverse")
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
 [no changes]
   Updating `~/.julia/environments/v1.4/Manifest.toml`
 [no changes]
  3.222060 seconds (2.55 M allocations: 168.544 MiB, 5.66% gc time)

julia> @time using Queryverse
[ Info: Precompiling Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58]
ERROR: LoadError: UndefVarError: RecCursor not defined
Stacktrace:
 [1] top-level scope at /home/c/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:26
 [2] top-level scope at none:2
 [3] eval at ./boot.jl:331 [inlined]
in expression starting at /home/c/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:26
ERROR: LoadError: Failed to precompile ParquetFiles [46a55296-af5a-53b0-aaa0-97023b66127f] to /home/c/.julia/compiled/v1.4/ParquetFiles/WDBBU_g2XoO.ji.
Stacktrace:
 [1] top-level scope at none:2
 [2] eval at ./boot.jl:331 [inlined]
in expression starting at /home/c/.julia/packages/Queryverse/ysqbZ/src/Queryverse.jl:15
ERROR: Failed to precompile Queryverse [612083be-0b0f-5412-89c1-4e7c75506a58] to /home/c/.julia/compiled/v1.4/Queryverse/hLJnW_g2XoO.ji.
Stacktrace:
 [1] compilecache(::Base.PkgId, ::String) at ./loading.jl:1272
 [2] _require(::Base.PkgId) at ./loading.jl:1029
 [3] require(::Base.PkgId) at ./loading.jl:927
 [4] require(::Module, ::Symbol) at ./loading.jl:922
 [5] top-level scope at util.jl:175

Reading Parquet to DataFrame is slow

Reading a parquet file into a DataFrame is ~170 slower than using CSV.read with the same data. Not sure I can help improve performance but this is limiting my use of ParquetFiles.jl

MWE:

(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.6.2
  [a93c6f00] DataFrames v0.21.2
  [626c502c] Parquet v0.4.0
  [46a55296] ParquetFiles v0.2.0
using ParquetFiles, BenchmarkTools, CSV, DataFrames
CSV.read("data.csv")
DataFrame(load("data.parquet"))

Loading times for ParquetFiles

@benchmark DataFrame(load("data.parquet"))
BenchmarkTools.Trial: 
  memory estimate:  45.66 MiB
  allocs estimate:  961290
  --------------
  minimum time:     287.492 ms (0.00% GC)
  median time:      290.843 ms (0.00% GC)
  mean time:        296.344 ms (1.64% GC)
  maximum time:     326.041 ms (8.46% GC)
  --------------
  samples:          17
  evals/sample:     1

Loading times for CSV:

@benchmark CSV.read("data.csv")
BenchmarkTools.Trial: 
  memory estimate:  758.14 KiB
  allocs estimate:  2299
  --------------
  minimum time:     1.690 ms (0.00% GC)
  median time:      1.735 ms (0.00% GC)
  mean time:        1.772 ms (1.43% GC)
  maximum time:     14.096 ms (63.93% GC)
  --------------
  samples:          2817
  evals/sample:     1

As compared to pandas:

import pandas as pd
%timeit pd.read_parquet("data.parquet")                                                                                                                                          
# 3.61 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pd.read_csv("data.csv")                                                                                                                                                  
# 4.73 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Data are included in zip file:
data.zip

TypeError when reading

Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/noshowappointments.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))

which yields the error

TypeError: in typeassert, expected Array{UInt8,1}, got typeof(show)
top-level scope at none:2
eval at boot.jl:331 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:231
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::Base.TTY, ::ParquetFiles.ParquetFile, ::Char) at io.jl:46
println(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:73
println(::ParquetFiles.ParquetFile) at coreio.jl:4
top-level scope at look_for_parquet.jl:35

UndefRefError

I am working with a series of large Parquet files (which I cannot share), and there seems to be a weird error when reading them:

julia> DataFrame(load("current.parquet"))
ERROR: UndefRefError: access to undefined reference

The stacktrace is as follows:

Stacktrace:
 [1] getproperty(::ParquetFiles.RCType276, ::Symbol) at ./Base.jl:33
 [2] macro expansion at /home/tpoisot/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:48 [inlined]
 [3] iterate(::ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}},ParquetFiles.RCType276}, ::Int64) at /home/tpoisot/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:39
 [4] iterate at /home/tpoisot/.julia/packages/Tables/xHhzi/src/tofromdatavalues.jl:53 [inlined]
 [5] iterate at ./iterators.jl:139 [inlined]
 [6] buildcolumns(::Tables.Schema{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}}, ::Tables.IteratorWrapper{ParquetFiles.ParquetNamedTupleIterator{NamedTuple{(:column, :names, :go, :here),Tuple{String,Int32,String,String,Int32,Int32,Int32,Int32,Int32}},ParquetFiles.RCType276}}) at /home/tpoisot/.julia/packages/Tables/xHhzi/src/fallbacks.jl:127
 [7] columns at /home/tpoisot/.julia/packages/Tables/xHhzi/src/fallbacks.jl:237 [inlined]
 [8] DataFrame(::ParquetFiles.ParquetFile; copycols::Bool) at /home/tpoisot/.julia/packages/DataFrames/GtZ1l/src/other/tables.jl:43
 [9] DataFrame(::ParquetFiles.ParquetFile) at /home/tpoisot/.julia/packages/DataFrames/GtZ1l/src/other/tables.jl:34
 [10] top-level scope at REPL[30]:1

Interestingly, AFAIK, the entire file is loaded, but saving to a DataFrame or CSV results in the same error being thrown. My guess is that the last line, somehow, has characters it should not?

Error in loading a file

I was able to get a file to read easily using Python and Pandas, however in Julia, the same file produces the following error:

ERROR: LoadError: Base.Meta.ParseError("extra token "Vector" after end of expression")

I usied Parquet.jl to try and load the file, and I get the following schema

Schema:
required schema {
optional INT64 timestamp# (from TIMESTAMP_MILLIS) {
optional INT64 t_datetime# (from TIMESTAMP_MILLIS) {
optional BYTE_ARRAY id_string# (from UTF8) {
optional DOUBLE v_double {
optional DOUBLE q_double {
optional BYTE_ARRAY alias_string# (from UTF8) {

However when I try to implement
schema(JuliaConverter(Main), Pq, :Tags)

I get the same error as before.

EOFError

Data Examples
https://github.com/xiaodaigh/parquet-data-collection/blob/master/fannie_mae_perf_small.parquet
https://github.com/xiaodaigh/parquet-data-collection/blob/master/synthetic_data.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

using ParquetFiles
a = "fannie_mae_perf_small.parquet"
print(load(a))

which yields the error

Error displaying ParquetFiles.ParquetFile: EOFError: read end of file
read at iobuffer.jl:212 [inlined]
_read_varint at codec.jl:40 [inlined]
read_hybrid(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64, ::Type{Int32}, ::Array{Int32,1}; read_len::Bool) at codec.jl:129
read_hybrid at codec.jl:125 [inlined]
(::Parquet.var"#read_hybrid##kw")(::NamedTuple{(:read_len,),Tuple{Bool}}, ::typeof(Parquet.read_hybrid), ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64, ::Type{Int32}) at codec.jl:125
(::Parquet.var"#read_hybrid##kw")(::NamedTuple{(:read_len,),Tuple{Bool}}, ::typeof(Parquet.read_hybrid), ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64) at codec.jl:125
read_hybrid at codec.jl:125 [inlined]
read_rle_dict(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32) at codec.jl:118
read_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::Int32, ::Int32) at reader.jl:222
read_levels_and_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Tuple{Int32,Int32,Int32}, ::Int32, ::Int32, ::ParFile, ::Parquet.Page) at reader.jl:261
values(::ParFile, ::Parquet.Page) at reader.jl:239
values(::ParFile, ::Parquet.PAR2.ColumnChunk) at reader.jl:178
setrow(::ColCursor{Float64}, ::Int64) at cursor.jl:144
ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at cursor.jl:115
(::Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64})(::String) at none:0
iterate at generator.jl:47 [inlined]
collect_to!(::Array{ColCursor,1}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at array.jl:710
collect_to!(::Array{ColCursor{Int64},1}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at array.jl:718
collect_to_with_first!(::Array{ColCursor{Int64},1}, ::ColCursor{Int64}, ::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}, ::Int64) at array.jl:689
collect(::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}) at array.jl:670
RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{ParquetFiles.RCType267}, ::Int64) at cursor.jl:269
RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{ParquetFiles.RCType267}) at cursor.jl:269
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:74
show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
show at multimedia.jl:47 [inlined]
(::Atom.var"#61#62"{ParquetFiles.ParquetFile})(::Base.GenericIOBuffer{Array{UInt8,1}}) at display.jl:17
sprint(::Function; context::Nothing, sizehint::Int64) at io.jl:105
sprint at io.jl:101 [inlined]
render at display.jl:16 [inlined]
Copyable at types.jl:39 [inlined]
Copyable at types.jl:40 [inlined]
render at display.jl:19 [inlined]

Fails to load data with non-alphanumerics in the column name

using ParquetFiles
a = "BreastCancer.parquet"
print(load(a))

which yields the error

syntax: field name "Cl.thickness" is not a symbol
top-level scope at look_for_parquet.jl:31
eval at boot.jl:331 [inlined]
schema_to_julia_types(::Module, ::Parquet.Schema, ::Symbol) at schema.jl:231
schema(::JuliaConverter, ::Parquet.Schema, ::Symbol) at schema.jl:224
schema at reader.jl:66 [inlined]
getiterator(::ParquetFiles.ParquetFile) at ParquetFiles.jl:65
show(::Base.TTY, ::ParquetFiles.ParquetFile) at ParquetFiles.jl:13
print(::Base.TTY, ::ParquetFiles.ParquetFile) at io.jl:35
print(::ParquetFiles.ParquetFile) at coreio.jl:3
top-level scope at look_for_parquet.jl:31

The data can be obtained here

https://github.com/xiaodaigh/parquet-data-collection/blob/master/BreastCancer.parquet

I am using Windows 10, Julia 1.4, ParquetFiles.jl v0.2, Parquet 0.3.2 which are the latest versions as of writing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.