ktrueda / parquet-tools Goto Github PK
View Code? Open in Web Editor NEWeasy install parquet-tools
License: MIT License
easy install parquet-tools
License: MIT License
Traceback (most recent call last):
File "/usr/bin/parquet-tools", line 5, in <module>
from parquet_tools.cli import main
File "/usr/lib/python2.7/site-packages/parquet_tools/cli.py", line 5
def main() -> None:
^
SyntaxError: invalid syntax
If users specify multiple parquet files which has different column structure, the outcome is concat with column direction (not row direction).
$ poetry run parquet-tools csv ./tests/test1.parquet ./tests/test0.parquet
one,two,three,a,b,c,d
-1.0,foo,True,,,,
,bar,False,,,,
2.5,baz,True,,,,
I used the tool pacman82/odbc2parquet to create a parquet file with compression lz4.
When I inspect the parquet file with parquet-tools, the compression cannot be detected:
Could be possible to have a parameter that behaves like truncate in spark.show()?
Currently, if you try to show a data frame with a column that has a long string, or many columns, the shape of the data frame will break down, making it hard to read. Since I don't care about that column I could have 2 possibilities, use --columns
and select all columns except that one (maybe there could be a param that are directly the columns to be excluded?), or the other one could be truncating the length of columns, but couldn't find that option.
I am not able to generate the schema for parquet file .
Below is the error.
parquet-tools schema Sample.parquet
usage: parquet-tools [-h] {show,csv,inspect} ...
parquet-tools: error: argument {show,csv,inspect}: invalid choice: 'schema' (choose from 'show', 'csv', 'inspect')
Showing row groups, the compression used, metadata and encoding.
e.g.
docker run -v $PWD/houseprices:/data markhneedham/pq meta /data/house_prices.parquet
File path: /data/house_prices.parquet
Created by: parquet-cpp version 1.5.1-SNAPSHOT
Properties: (none)
Schema:
message schema {
required int32 price (INTEGER(32,false));
required int32 date (INTEGER(16,false));
required binary postcode1;
required binary postcode2;
required int32 type (INTEGER(8,true));
required int32 is_new (INTEGER(8,false));
required int32 duration (INTEGER(8,true));
required binary addr1;
required binary addr2;
required binary street;
required binary locality;
required binary town;
required binary district;
required binary county;
}
Row group 0: count: 1000000 6.32 B records start: 4 total(compressed): 6.026 MB total(uncompressed):9.089 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
price INT32 Z _ R 1000000 1.72 B 0 "100" / "523000000"
date INT32 Z _ R 1000000 1.77 B 0 "9131" / "19405"
postcode1 BINARY Z _ R 1000000 0.00 B 0 "0x" / "0x42413131"
postcode2 BINARY Z _ R 1000000 0.16 B 0 "0x" / "0x39595A"
type INT32 Z _ R 1000000 0.19 B 0 "0" / "4"
is_new INT32 Z _ R 1000000 0.04 B 0 "0" / "1"
duration INT32 Z _ R 1000000 0.07 B 0 "0" / "2"
addr1 BINARY Z _ R 1000000 1.20 B 0 "0x" / "0x5A5954454B20484F555345"
addr2 BINARY Z _ R 1000000 0.20 B 0 "0x" / "0x5A4F4E452043"
street BINARY Z _ R 1000000 0.44 B 0 "0x" / "0x5A494F4E5320434C4F5345"
locality BINARY Z _ R 1000000 0.35 B 0 "0x" / "0x5A45414C53"
town BINARY Z _ R 1000000 0.08 B 0 "0x4142424F5453204C414E474..." / "0x595354524144204D4555524947"
district BINARY Z _ R 1000000 0.06 B 0 "0x41445552" / "0x594F524B"
county BINARY Z _ R 1000000 0.03 B 0 "0x41564F4E" / "0x594F524B"
When piping parquet-tools csv
output to less
, I consistently get a BrokenPipeError when less
is closed:
> parquet-tools csv my.parquet | less
Traceback (most recent call last):
File "/.../bin/parquet-tools", line 10, in <module>
sys.exit(main())
File ".../lib/python3.9/site-packages/parquet_tools/cli.py", line 26, in main
args.handler(args)
File "/.../lib/python3.9/site-packages/parquet_tools/commands/csv.py", line 46, in _cli
_execute(
File "/.../lib/python3.9/site-packages/parquet_tools/commands/csv.py", line 62, in _execute
print(df_select.to_csv(index=None))
File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 47, in write
self.__convertor.write(text)
File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 170, in write
self.write_and_convert(text)
File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 198, in write_and_convert
self.write_plain_text(text, cursor, len(text))
File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 203, in write_plain_text
self.wrapped.write(text[start:end])
BrokenPipeError: [Errno 32] Broken pipe
It is not a deal breaker, since in less
I can see the output just fine. It is just a bit distracting, so I thought I'd mention it.
Is it possible to specify the number of lines to display when using the show command. This would be useful when browsing parquet files with millions of lines e.g. only show top 5 rows
I just launched the installation of this package with pip python3 -m pip install parquet-tools
on a debian server and got this error:
Traceback (most recent call last):
File "/home/taz/.local/bin/parquet-tools", line 5, in <module>
from parquet_tools.cli import main
File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/cli.py", line 2, in <module>
from parquet_tools.commands import show, csv, inspect
File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/commands/show.py", line 9, in <module>
from .utils import (FileNotFoundException, InvalidCommandExcpetion,
File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/commands/utils.py", line 4, in <module>
from dataclasses import dataclass
ModuleNotFoundError: No module named 'dataclasses'
I can see that the dataclasses
library is used in parquet_tools/commands/utils.py
but not provided as dependency in pyproject.toml
Hello!
I have tried quite a few parquet cli's out there, and I think this one is the best!
Here is my use case:
I am running this tool in a local environment in docker compose against a local s3 emulator/server. However, because the --endpoint-url
occurs after the subcommands and not the main command, it's makes my requests a little more verbose to work with. Here is what I mean...
This is my docker-compose:
services:
s3:
image: scireum/s3-ninja
pull_policy: always
ports:
- 9000:9000
labels:
- traefik.enable=true
- traefik.http.routers.s3.rule=Host(`s3`) && PathPrefix(`/`)
- traefik.http.services.s3.loadbalancer.server.port=9000
parquet-tools:
profiles:
- script
build:
dockerfile_inline: |
ARG PYTHON_TOOLS_VERSION=0.2.15
FROM apache/arrow-dev:amd64-conda-python-3.10
RUN python -m ensurepip --upgrade
RUN pip install parquet-tools==0.2.15
environment:
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
entrypoint:
- parquet-tools
depends_on:
- s3
aws-cli:
profiles:
- script
image: public.ecr.aws/aws-cli/aws-cli
pull_policy: always
environment:
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
entrypoint:
- aws
- --endpoint-url
- http://s3:9000
depends_on:
- s3
volumes:
s3:
Now in comparing the aws-cli service and the parquet-tools service, you notice that I can add the --endpoint-url
to the entrypoint such that when I run it from the cli it looks like this:
$ docker compose run --rm aws-cli s3 ls s3://BUCKET/...
When using the parquet-tools binary, I have to do something like:
$ docker compose run --rm parquet-tools show --endpoint-url http://s3:9000 s3://BUCKET/...
I would love if I could call the command like this against my local s3 (either by moving the flag or having the flag supported by env var):
$ docker compose run --rm parquet-tools show s3://BUCKET/...
Thanks!
It'd be great if you'd support moto 5.x.
moto 5 brings some API changes, but seems to be much simpler overall.
I think that
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPROFILEIMPORTTIME
is useful.
With the latest release (0.2.5), the output of inspect
changed from the old output to what you also show in your readme now:
############ Column(UpdateTime) ############
name: UpdateTime
path: UpdateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MILLIS
became:
■■■■■■■■SchemaElement
■■■■■■■■■■■■type = 2
■■■■■■■■■■■■repetition_type = 1
■■■■■■■■■■■■name = UpdateTime
■■■■■■■■■■■■converted_type = 9
I found the old output much more useful, also since at least in my output, opposed to the example in the readme, the types are not even translated to their string-representation, but just integers (1, 9, ...)
Is there any way we can switch that back, or make it switchable by option or so?
It would be useful to print column sizes with inspect
.
Hello,
I've created a PARQUET dump from a Vertica database and upon inspect it with parquet-tools, I see that num_rows is negative. Any ideas? I'm not sure if this is a bug from Vertica or not.
############ file meta data ############
created_by: parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)
num_columns: 3
num_rows: -1803805480
num_row_groups: 1
format_version: 1.0
serialized_size: 329
LE:
When I exported the table, Vertica reported that it had exported 2491161816
rows, which is more than 2^31 (not sure what type num_columns is)
Thanks for developing this useful package. I needed a quick way to view parquet files from the terminal, and this package works great.
I am working on a submission to conda-forge. This will create conda binaries for all operating systems, and also auto-update whenever you release a new version to PyPI.
conda-forge/staged-recipes#15926
There are a few license issues I have to address before the recipe is accepted:
MANIFEST.in
parquet.thrift
, which is released under the Apache License, which also requires that it is included in the distributed softwareIdeally we could add these here in the source repo. I'm happy to send a PR if you'd like. Otherwise I can address them in the conda recipe. As a previous example, here is a previous PR of mine to include the package license as well as the license of the vendored file: randy3k/lineedit#4
I tried installing with few python versions including 3.8 and 3.6 and both installed successfully, however executing parquet-tools
returns error as in title.
The readme seems to be outdated, for example, there is not inspect
subcommand anymore.
csv export is nice, but if a text field contains itself the separator, the output then cannot be correctly interpreted
Exporting as JSON would alleviate the problem
Another possibility is to export the CSV with options:
parquet-tools show -n 5 s3://bucket-name/prefix/*
is very slow.
The reason is that extracting head rows is after download all files.
It need early stopping download files.
For both csv
and show
actions, when the --columns
argument is supplied, the --head
value is ignored and all rows are presented.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.