ktrueda / parquet-tools Goto Github PK

View Code? Open in Web Editor NEW

149.0 149.0 21.0 383 KB

easy install parquet-tools

License: MIT License

Python 57.31% Thrift 42.69%

cli parquet parquet-tools

parquet-tools's People

Contributors

Stargazers

Watchers

parquet-tools's Issues

Show column sizes and metadata for inspect

It would be useful to print column sizes with inspect.

Including license files

Thanks for developing this useful package. I needed a quick way to view parquet files from the terminal, and this package works great.

I am working on a submission to conda-forge. This will create conda binaries for all operating systems, and also auto-update whenever you release a new version to PyPI.

conda-forge/staged-recipes#15926

There are a few license issues I have to address before the recipe is accepted:

The license is MIT but there is no file included that contains the license text, which is a requirement of the MIT license. This can be addressed by adding a license file and then listing it in MANIFEST.in
This package vendors parquet.thrift, which is released under the Apache License, which also requires that it is included in the distributed software

Ideally we could add these here in the source repo. I'm happy to send a PR if you'd like. Otherwise I can address them in the conda recipe. As a previous example, here is a previous PR of mine to include the package license as well as the license of the vendored file: randy3k/lineedit#4

make --endpoint-url configurable by environment, or move it before the subcommand

Hello!

I have tried quite a few parquet cli's out there, and I think this one is the best!

Here is my use case:
I am running this tool in a local environment in docker compose against a local s3 emulator/server. However, because the --endpoint-url occurs after the subcommands and not the main command, it's makes my requests a little more verbose to work with. Here is what I mean...

This is my docker-compose:

services:
    s3:
    image: scireum/s3-ninja
    pull_policy: always
    ports:
      - 9000:9000
    labels:
      - traefik.enable=true
      - traefik.http.routers.s3.rule=Host(`s3`) && PathPrefix(`/`)
      - traefik.http.services.s3.loadbalancer.server.port=9000

  parquet-tools:
    profiles:
      - script
    build:
      dockerfile_inline: | 
        ARG PYTHON_TOOLS_VERSION=0.2.15
        FROM apache/arrow-dev:amd64-conda-python-3.10
        RUN python -m ensurepip --upgrade
        RUN pip install parquet-tools==0.2.15
    environment:
      AWS_ACCESS_KEY_ID:  ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    entrypoint:
      - parquet-tools
    depends_on:
      - s3

  aws-cli:
    profiles:
      - script
    image: public.ecr.aws/aws-cli/aws-cli
    pull_policy: always
    environment:
      AWS_ACCESS_KEY_ID:  ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    entrypoint:
      - aws
      - --endpoint-url
      - http://s3:9000
    depends_on:
      - s3

volumes:
  s3:

Now in comparing the aws-cli service and the parquet-tools service, you notice that I can add the --endpoint-url to the entrypoint such that when I run it from the cli it looks like this:

$ docker compose run --rm aws-cli s3 ls s3://BUCKET/...

When using the parquet-tools binary, I have to do something like:

$ docker compose run --rm parquet-tools show --endpoint-url http://s3:9000 s3://BUCKET/...

I would love if I could call the command like this against my local s3 (either by moving the flag or having the flag supported by env var):

$ docker compose run --rm parquet-tools show s3://BUCKET/...

Thanks!

Invalid concatenation for multiple parquet files with different columns structure

If users specify multiple parquet files which has different column structure, the outcome is concat with column direction (not row direction).

$ poetry run parquet-tools csv ./tests/test1.parquet ./tests/test0.parquet
one,two,three,a,b,c,d
-1.0,foo,True,,,,
,bar,False,,,,
2.5,baz,True,,,,

It's slow to initialize

I think that
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONPROFILEIMPORTTIME
is useful.

pipenv -> poetry

Show PageHeader in inspect

incompatible with moto >= 5

It'd be great if you'd support moto 5.x.

moto 5 brings some API changes, but seems to be much simpler overall.

Schema command not working

I am not able to generate the schema for parquet file .
Below is the error.
parquet-tools schema Sample.parquet
usage: parquet-tools [-h] {show,csv,inspect} ...
parquet-tools: error: argument {show,csv,inspect}: invalid choice: 'schema' (choose from 'show', 'csv', 'inspect')

Add show meta

Showing row groups, the compression used, metadata and encoding.

e.g.

docker run -v $PWD/houseprices:/data markhneedham/pq meta /data/house_prices.parquet

File path:  /data/house_prices.parquet
Created by: parquet-cpp version 1.5.1-SNAPSHOT
Properties: (none)
Schema:
message schema {
  required int32 price (INTEGER(32,false));
  required int32 date (INTEGER(16,false));
  required binary postcode1;
  required binary postcode2;
  required int32 type (INTEGER(8,true));
  required int32 is_new (INTEGER(8,false));
  required int32 duration (INTEGER(8,true));
  required binary addr1;
  required binary addr2;
  required binary street;
  required binary locality;
  required binary town;
  required binary district;
  required binary county;
}


Row group 0:  count: 1000000  6.32 B records  start: 4  total(compressed): 6.026 MB total(uncompressed):9.089 MB
--------------------------------------------------------------------------------
           type      encodings count     avg size   nulls   min / max
price      INT32     Z _ R     1000000   1.72 B     0       "100" / "523000000"
date       INT32     Z _ R     1000000   1.77 B     0       "9131" / "19405"
postcode1  BINARY    Z _ R     1000000   0.00 B     0       "0x" / "0x42413131"
postcode2  BINARY    Z _ R     1000000   0.16 B     0       "0x" / "0x39595A"
type       INT32     Z _ R     1000000   0.19 B     0       "0" / "4"
is_new     INT32     Z _ R     1000000   0.04 B     0       "0" / "1"
duration   INT32     Z _ R     1000000   0.07 B     0       "0" / "2"
addr1      BINARY    Z _ R     1000000   1.20 B     0       "0x" / "0x5A5954454B20484F555345"
addr2      BINARY    Z _ R     1000000   0.20 B     0       "0x" / "0x5A4F4E452043"
street     BINARY    Z _ R     1000000   0.44 B     0       "0x" / "0x5A494F4E5320434C4F5345"
locality   BINARY    Z _ R     1000000   0.35 B     0       "0x" / "0x5A45414C53"
town       BINARY    Z _ R     1000000   0.08 B     0       "0x4142424F5453204C414E474..." / "0x595354524144204D4555524947"
district   BINARY    Z _ R     1000000   0.06 B     0       "0x41445552" / "0x594F524B"
county     BINARY    Z _ R     1000000   0.03 B     0       "0x41564F4E" / "0x594F524B"

AttributeError: 'MultiIndex' object has no attribute 'labels'

I tried installing with few python versions including 3.8 and 3.6 and both installed successfully, however executing parquet-tools
returns error as in title.

Use newer arrow

#18

use parquet.thrift instead of arrow

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift

show s3://bucket-name/*

What happened to the old output of "inspect"?

With the latest release (0.2.5), the output of inspect changed from the old output to what you also show in your readme now:

############ Column(UpdateTime) ############
name: UpdateTime
path: UpdateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MILLIS

became:

■■■■■■■■SchemaElement
■■■■■■■■■■■■type = 2
■■■■■■■■■■■■repetition_type = 1
■■■■■■■■■■■■name = UpdateTime
■■■■■■■■■■■■converted_type = 9

I found the old output much more useful, also since at least in my output, opposed to the example in the readme, the types are not even translated to their string-representation, but just integers (1, 9, ...)

Is there any way we can switch that back, or make it switchable by option or so?

Missing `dataclasses` in requirements

I just launched the installation of this package with pip python3 -m pip install parquet-tools on a debian server and got this error:

Traceback (most recent call last):
  File "/home/taz/.local/bin/parquet-tools", line 5, in <module>
    from parquet_tools.cli import main
  File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/cli.py", line 2, in <module>
    from parquet_tools.commands import show, csv, inspect
  File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/commands/show.py", line 9, in <module>
    from .utils import (FileNotFoundException, InvalidCommandExcpetion,
  File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/commands/utils.py", line 4, in <module>
    from dataclasses import dataclass
ModuleNotFoundError: No module named 'dataclasses'

I can see that the dataclasses library is used in parquet_tools/commands/utils.py but not provided as dependency in pyproject.toml

Cannot read Hive output file whose name is like 000000_0

Setting to force length of columns (truncate in spark)

Could be possible to have a parameter that behaves like truncate in spark.show()?

Currently, if you try to show a data frame with a column that has a long string, or many columns, the shape of the data frame will break down, making it hard to read. Since I don't care about that column I could have 2 possibilities, use --columns and select all columns except that one (maybe there could be a param that are directly the columns to be excluded?), or the other one could be truncating the length of columns, but couldn't find that option.

Head argument is ignored when --columns are specified

For both csv and show actions, when the --columns argument is supplied, the --head value is ignored and all rows are presented.

num_rows negative

Hello,

I've created a PARQUET dump from a Vertica database and upon inspect it with parquet-tools, I see that num_rows is negative. Any ideas? I'm not sure if this is a bug from Vertica or not.

############ file meta data ############
created_by: parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)
num_columns: 3
num_rows: -1803805480
num_row_groups: 1
format_version: 1.0
serialized_size: 329

LE:
When I exported the table, Vertica reported that it had exported 2491161816 rows, which is more than 2^31 (not sure what type num_columns is)

Broken pipe when piping to `less`

When piping parquet-tools csv output to less, I consistently get a BrokenPipeError when less is closed:

> parquet-tools csv my.parquet | less
Traceback (most recent call last):
  File "/.../bin/parquet-tools", line 10, in <module>
    sys.exit(main())
  File ".../lib/python3.9/site-packages/parquet_tools/cli.py", line 26, in main
    args.handler(args)
  File "/.../lib/python3.9/site-packages/parquet_tools/commands/csv.py", line 46, in _cli
    _execute(
  File "/.../lib/python3.9/site-packages/parquet_tools/commands/csv.py", line 62, in _execute
    print(df_select.to_csv(index=None))
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 47, in write
    self.__convertor.write(text)
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 170, in write
    self.write_and_convert(text)
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 198, in write_and_convert
    self.write_plain_text(text, cursor, len(text))
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 203, in write_plain_text
    self.wrapped.write(text[start:end])
BrokenPipeError: [Errno 32] Broken pipe

It is not a deal breaker, since in less I can see the output just fine. It is just a bit distracting, so I thought I'd mention it.

Update readme to reflect the current implementaion

The readme seems to be outdated, for example, there is not inspect subcommand anymore.

File "/usr/lib/python2.7/site-packages/parquet_tools/cli.py", line 5

Traceback (most recent call last):
  File "/usr/bin/parquet-tools", line 5, in <module>
    from parquet_tools.cli import main
  File "/usr/lib/python2.7/site-packages/parquet_tools/cli.py", line 5
    def main() -> None:
               ^
SyntaxError: invalid syntax

use magic number to check if input file is PARQUET or not

https://github.com/apache/parquet-format

[feature request] support json as conversion format or better csv support

csv export is nice, but if a text field contains itself the separator, the output then cannot be correctly interpreted

Exporting as JSON would alleviate the problem
Another possibility is to export the CSV with options:

adding quote to all string field ouput (and quoting the quote as "")
specifying a different custom separator (eg using some improbable character or set of character as separator: ¤µ§)

Early stopping download if use -n option

parquet-tools show -n 5 s3://bucket-name/prefix/*

is very slow.

The reason is that extracting head rows is after download all files.
It need early stopping download files.

Windows trouble installing with pip

Having an issue installing parquet-tools on Windows. Keep getting this error. Do I need Python version < 3.7?

Detection of compression `lz4` not supported

I used the tool pacman82/odbc2parquet to create a parquet file with compression lz4.

When I inspect the parquet file with parquet-tools, the compression cannot be detected:

ktrueda / parquet-tools Goto Github PK

parquet-tools's People

Contributors

Stargazers

Watchers

Forkers

parquet-tools's Issues

Recommend Projects

Recommend Topics

Recommend Org