Giter Site home page Giter Site logo

parquet-tools's People

Contributors

dependabot[bot] avatar engstrom avatar exaspace avatar fabaff avatar geronimogoemon avatar jdblischak avatar ktrueda avatar mrdavidlaing avatar omo avatar ryan-williams avatar sattler avatar sbrandtb avatar sligocki avatar smousa avatar uniocto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

parquet-tools's Issues

Including license files

Thanks for developing this useful package. I needed a quick way to view parquet files from the terminal, and this package works great.

I am working on a submission to conda-forge. This will create conda binaries for all operating systems, and also auto-update whenever you release a new version to PyPI.

conda-forge/staged-recipes#15926

There are a few license issues I have to address before the recipe is accepted:

  1. The license is MIT but there is no file included that contains the license text, which is a requirement of the MIT license. This can be addressed by adding a license file and then listing it in MANIFEST.in
  2. This package vendors parquet.thrift, which is released under the Apache License, which also requires that it is included in the distributed software

Ideally we could add these here in the source repo. I'm happy to send a PR if you'd like. Otherwise I can address them in the conda recipe. As a previous example, here is a previous PR of mine to include the package license as well as the license of the vendored file: randy3k/lineedit#4

make --endpoint-url configurable by environment, or move it before the subcommand

Hello!

I have tried quite a few parquet cli's out there, and I think this one is the best!

Here is my use case:
I am running this tool in a local environment in docker compose against a local s3 emulator/server. However, because the --endpoint-url occurs after the subcommands and not the main command, it's makes my requests a little more verbose to work with. Here is what I mean...

This is my docker-compose:

services:
    s3:
    image: scireum/s3-ninja
    pull_policy: always
    ports:
      - 9000:9000
    labels:
      - traefik.enable=true
      - traefik.http.routers.s3.rule=Host(`s3`) && PathPrefix(`/`)
      - traefik.http.services.s3.loadbalancer.server.port=9000

  parquet-tools:
    profiles:
      - script
    build:
      dockerfile_inline: | 
        ARG PYTHON_TOOLS_VERSION=0.2.15
        FROM apache/arrow-dev:amd64-conda-python-3.10
        RUN python -m ensurepip --upgrade
        RUN pip install parquet-tools==0.2.15
    environment:
      AWS_ACCESS_KEY_ID:  ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    entrypoint:
      - parquet-tools
    depends_on:
      - s3

  aws-cli:
    profiles:
      - script
    image: public.ecr.aws/aws-cli/aws-cli
    pull_policy: always
    environment:
      AWS_ACCESS_KEY_ID:  ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    entrypoint:
      - aws
      - --endpoint-url
      - http://s3:9000
    depends_on:
      - s3

volumes:
  s3:

Now in comparing the aws-cli service and the parquet-tools service, you notice that I can add the --endpoint-url to the entrypoint such that when I run it from the cli it looks like this:

$ docker compose run --rm aws-cli s3 ls s3://BUCKET/...

When using the parquet-tools binary, I have to do something like:

$ docker compose run --rm parquet-tools show --endpoint-url http://s3:9000 s3://BUCKET/...

I would love if I could call the command like this against my local s3 (either by moving the flag or having the flag supported by env var):

$ docker compose run --rm parquet-tools show s3://BUCKET/...

Thanks!

incompatible with moto >= 5

It'd be great if you'd support moto 5.x.

moto 5 brings some API changes, but seems to be much simpler overall.

Schema command not working

I am not able to generate the schema for parquet file .
Below is the error.
parquet-tools schema Sample.parquet
usage: parquet-tools [-h] {show,csv,inspect} ...
parquet-tools: error: argument {show,csv,inspect}: invalid choice: 'schema' (choose from 'show', 'csv', 'inspect')

Add show meta

Showing row groups, the compression used, metadata and encoding.

e.g.

docker run -v $PWD/houseprices:/data markhneedham/pq meta /data/house_prices.parquet

File path:  /data/house_prices.parquet
Created by: parquet-cpp version 1.5.1-SNAPSHOT
Properties: (none)
Schema:
message schema {
  required int32 price (INTEGER(32,false));
  required int32 date (INTEGER(16,false));
  required binary postcode1;
  required binary postcode2;
  required int32 type (INTEGER(8,true));
  required int32 is_new (INTEGER(8,false));
  required int32 duration (INTEGER(8,true));
  required binary addr1;
  required binary addr2;
  required binary street;
  required binary locality;
  required binary town;
  required binary district;
  required binary county;
}


Row group 0:  count: 1000000  6.32 B records  start: 4  total(compressed): 6.026 MB total(uncompressed):9.089 MB
--------------------------------------------------------------------------------
           type      encodings count     avg size   nulls   min / max
price      INT32     Z _ R     1000000   1.72 B     0       "100" / "523000000"
date       INT32     Z _ R     1000000   1.77 B     0       "9131" / "19405"
postcode1  BINARY    Z _ R     1000000   0.00 B     0       "0x" / "0x42413131"
postcode2  BINARY    Z _ R     1000000   0.16 B     0       "0x" / "0x39595A"
type       INT32     Z _ R     1000000   0.19 B     0       "0" / "4"
is_new     INT32     Z _ R     1000000   0.04 B     0       "0" / "1"
duration   INT32     Z _ R     1000000   0.07 B     0       "0" / "2"
addr1      BINARY    Z _ R     1000000   1.20 B     0       "0x" / "0x5A5954454B20484F555345"
addr2      BINARY    Z _ R     1000000   0.20 B     0       "0x" / "0x5A4F4E452043"
street     BINARY    Z _ R     1000000   0.44 B     0       "0x" / "0x5A494F4E5320434C4F5345"
locality   BINARY    Z _ R     1000000   0.35 B     0       "0x" / "0x5A45414C53"
town       BINARY    Z _ R     1000000   0.08 B     0       "0x4142424F5453204C414E474..." / "0x595354524144204D4555524947"
district   BINARY    Z _ R     1000000   0.06 B     0       "0x41445552" / "0x594F524B"
county     BINARY    Z _ R     1000000   0.03 B     0       "0x41564F4E" / "0x594F524B"

What happened to the old output of "inspect"?

With the latest release (0.2.5), the output of inspect changed from the old output to what you also show in your readme now:

############ Column(UpdateTime) ############
name: UpdateTime
path: UpdateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=true, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MILLIS

became:

■■■■■■■■SchemaElement
■■■■■■■■■■■■type = 2
■■■■■■■■■■■■repetition_type = 1
■■■■■■■■■■■■name = UpdateTime
■■■■■■■■■■■■converted_type = 9

I found the old output much more useful, also since at least in my output, opposed to the example in the readme, the types are not even translated to their string-representation, but just integers (1, 9, ...)

Is there any way we can switch that back, or make it switchable by option or so?

Missing `dataclasses` in requirements

I just launched the installation of this package with pip python3 -m pip install parquet-tools on a debian server and got this error:

Traceback (most recent call last):
  File "/home/taz/.local/bin/parquet-tools", line 5, in <module>
    from parquet_tools.cli import main
  File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/cli.py", line 2, in <module>
    from parquet_tools.commands import show, csv, inspect
  File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/commands/show.py", line 9, in <module>
    from .utils import (FileNotFoundException, InvalidCommandExcpetion,
  File "/home/taz/.local/lib/python3.6/site-packages/parquet_tools/commands/utils.py", line 4, in <module>
    from dataclasses import dataclass
ModuleNotFoundError: No module named 'dataclasses'

I can see that the dataclasses library is used in parquet_tools/commands/utils.py but not provided as dependency in pyproject.toml

Setting to force length of columns (truncate in spark)

Could be possible to have a parameter that behaves like truncate in spark.show()?

Currently, if you try to show a data frame with a column that has a long string, or many columns, the shape of the data frame will break down, making it hard to read. Since I don't care about that column I could have 2 possibilities, use --columns and select all columns except that one (maybe there could be a param that are directly the columns to be excluded?), or the other one could be truncating the length of columns, but couldn't find that option.

num_rows negative

Hello,

I've created a PARQUET dump from a Vertica database and upon inspect it with parquet-tools, I see that num_rows is negative. Any ideas? I'm not sure if this is a bug from Vertica or not.

############ file meta data ############
created_by: parquet-cpp version 1.0.0-v10.1.1 (build Vertica Analytic Database)
num_columns: 3
num_rows: -1803805480
num_row_groups: 1
format_version: 1.0
serialized_size: 329

LE:
When I exported the table, Vertica reported that it had exported 2491161816 rows, which is more than 2^31 (not sure what type num_columns is)

Broken pipe when piping to `less`

When piping parquet-tools csv output to less, I consistently get a BrokenPipeError when less is closed:

> parquet-tools csv my.parquet | less
Traceback (most recent call last):
  File "/.../bin/parquet-tools", line 10, in <module>
    sys.exit(main())
  File ".../lib/python3.9/site-packages/parquet_tools/cli.py", line 26, in main
    args.handler(args)
  File "/.../lib/python3.9/site-packages/parquet_tools/commands/csv.py", line 46, in _cli
    _execute(
  File "/.../lib/python3.9/site-packages/parquet_tools/commands/csv.py", line 62, in _execute
    print(df_select.to_csv(index=None))
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 47, in write
    self.__convertor.write(text)
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 170, in write
    self.write_and_convert(text)
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 198, in write_and_convert
    self.write_plain_text(text, cursor, len(text))
  File "/.../lib/python3.9/site-packages/colorama/ansitowin32.py", line 203, in write_plain_text
    self.wrapped.write(text[start:end])
BrokenPipeError: [Errno 32] Broken pipe

It is not a deal breaker, since in less I can see the output just fine. It is just a bit distracting, so I thought I'd mention it.

[feature request] support json as conversion format or better csv support

csv export is nice, but if a text field contains itself the separator, the output then cannot be correctly interpreted

  1. Exporting as JSON would alleviate the problem

  2. Another possibility is to export the CSV with options:

  • adding quote to all string field ouput (and quoting the quote as "")
  • specifying a different custom separator (eg using some improbable character or set of character as separator: ¤µ§)

Early stopping download if use -n option

parquet-tools show -n 5 s3://bucket-name/prefix/*

is very slow.

The reason is that extracting head rows is after download all files.
It need early stopping download files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.