hangxie / parquet-tools Goto Github PK

View Code? Open in Web Editor NEW

47.0 47.0 5.0 1.19 MB

Utility to deal with Parquet data

License: BSD 3-Clause "New" or "Revised" License

Makefile 2.76% Go 92.88% Dockerfile 0.68% Shell 3.69%

parquet parquet-tools

parquet-tools's People

Contributors

Stargazers

Watchers

Forkers

prakashettigar allinux likang

parquet-tools's Issues

Get rid of filter option in cat

It's half-baked with limited feature, while jq can do better job. The original idea is to provide embedded filter function in case data size is huge that cannot be loaded into memory by jq, but with support of JSONL in parquet-tools, this may no longer needed - even if jq loads full JSONL into memory, one can split the jumbo file into small chunks then apply jq.

Don't forget update USAGE.md

Refactor schema code

Code for raw format is fine, but code for JSON schema and go struct are pretty messy, they should be refactored to make maintenance easier.

nested go struct

Current go struct implementation does not work well with nested struct, s3://dpla-provider-export/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet is a good example to use.

build docker image for mac m1

CI: cannot build arm/v7 image

amd64 and arm64 both are good, but not arm/v7:

#34 [linux/arm/v7 builder 4/4] RUN apt update     && apt install -y bash make git     && make build
#34 sha256:795f1301e7583f9110bd32e669be8e74cc5c6f300e73b986d8015c928e772ac0
#34 110.4 go: downloading github.com/xitongsys/parquet-go-source v0.0.0-20201108113611-f372b7d813be
#34 110.6 go: downloading github.com/xitongsys/parquet-go v1.6.1-0.20210331075444-5ecfa15142b5
#34 111.5 go: downloading github.com/stretchr/testify v1.6.1
#34 112.5 go: downloading github.com/pkg/errors v0.9.1
#34 113.0 go: downloading github.com/golang/mock v1.4.3
#34 113.0 go: downloading github.com/apache/thrift v0.13.1-0.20201008052519-daf620915714
#34 114.6 go: downloading github.com/davecgh/go-spew v1.1.1
#34 115.3 go: downloading github.com/pmezard/go-difflib v1.0.0
#34 115.4 go: downloading gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c
#34 116.8 go: downloading github.com/jmespath/go-jmespath v0.4.0
#34 120.5 go: downloading gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127
#34 121.2 go: downloading golang.org/x/net v0.0.0-20201110031124-69a78807bb2b
#34 129.8 go: downloading github.com/jmespath/go-jmespath/internal/testify v1.5.1
#34 130.3 go: downloading github.com/golang/snappy v0.0.1
#34 130.6 go: downloading github.com/klauspost/compress v1.10.5
#34 144.0 go: downloading github.com/kr/pretty v0.1.0
#34 144.1 go: downloading gopkg.in/yaml.v2 v2.2.8
#34 144.1 go: downloading github.com/kr/text v0.1.0
#34 144.3 go: downloading golang.org/x/text v0.3.3
#34 157.7 ==> Building executable
#34 174.8 go build runtime/cgo: gcc: exit status 1

Feature: add SQL query

Evaluate other parquet module

https://github.com/apache/arrow/tree/master/go, sample can be found at https://github.com/apache/arrow/blob/master/go/parquet/cmd/parquet_reader/main.go

INT96 import issue

INT96 values does not match original value after import:

    "Int96": "1717-12-28T19:20:10.805069776Z",		      |	    "Int96": "2022-01-01T09:09:09.009009Z",

Support more operators in cat filter

It supports == and <> now, but should not be that hard to add >, >=, <, <=.

Test case for issue #185

For #185, which was fixed by #186 and #188 but is still not covered by unit test yet.

Rewrite code for generating tag in schema.go

Current codes are quite redundant to deal with JSON schema and go struct, should have a better way to be clear.

Support other data sources

These can be added without too much effort (all are supported by parquet-go-source):

GCS
Azure Blob
HTTP
HDFS
some others

Test bug for slack ingestion

This is a test.

Integrate Slack with CircleCI

github and docker hub are done, CircleCI is the last one.

Support HDFS

It seems quite some people are using Hadoop along with parquet and HDFS scheme is supported by parquet-mr, I think it's a good idea to have HDFS support here as well.

build deb package

For debian and ubuntu.

CiecleCI release

Use CircleCI to release, then v1.0.0 can be announced.

panic: reflect: call of reflect.Value.SetString on ptr Value [recovered]

Optional fields use pointer in go struct and is not properly handled.

Unable to process map with value type of list

With parquet file generated by parquet-go's json_schema.go example:

$ parquet-tools schema json_schema.parquet
panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/alecthomas/kong.catch(0x1400081fe60)
	github.com/alecthomas/[email protected]/kong.go:383 +0xb8
panic({0x103c7e2c0, 0x140000cd260})
	runtime/panic.go:838 +0x204
github.com/hangxie/parquet-tools/cmd.(*schemaNode).updateTagFromConvertedType(0x1400081ec48, 0x1400060ef60?)
	github.com/hangxie/parquet-tools/cmd/schema.go:210 +0x818
github.com/hangxie/parquet-tools/cmd.(*schemaNode).getTagMap(0x1400081ec48)
	github.com/hangxie/parquet-tools/cmd/schema.go:285 +0x520
github.com/hangxie/parquet-tools/cmd.getTagMapAsChild({0x0, 0x0, 0x140000ca960, {0x140006143da, 0x5}, 0x140000ca9b0, 0x140000ca9b8, 0x0, 0x0, 0x0, ...}, ...)
	github.com/hangxie/parquet-tools/cmd/schema.go:312 +0x98
github.com/hangxie/parquet-tools/cmd.(*schemaNode).updateTagFromConvertedType(0x1400014ecb0, 0x1400060eed0?)
	github.com/hangxie/parquet-tools/cmd/schema.go:225 +0x558
github.com/hangxie/parquet-tools/cmd.(*schemaNode).getTagMap(0x1400014ecb0)
	github.com/hangxie/parquet-tools/cmd/schema.go:285 +0x520
github.com/hangxie/parquet-tools/cmd.(*schemaNode).jsonSchema(0x1400014ecb0)
	github.com/hangxie/parquet-tools/cmd/schema.go:129 +0x24
github.com/hangxie/parquet-tools/cmd.(*schemaNode).jsonSchema(0x1400014e7e0)
	github.com/hangxie/parquet-tools/cmd/schema.go:147 +0x160
github.com/hangxie/parquet-tools/cmd.(*SchemaCmd).Run(0x104375920, 0x1?)
	github.com/hangxie/parquet-tools/cmd/schema.go:37 +0x278
reflect.Value.call({0x103b79460?, 0x104375920?, 0x140006bfad8?}, {0x1038feff2, 0x4}, {0x1400000e1b0, 0x1, 0x10314604c?})
	reflect/value.go:556 +0x5e4
reflect.Value.Call({0x103b79460?, 0x104375920?, 0xc?}, {0x1400000e1b0, 0x1, 0x1})
	reflect/value.go:339 +0x98
github.com/alecthomas/kong.callMethod({0x1038feb24, 0x3}, {0x103c05580?, 0x104375920?, 0x3?}, {0x103b79460?, 0x104375920?, 0x0?}, 0x0?)
	github.com/alecthomas/[email protected]/callbacks.go:71 +0x3a4
github.com/alecthomas/kong.(*Context).RunNode(0x140000c4f80, 0x1400038e700, {0x140006bff00, 0x1, 0x1})
	github.com/alecthomas/[email protected]/context.go:706 +0x468
github.com/alecthomas/kong.(*Context).Run(0x1400038e1c0?, {0x140006bff00?, 0x0?, 0x0?})
	github.com/alecthomas/[email protected]/context.go:723 +0xc0
main.main()
	github.com/hangxie/parquet-tools/main.go:40 +0x2bc

Update usage

Review USAGE.md to make sure all sections are up to date

add to brew formula

Should make a v1.0.0 release before doing so, we are pretty close to that.

build rpm

Need to check if RPM is still a thing ... been away from RHEL/CentOS/Fedora for some time.

Does not run on Windows on ARM

Paralells on Apple M1, 386 and amd64 work well, but not the arm build.

Improve parquet-go error handling

parquet-go should not panic on catchable errors, follow error was from invalid CSV schema:

$ go run . import -m cmd/testdata/jsonl.schema -s cmd/testdata/jsonl.source /tmp/jsonl.parquet
panic: runtime error: index out of range [1] with length 1 [recovered]
	panic: runtime error: index out of range [1] with length 1

goroutine 1 [running]:
github.com/alecthomas/kong.catch(0x140003bfef8)
	/Users/xiehang/go/pkg/mod/github.com/alecthomas/[email protected]/kong.go:383 +0xd0
panic({0x102dcc7a0, 0x14000036c18})
	/opt/homebrew/Cellar/go/1.17/libexec/src/runtime/panic.go:1038 +0x21c
github.com/xitongsys/parquet-go/common.StringToTag({0x14000500000, 0x1})
	/Users/xiehang/go/pkg/mod/github.com/hangxie/[email protected]/common/common.go:86 +0x2980
github.com/xitongsys/parquet-go/schema.NewSchemaHandlerFromMetadata({0x140002a0600, 0x14, 0x20})
	/Users/xiehang/go/pkg/mod/github.com/hangxie/[email protected]/schema/csv.go:28 +0x378
github.com/xitongsys/parquet-go/writer.NewCSVWriter({0x140002a0600, 0x14, 0x20}, {0x102e6cf50, 0x1400000fdb8}, 0x8)
	/Users/xiehang/go/pkg/mod/github.com/hangxie/[email protected]/writer/csv.go:27 +0x50
github.com/hangxie/parquet-tools/cmd.newCSVWriter({0x14000036be8, 0x12}, {0x140002a0600, 0x14, 0x20})
	/Users/xiehang/Dev/parquet-tools/cmd/common.go:174 +0x94
github.com/hangxie/parquet-tools/cmd.(*ImportCmd).importCSV(0x10349d588)
	/Users/xiehang/Dev/parquet-tools/cmd/import.go:56 +0x328
github.com/hangxie/parquet-tools/cmd.(*ImportCmd).Run(0x10349d588, 0x140000818a0)
	/Users/xiehang/Dev/parquet-tools/cmd/import.go:27 +0x54
reflect.Value.call({0x102d56de0, 0x10349d588, 0x213}, {0x102a76885, 0x4}, {0x1400000fda0, 0x1, 0x1})
	/opt/homebrew/Cellar/go/1.17/libexec/src/reflect/value.go:543 +0x584
reflect.Value.Call({0x102d56de0, 0x10349d588, 0x213}, {0x1400000fda0, 0x1, 0x1})
	/opt/homebrew/Cellar/go/1.17/libexec/src/reflect/value.go:339 +0x8c
github.com/alecthomas/kong.callMethod({0x102a763b9, 0x3}, {0x102da8860, 0x10349d588, 0x199}, {0x102d56de0, 0x10349d588, 0x213}, 0x1400047c8d0)
	/Users/xiehang/go/pkg/mod/github.com/alecthomas/[email protected]/callbacks.go:71 +0x4ac
github.com/alecthomas/kong.(*Context).RunNode(0x140000ecd00, 0x14000442460, {0x140003bff38, 0x1, 0x1})
	/Users/xiehang/go/pkg/mod/github.com/alecthomas/[email protected]/context.go:706 +0x3e0
github.com/alecthomas/kong.(*Context).Run(0x140000ecd00, {0x140003bff38, 0x1, 0x1})
	/Users/xiehang/go/pkg/mod/github.com/alecthomas/[email protected]/context.go:723 +0x80
main.main()
	/Users/xiehang/Dev/parquet-tools/main.go:33 +0x1a8
exit status 2

Refactor document

To make README more useful, I should put installation part and simple use cases to README, then the USAGE.md still have full document for everything, it seems most user just read README and have no interest to go through the lengthy USAGE.md.

schema command panicked at a certain parquet file

@erikburgess reported that certain parquet file will panick schema command with:

panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/alecthomas/kong.catch(0x14000559e60)
	github.com/alecthomas/[email protected]/kong.go:383 +0xb4
panic({0x104e45100, 0x140001e56e0})
	runtime/panic.go:884 +0x204
github.com/hangxie/parquet-tools/cmd.(*schemaNode).updateTagFromConvertedType(0x14000524240, 0x140001128d0?)
	github.com/hangxie/parquet-tools/cmd/schema.go:262 +0x838
github.com/hangxie/parquet-tools/cmd.(*schemaNode).getTagMap(0x14000524240)
	github.com/hangxie/parquet-tools/cmd/schema.go:337 +0x518
github.com/hangxie/parquet-tools/cmd.(*schemaNode).jsonSchema(0x14000524240)
	github.com/hangxie/parquet-tools/cmd/schema.go:127 +0x24
github.com/hangxie/parquet-tools/cmd.(*schemaNode).jsonSchema(0x14000114630)
	github.com/hangxie/parquet-tools/cmd/schema.go:145 +0x160
github.com/hangxie/parquet-tools/cmd.(*SchemaCmd).Run(0x105559d68, 0x1?)
	github.com/hangxie/parquet-tools/cmd/schema.go:37 +0x298
reflect.Value.call({0x104d3d540?, 0x105559d68?, 0x140004d7ac8?}, {0x104ab8de6, 0x4}, {0x1400000fd10, 0x1, 0x1042ef39c?})
	reflect/value.go:584 +0x688
reflect.Value.Call({0x104d3d540?, 0x105559d68?, 0x140004d7b68?}, {0x1400000fd10?, 0xc?, 0xc?})
	reflect/value.go:368 +0x90
github.com/alecthomas/kong.callMethod({0x104ab890b, 0x3}, {0x104dcb920?, 0x105559d68?, 0x3?}, {0x104d3d540?, 0x105559d68?, 0x0?}, 0x0?)
	github.com/alecthomas/[email protected]/callbacks.go:71 +0x3f0
github.com/alecthomas/kong.(*Context).RunNode(0x140001c0a00, 0x140003aa700, {0x140004d7f00, 0x1, 0x1})
	github.com/alecthomas/[email protected]/context.go:706 +0x460
github.com/alecthomas/kong.(*Context).Run(0x140003aa1c0?, {0x140004d7f00?, 0x0?, 0x0?})
	github.com/alecthomas/[email protected]/context.go:723 +0xbc
main.main()
	github.com/hangxie/parquet-tools/main.go:40 +0x2b8

Docker hub notification

Slack webhook does not work with docker hub repository, the notification needs to be moved to CCI job.

Unit test: sorting columns

Compose a parquet file with sorting column for meta cmd unit test.

Handle nested decimal field

#106 can handle decimal fields at top level, I need to find a way to deal with nested decimal values.

protocol error: received DATA after END_STREAM

To re-produce:

$ parquet-tools schema https://huggingface.co/datasets/laion/laion2B-en/resolve/main/part-00047-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet
2022/08/24 09:51:57 protocol error: received DATA after END_STREAM
2022/08/24 09:51:58 protocol error: received DATA after END_STREAM
2022/08/24 09:51:58 protocol error: received DATA after END_STREAM
2022/08/24 09:51:58 protocol error: received DATA after END_STREAM
2022/08/24 09:51:58 protocol error: received DATA after END_STREAM
2022/08/24 09:51:58 protocol error: received DATA after END_STREAM
2022/08/24 09:51:58 protocol error: received DATA after END_STREAM
2022/08/24 09:51:58 protocol error: received DATA after END_STREAM
2022/08/24 09:51:59 protocol error: received DATA after END_STREAM
2022/08/24 09:51:59 protocol error: received DATA after END_STREAM
2022/08/24 09:51:59 protocol error: received DATA after END_STREAM
{"Tag":"name=Spark_schema, type=STRUCT, repetitiontype=REQUIRED","Fields":[{"Tag":"name=SAMPLE_ID, type=INT64, repetitiontype=OPTIONAL"},{"Tag":"name=URL, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"},{"Tag":"name=TEXT, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"},{"Tag":"name=HEIGHT, type=INT32, repetitiontype=OPTIONAL"},{"Tag":"name=WIDTH, type=INT32, repetitiontype=OPTIONAL"},{"Tag":"name=LICENSE, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"},{"Tag":"name=NSFW, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"},{"Tag":"name=Similarity, type=DOUBLE, repetitiontype=OPTIONAL"}]}

Min/Max value in meta command

Similar to #104 but for meta command in MinValue and MaxValue

docker image does not work with Azure blob

Need to verify S3 and GCS, but it seems because of ca-certificates are missing.

TUI

Reminder to myself that TUI is a thing that can be pretty interesting, I'm thinking of using Bubble Tea or tview to build something, maybe with this parquet-tools but can also be something else.

FIXED_LEN_BYTE_ARRAY/DECIMAL type output

FIXED_LEN_BYTE_ARRAY/DECIMAL output is not human readable, it's something like

"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0005\u007f)"

Create checksum for released artifacts

Not yet ready to do signature yet.

refactor code to deal with fields that need to be re-interpretted

Codes to handle DECIMAL, INTERVAL and INT96 in cat and meta commands were added steadily without an high level design, the logic is spread to lots of places and hard to maintain.

Here are what's in my mind to refactor those codes:

test cases:
- Cover these types:
  - DECIMAL (FIXED_LENGTH_BYTE_ARRY and BYTE_ARRAY)
  - DECIMAL (INT32 and INT64)
  - INTERVAL (FIXED_LENGTH_BYTE_ARRY)
  - INT96 (treat as timestamp only)
- Cover field with above types in these locations:
  - top level
    - scalar
    - pointer
  - embedded
    - list element
    - map key
    - map value
rough idea for cat:
1. scan schema to get all fields needs to be reinterpreted
2. base64 string values of those fields that need to be reinterpreted
3. JSON whole data (a single row)
4. use gjson/sjon to retrieve fields need to be reinterpreted, convert, then assign back
rough idea for meta:
1. scan schema to get all fields needs to be reinterpreted
2. reinterpret min/max value

Feature: output go struct

output go struct with tag with schema command.

Improve output of MinValue and MaxValue

There should be something I can do to improve data shown as MinValue and MaxValue, I don't mean to deal with all scenarios, but at least should handle numeric value (INTnn and FLOAT/DOUBLE) and string value (UTF8 only)

Missing logicaltype in schema output

logicaltype like these are missing in schema output:

Date2             int32               `parquet:"name=date2, type=INT32, convertedtype=DATE, logicaltype=DATE"`
TimeMillis2       int32               `parquet:"name=timemillis2, type=INT32, logicaltype=TIME, logicaltype.isadjustedtoutc=true, logicaltype.unit=MILLIS"`
TimeMicros2       int64               `parquet:"name=timemicros2, type=INT64, logicaltype=TIME, logicaltype.isadjustedtoutc=false, logicaltype.unit=MICROS"`
TimestampMillis2  int64               `parquet:"name=timestampmillis2, type=INT64, logicaltype=TIMESTAMP, logicaltype.isadjustedtoutc=true, logicaltype.unit=MILLIS"`
TimestampMicros2  int64               `parquet:"name=timestampmicros2, type=INT64, logicaltype=TIMESTAMP, logicaltype.isadjustedtoutc=false, logicaltype.unit=MICROS"`
Decimal5          int32               `parquet:"name=decimal5, type=INT32, scale=2, precision=9, logicaltype=DECIMAL, logicaltype.precision=9, logicaltype.scale=2"`

schema output are:

Date2             int32            `parquet:"name=Date2, type=INT32, convertedtype=DATE, repetitiontype=REQUIRED"`
Timemillis2       int32            `parquet:"name=Timemillis2, type=INT32, repetitiontype=REQUIRED"`
Timemicros2       int64            `parquet:"name=Timemicros2, type=INT64, repetitiontype=REQUIRED"`
Timestampmillis2  int64            `parquet:"name=Timestampmillis2, type=INT64, repetitiontype=REQUIRED"`
Timestampmicros2  int64            `parquet:"name=Timestampmicros2, type=INT64, repetitiontype=REQUIRED"`
Decimal5          int32            `parquet:"name=Decimal5, type=INT32, repetitiontype=REQUIRED"`

Common options for object store

Feel like all cloud based object stores should support anonymous access and versioning.

Autocomplete

At least for bash

INTERVAL type import or cat problem

Imported from a jsonl to parquet with INTERVAL date type, then cat got panic:

$ parquet-tools cat -f jsonl cmd/testdata/all-types.parquet > /tmp/data.jsonl
$ parquet-tools schema -f json cmd/testdata/all-types.parquet > /tmp/schema.json
$ parquet-tools import -m /tmp/schema.json -f jsonl -s /tmp/data.jsonl /tmp/imported.parquet
$ parquet-tools cat /tmp/imported.parquet
[panic: runtime error: index out of range [0] with length 0 [recovered]
	panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
github.com/alecthomas/kong.catch(0x140002ffe60)
	github.com/alecthomas/[email protected]/kong.go:383 +0xb8
panic({0x100d1f3a0, 0x1400090c990})
	runtime/panic.go:838 +0x204
github.com/xitongsys/parquet-go/types.DECIMAL_BYTE_ARRAY_ToString({0x1013b0020?, 0x0?, 0x0?}, 0x100a1236d?, 0x140002fee28?)
	github.com/xitongsys/[email protected]/types/converter.go:127 +0x1d0
github.com/hangxie/parquet-tools/cmd.reinterpretNestedFields(0x140002ff078, {0x1400008a6d0, 0x0, 0x0}, {0x10?, 0x1?, 0x140002fefc8?, 0x100294334?})
	github.com/hangxie/parquet-tools/cmd/cat.go:229 +0x604
github.com/hangxie/parquet-tools/cmd.reinterpretNestedFields(0x140006781d0, {0x1400008a6d0, 0x1, 0x1}, {0x0?, 0x4?, 0x1400035f100?, 0x14000479110?})
	github.com/hangxie/parquet-tools/cmd/cat.go:216 +0x248
github.com/hangxie/parquet-tools/cmd.(*CatCmd).Run(0x10137ef80, 0x1?)
	github.com/hangxie/parquet-tools/cmd/cat.go:112 +0x978
reflect.Value.call({0x100c3f940?, 0x10137ef80?, 0x1400035fad8?}, {0x1009f271b, 0x4}, {0x140004746f0, 0x1, 0x1002aaeec?})
	reflect/value.go:556 +0x5e4
reflect.Value.Call({0x100c3f940?, 0x10137ef80?, 0x9?}, {0x140004746f0, 0x1, 0x1})
	reflect/value.go:339 +0x98
github.com/alecthomas/kong.callMethod({0x1009f224f, 0x3}, {0x100d247a0?, 0x10137ef80?, 0x3?}, {0x100c3f940?, 0x10137ef80?, 0x0?}, 0x0?)
	github.com/alecthomas/[email protected]/callbacks.go:71 +0x3a4
github.com/alecthomas/kong.(*Context).RunNode(0x140001ff200, 0x14000412380, {0x1400035ff00, 0x1, 0x1})
	github.com/alecthomas/[email protected]/context.go:706 +0x468
github.com/alecthomas/kong.(*Context).Run(0x140004121c0?, {0x1400035ff00?, 0x0?, 0x0?})
	github.com/alecthomas/[email protected]/context.go:723 +0xc0
main.main()
	github.com/hangxie/parquet-tools/main.go:40 +0x2bc

It works fine if this field is removed:

    {
      "Tag": "name=Interval, type=FIXED_LEN_BYTE_ARRAY, convertedtype=INTERVAL, repetitiontype=REQUIRED"
    },

Import using JSON schema

This was mentioned by a friend, I think it's a useful feature as JSON schema is used more than schema file used by parquet-go.

I'm not sure if it is doable though, need to research a bit.

Fix Azure blob URI

According to https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#access-files-from-the-cluster, the URI should be:

wasbs://[email protected]/path/2017/08/fileA.json

Which is quite chatty, but I guess there is a possibility that Azure will support different domains (on-prem?)

# ./parquet-tools-v1.10.1-linux-amd64
runtime/cgo: pthread_create failed: Operation not permitted
SIGABRT: abort
PC=0x7f2996e7b39c m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7f2996e7b39c
stack: frame={sp:0x7ffdc062a060, fp:0x0} stack=[0x7ffdbfe2b5b8,0x7ffdc062a5f0)
0x00007ffdc0629f60:  0x00007ffdc062a430  0x00000000033662e0
0x00007ffdc0629f70:  0x0000000000203000  0x0000000001708480
0x00007ffdc0629f80:  0x00007f297027805b  0x00007f299700ee9f
0x00007ffdc0629f90:  0x0000000000000001  0x0000000000000000
0x00007ffdc0629fa0:  0x2525252525252525  0x2525252525252525
0x00007ffdc0629fb0:  0x000000ffffffffff  0x0000000000000000
0x00007ffdc0629fc0:  0x000000ffffffffff  0x0000000000000000
0x00007ffdc0629fd0:  0x415353454d5f434c  0x505f434c00534547
0x00007ffdc0629fe0:  0x0000000000000000  0x0000000000000000
0x00007ffdc0629ff0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a000:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a010:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a020:  0x6e75720000000000  0x6f67632f656d6974
0x00007ffdc062a030:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a040:  0x3b31303d63706d2e  0x67676f2e2a3a3633
0x00007ffdc062a050:  0x2a3a36333b31303d  0x00007f2996e7b38e
0x00007ffdc062a060: <0x3d7661772e2a3a36  0x2e2a3a36333b3130
0x00007ffdc062a070:  0x333b31303d61676f  0x7375706f2e2a3a36
0x00007ffdc062a080:  0x2a3a36333b31303d  0x3b31303d7870732e
0x00007ffdc062a090:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0a0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0b0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0c0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0d0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0e0:  0x0000000000000000  0x83af847d47132c00
0x00007ffdc062a0f0:  0x00007f2996de9740  0x0000000000000006
0x00007ffdc062a100:  0x00000000033662e0  0x0000000000203000
0x00007ffdc062a110:  0x0000000001708480  0x00007f2996e2e696
0x00007ffdc062a120:  0x00007f2996fe8990  0x00007f2996e187f3
0x00007ffdc062a130:  0x0000000000000020  0x0000000000000000
0x00007ffdc062a140:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a150:  0x0000000000000000  0x0000000000000000
runtime: unknown pc 0x7f2996e7b39c
stack: frame={sp:0x7ffdc062a060, fp:0x0} stack=[0x7ffdbfe2b5b8,0x7ffdc062a5f0)
0x00007ffdc0629f60:  0x00007ffdc062a430  0x00000000033662e0
0x00007ffdc0629f70:  0x0000000000203000  0x0000000001708480
0x00007ffdc0629f80:  0x00007f297027805b  0x00007f299700ee9f
0x00007ffdc0629f90:  0x0000000000000001  0x0000000000000000
0x00007ffdc0629fa0:  0x2525252525252525  0x2525252525252525
0x00007ffdc0629fb0:  0x000000ffffffffff  0x0000000000000000
0x00007ffdc0629fc0:  0x000000ffffffffff  0x0000000000000000
0x00007ffdc0629fd0:  0x415353454d5f434c  0x505f434c00534547
0x00007ffdc0629fe0:  0x0000000000000000  0x0000000000000000
0x00007ffdc0629ff0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a000:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a010:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a020:  0x6e75720000000000  0x6f67632f656d6974
0x00007ffdc062a030:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a040:  0x3b31303d63706d2e  0x67676f2e2a3a3633
0x00007ffdc062a050:  0x2a3a36333b31303d  0x00007f2996e7b38e
0x00007ffdc062a060: <0x3d7661772e2a3a36  0x2e2a3a36333b3130
0x00007ffdc062a070:  0x333b31303d61676f  0x7375706f2e2a3a36
0x00007ffdc062a080:  0x2a3a36333b31303d  0x3b31303d7870732e
0x00007ffdc062a090:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0a0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0b0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0c0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0d0:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a0e0:  0x0000000000000000  0x83af847d47132c00
0x00007ffdc062a0f0:  0x00007f2996de9740  0x0000000000000006
0x00007ffdc062a100:  0x00000000033662e0  0x0000000000203000
0x00007ffdc062a110:  0x0000000001708480  0x00007f2996e2e696
0x00007ffdc062a120:  0x00007f2996fe8990  0x00007f2996e187f3
0x00007ffdc062a130:  0x0000000000000020  0x0000000000000000
0x00007ffdc062a140:  0x0000000000000000  0x0000000000000000
0x00007ffdc062a150:  0x0000000000000000  0x0000000000000000

goroutine 1 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:350 fp=0xc000050780 sp=0xc000050778 pc=0x462f60
runtime.main()
	/usr/local/go/src/runtime/proc.go:174 +0x7b fp=0xc0000507e0 sp=0xc000050780 pc=0x43771b
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0000507e8 sp=0xc0000507e0 pc=0x465181

rax    0x0
rbx    0x7f2996de9740
rcx    0x7f2996e7b39c
rdx    0x6
rdi    0x15
rsi    0x15
rbp    0x15
rsp    0x7ffdc062a060
r8     0x7ffdc062a130
r9     0x7f2996fa24e0
r10    0x8
r11    0x246
r12    0x6
r13    0x203000
r14    0x1708480
r15    0x7f297027805b
rip    0x7f2996e7b39c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

Feature: import from JSON

Similar to CSV import.

schema output of LIST and MAP key/value type misses converted type

expected:

Map  map[string]int32 `parquet:"name=Map, type=MAP, repetitiontype=REQUIRED, keytype=BYTE_ARRAY, keyconvertedtype=UTF8, valuetype=INT32"`
List []string         `parquet:"name=List, type=LIST, repetitiontype=REQUIRED, valuetype=BYTE_ARRAY, valueconvertedtype=DECIMAL, valuescale=2, valueprecision=10"`

got:

Map  map[string]int32 `parquet:"name=Map, type=MAP, keytype=BYTE_ARRAY, valuetype=INT32"`
List []string         `parquet:"name=List, type=LIST, valuetype=BYTE_ARRAY"`

Push to alternative container registry

I'm moving away from Docker, current build works with podman after #97, however, images built are still uploading to Docker Hub which is kind of risky (cost and throttle, etc.).