apache / parquet-java Goto Github PK

View Code? Open in Web Editor NEW

2.5K 94.0 1.4K 14.63 MB

Apache Parquet Java

Home Page: https://parquet.apache.org/

License: Apache License 2.0

Shell 0.23% Python 0.22% Java 99.26% Thrift 0.16% Scala 0.13%

parquet apache parquet-java

parquet-java's Issues

Parquet OutputFormat should allow controlling the file size

To generate the most efficient on disk file, the size of the file is important to control. It would be nice if we could configure the ouputformat to roll over new files when it reaches the right size and start a new file.

There's currently no easy way to tune this and requires indirect tuning (number of reduces, map input size).

Reporter: Nong Li / @nongli

_{Note: This issue was originally created as PARQUET-17. Please see the migration documentation for further details.}

FilteringPrimitiveConverter should support dictionaries

If the delegated PrimitiveConverter supports dictionaries, then FilteringPrimitiveConverter should too.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-36. Please see the migration documentation for further details.}

Add bloom filters to parquet statistics

For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups.
Pull request:
#215

Reporter: Alex Levenson / @isnotinvain
Assignee: Junjie Chen / @chenjunjiedada

Subtasks:

Related issues:

[C++] Add bloom filter utility class (Blocked)
Implement the bloom filter for the ParquetSerde (blocks)
Optional bloom_filter_offset is filled if no bloom filter is present (relates to)
Use Parquet bloom filter (relates to)
Refactor the configuration for bloom filters (relates to)
Support writing bloom filters in Parquet (is related to)
Leverage parquet bloom filters (is related to)

PRs and other links:

_{Note: This issue was originally created as PARQUET-41. Please see the migration documentation for further details.}

Consider adding checks that validate eq(null) / notEq(null) is not used on a required column in the filter2 API

Is this too strict?

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-44. Please see the migration documentation for further details.}

Record filtering in the filter2 API could possibly short circuit

Record level filtering in the filter2 api still requires visiting every value of the record. We may be able to short circuit as soon as the filter predicate reaches a known state.

Another approach would be to figure out how to get essentially random access to the values referenced by the predicate and check them first. This could be tricky because it would require re-structuring the assembly algorithm.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-37. Please see the migration documentation for further details.}

Decommission google group

Some members of the community are still sending mail to the old google groups list, but some of the discussions seem to have been neglected a bit, such as this one here

https://groups.google.com/forum/#!topic/parquet-dev/eMoVUXxY044

I think an auto-reply should be added to the old list to tell people to post to the apache list instead, and for members that post on the web forum interface, a message locked to the head of the list should tell people about the Apache JIRA.

Reporter: Jason Altekruse / @jaltekruse
Assignee: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-10. Please see the migration documentation for further details.}

Parquet often OOMs when reading footers

This is one PR to help with mem usage but may not be enough to solve the overall problem:
https://github.com/apache/incubator-parquet-format/pull/2

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-28. Please see the migration documentation for further details.}

Parquet scrooge for scala 2.10

Bit of a bummer its compiling with 2.9.2 rather than 2.10.4

Reporter: Ian O Connell
Assignee: Tim / @tsdeng

_{Note: This issue was originally created as PARQUET-48. Please see the migration documentation for further details.}

Investigate automatic not null checks via annotations in place of checkNotNull calls

We've discussed that it would be neat if we could replace a lot of the checkNotNull() calls in parquet-mr with an annotation like

@NotNull

or even make not null the default and annotate things that can be null with

@Nullable

and have this enforced by the compiler / annotation preprocessor.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-29. Please see the migration documentation for further details.}

Take advantage of dictionary pages when performing rowgroup filters in the filter2 API

We currently only filter row groups via the min / max value in the row group.
We should additionally inspect the dictionary of unique values in a row group (if it has one) – this could dramatically increase our ability to drop entire rowgroups.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-40. Please see the migration documentation for further details.}

Pushdown predicates only work with hardcoded arguments

As far as I can tell there is no way to pass a dynamic argument to use in filtering.

UnboundRecordFilters should be initialized with a Configuration object that they can pull arguments form.

Reporter: Sandy Ryza / @sryza
Assignee: Sandy Ryza / @sryza

Original Issue Attachments:

PARQUET-25.patch

_{Note: This issue was originally created as PARQUET-25. Please see the migration documentation for further details.}

[parquet-thrift] improve performance of thrift push-down code

A user reported seeing slowness when projection push-down code is active, which seems to stem from ProtocolEventsAmender.

Details can be found in https://github.com/Parquet/parquet-mr/issues/406

Reporter: Dmitriy V. Ryaboy / @dvryaboy

PRs and other links:

GitHub Pull Request #7

_{Note: This issue was originally created as PARQUET-7. Please see the migration documentation for further details.}

Add support for custom metadata

The parquet file format already supports extensible key/value storage in the column metadata and file metadata sections of the file. It would be nice if there was a facility in place for computing custom values to be placed in these key/value stores.

See: https://github.com/Parquet/parquet-mr/pull/185

Reporter: Wesley Graham Peck

_{Note: This issue was originally created as PARQUET-15. Please see the migration documentation for further details.}

Benchmark the assembly of thrift objects, and possibly create a more efficient ReplayingTProtocol

The current implementation of parquet thrift creates an instance of TProtocol for each value of each record and builds a stack of these events, which are then replayed back to the TBase.

I'd be curious to benchmark this, and if it's slow, try building a "ReplayingTProtocol" that instead of having a stack of TProtocol instances, contains a primitive array of each type. As events are fed into this replaying TProtocol, it would just add these primitives to its buffers, and then the TBase would drain them. This would effectively let us stream the values into the TBase without making an object allocation for each value.

The buffers could be set to a certain size, and if they fill up (which they sholdn't in most cases), the TBase could begin draining the protocol until it is empty again, at which point the TProtocol can block the TBase from draining further while the parque record assembly feeds it more events.

This is all moot if it turns out not to be bottleneck though :)

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-33. Please see the migration documentation for further details.}

[parquet-scrooge] mvn eclipse:eclipse fails on parquet-scrooge

mvn eclipse:eclipse on the parquet-mr project fails when it hits the scrooge sub-project. Scrooge being in scala, this is probably not very surprising but it means the target doesn't get to all the Hive stuff and Tools. We should at least skip it, if this isn't easy to fix.

Reporter: Dmitriy V. Ryaboy / @dvryaboy
Assignee: Dmitriy V. Ryaboy / @dvryaboy

_{Note: This issue was originally created as PARQUET-8. Please see the migration documentation for further details.}

InternalParquetRecordReader will not read multiple blocks when filtering

The InternalParquetRecordReader keeps track of the count of records it has processed and uses that count to know when it is finished and when to load a new row group of data. But when it is wrapping a FilteredRecordReader, this count is not updated for records that are filtered, so when the reader exhausts the block it is reading, it will continue calling read() on the filtered reader and will pass null values to the caller.

The quick fix is to detect null values returned by the record reader and update the count to read the next row group. But the longer-term solution is to correctly account for the filtered records.

The pull request for the quick fix is #9.

Reporter: Ryan Blue / @rdblue
Assignee: Thomas White / @tomwhite

_{Note: This issue was originally created as PARQUET-9. Please see the migration documentation for further details.}

Add HyperLogLog / CountMinSketch to parquet statistics

HLL and CMS for rowgroups could help with query planning (getting a sense of data skew) and with cheaply counting approximate distinct values. Both are commutative which means they can be combined across rowgroups (unlike an exact distinct count for example).

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-42. Please see the migration documentation for further details.}

need http://parquet.incubator.apache.org to be maintained

http://parquet.incubator.apache.org is not available yet.

Environment: public access
Reporter: bijaya
Assignee: Chris Aniszczyk

_{Note: This issue was originally created as PARQUET-1. Please see the migration documentation for further details.}

Better exception when files are unaccessible

In some cases the Hadoop filesystem API will throw NullPointerException when trying to access files that have moved.
We'd want to catch those and give a better error message.

Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1043)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211)
	at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:395)
	at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:443)
	at parquet.hadoop.ParquetInputFormat.getGlobalMetaData(ParquetInputFormat.java:467)

Reporter: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-20. Please see the migration documentation for further details.}

Consider adding support for more column types in the filter2 API

such as String columns as convenience over Binary columns, Maps, Sets, etc.

Maybe we don't want to do this and just support the primitive column types though.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-35. Please see the migration documentation for further details.}

Pig and Hive cannot read repeated groups written with parquet-protobuf

parquet-hive and parquet-pig make assumptions about list schemas that are not compatible with the more compact schemas generated by parquet-protobuf. This bug was discussed in more detail on https://github.com/Parquet/parquet-mr/issues/354

Reporter: Nathan Howell

_{Note: This issue was originally created as PARQUET-14. Please see the migration documentation for further details.}

Remove items from semver blacklist

parquet-hadoop currently has the semver checks disabled, and a few classes are blacklisted.

We need to 1) publish an artifact (maybe 1.6.0rc1) and set that as the "previous version" as far as the semver enforcer is concerned, and then re-enable the enforcer / clear its blacklist.

Reporter: Alex Levenson / @isnotinvain
Assignee: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-50. Please see the migration documentation for further details.}

Parquet-hadoop's client side logs are buffered / do not print until the job completes

Log statements from the InputFormat or anywhere else in the hadoop client / submitter seem to get buffered until the MR job completes, instead of printing as the job progresses.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-27. Please see the migration documentation for further details.}

Parquet #13: Backport of HIVE-6938

This patch was included in hive after the moving the Serde to hive (included in hive 0.14+). Backport is required for use with previous versions.

https://github.com/apache/incubator-parquet-mr/pull/13

Reporter: Daniel Weeks / @danielcweeks
Assignee: Daniel Weeks / @danielcweeks

_{Note: This issue was originally created as PARQUET-22. Please see the migration documentation for further details.}

Reduce memory pressure when reading footers

I encountered OOMs reading metadata for a dataset with 500+ columns, many of them quite sparse.

We can reduce memory utilization significantly.

Reporter: Dmitriy V. Ryaboy / @dvryaboy

_{Note: This issue was originally created as PARQUET-11. Please see the migration documentation for further details.}

Fix reference to 'github-apache' in dev docs

See https://issues.apache.org/jira/browse/PARQUET-13?focusedCommentId=14054493&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054493

Reporter: Thomas White / @tomwhite
Assignee: Thomas White / @tomwhite

_{Note: This issue was originally created as PARQUET-21. Please see the migration documentation for further details.}

Cannot read dictionary-encoded pages with all null values

This is issue #283. Parquet-mr will try to read the bit-width byte in DictionaryValuesReader#initPage even if the incoming offset is at the end of the byte array because there are no values.

Here's the stack trace:

Caused by: parquet.io.ParquetDecodingException: could not read page Page [id: 1, bytes.size=7, valueCount=100, uncompressedSize=7] in col [id] INT32
	at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:532)
	at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
	at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
	at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:339)
	at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
	at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
	at parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:265)
	at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
	at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
	at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:112)
	at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:174)
	... 29 more
Caused by: java.io.EOFException
	at parquet.bytes.BytesUtils.readIntLittleEndianOnOneByte(BytesUtils.java:76)
	at parquet.column.values.dictionary.DictionaryValuesReader.initFromPage(DictionaryValuesReader.java:55)
	at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:530)
	... 39 more

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

_{Note: This issue was originally created as PARQUET-18. Please see the migration documentation for further details.}

Refactor the Statistics classes to match the specialized pattern used throughout parquet

Because Parquet tries very hard to avoid autoboxing, most of the core classes are specialized for each primitive by having a method for each type, eg:

void writeInt(int x);
void writeLong(long x);
void writeDouble(double x);

and so on.

However, the statistics classes take the other approach of having an InstStatistics class, a LongStatistics class, a DoubleStatistics class and so on. I think it's worth going for consistency and picking a pattern and sticking to it. Seems like the first pattern I mentioned is currently the more common one.

We may want to take this one step further and define an interface that these all conform to, eg:

public interface ParquetTypeVisitor {
  void visitInt(int x);
  void visitLong(long x);
  void visitDouble(double x);
}

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-32. Please see the migration documentation for further details.}

Enforce PARQUET jira prefix for PR names in merge script

merge_parquet_pr.py should:

enforce that the pull request description starts with "Parquet-X: "
automatically close the corresponding JIRA (right now it does except it ask for the JIRA ID)
ask for JIRA creds (right now they have to be in env)

https://github.com/apache/incubator-parquet-mr/blob/master/dev/merge_parquet_pr.py

Reporter: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-24. Please see the migration documentation for further details.}

Choose a strategy for semantic versioning of the java artifacts

We need to decide which java packages are "internal" and we are free to refactor without going up a major revision, and which are part of our public API and require strict semantic versioning. We need to then reflect this in the maven enforcer that detects these changes.

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-31. Please see the migration documentation for further details.}

There is overlap between ValidTypeMap and PrimitiveTypeName in parquet-mr

It might be a good idea to unify these two classes

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-30. Please see the migration documentation for further details.}

Simplify ParquetReader's constructors

ParquetReader has a lot of constructors. Maybe we should use the Builder pattern instead.

Reporter: Alex Levenson / @isnotinvain
Assignee: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-39. Please see the migration documentation for further details.}

Consider making a specialized / non-generic UserDefinedPredicate

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-46. Please see the migration documentation for further details.}

Change package names to org.apache.parquet.*

Parquet's package names are still parquet.\*. Since Parquet is now in the Apache Incubator, the namespaces should be updated to be org.apache.parquet.\*.

Reporter: David Chen
Assignee: Ryan Blue / @rdblue

Original Issue Attachments:

refactor.sh

_{Note: This issue was originally created as PARQUET-23. Please see the migration documentation for further details.}

Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits

When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits on it takes several minutes. (We were accessing this file from our office network rather than AWS.)

After some investigation, we found that ParquetInputFormat.getSplits is trying to call getFileStatus() on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus() call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.

Actually all these FileStatus objects have already been fetched when footers are retrieved (here). Caching these FileStatus objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).

Will submit a PR for this issue soon.

Reporter: Cheng Lian / @liancheng

Related issues:

Use LRU caching for footers in ParquetInputFormat. (relates to)
Cleanup FilteringParquetRowInputFormat (is related to)
Reading Parquet InputSplits dominates query execution time when reading off S3 (is related to)
Improve Parquet IO Performance within cloud datalakes (is depended upon by)

_{Note: This issue was originally created as PARQUET-16. Please see the migration documentation for further details.}

SERDE backed schema for parquet storage in Hive

As of now, for a hive table stored as parquet, the schema can only be specified in Hive MetaStore. For our use-case, it is desired that the schema be provided by Thrift SerDe rather than MetaStore. Using thrift IDL as a schema provider, allows us to maintain a consistent schema across executions engines other than Hive such as Pig and Native MR.

Additionally, for a large sparse schema, it is much easier to build thrift objects, and use parquet-thrift/elephant-bird to convert them into columns/tuples rather than constructing the whole big tuple itself.

Reporter: Abhishek Agarwal / @abhishekagarwal87
Assignee: Ashish Singh / @SinghAsDev

_{Note: This issue was originally created as PARQUET-47. Please see the migration documentation for further details.}

Adding Type Persuasion for Primitive Types

Pull request: https://github.com/apache/incubator-parquet-mr/pull/3
Original from the old repo: Parquet/parquet-mr#410

These changes allow primitive types to be requested as different types than what is stored in the file format using a flag to turn off strict type checking (default is on). Types are cast to the requested type where possible and will suffer precision loss for casting where necessary (e.g. requesting a double as an int).

No performance penalty is imposed for using the type defined in the file type. A flag exists to

A 6x6 test case is provided to test conversion between the primitive types.

Reporter: Daniel Weeks / @danielcweeks
Assignee: Daniel Weeks / @danielcweeks

_{Note: This issue was originally created as PARQUET-2. Please see the migration documentation for further details.}

The `-d` option for `parquet-schema` shouldn't have optional argument

According to ShowSchemaCommand.java, the -d | --detailed option doesn't require an optional argument. The hasOptionalArg() call should be removed.

Reporter: Cheng Lian / @liancheng

_{Note: This issue was originally created as PARQUET-13. Please see the migration documentation for further details.}

Add support for repeated columns in the filter2 API

They currently are not supported. They would need their own set of operators, like contains() and size() etc.

Reporter: Alex Levenson / @isnotinvain

PRs and other links:

_{Note: This issue was originally created as PARQUET-34. Please see the migration documentation for further details.}

NPE when an empty file is included in a Hive query that uses CombineHiveInputFormat

Make sure the valueObj instance variable is always initialized. This change is neeeded when running a Hive query that uses the CombineHiveInputFormat and the first file in the combined split is empty. This can lead to a NullPointerException because the valueObj is null when the CombineHiveInputFormat calls the createValue method.

Reporter: Matt Martin / @matt-martin
Assignee: Matt Martin / @matt-martin

Related issues:

Fix NPE when an empty file is included in a Hive query that uses CombineHiveInputFormat (is blocked by)

_{Note: This issue was originally created as PARQUET-19. Please see the migration documentation for further details.}

Increment a hadoop counter for bytes filtered / skipped

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-45. Please see the migration documentation for further details.}

Many classes in parquet-hadoop belong in parquet-column

There are a handful of classes, like BlockMetaData and some of the other *MetaData classes that are currentlyin parquet hadoop but aren't hadoop specific. Which force some other classes (like the row group filter) to also live in parquet-hadoop when they really belong in parquet-column

Reporter: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-38. Please see the migration documentation for further details.}

Use LRU caching for footers in ParquetInputFormat.

The caching approach needs to change because of issues that occur when the same ParquetInputFormat instance is reused to generate splits for different input directories. For example, it causes problems in Hive's FetchOperator when the FetchOperator is attempting to operate over more than one partition (sidenote: as far as I could tell, Hive has been reusing inputformat instances in this way for quite some time). The details of how this issue manifests itself with respect to Hive are described in more detail here: https://groups.google.com/d/msg/parquet-dev/0aXql-3z7vE/Gn5m094V7PMJ

The proposed patch can be found here: https://github.com/apache/incubator-parquet-mr/pull/2

Reporter: Matt Martin / @matt-martin
Assignee: Matt Martin / @matt-martin

Related issues:

Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits (is related to)

_{Note: This issue was originally created as PARQUET-4. Please see the migration documentation for further details.}

Add a check in the merge PR script to verify that the build on travis-ci passes

Reporter: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-51. Please see the migration documentation for further details.}

Add option to reuse (Avro) objects when reading

Instead of creating an object for each record, Avro DatumReaders support passing in an object to be reused. It would be nice to expose this in Parquet.

This could be turned on with a setting in the job configuration.

Reporter: Sandy Ryza / @sryza

_{Note: This issue was originally created as PARQUET-5. Please see the migration documentation for further details.}

Create a new filter API that supports filtering groups of records based on their statistics

https://github.com/apache/incubator-parquet-mr/pull/4

Reporter: Alex Levenson / @isnotinvain
Assignee: Alex Levenson / @isnotinvain

_{Note: This issue was originally created as PARQUET-49. Please see the migration documentation for further details.}

Add support for additional converted types

Add support for additional logical types. Based on discussions here: https://docs.google.com/document/d/1y8UKDsdgT6d05xXXz1UjeDJhOYWAGZiZyDn4uHtXXXw/edit and https://github.com/Parquet/parquet-format/pull/94/files#r11984805

Reporter: Jacques Nadeau / @jacques-n

Related issues:

Add new logical types to parquet-column (is related to)

_{Note: This issue was originally created as PARQUET-12. Please see the migration documentation for further details.}

Document new jira/pr workflow for contributors

We should create clear documentation for contributors on how to create new issues, PRs, etc. Spark has a pretty good page that we can use as a reference, since they use a similar workflow: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

We should also update the README.md with a link to this more detailed document, and cut most of what we currently have in the contributing section.

Reporter: Dmitriy V. Ryaboy / @dvryaboy
Assignee: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-6. Please see the migration documentation for further details.}

tool to merge pull requests based on Spark

https://github.com/apache/incubator-parquet-mr/pull/5

given a pull request id on github.com/apache/incubator-parquet-mr this script will merge it
requires 2 remotes: apache-github and apache to point to the corresponding repos.
tested here (pretending my fork is the apache remote):
julienledem@485658a

original tool:
https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

_{Note: This issue was originally created as PARQUET-3. Please see the migration documentation for further details.}

Parquet doesn't recognize the nested Array type in MAP as ArrayWritable.

When trying to insert hive data of type of MAP<string, array> into Parquet, it throws the following error

Caused by: parquet.io.ParquetEncodingException: This should be an ArrayWritable or MapWritable: org.apache.hadoop.hive.ql.io.parquet.writable.BinaryWritable@c644ef1c
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeData(DataWritableWriter.java:86)

Problem is reproducible with following steps:
Relevant test data is attached.

CREATE TABLE test_hive (
node string,
stime string,
stimeutc string,
swver string,
moid MAP <string,string>,
pdfs MAP <string,array>,
utcdate string,
motype string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '=';

LOAD DATA LOCAL INPATH '/root/38388/test.dat' INTO TABLE test_hive;

CREATE TABLE test_parquet(
pdfs MAP <string,array>
)
STORED AS PARQUET ;

INSERT INTO TABLE test_parquet SELECT pdfs FROM test_hive;

Reporter: Mala Chikka Kempanna
Assignee: Ryan Blue / @rdblue

Related issues:

Test for PARQUET-26 (is blocked by)
Hive doesn't correctly read Parquet nested types (depends upon)

Original Issue Attachments:

test.dat

_{Note: This issue was originally created as PARQUET-26. Please see the migration documentation for further details.}

apache / parquet-java Goto Github PK

parquet-java's Issues

Subtasks:

Related issues:

PRs and other links:

Original Issue Attachments:

PRs and other links:

Original Issue Attachments:

Related issues:

PRs and other links:

Related issues:

Related issues:

Related issues:

Related issues:

Original Issue Attachments:

Recommend Projects

Recommend Topics

Recommend Org