Giter Site home page Giter Site logo

apache / parquet-java Goto Github PK

View Code? Open in Web Editor NEW
2.5K 94.0 1.4K 14.63 MB

Apache Parquet Java

Home Page: https://parquet.apache.org/

License: Apache License 2.0

Shell 0.23% Python 0.22% Java 99.26% Thrift 0.16% Scala 0.13%
parquet apache parquet-java

parquet-java's Issues

Parquet OutputFormat should allow controlling the file size

To generate the most efficient on disk file, the size of the file is important to control. It would be nice if we could configure the ouputformat to roll over new files when it reaches the right size and start a new file.

There's currently no easy way to tune this and requires indirect tuning (number of reduces, map input size).

Reporter: Nong Li / @nongli

Note: This issue was originally created as PARQUET-17. Please see the migration documentation for further details.

Add bloom filters to parquet statistics

For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups.
Pull request:
#215

Reporter: Alex Levenson / @isnotinvain
Assignee: Junjie Chen / @chenjunjiedada

Subtasks:

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-41. Please see the migration documentation for further details.

Record filtering in the filter2 API could possibly short circuit

Record level filtering in the filter2 api still requires visiting every value of the record. We may be able to short circuit as soon as the filter predicate reaches a known state.

Another approach would be to figure out how to get essentially random access to the values referenced by the predicate and check them first. This could be tricky because it would require re-structuring the assembly algorithm.

Reporter: Alex Levenson / @isnotinvain

Note: This issue was originally created as PARQUET-37. Please see the migration documentation for further details.

Decommission google group

Some members of the community are still sending mail to the old google groups list, but some of the discussions seem to have been neglected a bit, such as this one here

https://groups.google.com/forum/#!topic/parquet-dev/eMoVUXxY044

I think an auto-reply should be added to the old list to tell people to post to the apache list instead, and for members that post on the web forum interface, a message locked to the head of the list should tell people about the Apache JIRA.

Reporter: Jason Altekruse / @jaltekruse
Assignee: Julien Le Dem / @julienledem

Note: This issue was originally created as PARQUET-10. Please see the migration documentation for further details.

Investigate automatic not null checks via annotations in place of checkNotNull calls

We've discussed that it would be neat if we could replace a lot of the checkNotNull() calls in parquet-mr with an annotation like

@NotNull

or even make not null the default and annotate things that can be null with

@Nullable

and have this enforced by the compiler / annotation preprocessor.

Reporter: Alex Levenson / @isnotinvain

Note: This issue was originally created as PARQUET-29. Please see the migration documentation for further details.

Benchmark the assembly of thrift objects, and possibly create a more efficient ReplayingTProtocol

The current implementation of parquet thrift creates an instance of TProtocol for each value of each record and builds a stack of these events, which are then replayed back to the TBase.

I'd be curious to benchmark this, and if it's slow, try building a "ReplayingTProtocol" that instead of having a stack of TProtocol instances, contains a primitive array of each type. As events are fed into this replaying TProtocol, it would just add these primitives to its buffers, and then the TBase would drain them. This would effectively let us stream the values into the TBase without making an object allocation for each value.

The buffers could be set to a certain size, and if they fill up (which they sholdn't in most cases), the TBase could begin draining the protocol until it is empty again, at which point the TProtocol can block the TBase from draining further while the parque record assembly feeds it more events.

This is all moot if it turns out not to be bottleneck though :)

Reporter: Alex Levenson / @isnotinvain

Note: This issue was originally created as PARQUET-33. Please see the migration documentation for further details.

[parquet-scrooge] mvn eclipse:eclipse fails on parquet-scrooge

mvn eclipse:eclipse on the parquet-mr project fails when it hits the scrooge sub-project. Scrooge being in scala, this is probably not very surprising but it means the target doesn't get to all the Hive stuff and Tools. We should at least skip it, if this isn't easy to fix.

Reporter: Dmitriy V. Ryaboy / @dvryaboy
Assignee: Dmitriy V. Ryaboy / @dvryaboy

Note: This issue was originally created as PARQUET-8. Please see the migration documentation for further details.

InternalParquetRecordReader will not read multiple blocks when filtering

The InternalParquetRecordReader keeps track of the count of records it has processed and uses that count to know when it is finished and when to load a new row group of data. But when it is wrapping a FilteredRecordReader, this count is not updated for records that are filtered, so when the reader exhausts the block it is reading, it will continue calling read() on the filtered reader and will pass null values to the caller.

The quick fix is to detect null values returned by the record reader and update the count to read the next row group. But the longer-term solution is to correctly account for the filtered records.

The pull request for the quick fix is #9.

Reporter: Ryan Blue / @rdblue
Assignee: Thomas White / @tomwhite

Note: This issue was originally created as PARQUET-9. Please see the migration documentation for further details.

Better exception when files are unaccessible

In some cases the Hadoop filesystem API will throw NullPointerException when trying to access files that have moved.
We'd want to catch those and give a better error message.

Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1043)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211)
	at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:395)
	at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:443)
	at parquet.hadoop.ParquetInputFormat.getGlobalMetaData(ParquetInputFormat.java:467)

Reporter: Julien Le Dem / @julienledem

Note: This issue was originally created as PARQUET-20. Please see the migration documentation for further details.

Remove items from semver blacklist

parquet-hadoop currently has the semver checks disabled, and a few classes are blacklisted.

We need to 1) publish an artifact (maybe 1.6.0rc1) and set that as the "previous version" as far as the semver enforcer is concerned, and then re-enable the enforcer / clear its blacklist.

Reporter: Alex Levenson / @isnotinvain
Assignee: Alex Levenson / @isnotinvain

Note: This issue was originally created as PARQUET-50. Please see the migration documentation for further details.

Cannot read dictionary-encoded pages with all null values

This is issue #283. Parquet-mr will try to read the bit-width byte in DictionaryValuesReader#initPage even if the incoming offset is at the end of the byte array because there are no values.

Here's the stack trace:

Caused by: parquet.io.ParquetDecodingException: could not read page Page [id: 1, bytes.size=7, valueCount=100, uncompressedSize=7] in col [id] INT32
	at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:532)
	at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
	at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
	at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:339)
	at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
	at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
	at parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:265)
	at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
	at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
	at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:112)
	at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:174)
	... 29 more
Caused by: java.io.EOFException
	at parquet.bytes.BytesUtils.readIntLittleEndianOnOneByte(BytesUtils.java:76)
	at parquet.column.values.dictionary.DictionaryValuesReader.initFromPage(DictionaryValuesReader.java:55)
	at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:530)
	... 39 more

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

Note: This issue was originally created as PARQUET-18. Please see the migration documentation for further details.

Refactor the Statistics classes to match the specialized pattern used throughout parquet

Because Parquet tries very hard to avoid autoboxing, most of the core classes are specialized for each primitive by having a method for each type, eg:

void writeInt(int x);
void writeLong(long x);
void writeDouble(double x);

and so on.

However, the statistics classes take the other approach of having an InstStatistics class, a LongStatistics class, a DoubleStatistics class and so on. I think it's worth going for consistency and picking a pattern and sticking to it. Seems like the first pattern I mentioned is currently the more common one.

We may want to take this one step further and define an interface that these all conform to, eg:

public interface ParquetTypeVisitor {
  void visitInt(int x);
  void visitLong(long x);
  void visitDouble(double x);
}

Reporter: Alex Levenson / @isnotinvain

Note: This issue was originally created as PARQUET-32. Please see the migration documentation for further details.

Enforce PARQUET jira prefix for PR names in merge script

merge_parquet_pr.py should:

  • enforce that the pull request description starts with "Parquet-X: "
  • automatically close the corresponding JIRA (right now it does except it ask for the JIRA ID)
  • ask for JIRA creds (right now they have to be in env)

https://github.com/apache/incubator-parquet-mr/blob/master/dev/merge_parquet_pr.py

Reporter: Julien Le Dem / @julienledem

Note: This issue was originally created as PARQUET-24. Please see the migration documentation for further details.

Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits

When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits on it takes several minutes. (We were accessing this file from our office network rather than AWS.)

After some investigation, we found that ParquetInputFormat.getSplits is trying to call getFileStatus() on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus() call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.

Actually all these FileStatus objects have already been fetched when footers are retrieved (here). Caching these FileStatus objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).

Will submit a PR for this issue soon.

Reporter: Cheng Lian / @liancheng

Related issues:

Note: This issue was originally created as PARQUET-16. Please see the migration documentation for further details.

SERDE backed schema for parquet storage in Hive

As of now, for a hive table stored as parquet, the schema can only be specified in Hive MetaStore. For our use-case, it is desired that the schema be provided by Thrift SerDe rather than MetaStore. Using thrift IDL as a schema provider, allows us to maintain a consistent schema across executions engines other than Hive such as Pig and Native MR.

Additionally, for a large sparse schema, it is much easier to build thrift objects, and use parquet-thrift/elephant-bird to convert them into columns/tuples rather than constructing the whole big tuple itself.

Reporter: Abhishek Agarwal / @abhishekagarwal87
Assignee: Ashish Singh / @SinghAsDev

Note: This issue was originally created as PARQUET-47. Please see the migration documentation for further details.

Adding Type Persuasion for Primitive Types

Pull request: https://github.com/apache/incubator-parquet-mr/pull/3
Original from the old repo: Parquet/parquet-mr#410

These changes allow primitive types to be requested as different types than what is stored in the file format using a flag to turn off strict type checking (default is on). Types are cast to the requested type where possible and will suffer precision loss for casting where necessary (e.g. requesting a double as an int).

No performance penalty is imposed for using the type defined in the file type. A flag exists to

A 6x6 test case is provided to test conversion between the primitive types.

Reporter: Daniel Weeks / @danielcweeks
Assignee: Daniel Weeks / @danielcweeks

Note: This issue was originally created as PARQUET-2. Please see the migration documentation for further details.

NPE when an empty file is included in a Hive query that uses CombineHiveInputFormat

Make sure the valueObj instance variable is always initialized. This change is neeeded when running a Hive query that uses the CombineHiveInputFormat and the first file in the combined split is empty. This can lead to a NullPointerException because the valueObj is null when the CombineHiveInputFormat calls the createValue method.

Reporter: Matt Martin / @matt-martin
Assignee: Matt Martin / @matt-martin

Related issues:

Note: This issue was originally created as PARQUET-19. Please see the migration documentation for further details.

Use LRU caching for footers in ParquetInputFormat.

The caching approach needs to change because of issues that occur when the same ParquetInputFormat instance is reused to generate splits for different input directories. For example, it causes problems in Hive's FetchOperator when the FetchOperator is attempting to operate over more than one partition (sidenote: as far as I could tell, Hive has been reusing inputformat instances in this way for quite some time). The details of how this issue manifests itself with respect to Hive are described in more detail here: https://groups.google.com/d/msg/parquet-dev/0aXql-3z7vE/Gn5m094V7PMJ

The proposed patch can be found here: https://github.com/apache/incubator-parquet-mr/pull/2

Reporter: Matt Martin / @matt-martin
Assignee: Matt Martin / @matt-martin

Related issues:

Note: This issue was originally created as PARQUET-4. Please see the migration documentation for further details.

Document new jira/pr workflow for contributors

We should create clear documentation for contributors on how to create new issues, PRs, etc. Spark has a pretty good page that we can use as a reference, since they use a similar workflow: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

We should also update the README.md with a link to this more detailed document, and cut most of what we currently have in the contributing section.

Reporter: Dmitriy V. Ryaboy / @dvryaboy
Assignee: Julien Le Dem / @julienledem

Note: This issue was originally created as PARQUET-6. Please see the migration documentation for further details.

tool to merge pull requests based on Spark

https://github.com/apache/incubator-parquet-mr/pull/5

given a pull request id on github.com/apache/incubator-parquet-mr this script will merge it
requires 2 remotes: apache-github and apache to point to the corresponding repos.
tested here (pretending my fork is the apache remote):
julienledem@485658a

original tool:
https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py

Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem

Note: This issue was originally created as PARQUET-3. Please see the migration documentation for further details.

Parquet doesn't recognize the nested Array type in MAP as ArrayWritable.

When trying to insert hive data of type of MAP<string, array> into Parquet, it throws the following error

Caused by: parquet.io.ParquetEncodingException: This should be an ArrayWritable or MapWritable: org.apache.hadoop.hive.ql.io.parquet.writable.BinaryWritable@c644ef1c
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeData(DataWritableWriter.java:86)

Problem is reproducible with following steps:
Relevant test data is attached.

CREATE TABLE test_hive (
node string,
stime string,
stimeutc string,
swver string,
moid MAP <string,string>,
pdfs MAP <string,array>,
utcdate string,
motype string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '=';

LOAD DATA LOCAL INPATH '/root/38388/test.dat' INTO TABLE test_hive;

CREATE TABLE test_parquet(
pdfs MAP <string,array>
)
STORED AS PARQUET ;

INSERT INTO TABLE test_parquet SELECT pdfs FROM test_hive;

Reporter: Mala Chikka Kempanna
Assignee: Ryan Blue / @rdblue

Related issues:

Original Issue Attachments:

Note: This issue was originally created as PARQUET-26. Please see the migration documentation for further details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.