apache / parquet-java Goto Github PK
View Code? Open in Web Editor NEWApache Parquet Java
Home Page: https://parquet.apache.org/
License: Apache License 2.0
Apache Parquet Java
Home Page: https://parquet.apache.org/
License: Apache License 2.0
To generate the most efficient on disk file, the size of the file is important to control. It would be nice if we could configure the ouputformat to roll over new files when it reaches the right size and start a new file.
There's currently no easy way to tune this and requires indirect tuning (number of reduces, map input size).
Note: This issue was originally created as PARQUET-17. Please see the migration documentation for further details.
If the delegated PrimitiveConverter supports dictionaries, then FilteringPrimitiveConverter should too.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-36. Please see the migration documentation for further details.
For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups.
Pull request:
#215
Reporter: Alex Levenson / @isnotinvain
Assignee: Junjie Chen / @chenjunjiedada
Note: This issue was originally created as PARQUET-41. Please see the migration documentation for further details.
Is this too strict?
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-44. Please see the migration documentation for further details.
Record level filtering in the filter2 api still requires visiting every value of the record. We may be able to short circuit as soon as the filter predicate reaches a known state.
Another approach would be to figure out how to get essentially random access to the values referenced by the predicate and check them first. This could be tricky because it would require re-structuring the assembly algorithm.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-37. Please see the migration documentation for further details.
Some members of the community are still sending mail to the old google groups list, but some of the discussions seem to have been neglected a bit, such as this one here
https://groups.google.com/forum/#!topic/parquet-dev/eMoVUXxY044
I think an auto-reply should be added to the old list to tell people to post to the apache list instead, and for members that post on the web forum interface, a message locked to the head of the list should tell people about the Apache JIRA.
Reporter: Jason Altekruse / @jaltekruse
Assignee: Julien Le Dem / @julienledem
Note: This issue was originally created as PARQUET-10. Please see the migration documentation for further details.
This is one PR to help with mem usage but may not be enough to solve the overall problem:
https://github.com/apache/incubator-parquet-format/pull/2
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-28. Please see the migration documentation for further details.
Bit of a bummer its compiling with 2.9.2 rather than 2.10.4
Reporter: Ian O Connell
Assignee: Tim / @tsdeng
Note: This issue was originally created as PARQUET-48. Please see the migration documentation for further details.
We've discussed that it would be neat if we could replace a lot of the checkNotNull() calls in parquet-mr with an annotation like
@NotNull
or even make not null the default and annotate things that can be null with
@Nullable
and have this enforced by the compiler / annotation preprocessor.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-29. Please see the migration documentation for further details.
We currently only filter row groups via the min / max value in the row group.
We should additionally inspect the dictionary of unique values in a row group (if it has one) โ this could dramatically increase our ability to drop entire rowgroups.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-40. Please see the migration documentation for further details.
As far as I can tell there is no way to pass a dynamic argument to use in filtering.
UnboundRecordFilters should be initialized with a Configuration object that they can pull arguments form.
Reporter: Sandy Ryza / @sryza
Assignee: Sandy Ryza / @sryza
Note: This issue was originally created as PARQUET-25. Please see the migration documentation for further details.
A user reported seeing slowness when projection push-down code is active, which seems to stem from ProtocolEventsAmender.
Details can be found in https://github.com/Parquet/parquet-mr/issues/406
Reporter: Dmitriy V. Ryaboy / @dvryaboy
Note: This issue was originally created as PARQUET-7. Please see the migration documentation for further details.
The parquet file format already supports extensible key/value storage in the column metadata and file metadata sections of the file. It would be nice if there was a facility in place for computing custom values to be placed in these key/value stores.
See: https://github.com/Parquet/parquet-mr/pull/185
Reporter: Wesley Graham Peck
Note: This issue was originally created as PARQUET-15. Please see the migration documentation for further details.
The current implementation of parquet thrift creates an instance of TProtocol for each value of each record and builds a stack of these events, which are then replayed back to the TBase.
I'd be curious to benchmark this, and if it's slow, try building a "ReplayingTProtocol" that instead of having a stack of TProtocol instances, contains a primitive array of each type. As events are fed into this replaying TProtocol, it would just add these primitives to its buffers, and then the TBase would drain them. This would effectively let us stream the values into the TBase without making an object allocation for each value.
The buffers could be set to a certain size, and if they fill up (which they sholdn't in most cases), the TBase could begin draining the protocol until it is empty again, at which point the TProtocol can block the TBase from draining further while the parque record assembly feeds it more events.
This is all moot if it turns out not to be bottleneck though :)
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-33. Please see the migration documentation for further details.
mvn eclipse:eclipse
on the parquet-mr project fails when it hits the scrooge sub-project. Scrooge being in scala, this is probably not very surprising but it means the target doesn't get to all the Hive stuff and Tools. We should at least skip it, if this isn't easy to fix.
Reporter: Dmitriy V. Ryaboy / @dvryaboy
Assignee: Dmitriy V. Ryaboy / @dvryaboy
Note: This issue was originally created as PARQUET-8. Please see the migration documentation for further details.
The InternalParquetRecordReader keeps track of the count of records it has processed and uses that count to know when it is finished and when to load a new row group of data. But when it is wrapping a FilteredRecordReader, this count is not updated for records that are filtered, so when the reader exhausts the block it is reading, it will continue calling read() on the filtered reader and will pass null values to the caller.
The quick fix is to detect null values returned by the record reader and update the count to read the next row group. But the longer-term solution is to correctly account for the filtered records.
The pull request for the quick fix is #9.
Reporter: Ryan Blue / @rdblue
Assignee: Thomas White / @tomwhite
Note: This issue was originally created as PARQUET-9. Please see the migration documentation for further details.
HLL and CMS for rowgroups could help with query planning (getting a sense of data skew) and with cheaply counting approximate distinct values. Both are commutative which means they can be combined across rowgroups (unlike an exact distinct count for example).
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-42. Please see the migration documentation for further details.
http://parquet.incubator.apache.org is not available yet.
Environment: public access
Reporter: bijaya
Assignee: Chris Aniszczyk
Note: This issue was originally created as PARQUET-1. Please see the migration documentation for further details.
In some cases the Hadoop filesystem API will throw NullPointerException when trying to access files that have moved.
We'd want to catch those and give a better error message.
Caused by: java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1043)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:211)
at parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:395)
at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:443)
at parquet.hadoop.ParquetInputFormat.getGlobalMetaData(ParquetInputFormat.java:467)
Reporter: Julien Le Dem / @julienledem
Note: This issue was originally created as PARQUET-20. Please see the migration documentation for further details.
such as String columns as convenience over Binary columns, Maps, Sets, etc.
Maybe we don't want to do this and just support the primitive column types though.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-35. Please see the migration documentation for further details.
parquet-hive and parquet-pig make assumptions about list schemas that are not compatible with the more compact schemas generated by parquet-protobuf. This bug was discussed in more detail on https://github.com/Parquet/parquet-mr/issues/354
Reporter: Nathan Howell
Note: This issue was originally created as PARQUET-14. Please see the migration documentation for further details.
parquet-hadoop currently has the semver checks disabled, and a few classes are blacklisted.
We need to 1) publish an artifact (maybe 1.6.0rc1) and set that as the "previous version" as far as the semver enforcer is concerned, and then re-enable the enforcer / clear its blacklist.
Reporter: Alex Levenson / @isnotinvain
Assignee: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-50. Please see the migration documentation for further details.
Log statements from the InputFormat or anywhere else in the hadoop client / submitter seem to get buffered until the MR job completes, instead of printing as the job progresses.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-27. Please see the migration documentation for further details.
This patch was included in hive after the moving the Serde to hive (included in hive 0.14+). Backport is required for use with previous versions.
https://github.com/apache/incubator-parquet-mr/pull/13
Reporter: Daniel Weeks / @danielcweeks
Assignee: Daniel Weeks / @danielcweeks
Note: This issue was originally created as PARQUET-22. Please see the migration documentation for further details.
I encountered OOMs reading metadata for a dataset with 500+ columns, many of them quite sparse.
We can reduce memory utilization significantly.
Reporter: Dmitriy V. Ryaboy / @dvryaboy
Note: This issue was originally created as PARQUET-11. Please see the migration documentation for further details.
Reporter: Thomas White / @tomwhite
Assignee: Thomas White / @tomwhite
Note: This issue was originally created as PARQUET-21. Please see the migration documentation for further details.
This is issue #283. Parquet-mr will try to read the bit-width byte in DictionaryValuesReader#initPage
even if the incoming offset is at the end of the byte array because there are no values.
Here's the stack trace:
Caused by: parquet.io.ParquetDecodingException: could not read page Page [id: 1, bytes.size=7, valueCount=100, uncompressedSize=7] in col [id] INT32
at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:532)
at parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
at parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:339)
at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
at parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:265)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:112)
at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:174)
... 29 more
Caused by: java.io.EOFException
at parquet.bytes.BytesUtils.readIntLittleEndianOnOneByte(BytesUtils.java:76)
at parquet.column.values.dictionary.DictionaryValuesReader.initFromPage(DictionaryValuesReader.java:55)
at parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:530)
... 39 more
Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue
Note: This issue was originally created as PARQUET-18. Please see the migration documentation for further details.
Because Parquet tries very hard to avoid autoboxing, most of the core classes are specialized for each primitive by having a method for each type, eg:
void writeInt(int x);
void writeLong(long x);
void writeDouble(double x);
and so on.
However, the statistics classes take the other approach of having an InstStatistics class, a LongStatistics class, a DoubleStatistics class and so on. I think it's worth going for consistency and picking a pattern and sticking to it. Seems like the first pattern I mentioned is currently the more common one.
We may want to take this one step further and define an interface that these all conform to, eg:
public interface ParquetTypeVisitor {
void visitInt(int x);
void visitLong(long x);
void visitDouble(double x);
}
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-32. Please see the migration documentation for further details.
merge_parquet_pr.py should:
https://github.com/apache/incubator-parquet-mr/blob/master/dev/merge_parquet_pr.py
Reporter: Julien Le Dem / @julienledem
Note: This issue was originally created as PARQUET-24. Please see the migration documentation for further details.
We need to decide which java packages are "internal" and we are free to refactor without going up a major revision, and which are part of our public API and require strict semantic versioning. We need to then reflect this in the maven enforcer that detects these changes.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-31. Please see the migration documentation for further details.
It might be a good idea to unify these two classes
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-30. Please see the migration documentation for further details.
ParquetReader has a lot of constructors. Maybe we should use the Builder pattern instead.
Reporter: Alex Levenson / @isnotinvain
Assignee: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-39. Please see the migration documentation for further details.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-46. Please see the migration documentation for further details.
Parquet's package names are still parquet.\*
. Since Parquet is now in the Apache Incubator, the namespaces should be updated to be org.apache.parquet.\*
.
Reporter: David Chen
Assignee: Ryan Blue / @rdblue
Note: This issue was originally created as PARQUET-23. Please see the migration documentation for further details.
When testing Spark SQL Parquet support, we found that accessing large Parquet files located in S3 can be very slow. To be more specific, we have a S3 Parquet file with over 3,000 part-files, calling ParquetInputFormat.getSplits
on it takes several minutes. (We were accessing this file from our office network rather than AWS.)
After some investigation, we found that ParquetInputFormat.getSplits
is trying to call getFileStatus()
on all part-files one by one sequentially (here). And in the case of S3, each getFileStatus()
call issues an HTTP request and wait for the reply in a blocking manner, which is considerably expensive.
Actually all these FileStatus
objects have already been fetched when footers are retrieved (here). Caching these FileStatus
objects can greatly improve our S3 case (reduced from over 5 minutes to about 1.4 minutes).
Will submit a PR for this issue soon.
Reporter: Cheng Lian / @liancheng
Note: This issue was originally created as PARQUET-16. Please see the migration documentation for further details.
As of now, for a hive table stored as parquet, the schema can only be specified in Hive MetaStore. For our use-case, it is desired that the schema be provided by Thrift SerDe rather than MetaStore. Using thrift IDL as a schema provider, allows us to maintain a consistent schema across executions engines other than Hive such as Pig and Native MR.
Additionally, for a large sparse schema, it is much easier to build thrift objects, and use parquet-thrift/elephant-bird to convert them into columns/tuples rather than constructing the whole big tuple itself.
Reporter: Abhishek Agarwal / @abhishekagarwal87
Assignee: Ashish Singh / @SinghAsDev
Note: This issue was originally created as PARQUET-47. Please see the migration documentation for further details.
Pull request: https://github.com/apache/incubator-parquet-mr/pull/3
Original from the old repo: Parquet/parquet-mr#410
These changes allow primitive types to be requested as different types than what is stored in the file format using a flag to turn off strict type checking (default is on). Types are cast to the requested type where possible and will suffer precision loss for casting where necessary (e.g. requesting a double as an int).
No performance penalty is imposed for using the type defined in the file type. A flag exists to
A 6x6 test case is provided to test conversion between the primitive types.
Reporter: Daniel Weeks / @danielcweeks
Assignee: Daniel Weeks / @danielcweeks
Note: This issue was originally created as PARQUET-2. Please see the migration documentation for further details.
According to ShowSchemaCommand.java
, the -d | --detailed
option doesn't require an optional argument. The hasOptionalArg()
call should be removed.
Reporter: Cheng Lian / @liancheng
Note: This issue was originally created as PARQUET-13. Please see the migration documentation for further details.
They currently are not supported. They would need their own set of operators, like contains() and size() etc.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-34. Please see the migration documentation for further details.
Make sure the valueObj instance variable is always initialized. This change is neeeded when running a Hive query that uses the CombineHiveInputFormat and the first file in the combined split is empty. This can lead to a NullPointerException because the valueObj is null when the CombineHiveInputFormat calls the createValue method.
Reporter: Matt Martin / @matt-martin
Assignee: Matt Martin / @matt-martin
Note: This issue was originally created as PARQUET-19. Please see the migration documentation for further details.
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-45. Please see the migration documentation for further details.
There are a handful of classes, like BlockMetaData and some of the other *MetaData classes that are currentlyin parquet hadoop but aren't hadoop specific. Which force some other classes (like the row group filter) to also live in parquet-hadoop when they really belong in parquet-column
Reporter: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-38. Please see the migration documentation for further details.
The caching approach needs to change because of issues that occur when the same ParquetInputFormat instance is reused to generate splits for different input directories. For example, it causes problems in Hive's FetchOperator when the FetchOperator is attempting to operate over more than one partition (sidenote: as far as I could tell, Hive has been reusing inputformat instances in this way for quite some time). The details of how this issue manifests itself with respect to Hive are described in more detail here: https://groups.google.com/d/msg/parquet-dev/0aXql-3z7vE/Gn5m094V7PMJ
The proposed patch can be found here: https://github.com/apache/incubator-parquet-mr/pull/2
Reporter: Matt Martin / @matt-martin
Assignee: Matt Martin / @matt-martin
Note: This issue was originally created as PARQUET-4. Please see the migration documentation for further details.
Reporter: Julien Le Dem / @julienledem
Note: This issue was originally created as PARQUET-51. Please see the migration documentation for further details.
Instead of creating an object for each record, Avro DatumReaders support passing in an object to be reused. It would be nice to expose this in Parquet.
This could be turned on with a setting in the job configuration.
Reporter: Sandy Ryza / @sryza
Note: This issue was originally created as PARQUET-5. Please see the migration documentation for further details.
https://github.com/apache/incubator-parquet-mr/pull/4
Reporter: Alex Levenson / @isnotinvain
Assignee: Alex Levenson / @isnotinvain
Note: This issue was originally created as PARQUET-49. Please see the migration documentation for further details.
Add support for additional logical types. Based on discussions here: https://docs.google.com/document/d/1y8UKDsdgT6d05xXXz1UjeDJhOYWAGZiZyDn4uHtXXXw/edit and https://github.com/Parquet/parquet-format/pull/94/files#r11984805
Reporter: Jacques Nadeau / @jacques-n
Note: This issue was originally created as PARQUET-12. Please see the migration documentation for further details.
We should create clear documentation for contributors on how to create new issues, PRs, etc. Spark has a pretty good page that we can use as a reference, since they use a similar workflow: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
We should also update the README.md with a link to this more detailed document, and cut most of what we currently have in the contributing section.
Reporter: Dmitriy V. Ryaboy / @dvryaboy
Assignee: Julien Le Dem / @julienledem
Note: This issue was originally created as PARQUET-6. Please see the migration documentation for further details.
https://github.com/apache/incubator-parquet-mr/pull/5
given a pull request id on github.com/apache/incubator-parquet-mr this script will merge it
requires 2 remotes: apache-github and apache to point to the corresponding repos.
tested here (pretending my fork is the apache remote):
julienledem@485658a
original tool:
https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py
Reporter: Julien Le Dem / @julienledem
Assignee: Julien Le Dem / @julienledem
Note: This issue was originally created as PARQUET-3. Please see the migration documentation for further details.
When trying to insert hive data of type of MAP<string, array> into Parquet, it throws the following error
Caused by: parquet.io.ParquetEncodingException: This should be an ArrayWritable or MapWritable: org.apache.hadoop.hive.ql.io.parquet.writable.BinaryWritable@c644ef1c
at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeData(DataWritableWriter.java:86)
Problem is reproducible with following steps:
Relevant test data is attached.
CREATE TABLE test_hive (
node string,
stime string,
stimeutc string,
swver string,
moid MAP <string,string>,
pdfs MAP <string,array>,
utcdate string,
motype string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '=';
LOAD DATA LOCAL INPATH '/root/38388/test.dat' INTO TABLE test_hive;
CREATE TABLE test_parquet(
pdfs MAP <string,array>
)
STORED AS PARQUET ;
INSERT INTO TABLE test_parquet SELECT pdfs FROM test_hive;
Reporter: Mala Chikka Kempanna
Assignee: Ryan Blue / @rdblue
Note: This issue was originally created as PARQUET-26. Please see the migration documentation for further details.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.