netflix / iceberg Goto Github PK

View Code? Open in Web Editor NEW

467.0 343.0 59.0 2.47 MB

Iceberg is a table format for large, slow-moving tabular data

License: Apache License 2.0

Java 99.64% Scala 0.36%

spark hadoop parquet avro

iceberg's Introduction

Iceberg has moved! Iceberg has been donated to the Apache Software Foundation.

Please use the new Apache mailing lists, site, and repository:

Mailing list: [email protected] (subscribe using [email protected])
Site: https://iceberg.apache.org
Repository: https://github.com/apache/incubator-iceberg

Iceberg is a new table format for storing large, slow-moving tabular data. It is designed to improve on the de-facto standard table layout built into Hive, Presto, and Spark.

Status

Iceberg is under active development at Netflix.

The core Java library that tracks table snapshots and metadata is complete, but still evolving. Current work is focused on integrating Iceberg into Spark and Presto.

The Iceberg format specification is being actively updated and is open for comment. Until the specification is complete and released, it carries no compatibility guarantees. The spec is currently evolving as the Java reference implementation changes.

Java API javadocs are available for the 0.3.0 (latest) release.

Collaboration

We welcome collaboration on both the Iceberg library and specification. The draft spec is open for comments.

For other discussion, please use the Iceberg mailing list or open issues on the Iceberg github page.

Building

Iceberg is built using Gradle 4.4.

Iceberg table support is organized in library modules:

iceberg-common contains utility classes used in other modules
iceberg-api contains the public Iceberg API
iceberg-core contains implementations of the Iceberg API and support for Avro data files, this is what processing engines should depend on
iceberg-parquet is an optional module for working with tables backed by Parquet files
iceberg-orc is an optional module for working with tables backed by ORC files (experimental)
iceberg-hive is am implementation of iceberg tables backed by hive metastore thrift client

This project Iceberg also has modules for adding Iceberg support to processing engines:

iceberg-spark is an implementation of Spark's Datasource V2 API for Iceberg (use iceberg-runtime for a shaded version)
iceberg-data is a client library used to read Iceberg tables from JVM applications
iceberg-pig is an implementation of Pig's LoadFunc API for Iceberg
iceberg-presto-runtime generates a shaded runtime jar that is used by presto to integrate with iceberg tables

Compatibility

Iceberg's Spark integration is compatible with the following Spark versions:

Iceberg version	Spark version
0.2.0+	2.3.0
0.3.0+	2.3.2

About Iceberg

Overview

Iceberg tracks individual data files in a table instead of directories. This allows writers to create data files in-place and only adds files to the table in an explicit commit.

Table state is maintained in metadata files. All changes to table state create a new metadata file and replace the old metadata with an atomic operation. The table metadata file tracks the table schema, partitioning config, other properties, and snapshots of the table contents. Each snapshot is a complete set of data files in the table at some point in time. Snapshots are listed in the metadata file, but the files in a snapshot are stored in separate manifest files.

The atomic transitions from one table metadata file to the next provide snapshot isolation. Readers use the snapshot that was current when they load the table metadata and are not affected by changes until they refresh and pick up a new metadata location.

Data files in snapshots are stored in one or more manifest files that contain a row for each data file in the table, its partition data, and its metrics. A snapshot is the union of all files in its manifests. Manifest files can be shared between snapshots to avoid rewriting metadata that is slow-changing.

Design benefits

This design addresses specific problems with the hive layout: file listing is no longer used to plan jobs and files are created in place without renaming.

This also provides improved guarantees and performance:

Snapshot isolation: Readers always use a consistent snapshot of the table, without needing to hold a lock. All table updates are atomic.
O(1) RPCs to plan: Instead of listing O(n) directories in a table to plan a job, reading a snapshot requires O(1) RPC calls.
Distributed planning: File pruning and predicate push-down is distributed to jobs, removing the metastore as a bottleneck.
Version history and rollback: Table snapshots are kept as history and tables can roll back if a job produces bad data.
Finer granularity partitioning: Distributed planning and O(1) RPC calls remove the current barriers to finer-grained partitioning.
Enables safe file-level operations. By supporting atomic changes, Iceberg enables new use cases, like safely compacting small files and safely appending late data to tables.

Why a new table format?

There are several problems with the current format:

There is no specification. Implementations don’t handle all cases consistently. For example, bucketing in Hive and Spark use different hash functions and are not compatible. Hive uses a locking scheme to make cross-partition changes safe, but no other implementations use it.
The metastore only tracks partitions. Files within partitions are discovered by listing partition paths. Listing partitions to plan a read is expensive, especially when using S3. This also makes atomic changes to a table’s contents impossible. Netflix has developed custom Metastore extensions to swap partition locations, but these are slow because it is expensive to make thousands of updates in a database transaction.
Operations depend on file rename. Most output committers depend on rename operations to implement guarantees and reduce the amount of time tables only have partial data from a write. But rename is not a metadata-only operation in S3 and will copy data. The new S3 committers that use multipart upload make this better, but can’t entirely solve the problem and put a lot of load on the S3 index during job commit.

Table data is tracked in both a central metastore, for partitions, and the file system, for files. The central metastore can be a scale bottleneck and the file system doesn't---and shouldn't---provide transactions to isolate concurrent reads and writes. The current table layout cannot be patched to fix its major problems.

Other design goals

In addition to changes in how table contents are tracked, Iceberg's design improves a few other areas:

Schema evolution: Columns are tracked by ID to support add/drop/rename.
Reliable types: Iceberg uses a core set of types, tested to work consistently across all of the supported data formats.
Metrics: The format includes cost-based optimization metrics stored with data files for better job planning.
Invisible partitioning: Partitioning is built into Iceberg as table configuration; it can plan efficient queries without extra partition predicates.
Unmodified partition data: The Hive layout stores partition data escaped in strings. Iceberg stores partition data without modification.
Portable spec: Tables are not tied to Java. Iceberg has a clear specification for other implementations.

iceberg's People

Contributors

Stargazers

Watchers

Forkers

t3rmin4t0r omalley julienledem stevenzwu renato2099 danielcweeks parth-brahmbhatt artwr njordan72 zmyer brharrington gitter-badger amiorin rdsr crixalis2013 sureshannapureddy mccheah omervk liorba yuvalitzchakov hmcl cochescu rominparekh xabriel osgigeek aroradaman ananddannygt fakenetflix manishmalhotrawork vivshri thaiphamquoc yuvrajyr partha0mishra lambdaml zorseti katoomba-demo chenqin trucnguyenlam zxding kkxiaotikk parveen-rx crestofwave spogiri123 sandeepk-dev karthikbhatp adam-netflix optionalg sv2000 novinceno1 shisheng-1 kyle-cx91 geekerch johnaffolter test-mass-forker-org-1 yueandxuan terry1504 skytin1004 bkeit pubuduwanigasekara

iceberg's Issues

Convert TestReadProjection/TestSparkReadProjection to use Spark's InternalRow

In starting to look at working on Iceberg's schema evolution for ORC, the current test case is full of Avro's types/data structures. That doesn't work at all for ORC, because I don't have any desire to build those bindings.

Therefore, I'll make a version of TestSparkReadProjection that uses Iceberg's Schema and Spark's InternalRow. That will work with all three files formats.

Should I fork the current test classes? Or should I change the current test to be more generic?

Add prefix matching expression

Some users have requested prefix matching or startsWith.

Add scan listener interface

Iceberg should support use cases that gather data from table operations. An easy way to provide data for scans is to add a listener interface that is called each time a scan is planned or executed.

Support snapshot selection in Spark's query options

Spark passes query options from DataFrameReader to the Iceberg source. Iceberg should support selecting a specific snapshot ID or the table state at some time from these options.

This is attached to the audit workflow support milestone because it is needed to read the table state that is being audited.

Add in and notIn predicates

Currently, set inclusion is implemented using a tree of equals predicates joined with or predicates. It would be much more efficient to add support for IN and NOT_IN predicates.

Implement column size metrics for ORC

This is blocked on ORC-305 and a release that contains it.

Support schema evolution for ORC

Add an API to maintain external schema mappings

Once Iceberg supports external schema mappings, it should also support an easy way to maintain those mappings by notifying Iceberg when an external schema changes. Iceberg would update its mapping when notified.

For example, starting with this mapping:

[ {"field-id": 1, "names": ["id"]},
  {"field-id": 2, "names": ["data"]} ]

Consider a new Avro schema registered that changes the name id to obj_id and adds a ts field. Iceberg would add an un-mapped entry for ts and add obj_id to the id mapping based on the Avro schema's field alias that indicates id and obj_id are the same field. The updated mapping would be:

[ {"field-id": 1, "names": ["obj_id", "id"]},
  {"field-id": 2, "names": ["data"]},
  {"names": ["ts"]} ]

Next, if the Iceberg table schema is updated to add ts, the mapping would be updated by matching the new Iceberg column to the unmatched mapping entry to produce this mapping:

[ {"field-id": 1, "names": ["rec_id", "id"]},
  {"field-id": 2, "names": ["data"]},
  {"field-id": 3, "names": ["ts"]} ]

This would maintain compatibility with new Avro data files without making changes to the Iceberg table other than the mapping. Columns can be added in Iceberg or Avro first and the mapping is completed by column name when it is added in both schemas.

Add comments to Iceberg schemas

Iceberg schemas should allow storing comments as documentation for struct fields.

Remove support for timetz

This was initially included for SQL, but isn't a very well-defined type.

Add operation to snapshots

Snapshots that are append-only can be aged off more aggressively than deletes because all data files must be tracked in the next snapshot. Adding an operation type and a summary to snapshot metadata would enable improvements to metadata cleanup operations.

Add action to cherry-pick snapshot changes

In an audit workflow, new data is written to an orphan snapshot that is not committed as the table's current state until it is audited. After auditing a change, it may need to be applied or cherry-picked on top of the latest snapshot instead of the one that was current when the audited changes were created.

Iceberg needs to support cherry-picking the changes from an orphan snapshot by applying them to the current snapshot.

Add manifest target size

Merge appends currently rewrite all of the data in the manifest to which new files will be appended. This writes too much data and wastes space with duplicate data. A simple fix is to add a target manifest size. After a manifest hits the target size, do not rewrite it and start using a new manifest.

usage help

First of all I think this package can be super beneficial so kudos.
Can you guide me in the process of installing and using this amazing package either by:

installation guide including prerequisites ( or even better a docker image)
tutorial, i saw there is a small example in Jupiter notebook .

thanks for the hard work

Support Customizing The Location Of Data Files Written By The Spark Data Source

Currently the Iceberg Data Source Writer requires files to be written to a location relative to the location of the table's metadata files. However, this is an artificial requirement because the manifest specifies URIs of data files that are completely independent of the URI of the table's metadata file system. For example one might want their table metadata to be stored in HDFS but their data files to be stored in S3.

We propose supporting a data source option, iceberg.spark.writer.dataLocation, to allow for overriding the base directory URI of the data files that are to be written.

Publish all project artifacts independently

Currently, Iceberg has to be imported by including the iceberg-runtime jar on the classpath. However, this is problematic for detecting version conflicts between Iceberg and projects that depend on it. Build systems typically include an option to resolve version conflicts between different dependent modules, or outright fail the build if such version conflicts are not resolved. Conflicts can be resolved by forcing a specific version. The iceberg-runtime jar bundles all of its dependencies without declaring their versions in its Maven pom, so version conflicts between Iceberg's bundled dependencies and the dependencies pulled in by Iceberg's clients cannot be detected until runtime.

The solution is just to publish all modules and their dependencies separately as well, so that users of Iceberg can depend on the modules and track the Maven dependency tree properly. For example, in my Gradle I want to be able to do this:

dependencies {
  runtime 'com.netflix.iceberg:iceberg-spark:0.3.2'
}

Support maps with non-string keys.

Add static caches to schema and partition spec parsers

Schemas and partition specs are parsed for each manifest file. Using a cache with weak keys would help avoid extra JSON parsing work.

Respect spark.catalog.currentDatabase instead of hardcoded "default"

In the event that a database isn't defined the behavior falls back to using a hardcoded "default" database. Instead we should respect spark.catalog.currentDatabase.

I have a simple pull request prepared if folks are interested.

Ignore invalid partition fields

Iceberg may add new transforms to the partition spec. When a transform is not recognized, Iceberg should ignore the field so that the format is forward-compatible with new transforms.

Iceberg should also ignore fields with multiple source columns, in case transforms on multiple columns are added.

Add parent snapshot ID to snapshot metadata

Audit workflows require writing a new snapshot, but not updating the table's current state until that snapshot has been audited. Once a snapshot has been audited, applying it to the current table state might be a fast-forward if its parent is the current table state, or might be a cherry-pick if another snapshot has been committed on top of the audit's parent. To know whether the snapshot must be cherry-picked, Iceberg should track each snapshot's parent ID.

Add split offsets to manifest files

Instead of storing a single HDFS block size for each data file, Iceberg should store a list of split offsets. That will allow split planning to be more precise by using row group or stripe offsets, without reading file footers.

Restrict decimal precision to 38 digits.

From @omalley:

Because implementing efficient operations on arbitrary length numbers is hard. BigInteger is really slow. Most C++ compilers now have __int128_t to directly implement 128 bits integers. Implementing a reasonable Int128 in Java is a pain, but doable.

All databases have limits on precision. Hive, Spark SQL, and SQL Server have a 38 digit limit. Oracle uses a 31 digit limit. MySql has a 65 digit limit. To me 38 is more than any application that I have written (except one!) needs and has a pretty straightforward implementation.

Vectorized Parquet Read In Spark DataSource

The Parquet file format reader that is available in core Spark includes a number of optimizations, the main one which is in vectorized columnar reading. In considering a potential migration from the old Spark readers to Iceberg, one would be concerned about the gap in performance that comes from lacking Spark's numerous optimizations in this space.

It is not clear what is the best way to incorporate these optimizations into Iceberg. One option would be to propose moving this code from Spark to parquet-mr. Another would be to invoke Spark's parquet reader directly here, but that is internal API. We could implement vectorized reading directly in Iceberg, but that is very much to suggest that we reinvent the wheel.

Partition schema mangling for ORC

Implement strict projection for transforms

Strict projection isn't required and wasn't implemented for several of the partitioning transformations. When strict projection isn't implemented (the projectStrict method returns null) Iceberg will fall back to a safe implementation. For example, residual evaluation will not remove predicates because they cannot be guaranteed to be true, and deletes can't determine that all values in a file match so the file can't be deleted (deletes usually fall back to min/max metrics evaluation).

Implementing strict projection for all transforms where possible will help query efficiency, will make deletes faster, etc.

Add snapshot history to table metadata

Table metadata currently keeps a set of snapshots that are still considered valid (e.g., can't delete manifests or data files in them) and a point to the current snapshot. It is possible to use the snapshots that are still valid to support time-travel queries, but iterating over the list of snapshots to find the one that was current at some point in time doesn't work. The current pointer may not have been updated to ever point to a particular snapshot and it is possible to roll back to a previous snapshot. Transactions also create multiple snapshots in a single metadata commit, some of which are never the table's current snapshot.

To fix these problems, the metadata should keep a reference log of changes to the current snapshot. Time travel queries would be able to go back to the correct current snapshot, skipping the ones that were never current or using a snapshot that was later rolled back.

Implement ReplaceFiles action

ReplaceFiles is the basis for operations like compaction or changing the file format: replace a set of files with new files that contain the same data rewritten.

Split files when building scan tasks

When building a scan, the TableScan API can plan the files to read (planFiles) or group the files into combined splits (planTasks). Split planning should also split files at the target split size before bin packing to create the final splits.

Implement Overwrite action

We need to have an overwrite action that will be used to support "insert overwrite" actions.

Reading Deltas via Metadata

(this is very similar in philosophy to #47 and it would be good to read that before this, same caveats applying)

A job that wants to read only new data since the last time it ran must understand what the high-water mark was and read new data from its source based on a predicate. For instance:

val newData = spark
  .read
  /* ... */
  .filter($"day" === "2018-08-05")

However, we can base our reads on the building-up of snapshots along time, so if our snapshots
are S₁, S₂, S₃ and S₄ and the last snapshot we processed was S₁, we can read the new data from S₂, S₃ and S₄ and skip the filtering completely. This would essentially make our high-water mark metadata-based, rather than data-based.

This can be achieved using the low-level Iceberg API, but not using the Spark API, which would be a great addition to the project.

Here's a sketch of how this API may look like:

spark
  .read
  .format("iceberg")
  .snapshots(2, 3, 4)
  .load(path)

Note: Specifying the list of snapshots would also let this API support other use cases, such as parallel-processing of snapshots, etc.

Provide access to the table properties to the WriterFactory

Currently, there isn't a way to get the table properties in the SparkOrcWriter via the WriterFactory.

Add support for relative paths in table metadata

Metadata currently tracks all paths using the full path. This is costly when not using compression (like the current metadata files) and doesn't allow a table to be easily relocated. The format should support relative paths with respect to the table root, or full paths.

Add initial support for ORC

I've submitted a pull request #12.

Add external schema mappings for data without field IDs

Files written by Iceberg writers contain Iceberg field IDs that are used for column projection. Iceberg doesn't currently support tracking data files that were written by other systems and added to Iceberg tables with the API because the field IDs are missing. To support files written by non-Iceberg writers, Iceberg could support a table-level mapping from a source schema to Iceberg IDs.

For example, a table with 2 columns might have an Avro schema mapping like this one, encoded as JSON in table properties:

[ {"field-id": 1, "names": ["id"]},
  {"field-id": 2, "names": ["data"]} ]

When reading an Avro file, the read schema would be produced using the file's schema and the field IDs from the mapping. The names in each field mapping is a list to handle aliasing.

Pig IcebergStorage Needs Avro Implementation

The initial implementation for pig only supports parquet. We need an avro read path.

In com.netflix.iceberg.pig.SchemaUtil does not convert complex schemas in maps and lists to Pig Schema

I was trying to reproduce some of the potential issues I highlighted in PigStorage pull request when I hit a possible bug in SchemaUtil

In com.netflix.iceberg.pig.SchemaUtil#convertComplex If the input is a list or a map, for their nested elements we call the method convert(Type type) which does not seem to take into account if its input is a complex schema. It seems that convert(NestedField) is much closer to what we want, isn't it? I tried to make the changes, but I'm not sure now to get a NestedField out of a list or a map value type. The methods are there but they seem to expect an "id" [com.netflix.iceberg.types.Types.MapType#field(int id)]

Here's a test case which shows that the Pig schema does not contain the nested struct

@Test
  public void testTupleInMap() throws IOException {
    Schema icebergSchema = new Schema(
        optional(
            1, "nested_list",
            MapType.ofOptional(
                2, 3,
                StringType.get(),
                ListType.ofOptional(
                    4, StructType.of(
                        required(5, "id", LongType.get()),
                        optional(6, "data", Types.StringType.get()))))));

    ResourceSchema pigSchema = SchemaUtil.convert(icebergSchema);
    Assert.assertEquals("nested_list:[[]]", pigSchema.toString()); // The output should contain a nested struct within a list within a map, I think.
  }

ORC: validate values are not null for required columns

The SparkOrcWriter defines converters that could easily throw exceptions when a required column has a null value, like this:

  static class RequiredIntConverter implements Converter {
    public void addValue(int rowId, int column, SpecializedGetters data,
                         ColumnVector output) {
      if (data.isNullAt(column)) {
        throw new NullPointerException("Column " + column + " is required, but null in row: " + rowId);
      } else {
        output.isNull[rowId] = false;
        ((LongColumnVector) output).vector[rowId] = data.getInt(column);
      }
    }
  }

Lazily merge manifests to cut down write volume

Commit e0a5f50 added a target size for manifest files to keep write volume low in merge appends, but manifests below the target size are still rewritten. We can improve on this by merging more lazily, using either a configurable minimum number of manifests to merge or by merging when the combined manifest will be approximately the target size. This second option is difficult because the decision needs to be made before writing the merged file (or merge, check size, and delete). I think it makes sense to go with the first option: add a minimum number of manifests to merge.

Fix dependency on snapshot build of parquet-avro and spark.

It really sucks that iceberg-core depends on a snapshot build of parquet avro.

Move snapshot out of metadata to a manifest file of data manifests

Metadata files are large when the list of manifests grows. This could be solved by using a separate manifest file that tracks the manifests of data files.

A secondary benefit to this approach is that the partition data in the manifest files would show up as min/max stats in the snapshot manifest, allowing Iceberg to eliminate whole manifest files when planning a scan.

Support replace table

CTAS is often used to replace a result set. If the replacement CTAS fails, users want the previous version of the table to exist. Replacing a table as a single operation would replace all metadata. Like CTAS, this will require a soft-replace that returns a table configured with the new CTAS schema (fresh ids) and partitioning, but that is not committed until the write operation is committed.

Prototype HLL buffers in manifest files to provide column distinct estimates.

Distinct counts aren't very valuable to cost-based optimization because they can't be easily merged. They should be removed. As a replacement, look into storing HLL buffers if they aren't too large.

Upgrade to Spark 2.4.0

Spark 2.4.0 was just released. We should upgrade iceberg-spark's Spark dependency to latest.

Support timestamps with timezone for ORC

We need to support timestamps with timezone for ORC. This is blocked by ORC-189 and a release that contains it.

Scan manifests in parallel

Share A Single File System Instance In HadoopTableOperations

We shouldn't use Util.getFS every time we want a FileSystem object in HadoopTableOperations. An example of where this breaks down is if file system object caching is disabled (set fs.<scheme>.impl.disable.cache). When such caching is disabled, a long string of calls on HadoopTableOperations in quick succession will create and GC FileSystem objects very quickly, leading to degraded JVM behavior.

An example of where one would want to disable file system caching is so that different instances of HadoopTableOperations can be set up with FileSystem objects that are configured with different Configuration objects - for example, configuring different Hadoop properties when invoking the data source in various iterations, given that we move forward with #91. Unfortunately, Hadoop caches file system objects by URI, not Configuration, so if one wants different HadoopTableOperations instances to load differently configured file system objects with the same URI, they will instead receive the same FileSystem object back every time, unless they disable FileSystem caching.

Support Custom Hadoop Properties In The Data Source

The Iceberg data source just uses the Spark Session's global Hadoop configuration when constructing File System objects in HadoopTableOperations, Reader, and Writer. We propose support for specifying additional reader and writer-specific options to the Hadoop configuration. The data source can parse out options with the prefix iceberg.spark.hadoop.* and apply those to the Hadoop configuration that is sent to all uses of the Hadoop FileSystem API throughout the Spark DataSource.

Promoting Idempotency Through Metadata

Currently, implementing idempotent jobs over Iceberg is done via data-based predicates. For instance, if a job run is presumed to have written the data for 2018-08-05, you will write something like:

df.write(t).overwrite($"day" === "2018-08-05")

However, this may be:

Slow - Need to push down filter for partitions on rewrite and to calculate the boundary values for each write (might be slow if it's on-demand)
Incomplete (false negatives) - the predicate doesn't cover it. e.g. late arriving data included in the previous output
Overzealous (false positives) - the predicate covers it, but it's not data we want to overwrite. e.g. data that has arrived before or after this job's run for the same day
Mix domain knowledge into the operations - each job needs to understand which field determines what it wrote and what value it wrote and preserve that value somewhere

To promote more complete idempotency, we can use the metadata Iceberg provides to revert previous snapshots based on their metadata. If, for instance, Partition P₁ writes file F₁, and we want to re-run the job that wrote it, we can write P₂ which deletes F₁ and writes F₂ with the new data, effectively reverting P₁.

The benefits from this would be:

Snapshot isolation is preserved
No duplicate data can be read (F₁+F₂)
No incomplete data can be read (neither F₁ nor F₂)
We can revert a snapshot, regardless of how far back it happened

Note: This would only be usable in cases where we are only appending new data in snapshots, so cases where we also regularly compact or coalesce files may not be supported.

To achieve this, we could:

Use the com.netflix.iceberg.RewriteFiles operation, but this would keep us at a very low-level, close to Iceberg, and force us to manually manage the files ourselves.
Use the com.netflix.iceberg.Rollback operation, but this only rolls back the previous snapshot, which is something we don't want to be tied to.
Use the com.netflix.iceberg.DeleteFiles operation, but this would create a new snapshot, causing us to either read duplicate or incomplete data.

What could be great is an API that lets us have some sort of transaction over both high-level (Spark) and low-level (Iceberg) APIs, so that we could delete the files written in a snapshot and write data using Spark, only then committing the transaction and creating a new snapshot.

@rdblue I would love to hear what you think this kind of API would look like.

Support create from snapshot for CTAS

CTAS writes data and creates a table in the same atomic operation. This would require a soft-create that handles tasks like reassigning column ids, but that doesn't commit metadata. This would return an in-memory table that is actually created when data is committed.