apache / datasketches-hive Goto Github PK

View Code? Open in Web Editor NEW

46.0 21.0 22.0 889 KB

Sketch adaptors for Hive.

Home Page: https://datasketches.apache.org

License: Apache License 2.0

Java 100.00%

datasketches

datasketches-hive's Introduction

=================

DataSketches Java UDF/UDAF Adaptors for Apache Hive

Please visit the main DataSketches website for more information.

If you are interested in making contributions to this site please see our Community page for how to contact us.

Hadoop Hive UDFs/UDAFs

See relevant sections under the different sketch types in Java Core Documentation.

Build Instructions

NOTE: This component accesses resource files for testing. As a result, the directory elements of the full absolute path of the target installation directory must qualify as Java identifiers. In other words, the directory elements must not have any space characters (or non-Java identifier characters) in any of the path elements. This is required by the Oracle Java Specification in order to ensure location-independent access to resources: See Oracle Location-Independent Access to Resources

JDK8 is required to compile

This DataSketches component is pure Java and you must compile using JDK 8.

Recommended Build Tool

This DataSketches component is structured as a Maven project and Maven is the recommended Build Tool.

There are two types of tests: normal unit tests and tests run by the strict profile.

To run normal unit tests:

$ mvn clean test

To run the strict profile tests:

$ mvn clean test -P strict

To install jars built from the downloaded source:

$ mvn clean install -DskipTests=true

This will create the following jars:

datasketches-hive-X.Y.Z-incubating.jar The compiled main class files.
datasketches-hive-X.Y.Z-incubating-tests.jar The compiled test class files.
datasketches-hive-X.Y.Z-incubating-sources.jar The main source files.
datasketches-hive-X.Y.Z-incubating-test-sources.jar The test source files
datasketches-hive-X.Y.Z-incubating-javadoc.jar The compressed Javadocs.

Dependencies

Run-time

This has the following top-level dependencies:

org.apache.datasketches : datasketches-java
org.apache.hive : hive-exec
org.apache.hadoop : hadoop-common
org.apache.hadoop : hadoop-mapreduce-client-core

Testing

See the pom.xml file for test dependencies.

datasketches-hive's People

Contributors

Stargazers

Watchers

datasketches-hive's Issues

Use new Union interface in MergeSketchUDAF

Union now supports a new interface, allowing MergeSketchUDAF to do all its computations with Union rather than having to have an UpdateSketch and a Union sketch. This should simplify the code even though it won't reduce the memory footprint (union and update sketches are currently lazily constructed and only one ever exists at a time).

ClassCast exception from dataToSketch

When running a query in hive that includes ... estimate_sketch(data_to_sketch(column)) ..., I get a class cast error with the stack trace of:

Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazy.LazyString cannot be cast to org.apache.hadoop.io.Text
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveJavaObject(WritableStringObjectInspector.java:46)
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getString(PrimitiveObjectInspectorUtils.java:843)
        at com.yahoo.sketches.hive.theta.DataToSketchUDAF$DataToSketchEvaluator.updateData(DataToSketchUDAF.java:459)
        at com.yahoo.sketches.hive.theta.DataToSketchUDAF$DataToSketchEvaluator.iterate(DataToSketchUDAF.java:241)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:185)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:787)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:693)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761)

This is occuring with hive Hive 1.2.1.5.1601270058

Consider hashing STRING columns as UTF-8 instead of UTF-16 in HLL

Currently the HLL implementation hashes STRING columns by converting the value the java.lang.string, and then hash the underlying UTF-16 char array:
https://github.com/apache/incubator-datasketches-hive/blob/master/src/main/java/org/apache/datasketches/hive/hll/SketchState.java#L66

This is looks like an optimization at first, as the UTF-16->UTF-8 conversion can be skipped, but actually Hive stores STRINGs as hadoop.io.text as far as I know, which holds the string as UTF-8 encoded byte arrays:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java#L214

This came up during Apache's Impala's HLL implementation, which should be compatible with Apache Hive's existing DataSketches wrapper. As Impala stores strings as byte arrays, converting the values to UTF-16 adds overhead and can also cause problems for byte arrays that cannot be converted to UTF-16.

Set Operations don't support custom seeds

While the Exclude set operation supports passing in a custom seed, Union and Intersection UDFs currently do not, limiting them to being used only with sketches created with the default seed. Seed should be allowed as an optional argument, with the default seed assumed if none is specified.

EstimateSketchUDF isn't processing BINARY fields correctly

I've been scratching my head for a while with this one, but I was writing some unit tests where I created a theta sketch with a single item, and the estimate function was returning an estimate of minus 800M.

This seems easily reproducible for me (using Hive version 1.1.0-cdh5.16.1):

add jar /path/to/datasketches-memory-1.2.0-incubating.jar;
add jar /path/to/datasketches-java-1.2.0-incubating.jar;
add jar /path/to/datasketches-hive-1.0.0-incubating.jar;

create temporary function data2sketch as 'org.apache.datasketches.hive.theta.DataToSketchUDAF';
create temporary function estimate as 'org.apache.datasketches.hive.theta.EstimateSketchUDF';

create temporary table theta_input as select 1 as id;

create temporary table sketch_intermediate as select data2sketch(id) as sketch from theta_input;

select estimate(sketch) as estimate_from_table from sketch_intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | -8.80936683E8        |
-- +----------------------+--+

with intermediate as (
    select data2sketch(id) as sketch from theta_input
)
select estimate(sketch) as estimate_from_table from intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | 1.0                  |
-- +----------------------+--+

For some reason there were some extra bytes in the BytesWritable storage, which was breaking the calculations. What was supposed to be a 16 byte SingleItemSketch, got an extra 8 bytes (zero-filled), making datasketches think it was a completely different thing.

A unit test of what I was seeing coming from Hive:

@Test
public void evaluateRespectsByteLength() {
    byte[] inputBytes = new byte[]{
            (byte) 0x01, (byte) 0x03, (byte) 0x03, (byte) 0x00,
            (byte) 0x00, (byte) 0x3a, (byte) 0xcc, (byte) 0x93,
            (byte) 0x15, (byte) 0xf9, (byte) 0x7d, (byte) 0xcb,
            (byte) 0xbd, (byte) 0x86, (byte) 0xa1, (byte) 0x05,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00
    };
    BytesWritable input = new BytesWritable(inputBytes, 16);
    EstimateSketchUDF estimate = new EstimateSketchUDF();
    Double testResult = estimate.evaluate(input);
    assertEquals(1.0, testResult, 0.0);
}

Adding this wrapper around EstimateSketchUDF fixes the problem:

public class EstimateSketchUDF extends org.apache.datasketches.hive.theta.EstimateSketchUDF {

  @Override
  public Double evaluate(BytesWritable binarySketch) {
    if (binarySketch == null) {
      return 0.0;
    }

    byte[] bytes = new byte[binarySketch.getLength()];
    System.arraycopy(binarySketch.getBytes(), 0, bytes, 0, binarySketch.getLength());
    BytesWritable fixedSketch = new BytesWritable(bytes);

    return super.evaluate(fixedSketch);
  }
}

the error of intersect reaches 41%

On the TPC-H dataset，i use theta sketch to get intersect，Error of some results reaches 41%, but the doc say the default size(4096) about 3% error.

spark.sql("create temporary function data2sketch as 'org.apache.datasketches.hive.theta.DataToSketchUDAF'")
spark.sql("create temporary function intersect as 'org.apache.datasketches.hive.theta.IntersectSketchUDF'")
spark.sql("create temporary function estimate as 'org.apache.datasketches.hive.theta.EstimateSketchUDF'")

scala> lineitem.select("l_suppkey").intersect(order.select("o_orderkey")).count
res17: Long = 250000
but theta sketch result is 145593, the error is 0.41

scala> customer.select("c_custkey").intersect(lineitem.select("l_orderkey")).count
res18: Long = 3750000
but theta sketch result is 4404198, the error is 0.14

DataToSketchUDAF does not support using custom seeds for sketch creation

Sometimes it is useful to create sketches with a non-standard seed used in the hashing funciton. The DataToSketchUDAF should be extended to support passing this seed as an optional parameter.

HLL UDFs not compatible with Spark 2.2+

This is to document an open issue, which appears to be a regression in Spark since version 2.2 with respect to support of Hive UDFs.
Discussion is here:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/sketches-user/GmH4-OlHP9g/MUay8NC5CQAJ