Giter Site home page Giter Site logo

datasketches-hive's Introduction

Build Status Maven Central Language grade: Java Total alerts Coverage Status

=================

DataSketches Java UDF/UDAF Adaptors for Apache Hive

Please visit the main DataSketches website for more information.

If you are interested in making contributions to this site please see our Community page for how to contact us.


Hadoop Hive UDFs/UDAFs

See relevant sections under the different sketch types in Java Core Documentation.

Build Instructions

NOTE: This component accesses resource files for testing. As a result, the directory elements of the full absolute path of the target installation directory must qualify as Java identifiers. In other words, the directory elements must not have any space characters (or non-Java identifier characters) in any of the path elements. This is required by the Oracle Java Specification in order to ensure location-independent access to resources: See Oracle Location-Independent Access to Resources

JDK8 is required to compile

This DataSketches component is pure Java and you must compile using JDK 8.

Recommended Build Tool

This DataSketches component is structured as a Maven project and Maven is the recommended Build Tool.

There are two types of tests: normal unit tests and tests run by the strict profile.

To run normal unit tests:

$ mvn clean test

To run the strict profile tests:

$ mvn clean test -P strict

To install jars built from the downloaded source:

$ mvn clean install -DskipTests=true

This will create the following jars:

  • datasketches-hive-X.Y.Z-incubating.jar The compiled main class files.
  • datasketches-hive-X.Y.Z-incubating-tests.jar The compiled test class files.
  • datasketches-hive-X.Y.Z-incubating-sources.jar The main source files.
  • datasketches-hive-X.Y.Z-incubating-test-sources.jar The test source files
  • datasketches-hive-X.Y.Z-incubating-javadoc.jar The compressed Javadocs.

Dependencies

Run-time

This has the following top-level dependencies:

  • org.apache.datasketches : datasketches-java
  • org.apache.hive : hive-exec
  • org.apache.hadoop : hadoop-common
  • org.apache.hadoop : hadoop-mapreduce-client-core

Testing

See the pom.xml file for test dependencies.

datasketches-hive's People

Contributors

alexandersaydakov avatar dependabot[bot] avatar jmalkin avatar koke avatar leerho avatar okumin avatar packet23 avatar will-lauer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasketches-hive's Issues

Use new Union interface in MergeSketchUDAF

Union now supports a new interface, allowing MergeSketchUDAF to do all its computations with Union rather than having to have an UpdateSketch and a Union sketch. This should simplify the code even though it won't reduce the memory footprint (union and update sketches are currently lazily constructed and only one ever exists at a time).

ClassCast exception from dataToSketch

When running a query in hive that includes ... estimate_sketch(data_to_sketch(column)) ..., I get a class cast error with the stack trace of:

Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.lazy.LazyString cannot be cast to org.apache.hadoop.io.Text
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveJavaObject(WritableStringObjectInspector.java:46)
        at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getString(PrimitiveObjectInspectorUtils.java:843)
        at com.yahoo.sketches.hive.theta.DataToSketchUDAF$DataToSketchEvaluator.updateData(DataToSketchUDAF.java:459)
        at com.yahoo.sketches.hive.theta.DataToSketchUDAF$DataToSketchEvaluator.iterate(DataToSketchUDAF.java:241)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:185)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:787)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:693)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761)

This is occuring with hive Hive 1.2.1.5.1601270058

Consider hashing STRING columns as UTF-8 instead of UTF-16 in HLL

Currently the HLL implementation hashes STRING columns by converting the value the java.lang.string, and then hash the underlying UTF-16 char array:
https://github.com/apache/incubator-datasketches-hive/blob/master/src/main/java/org/apache/datasketches/hive/hll/SketchState.java#L66

This is looks like an optimization at first, as the UTF-16->UTF-8 conversion can be skipped, but actually Hive stores STRINGs as hadoop.io.text as far as I know, which holds the string as UTF-8 encoded byte arrays:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/Text.java#L214

This came up during Apache's Impala's HLL implementation, which should be compatible with Apache Hive's existing DataSketches wrapper. As Impala stores strings as byte arrays, converting the values to UTF-16 adds overhead and can also cause problems for byte arrays that cannot be converted to UTF-16.

Set Operations don't support custom seeds

While the Exclude set operation supports passing in a custom seed, Union and Intersection UDFs currently do not, limiting them to being used only with sketches created with the default seed. Seed should be allowed as an optional argument, with the default seed assumed if none is specified.

EstimateSketchUDF isn't processing BINARY fields correctly

I've been scratching my head for a while with this one, but I was writing some unit tests where I created a theta sketch with a single item, and the estimate function was returning an estimate of minus 800M.

This seems easily reproducible for me (using Hive version 1.1.0-cdh5.16.1):

add jar /path/to/datasketches-memory-1.2.0-incubating.jar;
add jar /path/to/datasketches-java-1.2.0-incubating.jar;
add jar /path/to/datasketches-hive-1.0.0-incubating.jar;

create temporary function data2sketch as 'org.apache.datasketches.hive.theta.DataToSketchUDAF';
create temporary function estimate as 'org.apache.datasketches.hive.theta.EstimateSketchUDF';

create temporary table theta_input as select 1 as id;

create temporary table sketch_intermediate as select data2sketch(id) as sketch from theta_input;

select estimate(sketch) as estimate_from_table from sketch_intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | -8.80936683E8        |
-- +----------------------+--+

with intermediate as (
    select data2sketch(id) as sketch from theta_input
)
select estimate(sketch) as estimate_from_table from intermediate;

-- Output:
-- +----------------------+--+
-- | estimate_from_table  |
-- +----------------------+--+
-- | 1.0                  |
-- +----------------------+--+

For some reason there were some extra bytes in the BytesWritable storage, which was breaking the calculations. What was supposed to be a 16 byte SingleItemSketch, got an extra 8 bytes (zero-filled), making datasketches think it was a completely different thing.

A unit test of what I was seeing coming from Hive:

@Test
public void evaluateRespectsByteLength() {
    byte[] inputBytes = new byte[]{
            (byte) 0x01, (byte) 0x03, (byte) 0x03, (byte) 0x00,
            (byte) 0x00, (byte) 0x3a, (byte) 0xcc, (byte) 0x93,
            (byte) 0x15, (byte) 0xf9, (byte) 0x7d, (byte) 0xcb,
            (byte) 0xbd, (byte) 0x86, (byte) 0xa1, (byte) 0x05,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00,
            (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00
    };
    BytesWritable input = new BytesWritable(inputBytes, 16);
    EstimateSketchUDF estimate = new EstimateSketchUDF();
    Double testResult = estimate.evaluate(input);
    assertEquals(1.0, testResult, 0.0);
}

Adding this wrapper around EstimateSketchUDF fixes the problem:

public class EstimateSketchUDF extends org.apache.datasketches.hive.theta.EstimateSketchUDF {

  @Override
  public Double evaluate(BytesWritable binarySketch) {
    if (binarySketch == null) {
      return 0.0;
    }

    byte[] bytes = new byte[binarySketch.getLength()];
    System.arraycopy(binarySketch.getBytes(), 0, bytes, 0, binarySketch.getLength());
    BytesWritable fixedSketch = new BytesWritable(bytes);

    return super.evaluate(fixedSketch);
  }
}

the error of intersect reaches 41%

On the TPC-H dataset,i use theta sketch to get intersect,Error of some results reaches 41%, but the doc say the default size(4096) about 3% error.

spark.sql("create temporary function data2sketch as 'org.apache.datasketches.hive.theta.DataToSketchUDAF'")
spark.sql("create temporary function intersect as 'org.apache.datasketches.hive.theta.IntersectSketchUDF'")
spark.sql("create temporary function estimate as 'org.apache.datasketches.hive.theta.EstimateSketchUDF'")

scala> lineitem.select("l_suppkey").intersect(order.select("o_orderkey")).count
res17: Long = 250000
but theta sketch result is 145593, the error is 0.41

scala> customer.select("c_custkey").intersect(lineitem.select("l_orderkey")).count
res18: Long = 3750000
but theta sketch result is 4404198, the error is 0.14

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.