Giter Site home page Giter Site logo

hive-udaf-maxrow's Introduction

hive-udaf-maxrow

hive-udaf-maxrow is a simple user-defined aggregate function (UDAF) for Hive.

The maxrow() aggregate function is similar to the built-in max() function, but it allows you to refer to additional columns in the maximal row.

Example

For example, given the following data in a Hive table:

idtssomedata
12data-1,2
13data-1,3
14data-1,4
25data-2,5
23data-2,3
24data-2,4
36data-3,6
31data-3,1
34data-3,4

You can query this table using the maxrow() function:

hive> ADD JAR hive-udaf-maxrow.jar;
hive> CREATE TEMPORARY FUNCTION maxrow AS 'com.scribd.hive.udaf.GenericUDAFMaxRow';
hive> SELECT id, maxrow(ts, somedata) FROM sometable GROUP BY id;
idmaxrow
1{"col0":4,"col1":"data-1,4"}
2{"col0":5,"col1":"data-2,5"}
3{"col0":6,"col1":"data-3,6"}

While maxrow() looks only at its first parameter ("ts" in this case) to compute the maximum value, it carries along any additional values ("somedata" in this case).

Since maxrow() returns a "struct" value (see below), you can parse the result with Hive's built-in "dot" notation. For example:

hive> SELECT id, m.col0 as ts, m.col1 as somedata FROM (
          SELECT id, maxrow(ts, somedata) as m FROM sometable GROUP BY id
      ) s;
idtssomedata
14data-1,4
25data-2,5
36data-3,6

Limitations

As can be seen from the example above, there are a couple of limitations due to how Hive UDAFs work:

  • A UDAF can only output a value for a single column. Therefore, maxrow() returns a complex-valued "struct" object.
  • Hive does not provide the UDAF with the name of the columns that are being passed as input to the UDAF. Therefore, maxrow() generates simple names such as "col0", "col1", etc.

Building

To build hive-udaf-maxrow, you need to specify the location of your Hadoop and Hive jar files using the HADOOP_HOME and HIVE_HOME environment variables. The build classpath will include all of the jar files in these directories and and their lib/ subdirectories. For example:

> HADOOP_HOME=/path/to/hadoop HIVE_HOME=/path/to/hive ant

A successful build will create the dist/hive-udaf-maxrow.jar file. You can add this jar file to your Hive session using the ADD JAR command shown above.

hive-udaf-maxrow's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hive-udaf-maxrow's Issues

GenericUDAFCollectSetArray

Hi, all.
I'm writting a GenericUDAFCollectSetArray, which should worked like:

id ts somedata
1 2 data-1,2
1 3 data-1,3
1 4 data-1,4
2 5 data-2,5
2 3 data-2,3
2 4 data-2,4
3 6 data-3,6
3 1 data-3,1
3 4 data-3,4

SELECT id, collectSetArray(ts, somedata) FROM sometable GROUP BY id;
result:

id col1
1 [{"_col0": "2", "_col1": "data-1,2"}, {"_col0": "3", "_col1": "data-1,3"},
{"_col0": "4", "_col1": "data-1,4"}]
2 [{"_col0": "3", "_col1": "data-2,3"}, {"_col0": "4", "_col1": "data-2,4"},
{"_col0": "5", "_col1": "data-2,5"}]
3 [{"_col0": "1", "_col1": "data-3,1"}, {"_col0": "4", "_col1": "data-3,4"},
{"_col0": "6", "_col1": "data-3,6"}]

And I'm stuck in init() return and merge().
The below would be my own code which imitate
GenericUDAFMaxRow and GenericUDAFCollectSet.

Thanks for your patience for the long code.
And any hint would be helpful.

public class GenericUDAFCollectSet extends AbstractGenericUDAFResolver {
public GenericUDAFCollectSet() {
}

@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
        throws SemanticException {

    if (parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
        throw new UDFArgumentTypeException(0,
                "Only primitive type arguments are accepted but "
                        + parameters[0].getTypeName() + " was passed as parameter 1.");
    }

    return new GenericUDAFMkSetEvaluator();
}

public static class GenericUDAFMkSetEvaluator extends GenericUDAFEvaluator {


    private StandardListObjectInspector internalMergeOI;
    private PrimitiveObjectInspector internalMergeElementOI;

    private ObjectInspector[] inputOIs;
    private ObjectInspector[] outputOIs;
    private ObjectInspector structOI;
    private StandardListObjectInspector loi;

    public ObjectInspector init(Mode m, ObjectInspector[] parameters)
            throws HiveException {
        super.init(m, parameters);

        System.out.println("init() mode: " + m);

        int paramsLength = parameters.length;

        if (m == Mode.PARTIAL1 || m == Mode.COMPLETE) {
            assert(parameters instanceof Object[]);

            inputOIs = parameters;
        } else {
            assert(paramsLength == 1);

            internalMergeOI = (StandardListObjectInspector) parameters[0];

            internalMergeElementOI = (PrimitiveObjectInspector) internalMergeOI.getListElementObjectInspector();

            System.out.println("internalMergeOI: " + internalMergeOI.getTypeName());
            System.out.println("internalMergeElementOI: " + internalMergeElementOI.getTypeName());
        }

        outputOIs = new ObjectInspector[paramsLength];

        List<String> fieldNames = new ArrayList<String>(paramsLength);
        List<ObjectInspector> fieldOIs = Arrays.asList(outputOIs);

        for (int i = 0; i < paramsLength; i++) {
            fieldNames.add("_col" + i);
            outputOIs[i] = ObjectInspectorUtils.getStandardObjectInspector(parameters[i]);
        }

        structOI = ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);

        loi = ObjectInspectorFactory.getStandardListObjectInspector(structOI);

        System.out.println("return from init() ");
        return loi;
    }

    static class MkArrayAggregationBuffer implements AggregationBuffer {
        // What you see is what you get.
        List<Object[]> container;
    }

    @Override
    public void reset(AggregationBuffer agg) throws HiveException {
        ((MkArrayAggregationBuffer) agg).container = new ArrayList<Object[]>();
    }

    @Override
    public AggregationBuffer getNewAggregationBuffer() throws HiveException {
        MkArrayAggregationBuffer ret = new MkArrayAggregationBuffer();
        reset(ret);
        return ret;
    }

    //mapside
    @Override
    public void iterate(AggregationBuffer agg, Object[] parameters)
            throws HiveException {
        System.out.println("iterate() p.length: " + parameters.length);

        if (parameters.length > 0) {
            MkArrayAggregationBuffer listAgg = (MkArrayAggregationBuffer) agg;

            putIntoSet(parameters, listAgg);
        }

        printAgg(agg);
    }

    //mapside
    @Override
    public Object terminatePartial(AggregationBuffer agg) throws HiveException {
        System.out.println("terminatePartial() ");
        printAgg(agg);

        MkArrayAggregationBuffer listAgg = (MkArrayAggregationBuffer) agg;
        // However, the log said it seems to be ArrayList<Object>.
        ArrayList<Object[]> ret = new ArrayList<Object[]>(listAgg.container.size());
        ret.addAll(listAgg.container);
        return ret;
    }

    @Override
    public void merge(AggregationBuffer agg, Object partial)
            throws HiveException {
        System.out.println("merge() ");

        MkArrayAggregationBuffer listAgg = (MkArrayAggregationBuffer) agg;
        ArrayList<Object[]> partialResult = (ArrayList<Object[]>) internalMergeOI.getList(partial);
        for(Object[] i : partialResult) {
            putIntoSet(i, listAgg);
        }

        printAgg(agg);
    }

    @Override
    public Object terminate(AggregationBuffer agg) throws HiveException {
        System.out.println("terminate() ");
        printAgg(agg);

        MkArrayAggregationBuffer listAgg = (MkArrayAggregationBuffer) agg;
        ArrayList<Object[]> ret = new ArrayList<Object[]>(listAgg.container.size());
        ret.addAll(listAgg.container);
        return ret;
    }

    private void putIntoSet(Object[] p, MkArrayAggregationBuffer myagg) {
        System.out.println("putIntoSet() ");

        Object[] objects = new Object[p.length];

        for (int i = 0; i < p.length; i++) {
            objects[i] = ObjectInspectorUtils.copyToStandardObject(p[i], this.inputOIs[i]);
        }

        myagg.container.add(objects);
    }
}

}

Clarity in the behaviour

Hi,

What will be the expected behavior of maxrow(), when the key value is duplicated.?
So, in your example , if i add one more record like.
1 4 data-1,5
What will be the result for SELECT id, maxrow(ts, somedata) FROM sometable GROUP BY id;?
whether it's 1 4 data-1,5 or 1 4 data-1,4

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.