Giter Site home page Giter Site logo

linkedin / transport Goto Github PK

View Code? Open in Web Editor NEW
289.0 19.0 72.0 1.16 MB

A framework for writing performant user-defined functions (UDFs) that are portable across a variety of engines including Apache Spark, Apache Hive, and Presto.

License: BSD 2-Clause "Simplified" License

Java 89.65% Scala 10.35%

transport's Introduction

logo

Transport UDFs

Transport is a framework for writing performant user-defined functions (UDFs) that are portable across a variety of engines including Apache Spark, Apache Hive, and Trino. Transport UDFs are also capable of directly processing data stored in serialization formats such as Apache Avro. With Transport, developers only need to implement their UDF logic once using the Transport API. Transport then takes care of translating the UDF to native UDF version targeted at various engines or formats. Currently, Transport is capable of generating engine-artifacts for Spark, Hive, and Trino, and format-artifacts for Avro. Further details on Transport can be found in this LinkedIn Engineering blog post.

Documentation

Example

This example shows how a portable UDF is written using the Transport APIs.

public class MapFromTwoArraysFunction extends StdUDF2<StdArray, StdArray, StdMap> implements TopLevelStdUDF {

  private StdType _mapType;

  @Override
  public List<String> getInputParameterSignatures() {
    return ImmutableList.of(
        "array(K)",
        "array(V)"
    );
  }

  @Override
  public String getOutputParameterSignature() {
    return "map(K,V)";
  }

  @Override
  public void init(StdFactory stdFactory) {
    super.init(stdFactory);
    _mapType = getStdFactory().createStdType(getOutputParameterSignature());
  }

  @Override
  public StdMap eval(StdArray a1, StdArray a2) {
    if (a1.size() != a2.size()) {
      return null;
    }
    StdMap map = getStdFactory().createMap(_mapType);
    for (int i = 0; i < a1.size(); i++) {
      map.put(a1.get(i), a2.get(i));
    }
    return map;
  }

  @Override
  public String getFunctionName() {
    return "map_from_two_arrays";
  }

  @Override
  public String getFunctionDescription() {
    return "A function to create a map out of two arrays";
  }
}

In the example above, StdMap and StdArray are interfaces that provide high-level map and array operations to their objects. Depending on the engine where this UDF is executed, those interfaces are implemented differently to deal with native data types used by that engine. getStdFactory() is a method used to create objects that conform to a given data type (such as a map whose keys are of the type of elements in the first array and values are of the type of elements in the second array). StdUDF2 is an abstract class to express a UDF that takes two parameters. It is parametrized by the UDF input types and the UDF output type. Please consult the Transport UDFs API for more details and examples.

How to Build

Clone the repository:

git clone https://github.com/linkedin/transport.git

Change directory to transport:

cd transport

Build:

./gradlew build

Please note that this project requires Java 8 to run. Either set JAVA_HOME to the home of an appropriate version and then use ./gradlew build as described above, or set the org.gradle.java.home gradle property to the Java home of an appropriate version as below:

./gradlew -Dorg.gradle.java.home=/path/to/java/home build

There are known issues with Java 1.8.291. Kindly refrain from using this subversion. To recover from build issues, do a fresh checkout and build this project using a different Java subversion.

How to Use

The project under the directory transportable-udfs-examples is a standalone Gradle project that shows how to setup a project that uses the Transport UDFs framework to write Transportable UDFs. You can model your project after that standalone project. It implements a number of example UDFs to showcase different features and aspects of the API. Basically, you need to check out three components:

  • UDF examples code to familiarize yourself with the API, and how to write new UDFs.

  • Test code to find out how to write UDF tests in a unified testing API, but have the framework test them on multiple platforms.

  • Root build.gradle file to find out how to apply the transport plugin, which enables generating Hive, Spark, and Trino UDFs out of the transportable UDFs you define once you build your project. To see that in action:

Change directory to transportable-udfs-examples:

cd transportable-udfs-examples

Build transportable-udfs-examples:

gradle build

You will notice that the build process generates some code. This is the platform-specific versions of the UDFs. Once the build succeeds, check out the output artifacts:

ls transportable-udfs-example-udfs/build/libs/

The results should be like:

transportable-udfs-example-udfs-hive.jar
transportable-udfs-example-udfs-trino.jar
transportable-udfs-example-udfs-spark.jar
transportable-udfs-example-udfs.jar

That is it! While only one version of the UDFs is implemented, multiple jars are produced upon building the project. Each of those jars uses native platform APIs and data models to implement the UDFs. So from an execution engine's perspective, there is no data transformation needed for interoperability or portability. Only suitable classes are used for each engine.

To call those jars from your SQL engine (i.e., Hive, Spark, or Trino), the standard process for deploying UDF jars is followed for each engine. For example, in Hive, you add the jar to the classpath using the ADD JAR statement, and register the UDF using CREATE FUNCTION statement. In Trino, the jar is deployed to the plugin directory. However, a small patch is required for the Trino engine to recognize the jar as a plugin, since the generated Trino UDFs implement the SqlScalarFunction API, which is currently not part of Trino's SPI architecture. You can find the patch here and apply it before deploying your UDFs jar to the Trino engine.

Contributing

The project is under active development and we welcome contributions of different forms:

  • Contributing new general-purpose Transport UDFs (e.g., Machine Learning UDFs, Spatial UDFs, Linear Algebra UDFs, etc).

  • Contributing new platform support.

  • Contributing a framework for new types of UDFs, e.g., aggregate UDFs (UDAFs), or table functions (UDTFs).

Please take a look at the Contribution Agreement.

Questions?

Please send any questions or discussion topics to [email protected]

License

BSD 2-CLAUSE LICENSE

Copyright 2018 LinkedIn Corporation.
All Rights Reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright
   notice, this list of conditions and the following disclaimer in the
   documentation and/or other materials provided with the
   distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

transport's People

Contributors

aastha25 avatar akshayrai avatar curtiscwang avatar cwsteinbach avatar dihu-linkedin avatar hotsushi avatar ianvkoeppe avatar jasperll avatar jjoyce0510 avatar khaitranq avatar kxu1026 avatar ljfgem avatar lxynov avatar maluchari avatar phd3 avatar raymondlam12 avatar rzhang10 avatar shardulm94 avatar shipkit-org avatar srramach avatar surennihalani avatar wagnermarkd avatar weijiii avatar wmoustafa avatar yiqiangin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transport's Issues

Generated hive udf for working with struct values fails

I created a simple UDF and a unit test for it in the transport-udfs-examples module. The UDF increments the first integer field of a struct by 1. This is the UDF and the unit test:

public class StructElementIncrementByOneFunction extends StdUDF1<StdStruct, StdStruct> implements TopLevelStdUDF {

  @Override
  public List<String> getInputParameterSignatures() {
    return ImmutableList.of(
        "row(integer, integer)"
    );
  }

  @Override
  public String getOutputParameterSignature() {
    return "row(integer, integer)";
  }

  @Override
  public StdStruct eval(StdStruct myStruct) {
    int currVal = ((StdInteger) myStruct.getField(0)).get();
    myStruct.setField(0, getStdFactory().createInteger(currVal + 1));
    return myStruct;
  }

  @Override
  public String getFunctionName() {
    return "struct_element_increment_by_one";
  }

  @Override
  public String getFunctionDescription() {
    return "increment first element by one";
  }
}

A unit test for it:

public class TestStructElementIncrementByOneFunction extends AbstractStdUDFTest {

  @Override
  protected Map<Class<? extends TopLevelStdUDF>, List<Class<? extends StdUDF>>> getTopLevelStdUDFClassesAndImplementations() {
    return ImmutableMap.of(
        StructElementIncrementByOneFunction.class, ImmutableList.of(StructElementIncrementByOneFunction.class));
  }

  @Test
  public void testStructElementIncrementByOneFunction() {
    StdTester tester = getTester();
    tester.check(functionCall("struct_element_increment_by_one", row(1, 3)), row(2, 3), "row(integer,integer)");
    tester.check(functionCall("struct_element_increment_by_one", row(-2, 3)), row(-1, 3), "row(integer,integer)");
  }

The generated spark_2.11 and spark_2.12 UDFs run correctly. But the generated hive artifact fails with the below exception (I am running the gradle task ./gradlew hiveTask for testing).

java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.setStructFieldData(StandardStructObjectInspector.java:221)
at com.linkedin.transport.hive.data.HiveStruct.setField(HiveStruct.java:53)
at com.linkedin.transport.examples.StructElementIncrementByOneFunction.eval(StructElementIncrementByOneFunction.java:33)
at com.linkedin.transport.examples.StructElementIncrementByOneFunction.eval(StructElementIncrementByOneFunction.java:16)
at com.linkedin.transport.hive.StdUdfWrapper.evaluate(StdUdfWrapper.java:162)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:81)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:425)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:417)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:140)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:347)
at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455)
at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:447)
at com.linkedin.transport.test.hive.HiveTester.assertFunctionCall(HiveTester.java:123)
at com.linkedin.transport.test.spi.SqlStdTester.check(SqlStdTester.java:31)
at com.linkedin.transport.test.spi.StdTester.check(StdTester.java:38)
at com.linkedin.transport.examples.TestStructElementIncrementByOneFunction.testStructElementIncrementByOneFunction(TestStructElementIncrementByOneFunction.java:30)

I don't have any local changes. Am I doing something wrong or is this something to fix?

Unable to build project using latest version of gradle

Error:

spedamallu@spedamallu-mbp143 ~/PersonalProjects/transport/transportable-udfs-examples (master) $
| => gradle build
Starting a Gradle Daemon (subsequent builds will be faster)

> Configure project :transport
  Building version '0.0.62' (value loaded from 'version.properties' file).

FAILURE: Build failed with an exception.

* Where:
Build file '/Users/spedamallu/PersonalProjects/transport/build.gradle' line: 68

* What went wrong:
A problem occurred evaluating project ':transport:transportable-udfs-annotation-processor'.
> Failed to apply plugin 'org.gradle.java'.
   > Could not find method testCompile() for arguments [org.testng:testng:6.11] on object of type org.gradle.api.internal.artifacts.dsl.dependencies.DefaultDependencyHandler.

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/7.0/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 22s

Version:

spedamallu@spedamallu-mbp143 ~/PersonalProjects/transport (master) $
| => gradle --version

------------------------------------------------------------
Gradle 7.0
------------------------------------------------------------

Build time:   2021-04-09 22:27:31 UTC
Revision:     d5661e3f0e07a8caff705f1badf79fb5df8022c4

Kotlin:       1.4.31
Groovy:       3.0.7
Ant:          Apache Ant(TM) version 1.10.9 compiled on September 27 2020
JVM:          1.8.0_281 (Oracle Corporation 25.281-b09)
OS:           Mac OS X 10.14.6 x86_64

spedamallu@spedamallu-mbp143 ~/PersonalProjects/transport (master) $

Issue seems to be because of deprecated configurations and renaming them in gradle7 - https://github.com/gradle/gradle/blob/c4d09af3db073776225e705307d93284df8753c0/subprojects/docs/src/docs/userguide/migration/upgrading_version_6.adoc#removal-of-compile-and-runtime-configurations

Would like to know if this was attempted already and was discouraged for any reason. Otherwise, happy to create a PR.

Allow UDFs to declare whether they are deterministic

Currently we consider all UDFs as non-determinstic. This prevents engine from being able to push down predicates which can be resolved at query compile time. Presto/Hive/Spark allows UDFs to declare if they are deterministic. We should also provide something similar in our API.

Support for java.util.{Map, List} and Record container types

This ticket discusses container datatypes support :-

  • java.util.Map instead of StdMap
  • java.util.List instead of StdList

The benefits are self evident - A cleaner, familiar and more usable API.
We had already discussed this internally, but I'm not sure why we did not pursue this further at that moment. However, I'm filing this ticket to gauge whether we or anyone upstream would have more interest in something like this.

I tried a prototype for Spark and the results look encouraging. I was able to support Map and List and Struct. For Struct, I rolled my own and provided a Record API. See: https://github.com/rdsr/t2/blob/master/core/src/main/scala/t2/api/traits.scala

Of-course it all depends on a lot of things

  • Whether Presto impl. will be able to support this
  • Whether making deep backward incompatible API changes are warranted for a more cleaner API.

There are a lot of tradeoffs to be discussed.

Disable Jacoco if applied for platform tests

Jacoco plugin is known interfere with Presto and Hive as it instruments classes with additional fields. We should consider disabling Jacoco for platform tests as the code coverage is already being tested in the Generic tests.

Is there any plan to support udaf or udtf?

This is a amazing project!I have extended plugin to support other computation platform. But there is a just a pity for not support udaf and udtf. Is there any plan for these issue? Thanks!

Autogenerate platform-specific wrapper classes and jars

Currently, when authoring a UDF, a user first extends one of the StdUDF_i abstract classes and defines the business logic of the UDF. However, after the core logic of the UDF is written, the user has to create n different modules accompanying the core UDF module (one for each platform) [Hive|Presto|Spark]. Through these modules, the user is exposed to platform-specific details such as the UDF API, testing methodology, set of dependencies and the packaging requirements. If we look at the contents of these modules though, we see that the wrappers and gradle files a user has to write consists of mostly boilerplate code. However writing these modules is where the user tends to spend most of his time. This makes them a great candidate to be auto-generated.

This task will be split into a few subtasks:

  • An annotation processor to extract metadata about Transport UDFs defined by the user - #11
  • A Gradle task to generate wrappers classes from the metadata - #15
  • A Gradle task to compile generated wrappers and package them into JARs - #18
  • A Gradle plugin to automatically apply Transport related gradle tasks and configurations - #17

Type safe API for Transport

This ticket tries to make a case for a more type safe API for transport . Our API today IMO lacks in two ways :-

  1. The containers API - Struct, Map and Array is not parameterized. This means that for consuming elements from these containers requires a typecast. E.g taking a Int out of a map would mean taking StdData out and then typecasting it to a StdInt.
  2. The StdUDF API is parameterized, but the parameters extend StdData. This is limiting because :-
    1. Supporting #7 would require typecasting
    2. It would make it impossible to support #6

I think we can do better, and though Presto may have some unknowns, I was able to achieve some success with a Spark prototype. Below I try to give an idea of my approach

Containers API
As described in #7, I chose java.util.{List, Map} for List and Map types. For Struct I defined a Record type, similar in line to Avro's GenericRecord. The key point here is the all these container APIs are parameterized as shown below

trait Schema {
  def schema: DataType
}

trait IndexedRecord extends Schema {
  def put[V](i: Int, v: V): Unit

  def get[V](i: Int): V
}

trait GenericRecord extends IndexedRecord {
  def put[V](key: String, v: V): Unit

  def get[V](key: String): V
}

abstract class GenericList[A] extends util.AbstractList[A] with Schema

abstract class GenericMap[K, V] extends util.AbstractMap[K, V] with Schema

So to get a field out of a record, we'd do

final GenericRecord r = ... 
final List<Integer> f = r.get("A");

This does not involve any typecasting and is type safe. Similar examples can be given for other container types.

UDF API
Similarly, for UDF API, we can have generic parameters which need not extend StdData . Below I provide the API I used in my prototype.

trait Fn0[F] extends UDF0[F] with Fn

trait Fn1[T, F] extends UDF1[T, F] with Fn

trait Fn2[T1, T2, F] extends UDF2[T1, T2, F] with Fn

trait Fn3[T1, T2, T3, F] extends UDF3[T1, T2, T3, F] with Fn

trait Fn4[T1, T2, T3, T4, F] extends UDF4[T1, T2, T3, T4, F] with Fn

trait Fn5[T1, T2, T3, T4, T5, F] extends UDF5[T1, T2, T3, T4, T5, F] with Fn

trait Fn6[T1, T2, T3, T4, T5, T6, F] extends UDF6[T1, T2, T3, T4, T5, T6, F] with Fn

This would help us implement #7 cleanly and in a type safe manner, and would make #6 possible

Passing an array of nulls results in incorrect StdData object in Hive

When dealing with StdData objects, we expect that StdData.get() will never be null, instead the StdData objects itself are null. But in case of arrays of strings in Hive, we do get StdString with a null value underneath.

I think a change needs to go in

if (hiveObjectInspector instanceof IntObjectInspector) {
where if hiveData is null then StdData should also be returned a null.

Support null values in StdMap

Currently if we try to add null values in the StdMap, exceptions are thrown in pretty much every implementation. But seems like null values is a valid usecase and we should think of supporting it

Enable TestBinaryDuplicateFunction and TestBinaryObjectSizeFunction for Trino in the module of transportable-udfs-example-udfs

With the upgrade to Trino v406, these two test classes are temporarily disabled for Trino test by the following reason:

As the test infrastructure from Trino named QueryAssertions is used to run these test for Trino, QueryAssertions mandatory execute the function with the query in two formats: one with is the normal query (e.g. SELECT "binary_duplicate"(a0) FROM (VALUES ROW(from_base64('YmFy'))) t(a0); and SELECT "binary_size"(a0) FROM (VALUES ROW(from_base64('Zm9v'))) t(a0);), the other is with "where RAND()>0" clause (e.g. SELECT "binary_duplicate"(a0) FROM (VALUES ROW(from_base64('YmFy'))) t(a0) where RAND()>0; and SELECT "binary_size"(a0) FROM (VALUES ROW(from_base64('Zm9v'))) t(a0) where RAND()>0;) QueryAssertions verifies the output from both queries are equal otherwise the test fail. However, the execution of the query with where clause triggers the code of VariableWidthBlockBuilder.writeByte() to create the input byte array in Slice with an initial 32 byes capacity, while the execution of the query without where clause does not trigger the code of VariableWidthBlockBuilder.writeByte() and create the input byte array in Slice with the actual capacity of the content. Therefore, the outputs from both queries are different.

As the code causing the problem lie in Trino part, these tests should be enabled after the fix is done in Trino.

UDF Spark registration in Scala is not shown as a method

When trying to register one of the example UDFs packaged in Transport in a local spark image as described in the Using Transport UDFs doc, after importing the UDF, Scala gives the error that register isn't a member of the object.

When listing all of the methods available to Scala from the class, register isn't listed as one of the methods of the class. It did successfully import though, as the class is available to reference.

I think that this could be related to importing the wrong file, as the classpath is a bit odd -- but it does refer to an actual class that Scala successfully imports. (Classpath: com_linkedin_transport_transportable_udfs_example_udfs_unspecified.com.linkedin.transport.examples.NumericAddIntFunction)

Transport plugin shade all by default which cause type not found issue when refer to other libs

When I create a new UDF function which calls the API exposed by another lib, it errored when I tested in HIVE and other platforms.

Caused by: com_linkedin_jobs_udf_jobs_udfs_2_1_1.org.apache.avro.AvroTypeException: Found com.linkedin.standardization.taxonomy.industries.IndustryStatus, expecting com_linkedin_jobs_udf_jobs_udfs_2_1_1.com.linkedin.standardization.taxonomy.industries.IndustryStatus
	at com_linkedin_jobs_udf_jobs_udfs_2_1_1.org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:309)
	at com_linkedin_jobs_udf_jobs_udfs_2_1_1.org.apache.avro.io.parsing.Parser.advance(Parser.java:86)
	at com_linkedin_jobs_udf_jobs_udfs_2_1_1.org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:260)

The reason is that it expects the shaded type prefixed with com_linkedin_jobs_udf_jobs_udfs_2_1_1 but found the original type.

To workaround it, we have to explicitly exclude those namespaces by adding

shadeHiveJar.setDoNotShade(["com.linkedin.standardization.taxonomy.industries.*"])
shadeSpark_211Jar.setDoNotShade(["com.linkedin.standardization.taxonomy.industries.*"])
shadeSpark_212Jar.setDoNotShade(["com.linkedin.standardization.taxonomy.industries.*"])

to build.gradle.

Is this by design or a bug?

For more information, please refer to the internal discussion: https://linkedin-randd.slack.com/archives/C02D9EYGPGA/p1641401436435300

Test framework should create runtime tmp files in non-root directory

When we run hive/spark tests through the test framework, it will create some files in the root folder of the project such as

metastore_db/
derby.log
hdfs_cache/

Right now we add these to .gitignore to ignore them. We should consider setting appropriate configs so that these files are generated in tmp directories and not in the root directory.

Transport support for existing udfs written in presto for execution in hive/spark etc

Presto seems to support lot of rich functions that we find very useful, examples are:

These are just few of all the functions supported by presto.

If transport also had apis to translate presto functions to transport udf, it could unlock the rich set of presto functions to other execution engines (to spark or flink for example). A distro of presto out-of-the-box functions made available using transport for hive/flink can be of immense value to users.

Extending this idea, if the same is done to support hive udfs, that would help adoption as teams who already has lot of hive udfs can take advantage of the framework right away without having to re-write those udfs using transport udf.

Transport should depend on hive-exec core jar for the test framework

org.apache.hive:hive-exec bundles in some transitive dependencies (e.g. guava) without shading them. This causes classpath issues in the test framework as user is not able to depend on newer guava version. Instead we should use org.apache.hive:hive-exec:core jar which does not bundle transitive dependencies. This way, guava dependencies can be manipulated through Gradle dependency resolution if required.

Hive tests fail for varbinary when building

I'm using transport to convert an internal presto udf (ipstr) to a transport one. This udf takes in a varbinary as a parameter. The unit tests written passes fine when run in Intellij. However, when I try to build it, the hive tests fail for the test cases related to that udf.

default String getBinaryArgumentString(ByteBuffer value) {
    return "CAST('" + new String(value.array(), StandardCharsets.UTF_8) + "' AS BINARY)";
}

I believe the issue is in the method above in SQLFunctionCallGenerator. From what I understand, in the hiveTest, it tries to write the unit tests into queries and execute them. In this case, it tries to convert a bytebuffer into a string, which gets casted back into a binary.

I tested it in Intellij but converting byte[] -> String, and then trying to get the byte[] from the String yields a different result than the original byte[].

In our integration test, we convert the varbinary into a base64 encoded string and then execute a query like:

select ipstr(from_base64( base64EncodedStringOfVarbinary ));

This problem might be there too for presto and spark but I have not been able to verify that.

Support for standard java.lang primitive types

Filing this ticket to investigate if we can have support for java.lang primitive types.

Today we have limited set of specialized primitive datatypes e.g StdString, StdLong, StdInt. These specialized types have no methods associated with them and are inferior to standard JVM types like String, Int and Long. A user writing a UDF with a primitive type arg will always want to use the standard jvm type, since they are easy to work with having a set of familiar methods (e.g consider the String class) and everyone understands them better [e.g String, Int, Long are very well understood and used throughout].

Also, as and when we add support for newer primitive datatypes, it will become more challenging to roll our own wrapper classes for Decimal, Date, Binary etc.

There is no performance concern here as users working on primitive types will always 'realize' the underlying primitive object into a standard jvm type. The performance may also be slightly better as we are not creating wrapper classes for them.

I've tried a prototype for Spark using standard JVM primitive types as shown here https://github.com/rdsr/t2/blob/master/examples/src/main/java/t2/examples/Add.java and the results look encouraging. The code is also much easier to read [without the specialized primitive wrapper]

Add support for flink udfs

I think transport is an excellent project which lets us use the same udf implementation across various query engines. This is a request to add support for flink udfs which would increase the usability of this project for stream compute use cases as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.