cascading / cascading-hive Goto Github PK

View Code? Open in Web Editor NEW

26.0 6.0 15.0 320 KB

Integration for Cascading and Apache Hive

Home Page: http://docs.cascading.org/cascading-hive/2.0/

CSS 10.09% HTML 1.82% Java 88.09%

cascading-hive's Introduction

cascading-hive

Cascading-Hive is an integration between Apache Hive and Cascading. It currently has the following major features:

running Hive queries within a Cascade
reading from Hive tables within a Cascading Flow (including transactional tables)
writing/creating Hive tables from a Cascading Flow
writing/creating partitioned Hive tables from a Cascading Flow
deconstructing a Hive view into Taps
reading/writing taps from HCatalog

The demo sub-directory contains applications that demonstrate those features.

The code will pick up the configuration for a remote Hive MetaStore automatically, as long as it is present in your hive or hadoop configuration files.

Hive dependency

Cascading-Hive works Hive 1.x, 2.x is not yet supported. When using the cascading-hive in your project you have to specify the version of Hive you are using as a runtime dependency yourself. This is done to avoid classpath issues with the various Hadoop and Hive distributions in existence. See the demo project for an example.

Installing

To install cascading-hive into your local maven repository do this:

> gradle install

Maven, Ivy, Gradle

Cascading-Hive is also available on http://conjars.org.

Limitations

Views

Please note that the support for Hive views is currently limited since views pretend to be a resource or a Tap in Cascading terms, but are actually computation. It is currently not possible to read from a Hive View within a Cascading Flow. To work around this, you can create a table in Hive instead, and read from that within your Cascading Flow. This restriction might change in the future.

Transactional Tables

Also note that it is not yet possible to write to transactional tables and that the HiveTap will prevent any attempt to do so.

HCatTap

Using HCatTap to sink data to a Blob Storage service may lead to issues that can be hard to deal with due to Blob Storage not being a real file system. If you are planning to do so, use it at your own risk.

Users encountered various issues when using HCatTap to sync tables in S3 but there is a known workaround (tested on EMR 4.7.0):

	mapred.output.direct.EmrFileSystem = false
	mapred.output.direct.NativeS3FileSystem = false

These settings are required in order be able to commit dynamic partitions. This also implies that direct commits in EMR will be disabled and the job may take longer during the commit phase of tasks and jobs since the underlying FileSystem will have to copy the files to their final locations and delete the temporary copies. Depending on your use case this waiting time and relying on eventually consistent data may or may not be an issue.

Note that even though direct commits won't be available, EMR consistent views can still be used.

Environment

Finally note that Hive relies on the hadoop command being present in your PATH when it executes the queries on the Hadoop cluster.

cascading-hive's People

Contributors

Stargazers

Watchers

Forkers

matt-martin a-rkhsib brycelohr atblab visualdna lkafle vsreenivasan vinayakponangi n8fisher abhikorpe gurditsingh abhimanyugupta07 amulmgr

cascading-hive's Issues

Infinite recursive call in OutputCommitter of HCatTap

Hi,

We've started using HCatTap for some of our jobs and found an issue with the class OutputCommitterWrapper where a infinite recursive call is made when setupJob() and abortJob() is invoked.

This seems to happen in some scenarios and seems related to Hive rather than Cascading.

The following code sources data from a java List and sinks into an unpartitioned Hive table:

package com.company.hcat;

import static org.hamcrest.CoreMatchers.is;
import static org.junit.Assert.assertThat;

import java.io.File;
import java.util.List;

import org.apache.hadoop.hive.conf.HiveConf;
import org.junit.After;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TemporaryFolder;

import cascading.HiveTestCase;
import cascading.flow.hadoop.util.HadoopUtil;
import cascading.flow.hadoop2.Hadoop2MR1FlowConnector;
import cascading.pipe.Pipe;
import cascading.scheme.hcatalog.HCatScheme;
import cascading.tap.hcatalog.HCatTap;
import cascading.tuple.Fields;
import cascading.tuple.Tuple;

import com.google.common.collect.ImmutableList;
import com.twitter.maple.tap.MemorySourceTap;

public class HCatTapWriteTest extends HiveTestCase {
  private static final long serialVersionUID = 1L;

  private static final String DATABASE = "test_db";
  private static final String TABLE = "t_test";
  private static final String COL1 = "col1";
  private static final String COL2 = "col2";
  private static final Fields FIELDS = new Fields(new String[] { COL1, COL2 },
      new Class[] { Long.class, String.class });

  public @Rule TemporaryFolder temporaryFolder = new TemporaryFolder();

  private File dbLocation;
  private File tableLocation;

  @Before
  public void init() throws Exception {
    dbLocation = temporaryFolder.newFolder(DATABASE);
    tableLocation = temporaryFolder.newFolder(DATABASE + "/" + TABLE);
    createDb(dbLocation);
    createTable(tableLocation);
  }

  @After
  public void after() throws Exception {
    runHiveQuery("DROP TABLE IF EXISTS " + DATABASE + "." + TABLE);
    runHiveQuery("DROP DATABASE IF EXISTS " + DATABASE);
  }

  private void createDb(File dbLocation) throws Exception {
    runHiveQuery("CREATE DATABASE " + DATABASE + " LOCATION '" + dbLocation.getAbsolutePath() + "'");
  }

  private void createTable(File tableLocation) throws Exception {
    StringBuilder ddl = new StringBuilder(100)
        .append("CREATE TABLE " + DATABASE + "." + TABLE + " ( ")
        .append(COL1 + " BIGINT, ")
        .append(COL2 + " STRING ")
        .append(") ")
        .append("ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ")
        .append("STORED AS TEXTFILE ")
        .append("LOCATION '")
        .append(tableLocation.getAbsolutePath())
        .append("'");
    runHiveQuery(ddl.toString());
  }

  @Test
  public void write() throws Exception {
    HiveConf hveConf = createHiveConf();

    List<Tuple> rows = ImmutableList
        .<Tuple> builder()
        .add(new Tuple(10L, "col10"))
        .add(new Tuple(11L, "col12"))
        .build();
    MemorySourceTap source = new MemorySourceTap(rows, FIELDS);

    HCatTap sink = new HCatTap(new HCatScheme(FIELDS), DATABASE, TABLE);

    Pipe pipe = new Pipe("copy");

    new Hadoop2MR1FlowConnector(HadoopUtil.createProperties(hveConf)).connect(source, sink, pipe).complete();

    List<Object> tableData = runHiveQuery("SELECT * FROM " + DATABASE + "." + TABLE);
    assertThat(tableData.size(), is(2));
    assertThat(tableData.get(0).toString(), is("10\tcol10"));
    assertThat(tableData.get(1).toString(), is("11\tcol12"));
  }

}

The output of the code can be downloaded here.

The maven dependencies to run the previous code are:

<dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hive.hcatalog</groupId>
      <artifactId>hive-webhcat-java-client</artifactId>
      <version>${hive.version}</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>cascading</groupId>
      <artifactId>cascading-hive</artifactId>
      <version>2.1.0-wip-dev</version>
      <classifier>tests</classifier>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-minicluster</artifactId>
      <version>${hadoop.version}</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>cascading</groupId>
      <artifactId>cascading-core</artifactId>
      <version>${cascading.version}</version>
      <classifier>tests</classifier>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>cascading</groupId>
      <artifactId>cascading-hadoop2-mr1</artifactId>
      <version>${cascading.version}</version>
      <classifier>tests</classifier>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>cascading</groupId>
      <artifactId>cascading-platform</artifactId>
      <version>${cascading.version}</version>
      <classifier>tests</classifier>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.derby</groupId>
      <artifactId>derby</artifactId>
      <version>10.12.1.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>maple</artifactId>
      <version>0.16.0</version>
    </dependency>
  </dependencies>

A similar code to write to a partitioned table does not show this symptoms.

The method commitJob() works around this issue by unsetting the committer after the first call. But this is not the of the other 2 methods. Doing this will only fix the problem in some scenarios.

We've noticed that the class org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer creates a new when this job-specific committer methods are invoked. Specifically:

public class HCatMapRedUtil {
  ...
  public static JobContext createJobContext(JobConf conf, org.apache.hadoop.mapreduce.JobID id, Progressable progressable) {
    return ShimLoader.getHadoopShims().getHCatShim().createJobContext(conf, id, (Reporter) progressable);
  }
}

Could this be what's causing the issue in cascading-hive?

Any thoughts, ideas or insights that could help us to get this sorted are most welcome.

If you require more details please let me know.

Thanks,
Daniel

Let HiveTableDescriptor be constructed directly from an org.apache.hadoop.hive.metastore.api.Table instance

The current implementation of HiveTableDescriptor makes it difficult to create a table with a non-default location or a custom input/output format (there may be other limitations, but those are the two most notable that I've seen). Why not allow people to create a org.apache.hadoop.hive.metastore.api.Table object directly and then just pass it in to the HiveTableDescriptor constructor. The folks who don't want to know about storage descriptors, etc can keep using the existing constructors, but I think this new constructor would be essential for anyone who has even moderately customized Hive tables.

Add key copier in RecordReaderWrapper

We detected an issue when hash-joining pipes from different tables using different file formats, specifically joining ORC and Sequence Files. The key of ORC table would be left set to the last entry read in previous steps.

We will provide a patch to deal with this problem.

Identifiers for two different objects of a tap on a hive table are same

We have a use-case where we are trying to read some data from a source and then based on a business logic, do some processing on it before writing it back to the same hive table. It involves a self-join to calculate delta. With the way identifier is built in HcatTap (link), a identical identifier is returned for two different tap objects on the same table which leads to a cyclic flow exception being thrown from cascading. I propose to add the id of the tap as part of the identifier to make the identifier for a tap object unique. I will raise the PR shortly.

HCatTap fails to commit dynamic partitions when used as S3 sink

We've tried to use HCatTap in AWS to output data to a partitioned table. Some strange behavior is shown when the data is stored in S3: OutputCommitter#commitJob() seems to be invoked twice for each job and data is not moved properly from the temporary "directories" to its final location.

We have discussed the problem with Amazon and we have a patch to fix this issue which we like to contribute. I'll be raising a PR soon.

HiveFlow and HiveRiffle should provide asynchronous start method

In order to be consistent with the general Flow interface, HiveFlow and/or HiveRiffle should implement the start method which asynchronously starts the job.

Without this the Execution.run method in Scalding won't work since it doesn't call the complete method but instead uses the start method and a FlowListener to listen for the complete event.

cascading-hive:Table not Stored in Metastore DB

The table we created from the code are not store in metastore db.
It create a table in HDFS.But it not showing in Hive Shell when we do show tables;

This is the HiveTableDescriptor code:

HiveTableDescriptor sinkTableDescriptor = new HiveTableDescriptor( "CallCenterDB", "transdata_result", CALL_CENTER_TABLE_FIELDS, CALL_CENTER_TABLE_TYPES );

Please solve this issue.

gradle install failing:Artifact 'commons-logging:commons-logging:1.1.3@jar' not found

gradle install is failing due to the error Artifact 'commons-logging:commons-logging:1.1.3@jar' not found

I tried it on the root as well as the inner folders failing in both the cases.
In cascading-hive folder, when I tried gradle install, it fails with the exception:-
Build file 'C:\Users\dk\cascading hive\cascading-hive-1.0\build.gradle' line: 37

What went wrong:
A problem occurred evaluating root project 'cascading-hive'.

Could not resolve all dependencies for configuration 'classpath'.
Could not download artifact 'commons-logging:commons-logging:1.1.3@jar'
Artifact 'commons-logging:commons-logging:1.1.3@jar' not found.

Any plan to implemet

Error Reading External Table

I am trying to create a Hive tap from an external Hive:

    HiveTableDescriptor hiveTableDescriptor = new HiveTableDescriptor(databaseName, tableName, columnNames, columnTypes);
    // create HiveTap based on the descriptor
    HiveTap hiveTableTap = new HiveTap(hiveTableDescriptor, hiveTableDescriptor.toScheme(), REPLACE, true);
    flowDef = flowDef.addSource(currentPipe, hiveTableTap);

And I am getting an input path does not exist. I am using the version 1.0.0-wip-15.

A select * from test_table works fine. The same code works also for non external table.

Any idea how to fix this?

Thanks,
Ryadh

Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nameservice1/user/hive/warehouse/www.db/test_table
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:134)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1090)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1082)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:919)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

HiveTap is not concurrency safe.

As part of our testing we are writing to two HivePartitionTaps in parallel locally. This will fail with the stack trace below because it will try to create the same database in parallel.

Our current fix is to catch that particular exception around the createDatabase call. We also get task failures https://github.com/CommBank/ebenezer/issues/34 caused by a similar issue. I don't really have a good fix for this problem.

We are probably going to wrap the createDatabase call in the following try catch block to enable our tests to work:

 if ( !( ex.getCause() != null &&
            ex.getCause().getCause() != null &&
            ex.getCause().getCause().getMessage().startsWith( "The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE'") ) )
            {
            throw ex;
            }

The stack trace:

java.lang.Exception: cascading.CascadingException: java.io.IOException: MetaException(message:javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
Caused by: cascading.CascadingException: java.io.IOException: MetaException(message:javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.)
    at cascading.tap.hive.HivePartitionTap$HivePartitionCollector.closeCollector(HivePartitionTap.java:88)
    at cascading.tap.partition.BasePartitionTap$PartitionCollector.close(BasePartitionTap.java:186)
    at cascading.flow.stream.SinkStage.cleanup(SinkStage.java:120)
    at cascading.flow.stream.StreamGraph.cleanup(StreamGraph.java:176)
    at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:155)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:695)
Caused by: java.io.IOException: MetaException(message:javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.)
    at cascading.tap.hive.HiveTap.createHiveTable(HiveTap.java:169)
    at cascading.tap.hive.HiveTap.registerPartition(HiveTap.java:322)
    at cascading.tap.hive.HivePartitionTap$HivePartitionCollector.closeCollector(HivePartitionTap.java:84)
    ... 13 more
Caused by: MetaException(message:javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:602)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
    at com.sun.proxy.$Proxy8.create_database(Unknown Source)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:414)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
    at com.sun.proxy.$Proxy9.createDatabase(Unknown Source)
    at cascading.tap.hive.HiveTap.createHiveTable(HiveTap.java:159)
    ... 15 more
Caused by: javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.
    at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
    at org.datanucleus.api.jdo.JDOTransaction.commit(JDOTransaction.java:165)
    at org.apache.hadoop.hive.metastore.ObjectStore.commitTransaction(ObjectStore.java:345)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:111)
    at com.sun.proxy.$Proxy7.commitTransaction(Unknown Source)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database_core(HiveMetaStore.java:563)
    at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:591)
    ... 29 more
Caused by: java.sql.BatchUpdateException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.
    at org.apache.derby.impl.jdbc.EmbedStatement.executeBatch(Unknown Source)
    at org.apache.commons.dbcp.DelegatingStatement.executeBatch(DelegatingStatement.java:297)
    at org.apache.commons.dbcp.DelegatingStatement.executeBatch(DelegatingStatement.java:297)
    at org.datanucleus.store.rdbms.ParamLoggingPreparedStatement.executeBatch(ParamLoggingPreparedStatement.java:372)
    at org.datanucleus.store.rdbms.SQLController.processConnectionStatement(SQLController.java:628)
    at org.datanucleus.store.rdbms.SQLController.processStatementsForConnection(SQLController.java:596)
    at org.datanucleus.store.rdbms.SQLController$1.transactionFlushed(SQLController.java:683)
    at org.datanucleus.store.connection.AbstractManagedConnection.transactionFlushed(AbstractManagedConnection.java:86)
    at org.datanucleus.store.connection.ConnectionManagerImpl$2.transactionFlushed(ConnectionManagerImpl.java:454)
    at org.datanucleus.TransactionImpl.flush(TransactionImpl.java:199)
    at org.datanucleus.TransactionImpl.commit(TransactionImpl.java:263)
    at org.datanucleus.api.jdo.JDOTransaction.commit(JDOTransaction.java:98)
    ... 38 more
Caused by: java.sql.SQLIntegrityConstraintViolationException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.
    at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeBatchElement(Unknown Source)
    ... 50 more
Caused by: java.sql.SQLException: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.
    at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
    ... 58 more
Caused by: ERROR 23505: The statement was aborted because it would have caused a duplicate key value in a unique or primary key constraint or unique index identified by 'UNIQUE_DATABASE' defined on 'DBS'.
    at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
    at org.apache.derby.impl.sql.execute.IndexChanger.insertAndCheckDups(Unknown Source)
    at org.apache.derby.impl.sql.execute.IndexChanger.doInsert(Unknown Source)
    at org.apache.derby.impl.sql.execute.IndexChanger.insert(Unknown Source)
    at org.apache.derby.impl.sql.execute.IndexSetChanger.insert(Unknown Source)
    at org.apache.derby.impl.sql.execute.RowChangerImpl.insertRow(Unknown Source)
    at org.apache.derby.impl.sql.execute.InsertResultSet.normalInsertCore(Unknown Source)
    at org.apache.derby.impl.sql.execute.InsertResultSet.open(Unknown Source)
    at org.apache.derby.impl.sql.GenericPreparedStatement.executeStmt(Unknown Source)
    at org.apache.derby.impl.sql.GenericPreparedStatement.execute(Unknown Source)
    ... 52 more

java.lang.RuntimeException: Failed to load Hive builtin functions

I am running HiveDemo.java
I have hive-builtins jar in my build.
I am getting exception:-
flow failed: load data into dual
java.lang.RuntimeException: Failed to load Hive builtin functions
at org.apache.hadoop.hive.ql.session.SessionState.(SessionState.java:217)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:265)
at cascading.flow.hive.HiveDriverFactory.createHiveDriver(HiveDriverFactory.java:67)
at cascading.flow.hive.HiveQueryRunner.run(HiveQueryRunner.java:77)
at cascading.flow.hive.HiveRiffle.complete(HiveRiffle.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at riffle.process.scheduler.ProcessWrapper.invokeMethod(ProcessWrapper.java:178)
at riffle.process.scheduler.ProcessWrapper.findInvoke(ProcessWrapper.java:166)
at riffle.process.scheduler.ProcessWrapper.complete(ProcessWrapper.java:147)
at cascading.flow.hadoop.ProcessFlow.complete(ProcessFlow.java:199)
at cascading.flow.hive.HiveFlow.complete(HiveFlow.java:145)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:1057)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:1005)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.FileNotFoundException: /tmp/hadoop-stark/hadoop-unjar4751041980891901498 (Is a directory)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:215)
at java.util.zip.ZipFile.(ZipFile.java:145)
at java.util.jar.JarFile.(JarFile.java:153)
at java.util.jar.JarFile.(JarFile.java:90)
at sun.net.www.protocol.jar.URLJarFile.(URLJarFile.java:93)
at sun.net.www.protocol.jar.URLJarFile.getJarFile(URLJarFile.java:69)
at sun.net.www.protocol.jar.JarFileFactory.get(JarFileFactory.java:84)
at sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:122)
at sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:150)
at java.net.URL.openStream(URL.java:1037)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.registerFunctionsFromPluginJar(FunctionRegistry.java:1307)
at org.apache.hadoop.hive.ql.session.SessionState.(SessionState.java:213)
... 19 more

Need to be able to set performance options for queries

We want to be able to set specific hive performance options for some of the queries we run. Currently the HiveDriverFactory instantiates a new HiveConf as needed when it creates the driver.

Would it make sense to have a createHiveDriver method that takes a HiveConf as parameter and to modify HiveFlow so that it passes down the HiveConf or is there a better way?

Am I right in thinking there shouldn't be any serialisation issues around HiveConf since the hive flow will be run on the submitter node?

use cascading hive on tez very slow

I append hive-site.xml as follows:

<property>
    <name>hive.execution.engine</name>
    <value>tez</value>
</property>
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn-tez</value>
</property>
<property>
    <name>tez.lib.uris</name>
    <value>hdfs:///apps/tez-0.5.3/tez-0.5.3.tar.gz</value>
</property>

Then I run yarn jar build/libs/cascading-hive-demo-1.0.jar cascading.hive.HiveDemo

The Job run successful , but very slow .Slower than hive on mr.
Some logs as follows:
15/02/11 23:07:06 INFO HiveMetaStore.audit: ugi=hdfs ip=unknown-ip-addr cmd=Metastore shutdown complete.
15/02/11 23:07:06 WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce does not exist
15/02/11 23:07:06 WARN conf.HiveConf: HiveConf of name hive.semantic.analyzer.factory.impl does not exist
15/02/11 23:07:06 WARN conf.HiveConf: HiveConf of name hive.auto.convert.sortmerge.join.noconditionaltask does not exist
15/02/11 23:07:06 INFO session.SessionState: Created local directory: /tmp/61d91f93-0f22-4ebb-8fae-0dff14b8bbd7_resources
15/02/11 23:07:06 INFO session.SessionState: Created HDFS directory: /tmp/hive/hdfs/61d91f93-0f22-4ebb-8fae-0dff14b8bbd7
15/02/11 23:07:06 INFO session.SessionState: Created local directory: /tmp/hdfs/61d91f93-0f22-4ebb-8fae-0dff14b8bbd7
15/02/11 23:07:06 INFO session.SessionState: Created HDFS directory: /tmp/hive/hdfs/61d91f93-0f22-4ebb-8fae-0dff14b8bbd7/_tmp_space.db
15/02/11 23:07:06 INFO tez.TezSessionState: User of session id 61d91f93-0f22-4ebb-8fae-0dff14b8bbd7 is hdfs
15/02/11 23:07:06 INFO tez.DagUtils: Jar dir is null/directory doesn't exist. Choosing HIVE_INSTALL_DIR - hdfs:/user/hdfs/.hiveJars
15/02/11 23:07:06 INFO tez.DagUtils: Resource modification time: 1423663740552
15/02/11 23:07:06 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.5.3, revision=${buildNumber}, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=20150112-2131 ]
15/02/11 23:07:06 INFO tez.TezSessionState: Opening new Tez Session (id: 61d91f93-0f22-4ebb-8fae-0dff14b8bbd7, scratch dir: hdfs://localhost:8020/tmp/hive/hdfs/_tez_session_dir/61d91f93-0f22-4ebb-8fae-0dff14b8bbd7)
15/02/11 23:07:06 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
15/02/11 23:07:06 INFO client.TezClient: Session mode. Starting session.
15/02/11 23:07:06 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs:///apps/tez-0.5.3/tez-0.5.3.tar.gz
15/02/11 23:07:06 INFO client.TezClient: Tez system stage directory hdfs://localhost:8020/tmp/hive/hdfs/_tez_session_dir/61d91f93-0f22-4ebb-8fae-0dff14b8bbd7/.tez/application_1422003283699_0139 doesn't exist and is created
15/02/11 23:07:06 INFO impl.YarnClientImpl: Submitted application application_1422003283699_0139
15/02/11 23:07:06 INFO client.TezClient: The url to track the Tez Session: http://localhost:8088/proxy/application_1422003283699_0139/

The job block here for a long time.