Giter Site home page Giter Site logo

cascading-jdbc's Introduction

Cascading

Thanks for using Cascading.

Cascading 3.3

Cascading 3 includes a few major changes and additions from prior major releases:

  • Complete re-write of the platform query planner and improvements to the planner API
  • Addition of Apache Tez as a supported runtime platform
  • Changes to the Tap/Scheme generic type signatures to support portability

These changes hope to simplify the creation of new bindings to new platform implementations and to improve the performance of resulting applications.

General Information:

For project documentation and community support, visit: cascading.org

To download a pre-built distribution, visit http://cascading.org/downloads/, or use Maven (described below).

The project includes nine Cascading jar files:

  • cascading-core-x.y.z.jar - all Cascading Core class files
  • cascading-xml-x.y.z.jar - all Cascading XML operations class files
  • cascading-expression-x.y.z.jar - all Cascading Janino expression operations class files
  • cascading-local-x.y.z.jar - all Cascading Local in-memory mode class files
  • cascading-hadoop-x.y.z.jar - all Cascading Hadoop 1.x MapReduce mode class files
  • cascading-hadoop2-io-x.y.z.jar - all Cascading Hadoop 2.x HDFS and IO related class files
  • cascading-hadoop2-mr1-x.y.z.jar - all Cascading Hadoop 2.x MapReduce mode class files
  • cascading-hadoop2-tez-x.y.z.jar - all Cascading Hadoop 2.x Tez mode class files
  • cascading-hadoop2-tez-stats-x.y.z.jar - all Cascading Tez YARN timeline server class files

These class jars, along with, tests, source and javadoc jars, are all available via the Conjars.org Maven repository.

Hadoop 1.x mode is where the Cascading application should run on a Hadoop MapReduce cluster.

Hadoop 2.x MR1 mode is the same as above but for Hadoop 2.x releases.

Hadoop 2.x Tez mode is where the Cascading application should run on an Apache Tez DAG cluster.

Local mode is where the Cascading application will run locally in memory without any Hadoop dependencies or cluster distribution. This implementation has minimal to no robustness in low memory situations, by design.

As of Cascading 3.x, all above jar files are built against Java 1.7. Prior versions of Cascading are built against Java 1.6.

Extensions, the SDK, and DSLs

There are a number of projects based on and extensions to Cascading available.

Visit the Cascading Extensions page for a current list.

Or download the Cascading SDK which includes many pre-built binaries.

Of note are three top level projects:

  • Fluid - A fluent Java API for Cascading that is compatible with the default API.
  • Lingual - ANSI SQL and JDBC on Cascading
  • Pattern - Machine Learning scoring and PMML support with Cascading

And alternative languages:

And a third-party computing platform:

Versioning

Cascading stable releases are always of the form x.y.z, where z is the current maintenance release.

x.y.z releases are maintenance releases. No public incompatible API changes will be made, but in an effort to fix bugs, remediation may entail throwing new Exceptions.

x.y releases are minor releases. New features are added. No public incompatible API changes will be made on the core processing APIs (Pipes, Functions, etc), but in an effort to resolve inconsistencies, minor semantic changes may be necessary.

It is important to note that we do reserve to make breaking changes to the new query planner API through the 3.x releases. This allows us to respond to bugs and performance issues without issuing new major releases. Cascading 4.0 will keep the public query planner APIs stable.

The source and tags for all stable releases can be found here: https://github.com/Cascading/cascading

WIP (work in progress) releases are fully tested builds of code not yet deemed fully stable. On every build by our continuous integration servers, the WIP build number is increased. Successful builds are then tagged and published.

The WIP releases are always of the form x.y.z-wip-n, where x.y.z will be the next stable release version the WIP releases are leading up to. n is the current successfully tested build.

The source, working branches, and tags for all WIP releases can be found here: https://github.com/cwensel/cascading

Or downloaded from here: http://cascading.org/wip/

When a WIP is deemed stable and ready for production use, it will be published as a x.y.z release, and made available from the http://cascading.org/downloads/ page.

Writing and Running Tests

Comprehensive tests should be written against the cascading.PlatformTestCase.

When running tests built against the PlatformTestCase, the local cluster can be disabled (if enabled by the test) by setting:

-Dtest.cluster.enabled=false

From Gradle, to run a single test case:

> gradle :cascading-hadoop2-mr1:platformTest --tests=*.FieldedPipesPlatformTest -i

or a single test method:

> gradle :cascading-hadoop2-mr1:platformTest --tests=*.FieldedPipesPlatformTest.testNoGroup -i

Debugging the 3.x Planner

The new 3.0 planner has a much improved debugging framework.

When running tests, set the following

-Dtest.traceplan.enabled=true

If you are on Mac OS X and have installed GraphViz, dot files can be converted to pdf on the fly. To enable, set:

-Dutil.dot.to.pdf.enabled=true

Optionally, for stand alone applications, statistics and tracing can be enabled selectively with the following properties:

  • cascading.planner.stats.path - outputs detailed statistics on time spent by the planner
  • cascading.planner.plan.path - basic planner information
  • cascading.planner.plan.transforms.path - detailed information for each rule

Contributing and Reporting Issues

See CONTRIBUTING.md at https://github.com/Cascading/cascading.

Using with Maven/Ivy

It is strongly recommended developers pull Cascading from our Maven compatible jar repository Conjars.org.

You can find the latest public and WIP (work in progress) releases here:

When creating tests, make sure to add any of the relevant above dependencies to your test scope or equivalent configuration along with the cascading-platform dependency.

Note the cascading-platform compile dependency has no classes, you must pull the tests dependency with the tests classifier.

See http://cascading.org/downloads/#maven for example Maven pom dependency settings.

Source and Javadoc artifacts (using the appropriate classifier) are also available through Conjars.

Note that cascading-hadoop, cascading-hadoop2-mr1, and cascading-hadoop2-tez have a provided dependency on the Hadoop jars so that it won't get sucked into any application packaging as a dependency, typically.

Building and IDE Integration

For most cases, building Cascading is unnecessary as it has been pre-built, tested, and published to our Maven repository (above).

To build Cascading, run the following in the shell:

> git clone https://github.com/cascading/cascading.git
> cd cascading
> gradle build

Cascading requires at least Gradle 2.7 and Java 1.7 to build.

To use an IDE like IntelliJ, run the following to create IntelliJ project files:

> gradle idea

Similarly for Eclipse:

> gradle eclipse

Using with Apache Hadoop

First confirm you are using a supported version of Apache Hadoop by checking the Compatibility page.

To use Cascading with Hadoop, we suggest stuffing cascading-core and cascading-hadoop2-mr1, jar files and all third-party libs into the lib folder of your job jar and executing your job via $HADOOP_HOME/bin/hadoop jar your.jar <your args>.

For example, your job jar would look like this (via: jar -t your.jar)

/<all your class and resource files>
/lib/cascading-core-x.y.z.jar
/lib/cascading-hadoop2-mr1-x.y.z.jar
/lib/cascading-hadoop2-io-x.y.z.jar
/lib/cascading-expression-x.y.z.jar
/lib/<cascading third-party jar files>

Hadoop will unpack the jar locally and remotely (in the cluster) and add any libraries in lib to the classpath. This is a feature specific to Hadoop.

cascading-jdbc's People

Contributors

azymnis avatar cwensel avatar emlyn avatar fs111 avatar joeposner avatar johnynek avatar koertkuipers avatar noitcudni avatar r0man avatar rdesmond avatar senior avatar sritchie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cascading-jdbc's Issues

[scalding] java.lang.IllegalArgumentException: given types array contains null

This isn't exactly a bug in cascading-jdbc and more likely an in Scalding integration with it, but since I couldn't find any documentation or examples, thought I would file an issue and ask. I hope that's ok.

Scalding JDBC integration seems to have completely broken with 2.5.3 (and with current master).

class JDBCTestWrite(args: Args) extends Job(args) {
  import TDsl._
  // create a simple data set
  val data = IterableSource(List(1, 2, 3, 4, 5), 'value1)
  // read it and write to DB.
  data.read.write(DBOutputSource())
}

Produces an error:

Caused by: java.lang.IllegalArgumentException: given types array contains null
    at cascading.tuple.Fields.<init>(Fields.java:650)
    at cascading.jdbc.JDBCScheme.deriveInternalSinkFields(JDBCScheme.java:755)
    at cascading.jdbc.JDBCScheme.presentSinkFields(JDBCScheme.java:722)
    at cascading.tap.Tap.presentSinkFields(Tap.java:382)
    at cascading.flow.BaseFlow.presentSinkFields(BaseFlow.java:255)
    at cascading.flow.BaseFlow.updateSchemes(BaseFlow.java:206)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:272)
    ... 20 more

This seems to be caused by fields.getType[i] returning null here:

types[ i ] = InternalTypeMapping.findInternalType( fields.getType( i ) );
.

This is part of the new JDBCScheme mechanism introduced in 2.5.3. Must all fields have types attached to them for jdbc integration to work now?

Example for Hortonworks Sandbox 2.4

Hi folks,

I don't really know where to put this information, but I just made a small sample project on how to use this Tap/Sink on Hortonworks 2.4.
It is available here:
https://github.com/busche/cascading-jdbc-on-hdp

... maybe you want to add it as a note somewhere. It took me quite a time to resolve all runtime conflicts with the provided libraries thus it should be helpful as a start for others.

Best,
André

Cascading exception java.lang.IllegalArgumentException: Source fields not found in sinked data.

Iam reading a binary file in Cascading and write it to a text file.

Pipe testRecords = new Pipe("Test records");
Fields fieldSelector = new Fields(PotentialCounterfeitDatum.TRANSIT_ABA_FN, PotentialCounterfeitDatum.ACCOUNT_NUMBER_FN, PotentialCounterfeitDatum.AMOUNT_FN,
PotentialCounterfeitDatum.SCHEME_CREATE_DATE_FN);
testRecords = new Retain(testRecords, fieldSelector);

Iam getting an exception (java.lang.IllegalArgumentException: Source fields not found in sinked data).

Please advise what may be the error

Building with MySQL support failed: `Javadoc generation failed`

I'm trying to build the project with MySQL-support using the command-line given from the Readme:

gradle build --stacktrace -Dcascading.jdbc.url.mysql="jdbc:mysql://some-host/somedb?user=someuser&password=somepw" -i`

However, the Javadoc generation failed. Here is the stackgrace:

org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':cascading-jdbc-core:javadoc'.
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:69)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.execute(ExecuteActionsTaskExecuter.java:46)
    at org.gradle.api.internal.tasks.execution.PostExecutionAnalysisTaskExecuter.execute(PostExecutionAnalysisTaskExecuter.java:35)
    at org.gradle.api.internal.tasks.execution.SkipUpToDateTaskExecuter.execute(SkipUpToDateTaskExecuter.java:64)
    at org.gradle.api.internal.tasks.execution.ValidatingTaskExecuter.execute(ValidatingTaskExecuter.java:58)
    at org.gradle.api.internal.tasks.execution.SkipEmptySourceFilesTaskExecuter.execute(SkipEmptySourceFilesTaskExecuter.java:52)
    at org.gradle.api.internal.tasks.execution.SkipTaskWithNoActionsExecuter.execute(SkipTaskWithNoActionsExecuter.java:52)
    at org.gradle.api.internal.tasks.execution.SkipOnlyIfTaskExecuter.execute(SkipOnlyIfTaskExecuter.java:53)
    at org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter.execute(ExecuteAtMostOnceTaskExecuter.java:43)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter$EventFiringTaskWorker.execute(DefaultTaskGraphExecuter.java:203)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter$EventFiringTaskWorker.execute(DefaultTaskGraphExecuter.java:185)
    at org.gradle.execution.taskgraph.AbstractTaskPlanExecutor$TaskExecutorWorker.processTask(AbstractTaskPlanExecutor.java:66)
    at org.gradle.execution.taskgraph.AbstractTaskPlanExecutor$TaskExecutorWorker.run(AbstractTaskPlanExecutor.java:50)
    at org.gradle.execution.taskgraph.DefaultTaskPlanExecutor.process(DefaultTaskPlanExecutor.java:25)
    at org.gradle.execution.taskgraph.DefaultTaskGraphExecuter.execute(DefaultTaskGraphExecuter.java:110)
    at org.gradle.execution.SelectedTaskExecutionAction.execute(SelectedTaskExecutionAction.java:37)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:37)
    at org.gradle.execution.DefaultBuildExecuter.access$000(DefaultBuildExecuter.java:23)
    at org.gradle.execution.DefaultBuildExecuter$1.proceed(DefaultBuildExecuter.java:43)
    at org.gradle.execution.DryRunBuildExecutionAction.execute(DryRunBuildExecutionAction.java:32)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:37)
    at org.gradle.execution.DefaultBuildExecuter.execute(DefaultBuildExecuter.java:30)
    at org.gradle.initialization.DefaultGradleLauncher$4.run(DefaultGradleLauncher.java:154)
    at org.gradle.internal.Factories$1.create(Factories.java:22)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:90)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:52)
    at org.gradle.initialization.DefaultGradleLauncher.doBuildStages(DefaultGradleLauncher.java:151)
    at org.gradle.initialization.DefaultGradleLauncher.access$200(DefaultGradleLauncher.java:32)
    at org.gradle.initialization.DefaultGradleLauncher$1.create(DefaultGradleLauncher.java:99)
    at org.gradle.initialization.DefaultGradleLauncher$1.create(DefaultGradleLauncher.java:93)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:90)
    at org.gradle.internal.progress.DefaultBuildOperationExecutor.run(DefaultBuildOperationExecutor.java:62)
    at org.gradle.initialization.DefaultGradleLauncher.doBuild(DefaultGradleLauncher.java:93)
    at org.gradle.initialization.DefaultGradleLauncher.run(DefaultGradleLauncher.java:82)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter$DefaultBuildController.run(InProcessBuildActionExecuter.java:94)
    at org.gradle.tooling.internal.provider.ExecuteBuildActionRunner.run(ExecuteBuildActionRunner.java:28)
    at org.gradle.launcher.exec.ChainingBuildActionRunner.run(ChainingBuildActionRunner.java:35)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:43)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:28)
    at org.gradle.launcher.exec.ContinuousBuildActionExecuter.execute(ContinuousBuildActionExecuter.java:77)
    at org.gradle.launcher.exec.ContinuousBuildActionExecuter.execute(ContinuousBuildActionExecuter.java:47)
    at org.gradle.launcher.exec.DaemonUsageSuggestingBuildActionExecuter.execute(DaemonUsageSuggestingBuildActionExecuter.java:51)
    at org.gradle.launcher.exec.DaemonUsageSuggestingBuildActionExecuter.execute(DaemonUsageSuggestingBuildActionExecuter.java:28)
    at org.gradle.launcher.cli.RunBuildAction.run(RunBuildAction.java:43)
    at org.gradle.internal.Actions$RunnableActionAdapter.execute(Actions.java:170)
    at org.gradle.launcher.cli.CommandLineActionFactory$ParseAndBuildAction.execute(CommandLineActionFactory.java:237)
    at org.gradle.launcher.cli.CommandLineActionFactory$ParseAndBuildAction.execute(CommandLineActionFactory.java:210)
    at org.gradle.launcher.cli.JavaRuntimeValidationAction.execute(JavaRuntimeValidationAction.java:35)
    at org.gradle.launcher.cli.JavaRuntimeValidationAction.execute(JavaRuntimeValidationAction.java:24)
    at org.gradle.launcher.cli.CommandLineActionFactory$WithLogging.execute(CommandLineActionFactory.java:206)
    at org.gradle.launcher.cli.CommandLineActionFactory$WithLogging.execute(CommandLineActionFactory.java:169)
    at org.gradle.launcher.cli.ExceptionReportingAction.execute(ExceptionReportingAction.java:33)
    at org.gradle.launcher.cli.ExceptionReportingAction.execute(ExceptionReportingAction.java:22)
    at org.gradle.launcher.Main.doAction(Main.java:33)
    at org.gradle.launcher.bootstrap.EntryPoint.run(EntryPoint.java:45)
    at org.gradle.launcher.bootstrap.ProcessBootstrap.runNoExit(ProcessBootstrap.java:54)
    at org.gradle.launcher.bootstrap.ProcessBootstrap.run(ProcessBootstrap.java:35)
    at org.gradle.launcher.GradleMain.main(GradleMain.java:23)
Caused by: org.gradle.api.GradleException: Javadoc generation failed. Generated Javadoc options file (useful for troubleshooting): '/Users/dnies/devel/cascading-jdbc/cascading-jdbc-core/build/tmp/javadoc/javadoc.options'
    at org.gradle.api.tasks.javadoc.internal.JavadocGenerator.execute(JavadocGenerator.java:57)
    at org.gradle.api.tasks.javadoc.internal.JavadocGenerator.execute(JavadocGenerator.java:31)
    at org.gradle.api.tasks.javadoc.Javadoc.executeExternalJavadoc(Javadoc.java:143)
    at org.gradle.api.tasks.javadoc.Javadoc.generate(Javadoc.java:131)
    at org.gradle.internal.reflect.JavaMethod.invoke(JavaMethod.java:75)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.doExecute(AnnotationProcessingTaskFactory.java:227)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.execute(AnnotationProcessingTaskFactory.java:220)
    at org.gradle.api.internal.project.taskfactory.AnnotationProcessingTaskFactory$StandardTaskAction.execute(AnnotationProcessingTaskFactory.java:209)
    at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:585)
    at org.gradle.api.internal.AbstractTask$TaskActionWrapper.execute(AbstractTask.java:568)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeAction(ExecuteActionsTaskExecuter.java:80)
    at org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter.executeActions(ExecuteActionsTaskExecuter.java:61)
    ... 57 more
Caused by: org.gradle.process.internal.ExecException: Process 'command '/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/javadoc'' finished with non-zero exit value 1
    at org.gradle.process.internal.DefaultExecHandle$ExecResultImpl.assertNormalExitValue(DefaultExecHandle.java:367)
    at org.gradle.process.internal.DefaultExecAction.execute(DefaultExecAction.java:31)
    at org.gradle.api.tasks.javadoc.internal.JavadocGenerator.execute(JavadocGenerator.java:52)
    ... 68 more

Issue with resourceExist logic

Related to #12

The issue is this: https://github.com/Cascading/cascading-jdbc/blob/2.5/cascading-jdbc-core/src/main/java/cascading/jdbc/JDBCTap.java#L671-L680

Since it just catches exception, in the case of a default table exists query and then some other transient issue on the database. For example, a long running query on the same table could have a lock on the table and cause this query to time out, which would then make this think that the table exists. I realize that this is kind of an edge case but this led to a production table being dropped.

The default table existence query is broken

https://github.com/Cascading/cascading-jdbc/blob/2.5/cascading-jdbc-core/src/main/java/cascading/jdbc/JDBCTap.java#L642
which uses
https://github.com/Cascading/cascading-jdbc/blob/2.5/cascading-jdbc-core/src/main/java/cascading/jdbc/JDBCFactory.java#L51
as a default

I'm pretty sure that in any sql db the following:

select 1 from %s where 1 = 0

will always return no rows. The code as is checks to see if it returns exactly 1.

I don't know how we didn't hit this in testing (bad luck, I guess) but we hit it instantly in prod. I'm going to propose a fix in scalding for mysql that overrides the table existence query, but we should probably have a more reasonable default case that works in the major db's (at least mysql and postgres)

NoMethodError trying to build

I'm trying to build cascading-jdbc so that I can work through some of the cascading tutorials (specifically, the one dealing with Amazon Redshift). However, I'm getting an error when I attempt to build.

Error output is attached:
error.txt

I feel like I'm probably missing a library, but I can't figure out what it is or where I would include it.

I'm using gradle 1.11.

Thanks

Loading data from one database to other database is not supported with Hadoop2MR1FlowConnector

Hi,

In below code I am trying to read data from one MySQL server and to load it into table in another mysql server.

Job is completing successfully, but target table is not loaded with any record.

Instead of loading data into target table in target server (ip2), its loads data into source server(ip1).

Looks like it's creating single jobConf for single MR job and it creates just one connection instead of two.( as per code in cascading.jdbc.db.DBConfiguration). If I add a groupBy pipe in between source Tap and target Tap it works fine as it creates 2 MR jobs. Each MR job will have one JobConf.

Currently I am using Hadoop2MR1FlowConnector() as we are not using Tez for project. It works fine with Hadoop2TezFlowConnector().

`
public class TableToTable {

public static void main(String[] args) throws SQLException, ClassNotFoundException, IOException {

     String jdbcurl1 ="jdbc:mysql://ip2:3306/cascading_jdbc?user=root&password=mysql";
     String jdbcurl2 ="jdbc:mysql://ip1:3306/cascading_jdbc?user=root&password=mysql";

      String driverName="com.mysql.jdbc.Driver";

      Class<? extends DBInputFormat> inputFormatClass = MySqlDBInputFormat.class;

      String TESTING_TABLE_NAME_SOURCE = "testingtable12";
      String TESTING_TABLE_NAME_TARGET = "testingtable13";



        Fields fields = new Fields( new Comparable[]{"num", "lwr", "upr"}, new Type[]{int.class, String.class, String.class} );
        Pipe parsePipe = new Pipe( "insert" );
        String[] columnNames = {"num", "lwr", "upr"};
        String[] columnDefs = {"INT NOT NULL", "VARCHAR(100) NOT NULL", "VARCHAR(100) NOT NULL"};

        String[] primaryKeys = null;

        TableDesc tableDescS = new TableDesc( TESTING_TABLE_NAME_SOURCE, columnNames, columnDefs, primaryKeys );
        JDBCScheme schemeS = new JDBCScheme(inputFormatClass, fields, columnNames );
        JDBCTap sourceTap = new JDBCTap( jdbcurl1, driverName, tableDescS, schemeS,SinkMode.REPLACE);
        sourceTap.setBatchSize( 1 );

        TableDesc tableDescT = new TableDesc( TESTING_TABLE_NAME_TARGET, columnNames, columnDefs, primaryKeys );
        JDBCScheme schemeT = new JDBCScheme(inputFormatClass, fields, columnNames );
        JDBCTap targetTapT = new JDBCTap( jdbcurl2, driverName, tableDescT, schemeT, SinkMode.REPLACE);
        targetTapT.setBatchSize( 1 );

        Flow<?> parseFlow = new Hadoop2MR1FlowConnector().connect( sourceTap, targetTapT, parsePipe );
        parseFlow.complete();

}
}

`

https://groups.google.com/forum/#!topic/cascading-user/eBiV9vaomQo

exception handling bug

In https://github.com/Cascading/cascading-jdbc/blob/2.5/cascading-jdbc-core/src/main/java/cascading/jdbc/db/DBOutputFormat.java#L135

An exception is created without a message which leads to a NullPointer when trying to substring the exception's message.

I'm also throwing together a test case for how I even got this in the first place, see:
https://groups.google.com/forum/#!topic/cascading-user/nZ-xRtEDGpI

Error: java.lang.NullPointerException
    at cascading.jdbc.db.DBOutputFormat.manageBatchProcessingError(DBOutputFormat.java:158)
    at cascading.jdbc.db.DBOutputFormat.executeBatch(DBOutputFormat.java:136)

Issue with new table exist logic

So, I had a chance to have a user using vertica test 2.5.4-wip-84

14/07/23 03:57:00 INFO jdbc.JDBCTap: testing table exists with select 1 from TABLE where 1 = 0
14/07/23 03:57:00 INFO jdbc.JDBCTap: creating connection: JDBC STRING
14/07/23 03:57:00 INFO jdbc.JDBCTap: executing query: select 1 from TABLE where 1 = 0
14/07/23 03:57:00 INFO jdbc.JDBCTap: 'TABLE' exists? false

note, I replaced our table name with TABLE and the jdbc stirng with JDBC. The table exists, but it didn't detect it. I had them try using the wip:

14/07/23 17:22:13 INFO jdbc.JDBCTap: testing if table exists with DatabaseMetaData

That's it...the lack of logging alone seems like a regression, but is also didn't work: it then tried to make the table (and failed, as it exists).

Tap for Apache Phoenix

Anyone know if a Tap exist for Phoenix? I see a slideshare mention that HomeAway created one but cant seem to find it - maybe its not open sourced. If one doesnt exist would this be the right place? Create a sub-project like its been done for Oracle, mySql etc?

Potential issue with TableDesc w.r.t table existence

https://github.com/Cascading/cascading-jdbc/blob/2.5/cascading-jdbc-core/src/main/java/cascading/jdbc/TableDesc.java#L171-L182

It seems weird to me that getTableExistsQuery() checks canQueryExistence(). It seems like we are conflating two ideas. If canQueryExistence() is false, it should mean that there is no way to query existence in this database, right? But in getTableExistsQuery(), it uses this to mean "fall back to the default for the database," which is odd. This means that a user who first checks canQueryExistence() could get false becaue the tableDesc is not set, but getTableExistsQuery() could still return a valid query.

Seems like a bug to me. Thoughts?

[Feature Request] JDBCTap that can run in Local mode

Currently JDBCTap is defined as Tap<JobConf, RecordReader, OutputCollector> which is incompatible with Cascading's local mode, which uses java.util.Properties rather than org.apache.hadoop.mapred.JobConf.

It would be useful to run cascading-jdbc in local mode, for a number of reasons, such as

  1. Testing. For example we would like to unit test our JDBC integration in Scalding, but not able to execute in local mode makes it impractical.
  2. Small jobs. During development, or production, it may be useful to execute a job in local mode that can nevertheless connect to DBs.

Problems with schemas under Redshift

I'm using cascading-jdbc-redshift:2.5.4-wip-83, and I can't get it to sink into my Redshift instance at all.

The table I'm trying to sink into is called etl_demo. If I just use that as the table name for the RedshiftTableDesc, and run the job, I get this error:

cascading.tap.TapException: SQL error code: 0 executing update statement: CREATE TABLE etl_demo ( date DATE, store BIGINT, handle_time INTEGER, count_calls INTEGER, avg_handle_time FLOAT )  DISTKEY (store)  SORTKEY (store, date) 
        at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:478)
        at cascading.jdbc.JDBCTap.createResource(JDBCTap.java:597)
        at cascading.jdbc.RedshiftTap.createResource(RedshiftTap.java:172)
        at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:395)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
        at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
        at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
        at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
        at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
        at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
        at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
        at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
        at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.postgresql.util.PSQLException: ERROR: no schema has been selected to create in
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2077)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1810)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:498)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:300)
        at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:470)
        ... 22 more

So if I'm reading this right, Redshift doesn't like the CREATE TABLE statement because the user doesn't specify a schema to create the table in. If I try "public.etl_demo" as the table name for the RedshiftTableDesc, then I get this:

14/07/07 14:42:08 INFO jdbc.JDBCTap: executing update: CREATE TABLE public.etl_demo ( date DATE, store BIGINT, handle_time INTEGER, count_calls INTEGER, avg_handle_time FLOAT )  DISTKEY (store)  SORTKEY (store, date) 
14/07/07 14:42:09 INFO jdbc.JDBCTap: testing if table exists with DatabaseMetaData
14/07/07 14:42:09 INFO jdbc.JDBCTap: creating connection: jdbc:postgresql://something.example.com:5439/dev    
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [unable to create table: public.etl_demo]
    at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:265)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
    at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
    at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
    at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: cascading.tap.TapException: unable to create table: public.etl_demo
    at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:396)
    at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
    at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
    at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
    at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
    at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
    at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
    at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
    at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
    at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
    at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
    ... 9 more

Here the CREATE TABLE succeeds, but apparently the DatabaseMetaData check fails because it looks for a table with name public.etl_demo—but no such table exists.

And if I create the table by hand in Redshift, and use "etl_demo" in the RedshiftTableDesc, I get this (SinkMode.REPLACE):

14/07/07 14:54:42 INFO jdbc.JDBCTap: deleting table: etl_demo
14/07/07 14:54:42 INFO jdbc.JDBCTap: creating connection: jdbc:postgresql://something.example.com:5439/dev
14/07/07 14:54:42 INFO jdbc.JDBCTap: executing update: DROP TABLE etl_demo
14/07/07 14:54:43 WARN jdbc.JDBCTap: unable to drop table: etl_demo
14/07/07 14:54:43 WARN jdbc.JDBCTap: sql failure
org.postgresql.util.PSQLException: ERROR: table "etl_demo" does not exist
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2077)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1810)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:498)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:372)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:300)
        at cascading.jdbc.JDBCTap.executeUpdate(JDBCTap.java:470)
        at cascading.jdbc.JDBCTap.deleteResource(JDBCTap.java:623)
        at cascading.jdbc.RedshiftTap.deleteResource(RedshiftTap.java:183)
        at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:392)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
        at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
        at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
        at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
        at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
        at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
        at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
        at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
        at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [unable to drop table: etl_demo]
        at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:265)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
        at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
        at com.progressfin.analytics.cascading.ETLDemo.buildQuery(ETLDemo.java:108)
        at com.progressfin.analytics.cascading.ETLDemo.main(ETLDemo.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: cascading.tap.TapException: unable to drop table: etl_demo
        at cascading.jdbc.JDBCTap.sinkConfInit(JDBCTap.java:393)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:138)
        at cascading.jdbc.RedshiftTap.sinkConfInit(RedshiftTap.java:48)
        at cascading.flow.hadoop.HadoopFlowStep.initFromSink(HadoopFlowStep.java:422)
        at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:101)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
        at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
        at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:768)
        at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1229)
        at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
        at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:259)
        ... 9 more

My take here is that it can't drop the table because it doesn't know which schema to look in, once more.

If I create the table by hand and use SinkMode.UPDATE instead then I get past that, only to fail later down when it tries an INSERT statement into the unqualified table name:

14/07/07 15:07:56 ERROR db.DBOutputFormat: unable to execute update batch [msglength: 207][totstmts: 453][crntstmts: 453][batch: 1000] Batch entry 0 INSERT INTO etl_demo (date,store,handle_time,count_calls,avg_
org.postgresql.util.PSQLException: ERROR: relation "etl_demo" does not exist
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2077)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1810)
        at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1065)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:398)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2725)
        at cascading.jdbc.db.DBOutputFormat$DBRecordWriter.executeBatch(DBOutputFormat.java:113)
        at cascading.jdbc.db.DBOutputFormat$DBRecordWriter.close(DBOutputFormat.java:88)
        at cascading.jdbc.JDBCTapCollector.close(JDBCTapCollector.java:103)
        at cascading.flow.stream.SinkStage.cleanup(SinkStage.java:120)
        at cascading.flow.stream.StreamGraph.cleanup(StreamGraph.java:176)
        at cascading.flow.hadoop.FlowReducer.close(FlowReducer.java:157)
        at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:471)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:445)

(Note that it's trying to INSERT into the Redshift table instead of staging into S3 and then using a COPY command, as it's supposed to. I don't know if this is a problem in my end.)

I think it may be possible to work around all of these issues by configuring the Redshift server to add a default search_path, but I don't think this should be necessary to use the tap.

Looking at commit 37364da, perhaps the TableDesc should allow you to specify a schema name in addition to the table name? The DatabaseMetaData call is using null for the schema name, so the following logic (from JDBCTap.java) doesn't work if a database has more than one table with the same name in different schemas:

  DatabaseMetaData dbm = connection.getMetaData();

  // [My comment: the second `null` means "in any schema", which means
  // that we can get false positives in this check.]
  tables = dbm.getTables( null, null, tableDesc.getTableName(), null );

  if ( tables.next() )
    return true;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.