cascading / lingual Goto Github PK

View Code? Open in Web Editor NEW

48.0 10.0 17.0 2.26 MB

Stand-alone ANSI SQL for Cascading on Apache Hadoop

Home Page: http://www.cascading.org/lingual/

Java 95.62% CSS 3.71% HTML 0.67%

lingual's Introduction

Overview

Lingual is true SQL for Cascading and Apache Hadoop.

Lingual includes JDBC Drivers, SQL command shell, and a catalog manager for creating schemas and tables.

Lingual is under active development on the wip-2.0 branch. All wip releases are made available from files.concurrentinc.com. Final releases can be found under files.cascading.org.

To use Lingual, there is no installation other than the optional command line utilities.

Lingual is based on the Cascading distributed processing engine and the Optiq SQL parser and rule engine.

See the Lingual page for installation and usage.

Reporting Issues

The best way to report an issue is to add a new test to SimpleSqlPlatformTest along with the expected result set and submit a pull request on GitHub.

Failing that, feel free to open an issue on the Cascading/Ligual project site or mail the mailing list.

Developing

Running:

> gradle idea

from the root of the project will create all IntelliJ project and module files, and retrieve all dependencies.

lingual's People

Contributors

Stargazers

Watchers

Forkers

rdelbru sureddy lahorichargha strategist922 sthasailesh mojavelinux jkot kousikan posix4e bspaans richwhitjr jasonliaoxiaoge almacro elilevine nirvantech ttyo doytsujin

lingual's Issues

Lingual shell gives untruthful exit codes

vagrant@master:~$ echo "select count(*) from \"api_in\".\"api_full_stats\";" | lingual shell
Concurrent, Inc - Lingual 1.0.2
only 10,000 rows will be displayed
sqlline version 1.1.6
0: jdbc:lingual:hadoop> select count(*) from "api_in"."api_full_stats";
{utcTimestamp=1392573579207, currentTimestamp=1392573579207, localTimestamp=1392573579207, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
Warning: exception while executing query: flow failed: unhandled exception: unable to create connection: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.: xxx.rds.amazonaws.com (state=,code=0)
Error: exception while executing query (state=,code=0)
0: jdbc:lingual:hadoop> Closing: SC



vagrant@master:~$ echo $?
0

How to apply custom filters with Lingual queries on Hbase...

We store operational data into a message store (hbase). I'm trying to take advantage of the custom filters we'v written over this store when a Lingual query is run on it. Otherwise it's a full table scan for each query.

Here is the code i'm trying to run on my hbase instance:
https://gist.github.com/prodeezy/9000271

The approach i'm taking is, I have a custom Tap taht takes in a hbase Scan object. It overrides the openForRead() and create's it's a TableRecordReader with this Scan object applied on it. The code for that method is like so: https://gist.github.com/prodeezy/9000365

I'd like to know if it is possible to do what I'm trying to achieve? If so, is this the right way to do it?

I get the following error curently:

cascading.flow.planner.PlannerException: could not build flow from assembly: [unable to pack object: cascading.flow.hadoop.HadoopFlowStep]
at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:576)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:263)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:80)
at cascading.flow.FlowConnector.connect(FlowConnector.java:459)
at cascading.hbase.HBaseStaticTest.testJdbcConnect(HBaseStaticTest.java:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:77)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:195)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: cascading.flow.FlowException: unable to pack object: cascading.flow.hadoop.HadoopFlowStep
at cascading.flow.hadoop.HadoopFlowStep.pack(HadoopFlowStep.java:195)
at cascading.flow.hadoop.HadoopFlowStep.getInitializedConfig(HadoopFlowStep.java:170)
at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:201)
at cascading.flow.hadoop.HadoopFlowStep.createFlowStepJob(HadoopFlowStep.java:69)
at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:676)
at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1181)
at cascading.flow.BaseFlow.initialize(BaseFlow.java:199)
at cascading.flow.hadoop.planner.HadoopPlanner.buildFlow(HadoopPlanner.java:257)
... 31 more
Caused by: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Scan
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
at java.util.HashMap.writeObject(HashMap.java:1100)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:975)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
at cascading.flow.hadoop.util.JavaObjectSerializer.serialize(JavaObjectSerializer.java:57)
at cascading.flow.hadoop.util.HadoopUtil.serializeBase64(HadoopUtil.java:265)
at cascading.flow.hadoop.HadoopFlowStep.pack(HadoopFlowStep.java:191)
... 38 more

Improve error reporting

When connecting Lingual to Postgresql using cascading-jdbc, I made a mistake in the JDBC connection string:
jdbc:postgres://localhost:5432/sesame_store instead of
jdbc:postgresql://localhost:5432/sesame_store

After trying to run a query I got just the following error reports in the shell:

Warning: exception while executing query (state=,code=0)
Error: exception while executing query (state=,code=0)

It would be nice to see something more informative, it took me quite a long time to debug the problem. I'll do a pull request with a suggested fix after which the message would be the following:

Warning: exception while executing query: null: unable to create connection: No suitable driver found for jdbc:postgres://localhost:5432/mydb?user=postgres&password=postgres (state=,code=0)
Error: exception while executing query (state=,code=0)

It would perhaps also be nice to mention that one can increase logging by modifying log4j.properties in ~/.lingual-client/lib/lingual-client-1.0.0-wip-???.jar

SQLTimestampCoercibleType coercion roundtrip fails; timezone related

I'm using Lingual 1.1.0. This is the test case I've written, which I understand ought to succeed:

import static org.hamcrest.Matchers.*;
import static org.junit.Assert.assertThat;

@Test
public void sqlTimestampCoercibleTypeRoundtrip() {
    Timestamp ts = new Timestamp(System.currentTimeMillis());
    Fields fields = new Fields("ts", new SQLTimestampCoercibleType());
    Tuple tuple = Tuple.size(1);
    TupleEntry entry = new TupleEntry(fields, tuple);
    entry.setObject("ts", ts);

    assertThat("Timestamp value roundtrip", entry.getObject("ts", Timestamp.class), equalTo((Object)ts));
}

Result of running the test case:

java.lang.AssertionError: Timestamp value roundtrip
Expected: <2014-07-07 22:07:12.082>
     but: was <2014-07-08 05:07:12.082>
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
    at org.junit.Assert.assertThat(Assert.java:865)
    at com.progressfin.sqltaps.SQLTapsTest.sqlTimestampCoercibleTypeRoundtrip(SQLTapsTest.java:65)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
    at org.junit.rules.RunRules.evaluate(RunRules.java:20)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:160)

It's 7 hours off. I'm in California, which is UTC-7 right now.

Unable to build: Could not find property 'ext' on root project 'lingual'

I have checked out lingual and tried to set-up Eclipse project files:

gradle eclipse

I now get:

FAILURE: Build failed with an exception.

* Where:
Script '.../lingual/etc/properties.gradle' line: 33

* What went wrong:
A problem occurred evaluating script.
Cause: Could not find property 'ext' on root project 'lingual'.

* Try:
Run with -s or -d option to get more details. Run with -S option to get the full (very verbose) stacktrace.

BUILD FAILED

Total time: 1.963 secs

Problems when column names start with a number

See https://groups.google.com/forum/#!topic/lingual-user/mvy8zoYyPIE for details

How to get result pipe as input to next query without writting to file?

I need to understand is there any way in lingual to get result pipe as input to next query.

code is like below -

Tap empTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ),
    "C:/Users/Suryawanship/Desktop/inputs/emp.csv", SinkMode.KEEP );
Tap salesTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ),
    "C:/Users/Suryawanship/Desktop/inputs/dept.csv", SinkMode.KEEP );

Tap resultsTap = new FileTap( new SQLTypedTextDelimited( ",", "\"" ),
    "C:/Users/Suryawanship/Desktop/output/emp-dept.csv", SinkMode.REPLACE );

Pipe empPipe=new Pipe("example.employee");
Pipe deptPipe=new Pipe("example.dept");
Pipe resultPipe=new Pipe("results");

/*Identity identity = new Identity( new Fields( "1", "2" ,"3","4") );    
resultPipe=new Each(resultPipe,Fields.ALL,identity);*/

FlowDef flowDef = FlowDef.flowDef()
  .setName( "sql flow" )
  .addSource( empPipe, empTap )
  .addSource( deptPipe, salesTap )
  .addSink( resultPipe, resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner().setSql( statement );

Instead of writting resultsTap to one file can we directly use it in another query as input pipe?
if yes please advise how to proceed on this?

Update Optiq

Update Optiq to 0.8.

Column aliasing in INSERT INTO ... SELECT ... doesn't work in Lingual 1.1 with HBase

Test script:

#!/bin/bash

# Config
hdfs_path=/local/lingual-alias-test/
hbase_table=out
hbase_col_family=fields
export LINGUAL_PLATFORM=hadoop
export HADOOP_USER_NAME=hadoop

# Test data
printf "\n1\ta\n2\tb\n3\tc\n" > /tmp/alias.tsv
hadoop fs -copyFromLocal /tmp/alias.tsv "${hdfs_path}alias.tsv"

# Lingual 1.1
lingual catalog --init
lingual catalog --provider --add cascading:cascading-hbase:2.2.0:provider
lingual catalog --schema IN --add
lingual catalog --schema IN --stereotype TSVFILE -add \
--columns A,B \
--types   string,string
lingual catalog --schema IN --table IN --stereotype TSVFILE -add "${hdfs_path}" --format tsv
lingual catalog --schema OUT --add
lingual catalog --schema OUT --protocol hbase --add --provider hbase
lingual catalog --schema OUT --stereotype HTABLE -add \
--columns A,B \
--types   string,string
lingual catalog --schema OUT --format hbase --add --properties="family=${hbase_col_family}" --provider hbase
lingual catalog --schema OUT --table OUT --stereotype HTABLE -add "${hbase_table}" --protocol hbase --format hbase --provider hbase

# Works
lingual shell --sql - <<- EOQ
    INSERT INTO "OUT"."OUT"
        select *
        from "IN"."IN";
    SELECT * from "OUT"."OUT";
EOQ

# Doesn't work
lingual shell --sql - <<- EOQ
    INSERT INTO "OUT"."OUT"
        select
            B as A,
            A as B
        from "IN"."IN";
EOQ

Error is:

Warning: exception while executing query: could not build flow from assembly: [[HBaseScheme[['A', 'B' ...][cascading.hbase.HBaseFactory.createScheme(HBaseFactory.java:120)] unable to resolve scheme sink selector: [{1}:'A'], with incoming: [{2}:'$0', '$1' | String, String]]: [HBaseScheme[['A', 'B' ...][cascading.hbase.HBaseFactory.createScheme(HBaseFactory.java:120)] unable to resolve scheme sink selector: [{1}:'A'], with incoming: [{2}:'$0', '$1' | String, String]: could not select fields: [{1}:'A'], from: [{2}:'$0', '$1' | String, String] (state=,code=0)
Error: exception while executing query (state=,code=0)

Report error if stereotype fields are contradicted by header row

See #23 for confusion & discussion.

LIMIT clause not working properly

See https://groups.google.com/forum/#!topic/lingual-user/JDlqJAUTNtM for details.

How to run lingual SQL code as script file and pass the variable value as parameter or property file

Hi,
I am converting Teradata Queries to lingual sql.
in Teradata script file we have variables as ${Teradata_schema}.TableName for which the values need to be passed from command line or property file.

I am not getting any way to do this

Run the lingual sql query as a script file.
How to pass the parameter from command line or read the parameter from property file.

Please help me on this issue, as I am struck and not able to move forward.

Regards,
Danish

Documentation on JDBC jar and shading are unclear.

Per discussion in https://groups.google.com/forum/#!topic/lingual-user/3Jl7X5SGWBY the documentation for Lingual doesn't make it clear what the JDBC jar is and when it should be used.

Time part of timestamp fields are parsed as 00:00:00

I'm experiencing a curious problem on Lingual 1.1.0. Everything is working as expected, but when trying to parse a timestamp field like "2014-01-01 01:02:03" the data comes out like this: "2014-01-01 00:00:00"

Test case:

lingual catalog --init
lingual catalog --schema timestampTest --add
lingual catalog --schema timestampTest --stereotype times --add --columns ts --types timestamp
lingual catalog --schema timestampTest --table times --stereotype times --add data.csv

echo -e "ts\n\"2014-01-01 01:02:03\"" > data.csv

echo -e "select * from \"timestampTest\".\"times\";\n!quit\n" | lingual shell

Outputs:

+------------------------+
|           ts           |
+------------------------+
| 2014-01-01 00:00:00.0  |
+------------------------+

One workaround I have found is to add the table before the stereotype:

lingual catalog --init
lingual catalog --schema timestampTest --add
lingual catalog --schema timestampTest --table times --stereotype times --add data.csv
lingual catalog --schema timestampTest --stereotype times --add --columns ts --types timestamp

echo -e "ts\n\"2014-01-01 01:02:03\"" > data.csv

echo -e "select * from \"timestampTest\".\"times\";\n!quit\n" | lingual shell

Which outputs the expected:

+----------------------+
|          ts          |
+----------------------+
| 2014-01-01 01:02:03  |
+----------------------+

But this also adds a stereotype named "data" in the catalog which doesn't seem right.

The documentation suggests that the stereotype should be added before the table, but maybe I'm missing something there?

java.lang.AssertionError: table may be null; expr may not

I'm trying to setup a simple log analysis application based on Cascading & Lingual but I get stuck into the following error:

2013-12-28 00:06:31,899 INFO  tap.TapSchema (TapSchema.java:addTapTableFor(125)) - adding table on schema: null, table: LOG_ENTRIES, fields: 'ip', 'time', 'method', 'event', 'status', 'size' | String, String, String, String, String, String, identifier: log_analysis_tcsv.txt
2013-12-28 00:06:31,905 INFO  tap.TapSchema (TapSchema.java:addTapTableFor(125)) - adding table on schema: null, table: results, fields: 'ip', identifier: log_analysis_ips.txt

java.lang.AssertionError: table may be null; expr may not
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl$RelOptTableImpl.<init>(OptiqPrepareImpl.java:806)
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl$RelOptTableImpl.<init>(OptiqPrepareImpl.java:784)
...
    at net.hydromatic.optiq.prepare.OptiqPrepareImpl.prepareSql(OptiqPrepareImpl.java:195)
    at cascading.lingual.flow.SQLPlanner.resolveTails(SQLPlanner.java:111)
    at cascading.flow.planner.FlowPlanner.resolveAssemblyPlanners(FlowPlanner.java:150)

My application code:

Tap inTap = getPlatform().getTap(new SQLTypedTextDelimited( ",", "\""), IN_PATH, SinkMode.KEEP);

Tap outTap = getPlatform().getTap(new SQLTypedTextDelimited( new Fields("ip"), ",", "\""), OUT_PATH, SinkMode.REPLACE);

        //Define and execute flow
        FlowDef flowDef = FlowDef.flowDef()
                .addSource("LOG_ENTRIES", inTap)
                .addSink("results", outTap);

         String statement = "SELECT DISTINCT ip FROM LOG_ENTRIES";

         SQLPlanner sqlPlanner = new SQLPlanner().setSql(statement);
         flowDef.addAssemblyPlanner( sqlPlanner );

        getPlatform().getFlowConnector().connect(flowDef).complete();

INSERT INTO ... SELECT ... doesn't work using column positions in Lingual 1.1

In SQL, INSERT INTO ... SELECT ... happily works without column aliases, using column positions. This doesn't work in Lingual 1.1:

Lingual 1.1

#!/bin/bash

# Config
hdfs_path=/local/lingual-alias-test/
hbase_table=out
hbase_col_family=fields
export LINGUAL_PLATFORM=hadoop
export HADOOP_USER_NAME=hadoop

# Test data
printf "\n1\ta\n2\tb\n3\tc\n" > /tmp/alias.tsv
hadoop fs -copyFromLocal /tmp/alias.tsv "${hdfs_path}alias.tsv"

# Lingual 1.1
lingual catalog --init
lingual catalog --provider --add cascading:cascading-hbase:2.2.0:provider
lingual catalog --schema IN --add
lingual catalog --schema IN --stereotype TSVFILE -add \
--columns A,B \
--types   string,string
lingual catalog --schema IN --table IN --stereotype TSVFILE -add "${hdfs_path}" --format tsv
lingual catalog --schema OUT --add
lingual catalog --schema OUT --protocol hbase --add --provider hbase
lingual catalog --schema OUT --stereotype HTABLE -add \
--columns C,D \
--types   string,string
lingual catalog --schema OUT --format hbase --add --properties="family=${hbase_col_family}" --provider hbase
lingual catalog --schema OUT --table OUT --stereotype HTABLE -add "${hbase_table}" --protocol hbase --format hbase --provider hbase

# Doesn't work
lingual shell --sql - <<- EOQ
    INSERT INTO "OUT"."OUT"
        select *
        from "IN"."IN";
    SELECT * from "OUT"."OUT";
EOQ

# Doesn't work
lingual shell --sql - <<- EOQ
    INSERT INTO "OUT"."OUT"
        select
            A,
            B
        from "IN"."IN";
EOQ

Error output in both cases:

{utcTimestamp=1398423862581, currentTimestamp=1398423862581, localTimestamp=1398423862581, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
Warning: exception while executing query: could not build flow from assembly: [[HBaseScheme[['C', 'D' ...][cascading.hbase.HBaseFactory.createScheme(HBaseFactory.java:120)] unable to resolve scheme sink selector: [{1}:'C'], with incoming: [{2}:'A', 'B' | String, String]]: [HBaseScheme[['C', 'D' ...][cascading.hbase.HBaseFactory.createScheme(HBaseFactory.java:120)] unable to resolve scheme sink selector: [{1}:'C'], with incoming: [{2}:'A', 'B' | String, String]: could not select fields: [{1}:'C'], from: [{2}:'A', 'B' | String, String] (state=,code=0)
Error: exception while executing query (state=,code=0)

Postgres

CREATE TABLE int (
a text NOT NULL,
b text NOT NULL
);

CREATE TABLE outt (
c text NOT NULL,
d text NOT NULL
);

insert into int values ('1', 'a'), ('2', 'b');

insert into outt select * from int -- works

insert into outt select a, b from int -- also works

Update Lingual-HBase tutorial to use log files without headers...

... and in the tutorial, make a note of explicitly configuring the stereotype without headers.

This will save so much pain, given how rare it is on Hadoop to use files with headers...

Lingual 1.1 silently drops first line of tsv source files

Setup:

#!/bin/bash

# Config
hdfs_path=/local/lingual-tsv-test/
hbase_table=out
hbase_col_family=fields
export LINGUAL_PLATFORM=hadoop
export HADOOP_USER_NAME=hadoop

# Lingual 1.1
lingual catalog --init
lingual catalog --provider --add cascading:cascading-hbase:2.2.0:provider
lingual catalog --schema IN --add
lingual catalog --schema IN --stereotype TSVFILE -add \
--columns A,B \
--types   string,string
lingual catalog --schema IN --table IN --stereotype TSVFILE -add "${hdfs_path}" --format tsv

First row lost

Script:

# Test data - note no newline at start
printf "1\ta\n2\tb\n3\tc\n" > /tmp/lossy.tsv
hadoop fs -copyFromLocal /tmp/lossy.tsv "${hdfs_path}lossy.tsv"

# First line missing
lingual shell --sql - <<- EOQ
    SELECT * from "IN"."IN";
EOQ

Output:

{utcTimestamp=1398421144285, currentTimestamp=1398421144285, localTimestamp=1398421144285, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421144285, currentTimestamp=1398421144285, localTimestamp=1398421144285, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421144285, currentTimestamp=1398421144285, localTimestamp=1398421144285, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
+----+----+
| A  | B  |
+----+----+
| 2  | b  |
| 3  | c  |
+----+----+
2 rows selected (1.26 seconds)

Newline at start fixes it

Script:

# Test data - add newline at start
printf "\n1\ta\n2\tb\n3\tc\n" > /tmp/not-lossy.tsv
hadoop fs -rmr "${hdfs_path}"
hadoop fs -copyFromLocal /tmp/not-lossy.tsv "${hdfs_path}not-lossy.tsv"

# Doesn't work
lingual shell --sql - <<- EOQ
    SELECT * from "IN"."IN";
EOQ

Output:

{utcTimestamp=1398421279816, currentTimestamp=1398421279816, localTimestamp=1398421279816, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421279816, currentTimestamp=1398421279816, localTimestamp=1398421279816, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
{utcTimestamp=1398421279816, currentTimestamp=1398421279816, localTimestamp=1398421279816, timeZone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]}
+----+----+
| A  | B  |
+----+----+
| 1  | a  |
| 2  | b  |
| 3  | c  |
+----+----+
3 rows selected (1.318 seconds)

Feature Request: Allow setting of app name for lingual shell

Currently the app name defaults to lingual-hadoop.

Particularly when trying to observe lingual shell executions in Driven, it would be helpful for the user to be able to set this.

Support local catalog on Windows

This occurs:

ERROR platform.PlatformBroker: error starting connection
java.lang.IllegalArgumentException: Illegal character in opaque part at index 2: C:\Users\guoma\workspaceCascadingGuide\LingualMaven\src\main\resources\data\example
at java.net.URI.create(URI.java:859)
at cascading.lingual.platform.PlatformBroker.createSchemaNameFrom(PlatformBroker.java:628)
at cascading.lingual.catalog.SchemaCatalogManager.createSchemaDefAndTableDefsFor(SchemaCatalogManager.java:238)
at cascading.lingual.catalog.SchemaCatalogManager.createSchemaDefAndTableDefsFor(SchemaCatalogManager.java:223)
at cascading.lingual.platform.PlatformBroker.loadTransientSchemas(PlatformBroker.java:672)
at cascading.lingual.platform.PlatformBroker.loadCatalogManager(PlatformBroker.java:389)
at cascading.lingual.platform.PlatformBroker.getCatalogManager(PlatformBroker.java:270)
at cascading.lingual.platform.PlatformBroker.startConnection(PlatformBroker.java:174)
at cascading.lingual.jdbc.LingualConnection.initialize(LingualConnection.java:128)
at cascading.lingual.jdbc.LingualConnection.(LingualConnection.java:80)
at SC.(Unknown Source)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at cascading.lingual.jdbc.JaninoFactory.create(JaninoFactory.java:148)
at cascading.lingual.jdbc.JaninoFactory.createConnection(JaninoFactory.java:45)
at cascading.lingual.jdbc.Driver.connect(Driver.java:172)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:233)
at lingual.LingualMaven.JdbcExample.run(JdbcExample.java:17)
at lingual.LingualMaven.JdbcExample.main(JdbcExample.java:11)
Caused by: java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\Users\guoma\workspaceCascadingGuide\LingualMaven\src\main\resources\data\example
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.checkChars(URI.java:3002)
at java.net.URI$Parser.parse(URI.java:3039)
at java.net.URI.(URI.java:595)
at java.net.URI.create(URI.java:857)

Cannot specify location of catalog in HDFS with "lingual catalog --uri"

According to the help screen for "lingual catalog --help" one should be able to specify the directory for the catalog location with the --uri option:

--uri <directory>                    path to catalog location, defaults is
                                       current directory on current       
                                       platform (default: ./)

But my attempts to get this work are failing. Examples:

$ lingual catalog --platform hadoop --uri hdfenc/12345 --init
path: hdfs://michael-hadoop-9.foo.com:8020/user/diuser/.lingual has been initialized

$ hadoop fs -rmr -skipTrash .lingual
Deleted hdfs://michael-hadoop-9.foo.com:8020/user/diuser/.lingual

$ lingual catalog --platform hadoop --uri /user/diuser/hdfenc/12345 --init
path: hdfs://michael-hadoop-9.foo.com:8020/user/diuser/.lingual has been initialized

$ hadoop fs -rmr -skipTrash .lingual
Deleted hdfs://michael-hadoop-9.foo.com:8020/user/diuser/.lingual

lingual catalog --platform hadoop \
  --uri hdfs://michael-hadoop-9.foo.com:8020/user/diuser/hdfenc/12345/.lingual \
  --init

In every case it always creates it in the top level HDFS directory for my HDFS user.

Note: you can specify the catalog location with:

lingual catalog --platform hadoop --config catalog=hdfenc/12345 --init

So either the documentation should be updated or the code should be adjusted to allow the --uri switch to work as well.

Tested on:
Hadoop 1 (Hortonworks Data Platform 3.2)

$ lingual catalog --version
Concurrent, Inc - Lingual:1.0.3-wip-320, Cascading:2.2.0

Client side ClassLoader issue if query involves multiple providers

There is a ClassLoader issue, that only occurs when you have different providers with their own jars as source and sink.

For full details on this:

https://groups.google.com/forum/#!topic/lingual-user/V-Sk1KKsPD0

As a temporary workaround (assuming you are using Lingual shell):

$ lingual catalog --provider --add cascading:cascading-jdbc-mysql:2.5.1:provider
$ lingual catalog --provider --add cascading:cascading-hbase:2.2.0:provider
# or whatever providers you are loading...

$ find /home/$USER/.ivy2/cache/cascading -type f -iname "cascading*jar" -exec cp '{}' $(dirname `which lingual`) ';'

Parquet provider

With reference to the following issue: https://github.com/Parquet/parquet-mr/pull/285, did you ever manage to complete this provider? And if so can you point me at a link?