The eel-sdk's discuss from 51zero

Avro (non parquet) source and sink

Add support for impala

By passing in authmech=3 and user/pass in the jdbc url you can interact with Impala even when Kerberos is enabled

Add union operation

To join two frames together (append one to another).

Parquet source should accept schemas from hive metastore

Using multiple Source & Sinks on the same pipeline

val pipe1 = JdbcSource( ) ....
val pipe2 = JdbcSource( )

val pipe3 = From(pipe1,pipe2).join( pipe1.id -> pipe2.customer_id) ...

add metadata support to creating hive tables

Column projection

Should project (slice) particular columns

JdbcSource( ... )
.project( Range(1..10) )
.toJdbcSink( .. )

JdbcSource( ... )
.project( "id","column1","column2")
.toJdbcSink( .. )

Or re-arrange the order of the columns

JdbcSource( ... )
.project( "column2","column1","id")
.toJdbcSink( .. )

Per column filtering

On a Pipe filter out values based on expected Types / on expected values

Source(...)
.filter("columnName" , allow = Type.INT)
.filter("columnName" , .as(Type.INT) > 10 , errors="nums.txt"
.filter("gender" , _.as(Type.String) == "male")

Support splitting pipes

val ( male , female) = Source(...)
.partition( "gender" , _.as(Type.String) == "male")

HttpFS / WebHDFS support

Sometimes we want to operate over remote hadoop clusters. That can be achieved - through REST calls to HttpFS / WebHDFS.

The only difference between the 2 services above - is that i) they listen on different ports ii) HttpFS is minimizing the footprint required to access HDFS, as a single node acts as a "gateway" where WebHDFS requires access to all nodes of the cluster

Introduce capabilities over functionalities:

File and Directory Operations
Create and Write to a File
Append to a File
Open and Read a File
Make a Directory
Rename a File/Directory
Delete a File/Directory
Status of a File/Directory
List a Directory

Other File System Operations
Get Content Summary of a Directory
Get File Checksum
Get Home Directory
Set Permission
Set Owner
Set Replication Factor
Set Access or Modification Time

add signed to schema fields

Add avro sink

Capture size of schema fields

For numeric, determine if float/double, int/long
For varchar determine varchar(x) size

Add takeWhile

Add except() to Frame

Column discarding

JdbcSource( ... )
.discard( Range(5..6) )
.toJdbcSink( .. )

JdbcSource( ... )
.discard( "unnecessarycolumn1","unnecessarycolumn2")
.toJdbcSink( .. )

Add full SQL support

As Frames will be relatively small - we can store them in a in-mem db (H2) and provide full SQL support

Enhance HdfsResolver

with common methods like

du mkdir mkdirs (creates parent directories as well) rm rmdir rmdirs

jdbc dialect shouldn't use 0 for varchar

Make frames immutable and repeatable

Pivot capability

Imagine you have a schema : quarter - product - sales with data

("quarter1",  "wine",   100)
("quarter1",  "beer",   220)
("quarter1",  "coffee", 550)
("quarter2",  "coffee", 5)

And you want to pivot by product and sales - with newschema = "wine" , "beer" , "coffee" defaultvalue=0 resulting into data

quarter      wine     beer   coffee
quarter1      100     220      0
quarter2        0       0      5

Also be able to unpivot the data

Add component for solr

Add schema support to Frame

Insert columns with default value

JdbcSource( ... )
.insert( "newcolumn" , "new-default-value")
.toJdbcSink( .. )

Add common Sinks & Sources

HBase
Kafka / Queues
Solr
MongoDB
Cassandra
Elastic

add dropwhile

Add CsvSink

jdbc dialect should support all types

Add avro (non parquet) source

Allow extraction of portion of a source

i.e. You have a source parquet file with 100.million rows and you want to get 100K rows out of it into another sink (i.e. a Database) - (a file to use for integration test).

You could .take(100000)
Or .sample(0.1%)

Add HBase component

pretty print schema

Partitioned queries on Hive

Load data from Hive by providing optional partitioning information

val query = "SELECT column1 , column7 FROM tableX"
JdbcSource ( query = query , partition = Partition( "cob_date" , ">" , "20150615") ).read

51zero / eel-sdk Goto Github PK

eel-sdk's Issues

Recommend Projects

Recommend Topics

Recommend Org