Giter Site home page Giter Site logo

eel-sdk's Issues

Add support for impala

By passing in authmech=3 and user/pass in the jdbc url you can interact with Impala even when Kerberos is enabled

Column projection

Should project (slice) particular columns

JdbcSource( ... )
.project( Range(1..10) )
.toJdbcSink( .. )

JdbcSource( ... )
.project( "id","column1","column2")
.toJdbcSink( .. )

Or re-arrange the order of the columns

JdbcSource( ... )
.project( "column2","column1","id")
.toJdbcSink( .. )

Per column filtering

On a Pipe filter out values based on expected Types / on expected values

Source(...)
.filter("columnName" , allow = Type.INT)
.filter("columnName" , .as(Type.INT) > 10 , errors="nums.txt"
.filter("gender" , _.as(Type.String) == "male")

Support splitting pipes

val ( male , female) = Source(...)
.partition( "gender" , _.as(Type.String) == "male")

HttpFS / WebHDFS support

Sometimes we want to operate over remote hadoop clusters. That can be achieved - through REST calls to HttpFS / WebHDFS.

The only difference between the 2 services above - is that i) they listen on different ports ii) HttpFS is minimizing the footprint required to access HDFS, as a single node acts as a "gateway" where WebHDFS requires access to all nodes of the cluster

Introduce capabilities over functionalities:

File and Directory Operations
Create and Write to a File
Append to a File
Open and Read a File
Make a Directory
Rename a File/Directory
Delete a File/Directory
Status of a File/Directory
List a Directory

Other File System Operations
Get Content Summary of a Directory
Get File Checksum
Get Home Directory
Set Permission
Set Owner
Set Replication Factor
Set Access or Modification Time

Column discarding

JdbcSource( ... )
.discard( Range(5..6) )
.toJdbcSink( .. )

JdbcSource( ... )
.discard( "unnecessarycolumn1","unnecessarycolumn2")
.toJdbcSink( .. )

Add full SQL support

As Frames will be relatively small - we can store them in a in-mem db (H2) and provide full SQL support

Enhance HdfsResolver

with common methods like

du mkdir mkdirs (creates parent directories as well) rm rmdir rmdirs

Pivot capability

Imagine you have a schema : quarter - product - sales with data

("quarter1",  "wine",   100)
("quarter1",  "beer",   220)
("quarter1",  "coffee", 550)
("quarter2",  "coffee", 5)

And you want to pivot by product and sales - with newschema = "wine" , "beer" , "coffee" defaultvalue=0 resulting into data

quarter      wine     beer   coffee
quarter1      100     220      0
quarter2        0       0      5

Also be able to unpivot the data

Allow extraction of portion of a source

i.e. You have a source parquet file with 100.million rows and you want to get 100K rows out of it into another sink (i.e. a Database) - (a file to use for integration test).

  • You could .take(100000)
  • Or .sample(0.1%)

Partitioned queries on Hive

Load data from Hive by providing optional partitioning information

val query = "SELECT column1 , column7 FROM tableX"
JdbcSource ( query = query , partition = Partition( "cob_date" , ">" , "20150615") ).read

Allow providing CSV HEADER as a param

Some CVS files do not provide a schema - and we should be able to add it while reading
(in order to run some map/reduce or sql on it)

val pipe = Csv("my.tsv","\t").from(fileA).withHeader("column1","column2","column3")

Support ORC file format

Like Parquet is popular in Cloudera cluster
ORC is the format popular on HortonWorks clusters

Create Sinks/Sources for ORC files

Support input (multi-line) => output (entry-per-line)

Use cases where you have JSon messages across multiple lines
or multiple XML messages across multiple lines

and you want to automatically identify the bounds to those messages - and covert Sources into (entry-per-line) Sinks

Create REPL

Provide a REPL where data transformations can be executed in real-time

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.