51zero / eel-sdk Goto Github PK
View Code? Open in Web Editor NEWBig Data Toolkit for the JVM
License: Apache License 2.0
Big Data Toolkit for the JVM
License: Apache License 2.0
By passing in authmech=3
and user/pass in the jdbc url you can interact with Impala even when Kerberos is enabled
To join two frames together (append one to another).
val pipe1 = JdbcSource( ) ....
val pipe2 = JdbcSource( )
val pipe3 = From(pipe1,pipe2).join( pipe1.id -> pipe2.customer_id) ...
Should project (slice) particular columns
JdbcSource( ... )
.project( Range(1..10) )
.toJdbcSink( .. )
JdbcSource( ... )
.project( "id","column1","column2")
.toJdbcSink( .. )
Or re-arrange the order of the columns
JdbcSource( ... )
.project( "column2","column1","id")
.toJdbcSink( .. )
On a Pipe filter out values based on expected Types / on expected values
Source(...)
.filter("columnName" , allow = Type.INT)
.filter("columnName" , .as(Type.INT) > 10 , errors="nums.txt"
.filter("gender" , _.as(Type.String) == "male")
Support splitting pipes
val ( male , female) = Source(...)
.partition( "gender" , _.as(Type.String) == "male")
Sometimes we want to operate over remote hadoop clusters. That can be achieved - through REST calls to HttpFS / WebHDFS.
The only difference between the 2 services above - is that i) they listen on different ports ii) HttpFS
is minimizing the footprint required to access HDFS, as a single node acts as a "gateway" where WebHDFS
requires access to all nodes of the cluster
Introduce capabilities over functionalities:
File and Directory Operations
Create and Write to a File
Append to a File
Open and Read a File
Make a Directory
Rename a File/Directory
Delete a File/Directory
Status of a File/Directory
List a Directory
Other File System Operations
Get Content Summary of a Directory
Get File Checksum
Get Home Directory
Set Permission
Set Owner
Set Replication Factor
Set Access or Modification Time
For numeric, determine if float/double, int/long
For varchar determine varchar(x) size
JdbcSource( ... )
.discard( Range(5..6) )
.toJdbcSink( .. )
JdbcSource( ... )
.discard( "unnecessarycolumn1","unnecessarycolumn2")
.toJdbcSink( .. )
As Frames will be relatively small - we can store them in a in-mem db (H2) and provide full SQL support
with common methods like
du
mkdir
mkdirs
(creates parent directories as well) rm
rmdir
rmdirs
Imagine you have a schema : quarter - product - sales with data
("quarter1", "wine", 100)
("quarter1", "beer", 220)
("quarter1", "coffee", 550)
("quarter2", "coffee", 5)
And you want to pivot by product and sales - with newschema = "wine" , "beer" , "coffee" defaultvalue=0 resulting into data
quarter wine beer coffee
quarter1 100 220 0
quarter2 0 0 5
Also be able to unpivot the data
JdbcSource( ... )
.insert( "newcolumn" , "new-default-value")
.toJdbcSink( .. )
HBase
Kafka / Queues
Solr
MongoDB
Cassandra
Elastic
i.e. You have a source parquet file with 100.million rows and you want to get 100K rows out of it into another sink (i.e. a Database) - (a file to use for integration test).
Load data from Hive by providing optional partitioning information
val query = "SELECT column1 , column7 FROM tableX"
JdbcSource ( query = query , partition = Partition( "cob_date" , ">" , "20150615") ).read
Some CVS files do not provide a schema - and we should be able to add it while reading
(in order to run some map/reduce or sql on it)
val pipe = Csv("my.tsv","\t").from(fileA).withHeader("column1","column2","column3")
So something like "hdfs://mydir/*".withFilter
Like Parquet is popular in Cloudera cluster
ORC is the format popular on HortonWorks clusters
Create Sinks/Sources for ORC files
It's a format not as common as Parquet - but used to be a prevalent data format
To make users understand how resources are utilized 'eel' could display [i/o] & [cpu] stats for users to identify - performance measure systems etc
Use cases where you have JSon messages across multiple lines
or multiple XML messages across multiple lines
and you want to automatically identify the bounds to those messages - and covert Sources into (entry-per-line) Sinks
Provide a REPL where data transformations can be executed in real-time
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.