Giter Site home page Giter Site logo

apache / seatunnel Goto Github PK

View Code? Open in Web Editor NEW
7.3K 172.0 1.6K 28.31 MB

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

Home Page: https://seatunnel.apache.org/

License: Apache License 2.0

Shell 0.26% Java 99.44% Python 0.08% JavaScript 0.05% Batchfile 0.17%
data-integration high-performance offline real-time apache batch cdc change-data-capture data-ingestion elt

seatunnel's Introduction

Apache SeaTunnel

SeaTunnel Logo

Build Workflow Join Slack Twitter Follow

Table of Contents

Overview

SeaTunnel is a next-generation, high-performance, distributed data integration tool, capable of synchronizing vast amounts of data daily. It's trusted by numerous companies for its efficiency and stability.

Why Choose SeaTunnel

SeaTunnel addresses common data integration challenges:

  • Diverse Data Sources: Seamlessly integrates with hundreds of evolving data sources.

  • Complex Synchronization Scenarios: Supports various synchronization methods, including real-time, CDC, and full database synchronization.

  • Resource Efficiency: Minimizes computing resources and JDBC connections for real-time synchronization.

  • Quality and Monitoring: Provides data quality and monitoring to prevent data loss or duplication.

Key Features

  • Diverse Connectors: Offers support for over 100 connectors, with ongoing expansion.

  • Batch-Stream Integration: Easily adaptable connectors simplify data integration management.

  • Distributed Snapshot Algorithm: Ensures data consistency across synchronized data.

  • Multi-Engine Support: Works with SeaTunnel Zeta Engine, Flink, and Spark.

  • JDBC Multiplexing and Log Parsing: Efficiently synchronizes multi-tables and databases.

  • High Throughput and Low Latency: Provides high-throughput data synchronization with low latency.

  • Real-Time Monitoring: Offers detailed insights during synchronization.

  • Two Job Development Methods: Supports coding and visual job management with the SeaTunnel web project.

SeaTunnel Workflow

SeaTunnel Workflow

Configure jobs, select execution engines, and parallelize data using Source Connectors. Easily develop and extend connectors to meet your needs.

Supported Connectors

For a list of connectors and their health status, visit the Connector Status.

Getting Started

Download SeaTunnel from the official website.

Choose your runtime execution engine:

Use Cases

Explore real-world use cases of SeaTunnel, such as Weibo, Tencent Cloud, Sina, Sogou, and Yonghui Superstores. More use cases can be found on the SeaTunnel blog.

Code of Conduct

Participate in this project following the Contributor Covenant Code of Conduct.

Contributors

We appreciate all developers for their contributions. See the list of contributors.

How to Compile

Refer to this document for compilation instructions.

Contact Us

Landscapes

SeaTunnel enriches the CNCF CLOUD NATIVE Landscape.

Apache SeaTunnel Web Project

SeaTunnel Web is a web project that provides visual management of jobs, scheduling, running and monitoring capabilities. It is developed based on the SeaTunnel Connector API and the SeaTunnel Zeta Engine. It is a web project that can be deployed independently. It is also a sub-project of SeaTunnel. For more information, please refer to SeaTunnel Web

Our Users

Companies and organizations worldwide use SeaTunnel for research, production, and commercial products. Visit our user page for more information.

License

Apache 2.0 License

Frequently Asked Questions

1. How do I install SeaTunnel?

Follow the installation guide on our website to get started.

2. How can I contribute to SeaTunnel?

We welcome contributions! Please refer to our Contribution Guidelines for details.

3. How do I report issues or request features?

You can report issues or request features on our GitHub repository.

4. Can I use SeaTunnel for commercial purposes?

Yes, SeaTunnel is available under the Apache 2.0 License, allowing commercial use.

5. Where can I find documentation and tutorials?

Our official documentation includes detailed guides and tutorials to help you get started.

7. Is there a community or support channel?

Join our Slack community for support and discussions: SeaTunnel Slack.

seatunnel's People

Contributors

asdf2014 avatar ashulin avatar calvinkirs avatar carl-zhou-cn avatar cheneyyin avatar ericjoy2048 avatar flechazow avatar garyelephant avatar hailin0 avatar hisoka-x avatar ic4y avatar kid-xiong avatar laglangyue avatar lightzhao avatar liugddx avatar liunaijie avatar mans2singh avatar monsterchenzhuo avatar rickyhuo avatar ruanwenjun avatar simon824 avatar songjianet avatar taozex avatar tyrantlucifer avatar wuchunfu avatar xleoken avatar yx91490 avatar zhaomin1423 avatar zhilinli123 avatar zhongjiajie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seatunnel's Issues

Antlr4 tutorials

Antlr tutorial:

https://tomassetti.me/antlr-mega-tutorial/
http://sqtds.github.io/tags/antlr4/
https://alexecollins.com/antlr4-and-maven-tutorial/
http://meri-stuff.blogspot.com/2011/09/antlr-tutorial-expression-language.html#LexerBasics
http://progur.com/2016/09/how-to-create-language-using-antlr4.html
https://yq.aliyun.com/articles/11366
http://www.cnblogs.com/sld666666/p/6145854.html
http://blog.csdn.net/dc_726/article/details/45399371
https://github.com/antlr/antlr4/blob/master/doc/index.md
https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4
https://plugins.jetbrains.com/plugin/7358-antlr-v4-grammar-plugin
https://stackoverflow.com/questions/21534316/is-there-a-simple-example-of-using-antlr4-to-create-an-ast-from-java-source-code
https://stackoverflow.com/questions/23092081/antlr4-visitor-pattern-on-simple-arithmetic-example

https://stackoverflow.com/questions/6487593/what-does-fragment-mean-in-antlr
http://floris.briolas.nl/floris/2008/10/antlr-common-pittfals/
https://github.com/odiszapc/nginx-java-parser
https://codevomit.wordpress.com/2015/04/25/antlr4-project-with-maven-tutorial-episode-3/
https://stackoverflow.com/questions/1931307/antlr-is-there-a-simple-example
https://stackoverflow.com/questions/29971097/how-to-create-ast-with-antlr4

Listener vs Vistor:

https://stackoverflow.com/questions/20714492/antlr4-listeners-and-visitors-which-to-implement?rq=1

http://jakubdziworski.github.io/java/2016/04/01/antlr_visitor_vs_listener.html

ANTLRv4: How to read double quote escaped double quotes in string?

https://stackoverflow.com/questions/17897651/antlrv4-how-to-read-double-quote-escaped-double-quotes-in-string

nested boolean expression parsing:

https://stackoverflow.com/questions/25096713/parser-lexer-logical-expression
https://stackoverflow.com/questions/30976962/nested-boolean-expression-parser-using-antlr

parsing comment:

https://stackoverflow.com/questions/7070763/parse-comment-line?rq=1
https://stackoverflow.com/questions/28674875/antlr-4-how-to-parse-comments
http://meri-stuff.blogspot.com/2012/09/tackling-comments-in-antlr-compiler.html

design pattern: visitor

https://dzone.com/articles/design-patterns-visitor

books:

"The Definitive Antlr4 Reference"

scala code generation and runtime compile

中文/英文文档

中文文档完成度:


  • 配置

  • (Garyelephant) 通用配置

  • Input插件

    • Fake
    • File
    • Hdfs
    • Kafka
    • S3
    • Socket
  • Filter插件

    • (Rickyhuo)Add
    • Checksum
    • (Rickyhuo)Convert
    • (Rickyhuo)Date
    • Drop
    • Geoip
    • Grok
    • Json
    • Kv
    • (Rickyhuo)Lowercase
    • (Rikcyhuo)Remove
    • (Rickyhuo)Rename
    • Repartition
    • Replace
    • Sample
    • Split
    • SQL
    • Table
    • Truncate
    • (Rickyhuo)Uppercase
    • Uuid
  • Output插件

    • Elasticsearch
    • File
    • Hdfs
    • Jdbc
    • (Rickyhuo)Kafka
    • MySQL
    • S3
    • Stdout

  • 部署(Garyelephant, 1月10日)
  • 监控(Garyelephant)
  • 性能与调优
  • (Rickyhuo, 1月2日) 插件开发
  • (Garyelephant, 1月10日) Roadmap
  • 贡献代码

英文文档完成度:

20170904Week TODO

我列举一下本周可以做的事情:

(1) Filter UDF:

a) 找到Spark SQL自带的所有UDF列表,看看这些UDF都能做什么,将来我们可以在Waterdrop的文档里引用这些用户可以直接用的UDF;

b) 我们计划实现的那些Filter,能不能同时提供对应的UDF,如果能该怎么做?

c) 我们的Filter与SparkSQL自带的UDF有没有重复,能不能复用。

(2) 确定 BaseFilter的最终接口定义;

一个思路是把所有的Filter插件整理一遍,看它们需要什么样的BaseFilter接口定义;

另一个思路是想明白,如果将来有一个用户要开发他自己的插件,他该如何利用BaseFilter的接口开发出自己的插件。

(3) 在流程代码中支持多个 input, output

(4) [先搞定前3个再看此条] 确定BaseInput, BaseOutput的接口定义,这个涉及到几个麻烦的技术点,下周再跟你说。

Spark Streaming & Spark SQL

项目框架(Event, 数据流, Dataframe,UDF, UDAF)的开发与测试

configParse丢掉配置中最后一个字符

filter.Sql插件中sql字段读缺失最后一个字符

app.conf

spark {
  spark.streaming.batchDuration = 5
  spark.master = "local[2]"
  spark.app.name = "Waterdrop-1"
  spark.ui.port = 13000
}

input {
  kafka {
        topics = "sinabip_test"
        consumer.auto.offset.reset = "largest"
    }
}

filter {
	Split {
		source_field = "raw_message"
		fields = ["times", "info"]
	}
	Sql {
		table_name = "test",
		sql = "select info from test where info='hello'"
	}
	
}

output {
  Stdout {}
}

Main函数中打印的配置

[INFO] Parsed Config: 
{
    "filter" : [
        {
            "entries" : {
                "fields" : [
                    "times",
                    "info"
                ],
                "source_field" : "raw_message"
            },
            "name" : "Split"
        },
        {
            "entries" : {
                "sql" : "select info from test where info='hello",
                "table_name" : "test"
            },
            "name" : "Sql"
        }
    ],
    "input" : [
        {           
            "name" : "kafka"
        }
    ],
    "output" : [
        {
            "entries" : {},
            "name" : "Stdout"
        }
    ],
    "spark" : {
        "spark" : {
            "app" : {
                "name" : "Waterdrop-1"
            },
            "master" : "local[2]",
            "streaming" : {
                "batchDuration" : 5
            },
            "ui" : {
                "port" : 13000
            }
        }
    }
}

conditional and expression in configurtion file

新增src/test/java目录,其下main方法运行报错

想新建一个测试目录 src/test/java,新建一个类的Main方法

org.interestinglab.waterdrop.WaterDropTest

运行报错

Error:(13, 14) BoolExprBaseVisitor is already defined as object BoolExprBaseVisitor
public class BoolExprBaseVisitor<T> extends AbstractParseTreeVisitor<T> implements BoolExprVisitor<T> {
             ^

Spark RDD处理多语言支持

You can use the pipe() function on RDDs to call external code. It passes data to an external program through stdin / stdout. For Spark Streaming, you would do dstream.transform(rdd => rdd.pipe(...)) to call it on each RDD.

易用性提升的几点备注

  1. debug 模式:能让用户很容易知道每个环节的数据变化

  2. 本地模式:利用spark 的local模式,方便用户debug和本地开发。

  3. Filter过程可视化。

  4. 帮用户想好应用场景,并简化对应的部署和运行流程。

coding style 优化

  • 加入java的checkstyle检查,
  • scalastyle配置更严格的coding style
  • codacy做对应的配置

配置中支持复杂配置逻辑(如:if else 的逻辑,模版变量,预定义变量)

Structured Streaming & Window Operations

向后兼容性

  • spark升级到2.2之后由于scala升级到2.11导致无法兼容spark1.6

  • 兼容jdk 1.7

Roadmap

Waterdrop 未来5个重要的发展方向:

  • 支持Flink/Bean 计算引擎

  • 支持基于Flink 的【有状态】【实时】【聚合】计算(用户可指定时间粒度,纬度,指标)

  • 交互式 UI(支持Pipeline的交互式构建、交互式的SQL执行,功能和性能诊断可视化工具)

  • 基于应用场景的深入Spark/Flink 底层做性能优化。

  • 扩大生产环境使用Waterdrop的公司规模(国内公司技术支持,英文社区推广)

  • 自助化和交互式的问题诊断和性能优化,参考Alibaba Arthas

JSON插件目前的问题

  1. target_field为***ROOT***时, 由于需要从原始数据获取schema,导致目前没有解决办法
  2. 嵌套JSON结构无法多级解析

插件体系的结构问题

当前目录结构

filter/
├── BaseFilter.scala
└── Split.scala
└── Sql.scala

在具体文件上面加一级document会不会好点, 因为有的插件代码不可能都在一个文件里面完成,如果都放在一起的话, 不好管理。以这个grok为例

filter/
├── BaseFilter.scala
├── grok
│   ├── Grok.scala
│   └── PatternGrok.scala
├── split
│   └── Split.scala
└── sql
    └── Sql.scala

M1 TODO

  • 配置解析

    • spark common config(spark.*,还有appname, duration)解析
    • 配置文件错误提示和定位
    • 【非必需】实现if..else逻辑的代码【与插件流程体系直接相关】
    • 【非必需】用户预定义模版变量,系统环境变量替换
  • 插件流程体系

    • 确定 BaseFilter最终接口定义(重点:filter(包括其他开发者的filter)根据需要自动注册为UDF)
    • 确定BaseInput, BaseOutput的接口定义(考虑到broadcast, accumulator 的应用;与spark input,output format的关系)
    • 在流程代码中支持多个 input, output
    • Serializer 与其他Plugin的关系
    • 能够集成外部开发者的插件(支持:Java/Scala)
    • 【非必需】Field Reference
    • 【非必需】支持if..else逻辑
  • Input,Filter,Output插件开发

    • Input 插件
    • Filter 插件
    • Output插件
    • Input, Filter, Output插件功能测试(spark on Yarn[client, cluster]模式,spark on Mesos, Local)
  • 全流程简化

    • 区分不同的build.sbt
    • 接管整个spark + waterdrop 的流程。同时允许waterdrop以最简单spark job方式运行。
    • 安装
    • 部署(3种部署方式)
    • 插件集成
    • 配置
    • 运行
  • 中英文文档

    • 统一的插件定义的文档格式
    • 完整的中文文档(重点插件文档)
    • 完整的英文文档(重点插件文档)

[在这个节点上线]


  • 性能报告
    • 【非必需】大数据量的稳定性,处理性能,一致性的测试。
    • 【非必需】性能报告
    • 【非必需】性能调优(并行度,filter体系代码)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.