Comments (1)
数据流程举例:
从kafka input 获取到1条数据 "a b c"
String RDD 转换为DataFrame, Row的结构如下:
{
"raw_message": "a b c"
}
再经过split filter:
{
"raw_message": "a b c"
"c1": "a",
"c2": "b",
"c3": "c"
}
之后再通过Kafka output输出即可。
Filter 操作DataFrame的需求整理如下:
Q: how to insert row to DataFrame ? (filter plugin: clone)
A: https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
Q: how to delete row from DataFrame ? (filter plugin: drop)
A: https://stackoverflow.com/questions/43515193/how-to-delete-rows-in-a-table-created-from-a-spark-dataframe
Q: how to add specific column from DataFrame ? (filter plugin: date)
A: 用withColumn() + udf的组合
Q: how to delete specific column from DataFrame ? (filter plugin: drop_field)
A: example: df.drop(df.col("raw.hourOfWeek"))
Q: how to update specific column from DataFrame ? (filter plugin: )
A: 还是用withColumn
Q: how to added multiple columns at once ? (filter plugin: grok, kv, split)
A: 有2种方式:(1)val newDf = dataframe.flatMap(...), 参考:
https://community.hortonworks.com/comments/73622/view.html
(2) 注册UDF, 这个UDF返回的类型是Struct,相当于withColumn("name", udf), 新增了一个嵌套结构的column,
这个column中包含了多个字段。参考:
https://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe
https://community.hortonworks.com/comments/73622/view.html
另外见 #30
Q: how to added column enriched from dict information ? (filter plugin: dict, geoip)
A: 这个得看字典信息的数据量,如果小,可以直接用broadcast variable + udf 实现;如果大,最好是将字典信息注册为DataFrame,
之后与数据DataFrame 进行join操作(Spark应该会根据字典信息的DataFrame自动作join优化,所以不一定需要区分字典信息的数据量),
字典信息仍然需要用broadcast variable 事先创建,避免每个batch重建。
other references:
https://stackoverflow.com/questions/31816975/how-to-pass-whole-row-to-udf-spark-dataframe-filter
from seatunnel.
Related Issues (20)
- [Bug] [Kafka] Kafka format debezium-json is not supported
- [Bug] [Zeta] Synchronize data from Starrocks to Hive Chinese field garbled code
- Connection com.mysql.cj.jdbc.ConnectionImpl@17a1981f marked as broken because of SQLSTATE(08S01), ErrorCode(0) HOT 1
- [Feature][Connector-V2] The hive connector should support cos、oss、s3 file system. HOT 1
- [Bug] [connector-S3] 2个问题,Source 是S3,服务是minio,Sink是clickhouse,同步时报错;Source是iceberg catalog_type hive ,iceberg链接的服务是HA模式时 问题。 HOT 2
- [Bug] [connector-rocketmq] Unnecessary threadInterruptedException
- [Bug] [Doris] sync from mysql to doris create table error HOT 1
- [Feature][Rest API] Add restore job in rest-api HOT 1
- [Doc][Kafka] Add kafka kerberos config document HOT 1
- [OnlineMeeting&April.9]SeaTunnel community meeting Topic collect HOT 6
- [SqlServer-CDC]当字段数量超过128,无法正常cdc HOT 14
- [Feature] Change tables_configs to table_list in file connector
- [Bug] [Hive ] Hive sink connector will throw "Client cannot authenticate via:[TOKEN, KERBEROS]" when writing to ORC formatted hive table
- ORA-01292 Oracle to Postgres HOT 6
- [Feature] Add E2E case for hive with kerberos HOT 1
- [Bug] [connector-rocketmq] consume offset is incorrect,and it is always 1 less than the actual value.
- [Feature][Module Name] support hbase incremental read
- [Improve][Zeta] Does not save checkpoint file when job finished. HOT 3
- [Bug] [connector-rabbitmq] Failed to trigger checkpoint and job cannot work HOT 4
- [Feature][Zeta Engine] Separate the responsibilities of Master and Worker to improve the stability of the cluster HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seatunnel.