Giter Site home page Giter Site logo

Comments (1)

garyelephant avatar garyelephant commented on May 11, 2024

数据流程举例:

 从kafka input 获取到1条数据 "a b c"
 String RDD 转换为DataFrame, Row的结构如下:
 {
    "raw_message": "a b c"
 }

 再经过split filter:
 {
    "raw_message": "a b c"
    "c1": "a",
    "c2": "b",
    "c3": "c"
 }
 之后再通过Kafka output输出即可。

Filter 操作DataFrame的需求整理如下:

 Q: how to insert row to DataFrame ? (filter plugin: clone)
 A: https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html

 Q: how to delete row from DataFrame ? (filter plugin: drop)
 A: https://stackoverflow.com/questions/43515193/how-to-delete-rows-in-a-table-created-from-a-spark-dataframe

 Q: how to add specific column from DataFrame ? (filter plugin: date)
 A: 用withColumn() + udf的组合

 Q: how to delete specific column from DataFrame ? (filter plugin: drop_field)
 A: example: df.drop(df.col("raw.hourOfWeek"))

 Q: how to update specific column from DataFrame ? (filter plugin: )
 A: 还是用withColumn

 Q: how to added multiple columns at once ? (filter plugin: grok, kv, split)
 A: 有2种方式:(1)val newDf = dataframe.flatMap(...), 参考:
 https://community.hortonworks.com/comments/73622/view.html
 (2) 注册UDF, 这个UDF返回的类型是Struct,相当于withColumn("name", udf), 新增了一个嵌套结构的column,
 这个column中包含了多个字段。参考:
 https://stackoverflow.com/questions/32196207/derive-multiple-columns-from-a-single-column-in-a-spark-dataframe
 https://community.hortonworks.com/comments/73622/view.html
 另外见 #30 

 Q: how to added column enriched from dict information ? (filter plugin: dict, geoip)
 A: 这个得看字典信息的数据量,如果小,可以直接用broadcast variable + udf 实现;如果大,最好是将字典信息注册为DataFrame,
 之后与数据DataFrame 进行join操作(Spark应该会根据字典信息的DataFrame自动作join优化,所以不一定需要区分字典信息的数据量),
 字典信息仍然需要用broadcast variable 事先创建,避免每个batch重建。
 other references:
 https://stackoverflow.com/questions/31816975/how-to-pass-whole-row-to-udf-spark-dataframe-filter

from seatunnel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.