Giter Site home page Giter Site logo

JSON插件目前的问题 about seatunnel HOT 3 CLOSED

apache avatar apache commented on July 16, 2024
JSON插件目前的问题

from seatunnel.

Comments (3)

garyelephant avatar garyelephant commented on July 16, 2024

image

如果这么写,JSON Filter只能处理单层,不嵌套的json string,
{"k1":"v1", "k2":3}能处理,{"k1":"v1", "k2":{"k3":"v3"}}像这样的再嵌套一层的,就会直接报错

from seatunnel.

RickyHuo avatar RickyHuo commented on July 16, 2024

目前解决方案为,先通过spark.read.json获取到schema,再根据schema进行转换

val stringDataSet = df.select(srcField).as[String]
val schema = spark.read.json(stringDataSet).schema
df.withColumn(targetField, from_json(col(srcField), schema))

此方案的问题是带来了额外的计算开销。

解决方案:

  • 抽样而不是把所有数据转换为dataset,val stringDataSet = df.select(srcField).sample(true, 0.1).as[String]
  • 如果传入的数据,本来就是JSON,就在Input之后,filter之前,一次性转换为DF:val df = spark.read.json(stringRDD)

from seatunnel.

RickyHuo avatar RickyHuo commented on July 16, 2024

一个新的解决思路:

通过spark.read.json(stringDataSet)获取一个dataframe df1,剩余的字段通过df1.withColums加上。 可以减少多余的json解析操作

通过spark.read.json(stringDataSet)获取一个dataframe df1,新增自增id字段,和原始df进行join

import pyspark.sql.functions as f

minDf = sc.parallelize([['2016-11-01 10:50:00'],['2016-11-01 11:46:00']]).toDF(["date_min"])
maxDf = sc.parallelize([['2016-11-01 10:50:00'],['2016-11-01 11:46:00']]).toDF(["date_max"])

# since there is no common column between these two dataframes add row_index so that it can be joined
minDf=minDf.withColumn('row_index', f.monotonically_increasing_id())
maxDf=maxDf.withColumn('row_index', f.monotonically_increasing_id())

minDf = minDf.join(maxDf, on=["row_index"]).sort("row_index").drop("row_index")
minDf.show()

from seatunnel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.