Comments (3)
如果这么写,JSON Filter只能处理单层,不嵌套的json string,
{"k1":"v1", "k2":3}能处理,{"k1":"v1", "k2":{"k3":"v3"}}像这样的再嵌套一层的,就会直接报错
from seatunnel.
目前解决方案为,先通过spark.read.json
获取到schema
,再根据schema
进行转换
val stringDataSet = df.select(srcField).as[String]
val schema = spark.read.json(stringDataSet).schema
df.withColumn(targetField, from_json(col(srcField), schema))
此方案的问题是带来了额外的计算开销。
解决方案:
- 抽样而不是把所有数据转换为dataset,val stringDataSet = df.select(srcField).sample(true, 0.1).as[String]
- 如果传入的数据,本来就是JSON,就在Input之后,filter之前,一次性转换为DF:val df = spark.read.json(stringRDD)
from seatunnel.
一个新的解决思路:
通过spark.read.json(stringDataSet)获取一个dataframe df1,剩余的字段通过df1.withColums加上。 可以减少多余的json解析操作
通过spark.read.json(stringDataSet)获取一个dataframe df1,新增自增id字段,和原始df进行join
import pyspark.sql.functions as f
minDf = sc.parallelize([['2016-11-01 10:50:00'],['2016-11-01 11:46:00']]).toDF(["date_min"])
maxDf = sc.parallelize([['2016-11-01 10:50:00'],['2016-11-01 11:46:00']]).toDF(["date_max"])
# since there is no common column between these two dataframes add row_index so that it can be joined
minDf=minDf.withColumn('row_index', f.monotonically_increasing_id())
maxDf=maxDf.withColumn('row_index', f.monotonically_increasing_id())
minDf = minDf.join(maxDf, on=["row_index"]).sort("row_index").drop("row_index")
minDf.show()
from seatunnel.
Related Issues (20)
- Why email not have corresponding e2e test
- [Bug] [Connector-JDBC] Table structure synchronization HOT 3
- MongoDB - Sink - PluginIdentifier not found HOT 7
- [Feature][Module Name] Feature title HOT 1
- how to set tls_verify_certificate = false in JDBC connection
- [Feature][Engine] The name of the rest-api interface for returning job details was changed from running-job to job-info
- [Feature][sftp] Ignore error records HOT 2
- The main method caused an error: Plugin PluginIdentifier HOT 7
- [Bug] [CI] Fix FixSlotResourceTest Unstable issues
- [Feature][Connector-V2] Support Hive catalog for paimon sink HOT 1
- Seatunnel Source code compile failed
- When extracting hive data to mysql, the parallelism parameter setting is invalid. HOT 7
- [Bug] [Connector-V2-Paimon] Data changes are lost when sinking into Paimon using batch mode. HOT 18
- [Bug] quick-start-flink not working !!! ? HOT 7
- [Feature][Transform] Add split multiple source table in transform HOT 1
- [Feature] Unified Storage Config
- hook Listener HOT 1
- [Bug] [Zeta] restore job failed casued by NPE HOT 2
- [Bug] [Zeta] Config variable substitution 不一致 HOT 4
- Error: VM option 'UseG1GC' is experimental and must be enabled via -XX:+UnlockExperimentalVMOptions HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seatunnel.