Problem found for Spark version: 3.0.1
Scala version: 2.12
I am working with spark version 3.0.1. I am generating a large dataframe.
At the end calculations, I save dataframe plan in json format. I need him.
But there is one problem. If I persist a DataFrame, then its plan in json format is completely truncated.
That is, all data lineage disappears.
For example, I do this:
val myDf: DataFrame = ???
val myPersistDf = myDf.persist
//toJSON method cuts down my plan
val jsonPlan = myPersistDf.queryExecution.optimizedPlan.toJSON
As a result, only information about the current columns remains.
If you use the spark version 3.1.2, then there is no such problem. That is, the plan is not cut.
- Build project in IDE
- Open CutPlan.scala in directory /src/main/scala/
- Change the value of the isNeedPersist variable to false. And run project.
17: val isNeedPersist = false
- Change the value of the isNeedPersist variable to true. And run project.
17: val isNeedPersist = true
UPD(1):
Now I'm trying to convert each node to json separately. Now it doesn't work perfectly, but I think we need to go in this direction. The thing is, I'm losing some data lineage.
val jsonPlan = s"[${getJson(result_df.queryExecution.optimizedPlan).mkString(",")}]"
def getJson(lp: TreeNode[_]): Seq[String] = {
val children = (lp.innerChildren ++ lp.children.map(c => c.asInstanceOf[TreeNode[_]])).distinct
JsonMethods.compact(JsonMethods.render(JsonMethods.parse(lp.toJSON)(0))) +:
getJson(t.asInstanceOf[TreeNode[_]])))
children.flatMap(t => getJson(t))
}
P.S. Issue in Jira: https://issues.apache.org/jira/browse/SPARK-38068
Question in stackoverflow: https://stackoverflow.com/questions/70910318/why-do-the-persist-and-cache-methods-shorten-dataframe-plan-in-spark