The cutplandataframe's intro from minoritymeaning

cutplandataframe's Introduction

Problem

Problem found for Spark version: 3.0.1
Scala version: 2.12

I am working with spark version 3.0.1. I am generating a large dataframe. At the end calculations, I save dataframe plan in json format. I need him.
But there is one problem. If I persist a DataFrame, then its plan in json format is completely truncated. That is, all data lineage disappears.

For example, I do this:

val myDf: DataFrame = ???
val myPersistDf = myDf.persist
  //toJSON method cuts down my plan
val jsonPlan = myPersistDf.queryExecution.optimizedPlan.toJSON

As a result, only information about the current columns remains.
If you use the spark version 3.1.2, then there is no such problem. That is, the plan is not cut.

How to use

Build project in IDE
Open CutPlan.scala in directory /src/main/scala/
Change the value of the isNeedPersist variable to false. And run project.

17: val isNeedPersist = false

Change the value of the isNeedPersist variable to true. And run project.

17: val isNeedPersist = true

Four plan files will appear in the root directory. Now you can see the problem.

UPD(1):

Now I'm trying to convert each node to json separately. Now it doesn't work perfectly, but I think we need to go in this direction. The thing is, I'm losing some data lineage.

val jsonPlan = s"[${getJson(result_df.queryExecution.optimizedPlan).mkString(",")}]"

  def getJson(lp: TreeNode[_]): Seq[String] = {
    val children = (lp.innerChildren ++ lp.children.map(c => c.asInstanceOf[TreeNode[_]])).distinct
    JsonMethods.compact(JsonMethods.render(JsonMethods.parse(lp.toJSON)(0))) +:
      getJson(t.asInstanceOf[TreeNode[_]])))
      children.flatMap(t => getJson(t))
  }

P.S. Issue in Jira: https://issues.apache.org/jira/browse/SPARK-38068
Question in stackoverflow: https://stackoverflow.com/questions/70910318/why-do-the-persist-and-cache-methods-shorten-dataframe-plan-in-spark

Recommend Projects

minoritymeaning / cutplandataframe Goto Github PK

cutplandataframe's Introduction

Problem

How to use

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent