Giter Site home page Giter Site logo

cutplandataframe's Introduction

Problem

Problem found for Spark version: 3.0.1
Scala version: 2.12

I am working with spark version 3.0.1. I am generating a large dataframe. At the end calculations, I save dataframe plan in json format. I need him.
But there is one problem. If I persist a DataFrame, then its plan in json format is completely truncated. That is, all data lineage disappears.

For example, I do this:

val myDf: DataFrame = ???
val myPersistDf = myDf.persist
  //toJSON method cuts down my plan
val jsonPlan = myPersistDf.queryExecution.optimizedPlan.toJSON

As a result, only information about the current columns remains.
If you use the spark version 3.1.2, then there is no such problem. That is, the plan is not cut.

How to use

  1. Build project in IDE
  2. Open CutPlan.scala in directory /src/main/scala/
  3. Change the value of the isNeedPersist variable to false. And run project.
17: val isNeedPersist = false
  1. Change the value of the isNeedPersist variable to true. And run project.
17: val isNeedPersist = true
  1. Four plan files will appear in the root directory. Now you can see the problem.

UPD(1):

Now I'm trying to convert each node to json separately. Now it doesn't work perfectly, but I think we need to go in this direction. The thing is, I'm losing some data lineage.

val jsonPlan = s"[${getJson(result_df.queryExecution.optimizedPlan).mkString(",")}]"

  def getJson(lp: TreeNode[_]): Seq[String] = {
    val children = (lp.innerChildren ++ lp.children.map(c => c.asInstanceOf[TreeNode[_]])).distinct
    JsonMethods.compact(JsonMethods.render(JsonMethods.parse(lp.toJSON)(0))) +:
      getJson(t.asInstanceOf[TreeNode[_]])))
      children.flatMap(t => getJson(t))
  }

P.S. Issue in Jira: https://issues.apache.org/jira/browse/SPARK-38068
Question in stackoverflow: https://stackoverflow.com/questions/70910318/why-do-the-persist-and-cache-methods-shorten-dataframe-plan-in-spark

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.