Giter Site home page Giter Site logo

Crash reading ROOT tuples about spark-root HOT 7 CLOSED

diana-hep avatar diana-hep commented on September 28, 2024
Crash reading ROOT tuples

from spark-root.

Comments (7)

vkhristenko avatar vkhristenko commented on September 28, 2024

Hi Andrew

Can you please provide a ROOT file???

VK

from spark-root.

vkhristenko avatar vkhristenko commented on September 28, 2024

Random thing - can you follow the naming conventions in there??? I mean the order of columns???
Just as a test, the error is not on spark-root side actually, as far as the exception goes. But there might be something I'm doing wrong as well....

try using the same order of columns first and please provide a file if possible. But I see that this is not a CMS AOD/MiniAOD or whatever....

VK

from spark-root.

PerilousApricot avatar PerilousApricot commented on September 28, 2024

Hi VK - I notice we're on the same slack, so I'll send you the URL through there (it's actual CMS data, so..)

from spark-root.

PerilousApricot avatar PerilousApricot commented on September 28, 2024

And I agree the problem looks to be in Spark itself (I don't see anything obvious in the stacktrace to say otherwise), which is surprising to me...

I can confirm that the problem goes away if I reorder the columns. This works

droppedColumn = df.select('Trigger_names','Trigger_prescale')

but this doesn't

droppedColumn = df.select('Trigger_prescale','Trigger_names')

That seems ... counter-intuitive to me. I've gone through the Spark docs a couple of times and can't see anything that says the columns in a select statement need to be in the same order as they appear in the dataframe. In fact, I see a suggestion that select() can be used to reorder columns. I'm perplexed at what's happening...

from spark-root.

vkhristenko avatar vkhristenko commented on September 28, 2024

Hi Andrew,

If possible for the next ~2 weeks use the right order :) on vacation right now.......
I will check what's going under the hood once I'm back in the office... I've seen this behavior before and will check if that is spark-root. It must be us somewhere in the column selection....

Apology for the delay resolving this issue!

VK

from spark-root.

PerilousApricot avatar PerilousApricot commented on September 28, 2024

No problem, enjoy your vacation! Like we discussed, I see something similar with UDFs as well, which is where I originally got stuck. I've written up a set of tests that hopefully can point to what's happening. The stacktraces look similar to the select() case, something is getting confused in spark's Catalyst module (which appears to be the query planning module).

What's particularly interesting is that I still get those errors even if I don't ask for any columns to be passed to the UDF...

# Verify spark functionality
import findspark
findspark.init()

import pyspark
import random
if 'sc' in globals():
    sc.stop()
sc = pyspark.SparkContext(appName="Pi")
import os
import os.path
from pyspark.sql.types import BooleanType, ArrayType
from pyspark.sql.functions import array
import pyspark.sql.functions
sqlContext = pyspark.SQLContext(sc)
testPath = os.path.join(os.getcwd(), "tuple-test.root")
df = sqlContext.read.format("org.dianahep.sparkroot").load(testPath)

dropColumn1 = df.select("Trigger_decision")
dropColumn2 = df.select("Trigger_decision", "Trigger_names")

def triggerFilterFunc(val=None):
    return True
triggerFilterUDF = pyspark.sql.functions.udf(triggerFilterFunc, BooleanType())

# 1. OK - Use the only column in the DataFrame
triggeredDF = dropColumn1.withColumn("Trigger_pass", triggerFilterUDF('Trigger_decision'))
triggeredDF.take(1)

# 2. OK - Use the first of two columns in the DataFrame
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF('Trigger_decision'))
triggeredDF.take(1)

# 3. OK - Use both columns in the 2column DataFrame in backwards order
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF(array("Trigger_names", 'Trigger_decision')))
triggeredDF.take(1)

# 4. OK - Use both columns in the 2column DataFrame in correct order
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF(array('Trigger_decision', "Trigger_names")))
triggeredDF.take(1)

# 5. OK - Use the second of two columns in the DataFrame
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF('Trigger_names'))
triggeredDF.take(1)

# 6. Not OK - Use the first column in the original DataFrame
triggeredDF = df.withColumn("Trigger_pass", triggerFilterUDF('Trigger_decision'))
triggeredDF.take(1)

# 7. Not OK - Use the second column in the original DataFrame
triggeredDF = df.withColumn("Trigger_pass", triggerFilterUDF('Trigger_names'))
triggeredDF.take(1)

# 8. Not OK - Use no columns in the 1column DataFrame
triggeredDF = dropColumn1.withColumn("Trigger_pass", triggerFilterUDF())
triggeredDF.take(1)

# 9. Not OK - Use no columns in the 2column DataFrame
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF())
triggeredDF.take(1)

# 10. Not OK - Use no columns in the original DataFrame
triggeredDF = df.withColumn("Trigger_pass", triggerFilterUDF())
triggeredDF.take(1)

from spark-root.

vkhristenko avatar vkhristenko commented on September 28, 2024

fixed with vkhristenko@07a1be0

from spark-root.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.