Comments (7)
Hi Andrew
Can you please provide a ROOT file???
VK
from spark-root.
Random thing - can you follow the naming conventions in there??? I mean the order of columns???
Just as a test, the error is not on spark-root side actually, as far as the exception goes. But there might be something I'm doing wrong as well....
try using the same order of columns first and please provide a file if possible. But I see that this is not a CMS AOD/MiniAOD or whatever....
VK
from spark-root.
Hi VK - I notice we're on the same slack, so I'll send you the URL through there (it's actual CMS data, so..)
from spark-root.
And I agree the problem looks to be in Spark itself (I don't see anything obvious in the stacktrace to say otherwise), which is surprising to me...
I can confirm that the problem goes away if I reorder the columns. This works
droppedColumn = df.select('Trigger_names','Trigger_prescale')
but this doesn't
droppedColumn = df.select('Trigger_prescale','Trigger_names')
That seems ... counter-intuitive to me. I've gone through the Spark docs a couple of times and can't see anything that says the columns in a select statement need to be in the same order as they appear in the dataframe. In fact, I see a suggestion that select()
can be used to reorder columns. I'm perplexed at what's happening...
from spark-root.
Hi Andrew,
If possible for the next ~2 weeks use the right order :) on vacation right now.......
I will check what's going under the hood once I'm back in the office... I've seen this behavior before and will check if that is spark-root. It must be us somewhere in the column selection....
Apology for the delay resolving this issue!
VK
from spark-root.
No problem, enjoy your vacation! Like we discussed, I see something similar with UDFs as well, which is where I originally got stuck. I've written up a set of tests that hopefully can point to what's happening. The stacktraces look similar to the select()
case, something is getting confused in spark's Catalyst module (which appears to be the query planning module).
What's particularly interesting is that I still get those errors even if I don't ask for any columns to be passed to the UDF...
# Verify spark functionality
import findspark
findspark.init()
import pyspark
import random
if 'sc' in globals():
sc.stop()
sc = pyspark.SparkContext(appName="Pi")
import os
import os.path
from pyspark.sql.types import BooleanType, ArrayType
from pyspark.sql.functions import array
import pyspark.sql.functions
sqlContext = pyspark.SQLContext(sc)
testPath = os.path.join(os.getcwd(), "tuple-test.root")
df = sqlContext.read.format("org.dianahep.sparkroot").load(testPath)
dropColumn1 = df.select("Trigger_decision")
dropColumn2 = df.select("Trigger_decision", "Trigger_names")
def triggerFilterFunc(val=None):
return True
triggerFilterUDF = pyspark.sql.functions.udf(triggerFilterFunc, BooleanType())
# 1. OK - Use the only column in the DataFrame
triggeredDF = dropColumn1.withColumn("Trigger_pass", triggerFilterUDF('Trigger_decision'))
triggeredDF.take(1)
# 2. OK - Use the first of two columns in the DataFrame
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF('Trigger_decision'))
triggeredDF.take(1)
# 3. OK - Use both columns in the 2column DataFrame in backwards order
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF(array("Trigger_names", 'Trigger_decision')))
triggeredDF.take(1)
# 4. OK - Use both columns in the 2column DataFrame in correct order
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF(array('Trigger_decision', "Trigger_names")))
triggeredDF.take(1)
# 5. OK - Use the second of two columns in the DataFrame
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF('Trigger_names'))
triggeredDF.take(1)
# 6. Not OK - Use the first column in the original DataFrame
triggeredDF = df.withColumn("Trigger_pass", triggerFilterUDF('Trigger_decision'))
triggeredDF.take(1)
# 7. Not OK - Use the second column in the original DataFrame
triggeredDF = df.withColumn("Trigger_pass", triggerFilterUDF('Trigger_names'))
triggeredDF.take(1)
# 8. Not OK - Use no columns in the 1column DataFrame
triggeredDF = dropColumn1.withColumn("Trigger_pass", triggerFilterUDF())
triggeredDF.take(1)
# 9. Not OK - Use no columns in the 2column DataFrame
triggeredDF = dropColumn2.withColumn("Trigger_pass", triggerFilterUDF())
triggeredDF.take(1)
# 10. Not OK - Use no columns in the original DataFrame
triggeredDF = df.withColumn("Trigger_pass", triggerFilterUDF())
triggeredDF.take(1)
from spark-root.
fixed with vkhristenko@07a1be0
from spark-root.
Related Issues (18)
- Proper Template Arguments synthesis HOT 1
- Add Unit tests
- Recursively find all the input files
- Broadcast HadoopConfiguration to the executors
- What happes upon df.repartition
- Problematic branches cause a crash HOT 3
- Add all of the STL Collections HOT 1
- java.io.IOException: Error during decompression (root 6.14) HOT 12
- Exception when I try to create a Dataframe from source data HOT 2
- Is there a maintained version? HOT 4
- Regarding public CMS MuOnia Example Analysis python notebook HOT 5
- Code Cleaning HOT 1
- class members as pointers to the class itself, or a vector<T>
- TTree Search bug fix HOT 1
- Remove empty Rows from the schema HOT 1
- Why bytesRead are 0s always??? HOT 1
- Flatten out BASE classes
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spark-root.