diana-hep / spark-root Goto Github PK
View Code? Open in Web Editor NEWApache Spark Data Source for ROOT File Format
License: Apache License 2.0
Apache Spark Data Source for ROOT File Format
License: Apache License 2.0
by default FileFormat does not provide a recursive file search
Hi, When I try to load the Dataframes with the example files, I got an exception. I have tried to use the last two versions of the library, 0.15 and 0.16 and in both the exceptions occur.
I am attaching a file with the output.
issue-spark-shell-diana-hep.txt
Do not build a Row for the BASE classes - flatten that out!
below is an example of that
StreamerInfo for class: CSCSegment, version=11, checksum=0x98eed67
RecSegment BASE offset= 0 type= 0
vector<CSCRecHit2D> theCSCRecHits offset= 0 type=300 ,stl=1, ctype=61,
Point3DBase<float,LocalTag> theOrigin offset= 0 type=62 in chamber frame - the GeomDet local coordinate system
Vector3DBase<float,LocalTag> theLocalDirection offset= 0 type=62 in chamber frame - the GeomDet local coordinate system
CLHEP::HepSymMatrix theCovMatrix offset= 0 type=62 the covariance matrix
double theChi2 offset= 0 type= 8
bool aME11a_duplicate offset= 0 type=18
vector<CSCSegment> theDuplicateSegments offset= 0 type=300 ,stl=1, ctype=61, ME1/1a only
Need to identify what happens upon repartitioning of the data. We are not capable yet of writing ROOT files, therefore how can we get 3 tasks reading from different locations???
Hello,
(I'm not sure how to properly debug scala code, so if you need more info, let me know)
With spark2.2 with org.diana-hep:spark-root_2.11:0.1.11,org.diana-hep:histogrammar-sparksql_2.11:1.0.4 and the following code:
import findspark
findspark.init()
import pyspark
import random
if 'sc' in globals():
sc.stop()
sc = pyspark.SparkContext(appName="Pi")
import os
import os.path
sqlContext = pyspark.SQLContext(sc)
testPath = os.path.join(os.getcwd(), "tuple-test.root")
df = sqlContext.read.format("org.dianahep.sparkroot").load(testPath)
droppedColumn = df.select('Trigger_names','Trigger_prescale','Trigger_decision')
# This works
df.take(1)
# This fails
droppedColumn.take(1)
I get the following backtrace: https://gist.github.com/PerilousApricot/118a6aaa088fe3ed6e07a36e7e5c794d
df.printSchema()
yields the following
root
|-- Trigger_decision: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- Trigger_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Trigger_prescale: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- Muon_pt: array (nullable = true)
| |-- element: double (containsNull = true)
<snip other columns>
I stumbled onto this error trying to write a UserDefinedFunction to filter over events, and the above snippet was a "minimal case" I developed to report. I can provide the input file if you're a CMS member if that helps as well.
just a re won't be sufficient for complex nested class declarations!
logger.debug(s"Synthesizing the current class arguments: ${classTypeString}")
val mapTemplateRE = "(.*?),(.*?)".r
val (keyTypeString, valueTypeString) = argumentsTypeString match {
case mapTemplateRE(aaa,bbb) => (aaa,bbb)
case _ => (null, null)
}
do not need to actually read an object to know its class, just use the key.
Important cause TH2, for instance, are not being properly read so far
VK
I can use the library using root4j, I can load Dataframes and work with them, but, the library is no longer maintained, so, is there any maintained library?
For CMSSW there are a bunch of empty classes - remove them from the schema
Hi everyone,
I was wondering if the notebook provided as an example (publicCMSMuonia_exampleAnalysis) was the actual code/workflow run here or at least similar in functionality here and here
Thanks in advanced for any pointers!
Configuration:
Hi!
When I try to load this file (3d coordinates of points) in a spark-shell, I get the following error:
scala> val df = spark.read.format("org.dianahep.sparkroot").load("event000001000-hits.root")
Map(path -> event000001000-hits.root)
java.io.IOException: Error during decompression (size=5013/14440)
at org.dianahep.root4j.core.RootHDFSInputStream.slice(RootHDFSInputStream.java:354)
at org.dianahep.root4j.proxy.TKey.getData(<generated>:107)
at org.dianahep.root4j.proxy.TKey.getObject(<generated>:55)
at org.dianahep.root4j.core.FileClassFactory.<init>(FileClassFactory.java:39)
at org.dianahep.root4j.RootFileReader.init(RootFileReader.java:324)
at org.dianahep.root4j.RootFileReader.<init>(RootFileReader.java:183)
at org.dianahep.sparkroot.package$RootTableScan.<init>(sparkroot.scala:96)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:146)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:143)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
... 48 elided
Caused by: java.util.zip.DataFormatException: invalid distance code
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at org.dianahep.root4j.core.RootHDFSInputStream.slice(RootHDFSInputStream.java:314)
... 59 more
Maybe the file is corrupted to start with, but being neither an expert nor a user of root, I cannot judge...
So any help would be appreciated!
Thanks,
Julien
Dear All,
When I try to load a root file (that you can find here [1]) I get the following error [2].
Usually I never had problems using the command below:
df = spark.read.format("org.dianahep.sparkroot").load("file.root")
We noted that some branches (those starting with Jet_btagSF) for instance, are visible in the TBrowser, but not in PYROOT.
I assume some mistake have been done when creating those branches.
There is any plan to be able to skip 'problematic' branches when loading a root file, instead of having a crash?
Cheers,
Luca
[1] https://www.dropbox.com/s/8yzbdvs4rbaiyf7/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root?dl=0
[2]
Map(path -> /data/taohuang/HHNtuple_20180418_DYestimation/DYJetsToLL_M-10to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root)
Warnng: Generating dummy read for fIOBits
Traceback (most recent call last):
File "trainDY.py", line 32, in
df = spark.read.format("org.dianahep.sparkroot").load("/data/taohuang/HHNtuple_20180418_DYestimation/DYJetsToLL_M-10to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root")
File "/home/demarley/anaconda2/lib/python2.7/site-packages/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/home/demarley/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in call
answer, self.gateway_client, self.target_id, self.name)
File "/home/demarley/anaconda2/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/demarley/anaconda2/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o41.load.
: java.io.IOException: Cannot skip object with no length
at org.dianahep.root4j.core.RootInputStream.skipObject(RootInputStream.java:596)
at org.dianahep.root4j.core.RootHDFSInputStream.skipObject(RootHDFSInputStream.java:387)
at org.dianahep.root4j.proxy.ROOT.TIOFeatures.readMembers()
at org.dianahep.root4j.core.AbstractRootObject.read(AbstractRootObject.java:52)
at org.dianahep.root4j.core.RootInputStream.readObject(RootInputStream.java:466)
at org.dianahep.root4j.core.RootHDFSInputStream.readObject(RootHDFSInputStream.java:222)
at org.dianahep.root4j.proxy.TTree.readMembers()
at org.dianahep.root4j.core.AbstractRootObject.read(AbstractRootObject.java:52)
at org.dianahep.root4j.proxy.TKey.getObject(:57)
at org.dianahep.sparkroot.core.package$$anonfun$findTree$1.apply(ast.scala:1177)
at org.dianahep.sparkroot.core.package$$anonfun$findTree$1.apply(ast.scala:1166)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.dianahep.sparkroot.core.package$.findTree(ast.scala:1166)
at org.dianahep.sparkroot.package$RootTableScan.(sparkroot.scala:97)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:146)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:143)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.