diana-hep / spark-root Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 18.0 38.68 MB

Apache Spark Data Source for ROOT File Format

License: Apache License 2.0

Scala 9.66% Jupyter Notebook 90.34%

big-data histogrammar python root scala spark

spark-root's People

Contributors

Stargazers

Watchers

Forkers

vkhristenko remkamal superdude264 yarborocl srimanob aniq55 reikdas vkuznet gutsche julienpeloton nutufts lucacanali glorf wenxingfang saborforly youngkwonjo neogyk

spark-root's Issues

Recursively find all the input files

by default FileFormat does not provide a recursive file search

Exception when I try to create a Dataframe from source data

Hi, When I try to load the Dataframes with the example files, I got an exception. I have tried to use the last two versions of the library, 0.15 and 0.16 and in both the exceptions occur.

I am attaching a file with the output.
issue-spark-shell-diana-hep.txt

Broadcast HadoopConfiguration to the executors

Add all of the STL Collections

multimap
array/list, etc....
for associative containers - include the case for non-basic type for the key

Flatten out BASE classes

Do not build a Row for the BASE classes - flatten that out!

Why bytesRead are 0s always???

class members as pointers to the class itself, or a vector<T>

below is an example of that

StreamerInfo for class: CSCSegment, version=11, checksum=0x98eed67
  RecSegment     BASE            offset=  0 type= 0                     
  vector<CSCRecHit2D> theCSCRecHits   offset=  0 type=300 ,stl=1, ctype=61,                     
  Point3DBase<float,LocalTag> theOrigin       offset=  0 type=62 in chamber frame - the GeomDet local coordinate system
  Vector3DBase<float,LocalTag> theLocalDirection offset=  0 type=62 in chamber frame - the GeomDet local coordinate system
  CLHEP::HepSymMatrix theCovMatrix    offset=  0 type=62 the covariance matrix
  double         theChi2         offset=  0 type= 8                     
  bool           aME11a_duplicate offset=  0 type=18                     
  vector<CSCSegment> theDuplicateSegments offset=  0 type=300 ,stl=1, ctype=61, ME1/1a only

What happes upon df.repartition

Need to identify what happens upon repartitioning of the data. We are not capable yet of writing ROOT files, therefore how can we get 3 tasks reading from different locations???

Crash reading ROOT tuples

Hello,

(I'm not sure how to properly debug scala code, so if you need more info, let me know)

With spark2.2 with org.diana-hep:spark-root_2.11:0.1.11,org.diana-hep:histogrammar-sparksql_2.11:1.0.4 and the following code:

import findspark
findspark.init()

import pyspark
import random
if 'sc' in globals():
    sc.stop()
sc = pyspark.SparkContext(appName="Pi")
import os
import os.path
sqlContext = pyspark.SQLContext(sc)
testPath = os.path.join(os.getcwd(), "tuple-test.root")
df = sqlContext.read.format("org.dianahep.sparkroot").load(testPath)
droppedColumn = df.select('Trigger_names','Trigger_prescale','Trigger_decision')
# This works
df.take(1)
# This fails
droppedColumn.take(1)

I get the following backtrace: https://gist.github.com/PerilousApricot/118a6aaa088fe3ed6e07a36e7e5c794d

df.printSchema() yields the following

root
 |-- Trigger_decision: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- Trigger_names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Trigger_prescale: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- Muon_pt: array (nullable = true)
 |    |-- element: double (containsNull = true)
<snip other columns>

I stumbled onto this error trying to write a UserDefinedFunction to filter over events, and the above snippet was a "minimal case" I developed to report. I can provide the input file if you're a CMS member if that helps as well.

Code Cleaning

The build of the intermediate state needs cleaning + commenting

Proper Template Arguments synthesis

just a re won't be sufficient for complex nested class declarations!

          logger.debug(s"Synthesizing the current class arguments: ${classTypeString}")
          val mapTemplateRE = "(.*?),(.*?)".r
          val (keyTypeString, valueTypeString) = argumentsTypeString match {
            case mapTemplateRE(aaa,bbb) => (aaa,bbb)
            case _ => (null, null)
          }

use "paranthesis balancing"

TTree Search bug fix

https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/ast/ast.scala#L1689

do not need to actually read an object to know its class, just use the key.
Important cause TH2, for instance, are not being properly read so far

Is there a maintained version?

I can use the library using root4j, I can load Dataframes and work with them, but, the library is no longer maintained, so, is there any maintained library?

issue-spark-shell-diana-hep.txt

Remove empty Rows from the schema

For CMSSW there are a bunch of empty classes - remove them from the schema

Regarding public CMS MuOnia Example Analysis python notebook

Hi everyone,

I was wondering if the notebook provided as an example (publicCMSMuonia_exampleAnalysis) was the actual code/workflow run here or at least similar in functionality here and here
Thanks in advanced for any pointers!

java.io.IOException: Error during decompression (root 6.14)

Configuration:

OS: macOS High Sierra (10.13.5)
JDK: 8
Scala: 2.11
Spark: 2.2.1
extra-packages: org.diana-hep:root4j:0.1.6,org.diana-hep:spark-root_2.11:0.1.16

Hi!

When I try to load this file (3d coordinates of points) in a spark-shell, I get the following error:

scala> val df = spark.read.format("org.dianahep.sparkroot").load("event000001000-hits.root")
Map(path -> event000001000-hits.root)
java.io.IOException: Error during decompression (size=5013/14440)
  at org.dianahep.root4j.core.RootHDFSInputStream.slice(RootHDFSInputStream.java:354)
  at org.dianahep.root4j.proxy.TKey.getData(<generated>:107)
  at org.dianahep.root4j.proxy.TKey.getObject(<generated>:55)
  at org.dianahep.root4j.core.FileClassFactory.<init>(FileClassFactory.java:39)
  at org.dianahep.root4j.RootFileReader.init(RootFileReader.java:324)
  at org.dianahep.root4j.RootFileReader.<init>(RootFileReader.java:183)
  at org.dianahep.sparkroot.package$RootTableScan.<init>(sparkroot.scala:96)
  at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:146)
  at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:143)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
  ... 48 elided
Caused by: java.util.zip.DataFormatException: invalid distance code
  at java.util.zip.Inflater.inflateBytes(Native Method)
  at java.util.zip.Inflater.inflate(Inflater.java:259)
  at org.dianahep.root4j.core.RootHDFSInputStream.slice(RootHDFSInputStream.java:314)
  ... 59 more

Maybe the file is corrupted to start with, but being neither an expert nor a user of root, I cannot judge...
So any help would be appreciated!

Thanks,
Julien

Problematic branches cause a crash

Dear All,

When I try to load a root file (that you can find here [1]) I get the following error [2].
Usually I never had problems using the command below:
df = spark.read.format("org.dianahep.sparkroot").load("file.root")

We noted that some branches (those starting with Jet_btagSF) for instance, are visible in the TBrowser, but not in PYROOT.
I assume some mistake have been done when creating those branches.
There is any plan to be able to skip 'problematic' branches when loading a root file, instead of having a crash?

Cheers,
Luca

[1] https://www.dropbox.com/s/8yzbdvs4rbaiyf7/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root?dl=0

[2]
Map(path -> /data/taohuang/HHNtuple_20180418_DYestimation/DYJetsToLL_M-10to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root)
Warnng: Generating dummy read for fIOBits
Traceback (most recent call last):
File "trainDY.py", line 32, in
df = spark.read.format("org.dianahep.sparkroot").load("/data/taohuang/HHNtuple_20180418_DYestimation/DYJetsToLL_M-10to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root")
File "/home/demarley/anaconda2/lib/python2.7/site-packages/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/home/demarley/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in call
answer, self.gateway_client, self.target_id, self.name)
File "/home/demarley/anaconda2/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/demarley/anaconda2/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o41.load.
: java.io.IOException: Cannot skip object with no length
at org.dianahep.root4j.core.RootInputStream.skipObject(RootInputStream.java:596)
at org.dianahep.root4j.core.RootHDFSInputStream.skipObject(RootHDFSInputStream.java:387)
at org.dianahep.root4j.proxy.ROOT.TIOFeatures.readMembers()
at org.dianahep.root4j.core.AbstractRootObject.read(AbstractRootObject.java:52)
at org.dianahep.root4j.core.RootInputStream.readObject(RootInputStream.java:466)
at org.dianahep.root4j.core.RootHDFSInputStream.readObject(RootHDFSInputStream.java:222)
at org.dianahep.root4j.proxy.TTree.readMembers()
at org.dianahep.root4j.core.AbstractRootObject.read(AbstractRootObject.java:52)
at org.dianahep.root4j.proxy.TKey.getObject(:57)
at org.dianahep.sparkroot.core.package$$anonfun$findTree$1.apply(ast.scala:1177)
at org.dianahep.sparkroot.core.package$$anonfun$findTree$1.apply(ast.scala:1166)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.dianahep.sparkroot.core.package$.findTree(ast.scala:1166)
at org.dianahep.sparkroot.package$RootTableScan.(sparkroot.scala:97)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:146)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:143)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)