Giter Site home page Giter Site logo

spark-root's People

Contributors

glorf avatar jpivarski avatar reikdas avatar vkhristenko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-root's Issues

class members as pointers to the class itself, or a vector<T>

below is an example of that

StreamerInfo for class: CSCSegment, version=11, checksum=0x98eed67
  RecSegment     BASE            offset=  0 type= 0                     
  vector<CSCRecHit2D> theCSCRecHits   offset=  0 type=300 ,stl=1, ctype=61,                     
  Point3DBase<float,LocalTag> theOrigin       offset=  0 type=62 in chamber frame - the GeomDet local coordinate system
  Vector3DBase<float,LocalTag> theLocalDirection offset=  0 type=62 in chamber frame - the GeomDet local coordinate system
  CLHEP::HepSymMatrix theCovMatrix    offset=  0 type=62 the covariance matrix
  double         theChi2         offset=  0 type= 8                     
  bool           aME11a_duplicate offset=  0 type=18                     
  vector<CSCSegment> theDuplicateSegments offset=  0 type=300 ,stl=1, ctype=61, ME1/1a only    

What happes upon df.repartition

Need to identify what happens upon repartitioning of the data. We are not capable yet of writing ROOT files, therefore how can we get 3 tasks reading from different locations???

Crash reading ROOT tuples

Hello,

(I'm not sure how to properly debug scala code, so if you need more info, let me know)

With spark2.2 with org.diana-hep:spark-root_2.11:0.1.11,org.diana-hep:histogrammar-sparksql_2.11:1.0.4 and the following code:

import findspark
findspark.init()

import pyspark
import random
if 'sc' in globals():
    sc.stop()
sc = pyspark.SparkContext(appName="Pi")
import os
import os.path
sqlContext = pyspark.SQLContext(sc)
testPath = os.path.join(os.getcwd(), "tuple-test.root")
df = sqlContext.read.format("org.dianahep.sparkroot").load(testPath)
droppedColumn = df.select('Trigger_names','Trigger_prescale','Trigger_decision')
# This works
df.take(1)
# This fails
droppedColumn.take(1)

I get the following backtrace: https://gist.github.com/PerilousApricot/118a6aaa088fe3ed6e07a36e7e5c794d

df.printSchema() yields the following

root
 |-- Trigger_decision: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- Trigger_names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Trigger_prescale: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- Muon_pt: array (nullable = true)
 |    |-- element: double (containsNull = true)
<snip other columns>

I stumbled onto this error trying to write a UserDefinedFunction to filter over events, and the above snippet was a "minimal case" I developed to report. I can provide the input file if you're a CMS member if that helps as well.

Code Cleaning

  • The build of the intermediate state needs cleaning + commenting

Proper Template Arguments synthesis

just a re won't be sufficient for complex nested class declarations!

          logger.debug(s"Synthesizing the current class arguments: ${classTypeString}")
          val mapTemplateRE = "(.*?),(.*?)".r
          val (keyTypeString, valueTypeString) = argumentsTypeString match {
            case mapTemplateRE(aaa,bbb) => (aaa,bbb)
            case _ => (null, null)
          }  
  • use "paranthesis balancing"

java.io.IOException: Error during decompression (root 6.14)

Configuration:

  • OS: macOS High Sierra (10.13.5)
  • JDK: 8
  • Scala: 2.11
  • Spark: 2.2.1
  • extra-packages: org.diana-hep:root4j:0.1.6,org.diana-hep:spark-root_2.11:0.1.16

Hi!

When I try to load this file (3d coordinates of points) in a spark-shell, I get the following error:

scala> val df = spark.read.format("org.dianahep.sparkroot").load("event000001000-hits.root")
Map(path -> event000001000-hits.root)
java.io.IOException: Error during decompression (size=5013/14440)
  at org.dianahep.root4j.core.RootHDFSInputStream.slice(RootHDFSInputStream.java:354)
  at org.dianahep.root4j.proxy.TKey.getData(<generated>:107)
  at org.dianahep.root4j.proxy.TKey.getObject(<generated>:55)
  at org.dianahep.root4j.core.FileClassFactory.<init>(FileClassFactory.java:39)
  at org.dianahep.root4j.RootFileReader.init(RootFileReader.java:324)
  at org.dianahep.root4j.RootFileReader.<init>(RootFileReader.java:183)
  at org.dianahep.sparkroot.package$RootTableScan.<init>(sparkroot.scala:96)
  at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:146)
  at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:143)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
  ... 48 elided
Caused by: java.util.zip.DataFormatException: invalid distance code
  at java.util.zip.Inflater.inflateBytes(Native Method)
  at java.util.zip.Inflater.inflate(Inflater.java:259)
  at org.dianahep.root4j.core.RootHDFSInputStream.slice(RootHDFSInputStream.java:314)
  ... 59 more

Maybe the file is corrupted to start with, but being neither an expert nor a user of root, I cannot judge...
So any help would be appreciated!

Thanks,
Julien

Problematic branches cause a crash

Dear All,

When I try to load a root file (that you can find here [1]) I get the following error [2].
Usually I never had problems using the command below:
df = spark.read.format("org.dianahep.sparkroot").load("file.root")

We noted that some branches (those starting with Jet_btagSF) for instance, are visible in the TBrowser, but not in PYROOT.
I assume some mistake have been done when creating those branches.
There is any plan to be able to skip 'problematic' branches when loading a root file, instead of having a crash?

Cheers,
Luca

[1] https://www.dropbox.com/s/8yzbdvs4rbaiyf7/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root?dl=0

[2]
Map(path -> /data/taohuang/HHNtuple_20180418_DYestimation/DYJetsToLL_M-10to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root)
Warnng: Generating dummy read for fIOBits
Traceback (most recent call last):
File "trainDY.py", line 32, in
df = spark.read.format("org.dianahep.sparkroot").load("/data/taohuang/HHNtuple_20180418_DYestimation/DYJetsToLL_M-10to50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/6214A145-5711-E811-997E-0CC47A78A42C_Friend.root")
File "/home/demarley/anaconda2/lib/python2.7/site-packages/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/home/demarley/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in call
answer, self.gateway_client, self.target_id, self.name)
File "/home/demarley/anaconda2/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/demarley/anaconda2/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o41.load.
: java.io.IOException: Cannot skip object with no length
at org.dianahep.root4j.core.RootInputStream.skipObject(RootInputStream.java:596)
at org.dianahep.root4j.core.RootHDFSInputStream.skipObject(RootHDFSInputStream.java:387)
at org.dianahep.root4j.proxy.ROOT.TIOFeatures.readMembers()
at org.dianahep.root4j.core.AbstractRootObject.read(AbstractRootObject.java:52)
at org.dianahep.root4j.core.RootInputStream.readObject(RootInputStream.java:466)
at org.dianahep.root4j.core.RootHDFSInputStream.readObject(RootHDFSInputStream.java:222)
at org.dianahep.root4j.proxy.TTree.readMembers()
at org.dianahep.root4j.core.AbstractRootObject.read(AbstractRootObject.java:52)
at org.dianahep.root4j.proxy.TKey.getObject(:57)
at org.dianahep.sparkroot.core.package$$anonfun$findTree$1.apply(ast.scala:1177)
at org.dianahep.sparkroot.core.package$$anonfun$findTree$1.apply(ast.scala:1166)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at org.dianahep.sparkroot.core.package$.findTree(ast.scala:1166)
at org.dianahep.sparkroot.package$RootTableScan.(sparkroot.scala:97)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:146)
at org.dianahep.sparkroot.DefaultSource.createRelation(sparkroot.scala:143)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.