Giter Site home page Giter Site logo

absaoss / cobrix Goto Github PK

View Code? Open in Web Editor NEW
133.0 27.0 79.0 4.72 MB

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

License: Apache License 2.0

Scala 63.81% COBOL 3.24% Gnuplot 0.20% Batchfile 0.03% Shell 0.06% ANTLR 0.52% Java 32.13%
cobol-parser spark cobol copybook mainframe ebcdic scalable etl

cobrix's People

Contributors

codealways avatar dependabot[bot] avatar dwfchu avatar felipemmelo avatar fosgate29 avatar gd-iborisov avatar georgichochov avatar jaggel avatar joaquin021 avatar miroslavpojer avatar schaloner-kbc avatar thesuperzapper avatar tr11 avatar vbarakou avatar wajda avatar yruslan avatar zejnilovic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cobrix's Issues

expandPic failing when expanding virtual decimals

If a .cob file has a pic with two sets of braces in it, if either set of braces has more than one digit in it then the parser fails to parse that line and returns an error NOT DIVISIBLE by the RECORD SIZE calculated from the copybook with a byte count less than what should be expected.

Pass: PIC S9(15)V99.
Pass: PIC S9(9)V9(6).
Fail: S9(15)V9(2).

I suspect that this may be something to do with repeatCount in expandPic on line 944 of CopybookParser.scala

Rewrite the COBOL parser

The current parser does it's job, but it is written without conforming to the usual practice of parsing.

We need to write a grammar for the subset of COBOL that we are parsing and generate the parser using a tool like ANTLR, JavaCC, etc.

Not showing deeper levels as columns. Also .count() problem.

I managed to get it working.
However, it is only showing the level 01 records as columns in the data frame.
Why is that? The level 05 and level 10 are not appearing as columns.
Also, a .count() function is failing due to some Java error. Why is this? (error pasted at the bottom of this issue)

scala> :load /home/private/Documents/SparkSQLExample_cobol.scala
Loading /home/private/Documents/SparkSQLExample_cobol.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._
import scodec.bits._
import scodec.codecs._
spark: org.apache.spark.sql.SparkSession.type = org.apache.spark.sql.SparkSession$@4ad3969
res0: spark.Builder = org.apache.spark.sql.SparkSession$Builder@7ec5aad
res1: spark.Builder = org.apache.spark.sql.SparkSession$Builder@7ec5aad
2018-10-03 14:52:31 WARN SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@65f470f8
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@21ed4a51
2018-10-03 14:52:40 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
DataFrame: org.apache.spark.sql.DataFrame = [SAM010A_LEADER_RECORD: struct<SAM010A_ALL_ZEROS: string, SAM010A_DATE_CREATED: int ... 3 more fields>, SAM010D_DATA_RECORD_B: struct<SAM010D_ACCOUNT_NO: bigint, SAM010D_ACCOUNT_NO_A: struct<SAM010D_PREFIX: int, SAM010D_BRANCH_NO: int ... 1 more field> ... 301 more fields> ... 2 more fields]
+---------------------+---------------------+-----------------------------+----------------------+
|SAM010A_LEADER_RECORD|SAM010D_DATA_RECORD_B|SAM010E_BRANCH_TRAILER_RECORD|SAM010F_TRAILER_RECORD|
+---------------------+---------------------+-----------------------------+----------------------+
| [0000000000000000...| [2001001355, [2, ...| [2, 1, 2408,,, 0,...| [909000000, 0]|
| [8909000000000002...| [9909000000, [9, ...| [2, 909, 0,, 2000...| [0,]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 9...| [, 1725]|
| [00{0000000019640...| [, [0,, 0], [0], ...| [0,, 0, 338520000...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [, 5463696]|
| [283261{000000000...| [, [0, 147,], [1]...| [0, 0,, 0.00, 0, ...| [0, 0]|
| [0000000000000000...| [5000000, [0, 5, ...| [0, 0, 0, 1000600...| [90,]|
| [0000000090000000...| [90, [0, 0, 90], ...| [0, 0, 90, 0.00, ...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [0, 0]|
+---------------------+---------------------+-----------------------------+----------------------+
only showing top 10 rows

scala> DataFrame.count()
[Stage 1:========================================================>(77 + 1) / 78]2018-10-03 15:03:30 ERROR Executor:91 - Exception in task 77.0 in stage 1.0 (TID 78)
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.spark.input.FixedLengthBinaryRecordReader.nextKeyValue(FixedLengthBinaryRecordReader.scala:118)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Inconsistent generated IDs

Depending on whether segment id filtering is set or not the resulting IDs can have a gap in ID numbers.

The generated IDs should not depend on "segment_field", "segment_filter" and "segment_id_root" options.

signed decimal places

Hi,
I have fields which are made up of 7 numeric digits S9(7). We need them to have 7 decimal places and I cannot find a PIC that can achieve that.
So a value of 000456J needs to become -0.0004561
V9(7) achieves that, but only for positive values, it brings nulls for negatives.
I've tried several PICS including a few below, but they either fail, or do not achieve desirable result:
SV9(7)
S9(0)V9(7)
Can you please let me know what PIC I need to use to achieve 7 decimal places for signed field of 7 digits?
Thank you

streaming

I'm reading a file in streaming I have no error the count returns 0

my code

val spark = SparkSession
  .builder()
  .appName("CobolParser")
  .master("local[2]")
  .config("duration", 13)
  .config("copybook", paramMap("-Dcopybook"))
  .config("path",paramMap("-Ddata") )
 .getOrCreate()
 val streamingContext = new StreamingContext(spark.sparkContext, Seconds(13))
import za.co.absa.cobrix.spark.cobol.source.streaming.CobolStreamer._
val reader = getReader(streamingContext)

val result = streamingContext.cobolStream()
result.foreachRDD( (rdd:RDD[Row],time: Time) => {
println(time)
println(rdd.count())
for(item <- rdd) {
println(s"---------------${item}")
}
})

output example :
9/01/13 09:30:47 INFO JobScheduler: Added jobs for time 1547400647000 ms
19/01/13 09:30:47 INFO JobScheduler: Starting job streaming job 1547400647000 ms.0 from job set of time 1547400647000 ms
1547400647000 ms
19/01/13 09:30:47 INFO SparkContext: Starting job: count at NewStreamingEbcdic.scala:78
19/01/13 09:30:47 INFO DAGScheduler: Job 0 finished: count at NewStreamingEbcdic.scala:78, took 0.005128 s
0
19/01/13 09:30:47 INFO SparkContext: Starting job: foreach at NewStreamingEbcdic.scala:79
19/01/13 09:30:47 INFO DAGScheduler: Job 1 finished: foreach at NewStreamingEbcdic.scala:79, took 0.000039 s
19/01/13 09:30:47 INFO JobScheduler: Finished job streaming job 1547400647000 ms.0 from job set of time 1547400647000 ms
19/01/13 09:30:47 INFO JobScheduler: Total delay: 0.252 s for time 1547400647000 ms (execution: 0.085 s)
19/01/13 09:30:47 INFO FileInputDStream: Cleared 0 old files that were older than 1547400582000 ms:
19/01/13 09:30:47 INFO ReceivedBlockTracker: Deleting batches:
19/01/13 09:30:47 INFO InputInfoTracker: remove old batch metadata:
19/01/13 09:31:00 INFO FileInputDStream: Finding new files took 11 ms
19/01/13 09:31:00 INFO FileInputDStream: New files at time 1547400660000 ms:

because ????

Unable to read local file

whenever I try to read local file, it still search file in hdfs and gives below exception

java.lang.IllegalArgumentException: Wrong FS: file://xxxxxx:9000/root/abc.cpy expected: hdfs://xxxxxx:9000

command:
val df = spark.read.format("cobol").option("encoding", "ebcdic").option("generate_record_id", true).option("copybook", "file:///root/abc.cpy").load("file:///root/data.bin")

Allow disabling file size check for debugging purposes

When a copybook does not match the data file an error message is displayed saying that record size does not divide the file size. No data is retrieved.

It can be very helpful to debug where exactly a copybook starts to mismatch a data file by allowing users to override that file size check behavior and still allow to parse the file. In this mode, only top records of a data file can be retrieved, but most of the time it is enough for debugging copybook/data file correspondence.

Retain Group data

I have a few copybooks that contain the structure in the copybook below:

***********************************************************************
        05  NAME.
            10  SHORT-NAME.
                11  NAME-CHAR-1     PIC X.
                11  FILLER          PIC X(9).
            10  FILLER              PIC X(20).
***********************************************************************

The NAME field is 30 characters, but the copybook also uses the first 10 chars for the SHORT-NAME field and the first character as another field.

Right now we drop all the information in the FILLER fields, so we end up with only the first character in the copybook above. This is connected to #53 and a solution for that one would probably be OK for this case.

Maybe we could add an option to specify non-terminal items to retain? In the example above, we would want to retain NAME and SHORT-NAME. We could add an _NT suffix or pass in a map allowing to specify the new names.

Support for decimal scaling

I have the following PIC on a copybook: SVPP9(5), which is not recognized as a valid PIC. This means that the SV9(5) field will be scaled down by 100. It seems that we can have fields such as S9(n)P(m) or 9(n)P(m) for scaling up and P(m)9(n) or SVP(m)9(n) for scaling down.

Page 211 on this manual gives the following example:

PPP999: 0 through .000999
S999PPP: -1000 through -999000 and +1000 through +999000 or zero

cannot find columns with occurs x times

When I tried to read the cobol binary file using the following script:
val df = spark.read.format("za.co.absa.cobrix.spark.cobol.source").option("copybook", "file:///test/XYZ.txt").option("is_record_sequence", "true").option("is_rdw_big_endian", "true").option("schema_retention_policy", "collapse_root").load("/test/test1")

The script ran successfully, but when I retrieve the data, I found the Filler column missing, but other columns are good. In copybook it looks like:
02 FILLER OCCURS 2 TIMES
PIC X.

Can you tell me how to solve the issue?

OCCURS depending on non-integer fields

The OCCURS clause handler currently enforces an integer DEPENDING ON field, which can lead to issues when the field depended on is interpreted as a decimal. For example, S9V is a Decimal(1,0) but in reality is an integer value since it doesn't specify any decimal places.

Maybe we can allow a Decimal(n,0) to be depended on or, alternatively, interpret those as int values.

This is not a priority since one can easily change a copybook DEPENDING ON fields to strip the extra V decimal place indicator.

Spark-Cobol-App

I get an error when I execute the command : mvn test in the example Spark-Cobol-App.
The error is :
java.lang.IllegalStateException: Root segment SEGMENT_ID=='C' not found in the data file.

Retaining string for 9(...) picture

Sometimes pictures like 9(8) are actually not referring to integer values. For example, many mainframe files use 9(8) to represent a date. This means that the value should not be converted to integer automatically. I'd like to have a feature where all 9(...) fields are not automatically converted to integer but treated as string

Exception in thread "main" za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 16: Unable to parse the value of LEVEL. Numeric value expected, but 'CUSTOM-CHANGE-FLAGS-CNT' encountered

I am getting the error
Exception in thread "main" za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 16: Unable to parse the value of LEVEL. Numeric value expected, but 'CUSTOM-CHANGE-FLAGS-CNT' encountered

Also cobrix is going to work with occurs nested subgroups. How many nested subgroups are handled.
Do I need to make any changes to the copybooks from mainframe to parse the copybooks. If any changes are required please let me know.
Because I am having lot of issues with copybooks. I tried to parse many copybooks. But they are failing for some reason.

If there are any tips to parse the copybooks easily. Please let me know.

Appreciate your cooperation

java.lang.NoClassDefFoundError: za/co/absa/cobrix/cobol/parser/encoding/Encoding

I have done an mvn package to build your code.
I referenced the three produced jars in spark-shell:
spark-cobol-0.2.6-SNAPSHOT.jar
spark-cobol-0.2.6-SNAPSHOT-javadoc.jar
spark-cobol-0.2.6-SNAPSHOT-sources.jar

Then I invoke spark-shell:
spark-shell --jars /path_to_cobrix_jars/*.jar

I am getting an error that a particular class is missing.
Can you guide me?

scala> :load /home/private/Documents/SparkSQLExample_cobol.scala
Loading /home/private/Documents/SparkSQLExample_cobol.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._
spark: org.apache.spark.sql.SparkSession.type = org.apache.spark.sql.SparkSession$@37f41a81
res0: spark.Builder = org.apache.spark.sql.SparkSession$Builder@5ee581db
res1: spark.Builder = org.apache.spark.sql.SparkSession$Builder@5ee581db
2018-09-11 10:16:51 WARN SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@5a8656a2
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4070c4ff
java.lang.NoClassDefFoundError: za/co/absa/cobrix/cobol/parser/encoding/Encoding
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:82)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:69)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:50)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 77 elided
Caused by: java.lang.ClassNotFoundException: za.co.absa.cobrix.cobol.parser.encoding.Encoding
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 85 more

Code executed is below:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._

val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.getOrCreate()

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val DataFrame = sqlContext.read.format("cobol").option("copybook", "/home/private/Documents/Data_Files/Cobol_EBCDIC/pnic00.wspack.sam010.d0.sq2700.D180630.cpy").load("/home/private/Documents/Data_Files/Cobol_EBCDIC/pnic00.wspack.sam010.d0.sq2700.D180630.dat")

Add an option to control string trimming

Currently, when Cobrix reads a string type it trims it from both sides.
Often it is required to be less data intrusive as possible and only do a right trimming or no trimming at all.

The new option should be created to control this behavior.

Add support for EBCDIC code pages

Currently Cobrix supports only the common EBCDIC code page. Need an option for users to specify a different code page for the data file. The priority one is code page 37.

Characters in these code pages cannot be mapped to 7-bit ASCII. Need an EBCDIC to Unicode character conversion to support this.

Add support for nested segment loading

Currently multisegment files can be loaded one segment in time. It might be very useful to add a possibility to load all the segments of a hierarchical database. Children nodes can became an array of struct inside root node.

Open implementation questions:

  • How should users specify parent-child relationships among segments? Perhaps need to specify parent segment for each segment.
  • How should users provide schema for each of the segments? Should a copybook per segment be required or we can somehow combine all segments in one copybook? Which is better?

EBCDIC byte reader -New feature

Hi Team - We have a enhancement requirement to read bytes of data from a single EBCDIC file. This file contains different record types and we have a 3 letter keyword or 1 letter to identify the record type. Based on that the file split should happen.
Note: Its not a standard file which have a RDW.

SV9(x) should be a valid PIC

Great library! Hit an issue with a copybook with a SV9(5) pic. The tests seem to indicate this is invalid, but PICs such as SV9(x) and SV99 should be valid.

According to this, given a PIC S9(p-s)V9(s):

p is precision; s is scale. 0≤s≤p≤31. If s=0, use S9(p)V or S9(p). If s=p, use SV9(s).

Top level REDEFINES

I have a copybook that has a format akin to the following

***********************************************************************
       01 RECORD-TYPE-1.
           02  FIELD-1            PIC X(16).
       01 RECORD-TYPE-2 REDEFINES RECORD-TYPE-1.
           02  FIELD-2            PIC X(16).
       01 RECORD-TYPE-3 REDEFINES RECORD-TYPE-1.
           02  FIELD-3            PIC X(16).
***********************************************************************

The library is reading this as a copybook of size 48 (3*16) but should be only 16 in size. The following copybook, which adds a root element, seems to work correctly (and is a workaround right now):

***********************************************************************
      01 RECORD.
       02 RECORD-TYPE-1.
           03  FIELD-1            PIC X(16).
       02 RECORD-TYPE-2 REDEFINES RECORD-TYPE-1.
           03  FIELD-2            PIC X(16).
       02 RECORD-TYPE-3 REDEFINES RECORD-TYPE-1.
           03  FIELD-3            PIC X(16).
***********************************************************************

Sparse indexes for non-RDW variable length files not used

Currently sparse index generation is used only if an RDW headers are present in a variable record length file.

This makes performance processing of fixed record length files slow when file headers and footers are use used. Or when record id generation is used for fixed record length files.

Need to add sparse index generation to RDW-less files as well.

spark.read.format: Intellij vs shell

Hi team

Scenario 1: Spark-shell, local mode, Windows local file system.

The following ingestion works fine:

spark-shell --jars D:\Tools\cobrix-master-jars\cobol-parser-0.4.2.jar,D:\Tools\cobrix-master-jars\spark-cobol-0.4.2.jar,D:\Tools\cobrix-master-jars\scodec-core_2.11-1.10.3.jar,D:\Tools\cobrix-master-jars\scodec-bits_2.11-1.1.4.jar

val df_in: DataFrame = spark.read.format("cobol"). //"za.co.absa.cobrix.spark.cobol.source"
    option("copybook", "D:\\Tools\\cobrix-master\\data\\test2_copybook.cob").
    load("D:\\Tools\\cobrix-master\\data\\test2_data\\example.bin")
df_in.count

res30: Long = 10

Scenario 2: Intellij Ultimate. Maven, scala-maven-plugin, and the following dependencies in the POM,

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>2.4.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>2.4.2</version>
        </dependency>

        <dependency>
            <groupId>za.co.absa.cobrix</groupId>
            <artifactId>spark-cobol</artifactId>
            <version>0.4.2</version>
        </dependency>

and creating a simple ingestion routine:

package XYZ

import org.apache.spark.sql._

object Main {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().
      master("local").
      appName("cobrixTester").
      getOrCreate()

    val df_in: DataFrame = spark.read.format("cobol"). //"za.co.absa.cobrix.spark.cobol.source"
      option("copybook", "D:\\Tools\\cobrix-master\\data\\test2_copybook.cob").
      load("D:\\Tools\\cobrix-master\\data\\test2_data\\example.bin")
    df_in.count
  }
}

This results in:

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:67)
Caused by: java.lang.NoClassDefFoundError: scala/Product$class
	at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParameters.<init>(CobolParameters.scala:50)
	at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.parse(CobolParametersParser.scala:118)
	at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:54)
	at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at XYZ.Main$.main(Main.scala:14)
	at XYZ.Main.main(Main.scala)
	... 5 more
Caused by: java.lang.ClassNotFoundException: scala.Product$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 15 more

for both .format("cobol") and .format("za.co.absa.cobrix.spark.cobol.source")

Thanks for your inputs
Matthias

Field names cannot start with COMP-X

We should support copybooks with protected keywords as the beginning of field names. While this is bad practice, its still valid COBOL syntax.

The below copybook would fail, due to the COMP-ACCOUNT-I being similar to COMP-X:

                   12  ACCOUNT-DETAIL    OCCURS 80
                                         DEPENDING ON NUMBER-OF-ACCTS
                                         INDEXED BY COMP-ACCOUNT-I.
                      15  ACCOUNT-NUMBER     PIC X(24).
                      15  ACCOUNT-TYPE-N     PIC 9(5) COMP-3.
                      15  ACCOUNT-TYPE-X     REDEFINES
                           ACCOUNT-TYPE-N  PIC X(3).

PR INCOMING TO FIX

NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (136 bytes per record).

I am getting this error while trying df.show.

df.show()
19/04/25 23:03:42 ERROR utils.FileUtils$: File hdfs://quickstart.cloudera:8020/user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat IS NOT divisible by 136.
java.lang.IllegalArgumentException: There are some files in /user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (136 bytes per record). Check the logs for the names of the files.
at za.co.absa.cobrix.spark.cobol.source.scanners.CobolScanners$.buildScanForFixedLength(CobolScanners.scala:87)
at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:85)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:308)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3254)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 49 elided

Can some one please help me to resolve this exception.

Request for assistance: GCD of 1, but all pointing towards fixed record length files

Hi Ruslan

First of all, thanks for your work! I work for a company which gets its transaction batch files from a IBM Mainframe in EBCDIC encoding.

For a PoC on ingesting mainframe data directly in Spark, we managed to ingest fixed record length files so far without greater issues, only by adapting the original copybooks i.o.t. get proper REDEFINES and by such, proper fixed record length, which corresponds to the GCD of the test files.

We have issues however with variable record length files. We identified several of them by simply calculating the GCD for the test files, which in the present cases of course is 1.

However, arranging the corresponding copybook by implementing proper REDEFINES, respectively excluding common fields from REDEFINES, results in a structure that suggests fixed record length:
a) all segments have the same length (Cobrix schema parsing works fine),
b) the second common field at position 8 is called "RECORD-TYPE", has byte length 4 (DISPLAY) and is very likely to indicate the record type.

When trying to ingest such files with option("is_record_sequence", "true"), the result is:

df_in: org.apache.spark.sql.DataFrame = [GLOBAL_RECORD: struct<Account_ID: string, Record_Type: string ... 14 more fields>] 2019-04-08 11:07:53 WARN VarLenNestedReader:87 - Input split size = 32 MB 2019-04-08 11:07:53 ERROR Executor:91 - Exception in task 0.0 in stage 98.0 (TID 7660) java.lang.IllegalStateException: RDW headers should never be zero (0,0,0,0). Found zero size record at 0. at za.co.absa.cobrix.cobol.parser.decoders.BinaryUtils$.extractRdwRecordSize(BinaryUtils.scala:305)

So, obviously, the RDW is not found.

When trying to ingest such files with option("is_record_sequence", "false"), the result of course is:

df_in: org.apache.spark.sql.DataFrame = [GLOBAL_RECORD: struct<Account_ID: string, Record_Type: string ... 14 more fields>] 2019-04-08 11:25:20 ERROR FileUtils$:218 - File file:/XYZ IS NOT divisible by 1513. java.lang.IllegalArgumentException: There are some files in XYZ that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (1513 bytes per record)

In brief: GCD of 1, but all is pointing towards fixed record length.

We are still figuring out in which form to send you a data sample plus copybook, i.o.t. comply with internal guidelines.
In the meantime, I'd greatly appreciate a hint.

Thanks a lot in advance
Matthias

Handle negative values for unsigned number patterns

Here is an example of an unsigned field definition:

07 AMOUNT    PIC 9(16)V99.

Cobrix is permissive and allows the actual data to contain negative numbers in that column. A better behavior might be returning null in such cases. This is what Cobrix usually does when it encounters pattern/data mismatch.

Add an option for segment-redefine awarness

If there is a known mapping between segment ids and redefines we can have an option for a user to specify that mapping. This way the parser won't need to parse all the segments for each record. Combining with segment filtering (predicate pushdown) this should significantly increase performance when targeting a relational model of the output.

Header and Trailer records

Some of the files I have come with a header and/or trailer record. For fixed length, the layout looks something like

HEADER
RECORD
...
RECORD
TRAILER

where all the records, the header, and the trailer have the same (fixed) length.

Do we already support something like this? Right now I'm just creating a new copybook that contains the three copybooks fo the types above glued with redefines, but it would be fantastic if we could support this kind of layout.

Q: Maturity of this library

I apologise as I know this the wrong place as it's not an issue, if you could point me to a better place to ask this question I'm happy to go elsewhere.

Anyway, we are wondering what usage this library is getting in production environments, as we just want to confirm the project is mature enough for our usage.

Is it possible to confirm if Absa is using this in prod?

0.4.1 SBT import library is huge

We tried importing 0.4.1 through the SBT import. The older version (0.4.0) JAR size was approx 12 MB but the current version size is 110 MB. I just want to understand if this is normal or if there are any issues with the project build.

Add support for Scala 2.12

Background

Cobrix was originally developed for Spark 2.2.2 which has supported Scala 2.11.
Now Spark 2.4.2 and Spark 3.0.0 are coming with Scala 2.12 support.

Feature

Need to add support Scala 2.12

Example

See #84

Proposed Solution

  1. Need to introduce a release process that generates 2 versions of the library - one for Scala 2.11, the other - for Scala 2.12.
  2. The cobol-parser artifact does not depend on Spark, so it can have a single version.
  3. The spark-cobol artifact should have 2 versions for each release.

Proactively validate binary data size

Cobol data source uses binaryRecords() API to parse binary data.
When size of any of input files is not evenly divisible by record size, Spark will produce an exception.

We need to proactively check that all binary files are evenly divisible by the record size and produce more readable exception.

Preferably need to it scalable for cases when there are millions of files.

Debug functionality of EBCDIC data

Hi,
Here I am not mentioning any issue but a functionality on debug purpose mainly. While Cobrix create dataframe it decodes the EBCDIC data according to the datatype of the given primitive field, Here if it is possible also to show the hex value of the column as well for a given option like add_hex = true. This functionality is only for debug purpose to check the data.

Fix code duplication for COBOL AST

Currently COBOL AST is an array of GROUP objects. Need to change it to be just a GROUP. This way AST traversal functions can be simplified.
Also, this can allow root level primitive fields in copybooks.

This is to support something like

     05 FILLER PIC 9(08) BINARY.
     05 :RRHEADER:-CNTL.
        10 :RRHEADER:-LENGTH PIC S9(08) BINARY.
           88 :RRHEADER:-LENGTH-SET VALUE +65.

The filler on the top is not currently supported since it is a primitive field.

RDW interpretation is incorrect for an IBM z/OS dataset

Record Descriptor Word (previously referred to as XCOM header) is interpreted incorrectly for a file transferred from IBM z/OS to Windows 10 via standard FTP in binary mode.

Per the IBM documentation (https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.idad400/d4357.htm) the RDW is the first 2 bytes of each record, but the extractRdwRecordSize method in the BinaryUtils class is expecting the length to be in the 3rd and 4th bytes.

It's possible XCOM or other transfer methods change the interpretation of the RDW, but the current method makes it impossible to process a dataset originating from IBM z/OS, at least in our environment.

I am able to process the datasets correctly by making a minor adjustment to the extractRdwRecordSize method to read the 1st and 2nd bytes, and the data is parsed correctly after that.

Syntax error in the copybook at line 2: Unable to parse the value of LEVEL. Numeric value expected, but 'PN-TABLE-NAME' encountered

Hi Team,

I am getting the error while trying to create the DF. It is failing with LAYOUT file error.
Could you please help me to get the error in layout file.

Spark shell:
spark2-shell --master yarn --deploy-mode client --driver-cores 2 --driver-memory 4G --jars cobol-parser-0.4.3-SNAPSHOT.jar,spark-cobol-0.4.3-SNAPSHOT.jar,scodec-core_2.11-1.10.3.jar,scodec-bits_2.11-1.1.2.jar

Data Frame:
val df = spark.read.format("cobol").option("copybook", "/user/cloudera/layout/layout_spark.txt").load("/user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat")

Error:
za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 2: Unable to parse the value of LEVEL. Numeric value expected, but 'PN-TABLE-NAME' encountered
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$CreateCopybookLine(CopybookParser.scala:204)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$8.apply(CopybookParser.scala:134)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$8.apply(CopybookParser.scala:133)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.parseTree(CopybookParser.scala:133)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.loadCopyBook(FixedLenNestedReader.scala:81)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.(FixedLenNestedReader.scala:47)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:86)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:73)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:57)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 49 elided

layout file:
01 DCLDPN-DPND-REC-TABL.
02 DPN-TABLE-NAME PIC X(8).
02 DPN-EMP-NBR PIC S9(9) COMP.
02 DPN-CARR-CODE PIC X(3).
02 DPN-ACVY-STRT-DATE PIC X(10).
02 DPN-CREW-ACVY-CODE PIC X(3).
02 DPN-PRNG-NBR PIC X(5).
02 DPN-PRNG-ORIG-DATE PIC X(10).
02 DPN-ASNT-DAYS-CNT PIC S9(4) COMP.
02 DPN-RVEW-DATE PIC X(10).
02 DPN-CONJ-ACVY-CODE PIC X(3).
02 DPN-NOTE-RCVD-IND PIC X(1).
02 DPN-DROP-OFF-DATE PIC X(10).
02 DPN-DPND-CMNT-TEXT PIC X(65).
02 FILLER PIC X(66).

Redefine Field name with dot(.) notation

Hi,
while searching for REDEFINE fields attached with a given statement(Group/Primitive), found it is provided only the field name. This can be an issue if there are multiple fields with the same name in the copybook. Is it possible to provide exact field name with hierarchy with dot(.) notation ?

Speeding up reads

I am doing some benchmarking on a Databricks cluster where I use Cobrix to read EBCDIC files and write to parquet. I have an implementation of the same process which does not use this library. Reading a 2GB EBCDIC file with Cobrix takes two minutes longer than reading the file using sc.binaryRecords() and putting the right schema in place to create a DataFrame. The file has around 1400 columns and 9000 bytes per record.

Here is the cluster config:
Spark Version: 2.4
Worker count: flexible, depending on the workload
RAM per worker: 14 GB
Cores per worker: 4
Executor count: flexible, depending on the workload
Executor memory: 7.4 GB

The benchmarks you have included in this project make me think that an increase in executor count would speed up the read throughput. However, I have currently enabled autoscaling in Databricks so it dynamically allocates executors on the fly depending on the workload.

Could you please provide some guidelines on optimising read speed? Things like configurations you have tried in your organisation's use case can really be of help to speed up the process.

Add support for EBCDIC non-printable characters in string fields

Currently Cobrix ignores all characters that translate into ASCII characters below 32. Need to have an option to allow retaining these.

I think it is aligned with supporting EBCDIC code pages. Currently we support only the common code page. We can add 'common/all' to include non-printable characters.

Allow to specify several copybooks

Background

Multisegment data files, especially ones that come from hierarchical databases are often described by several copybooks. In this case each copybook describes a particular segment.

Currently in order to load multisegment files one must combine segment copybooks in a way that each segment is represented by a GROUP in the copybook, and these segment GROUPs must redefine each other. This is a manual copybook-intrusive work.

Feature

Allow Cobrix to accept several copybooks as an input and automatically combine them by placing segments into GROUPs that redefine each other.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.