absaoss / cobrix Goto Github PK
View Code? Open in Web Editor NEWA COBOL parser and Mainframe/EBCDIC data source for Apache Spark
License: Apache License 2.0
A COBOL parser and Mainframe/EBCDIC data source for Apache Spark
License: Apache License 2.0
If a .cob file has a pic with two sets of braces in it, if either set of braces has more than one digit in it then the parser fails to parse that line and returns an error NOT DIVISIBLE by the RECORD SIZE calculated from the copybook with a byte count less than what should be expected.
Pass: PIC S9(15)V99.
Pass: PIC S9(9)V9(6).
Fail: S9(15)V9(2).
I suspect that this may be something to do with repeatCount in expandPic on line 944 of CopybookParser.scala
By design the filler fields are dropped. Is there any option to avoid dropping this default behavior ?
The current parser does it's job, but it is written without conforming to the usual practice of parsing.
We need to write a grammar for the subset of COBOL that we are parsing and generate the parser using a tool like ANTLR
, JavaCC
, etc.
I managed to get it working.
However, it is only showing the level 01 records as columns in the data frame.
Why is that? The level 05 and level 10 are not appearing as columns.
Also, a .count() function is failing due to some Java error. Why is this? (error pasted at the bottom of this issue)
scala> :load /home/private/Documents/SparkSQLExample_cobol.scala
Loading /home/private/Documents/SparkSQLExample_cobol.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._
import scodec.bits._
import scodec.codecs._
spark: org.apache.spark.sql.SparkSession.type = org.apache.spark.sql.SparkSession$@4ad3969
res0: spark.Builder = org.apache.spark.sql.SparkSession$Builder@7ec5aad
res1: spark.Builder = org.apache.spark.sql.SparkSession$Builder@7ec5aad
2018-10-03 14:52:31 WARN SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@65f470f8
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@21ed4a51
2018-10-03 14:52:40 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
DataFrame: org.apache.spark.sql.DataFrame = [SAM010A_LEADER_RECORD: struct<SAM010A_ALL_ZEROS: string, SAM010A_DATE_CREATED: int ... 3 more fields>, SAM010D_DATA_RECORD_B: struct<SAM010D_ACCOUNT_NO: bigint, SAM010D_ACCOUNT_NO_A: struct<SAM010D_PREFIX: int, SAM010D_BRANCH_NO: int ... 1 more field> ... 301 more fields> ... 2 more fields]
+---------------------+---------------------+-----------------------------+----------------------+
|SAM010A_LEADER_RECORD|SAM010D_DATA_RECORD_B|SAM010E_BRANCH_TRAILER_RECORD|SAM010F_TRAILER_RECORD|
+---------------------+---------------------+-----------------------------+----------------------+
| [0000000000000000...| [2001001355, [2, ...| [2, 1, 2408,,, 0,...| [909000000, 0]|
| [8909000000000002...| [9909000000, [9, ...| [2, 909, 0,, 2000...| [0,]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 9...| [, 1725]|
| [00{0000000019640...| [, [0,, 0], [0], ...| [0,, 0, 338520000...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [, 5463696]|
| [283261{000000000...| [, [0, 147,], [1]...| [0, 0,, 0.00, 0, ...| [0, 0]|
| [0000000000000000...| [5000000, [0, 5, ...| [0, 0, 0, 1000600...| [90,]|
| [0000000090000000...| [90, [0, 0, 90], ...| [0, 0, 90, 0.00, ...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [0, 0]|
+---------------------+---------------------+-----------------------------+----------------------+
only showing top 10 rows
scala> DataFrame.count()
[Stage 1:========================================================>(77 + 1) / 78]2018-10-03 15:03:30 ERROR Executor:91 - Exception in task 77.0 in stage 1.0 (TID 78)
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.spark.input.FixedLengthBinaryRecordReader.nextKeyValue(FixedLengthBinaryRecordReader.scala:118)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Depending on whether segment id filtering is set or not the resulting IDs can have a gap in ID numbers.
The generated IDs should not depend on "segment_field", "segment_filter" and "segment_id_root" options.
Hi,
I have fields which are made up of 7 numeric digits S9(7). We need them to have 7 decimal places and I cannot find a PIC that can achieve that.
So a value of 000456J needs to become -0.0004561
V9(7) achieves that, but only for positive values, it brings nulls for negatives.
I've tried several PICS including a few below, but they either fail, or do not achieve desirable result:
SV9(7)
S9(0)V9(7)
Can you please let me know what PIC I need to use to achieve 7 decimal places for signed field of 7 digits?
Thank you
I'm reading a file in streaming I have no error the count returns 0
my code
val spark = SparkSession
.builder()
.appName("CobolParser")
.master("local[2]")
.config("duration", 13)
.config("copybook", paramMap("-Dcopybook"))
.config("path",paramMap("-Ddata") )
.getOrCreate()
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(13))
import za.co.absa.cobrix.spark.cobol.source.streaming.CobolStreamer._
val reader = getReader(streamingContext)
val result = streamingContext.cobolStream()
result.foreachRDD( (rdd:RDD[Row],time: Time) => {
println(time)
println(rdd.count())
for(item <- rdd) {
println(s"---------------${item}")
}
})
output example :
9/01/13 09:30:47 INFO JobScheduler: Added jobs for time 1547400647000 ms
19/01/13 09:30:47 INFO JobScheduler: Starting job streaming job 1547400647000 ms.0 from job set of time 1547400647000 ms
1547400647000 ms
19/01/13 09:30:47 INFO SparkContext: Starting job: count at NewStreamingEbcdic.scala:78
19/01/13 09:30:47 INFO DAGScheduler: Job 0 finished: count at NewStreamingEbcdic.scala:78, took 0.005128 s
0
19/01/13 09:30:47 INFO SparkContext: Starting job: foreach at NewStreamingEbcdic.scala:79
19/01/13 09:30:47 INFO DAGScheduler: Job 1 finished: foreach at NewStreamingEbcdic.scala:79, took 0.000039 s
19/01/13 09:30:47 INFO JobScheduler: Finished job streaming job 1547400647000 ms.0 from job set of time 1547400647000 ms
19/01/13 09:30:47 INFO JobScheduler: Total delay: 0.252 s for time 1547400647000 ms (execution: 0.085 s)
19/01/13 09:30:47 INFO FileInputDStream: Cleared 0 old files that were older than 1547400582000 ms:
19/01/13 09:30:47 INFO ReceivedBlockTracker: Deleting batches:
19/01/13 09:30:47 INFO InputInfoTracker: remove old batch metadata:
19/01/13 09:31:00 INFO FileInputDStream: Finding new files took 11 ms
19/01/13 09:31:00 INFO FileInputDStream: New files at time 1547400660000 ms:
because ????
whenever I try to read local file, it still search file in hdfs and gives below exception
java.lang.IllegalArgumentException: Wrong FS: file://xxxxxx:9000/root/abc.cpy expected: hdfs://xxxxxx:9000
command:
val df = spark.read.format("cobol").option("encoding", "ebcdic").option("generate_record_id", true).option("copybook", "file:///root/abc.cpy").load("file:///root/data.bin")
When a copybook does not match the data file an error message is displayed saying that record size does not divide the file size. No data is retrieved.
It can be very helpful to debug where exactly a copybook starts to mismatch a data file by allowing users to override that file size check behavior and still allow to parse the file. In this mode, only top records of a data file can be retrieved, but most of the time it is enough for debugging copybook/data file correspondence.
I have a few copybooks that contain the structure in the copybook below:
***********************************************************************
05 NAME.
10 SHORT-NAME.
11 NAME-CHAR-1 PIC X.
11 FILLER PIC X(9).
10 FILLER PIC X(20).
***********************************************************************
The NAME
field is 30 characters, but the copybook also uses the first 10 chars for the SHORT-NAME
field and the first character as another field.
Right now we drop all the information in the FILLER
fields, so we end up with only the first character in the copybook above. This is connected to #53 and a solution for that one would probably be OK for this case.
Maybe we could add an option to specify non-terminal items to retain? In the example above, we would want to retain NAME
and SHORT-NAME
. We could add an _NT
suffix or pass in a map allowing to specify the new names.
I have the following PIC on a copybook: SVPP9(5)
, which is not recognized as a valid PIC. This means that the SV9(5) field will be scaled down by 100. It seems that we can have fields such as S9(n)P(m) or 9(n)P(m) for scaling up and P(m)9(n) or SVP(m)9(n) for scaling down.
Page 211 on this manual gives the following example:
PPP999: 0 through .000999
S999PPP: -1000 through -999000 and +1000 through +999000 or zero
When I tried to read the cobol binary file using the following script:
val df = spark.read.format("za.co.absa.cobrix.spark.cobol.source").option("copybook", "file:///test/XYZ.txt").option("is_record_sequence", "true").option("is_rdw_big_endian", "true").option("schema_retention_policy", "collapse_root").load("/test/test1")
The script ran successfully, but when I retrieve the data, I found the Filler column missing, but other columns are good. In copybook it looks like:
02 FILLER OCCURS 2 TIMES
PIC X.
Can you tell me how to solve the issue?
The OCCURS
clause handler currently enforces an integer DEPENDING ON
field, which can lead to issues when the field depended on is interpreted as a decimal. For example, S9V
is a Decimal(1,0) but in reality is an integer value since it doesn't specify any decimal places.
Maybe we can allow a Decimal(n,0) to be depended on or, alternatively, interpret those as int values.
This is not a priority since one can easily change a copybook DEPENDING ON
fields to strip the extra V
decimal place indicator.
I get an error when I execute the command : mvn test in the example Spark-Cobol-App.
The error is :
java.lang.IllegalStateException: Root segment SEGMENT_ID=='C' not found in the data file.
Sometimes pictures like 9(8) are actually not referring to integer values. For example, many mainframe files use 9(8) to represent a date. This means that the value should not be converted to integer automatically. I'd like to have a feature where all 9(...) fields are not automatically converted to integer but treated as string
I am getting the error
Exception in thread "main" za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 16: Unable to parse the value of LEVEL. Numeric value expected, but 'CUSTOM-CHANGE-FLAGS-CNT' encountered
Also cobrix is going to work with occurs nested subgroups. How many nested subgroups are handled.
Do I need to make any changes to the copybooks from mainframe to parse the copybooks. If any changes are required please let me know.
Because I am having lot of issues with copybooks. I tried to parse many copybooks. But they are failing for some reason.
If there are any tips to parse the copybooks easily. Please let me know.
Appreciate your cooperation
Currently the mapping between segment id values and redefined fields is supported only for variable length files. There are use cases when this would be helpful to be supported for fixed length files as well.
I have done an mvn package to build your code.
I referenced the three produced jars in spark-shell:
spark-cobol-0.2.6-SNAPSHOT.jar
spark-cobol-0.2.6-SNAPSHOT-javadoc.jar
spark-cobol-0.2.6-SNAPSHOT-sources.jar
Then I invoke spark-shell:
spark-shell --jars /path_to_cobrix_jars/*.jar
I am getting an error that a particular class is missing.
Can you guide me?
scala> :load /home/private/Documents/SparkSQLExample_cobol.scala
Loading /home/private/Documents/SparkSQLExample_cobol.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._
spark: org.apache.spark.sql.SparkSession.type = org.apache.spark.sql.SparkSession$@37f41a81
res0: spark.Builder = org.apache.spark.sql.SparkSession$Builder@5ee581db
res1: spark.Builder = org.apache.spark.sql.SparkSession$Builder@5ee581db
2018-09-11 10:16:51 WARN SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@5a8656a2
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4070c4ff
java.lang.NoClassDefFoundError: za/co/absa/cobrix/cobol/parser/encoding/Encoding
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:82)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:69)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:50)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 77 elided
Caused by: java.lang.ClassNotFoundException: za.co.absa.cobrix.cobol.parser.encoding.Encoding
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 85 more
Code executed is below:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.getOrCreate()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val DataFrame = sqlContext.read.format("cobol").option("copybook", "/home/private/Documents/Data_Files/Cobol_EBCDIC/pnic00.wspack.sam010.d0.sq2700.D180630.cpy").load("/home/private/Documents/Data_Files/Cobol_EBCDIC/pnic00.wspack.sam010.d0.sq2700.D180630.dat")
Currently, when Cobrix reads a string type it trims it from both sides.
Often it is required to be less data intrusive as possible and only do a right trimming or no trimming at all.
The new option should be created to control this behavior.
Currently Cobrix supports only the common EBCDIC code page. Need an option for users to specify a different code page for the data file. The priority one is code page 37.
Characters in these code pages cannot be mapped to 7-bit ASCII. Need an EBCDIC to Unicode character conversion to support this.
I hit a copybook with the following definition: PIC 9(15)+.
The copybook parser is not able to handle it. What isthe meaning of the + in this case ?
Currently multisegment files can be loaded one segment in time. It might be very useful to add a possibility to load all the segments of a hierarchical database. Children nodes can became an array of struct inside root node.
Open implementation questions:
Hi Team - We have a enhancement requirement to read bytes of data from a single EBCDIC file. This file contains different record types and we have a 3 letter keyword or 1 letter to identify the record type. Based on that the file split should happen.
Note: Its not a standard file which have a RDW.
Great library! Hit an issue with a copybook with a SV9(5)
pic. The tests seem to indicate this is invalid, but PICs such as SV9(x)
and SV99
should be valid.
According to this, given a PIC S9(p-s)V9(s)
:
p is precision; s is scale. 0≤s≤p≤31. If s=0, use S9(p)V or S9(p). If s=p, use SV9(s).
I have a copybook that has a format akin to the following
***********************************************************************
01 RECORD-TYPE-1.
02 FIELD-1 PIC X(16).
01 RECORD-TYPE-2 REDEFINES RECORD-TYPE-1.
02 FIELD-2 PIC X(16).
01 RECORD-TYPE-3 REDEFINES RECORD-TYPE-1.
02 FIELD-3 PIC X(16).
***********************************************************************
The library is reading this as a copybook of size 48 (3*16) but should be only 16 in size. The following copybook, which adds a root element, seems to work correctly (and is a workaround right now):
***********************************************************************
01 RECORD.
02 RECORD-TYPE-1.
03 FIELD-1 PIC X(16).
02 RECORD-TYPE-2 REDEFINES RECORD-TYPE-1.
03 FIELD-2 PIC X(16).
02 RECORD-TYPE-3 REDEFINES RECORD-TYPE-1.
03 FIELD-3 PIC X(16).
***********************************************************************
Currently sparse index generation is used only if an RDW headers are present in a variable record length file.
This makes performance processing of fixed record length files slow when file headers and footers are use used. Or when record id generation is used for fixed record length files.
Need to add sparse index generation to RDW-less files as well.
Hi team
Scenario 1: Spark-shell, local mode, Windows local file system.
The following ingestion works fine:
spark-shell --jars D:\Tools\cobrix-master-jars\cobol-parser-0.4.2.jar,D:\Tools\cobrix-master-jars\spark-cobol-0.4.2.jar,D:\Tools\cobrix-master-jars\scodec-core_2.11-1.10.3.jar,D:\Tools\cobrix-master-jars\scodec-bits_2.11-1.1.4.jar
val df_in: DataFrame = spark.read.format("cobol"). //"za.co.absa.cobrix.spark.cobol.source"
option("copybook", "D:\\Tools\\cobrix-master\\data\\test2_copybook.cob").
load("D:\\Tools\\cobrix-master\\data\\test2_data\\example.bin")
df_in.count
res30: Long = 10
Scenario 2: Intellij Ultimate. Maven, scala-maven-plugin, and the following dependencies in the POM,
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.2</version>
</dependency>
<dependency>
<groupId>za.co.absa.cobrix</groupId>
<artifactId>spark-cobol</artifactId>
<version>0.4.2</version>
</dependency>
and creating a simple ingestion routine:
package XYZ
import org.apache.spark.sql._
object Main {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().
master("local").
appName("cobrixTester").
getOrCreate()
val df_in: DataFrame = spark.read.format("cobol"). //"za.co.absa.cobrix.spark.cobol.source"
option("copybook", "D:\\Tools\\cobrix-master\\data\\test2_copybook.cob").
load("D:\\Tools\\cobrix-master\\data\\test2_data\\example.bin")
df_in.count
}
}
This results in:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:67)
Caused by: java.lang.NoClassDefFoundError: scala/Product$class
at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParameters.<init>(CobolParameters.scala:50)
at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.parse(CobolParametersParser.scala:118)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:54)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at XYZ.Main$.main(Main.scala:14)
at XYZ.Main.main(Main.scala)
... 5 more
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 15 more
for both .format("cobol")
and .format("za.co.absa.cobrix.spark.cobol.source")
Thanks for your inputs
Matthias
Cobrix throws NullPointerException when a segment filter is used and the data file contains a record with the field that contains non decodable data, i.e. the decoder returns null.
We should support copybooks with protected keywords as the beginning of field names. While this is bad practice, its still valid COBOL syntax.
The below copybook would fail, due to the COMP-ACCOUNT-I
being similar to COMP-X:
12 ACCOUNT-DETAIL OCCURS 80
DEPENDING ON NUMBER-OF-ACCTS
INDEXED BY COMP-ACCOUNT-I.
15 ACCOUNT-NUMBER PIC X(24).
15 ACCOUNT-TYPE-N PIC 9(5) COMP-3.
15 ACCOUNT-TYPE-X REDEFINES
ACCOUNT-TYPE-N PIC X(3).
PR INCOMING TO FIX
I am getting this error while trying df.show.
df.show()
19/04/25 23:03:42 ERROR utils.FileUtils$: File hdfs://quickstart.cloudera:8020/user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat IS NOT divisible by 136.
java.lang.IllegalArgumentException: There are some files in /user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (136 bytes per record). Check the logs for the names of the files.
at za.co.absa.cobrix.spark.cobol.source.scanners.CobolScanners$.buildScanForFixedLength(CobolScanners.scala:87)
at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:85)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:308)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3254)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 49 elided
Can some one please help me to resolve this exception.
Hi Ruslan
First of all, thanks for your work! I work for a company which gets its transaction batch files from a IBM Mainframe in EBCDIC encoding.
For a PoC on ingesting mainframe data directly in Spark, we managed to ingest fixed record length files so far without greater issues, only by adapting the original copybooks i.o.t. get proper REDEFINES and by such, proper fixed record length, which corresponds to the GCD of the test files.
We have issues however with variable record length files. We identified several of them by simply calculating the GCD for the test files, which in the present cases of course is 1.
However, arranging the corresponding copybook by implementing proper REDEFINES, respectively excluding common fields from REDEFINES, results in a structure that suggests fixed record length:
a) all segments have the same length (Cobrix schema parsing works fine),
b) the second common field at position 8 is called "RECORD-TYPE", has byte length 4 (DISPLAY) and is very likely to indicate the record type.
When trying to ingest such files with option("is_record_sequence", "true"), the result is:
df_in: org.apache.spark.sql.DataFrame = [GLOBAL_RECORD: struct<Account_ID: string, Record_Type: string ... 14 more fields>] 2019-04-08 11:07:53 WARN VarLenNestedReader:87 - Input split size = 32 MB 2019-04-08 11:07:53 ERROR Executor:91 - Exception in task 0.0 in stage 98.0 (TID 7660) java.lang.IllegalStateException: RDW headers should never be zero (0,0,0,0). Found zero size record at 0. at za.co.absa.cobrix.cobol.parser.decoders.BinaryUtils$.extractRdwRecordSize(BinaryUtils.scala:305)
So, obviously, the RDW is not found.
When trying to ingest such files with option("is_record_sequence", "false"), the result of course is:
df_in: org.apache.spark.sql.DataFrame = [GLOBAL_RECORD: struct<Account_ID: string, Record_Type: string ... 14 more fields>] 2019-04-08 11:25:20 ERROR FileUtils$:218 - File file:/XYZ IS NOT divisible by 1513. java.lang.IllegalArgumentException: There are some files in XYZ that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (1513 bytes per record)
In brief: GCD of 1, but all is pointing towards fixed record length.
We are still figuring out in which form to send you a data sample plus copybook, i.o.t. comply with internal guidelines.
In the meantime, I'd greatly appreciate a hint.
Thanks a lot in advance
Matthias
Even though PR #86 is merged, parser still complains to a picture like SVPP9(5)
.
Here is an example of an unsigned field definition:
07 AMOUNT PIC 9(16)V99.
Cobrix is permissive and allows the actual data to contain negative numbers in that column. A better behavior might be returning null
in such cases. This is what Cobrix usually does when it encounters pattern/data mismatch.
If there is a known mapping between segment ids and redefines we can have an option for a user to specify that mapping. This way the parser won't need to parse all the segments for each record. Combining with segment filtering (predicate pushdown) this should significantly increase performance when targeting a relational model of the output.
Some of the files I have come with a header and/or trailer record. For fixed length, the layout looks something like
HEADER
RECORD
...
RECORD
TRAILER
where all the records, the header, and the trailer have the same (fixed) length.
Do we already support something like this? Right now I'm just creating a new copybook that contains the three copybooks fo the types above glued with redefines, but it would be fantastic if we could support this kind of layout.
I apologise as I know this the wrong place as it's not an issue, if you could point me to a better place to ask this question I'm happy to go elsewhere.
Anyway, we are wondering what usage this library is getting in production environments, as we just want to confirm the project is mature enough for our usage.
Is it possible to confirm if Absa is using this in prod?
We tried importing 0.4.1 through the SBT import. The older version (0.4.0) JAR size was approx 12 MB but the current version size is 110 MB. I just want to understand if this is normal or if there are any issues with the project build.
Cobrix was originally developed for Spark 2.2.2 which has supported Scala 2.11.
Now Spark 2.4.2 and Spark 3.0.0 are coming with Scala 2.12 support.
Need to add support Scala 2.12
See #84
cobol-parser
artifact does not depend on Spark, so it can have a single version.spark-cobol
artifact should have 2 versions for each release.I am trying to read a header record file and it have variable length fields. But the file don't have a Record Descriptor Word (RDW). On reading the file, i could see first 4 characters getting truncated as i am using the option 'option("is_record_sequence", "true")'. Is there anyway to avoid this truncation.
Cobol data source uses binaryRecords() API to parse binary data.
When size of any of input files is not evenly divisible by record size, Spark will produce an exception.
We need to proactively check that all binary files are evenly divisible by the record size and produce more readable exception.
Preferably need to it scalable for cases when there are millions of files.
Hi,
Here I am not mentioning any issue but a functionality on debug purpose mainly. While Cobrix create dataframe it decodes the EBCDIC data according to the datatype of the given primitive field, Here if it is possible also to show the hex value of the column as well for a given option like add_hex = true. This functionality is only for debug purpose to check the data.
Currently COBOL AST is an array of GROUP objects. Need to change it to be just a GROUP. This way AST traversal functions can be simplified.
Also, this can allow root level primitive fields in copybooks.
This is to support something like
05 FILLER PIC 9(08) BINARY.
05 :RRHEADER:-CNTL.
10 :RRHEADER:-LENGTH PIC S9(08) BINARY.
88 :RRHEADER:-LENGTH-SET VALUE +65.
The filler on the top is not currently supported since it is a primitive field.
Record Descriptor Word (previously referred to as XCOM header) is interpreted incorrectly for a file transferred from IBM z/OS to Windows 10 via standard FTP in binary mode.
Per the IBM documentation (https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.idad400/d4357.htm) the RDW is the first 2 bytes of each record, but the extractRdwRecordSize method in the BinaryUtils class is expecting the length to be in the 3rd and 4th bytes.
It's possible XCOM or other transfer methods change the interpretation of the RDW, but the current method makes it impossible to process a dataset originating from IBM z/OS, at least in our environment.
I am able to process the datasets correctly by making a minor adjustment to the extractRdwRecordSize method to read the 1st and 2nd bytes, and the data is parsed correctly after that.
Hi Team,
I am getting the error while trying to create the DF. It is failing with LAYOUT file error.
Could you please help me to get the error in layout file.
Spark shell:
spark2-shell --master yarn --deploy-mode client --driver-cores 2 --driver-memory 4G --jars cobol-parser-0.4.3-SNAPSHOT.jar,spark-cobol-0.4.3-SNAPSHOT.jar,scodec-core_2.11-1.10.3.jar,scodec-bits_2.11-1.1.2.jar
Data Frame:
val df = spark.read.format("cobol").option("copybook", "/user/cloudera/layout/layout_spark.txt").load("/user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat")
Error:
za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 2: Unable to parse the value of LEVEL. Numeric value expected, but 'PN-TABLE-NAME' encountered
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$CreateCopybookLine(CopybookParser.scala:204)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$8.apply(CopybookParser.scala:134)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$8.apply(CopybookParser.scala:133)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.parseTree(CopybookParser.scala:133)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.loadCopyBook(FixedLenNestedReader.scala:81)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.(FixedLenNestedReader.scala:47)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:86)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:73)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:57)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 49 elided
layout file:
01 DCLDPN-DPND-REC-TABL.
02 DPN-TABLE-NAME PIC X(8).
02 DPN-EMP-NBR PIC S9(9) COMP.
02 DPN-CARR-CODE PIC X(3).
02 DPN-ACVY-STRT-DATE PIC X(10).
02 DPN-CREW-ACVY-CODE PIC X(3).
02 DPN-PRNG-NBR PIC X(5).
02 DPN-PRNG-ORIG-DATE PIC X(10).
02 DPN-ASNT-DAYS-CNT PIC S9(4) COMP.
02 DPN-RVEW-DATE PIC X(10).
02 DPN-CONJ-ACVY-CODE PIC X(3).
02 DPN-NOTE-RCVD-IND PIC X(1).
02 DPN-DROP-OFF-DATE PIC X(10).
02 DPN-DPND-CMNT-TEXT PIC X(65).
02 FILLER PIC X(66).
Hi,
while searching for REDEFINE fields attached with a given statement(Group/Primitive), found it is provided only the field name. This can be an issue if there are multiple fields with the same name in the copybook. Is it possible to provide exact field name with hierarchy with dot(.) notation ?
I am doing some benchmarking on a Databricks cluster where I use Cobrix to read EBCDIC files and write to parquet. I have an implementation of the same process which does not use this library. Reading a 2GB EBCDIC file with Cobrix takes two minutes longer than reading the file using sc.binaryRecords() and putting the right schema in place to create a DataFrame. The file has around 1400 columns and 9000 bytes per record.
Here is the cluster config:
Spark Version: 2.4
Worker count: flexible, depending on the workload
RAM per worker: 14 GB
Cores per worker: 4
Executor count: flexible, depending on the workload
Executor memory: 7.4 GB
The benchmarks you have included in this project make me think that an increase in executor count would speed up the read throughput. However, I have currently enabled autoscaling in Databricks so it dynamically allocates executors on the fly depending on the workload.
Could you please provide some guidelines on optimising read speed? Things like configurations you have tried in your organisation's use case can really be of help to speed up the process.
Currently Cobrix ignores all characters that translate into ASCII characters below 32. Need to have an option to allow retaining these.
I think it is aligned with supporting EBCDIC code pages. Currently we support only the common code page. We can add 'common/all' to include non-printable characters.
Multisegment data files, especially ones that come from hierarchical databases are often described by several copybooks. In this case each copybook describes a particular segment.
Currently in order to load multisegment files one must combine segment copybooks in a way that each segment is represented by a GROUP in the copybook, and these segment GROUPs must redefine each other. This is a manual copybook-intrusive work.
Allow Cobrix to accept several copybooks as an input and automatically combine them by placing segments into GROUPs that redefine each other.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.