absaoss / cobrix Goto Github PK

View Code? Open in Web Editor NEW

136.0 28.0 80.0 5.04 MB

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

License: Apache License 2.0

Scala 64.83% COBOL 3.15% Gnuplot 0.19% Batchfile 0.03% Shell 0.06% ANTLR 0.51% Java 31.23%

cobol-parser spark cobol copybook mainframe ebcdic scalable etl

cobrix's Introduction

Cobrix - COBOL Data Source for Apache Spark

Pain free Spark/Cobol files integration.

Seamlessly query your COBOL/EBCDIC binary files as Spark Dataframes and streams.

Add mainframe as a source to your data engineering strategy.

Motivation

Among the motivations for this project, it is possible to highlight:

Lack of expertise in the Cobol ecosystem, which makes it hard to integrate mainframes into data engineering strategies
Lack of support from the open-source community to initiatives in this field
The overwhelming majority (if not all) of tools to cope with this domain are proprietary
Several institutions struggle daily to maintain their legacy mainframes, which prevents them from evolving to more modern approaches to data management
Mainframe data can only take part in data science activities through very expensive investments

Features

Supports primitive types (although some are "Cobol compiler specific")
Supports REDEFINES, OCCURS and DEPENDING ON fields (e.g. unchecked unions and variable-size arrays)
Supports nested structures and arrays
Supports HDFS as well as local file systems
The COBOL copybooks parser doesn't have a Spark dependency and can be reused for integrating into other data processing engines

Videos

We have presented Cobrix at DataWorks Summit 2019 and Spark Summit 2019 conferences. The screencasts are available here:

DataWorks Summit 2019 (General Cobrix workflow for hierarchical databases): https://www.youtube.com/watch?v=o_up7X3ZL24

Spark Summit 2019 (More detailed overview of performance optimizations): https://www.youtube.com/watch?v=BOBIdGf3Tm0

Requirements

spark-cobol	Spark
0.x	2.2+
1.x	2.2+
2.x	2.4.3+
2.6.x+	3.2.0+

Linking

You can link against this library in your program at the following coordinates:

Scala 2.11	Scala 2.12	Scala 2.13

groupId: za.co.absa.cobrix artifactId: spark-cobol_2.11 version: 2.7.3	groupId: za.co.absa.cobrix artifactId: spark-cobol_2.12 version: 2.7.3	groupId: za.co.absa.cobrix artifactId: spark-cobol_2.13 version: 2.7.3

Scala 2.11

Scala 2.12

Scala 2.13

groupId: za.co.absa.cobrix
artifactId: spark-cobol_2.11
version: 2.7.3

groupId: za.co.absa.cobrix
artifactId: spark-cobol_2.12
version: 2.7.3

groupId: za.co.absa.cobrix
artifactId: spark-cobol_2.13
version: 2.7.3

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

Spark compiled with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.11:2.7.3

Spark compiled with Scala 2.12

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.12:2.7.3

Spark compiled with Scala 2.13

$SPARK_HOME/bin/spark-shell --packages za.co.absa.cobrix:spark-cobol_2.13:2.7.3

Usage

Quick start

This repository contains several standalone example applications in examples/spark-cobol-app directory. It is a Maven project that contains several examples:

SparkTypesApp is an example of a very simple mainframe file processing. It is a fixed record length raw data file with a corresponding copybook. The copybook contains examples of various numeric data types Cobrix supports.
SparkCobolApp is an example of a Spark Job for handling multisegment variable record length mainframe files.
SparkCodecApp is an example usage of a custom record header parser. This application reads a variable record length file having non-standard RDW headers. In this example RDH header is 5 bytes instead of 4
SparkCobolHierarchical is an example processing of an EBCDIC multisegment file extracted from a hierarchical database.

The example project can be used as a template for creating Spark Application. Refer to README.md of that project for the detailed guide how to run the examples locally and on a cluster.

When running mvn clean package in examples/spark-cobol-app an uber jar will be created. It can be used to run jobs via spark-submit or spark-shell.

How to generate Code coverage report

sbt ++{scala_version} jacoco

Code coverage will be generated on path:

{project-root}/cobrix/{module}/target/scala-{scala_version}/jacoco/report/html

Reading Cobol binary files from HDFS/local and querying them

Create a Spark SQLContext
Start a sqlContext.read operation specifying za.co.absa.cobrix.spark.cobol.source as the format
Inform the path to the copybook describing the files through ... .option("copybook", "path_to_copybook_file"). By default the copybook is expected to be in HDFS. You can specify that a copybook is located in the local file system by adding file:// prefix. For example, you can specify a local file like this .option("copybook", "file:///home/user/data/compybook.cpy"). Alternatively, instead of providing a path to a copybook file you can provide the contents of the copybook itself by using .option("copybook_contents", "...copybook contents...").
Inform the path to the HDFS directory containing the files: ... .load("path_to_directory_containing_the_binary_files")
Inform the query you would like to run on the Cobol Dataframe

Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.SampleApp and za.co.absa.cobrix.spark.cobol.examples.CobolSparkExample

val sparkBuilder = SparkSession.builder().appName("Example")
val spark = sparkBuilder
  .getOrCreate()

val cobolDataframe = spark
  .read
  .format("cobol")
  .option("copybook", "data/test1_copybook.cob")
  .load("data/test2_data")

cobolDataframe
    .filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH'
    .take(10)
    .foreach(v => println(v))

The full example is available here

In some scenarios Spark is unable to find "cobol" data source by it's short name. In that case you can use the full path to the source class instead: .format("za.co.absa.cobrix.spark.cobol.source")

Cobrix assumes input data is encoded in EBCDIC. You can load ASCII files as well by specifying the following option: .option("encoding", "ascii").

If the input file is a text file (CRLF / LF are used to split records), use .option("is_text", "true").

Multisegment ASCII text files are supported using this option: .option("record_format", "D").

Cobrix has better handling of special characters and partial records using its extension format: .option("record_format", "D2").

Streaming Cobol binary files from a directory

Create a Spark StreamContext
Import the binary files/stream conversion manager: za.co.absa.spark.cobol.source.streaming.CobolStreamer._
Read the binary files contained in the path informed in the creation of the SparkSession as a stream: ... streamingContext.cobolStream()
Apply queries on the stream: ... stream.filter("some_filter") ...
Start the streaming job.

Below is an example whose full version can be found at za.co.absa.cobrix.spark.cobol.examples.StreamingExample

val spark = SparkSession
  .builder()
  .appName("CobolParser")
  .master("local[2]")
  .config("duration", 2)
  .config("copybook", "path_to_the_copybook")
  .config("path", "path_to_source_directory") // could be both, local or HDFS
  .getOrCreate()          
      
val streamingContext = new StreamingContext(spark.sparkContext, Seconds(3))         
    
import za.co.absa.spark.cobol.source.streaming.CobolStreamer._ // imports the Cobol streams manager

val stream = streamingContext.cobolStream() // streams the binary files into the application    

stream
    .filter(row => row.getAs[Integer]("NUMERIC_FLD") % 2 == 0) // filters the even values of the nested field 'NUMERIC_FLD'
    .print(10)		

streamingContext.start()
streamingContext.awaitTermination()

Using Cobrix from a Spark shell

To query mainframe files interactively using spark-shell you need to provide jar(s) containing Corbrix and it's dependencies. This can be done either by downloading all the dependencies as separate jars or by creating an uber jar that contains all of the dependencies.

Getting all Cobrix dependencies

Cobrix's spark-cobol data source depends on the COBOL parser that is a part of Cobrix itself and on scodec libraries to decode various binary formats.

The jars that you need to get are:

spark-cobol_2.12-2.7.3.jar
cobol-parser_2.12-2.7.3.jar
scodec-core_2.12-1.10.3.jar
scodec-bits_2.12-1.1.4.jar

Versions older than 2.7.1 also need antlr4-runtime-4.8.jar.

After that you can specify these jars in spark-shell command line. Here is an example:

$ spark-shell --packages za.co.absa.cobrix:spark-cobol_2.12:2.7.3
or 
$ spark-shell --master yarn --deploy-mode client --driver-cores 4 --driver-memory 4G --jars spark-cobol_2.12-2.7.3.jar,cobol-parser_2.12-2.7.3.jar,scodec-core_2.12-1.10.3.jar,scodec-bits_2.12-1.1.4.jar

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context available as 'sc' (master = yarn, app id = application_1535701365011_2721).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.read.format("cobol").option("copybook", "/data/example1/test3_copybook.cob").load("/data/example1/data")
df: org.apache.spark.sql.DataFrame = [TRANSDATA: struct<CURRENCY: string, SIGNATURE: string ... 4 more fields>]

scala> df.show(false)
+----------------------------------------------------+
|TRANSDATA                                           |
+----------------------------------------------------+
|[GBP,S9276511,Beierbauh.,0123330087,1,89341.00]     |
|[ZAR,S9276511,Etqutsa Inc.,0039003991,1,2634633.00] |
|[USD,S9276511,Beierbauh.,0038903321,0,75.71]        |
|[ZAR,S9276511,Beierbauh.,0123330087,0,215.39]       |
|[ZAR,S9276511,Test Bank,0092317899,1,643.94]        |
|[ZAR,S9276511,Xingzhoug,8822278911,1,998.03]        |
|[USD,S9276511,Beierbauh.,0123330087,1,848.88]       |
|[USD,S9276511,Beierbauh.,0123330087,0,664.11]       |
|[ZAR,S9276511,Beierbauh.,0123330087,1,55262.00]     |
+----------------------------------------------------+
only showing top 20 rows

scala>

Creating an uber jar

Gathering all dependencies manually maybe a tiresome task. A better approach would be to create a jar file that contains all required dependencies (an uber jar aka fat jar).

Creating an uber jar for Cobrix is very easy. Steps to build:

Install JDK 8
Install SBT
Clone Cobrix repository

Run sbt assembly in the root directory of the repository specifying the Scala and Spark version you want to build for:

# For Scala 2.11
sbt -DSPARK_VERSION="2.4.8" ++2.11.12 assembly

# For Scala 2.12
sbt -DSPARK_VERSION="2.4.8" ++2.12.19 assembly
sbt -DSPARK_VERSION="3.1.3" ++2.12.19 assembly
sbt -DSPARK_VERSION="3.2.3" ++2.12.19 assembly
sbt -DSPARK_VERSION="3.3.2" ++2.12.19 assembly
sbt -DSPARK_VERSION="3.4.0" ++2.12.19 assembly

# For Scala 2.13
sbt -DSPARK_VERSION="3.3.2" ++2.13.14 assembly
sbt -DSPARK_VERSION="3.4.0" ++2.13.14 assembly

You can collect the uber jar of spark-cobol either at spark-cobol/target/scala-2.11/ or in spark-cobol/target/scala-2.12/ depending on the Scala version you used. The fat jar will have '-bundle' suffix. You can also download pre-built bundles from https://github.com/AbsaOSS/cobrix/releases/tag/v2.7.3

Then, run spark-shell or spark-submit adding the fat jar as the option.

$ spark-shell --jars spark-cobol_2.12_3.3-2.7.4-SNAPSHOT-bundle.jar

A note for building and running tests on Windows
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat is a Hadoop compatibility with Windows issue. The workaround is described here: https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io
When running assembly with -DSPARK_VERSION=... on Windows, it seems an sbt compatibility with Windows issue: https://stackoverflow.com/questions/59144913/run-sbt-1-2-8-project-with-java-d-options-on-windows You can work around it by using default Spark version for a given Scala version:
sbt ++2.11.12 assembly
sbt ++2.12.19 assembly
sbt ++2.13.14 assembly

Other Features

Loading several paths

Currently, specifying multiple paths in load() is not supported. Use the following syntax:

    spark
      .read
      .format("cobol")
      .option("copybook_contents", copybook)
      .option("paths", inputPaths.mkString(","))
      .load()

Spark SQL schema extraction

This library also provides convenient methods to extract Spark SQL schemas and Cobol layouts from copybooks.

If you want to extract a Spark SQL schema from a copybook by providing same options you provide to Spark:

// Same options that you use for spark.read.format("cobol").option()
val options = Map("schema_retention_policy" -> "keep_original")

val cobolSchema = CobolSchema.fromSparkOptions(Seq(copybook), options)
val sparkSchema = cobolSchema.getSparkSchema.toString()

println(sparkSchema)

If you want to extract a Spark SQL schema from a copybook using the Cobol parser directly:

import za.co.absa.cobrix.cobol.parser.CopybookParser
import za.co.absa.cobrix.cobol.reader.policies.SchemaRetentionPolicy
import za.co.absa.cobrix.spark.cobol.schema.CobolSchema

val parsedSchema = CopybookParser.parseSimple(copyBookContents)
val cobolSchema = new CobolSchema(parsedSchema, SchemaRetentionPolicy.CollapseRoot, inputFileNameField = "", generateRecordId = false)
val sparkSchema = cobolSchema.getSparkSchema.toString()

println(sparkSchema)

If you want to check the layout of the copybook:

import za.co.absa.cobrix.cobol.parser.CopybookParser

val copyBook = CopybookParser.parseSimple(copyBookContents)
println(copyBook.generateRecordLayoutPositions())

Spark schema metadata

When a copybook is converted to a Spark schema, some information is lost, such as length of string fields or minimum and maximum number of elements in arrays. To preserve this information, Cobrix adds metadata to Spark schema fields. The following metadata is added:

Metadata key	Description
maxLength	The maximum length of a string field.
minElements	The minimum number of elements in an array.
maxElements	The maximum number of elements in an array.

You can access the metadata in the usual way:

// This example returns the maximum length of a string field that is the first field of the copybook
df.schema.fields(0).metadata.getLong("maxLength")

Fixed record length files

Cobrix assumes files has fixed length (F) record format by default. The record length is determined by the length of the record defined by the copybook. But you can specify the record length explicitly:

.option("record_format", "F")
.option("record_length", "250")

Fixed block record formats (FB) are also supported. The support is experimental, if you find any issues, please let us know. When the record format is 'FB' you can specify block length or number of records per block. As with 'F' if record_length is not specified, it will be determined from the copybook.

Records that have BDWs, but not rdws can be read like this:

.option("record_format", "FB")
.option("record_length", "250")

or simply

.option("record_format", "FB")

Records that have neither BDWs nor RDWs can be read like this:

.option("record_format", "FB")
.option("record_length", "250")
.option("block_length", "500")

.option("record_format", "FB")
.option("record_length", "250")
.option("records_per_block", "2")

More on fixed-length record formats: https://www.ibm.com/docs/en/zos/2.3.0?topic=sets-fixed-length-record-formats

Variable length records support

Cobrix supports variable record length files. The only requirement is that such a file should contain a standard 4 byte record header known as Record Descriptor Word (RDW). Such headers are created automatically when a variable record length file is copied from a mainframe. Another type of files are variable blocked length. Such files contain Block Descriptor Word (BDW), as well as Record Descriptor Word (RDW) headers. Any such header can be either big-endian or little-endian. Also, quite often BDW headers need to be adjusted in order to be read properly. See the use cases section below.

To load variable length record file the following option should be specified:

.option("record_format", "V")

To load variable blocked length record file the following option should be specified:

.option("record_format", "VB")

More on record formats: https://www.ibm.com/docs/en/zos/2.3.0?topic=files-selecting-record-formats-non-vsam-data-sets

The space used by the headers (both BDW and RDW) should not be mentioned in the copybook if this option is used. Please refer to the 'Record headers support' section below.

If a record of the copybook contains record lengths for each record you can use 'record_length_field' like this:

.option("record_format", "F")
.option("record_length_field", "RECORD_LENGTH")

You can use expressions as well:

.option("record_format", "F")
.option("record_length_field", "RECORD_LENGTH + 10")

.option("record_format", "F")
.option("record_length_field", "FIELD1 * 10 + 200")

If the record field contains a string that can be mapped to a record size, you can add the mapping as a JSON:

.option("record_format", "F")
.option("record_length_field", "FIELD_STR")
.option("record_length_map", """{"SEG1":100,"SEG2":200}""")

Use cases for various variable length formats

In order to understand the file format it is often sufficient to look at the first 4 bytes of the file (un case of RDW only files), or the first 8 bytes of a file + lookup the offset of the block (in case of BDW + RDW)

V header examples (have only RDW headers)

In order to determine if an RDW is a big- or little-endian, take a look at the first 4 bytes. If the first 2 bytes are zeros, it's a little-endian RDW header, otherwise it is a big-endian RDW header.

Header example	Description	Options
`00 10 00 00`	Big-endian RDW, no adjustments, the record size: `0x10 = 16 bytes`	`.option("record_format", "V")` `.option("is_rdw_big_endian", "true")`
`01 10 00 00`	Big-endian RDW, adjustment `-2`, the record size: `0x01*256 + 0x10 - 2 = 256 + 16 + 2 = 270 bytes`	`.option("record_format", "V")` `.option("is_rdw_big_endian", "true")` `.option("rdw_adjustment", -2)`
`00 00 10 00`	Little-endian RDW, no adjustments, the record size: `0x10 = 16 bytes`	`.option("record_format", "V")` `.option("is_rdw_big_endian", "false")`
`00 00 10 01`	Little-endian RDW, adjustment `-2`, the record size: `0x01*256 + 0x10 - 2 = 256 + 16 + 2 = 270 bytes`	`.option("record_format", "V")` `.option("is_rdw_big_endian", "false")` `.option("rdw_adjustment", -2)`

VB header examples (have both BDW and RDW headers)

It is harder to determine if a BDW header is big- or little-endian since BDW header bytes can be all non-zero. But for VB format RDWs follow BDWs and endiness. You can determine the endiness from an RDW, and use the same option for BDW.

Header example	Description	Options
`00 28 00 00` `00 10 00 00` (BDW, RDW)	Big-endian BDW+RDW, no adjustments, BDW = `0x28 = 40 byes` the record size: `0x10 = 16 bytes`	`.option("record_format", "VB")` `.option("is_bdw_big_endian", "true")` `.option("is_rdw_big_endian", "true")`
`00 2C 00 00` `00 10 00 00` (BDW, RDW)	Big-endian BDW+RDW, need -4 byte adjustment since BDW includes its own length, BDW = `0x2C - 4 = 40 byes` the record size: `0x10 = 16 bytes`	`.option("record_format", "VB")` `.option("is_bdw_big_endian", "true")` `.option("is_rdw_big_endian", "true")` `.option("rdw_adjustment", -4)`
`00 00 28 00` `00 00 10 00` (BDW, RDW)	Little-endian BDW+RDW, no adjustments, BDW = `0x28 = 40 byes` the record size: `0x10 = 16 bytes`	`.option("record_format", "VB")` `.option("is_bdw_big_endian", "false")` `.option("is_rdw_big_endian", "false")`
`00 00 2C 00` `00 00 10 00` (BDW, RDW)	Little-endian BDW+RDW, need -4 byte adjustment since BDW includes its own length, BDW = `0x2C - 4 = 40 byes` the record size: `0x10 = 16 bytes`	`.option("record_format", "VB")` `.option("is_bdw_big_endian", "false")` `.option("is_rdw_big_endian", "false")` `.option("rdw_adjustment", -4)`

Schema collapsing

Mainframe data often contain only one root GROUP. In such cases such a GROUP can be considered something similar to XML rowtag. Cobrix allows either to collapse or to retain the GROUP. To turn this on use the following option:

.option("schema_retention_policy", "collapse_root")

.option("schema_retention_policy", "keep_original")

Let's look at an example. Let's say we have a copybook that looks like this:

       01  RECORD.
           05  ID                        PIC S9(4)  COMP.
           05  COMPANY.
               10  SHORT-NAME            PIC X(10).
               10  COMPANY-ID-NUM        PIC 9(5) COMP-3.

When "schema_retention_policy" is set to "collapse_root" (default), the root group will be collapsed and the schema will look like this (note the RECORD field is not part of the schema):

root
 |-- ID: integer (nullable = true)
 |-- COMPANY: struct (nullable = true)
 |    |-- SHORT_NAME: string (nullable = true)
 |    |-- COMPANY_ID_NUM: integer (nullable = true)

But when "schema_retention_policy" is set to "keep_original", the schema will look like this (note the RECORD field is part of the schema):

root
 |-- RECORD: struct (nullable = true)
 |    |-- ID: integer (nullable = true)
 |    |-- COMPANY: struct (nullable = true)
 |    |    |-- SHORT_NAME: string (nullable = true)
 |    |    |-- COMPANY_ID_NUM: integer (nullable = true)

You can experiment with this feature using built-in example in za.co.absa.cobrix.spark.cobol.examples.CobolSparkExample

Record Id fields generation

For data that has record order dependency generation of "File_Id", "Record_Id", and "Record_Byte_Length" fields is supported. The values of the File_Id column will be unique for each file when a directory is specified as the source for data. The values of the Record_Id column will be unique and sequential record identifiers within the file.

Turn this feature on use

.option("generate_record_id", true)

The following fields will be added to the top of the schema:

root
 |-- File_Id: integer (nullable = false)
 |-- Record_Id: long (nullable = false)
 |-- Record_Byte_Length: integer (nullable = false)

You can use this option to generate raw bytes of each record as a binary field:

.option("generate_record_bytes", "true")

The following fields will be added to the top of the schema:

root
 |-- Record_Bytes: binary (nullable = false)

Locality optimization for variable-length records parsing

Variable-length records depend on headers to have their length calculated, which makes it hard to achieve parallelism while parsing.

Cobrix strives to overcome this drawback by performing a two-stages parsing. The first stage traverses the records retrieving their lengths and offsets into structures called indexes. Then, the indexes are distributed across the cluster, which allows for parallel variable-length records parsing.

However effective, this strategy may also suffer from excessive shuffling, since indexes may be sent to executors far from the actual data.

The latter issue is overcome by extracting the preferred locations for each index directly from HDFS, and then passing those locations to Spark during the creation of the RDD that distributes the indexes.

When processing large collections, the overhead of collecting the locations is offset by the benefits of locality, thus, this feature is enabled by default, but can be disabled by the configuration below:

.option("improve_locality", false)

Workload optimization for variable-length records parsing

When dealing with variable-length records, Cobrix strives to maximize locality by identifying the preferred locations in the cluster to parse each record, i.e. the nodes where the record resides.

This feature is implemented by querying HDFS about the locations of the blocks containing each record and instructing Spark to create the partition for that record in one of those locations.

However, sometimes, new nodes can be added to the cluster after the Cobol file is stored, in which case those nodes would be ignored when processing the file since they do not contain any record.

To overcome this issue, Cobrix also strives to re-balance the records among the new nodes at parsing time, as an attempt to maximize the utilization of the cluster. This is done through identifying the busiest nodes and sharing part of their burden with the new ones.

Since this is not an issue present in most cluster configurations, this feature is disabled by default, and can be enabled from the configuration below:

.option("optimize_allocation", true)

If however the option improve_locality is disabled, this option will also be disabled regardless of the value in optimize_allocation.

Record headers support

As you may already know a file in the mainframe world does not mean the same as in the PC world. On PCs we think of a file as a stream of bytes that we can open, read/write and close. On mainframes a file can be a set of records that we can query. Record is a blob of bytes, can have different size. Mainframe's 'filesystem' handles the mapping between logical records and physical location of data.

Details are available at this Wikipedia article (look for MVS filesystem).

So usually a file cannot simply be 'copied' from a mainframe. When files are transferred using tools like XCOM each record is prepended with an additional record header or RDW. This header allows readers of a file in PC to restore the 'set of records' nature of the file.

Mainframe files coming from IMS and copied through specialized tools contain records (the payload) having schema of DBs copybook warped with DB export tool headers wrapped with record headers. Like this:

RECORD_HEADERS ( TOOL_HEADERS ( PAYLOAD ) )

Similar to Internet's TCP protocol IP_HEADERS ( TCP_HEADERS ( PAYLOAD ) ).

TOOL_HEADERS are application dependent. Often it contains the length of the payload. But this length is sometime not very reliable. RECORD_HEADERS contain the record length (including TOOL_HEADERS length) and are proved to be reliable.

For fixed record length files record headers can be ignored since we already know the record length. But for variable record length files and for multisegment files record headers can be considered the most reliable single point of truth about record length.

You can instruct the reader to use 4 byte record headers to extract records from a mainframe file.

.option("record_format", "V")

This is very helpful for multisegment files when segments have different lengths. Since each segment has it's own copybook it is very convenient to extract segments one by one by combining record_format = V option with segment filter option.

.option("segment_field", "SEG-ID")
.option("segment_filter", "1122334")

In this example it is expected that the copybook has a field with the name 'SEG-ID'. The data source will read all segments, but will parse only ones that have SEG-ID = "1122334".

If you want to parse multiple segments, set the option 'segment_filter' to a comma separated list of the segment values. For example:

.option("segment_field", "SEG-ID")
.option("segment_filter", "1122334,1122335")

will only parse the records with SEG-ID = "1122334" OR SEG-ID = "1122335"

Custom record extractors

Custom record extractors can be used for customizing splitting of input files into a set of records. Cobrix supports text files, fixed length binary files and binary files with RDWs. If your input file is not in one of the supported formats you can implement a custom record extractor interface and provide it to spark-cobol as a option:

.option("record_extractor", "com.example.record.header.parser")

A custom record extractor needs to be a class having this precise constructor signature:

class TextRecordExtractor(ctx: RawRecordExtractorParameters) extends Serializable with RawRecordExtractor {
                             // Your implementation
                          }

A record extractor is essentially iterator of records. Each returned record is an array of bytes parsable by the copybook.

A record extractor is invoked two times. First, it is invoked at the beginning each file to go thought the file and create a sparse index. The second time it is invoked by parallel processes starting from different records in the file. The starting record number is provided in constructor. The starting file offset is available from inputStream.

RawRecordContext consists of the following fields that the custom record extractor will get from Cobrix in runtime:

startingRecordNumber - A record number the input stream is pointing to.
inputStream - The input stream of bytes of the input file.
copybook - The parsed copybook of the input stream.
additionalInfo - An arbitrary info that can be passed as an option (see below).

If your record extractor needs additional information in order to extract records properly, you can provide an arbitrary additional info to the record extracted at runtime by specifying this option:

Take a look at CustomRecordExtractorMock inside spark-cobol project to see how a custom record extractor can be built.

.option("re_additional_info", "some info")

Custom record header parsers (deprecated)

Custom record header parsers are deprecated. Use custom record extractors instead. They are more flexible and easier to use.

If your variable length file does not have RDW headers, but has fields that can be used for determining record lengths you can provide a custom record header parser that takes starting bytes of each record and returns record lengths. In order to do that you need to create a class inheriting RecordHeaderParser and Serializable traits and provide a fully qualified class name to the following option:

.option("record_header_parser", "com.example.record.header.parser")

RDDs

Cobrix provides helper methods to convert RDD[String] or RDD[Array[Byte]] to DataFrame using a copybook. This can be used if you want to use a custom logic to split the input file into records as either ASCII strings or arrays of bytes, and then parse each record using a copybook.

An example of RDD[Array[Byte]]:

import za.co.absa.cobrix.spark.cobol.Cobrix

val rdd = ???
val df = Cobrix.fromRdd
    .copybookContents(copybook)
    .option("encoding", "ebcdic") // any supported option 
    .load(rdd)

An example of ASCII Strings RDD[String]:

import za.co.absa.cobrix.spark.cobol.Cobrix

val rdd = ???
val df = Cobrix.fromRdd
    .copybookContents(copybook)
    .option("variable_size_occurs", "true") // any supported option 
    .loadText(rdd)

When converting from an RDD some of the options like record_format or generate_record_id cannot be used since the data is assumed to be already split by records and the information about file names and relative order of records is not available.

EBCDIC code pages

The following code pages are supported:

common - (default) EBCDIC common characters
common_extended - EBCDIC common characters with special characters extension
cp037 - IBM EBCDIC US-Canada
cp037_extended - IBM EBCDIC US-Canada with special characters extension
cp300 - IBM EBCDIC Japanese Extended (2 byte code page)
cp838 - IBM EBCDIC Thailand
cp870 - IBM EBCDIC Multilingual Latin-2
cp875 - IBM EBCDIC Greek
cp1025 - IBM EBCDIC Multilingual Cyrillic
cp1047 - IBM EBCDIC Latin-1/Open System
cp1364 - (experimental support) IBM EBCDIC Korean (2 byte code page)
cp1388 - (experimental support) IBM EBCDIC Simplified Chinese (2 byte code page)

By default, Cobrix uses common EBCDIC code page which contains only basic latin characters, numbers, and punctuation. You can specify the code page to use for all string fields by setting the ebcdic_code_page option to one of the following values:

.option("ebcdic_code_page", "cp037")

For multi-codepage files, you can specify the code page to use for each field by setting the field_code_page:<code page> option

.option("ebcdic_code_page", "cp037")
.option("field_code_page:cp1256" -> "FIELD1")
.option("field_code_page:us-ascii" -> "FIELD-2, FIELD_3")

Reading ASCII text file

Cobrix is primarily designed to read binary files, but you can directly use some internal functions to read ASCII text files. In ASCII text files, records are separated with newlines.

Working example 1:

    // The recommended way
    val df = spark
      .read
      .format("cobol")
      .option("copybook_contents", copybook)
      .option("ascii_charset", "ISO-8859-1") // You can choose a charset, UTF-8 is used by default
      .option("record_format", "D")
      .load(tmpFileName)

Working example 2 - Using RDDs and helper methods:

    // This is the way if you have data converted to an RDD[String] already.
    // You have full control on reading the input data records and converting them to `java.lang.String`.
    val df = Cobrix.fromRdd
        .copybookContents(copybook)
        .option("variable_size_occurs", "true") // any supported option 
        .loadText(rdd)

Working example 3 - Using RDDs and record parsers directly:

    // This is the most verbose way - creating dataframes from RDDs. But it gives full control on how text files are
    // processed before parsing actual records
    val spark = SparkSession
      .builder()
      .appName("Spark-Cobol ASCII text file")
      .master("local[*]")
      .getOrCreate()

    val copybook =
      """       01  COMPANY-DETAILS.
        |            05  SEGMENT-ID		PIC 9(1).
        |            05  STATIC-DETAILS.
        |               10  NAME      	PIC X(2).
        |
        |            05  CONTACTS REDEFINES STATIC-DETAILS.
        |               10  PERSON    	PIC X(3).
      """.stripMargin

    val parsedCopybook = CopybookParser.parse(copybook, dataEnncoding = ASCII, stringTrimmingPolicy = StringTrimmingPolicy.TrimNone)
    val cobolSchema = new CobolSchema(parsedCopybook, SchemaRetentionPolicy.CollapseRoot, "", false)
    val sparkSchema = cobolSchema.getSparkSchema

    val rddText = spark.sparkContext.textFile("src/main/resources/mini.txt")

    val recordHandler = new RowHandler()

    val rddRow = rddText
      .filter(str => str.length > 0)
      .map(str => {
        val record = RecordExtractors.extractRecord[GenericRow](parsedCopybook.ast,
          str.getBytes(),
          0,
          SchemaRetentionPolicy.CollapseRoot, handler = recordHandler)
        Row.fromSeq(record)
      })

    val dfOut = spark.createDataFrame(rddRow, sparkSchema)

    dfOut.printSchema()
    dfOut.show()

Corresponding data sample in mini.txt:

1BB
2CCC

Output:

root
 |-- SEGMENT_ID: integer (nullable = true)
 |-- STATIC_DETAILS: struct (nullable = true)
 |    |-- NAME: string (nullable = true)
 |-- CONTACTS: struct (nullable = true)
 |    |-- PERSON: string (nullable = true)

 ...

 +----------+--------------+--------+
 |SEGMENT_ID|STATIC_DETAILS|CONTACTS|
 +----------+--------------+--------+
 |         1|          [BB]|  [null]|
 |         2|          [CC]|   [CCC]|
 +----------+--------------+--------+

There, Cobrix loaded all redefines for every record. Each record contains data from all of the segments. But only one redefine is valid for every segment. Filtering is described in the following section.

Automatic segment redefines filtering

When reading a multisegment file you can use Spark to clean up redefines that do not match segment ids. Cobrix will parse every redefined field for each segment. To increase performance you can specify which redefine corresponds to which segment id. This way Cobrix will parse only relevant segment redefined fields and leave the rest of the redefined fields null.

  .option("redefine-segment-id-map:0", "REDEFINED_FIELD1 => SegmentId1,SegmentId2,...")
  .option("redefine-segment-id-map:1", "REDEFINED_FIELD2 => SegmentId10,SegmentId11,...")

For the above example the load options will lok like this (last 2 options):

val df = spark
  .read
  .format("cobol")
  .option("copybook_contents", copybook)
  .option("record_format", "V")
  .option("segment_field", "SEGMENT_ID")
  .option("segment_id_level0", "C")
  .option("segment_id_level1", "P")
  .option("redefine_segment_id_map:0", "STATIC-DETAILS => C")
  .option("redefine_segment_id_map:1", "CONTACTS => P")
  .load("examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat")

The filtered data will look like this:

df.show(10)
+----------+----------+--------------------+--------------------+
|SEGMENT_ID|COMPANY_ID|      STATIC_DETAILS|            CONTACTS|
+----------+----------+--------------------+--------------------+
|         C|9377942526|[Joan Q & Z,10 Sa...|                    |
|         P|9377942526|                    |[+(277) 944 44 55...|
|         C|3483483977|[Robotrd Inc.,2 P...|                    |
|         P|3483483977|                    |[+(174) 970 97 54...|
|         P|3483483977|                    |[+(848) 832 61 68...|
|         P|3483483977|                    |[+(455) 184 13 39...|
|         C|7540764401|[Eqartion Inc.,87...|                    |
|         C|4413124035|[Xingzhoug,74 Qin...|                    |
|         C|9546291887|[ZjkLPj,5574, Tok...|                    |
|         P|9546291887|                    |[+(300) 252 33 17...|
+----------+----------+--------------------+--------------------+

In the above example invalid fields became null and the parsing is done faster because Cobrix does not need to process every redefine for each record.

Group Filler dropping

A FILLER is an anonymous field that is usually used for reserving space for new fields in a fixed record length data. Or it is used to remove a field from a copybook without affecting compatibility.

      05  COMPANY.
          10  NAME      PIC X(15).
          10  FILLER    PIC X(5).
          10  ADDRESS   PIC X(25).
          10  FILLER    PIC X(125).

Such fields are dropped when imported into a Spark data frame by Cobrix. Some copybooks, however, have FILLER groups that contain non-filler fields. For example,

      05  FILLER.
          10  NAME      PIC X(15).
          10  ADDRESS   PIC X(25).
      05  FILLER.
          10  AMOUNT    PIC 9(10)V96.
          10  COMMENT   PIC X(40).

By default Cobrix will retain such fields, but will rename each such filler to a unique name so each each individual struct can be specified unambiguously. For example, in this case the filler groups will be renamed to FILLER_1 and FILLER_2. You can change this behaviour if you would like to drop such filler groups by providing this option:

.option("drop_group_fillers", "true")

In order to retain value FILLERs (e.g. non-group FILLERs) as well, use this option:

.option("drop_value_fillers", "false")

Reading hierarchical data sets

Let's imagine we have a multisegment file with 2 segments having parent-child relationships. Each segment has a different record type. The root record/segment contains company info, an address and a taxpayer number. The child segment contains a contact person for a company. Each company can have zero or more contact persons. So each root record can be followed by zero or more child records.

To load such data in Spark the first thing you need to do is to create a copybook that contains all segment specific fields in redefined groups. Here is the copybook for our example:

        01  COMPANY-DETAILS.
            05  SEGMENT-ID        PIC X(5).
            05  COMPANY-ID        PIC X(10).
            05  STATIC-DETAILS.
               10  COMPANY-NAME      PIC X(15).
               10  ADDRESS           PIC X(25).
               10  TAXPAYER.
                  15  TAXPAYER-TYPE  PIC X(1).
                  15  TAXPAYER-STR   PIC X(8).
                  15  TAXPAYER-NUM  REDEFINES TAXPAYER-STR
                                     PIC 9(8) COMP.

            05  CONTACTS REDEFINES STATIC-DETAILS.
               10  PHONE-NUMBER      PIC X(17).
               10  CONTACT-PERSON    PIC X(28).

The 'SEGMENT-ID' and 'COMPANY-ID' fields are present in all of the segments. The 'STATIC-DETAILS' group is present only in the root record. The 'CONTACTS' group is present only in child record. Notice that 'CONTACTS' redefine 'STATIC-DETAILS'.

Because the records have different lengths use record_format = V or record_format = VB depending of the record format.

If you load this file as is you will get the schema and the data similar to this.

Spark App:

val df = spark
  .read
  .format("cobol")
  .option("copybook", "/path/to/thecopybook")
  .option("record_format", "V")
  .load("examples/multisegment_data")

Schema

df.printSchema()
root
 |-- SEGMENT_ID: string (nullable = true)
 |-- COMPANY_ID: string (nullable = true)
 |-- STATIC_DETAILS: struct (nullable = true)
 |    |-- COMPANY_NAME: string (nullable = true)
 |    |-- ADDRESS: string (nullable = true)
 |    |-- TAXPAYER: struct (nullable = true)
 |    |    |-- TAXPAYER_TYPE: string (nullable = true)
 |    |    |-- TAXPAYER_STR: string (nullable = true)
 |    |    |-- TAXPAYER_NUM: integer (nullable = true)
 |-- CONTACTS: struct (nullable = true)
 |    |-- PHONE_NUMBER: string (nullable = true)
 |    |-- CONTACT_PERSON: string (nullable = true)

Data sample

df.show(10)
+----------+----------+--------------------+--------------------+
|SEGMENT_ID|COMPANY_ID|      STATIC_DETAILS|            CONTACTS|
+----------+----------+--------------------+--------------------+
|         C|9377942526|[Joan Q & Z,10 Sa...|[Joan Q & Z     1...|
|         P|9377942526|[+(277) 944 44 5,...|[+(277) 944 44 55...|
|         C|3483483977|[Robotrd Inc.,2 P...|[Robotrd Inc.   2...|
|         P|3483483977|[+(174) 970 97 5,...|[+(174) 970 97 54...|
|         P|3483483977|[+(848) 832 61 6,...|[+(848) 832 61 68...|
|         P|3483483977|[+(455) 184 13 3,...|[+(455) 184 13 39...|
|         C|7540764401|[Eqartion Inc.,87...|[Eqartion Inc.  8...|
|         C|4413124035|[Xingzhoug,74 Qin...|[Xingzhoug      7...|
|         C|9546291887|[ZjkLPj,5574, Tok...|[ZjkLPj         5...|
|         P|9546291887|[+(300) 252 33 1,...|[+(300) 252 33 17...|
+----------+----------+--------------------+--------------------+

As you can see Cobrix loaded all redefines for every record. Each record contains data from all of the segments. But only one redefine is valid for every segment. So we need to split the data set into 2 datasets or tables. The distinguisher is the 'SEGMENT_ID' field. All company details will go into one data sets (segment id = 'C' [company]) while contacts will go in the second data set (segment id = 'P' [person]). While doing the split we can also collapse the groups so the table won't contain nested structures. This can be helpful to simplify the analysis of the data.

While doing it you might notice that the taxpayer number field is actually a redefine. Depending on the 'TAXPAYER_TYPE' either 'TAXPAYER_NUM' or 'TAXPAYER_STR' is used. We can resolve this in our Spark app as well.

Automatic reconstruction of hierarchical record structure

Starting from spark-cobol version 1.1.0 hierarchical structure of multisegment records can be restored automatically. In order to do this you need to provide:

A segment ID field that will be used to distinguish segment types.
A segmentId to redefine fields mapping that will be used to map each segment to a redefine field.
A parent-child relationship between segments identified by segment redefine fields.

When all of the above is specified Cobrix can reconstruct hierarchical nature of records by making child segments nested arrays of parent segments. Arbitrary levels of hierarchy and arbitrary number of segments is supported.

val df = spark
  .read
  .format("cobol")
  .option("copybook", "/path/to/thecopybook")
  .option("record_format", "V")

  // Specifies a field containing a segment id
  .option("segment_field", "SEGMENT_ID")
  
  // Specifies a mapping between segment ids and segment redefine fields
  .option("redefine_segment_id_map:1", "STATIC-DETAILS => C")
  .option("redefine-segment-id-map:2", "CONTACTS => P")
  
  // Specifies a parent-child relationship
  .option("segment-children:1", "STATIC-DETAILS => CONTACTS")
  
  .load("examples/multisegment_data")

The output schema will be

scala> df.printSchema()

root
 |-- SEGMENT_ID: string (nullable = true)
 |-- COMPANY_ID: string (nullable = true)
 |-- STATIC_DETAILS: struct (nullable = true)
 |    |-- COMPANY_NAME: string (nullable = true)
 |    |-- ADDRESS: string (nullable = true)
 |    |-- TAXPAYER: struct (nullable = true)
 |    |    |-- TAXPAYER_TYPE: string (nullable = true)
 |    |    |-- TAXPAYER_STR: string (nullable = true)
 |    |    |-- TAXPAYER_NUM: integer (nullable = true)
 |    |-- CONTACTS: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- PHONE_NUMBER: string (nullable = true)
 |    |    |    |-- CONTACT_PERSON: string (nullable = true)

Notice that contacts now is an array of structs. That is a company static details can contain zero or mor contacts. A possible hierarchical record output is

scala> import za.co.absa.cobrix.spark.cobol.utils.SparkUtils

scala> println(SparkUtils.prettyJSON(df.toJSON.take(1).mkString("[", ", ", "]")))
{
  "SEGMENT_ID" : "C",
  "COMPANY_ID" : "9377942526",
  "STATIC_DETAILS" : {
    "COMPANY_NAME" : "Joan Q & Z",
    "ADDRESS" : "10 Sandton, Johannesburg",
    "TAXPAYER" : {
      "TAXPAYER_TYPE" : "A",
      "TAXPAYER_STR" : "92714306",
      "TAXPAYER_NUM" : 959592241
    },
    "CONTACTS" : [ {
      "PHONE_NUMBER" : "+(174) 970 97 54",
      "CONTACT_PERSON" : "Tyesha Debow"
    }, {
      "PHONE_NUMBER" : "+(848) 832 61 68",
      "CONTACT_PERSON" : "Mindy Celestin"
    }, {
      "PHONE_NUMBER" : "+(455) 184 13 39",
      "CONTACT_PERSON" : "Mabelle Winburn"
    } ]
  }
}

An advanced hierarchical example with multiple levels of nesting and multiple segments on each level is available as a unit test za/co/absa/cobrix/spark/cobol/source/integration/Test17HierarchicalSpec.scala.

Manual reconstruction of hierarchical structure

Alternatively, hierarchical record structure can be reconstructed manually by extracting each segment and joining segments together. This a is more complicated process, but it provides more control.

Getting the first segment

import spark.implicits._

val dfCompanies = df
  // Filtering the first segment by segment id
  .filter($"SEGMENT_ID"==="C")
  // Selecting fields that are only available in the first segment
  .select($"COMPANY_ID", $"STATIC_DETAILS.COMPANY_NAME", $"STATIC_DETAILS.ADDRESS",
  // Resolving the taxpayer redefine
    when($"STATIC_DETAILS.TAXPAYER.TAXPAYER_TYPE" === "A", $"STATIC_DETAILS.TAXPAYER.TAXPAYER_STR")
      .otherwise($"STATIC_DETAILS.TAXPAYER.TAXPAYER_NUM").cast(StringType).as("TAXPAYER"))

The resulting table looks like this:

dfCompanies.show(10, truncate = false)
+----------+-------------+-------------------------+--------+
|COMPANY_ID|COMPANY_NAME |ADDRESS                  |TAXPAYER|
+----------+-------------+-------------------------+--------+
|9377942526|Joan Q & Z   |10 Sandton, Johannesburg |92714306|
|3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|
|7540764401|Eqartion Inc.|871A Forest ave., Toronto|87432264|
|4413124035|Xingzhoug    |74 Qing ave., Beijing    |50803302|
|9546291887|ZjkLPj       |5574, Tokyo              |73538919|
|9168453994|Test Bank    |1 Garden str., London    |82573513|
|4225784815|ZjkLPj       |5574, Tokyo              |96136195|
|8463159728|Xingzhoug    |74 Qing ave., Beijing    |17785468|
|8180356010|Eqartion Inc.|871A Forest ave., Toronto|79054306|
|7107728116|Xingzhoug    |74 Qing ave., Beijing    |70899995|
+----------+-------------+-------------------------+--------+

This looks like a valid and clean table containing the list of companies. Now let's do the same for the second segment.

Getting the second segment

    val dfContacts = df
      // Filtering the second segment by segment id
      .filter($"SEGMENT_ID"==="P")
      // Selecting the fields only valid for the second segment
      .select($"COMPANY_ID", $"CONTACTS.CONTACT_PERSON", $"CONTACTS.PHONE_NUMBER")

The resulting data loons like this:

dfContacts.show(10, truncate = false)
+----------+--------------------+----------------+
|COMPANY_ID|CONTACT_PERSON      |PHONE_NUMBER    |
+----------+--------------------+----------------+
|9377942526|Janiece Newcombe    |+(277) 944 44 55|
|3483483977|Tyesha Debow        |+(174) 970 97 54|
|3483483977|Mindy Celestin      |+(848) 832 61 68|
|3483483977|Mabelle Winburn     |+(455) 184 13 39|
|9546291887|Carrie Celestin     |+(300) 252 33 17|
|9546291887|Edyth Deveau        |+(907) 101 70 64|
|9546291887|Jene Norgard        |+(694) 918 17 44|
|9168453994|Timika Bourke       |+(768) 691 44 85|
|9168453994|Lynell Riojas       |+(695) 918 33 16|
|4225784815|Jene Mackinnon      |+(540) 937 33 71|
+----------+--------------------+----------------+

This looks good as well. The table contains the list of contact persons for companies. This data set contains the 'COMPANY_ID' field which we can use later to join the tables. But often there are no such fields in data imported from hierarchical databases. If that is the case Cobrix can help you craft such fields automatically. Use 'segment_field' to specify a field that contain the segment id. Use 'segment_id_level0' to ask Cobrix to generate ids for the particular segments. We can use 'segment_id_level1' to generate child ids as well. If children records can contain children of their own we can use 'segment_id_level2' etc.

Generating segment ids

val df = spark
  .read
  .format("cobol")
  .option("copybook_contents", copybook)
  .option("record_format", "V")
  .option("segment_field", "SEGMENT_ID")
  .option("segment_id_level0", "C")
  .option("segment_id_level1", "P")
  .load("examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat")

Sometimes, the leaf level has many segments. In this case, you can use _ as the list of segment ids to specify 'the rest of segment ids', like this:

val df = spark
  .read
  .format("cobol")
  .option("copybook_contents", copybook)
  .option("record_format", "V")
  .option("segment_field", "SEGMENT_ID")
  .option("segment_id_level0", "C")
  .option("segment_id_level1", "_")
  .load("examples/multisegment_data/COMP.DETAILS.SEP30.DATA.dat")

The result of both above code snippets is the same.

The resulting table will look like this:

df.show(10)
+------------------+-----------------------+----------+----------+--------------------+--------------------+
|           Seg_Id0|                Seg_Id1|SEGMENT_ID|COMPANY_ID|      STATIC_DETAILS|            CONTACTS|
+------------------+-----------------------+----------+----------+--------------------+--------------------+
|20181219130609_0_0|                   null|         C|9377942526|[Joan Q & Z,10 Sa...|[Joan Q & Z     1...|
|20181219130609_0_0|20181219130723_0_0_L1_1|         P|9377942526|[+(277) 944 44 5,...|[+(277) 944 44 55...|
|20181219130609_0_2|                   null|         C|3483483977|[Robotrd Inc.,2 P...|[Robotrd Inc.   2...|
|20181219130609_0_2|20181219130723_0_2_L1_1|         P|3483483977|[+(174) 970 97 5,...|[+(174) 970 97 54...|
|20181219130609_0_2|20181219130723_0_2_L1_2|         P|3483483977|[+(848) 832 61 6,...|[+(848) 832 61 68...|
|20181219130609_0_2|20181219130723_0_2_L1_3|         P|3483483977|[+(455) 184 13 3,...|[+(455) 184 13 39...|
|20181219130609_0_6|                   null|         C|7540764401|[Eqartion Inc.,87...|[Eqartion Inc.  8...|
|20181219130609_0_7|                   null|         C|4413124035|[Xingzhoug,74 Qin...|[Xingzhoug      7...|
|20181219130609_0_8|                   null|         C|9546291887|[ZjkLPj,5574, Tok...|[ZjkLPj         5...|
|20181219130609_0_8|20181219130723_0_8_L1_1|         P|9546291887|[+(300) 252 33 1,...|[+(300) 252 33 17...|
+------------------+-----------------------+----------+----------+--------------------+--------------------+

The data now contain 2 additional fields: 'Seg_Id0' and 'Seg_Id1'. The 'Seg_Id0' is an autogenerated id for each root record. It is also unique for a root record. After splitting the segments you can use Seg_Id0 to join both tables. The 'Seg_Id1' field contains a unique child id. It is equal to 'null' for all root records but uniquely identifies child records.

You can now split these 2 segments and join them by Seg_Id0. The full example is available at spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/examples/CobolSparkExample2.scala

To run it from an IDE you'll need to change Scala and Spark dependencies from 'provided' to 'compile' so the jar file would contain all the dependencies. This is because Cobrix is a library to be used in Spark job projects. Spark jobs uber jars should not contain Scala and Spark dependencies since Hadoop clusters have their Scala and Spark dependencies provided by the infrastructure. Including Spark and Scala dependencies in an uber jar can produce binary incompatibilities when these jars are used in spark-submit and spark-shell.

Here is our example tables to join:

Segment 1 (Companies)

dfCompanies.show(10, truncate = false)
+--------------------+----------+-------------+-------------------------+--------+
|Seg_Id0             |COMPANY_ID|COMPANY_NAME |ADDRESS                  |TAXPAYER|
+--------------------+----------+-------------+-------------------------+--------+
|20181219130723_0_0  |9377942526|Joan Q & Z   |10 Sandton, Johannesburg |92714306|
|20181219130723_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|
|20181219130723_0_6  |7540764401|Eqartion Inc.|871A Forest ave., Toronto|87432264|
|20181219130723_0_7  |4413124035|Xingzhoug    |74 Qing ave., Beijing    |50803302|
|20181219130723_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|
|20181219130723_0_12 |9168453994|Test Bank    |1 Garden str., London    |82573513|
|20181219130723_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|
|20181219130723_0_20 |8463159728|Xingzhoug    |74 Qing ave., Beijing    |17785468|
|20181219130723_0_24 |8180356010|Eqartion Inc.|871A Forest ave., Toronto|79054306|
|20181219130723_0_27 |7107728116|Xingzhoug    |74 Qing ave., Beijing    |70899995|
+--------------------+----------+-------------+-------------------------+--------+

Segment 2 (Contacts)

dfContacts.show(13, truncate = false)
+-------------------+----------+-------------------+----------------+
|Seg_Id0            |COMPANY_ID|CONTACT_PERSON     |PHONE_NUMBER    |
+-------------------+----------+-------------------+----------------+
|20181219130723_0_0 |9377942526|Janiece Newcombe    |+(277) 944 44 55|
|20181219130723_0_2 |3483483977|Tyesha Debow        |+(174) 970 97 54|
|20181219130723_0_2 |3483483977|Mindy Celestin      |+(848) 832 61 68|
|20181219130723_0_2 |3483483977|Mabelle Winburn     |+(455) 184 13 39|
|20181219130723_0_8 |9546291887|Carrie Celestin     |+(300) 252 33 17|
|20181219130723_0_8 |9546291887|Edyth Deveau        |+(907) 101 70 64|
|20181219130723_0_8 |9546291887|Jene Norgard        |+(694) 918 17 44|
|20181219130723_0_12|9168453994|Timika Bourke       |+(768) 691 44 85|
|20181219130723_0_12|9168453994|Lynell Riojas       |+(695) 918 33 16|
|20181219130723_0_15|4225784815|Jene Mackinnon      |+(540) 937 33 71|
|20181219130723_0_15|4225784815|Timika Concannon    |+(122) 216 11 25|
|20181219130723_0_15|4225784815|Jene Godfrey        |+(285) 643 50 47|
|20181219130723_0_15|4225784815|Gabriele Winburn    |+(489) 644 53 67|
+-------------------+----------+-------------------+----------------+

Let's now join these tables.

Joined datasets

The join statement in Spark:

val dfJoined = dfCompanies.join(dfContacts, "Seg_Id0")

The joined data looks like this:

dfJoined.show(13, truncate = false)
+--------------------+----------+-------------+-------------------------+--------+----------+--------------------+----------------+
|Seg_Id0             |COMPANY_ID|COMPANY_NAME |ADDRESS                  |TAXPAYER|COMPANY_ID|CONTACT_PERSON      |PHONE_NUMBER    |
+--------------------+----------+-------------+-------------------------+--------+----------+--------------------+----------------+
|20181219130723_0_0  |9377942526|Joan Q & Z   |10 Sandton, Johannesburg |92714306|9377942526|Janiece Newcombe    |+(277) 944 44 55|
|20181219131239_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|3483483977|Mindy Celestin      |+(848) 832 61 68|
|20181219131239_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|3483483977|Tyesha Debow        |+(174) 970 97 54|
|20181219131239_0_2  |3483483977|Robotrd Inc. |2 Park ave., Johannesburg|31195396|3483483977|Mabelle Winburn     |+(455) 184 13 39|
|20181219131344_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|9546291887|Jene Norgard        |+(694) 918 17 44|
|20181219131344_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|9546291887|Edyth Deveau        |+(907) 101 70 64|
|20181219131344_0_8  |9546291887|ZjkLPj       |5574, Tokyo              |73538919|9546291887|Carrie Celestin     |+(300) 252 33 17|
|20181219131344_0_12 |9168453994|Test Bank    |1 Garden str., London    |82573513|9168453994|Timika Bourke       |+(768) 691 44 85|
|20181219131344_0_12 |9168453994|Test Bank    |1 Garden str., London    |82573513|9168453994|Lynell Riojas       |+(695) 918 33 16|
|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Jene Mackinnon      |+(540) 937 33 71|
|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Timika Concannon    |+(122) 216 11 25|
|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Jene Godfrey        |+(285) 643 50 47|
|20181219131344_0_15 |4225784815|ZjkLPj       |5574, Tokyo              |96136195|4225784815|Gabriele Winburn    |+(489) 644 53 67|
+--------------------+----------+-------------+-------------------------+--------+----------+--------------------+----------------+

Again, the full example is available at spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/examples/CobolSparkExample2.scala

COBOL parser extensions

Some encoding formats are not expressible by the standard copybook spec. Cobrix has extensions to help you decode fields encoded in this way.

Loading multiple paths

Loading multiple paths in the standard way is not supported.

 val df = spark
   .read
   .format("cobol")
   .option("copybook_contents", copybook)
   .load("/path1", "/paths2")

But there is a Cobrix extension that allows you to load multiple paths:

 val df = spark
   .read
   .format("cobol")
   .option("copybook_contents", copybook)
   .option("data_paths", "/path1,/paths2")
   .load()

Parsing little-endian binary numbers

Cobrix expects all binary numbers to be big-endian. If you have a binary number in the little-endian format, use COMP-9 (Cobrix extension) instead of COMP or COMP-5 for the affected fields.

For example, 0x01 0x02 is 1 + 2*256 = 513 in big-endian (COMP) and 1*256 + 2 = 258 (COMP-9) in little-endian.

  10 NUM  PIC S9(8) COMP.    ** Big-endian
  10 NUM  PIC S9(8) COMP-9.  ** Little-endian

Parsing 'unsigned packed' aka Easyextract numbers

Unsigned backed numbers are encoded as BCD (COMP-3) without the sign nibble. For example, bytes 0x12 0x34 encode the number 1234. As of 2.6.2 Cobrix supports decoding such numbers using an extension. Use COMP-3U for unsigned packed numbers.

The 'COMP-3U' usage

  10 NUM  PIC X(4) COMP-3U.

Note that when using X 4 refers to the number of bytes the field occupies. Here, the number of digits is 4*2 = 8.

  10 NUM  PIC 9(8) COMP-3U.

When using 9 8 refers to the number of digits the number has. Here, the size of the field in bytes is 8/2 = 4.

  10 NUM  PIC 9(6)V99 COMP-3U.

You can have decimals when using COMP-3 as well.

Flattening schema with GROUPs and OCCURS

Flattening could be helpful when migrating data from mainframe data with fields that have OCCURs (arrays) to a relational databases that do not support nested arrays.

Cobrix has a method that can flatten the schema automatically given a DataFrame produced by spark-cobol.

Spark Scala example:

val dfFlat = SparkUtils.flattenSchema(df, useShortFieldNames = false)

PySpark example

from pyspark.sql import SparkSession, DataFrame, SQLContext
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
from py4j.java_gateway import java_import

schema = StructType([
   StructField("id", IntegerType(), True),
   StructField("name", StringType(), True),
   StructField("subjects", ArrayType(StringType()), True)
])

# Sample data
data = [
   (1, "Alice", ["Math", "Science"]),
   (2, "Bob", ["History", "Geography"]),
   (3, "Charlie", ["English", "Math", "Physics"])
]

# Create a test DataFrame
df = spark.createDataFrame(data, schema)

# Show the Dataframe before flattening
df.show()

# Flatten the schema using Cobrix Scala 'SparkUtils.flattenSchema' method
sc = spark.sparkContext
java_import(sc._gateway.jvm, "za.co.absa.cobrix.spark.cobol.utils.SparkUtils")
dfFlatJvm = spark._jvm.SparkUtils.flattenSchema(df._jdf, False)
dfFlat = DataFrame(dfFlatJvm, SQLContext(sc))

# Show the Dataframe after flattening
dfFlat.show(truncate=False)
dfFlat.printSchema()

The output looks like this:

# Before flattening
+---+-------+------------------------+
|id |name   |subjects                |
+---+-------+------------------------+
|1  |Alice  |[Math, Science]         |
|2  |Bob    |[History, Geography]    |
|3  |Charlie|[English, Math, Physics]|
+---+-------+------------------------+

# After flattening
+---+-------+----------+----------+----------+
|id |name   |subjects_0|subjects_1|subjects_2|
+---+-------+----------+----------+----------+
|1  |Alice  |Math      |Science   |null      |
|2  |Bob    |History   |Geography |null      |
|3  |Charlie|English   |Math      |Physics   |
+---+-------+----------+----------+----------+

Summary of all available options

File reading options

Option (usage example)	Description
.option("data_paths", "/path1,/path2")	Allows loading data from multiple unrelated paths on the same filesystem.
.option("file_start_offset", "0")	Specifies the number of bytes to skip at the beginning of each file.
.option("file_end_offset", "0")	Specifies the number of bytes to skip at the end of each file.
.option("record_start_offset", "0")	Specifies the number of bytes to skip at the beginning of each record before applying copybook fields to data.
.option("record_end_offset", "0")	Specifies the number of bytes to skip at the end of each record after applying copybook fields to data.

Copybook parsing options

Option (usage example)	Description
.option("truncate_comments", "true")	Historically, COBOL parser ignores the first 6 characters and all characters after 72. When this option is `false`, no truncation is performed.
.option("comments_lbound", 6)	By default each line starts with a 6 character comment. The exact number of characters can be tuned using this option.
.option("comments_ubound", 72)	By default all characters after 72th one of each line is ignored by the COBOL parser. The exact number of characters can be tuned using this option.

Data parsing options

Option (usage example)	Description
.option("string_trimming_policy", "both")	Specifies if and how string fields should be trimmed. Available options: `both` (default), `none`, `left`, `right`, `keep_all`. `keep_all` - keeps control characters when decoding ASCII text files
.option("ebcdic_code_page", "common")	Specifies a code page for EBCDIC encoding. Currently supported values: `common` (default), `common_extended`, `cp037`, `cp037_extended`, and others (see "Currently supported EBCDIC code pages" section.
.option("ebcdic_code_page_class", "full.class.specifier")	Specifies a user provided class for a custom code page to UNICODE conversion.
.option("field_code_page:cp825", "field1, field2")	Specifies the code page for selected fields. You can add mo than 1 such option for multiple code page overrides.
.option("is_utf16_big_endian", "true")	Specifies if UTF-16 encoded strings (`National` / `PIC N` format) are big-endian (default).
.option("floating_point_format", "IBM")	Specifies a floating-point format. Available options: `IBM` (default), `IEEE754`, `IBM_little_endian`, `IEEE754_little_endian`.
.option("variable_size_occurs", "false")	If `false` (default) fields that have `OCCURS 0 TO 100 TIMES DEPENDING ON` clauses always have the same size corresponding to the maximum array size (e.g. 100 in this example). If set to `true` the size of the field will shrink for each field that has less actual elements.
.option("occurs_mapping", "{"FIELD": {"X": 1}}")	If specified, as a JSON string, allows for String `DEPENDING ON` fields with a corresponding mapping.
.option("strict_sign_overpunching", "true")	If `true` (default), sign overpunching will only be allowed for signed numbers. If `false`, overpunched positive sign will be allowed for unsigned numbers, but negative sign will result in null.
.option("improved_null_detection", "true")	If `true`(default), values that contain only 0x0 ror DISPLAY strings and numbers will be considered `null`s instead of empty strings.
.option("strict_integral_precision", "true")	If `true`, Cobrix will not generate `short`/`integer`/`long` Spark data types, and always use `decimal(n)` with the exact precision that matches the copybook.
.option("binary_as_hex", "false")	By default fields that have `PIC X` and `USAGE COMP` are converted to `binary` Spark data type. If this option is set to `true`, such fields will be strings in HEX encoding.

Modifier options

Option (usage example)	Description
.option("schema_retention_policy", "collapse_root")	When `collapse_root` (default) the root level record will be removed from the Spark schema. When `keep_original`, the root level GROUP will be present in the Spark schema
.option("drop_group_fillers", "false")	If `true`, all GROUP FILLERs will be dropped from the output schema. If `false` (default), such fields will be retained.
.option("drop_value_fillers", "false")	If `true` (default), all non-GROUP FILLERs will be dropped from the output schema. If `false`, such fields will be retained.
.option("filler_naming_policy", "sequence_numbers")	Filler renaming strategy so that column names are not duplicated. Either `sequence_numbers` (default) or `previous_field_name` can be used.
.option("non_terminals", "GROUP1,GROUP2")	Specifies groups to also be added to the schema as string fields. When this option is specified, the reader will add one extra data field after each matching group containing the string data for the group.
.option("generate_record_id", false)	Generate autoincremental 'File_Id', 'Record_Id' and 'Record_Byte_Length' fields. This is used for processing record order dependent data.
.option("generate_record_bytes", false)	Generate 'Record_Bytes', the binary field that contains raw contents of the original unparsed records.
.option("with_input_file_name_col", "file_name")	Generates a column containing input file name for each record (Similar to Spark SQL `input_file_name()` function). The column name is specified by the value of the option. This option only works for variable record length files. For fixed record length and ASCII files use `input_file_name()`.
.option("metadata", "basic")	Specifies wat kind of metadata to include in the Spark schema: `false`, `basic`(default), or `extended` (PIC, usage, etc).
.option("debug", "hex")	If specified, each primitive field will be accompanied by a debug field containing raw bytes from the source file. Possible values: `none` (default), `hex`, `binary`, `string` (ASCII only). The legacy value `true` is supported and will generate debug fields in HEX.

Fixed length record format options (for record_format = F or FB)

Option (usage example)	Description
.option("record_format", "F")	Record format from the spec. One of `F` (fixed length, default), `FB` (fixed block), V`(variable length RDW),`VB`(variable block BDW+RDW),`D` (ASCII text).
.option("record_length", "100")	Overrides the length of the record (in bypes). Normally, the size is derived from the copybook. But explicitly specifying record size can be helpful for debugging fixed-record length files.
.option("block_length", "500")	Specifies the block length for FB records. It should be a multiple of 'record_length'. Cannot be used together with `records_per_block`
.option("records_per_block", "5")	Specifies the number of records ber block for FB records. Cannot be used together with `block_length`

Variable record length files options (for record_format = V or VB)

Option (usage example)	Description
.option("record_format", "V")	Record format from the spec. One of `F` (fixed length, default), `FB` (fixed block), V`(variable length RDW),`VB`(variable block BDW+RDW),`D` (ASCII text).
.option("is_record_sequence", "true")	[deprecated] If 'true' the parser will look for 4 byte RDW headers to read variable record length files. Use `.option("record_format", "V")` instead.
.option("is_rdw_big_endian", "true")	Specifies if RDW headers are big endian. They are considered little-endian by default.
.option("is_rdw_part_of_record_length", false)	Specifies if RDW headers count themselves as part of record length. By default RDW headers count only payload record in record length, not RDW headers themselves. This is equivalent to `.option("rdw_adjustment", -4)`. For BDW use `.option("bdw_adjustment", -4)`
.option("rdw_adjustment", 0)	If there is a mismatch between RDW and record length this option can be used to adjust the difference.
.option("bdw_adjustment", 0)	If there is a mismatch between BDW and record length this option can be used to adjust the difference.
.option("re_additional_info", "")	Passes a string as an additional info parameter passed to a custom record extractor to its constructor.
.option("record_length_field", "RECORD-LEN")	Specifies a record length field or expression to use instead of RDW. Use `rdw_adjustment` option if the record length field differs from the actual length by a fixed amount of bytes. The `record_format` should be set to `F`. This option is incompatible with `is_record_sequence`.
.option("record_length_map", """{"A":100,"B":50}""")	Specifies a mapping between record length field values and actual record lengths.
.option("record_extractor", "com.example.record.extractor")	Specifies a class for parsing record in a custom way. The class must inherit `RawRecordExtractor` and `Serializable` traits. See the chapter on record extractors above.
.option("minimum_record_length", 1)	Specifies the minimum length a record is considered valid, will be skipped otherwise.
.option("maximum_record_length", 1000)	Specifies the maximum length a record is considered valid, will be skipped otherwise.

ASCII files options (for record_format = D or D2)

Option (usage example)	Description
.option("record_format", "D")	Record format from the spec. One of `F` (fixed length, default), `FB` (fixed block), V`(variable length RDW),`VB`(variable block BDW+RDW),`D` (ASCII text).
.option("is_text", "true")	If 'true' the file will be considered a text file where records are separated by an end-of-line character. Currently, only ASCII files having UTF-8 charset can be processed this way. If combined with `record_format = D`, multisegment and hierarchical text record files can be loaded.

Multisegment files options

Option (usage example)	Description
.option("segment_field", "SEG-ID")	Specify a segment id field name. This is to ensure the splitting is done using root record boundaries for hierarchical datasets. The first record will be considered a root segment record.
.option("redefine-segment-id-map:0", "REDEFINED_FIELD1 => SegmentId1,SegmentId2,...")	Specifies a mapping between redefined field names and segment id values. Each option specifies a mapping for a single segment. The numeric value for each mapping option must be incremented so the option keys are unique.
.option("segment-children:0", "COMPANY => EMPLOYEE,DEPARTMENT")	Specifies a mapping between segment redefined fields and their children. Each option specifies a mapping for a single parent field. The numeric value for each mapping option must be incremented so the option keys are unique. If such mapping is specified hierarchical record structure will be automatically reconstructed. This require `redefine-segment-id-map` to be set.
.option("enable_indexes", "true")	Turns on indexing of multisegment variable length files (on by default).
.option("input_split_records", 50000)	Specifies how many records will be allocated to each split/partition. It will be processed by Spark tasks. (The default is not set and the split will happen according to size, see the next option)
.option("input_split_size_mb", 100)	Specify how many megabytes to allocate to each partition/split. (The default is 100 MB)

Helper fields generation options

Option (usage example)	Description
.option("segment_field", "SEG-ID")	Specified the field in the copybook containing values of segment ids.
.option("segment_filter", "S0001")	Allows to add a filter on the segment id that will be pushed down the reader. This is if the intent is to extract records only of a particular segments.
.option("segment_id_level0", "SEGID-ROOT")	Specifies segment id value for root level records. When this option is specified the Seg_Id0 field will be generated for each root record
.option("segment_id_level1", "SEGID-CLD1")	Specifies segment id value for child level records. When this option is specified the Seg_Id1 field will be generated for each root record
.option("segment_id_level2", "SEGID-CLD2")	Specifies segment id value for child of a child level records. When this option is specified the Seg_Id2 field will be generated for each root record. You can use levels 3, 4 etc.
.option("segment_id_prefix", "A_PREEFIX")	Specifies a prefix to be added to each segment id value. This is to mage generated IDs globally unique. By default the prefix is the current timestamp in form of '201811122345_'.

Debug helper options

Option (usage example)	Description
.option("pedantic", "false")	If 'true' Cobrix will throw an exception is an unknown option is encountered. If 'false' (default), unknown options will be logged as an error without failing Spark Application.
.option("debug_ignore_file_size", "true")	If 'true' no exception will be thrown if record size does not match file size. Useful for debugging copybooks to make them match a data file.
.option("ascii_charset", "US-ASCII")	Specifies a charset to use to decode ASCII data. The value can be any charset supported by `java.nio.charset`: `US-ASCII` (default), `UTF-8`, `ISO-8859-1`, etc.
.option("field_code_page:cp825", "field1, field2")	Specifies the code page for selected fields. You can add mo than 1 such option for multiple code page overrides.
.option("minimum_record_length", 1)	Specifies the minimum length a record is considered valid, will be skipped otherwise. It is used to skip ASCII lines that contains invalid records, an EOF character, for example.
.option("maximum_record_length", 1000)	Specifies the maximum length a record is considered valid, will be skipped otherwise.

Currently supported EBCDIC code pages

Option	Code page	Description
.option("ebcdic_code_page", "common")	Common	(Default) Only characters common across EBCDIC code pages are decoded.
.option("ebcdic_code_page", "cp037")	EBCDIC 037	Australia, Brazil, Canada, New Zealand, Portugal, South Africa, USA.
.option("ebcdic_code_page", "cp273")	EBCDIC 273	Germany, Austria.
.option("ebcdic_code_page", "cp300")	EBCDIC 300	Double-byte code page with Japanese and Latin characters.
.option("ebcdic_code_page", "cp500")	EBCDIC 500	Belgium, Canada, Switzerland, International.
.option("ebcdic_code_page", "cp838")	EBCDIC 838	Double-byte code page with Thai and Latin characters.
.option("ebcdic_code_page", "cp870")	EBCDIC 870	Albania, Bosnia and Herzegovina, Croatia, Czech Republic, Hungary, Poland, Romania, Slovakia, and Slovenia.
.option("ebcdic_code_page", "cp875")	EBCDIC 875	A code page with Greek characters.
.option("ebcdic_code_page", "cp1025")	EBCDIC 1025	A code page with Cyrillic alphabet.
.option("ebcdic_code_page", "cp1047")	EBCDIC 1047	A code page containing all of the Latin-1/Open System characters.
.option("ebcdic_code_page", "cp1140")	EBCDIC 1140	Same as code page 037 with € at the position of the international currency symbol ¤.
.option("ebcdic_code_page", "cp1141")	EBCDIC 1141	Same as code page 273 with € at the position of the international currency symbol ¤.
.option("ebcdic_code_page", "cp1148")	EBCDIC 1148	Same as code page 500 with € at the position of the international currency symbol ¤.
.option("ebcdic_code_page", "cp1364")	EBCDIC 1364	Double-byte code page CCSID-1364, Korean.
.option("ebcdic_code_page", "cp1388")	EBCDIC 1388	Double-byte code page CCSID-1388, Simplified Chinese.

common_extended, cp037_extended are code pages supporting non-printable characters that converts to ASCII codes below 32.

Performance Analysis

Performance tests were performed on synthetic datasets. The setup and results are as follows.

Cluster setup

Spark 2.2.1
Driver memory: 4GB
Driver cores: 4
Executor memory: 4GB
Cores per executor: 1

Test Applications

The test Spark Application is just a conversion from the mainframe format to Parquet.

For fixed record length tests:

    val sparkBuilder = SparkSession.builder().appName("Performance test")
    val spark = sparkBuilder
      .getOrCreate()

    val copybook = "...copybook contents..."
    val df = spark
      .read
      .format("cobol")
      .option("copybook_contents", copybook)
      .load(args(0))
    
      df.write.mode(SaveMode.Overwrite).parquet(args(1))

For multisegment variable lengths tests:

    val sparkBuilder = SparkSession.builder().appName("Performance test")
    val spark = sparkBuilder
      .getOrCreate()

    val copybook = "...copybook contents..."
    val df = spark
      .read
      .format("cobol")
      .option("copybook_contents", copybook)
      .option("generate_record_id", true)
      .option("record_format", "V")
      .option("segment_field", "SEGMENT_ID")
      .option("segment_id_level0", "C")
      .load(args(0))
    
      df.write.mode(SaveMode.Overwrite).parquet(args(1))

Performance Test 1. Fixed record length raw file

Raw (single segment) fixed length record file
167 columns
1341 bytes per record
30,000,000 records
40 GB single input file size
The data can be generated using za.co.absa.cobrix.cobol.parser.examples.TestDataGen6TypeVariety generator app

Performance Test 2. Multisegment narrow file

Multisegment variable length record file
A copybook containing very few fields
68 bytes per segment 1 record and 64 bytes per segment 2 record
640,000,000 records
40 GB single input file size
The data can be generated using za.co.absa.cobrix.cobol.parser.examples.TestDataGen3Companies generator app

Performance Test 3. Multisegment wide file

Multisegment variable length record file
A copybook containing a segment with arrays of 2000 struct elements. This reproduces a case when a copybook contain a lot of fields.
16068 bytes per segment 1 record and 64 bytes per segment 2 record
8,000,000 records
40 GB single input file size
The data can be generated using za.co.absa.cobrix.cobol.parser.examples.generatorsTestDataGen4CompaniesWide generator app

How to generate Code coverage report

sbt jacoco

Code coverage will be generated on path:

{local-path}\fixed-width\target\scala-2.XY\jacoco\report\html

FAQ

This is a new section where we are going to post common questions and workarounds from GitHub issues raised by our users.

Q: Numeric COMP or COMP-5 data is decoded incorrectly. Specifically, small values look like very big values

A: This is often a sign that the binary data is little-endian. Cobrix expects all binary data to be big-endian. The workaround is to use COMP-9 (Cobrix extension) instead of COMP and COMP-5 for the affected fields.

Q: Getting the following error when using Spark < 2.4.3:

ANTLR Tool version 4.7.2 used for code generation does not match the current runtime version 4.5.3ANTLR 
Runtime version 4.7.2 used for parser compilation does not match the current runtime version 4.5.321/12/20 11:42:54
ERROR ApplicationMaster: User class threw exception: java.lang.ExceptionInInitializerError

A: Option 1: Use Spark 2.4.3 or higher. Option 2: Use 'sbt assembly' as stated above in README to generate your spark-cobol artifact tailored for your Spark version. The artifact shades ANTLR so the incompatibility should be resolved.

Q: Getting exceptions from Hadoop classes when running Cobrix test suite on Windows:

exception or error caused a run to abort: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:608)

A: Update hadoop dll to version 3.2.2 or newer.

Changelog

2.7.3 released 17 Jule 2024.

#678 Add the ability to generate Spark schema based on strict integral precision:

// `decimal(n,0)` will be used instead of `integer` and `long`
.option("strict_integral_precision", "true")

#689 Add support for '_' for hierarchical key generation at leaf level:

.option("segment_id_level0", "SEG0") // Root segment
.option("segment_id_level1", "_")    // Leaf segment (use 'all other' segment IDs)

2.7.2 released 7 June 2024.
- #684 Fixed failing to read a data file in certain combination of options.
- #685 Added methods to flatten schema of a dataframe more effective than flattenSchema(), but does not flatten arrays:
```
// df - a DataFrame with nested structs
val flatDf = SparkUtils.unstructDataFrame(df)
// flatDf the same dataframe with all nested fields promoted to the top level.
```
2.7.1 released 4 June 2024.
- #680 Shaded ANTLR runtime in 'cobol-parser' to avoid conflicts with various versions of Spark that uses ANTLR as well.
- #678 Added an experimental method SparkUtils.covertIntegralToDecimal() for applying extended metadata to a DataFrame.

2.7.0 released 23 April 2024.

#666 Added support for record length value mapping.

.option("record_format", "F")
.option("record_length_field", "FIELD_STR")
.option("record_length_map", """{"SEG1":100,"SEG2":200}""")

#669 Allow 'V' to be at the end of scaled PICs.

     10  SCALED-DECIMAL-FIELD    PIC S9PPPV      DISPLAY.

#672 Add the ability to parse copybooks with options normally passed to the spark-cobol Spark data source.

// Same options that you use for spark.read.format("cobol").option()
val options = Map("schema_retention_policy" -> "keep_original")

val cobolSchema = CobolSchema.fromSparkOptions(Seq(copybook), options)
val sparkSchema = cobolSchema.getSparkSchema.toString()

println(sparkSchema)

#674 Extended the usage of indexes for variable record length files with a record length field.

.option("record_length_field", "RECORD-LENGTH")
.option("enable_indexes", "true") // true by default so can me omitted

2.6.11 released 8 April 2024.
- #659 Fixed record length option when record id generation is turned on.
2.6.10 released 17 January 2024.
- #653 Add support for new EBCDIC code pages: 273, 500, 1140, 1141, 1148.
2.6.9 released 16 October 2023.
- #634 Retain metadata when flattening the schema in SparkUtils.
- #644 Add support for Spark 3.5.0.
2.6.8 released 1 June 2023.
- #624 Add support for binary fields that have PIC X and USAGE COMP.
2.6.7 released 6 May 2023.
- #620 Fixed a regression bug that made a breaking change to custom record extractors. The source code compatibility has been restored.
2.6.6 released 5 May 2023.
- #601 Fixed file_start_offset and file_end_offset options for VB record format (BDW+RDW).
- #614 Fixed catching a state when a custom record extractor does not conform to the contract.
- #613 Added the ability of custom record extractors to get header stream pointing to teh beginning of the file.
- #607 Added minimum_record_length and maximum_record_length options.
2.6.5 released 5 April 2023.
- #539 Fixed 'cp300', and added experimental support for 'cp1364' and 'cp1388' code pages (thanks @BenceBenedek).
- #590 Changed from .option("extended_metadata", true) to .option("metadata", "extended") allowing other modes like 'basic' (default) and 'false' (disable metadata).
- #593 Added option .option("generate_record_bytes", true) that adds a field containing raw bytes of each record decoded.
2.6.4 released 3 March 2023.
- #576 Added the ability to create DataFrames from RDDs plus a copybook using .Cobrix.fromRdd() extension (look for 'Cobrix.fromRdd' for examples).
- #574 Added the ability to read data files with fields encoded using multiple code pages using (.option("field_code_page:cp037", "FIELD-1,FIELD_2")).
- #538 Added experimental support for cp00300, the 2 byte Japanese code page (thanks @BenceBenedek).
2.6.3 released 1 February 2023.
- #550 Added .option("extended_metadata", true) option that adds many additional metadata fields (PIC, USAGE, etc) to the generated Spark schema.
- #567 Added support for new code pages 838, 870, 1025 (Thanks @sree018).
- #569 Added support for field length expressions based on filed on the copybook See Variable length records support.
- #572 Improved performance of non-UTF8 encoded ASCII test files.
- #560 Fixed 'segment_filter' for ASCII files when Spark test splitter is used.
2.6.2 released 3 January 2023.
- #516 Added support for unsigned packed numbers via a Cobrix extension (COMP-3U).
- #542 Added .option("filler_naming_policy", "previous_field_name") allowing for a different filler naming strategy.
- #544 Added data_paths option to replace paths option that conflicts with Sparks internal option paths.
- #545 Added support for string debug columns for ASCII (D/D2/T) files (.option("debug", "string")).
- #543 Improved performance of processing ASCII text (D/D2/T) files with variable OCCURS.
- #553 Fixed variable occurs now working properly with basic ASCII record format (D2).
- #556 Fixed file_end_offset option dropping records from the end of partitions instead of end of files.
2.6.1 released 2 December 2022.
- #531 Added support for CP1047 EBCDIC code page.
- #532 Added Jacoco code coverage support.
- #529 Fixed unit tests failing on Windows.
2.6.0 released 14 October 2022.
- #514 Added support for Scala 2.13 and Spark 3.3.0.
- #517 Added 'maxLength' metadata for Spark schema string fields.
- #521 Fixed index generation and improved performance of variable block length files processing (record_format='VB').

Older versions

2.5.1 released 24 August 2022.
- #510 Fixed dropping of FILLER fields in Spack Schema if the FILLER has OCCURS of GROUPS.
2.5.0 released 28 June 2022.
- #396 Added support for parsing copybooks that do not have root record GROUP.
- #423 Added Record_Byte_Length field to be generated when generate_record_id is set to true.
- #500 Improved null detection by default (the old behavior can be restored using .option("improved_null_detection", "false")).
- #491 Strictness of sign overpunching is now controlled by .option("strict_sign_overpunching", "true") (false by default). When set to true sign overpunching is not allowed for unsigned fields. When false, positive sign overpunching is allowed for unsigned fields.
- #501 Fixed decimal field null detection when 'improved_null_detection' is turned on.
- #502 Fixed parsing of scaled decimals that have a pattern similar to SVP9(5).
2.4.10 released 8 April 2022.
- #481 ASCII control characters are now ignored instead of being replaced with spaces. A new string trimming policy (keep_all) allows keeping all control characters in strings (including 0x00).
- #484 Fix parsing of ASCII files so that only full records are parsed. The old behavior can be restored with .option("allow_partial_records", "true").
2.4.9 released 4 March 2022.
- #474 Fix numeric decoder of unsigned DISPLAY format. The decoder made more strict and does not allow sign overpunching for unsigned numbers.
- #477 Fixed NotSerializableException when using non-default logger implementations (Thanks @joaquin021).
2.4.8 released 4 February 2022.
- #324 Allow removing of FILLERs from AST when parsing using 'parseSimple()'. The signature of the method has changed. The boolean arguments now reflect more clearly what they do.
- #466 Added maxElements and minElements to Spark schema metadata for array fields created from fields with OCCURS. This allows knowing the maximum number of elements in arrays when flattening the schema.
2.4.7 released 11 January 2022.
- #459 Fixed signed overpunch for ASCII files.
2.4.6 released 21 December 2021.
- #451 Fixed COMP-9 (Cobrix extension for little-endian binary fields).
2.4.5 released 02 December 2021.
- #442 Fixed EOFException when reading large ASCII files.
- #444 Add 'basic ASCII format' (record_format = D2) which uses Spark's textFile() to split records.
2.4.4 released 16 November 2021.
- #435 Fixed 'INDEXED BY' clause followed by multiple identifiers.
- #437 Added support for '@' characters inside identifier names.
2.4.3 released 26 October 2021.
- #430 Added support for 'twisted' RDW headers when big-endian or little-endian RDWs use unexpected RDW bytes.
2.4.2 released 7 October 2021.
- #427 Fixed parsing of identifiers that end with '-' or '_'.
2.4.1 released 22 September 2021.
- #420 Add experimental support for fixed blocked (FB) record format.
- #422 Fixed decoding of 'broken pipe' (¦) character from EBCDIC.
- #424 Fixed an ASCII reader corner case.
2.4.0 released 7 September 2021.
- #412 Add support for variable block (VB aka VBVR) record format. Options to adjust BDW settings are added:
  - is_bdw_big_endian - specifies if BDW is big-endian (false by default)
  - bdw_adjustment - Specifies how the value of a BDW is different from the block payload. For example, if the side in BDW headers includes BDW record itself, use .option("bdw_adjustment", "-4").
- Options is_record_sequence and is_xcom are deprecated. Use .option("record_format", "V") instead.
- #417 Multisegment ASCII text files have now direct support using record_format = D.
2.3.0 released 2 August 2021.
- #405 Fix extracting records that contain redefines of the top level GROUPs.
- #406 Use 'collapse_root' retention policy by default. This is the breaking, change, to restore the original behavior add .option("schema_retention_policy", "keep_original").
- #407 The layout positions summary generated by the parser now contains level numbers for root level GROUPs. This is a breaking change if you have unit tests that depend on the formatting of the layout positions output.
2.2.3 released 14 July 2021.
- #397 Fix skipping of empty lines when reading ASCII files with is_record_sequence = true
- #394 Added an ability to specify multiple paths to read data from (Use .option("paths", inputPaths.mkString(","))). This is a workaround implementation since adding support for multiple paths in load() would require a big rewrite for spark-cobol from data source to data format.
- #372 Added an option to better handle null values in DISPLAY formatted data: .option("improved_null_detection", "false")
2.2.2 released 27 May 2021.
- #387 Fixed parsing of COMP-1 and COMP-2 fields that use 'USAGE' or 'USAGE IS' keywords.
- Added an example project that allows running Spark + Cobrix locally while writing to S3. The project is located here.
- Improved several common error messages to provide more relevant information.
2.2.1 released 12 March 2021.
- #373 Added the ability to create Uber jar directly from the source code of the project (https://github.com/AbsaOSS/cobrix#creating-an-uber-jar).
2.2.0 released 30 December 2020.
- #146 Added support for S3 storage. The S3 support could be considered experimental since only S3A connector has been tested. However, since Cobrix is built on filesystem abstractions provided by Hadoop and Spark libraries, there shouldn't be any issues using other S3 connectors.
2.1.5 released 11 December 2020.
- #349 Fixed regression bug introduced in 2.1.4 resulting in an infinite loop in the sparse index generation for fixed-record length multisegment files.
2.1.4 released 4 December 2020.
- #334 Added support for reading multisegment ASCII text files.
- #338 Added support for custom record extractors that are better replacement for custom record header parsers.
- #340 Added the option to enforce record length: .option("record_length", "123").
- #335 Fixed sparse index generation for files that have variable length occurs, but no RDWs.
- #342 Fixed sparse index generation for files with multiple values of the root segment id.
- #346 Updated Spark dependency to 2.4.7 (was 2.4.5).
2.1.3 released 11 November 2020.
- #329 Added debug fields generation for redefines (Thanks @codealways).
2.1.2 released 2 November 2020.
- #325 Added support for EBCDIC code page 875 (Thanks @vbarakou).
2.1.1 released 18 August 2020.
- #53 Added an option to retain FILLERs. .option("drop_value_fillers", "false"). Use together with .option("drop_group_fillers", "false").
- #315 Added CopybookParser.parseSimple() that requires only essential arguments.
- #316 Added support for copybooks that contain non-breakable spaces (0xA0) and tabs.
2.1.0 released 11 June 2020.
- #291 Added ability to generate debug fields in raw/binary format.
- #295 Added is_text option for easier processing ASCII text files that uses EOL characters as record separators.
- #294 Updated Spark compile-time dependency to 2.4.5 to remove security alerts.
- #293 Copybook-related error messages are made more clear.
2.0.8 released 14 May 2020.
- #184 Record extractors are made generic to be reusable for other targets in addition to Sspark Row. (Thanks @tr11).
- #283 Added a custom JSON parser to mitigate jackson compatibility when spark-cobol is used in Spark 3.0. (Thanks @tr11).
2.0.7 released 14 April 2020.
- #273 Fixed the class loader for custom code pages (Thanks @schaloner-kbc).
2.0.7 released 14 April 2020.
- #273 Fixed the class loader for custom code pages (Thanks @schaloner-kbc).
2.0.6 released 6 April 2020.
- #151 Added an option (occurs_mapping) to define mappings between non-numeric fields and sizes of corresponding OCCURS (Thanks @tr11).
- #269 Added support for segment redefines deeply nested, instead of requiring them to be defined always at the top record level.
2.0.5 released 23 March 2020.
- #239 Added support for generation of debugging fields (.option("debug", "true")).
- #249 Added support for NATIONAL (PIC N) formatted strings (Thanks @schaloner-kbc).
2.0.4 released 25 February 2020.
- #239 Added an ability to load files with variable size OCCURS and no RDWs.
- #249 Fixed handling of variable size OCCURS when loading hierarchical files.
- #251 Fixed hidden files not being ignored when there are many files in a directory.
- #252 Fixed compatibility of 'with_input_file_name_col' with file offset options.
2.0.3 released 5 February 2020.
- #241 Fixed EBCDIC string to number converter to support comma as the decimal separator.
2.0.2 released 2 February 2020.
- #241 Added support for comma as a decimal seprartor for explicit decimal point in DISPLAY numeric format (Thanks @tr11).
2.0.1 released 20 December 2019.
- #225 Added .option("ascii_charset", chrsetName) to specify a charset for ASCII data.
- #221 Added .option("with_input_file_name_col", "file_name") for variable length files to workaround empty value returned by input_file_name().
- Fixed Scala dependency in artifacts produced by sbt. The dependency is now provided so that a fat jar produced with spark-cobol dependency is compatible to wider range of Spark deployments.
2.0.0 released 11 December 2019.
- Added cross-compilation for Scala 2.11 and 2.12 via sbt build (Thanks @GeorgiChochov).
- Added sbt build for the example project.
1.1.2 released 28 November 2019.
- This is the last Maven release. New versions are going to be released via sbt and cross-compiled for Scala 2.11 and 2.12.
- Fixed too permissive parsing of uncompressed (DISPLAY) numbers.
1.1.1 released 15 November 2019.
- Fixed processing files that have special characters in their paths.
1.1.0 released 7 November 2019.
- Added an option (segment-children) to reconstruct hierarchical structure of records. See Automatic reconstruction of hierarchical structure
1.0.2 released 21 October 2019.
- Fixed trimming of VARCHAR fields.
1.0.1 released 5 September 2019.
- Added an option to control behavior of variable size 'OCCURS DEPENDING ON' fields (see .option("variable_size_occurs", "true")).
1.0.0 released 29 August 2019.
- The parser is completely rewritten in ANTLR by Tiago Requeijo. This provides a lot of benefits, including but not limited to:
  - This should provide more strict compliance to the COBOL spec.
  - The parser should now be more robust and maintainable.
  - Changes to the parser from now on should be less error prone.
  - Syntax error messages should be more meaningful.
0.5.6 released 21 August 2019. This release should be functionally equivalent to 1.0.0.
- Added ability to control truncation of comments when parsing a copybook. Historically the first 6 bytes of a copybook are ignored, as well as all characters after position 72. Added options to control this behavior.
- Added unit tests for the copybook parser covering several exotic cases.
0.5.5 released 15 August 2019
- Added ability to read variable length files without RDW. A length field and [optionally] offsets can be specified instead.
- Added support for VARCHAR fields at the end of records. Length of such string fields is determined by the length of each record.
- Added support for segment id to redefine fields mapping for fixed-record length files.
0.5.4 released 23 July 2019
- Added support for IBM floating point formats.
- Fixed sparse index generation when file header has the same segment id as root record.
- Fixed sparse index generation when a file does not contain any data, but just headers and footers.
0.5.3 released 15 July 2019
- Make peadntic=false by default so existing workflows won't break.
- Use cobrix_build.properties file for storing Cobrix version instead of build.properties to avoid name clashes.
0.5.2 released 12 July 2019
- Added options to adjust record sizes returned by RDW headers. RDWs may or may not include themselves as part of record size.
- Added tracking of unrecognized and redundant options. If an option to spark-cobol is unrecognized or redundant the Spark Application won't run unless pedantic = false.
- Added logging of Cobrix version during Spark Application execution.
- Improved custom record header parser to support wider range of use cases.
- Fixed processing paths that contain wildcards.
- Various improvements in the structure of the project, POM files and examples.
0.5.1 released 26 June 2019
- This is a minor feature release.
- Added support for specifying several copybooks. They will be automatically merged into a larger one (Thanks Tiago Requeijo).
- Added an option for ignoring file headers and footers ('file_start_offset', 'file_end_offset').
- Added the dropRoot and restrictTo operations for the copybooks parser (Thanks Tiago Requeijo).
- Added support for explicit decimal point location together with scale factor (Thanks @gd-iborisov)
- Added an option to suppress file size check for debugging fixed record length files ('debug_ignore_file_size').
- Added support for sparse indexes when fixed record length files require variable record length features. It can be, for example, when a record id generation is requested or when file offsets are used.
- Improved custom record header parser interface:
  - Cobrix now provides file offset, size and record number to record metadata parser.
  - Fixed handling the case where record headers are part of the copybook.
0.5.0 released 17 May 2019
- This is a minor feature release.
- Cobrix now handles top level REDEFINES (Thanks Tiago Requeijo).
- Added support for extracting non-terminal fields (GROUPs) as string columns (Thanks Tiago Requeijo). (see 'non_terminals' option)
- Added support for decimal scale 'P' in PICs.
- Added ability to specify custom record header parsers for variable record length files (see 'record_header_parser' option)
- Interpret Decimal values with no fractional digits as Integral (Thanks Tiago Requeijo).
- Fixed OCCURS depending on non-integer fields (Thanks Tiago Requeijo).
- Fixed code duplication in AST traversal routines (Thanks Tiago Requeijo).
- Fixed handling number PICs that start with implicit decimal point (e.g. 'SV9(5)')
- Fixed unit tests failure when built in Windows
0.4.2 released 29 Mar 2019
- This is a minor feature release.
- Added ability for a user to provide a custom EBCDIC code page to Unicode conversion table.
- Fixed generated record id and 'segment id fields inconsistencies.
0.4.1 released 15 Mar 2019
- This is a minor feature release.
- Added an option to specify if and how strings should be trimmed.
- Added an option to select an EBCDIC code page with support of non-printable characters.
0.4.0 released 6 Mar 2019
- This is a minor feature release.
- Added ability to specify segment id to redefine mapping. If specified Cobrix won't parse redefines that are not valid for a given segment id. This should increase performance when parsing multisegment files.
- Unsigned numeric type patterns are now handled more strictly resulting in null values if a number is decoded as negative
0.3.3 released 21 Feb 2019
- This is a hotfix release.
- Fixed segment id filter pushdown if the segment id field contains non-decodable values
0.3.2 released 14 Feb 2019
- This is a minor feature release.
- Added support for big endian RDW headers in record sequence files option("is_rdw_big_endian", "true"). By default RDW headers are expected to be little-endian (for compatibility with earlier versions of Cobrix).
- Improved default settings for sparse index generation to achieve better data locality.
- Fixed parsing field names containing 'COMP-' inside its name.
0.3.1 released 11 Jan 2019
- This is a maintenance release.
- Added support for PICs specifying separate sign character (for example, 9(4)+).
- Added support for PICs characters specifying zeros masking (for example, Z(4)).
- Added support for reading ASCII data files using option("encoding", "ascii").
0.3.0 released 17 Dec 2018
- This is a minor feature release. There are changes that change behavior.
- Added balancing partitions among potentially idle executors. See the section on locality optimization.
- Added performance charts to README.md
- Added 2 standalone projects in the 'examples' folder. It can be used as templates for creating Spark jobs that use Cobrix
- Added option to drop all FILLER GROUPs. See section "Group Filler dropping" describing the new option.
- Fixed handling of GROUP fields having only FILLER nested fields. Such groups are removed automatically because some formats don't support empty struct fields.
- All parser offsets are now in bytes instead of bits.
- COMP-5 is similar to COMP-4, the truncation happens by the size of the binary data, not exactly by PIC precision specification. For COMP-5 numbers are expected to be big endian.
- Added artificial COMP-9 format for little endian binary numbers.
0.2.11 released 10 Dec 2018
- Added partitioning at the record-level by leveraging HDFS locations. This makes Cobrix data locality aware data source.
- Parser decoders have been rewritten resulting in about 20% increase of performance
- Added support for IEEE-754 floating point numbers for COMP-1 and COMP-2
- Added support for decimal numbers with explicit decimal point
- Added support for DISPLAY-formatted numbers sign overpunching
- Added support for DISPLAY-formatted numbers sign separate (both leading and trailing)
- Added support for very big numbers (when precision is bigger than 18)
- Added ability to filter by several segment ids in a multiple segment file (Thanks Peter Moon)
- Syntax check made more strict, added more diagnostic messages.
- The "is_xcom" option is renamed to "is_record_sequence" since other tools provide such a header as well. The old option remains for compatibility.

Option (usage example)	Description
.option("is_record_sequence", "true")	Specifies that input files have byte record headers.

0.2.10 released 26 Nov 2018
- Fixed file retrieval by complying with HDFS patterns and supporting glob patterns.
- Increased performance of BINARY/COMP formatted numbers by rewriting binary number decoders.
- Made partition split by size more accurate, make it the default split type
- Aligned input split terminology according to other Spark data sources (see the table below)
  - When both split options are specified 'input_split_records' is used
  - When none split of the split options are specified the default is split by size of 100 MB

Option (usage example)	Description
.option("input_split_records", 50000)	Specifies how many records will be allocated to each split/partition. It will be processed by Spark tasks. (The default is 50K records)
.option("input_split_size_mb", 100)	Specify how many megabytes to allocate to each partition/split. (The default is 100 MB)

0.2.9 released 21 Nov 2018
- Added an index generation for multisegment variable record size files to make the reader scalable.
  - It is turned on by default, you can restore old behaviour using .option("enable_indexes", "false")
- Added options to control index generation (first table below).
- Added generation of helper fields for hierarchical databases (second table below). These helper fields allows to split a dataset into individual segments and then join them. The helper fields will contain segment ids that can be used for joining the resulting tables. See the guide on loading hierarchical data sets above.
- Fixed many performance issues that should make reading mainframe files several times faster. The actual performance depends on concrete copybooks.

Acknowledgements

Thanks to the following people the project was made possible and for all the help along the way: 
- Andrew Baker, Francois Cillers, Adam Smyczek,  Jan Scherbaum, Peter Moon, Clifford Lategan, Rekha Gorantla, Mohit Suryavanshi, Niel Steyn
Thanks to Tiago Requeijo, the author of the current ANTLR-based COBOL parser contributed to Cobrix.
Thanks to the authors of the original COBOL parser. When we started the project we had zero knowledge of COBOL and this parser was a good starting point:
- Ian De Beer, Rikus de Milander (https://github.com/zenaptix-lab/copybookStreams)

Disclaimer

Companies, Names, Ids and values in all examples present in this project/repository are completely fictional and were generated randomly. Any resemblance to actual persons, companies or actual transactions is purely coincidental.

cobrix's People

Contributors

Stargazers

Watchers

Forkers

steccami jwcastillo skadyan rub0 touseefzaki fanqi1909 thesuperzapper nsereno anandvermamule srkadiyala gokcerbelgusen mengjin001 fosgate29 tr11 magantivenkat adeagbot srikanth-gandi lieuranjan mantovani bastien-bonnet nikitaboyko koundeld vs199vishal rafpyprog tusharkalecam indoos yruslan mpisu praveen93v guptam mdrakiburrahman bart-at-qqdatafruits schaloner-kbc priyanka5 hereisharish wearedatamarvels abhilash47 nextdynamic ankitanav codealways gaybro8777 kodumurivenkatesh9 jacobjohansen jaggel tmdc-io tooptoop4 manasdebashiskar sudhirnikam fossabot bishwajitdey mark-weghorst bizzylab narasimhakattunga cienciadedadosebigdata datatrekkers datanir alantsev alexrogalskiy joaquin021 deluis85 reillymaw manchikalapudi lorenzolewis khajaasmath786 arindbha phaniomf itspsb sathya-reddy-m yogeshmahera dz-1 sree018 gszecsenyi dasaradhs1 bmscomp naheedmk sandeep-pendyala cyrilroque

cobrix's Issues

Q: Maturity of this library

I apologise as I know this the wrong place as it's not an issue, if you could point me to a better place to ask this question I'm happy to go elsewhere.

Anyway, we are wondering what usage this library is getting in production environments, as we just want to confirm the project is mature enough for our usage.

Is it possible to confirm if Absa is using this in prod?

Rewrite the COBOL parser

The current parser does it's job, but it is written without conforming to the usual practice of parsing.

We need to write a grammar for the subset of COBOL that we are parsing and generate the parser using a tool like ANTLR, JavaCC, etc.

java.lang.NoClassDefFoundError: za/co/absa/cobrix/cobol/parser/encoding/Encoding

I have done an mvn package to build your code.
I referenced the three produced jars in spark-shell:
spark-cobol-0.2.6-SNAPSHOT.jar
spark-cobol-0.2.6-SNAPSHOT-javadoc.jar
spark-cobol-0.2.6-SNAPSHOT-sources.jar

Then I invoke spark-shell:
spark-shell --jars /path_to_cobrix_jars/*.jar

I am getting an error that a particular class is missing.
Can you guide me?

scala> :load /home/private/Documents/SparkSQLExample_cobol.scala
Loading /home/private/Documents/SparkSQLExample_cobol.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._
spark: org.apache.spark.sql.SparkSession.type = org.apache.spark.sql.SparkSession$@37f41a81
res0: spark.Builder = org.apache.spark.sql.SparkSession$Builder@5ee581db
res1: spark.Builder = org.apache.spark.sql.SparkSession$Builder@5ee581db
2018-09-11 10:16:51 WARN SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@5a8656a2
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4070c4ff
java.lang.NoClassDefFoundError: za/co/absa/cobrix/cobol/parser/encoding/Encoding
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:82)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:69)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:50)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 77 elided
Caused by: java.lang.ClassNotFoundException: za.co.absa.cobrix.cobol.parser.encoding.Encoding
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 85 more

Code executed is below:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._

val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.getOrCreate()

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val DataFrame = sqlContext.read.format("cobol").option("copybook", "/home/private/Documents/Data_Files/Cobol_EBCDIC/pnic00.wspack.sam010.d0.sq2700.D180630.cpy").load("/home/private/Documents/Data_Files/Cobol_EBCDIC/pnic00.wspack.sam010.d0.sq2700.D180630.dat")

signed decimal places

Hi,
I have fields which are made up of 7 numeric digits S9(7). We need them to have 7 decimal places and I cannot find a PIC that can achieve that.
So a value of 000456J needs to become -0.0004561
V9(7) achieves that, but only for positive values, it brings nulls for negatives.
I've tried several PICS including a few below, but they either fail, or do not achieve desirable result:
SV9(7)
S9(0)V9(7)
Can you please let me know what PIC I need to use to achieve 7 decimal places for signed field of 7 digits?
Thank you

cannot find columns with occurs x times

When I tried to read the cobol binary file using the following script:
val df = spark.read.format("za.co.absa.cobrix.spark.cobol.source").option("copybook", "file:///test/XYZ.txt").option("is_record_sequence", "true").option("is_rdw_big_endian", "true").option("schema_retention_policy", "collapse_root").load("/test/test1")

The script ran successfully, but when I retrieve the data, I found the Filler column missing, but other columns are good. In copybook it looks like:
02 FILLER OCCURS 2 TIMES
PIC X.

Can you tell me how to solve the issue?

Spark-Cobol-App

I get an error when I execute the command : mvn test in the example Spark-Cobol-App.
The error is :
java.lang.IllegalStateException: Root segment SEGMENT_ID=='C' not found in the data file.

OCCURS depending on non-integer fields

The OCCURS clause handler currently enforces an integer DEPENDING ON field, which can lead to issues when the field depended on is interpreted as a decimal. For example, S9V is a Decimal(1,0) but in reality is an integer value since it doesn't specify any decimal places.

Maybe we can allow a Decimal(n,0) to be depended on or, alternatively, interpret those as int values.

This is not a priority since one can easily change a copybook DEPENDING ON fields to strip the extra V decimal place indicator.

Not showing deeper levels as columns. Also .count() problem.

I managed to get it working.
However, it is only showing the level 01 records as columns in the data frame.
Why is that? The level 05 and level 10 are not appearing as columns.
Also, a .count() function is failing due to some Java error. Why is this? (error pasted at the bottom of this issue)

scala> :load /home/private/Documents/SparkSQLExample_cobol.scala
Loading /home/private/Documents/SparkSQLExample_cobol.scala...
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, DataFrame, SQLContext}
import org.apache.spark.sql.SQLContext
import za.co.absa.cobrix._
import scodec.bits._
import scodec.codecs._
spark: org.apache.spark.sql.SparkSession.type = org.apache.spark.sql.SparkSession$@4ad3969
res0: spark.Builder = org.apache.spark.sql.SparkSession$Builder@7ec5aad
res1: spark.Builder = org.apache.spark.sql.SparkSession$Builder@7ec5aad
2018-10-03 14:52:31 WARN SparkSession$Builder:66 - Using an existing SparkSession; some configuration may not take effect.
res2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@65f470f8
warning: there was one deprecation warning; re-run with -deprecation for details
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@21ed4a51
2018-10-03 14:52:40 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
DataFrame: org.apache.spark.sql.DataFrame = [SAM010A_LEADER_RECORD: struct<SAM010A_ALL_ZEROS: string, SAM010A_DATE_CREATED: int ... 3 more fields>, SAM010D_DATA_RECORD_B: struct<SAM010D_ACCOUNT_NO: bigint, SAM010D_ACCOUNT_NO_A: struct<SAM010D_PREFIX: int, SAM010D_BRANCH_NO: int ... 1 more field> ... 301 more fields> ... 2 more fields]
+---------------------+---------------------+-----------------------------+----------------------+
|SAM010A_LEADER_RECORD|SAM010D_DATA_RECORD_B|SAM010E_BRANCH_TRAILER_RECORD|SAM010F_TRAILER_RECORD|
+---------------------+---------------------+-----------------------------+----------------------+
| [0000000000000000...| [2001001355, [2, ...| [2, 1, 2408,,, 0,...| [909000000, 0]|
| [8909000000000002...| [9909000000, [9, ...| [2, 909, 0,, 2000...| [0,]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 9...| [, 1725]|
| [00{0000000019640...| [, [0,, 0], [0], ...| [0,, 0, 338520000...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [, 5463696]|
| [283261{000000000...| [, [0, 147,], [1]...| [0, 0,, 0.00, 0, ...| [0, 0]|
| [0000000000000000...| [5000000, [0, 5, ...| [0, 0, 0, 1000600...| [90,]|
| [0000000090000000...| [90, [0, 0, 90], ...| [0, 0, 90, 0.00, ...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [0, 0]|
| [0000000000000000...| [0, [0, 0, 0], [0...| [0, 0, 0, 0.00, 0...| [0, 0]|
+---------------------+---------------------+-----------------------------+----------------------+
only showing top 10 rows

scala> DataFrame.count()
[Stage 1:========================================================>(77 + 1) / 78]2018-10-03 15:03:30 ERROR Executor:91 - Exception in task 77.0 in stage 1.0 (TID 78)
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.spark.input.FixedLengthBinaryRecordReader.nextKeyValue(FixedLengthBinaryRecordReader.scala:118)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Add support of explicit decimal point position together with scale

Even though PR #86 is merged, parser still complains to a picture like SVPP9(5).

0.4.1 SBT import library is huge

We tried importing 0.4.1 through the SBT import. The older version (0.4.0) JAR size was approx 12 MB but the current version size is 110 MB. I just want to understand if this is normal or if there are any issues with the project build.

EBCDIC byte reader -New feature

Hi Team - We have a enhancement requirement to read bytes of data from a single EBCDIC file. This file contains different record types and we have a 3 letter keyword or 1 letter to identify the record type. Based on that the file split should happen.
Note: Its not a standard file which have a RDW.

Allow disabling file size check for debugging purposes

When a copybook does not match the data file an error message is displayed saying that record size does not divide the file size. No data is retrieved.

It can be very helpful to debug where exactly a copybook starts to mismatch a data file by allowing users to override that file size check behavior and still allow to parse the file. In this mode, only top records of a data file can be retrieved, but most of the time it is enough for debugging copybook/data file correspondence.

Proactively validate binary data size

Cobol data source uses binaryRecords() API to parse binary data.
When size of any of input files is not evenly divisible by record size, Spark will produce an exception.

We need to proactively check that all binary files are evenly divisible by the record size and produce more readable exception.

Preferably need to it scalable for cases when there are millions of files.

Filler fields are dropped by default.

By design the filler fields are dropped. Is there any option to avoid dropping this default behavior ?

Unable to read local file

whenever I try to read local file, it still search file in hdfs and gives below exception

java.lang.IllegalArgumentException: Wrong FS: file://xxxxxx:9000/root/abc.cpy expected: hdfs://xxxxxx:9000

command:
val df = spark.read.format("cobol").option("encoding", "ebcdic").option("generate_record_id", true).option("copybook", "file:///root/abc.cpy").load("file:///root/data.bin")

Support for decimal scaling

I have the following PIC on a copybook: SVPP9(5), which is not recognized as a valid PIC. This means that the SV9(5) field will be scaled down by 100. It seems that we can have fields such as S9(n)P(m) or 9(n)P(m) for scaling up and P(m)9(n) or SVP(m)9(n) for scaling down.

Page 211 on this manual gives the following example:

PPP999: 0 through .000999
S999PPP: -1000 through -999000 and +1000 through +999000 or zero

spark.read.format: Intellij vs shell

Hi team

Scenario 1: Spark-shell, local mode, Windows local file system.

The following ingestion works fine:

spark-shell --jars D:\Tools\cobrix-master-jars\cobol-parser-0.4.2.jar,D:\Tools\cobrix-master-jars\spark-cobol-0.4.2.jar,D:\Tools\cobrix-master-jars\scodec-core_2.11-1.10.3.jar,D:\Tools\cobrix-master-jars\scodec-bits_2.11-1.1.4.jar

val df_in: DataFrame = spark.read.format("cobol"). //"za.co.absa.cobrix.spark.cobol.source"
    option("copybook", "D:\\Tools\\cobrix-master\\data\\test2_copybook.cob").
    load("D:\\Tools\\cobrix-master\\data\\test2_data\\example.bin")
df_in.count

res30: Long = 10

Scenario 2: Intellij Ultimate. Maven, scala-maven-plugin, and the following dependencies in the POM,

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>2.4.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>2.4.2</version>
        </dependency>

        <dependency>
            <groupId>za.co.absa.cobrix</groupId>
            <artifactId>spark-cobol</artifactId>
            <version>0.4.2</version>
        </dependency>

and creating a simple ingestion routine:

package XYZ

import org.apache.spark.sql._

object Main {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().
      master("local").
      appName("cobrixTester").
      getOrCreate()

    val df_in: DataFrame = spark.read.format("cobol"). //"za.co.absa.cobrix.spark.cobol.source"
      option("copybook", "D:\\Tools\\cobrix-master\\data\\test2_copybook.cob").
      load("D:\\Tools\\cobrix-master\\data\\test2_data\\example.bin")
    df_in.count
  }
}

This results in:

Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:67)
Caused by: java.lang.NoClassDefFoundError: scala/Product$class
	at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParameters.<init>(CobolParameters.scala:50)
	at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.parse(CobolParametersParser.scala:118)
	at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:54)
	at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at XYZ.Main$.main(Main.scala:14)
	at XYZ.Main.main(Main.scala)
	... 5 more
Caused by: java.lang.ClassNotFoundException: scala.Product$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 15 more

for both .format("cobol") and .format("za.co.absa.cobrix.spark.cobol.source")

Thanks for your inputs
Matthias

Add support for EBCDIC non-printable characters in string fields

Currently Cobrix ignores all characters that translate into ASCII characters below 32. Need to have an option to allow retaining these.

I think it is aligned with supporting EBCDIC code pages. Currently we support only the common code page. We can add 'common/all' to include non-printable characters.

Not able to get the result in tabular form as it is shown in cobolCopybookExample1 under Spark-Cobol project

Add support for EBCDIC code page 037

SV9(x) should be a valid PIC

Great library! Hit an issue with a copybook with a SV9(5) pic. The tests seem to indicate this is invalid, but PICs such as SV9(x) and SV99 should be valid.

According to this, given a PIC S9(p-s)V9(s):

p is precision; s is scale. 0≤s≤p≤31. If s=0, use S9(p)V or S9(p). If s=p, use SV9(s).

Add an option to control string trimming

Currently, when Cobrix reads a string type it trims it from both sides.
Often it is required to be less data intrusive as possible and only do a right trimming or no trimming at all.

The new option should be created to control this behavior.

Add support for nested segment loading

Currently multisegment files can be loaded one segment in time. It might be very useful to add a possibility to load all the segments of a hierarchical database. Children nodes can became an array of struct inside root node.

Open implementation questions:

How should users specify parent-child relationships among segments? Perhaps need to specify parent segment for each segment.
How should users provide schema for each of the segments? Should a copybook per segment be required or we can somehow combine all segments in one copybook? Which is better?

expandPic failing when expanding virtual decimals

If a .cob file has a pic with two sets of braces in it, if either set of braces has more than one digit in it then the parser fails to parse that line and returns an error NOT DIVISIBLE by the RECORD SIZE calculated from the copybook with a byte count less than what should be expected.

Pass: PIC S9(15)V99.
Pass: PIC S9(9)V9(6).
Fail: S9(15)V9(2).

I suspect that this may be something to do with repeatCount in expandPic on line 944 of CopybookParser.scala

EBCDIC files received don't have Record Descriptor Word (RDW) but the file have variable length record

I am trying to read a header record file and it have variable length fields. But the file don't have a Record Descriptor Word (RDW). On reading the file, i could see first 4 characters getting truncated as i am using the option 'option("is_record_sequence", "true")'. Is there anyway to avoid this truncation.

Field names cannot start with COMP-X

We should support copybooks with protected keywords as the beginning of field names. While this is bad practice, its still valid COBOL syntax.

The below copybook would fail, due to the COMP-ACCOUNT-I being similar to COMP-X:

                   12  ACCOUNT-DETAIL    OCCURS 80
                                         DEPENDING ON NUMBER-OF-ACCTS
                                         INDEXED BY COMP-ACCOUNT-I.
                      15  ACCOUNT-NUMBER     PIC X(24).
                      15  ACCOUNT-TYPE-N     PIC 9(5) COMP-3.
                      15  ACCOUNT-TYPE-X     REDEFINES
                           ACCOUNT-TYPE-N  PIC X(3).

PR INCOMING TO FIX

Header and Trailer records

Some of the files I have come with a header and/or trailer record. For fixed length, the layout looks something like

HEADER
RECORD
...
RECORD
TRAILER

where all the records, the header, and the trailer have the same (fixed) length.

Do we already support something like this? Right now I'm just creating a new copybook that contains the three copybooks fo the types above glued with redefines, but it would be fantastic if we could support this kind of layout.

Speeding up reads

I am doing some benchmarking on a Databricks cluster where I use Cobrix to read EBCDIC files and write to parquet. I have an implementation of the same process which does not use this library. Reading a 2GB EBCDIC file with Cobrix takes two minutes longer than reading the file using sc.binaryRecords() and putting the right schema in place to create a DataFrame. The file has around 1400 columns and 9000 bytes per record.

Here is the cluster config:
Spark Version: 2.4
Worker count: flexible, depending on the workload
RAM per worker: 14 GB
Cores per worker: 4
Executor count: flexible, depending on the workload
Executor memory: 7.4 GB

The benchmarks you have included in this project make me think that an increase in executor count would speed up the read throughput. However, I have currently enabled autoscaling in Databricks so it dynamically allocates executors on the fly depending on the workload.

Could you please provide some guidelines on optimising read speed? Things like configurations you have tried in your organisation's use case can really be of help to speed up the process.

Retaining string for 9(...) picture

Sometimes pictures like 9(8) are actually not referring to integer values. For example, many mainframe files use 9(8) to represent a date. This means that the value should not be converted to integer automatically. I'd like to have a feature where all 9(...) fields are not automatically converted to integer but treated as string

RDW interpretation is incorrect for an IBM z/OS dataset

Record Descriptor Word (previously referred to as XCOM header) is interpreted incorrectly for a file transferred from IBM z/OS to Windows 10 via standard FTP in binary mode.

Per the IBM documentation (https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.idad400/d4357.htm) the RDW is the first 2 bytes of each record, but the extractRdwRecordSize method in the BinaryUtils class is expecting the length to be in the 3rd and 4th bytes.

It's possible XCOM or other transfer methods change the interpretation of the RDW, but the current method makes it impossible to process a dataset originating from IBM z/OS, at least in our environment.

I am able to process the datasets correctly by making a minor adjustment to the extractRdwRecordSize method to read the 1st and 2nd bytes, and the data is parsed correctly after that.

Error while parsing copybook definition: PIC 9(15)+.

I hit a copybook with the following definition: PIC 9(15)+.
The copybook parser is not able to handle it. What isthe meaning of the + in this case ?

Add support for Scala 2.12

Background

Cobrix was originally developed for Spark 2.2.2 which has supported Scala 2.11.
Now Spark 2.4.2 and Spark 3.0.0 are coming with Scala 2.12 support.

Feature

Need to add support Scala 2.12

Example

See #84

Proposed Solution

Need to introduce a release process that generates 2 versions of the library - one for Scala 2.11, the other - for Scala 2.12.
The cobol-parser artifact does not depend on Spark, so it can have a single version.
The spark-cobol artifact should have 2 versions for each release.

Segment filter pushdown throws if a segment id cannot be parsed for some records

Cobrix throws NullPointerException when a segment filter is used and the data file contains a record with the field that contains non decodable data, i.e. the decoder returns null.

Add support for segment/redefines map for fixed length files

Currently the mapping between segment id values and redefined fields is supported only for variable length files. There are use cases when this would be helpful to be supported for fixed length files as well.

Fix code duplication for COBOL AST

Currently COBOL AST is an array of GROUP objects. Need to change it to be just a GROUP. This way AST traversal functions can be simplified.
Also, this can allow root level primitive fields in copybooks.

This is to support something like

     05 FILLER PIC 9(08) BINARY.
     05 :RRHEADER:-CNTL.
        10 :RRHEADER:-LENGTH PIC S9(08) BINARY.
           88 :RRHEADER:-LENGTH-SET VALUE +65.

The filler on the top is not currently supported since it is a primitive field.

Add an option for segment-redefine awarness

If there is a known mapping between segment ids and redefines we can have an option for a user to specify that mapping. This way the parser won't need to parse all the segments for each record. Combining with segment filtering (predicate pushdown) this should significantly increase performance when targeting a relational model of the output.

streaming

I'm reading a file in streaming I have no error the count returns 0

my code

val spark = SparkSession
  .builder()
  .appName("CobolParser")
  .master("local[2]")
  .config("duration", 13)
  .config("copybook", paramMap("-Dcopybook"))
  .config("path",paramMap("-Ddata") )
 .getOrCreate()
 val streamingContext = new StreamingContext(spark.sparkContext, Seconds(13))
import za.co.absa.cobrix.spark.cobol.source.streaming.CobolStreamer._
val reader = getReader(streamingContext)

val result = streamingContext.cobolStream()
result.foreachRDD( (rdd:RDD[Row],time: Time) => {
println(time)
println(rdd.count())
for(item <- rdd) {
println(s"---------------${item}")
}
})

output example :
9/01/13 09:30:47 INFO JobScheduler: Added jobs for time 1547400647000 ms
19/01/13 09:30:47 INFO JobScheduler: Starting job streaming job 1547400647000 ms.0 from job set of time 1547400647000 ms
1547400647000 ms
19/01/13 09:30:47 INFO SparkContext: Starting job: count at NewStreamingEbcdic.scala:78
19/01/13 09:30:47 INFO DAGScheduler: Job 0 finished: count at NewStreamingEbcdic.scala:78, took 0.005128 s
0
19/01/13 09:30:47 INFO SparkContext: Starting job: foreach at NewStreamingEbcdic.scala:79
19/01/13 09:30:47 INFO DAGScheduler: Job 1 finished: foreach at NewStreamingEbcdic.scala:79, took 0.000039 s
19/01/13 09:30:47 INFO JobScheduler: Finished job streaming job 1547400647000 ms.0 from job set of time 1547400647000 ms
19/01/13 09:30:47 INFO JobScheduler: Total delay: 0.252 s for time 1547400647000 ms (execution: 0.085 s)
19/01/13 09:30:47 INFO FileInputDStream: Cleared 0 old files that were older than 1547400582000 ms:
19/01/13 09:30:47 INFO ReceivedBlockTracker: Deleting batches:
19/01/13 09:30:47 INFO InputInfoTracker: remove old batch metadata:
19/01/13 09:31:00 INFO FileInputDStream: Finding new files took 11 ms
19/01/13 09:31:00 INFO FileInputDStream: New files at time 1547400660000 ms:

because ????

Request for assistance: GCD of 1, but all pointing towards fixed record length files

Hi Ruslan

First of all, thanks for your work! I work for a company which gets its transaction batch files from a IBM Mainframe in EBCDIC encoding.

For a PoC on ingesting mainframe data directly in Spark, we managed to ingest fixed record length files so far without greater issues, only by adapting the original copybooks i.o.t. get proper REDEFINES and by such, proper fixed record length, which corresponds to the GCD of the test files.

We have issues however with variable record length files. We identified several of them by simply calculating the GCD for the test files, which in the present cases of course is 1.

However, arranging the corresponding copybook by implementing proper REDEFINES, respectively excluding common fields from REDEFINES, results in a structure that suggests fixed record length:
a) all segments have the same length (Cobrix schema parsing works fine),
b) the second common field at position 8 is called "RECORD-TYPE", has byte length 4 (DISPLAY) and is very likely to indicate the record type.

When trying to ingest such files with option("is_record_sequence", "true"), the result is:

df_in: org.apache.spark.sql.DataFrame = [GLOBAL_RECORD: struct<Account_ID: string, Record_Type: string ... 14 more fields>] 2019-04-08 11:07:53 WARN VarLenNestedReader:87 - Input split size = 32 MB 2019-04-08 11:07:53 ERROR Executor:91 - Exception in task 0.0 in stage 98.0 (TID 7660) java.lang.IllegalStateException: RDW headers should never be zero (0,0,0,0). Found zero size record at 0. at za.co.absa.cobrix.cobol.parser.decoders.BinaryUtils$.extractRdwRecordSize(BinaryUtils.scala:305)

So, obviously, the RDW is not found.

When trying to ingest such files with option("is_record_sequence", "false"), the result of course is:

df_in: org.apache.spark.sql.DataFrame = [GLOBAL_RECORD: struct<Account_ID: string, Record_Type: string ... 14 more fields>] 2019-04-08 11:25:20 ERROR FileUtils$:218 - File file:/XYZ IS NOT divisible by 1513. java.lang.IllegalArgumentException: There are some files in XYZ that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (1513 bytes per record)

In brief: GCD of 1, but all is pointing towards fixed record length.

We are still figuring out in which form to send you a data sample plus copybook, i.o.t. comply with internal guidelines.
In the meantime, I'd greatly appreciate a hint.

Thanks a lot in advance
Matthias

NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (136 bytes per record).

I am getting this error while trying df.show.

df.show()
19/04/25 23:03:42 ERROR utils.FileUtils$: File hdfs://quickstart.cloudera:8020/user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat IS NOT divisible by 136.
java.lang.IllegalArgumentException: There are some files in /user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (136 bytes per record). Check the logs for the names of the files.
at za.co.absa.cobrix.spark.cobol.source.scanners.CobolScanners$.buildScanForFixedLength(CobolScanners.scala:87)
at za.co.absa.cobrix.spark.cobol.source.CobolRelation.buildScan(CobolRelation.scala:85)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:308)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3254)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 49 elided

Can some one please help me to resolve this exception.

Allow to specify several copybooks

Background

Multisegment data files, especially ones that come from hierarchical databases are often described by several copybooks. In this case each copybook describes a particular segment.

Currently in order to load multisegment files one must combine segment copybooks in a way that each segment is represented by a GROUP in the copybook, and these segment GROUPs must redefine each other. This is a manual copybook-intrusive work.

Feature

Allow Cobrix to accept several copybooks as an input and automatically combine them by placing segments into GROUPs that redefine each other.

Retain Group data

I have a few copybooks that contain the structure in the copybook below:

***********************************************************************
        05  NAME.
            10  SHORT-NAME.
                11  NAME-CHAR-1     PIC X.
                11  FILLER          PIC X(9).
            10  FILLER              PIC X(20).
***********************************************************************

The NAME field is 30 characters, but the copybook also uses the first 10 chars for the SHORT-NAME field and the first character as another field.

Right now we drop all the information in the FILLER fields, so we end up with only the first character in the copybook above. This is connected to #53 and a solution for that one would probably be OK for this case.

Maybe we could add an option to specify non-terminal items to retain? In the example above, we would want to retain NAME and SHORT-NAME. We could add an _NT suffix or pass in a map allowing to specify the new names.

Handle negative values for unsigned number patterns

Here is an example of an unsigned field definition:

07 AMOUNT    PIC 9(16)V99.

Cobrix is permissive and allows the actual data to contain negative numbers in that column. A better behavior might be returning null in such cases. This is what Cobrix usually does when it encounters pattern/data mismatch.

Debug functionality of EBCDIC data

Hi,
Here I am not mentioning any issue but a functionality on debug purpose mainly. While Cobrix create dataframe it decodes the EBCDIC data according to the datatype of the given primitive field, Here if it is possible also to show the hex value of the column as well for a given option like add_hex = true. This functionality is only for debug purpose to check the data.

Inconsistent generated IDs

Depending on whether segment id filtering is set or not the resulting IDs can have a gap in ID numbers.

The generated IDs should not depend on "segment_field", "segment_filter" and "segment_id_root" options.

Syntax error in the copybook at line 2: Unable to parse the value of LEVEL. Numeric value expected, but 'PN-TABLE-NAME' encountered

Hi Team,

I am getting the error while trying to create the DF. It is failing with LAYOUT file error.
Could you please help me to get the error in layout file.

Spark shell:
spark2-shell --master yarn --deploy-mode client --driver-cores 2 --driver-memory 4G --jars cobol-parser-0.4.3-SNAPSHOT.jar,spark-cobol-0.4.3-SNAPSHOT.jar,scodec-core_2.11-1.10.3.jar,scodec-bits_2.11-1.1.2.jar

Data Frame:
val df = spark.read.format("cobol").option("copybook", "/user/cloudera/layout/layout_spark.txt").load("/user/cloudera/test_mainframe/DEPEND_SAMP_190422150915.dat")

Error:
za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 2: Unable to parse the value of LEVEL. Numeric value expected, but 'PN-TABLE-NAME' encountered
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$CreateCopybookLine(CopybookParser.scala:204)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$8.apply(CopybookParser.scala:134)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$8.apply(CopybookParser.scala:133)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.parseTree(CopybookParser.scala:133)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.loadCopyBook(FixedLenNestedReader.scala:81)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.(FixedLenNestedReader.scala:47)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:86)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:73)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:57)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 49 elided

layout file:
01 DCLDPN-DPND-REC-TABL.
02 DPN-TABLE-NAME PIC X(8).
02 DPN-EMP-NBR PIC S9(9) COMP.
02 DPN-CARR-CODE PIC X(3).
02 DPN-ACVY-STRT-DATE PIC X(10).
02 DPN-CREW-ACVY-CODE PIC X(3).
02 DPN-PRNG-NBR PIC X(5).
02 DPN-PRNG-ORIG-DATE PIC X(10).
02 DPN-ASNT-DAYS-CNT PIC S9(4) COMP.
02 DPN-RVEW-DATE PIC X(10).
02 DPN-CONJ-ACVY-CODE PIC X(3).
02 DPN-NOTE-RCVD-IND PIC X(1).
02 DPN-DROP-OFF-DATE PIC X(10).
02 DPN-DPND-CMNT-TEXT PIC X(65).
02 FILLER PIC X(66).

Exception in thread "main" za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 16: Unable to parse the value of LEVEL. Numeric value expected, but 'CUSTOM-CHANGE-FLAGS-CNT' encountered

I am getting the error
Exception in thread "main" za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 16: Unable to parse the value of LEVEL. Numeric value expected, but 'CUSTOM-CHANGE-FLAGS-CNT' encountered

Also cobrix is going to work with occurs nested subgroups. How many nested subgroups are handled.
Do I need to make any changes to the copybooks from mainframe to parse the copybooks. If any changes are required please let me know.
Because I am having lot of issues with copybooks. I tried to parse many copybooks. But they are failing for some reason.

If there are any tips to parse the copybooks easily. Please let me know.

Appreciate your cooperation

Redefine Field name with dot(.) notation

Hi,
while searching for REDEFINE fields attached with a given statement(Group/Primitive), found it is provided only the field name. This can be an issue if there are multiple fields with the same name in the copybook. Is it possible to provide exact field name with hierarchy with dot(.) notation ?

Top level REDEFINES

I have a copybook that has a format akin to the following

***********************************************************************
       01 RECORD-TYPE-1.
           02  FIELD-1            PIC X(16).
       01 RECORD-TYPE-2 REDEFINES RECORD-TYPE-1.
           02  FIELD-2            PIC X(16).
       01 RECORD-TYPE-3 REDEFINES RECORD-TYPE-1.
           02  FIELD-3            PIC X(16).
***********************************************************************

The library is reading this as a copybook of size 48 (3*16) but should be only 16 in size. The following copybook, which adds a root element, seems to work correctly (and is a workaround right now):

***********************************************************************
      01 RECORD.
       02 RECORD-TYPE-1.
           03  FIELD-1            PIC X(16).
       02 RECORD-TYPE-2 REDEFINES RECORD-TYPE-1.
           03  FIELD-2            PIC X(16).
       02 RECORD-TYPE-3 REDEFINES RECORD-TYPE-1.
           03  FIELD-3            PIC X(16).
***********************************************************************

Add support for EBCDIC code pages

Currently Cobrix supports only the common EBCDIC code page. Need an option for users to specify a different code page for the data file. The priority one is code page 37.

Characters in these code pages cannot be mapped to 7-bit ASCII. Need an EBCDIC to Unicode character conversion to support this.

Sparse indexes for non-RDW variable length files not used

Currently sparse index generation is used only if an RDW headers are present in a variable record length file.

This makes performance processing of fixed record length files slow when file headers and footers are use used. Or when record id generation is used for fixed record length files.

Need to add sparse index generation to RDW-less files as well.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.