Comments (26)
@yruslan can you please guide me on this.
from cobrix.
Hi, this is not directly supported, unfortunately.
The only workaround I see right now is to have a REDEFINE for each use case. Which is messy...
01 CUSTOMER-RECORD.
05 SEGMENT-INDICATORS.
10 CUSTOMER-DETAILS-PRESENT PIC X(1).
10 ACCOUNT-INFORMATION-PRESENT PIC X(1).
10 TRANSACTION-HISTORY-PRESENT PIC X(1).
05 DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
05 DETAILS010 REDEFINES DETAILS100.
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
05 DETAILS001 REDEFINES DETAILS100.
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
05 DETAILS110 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
05 DETAILS011 REDEFINES DETAILS100.
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
05 DETAILS101 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
05 DETAILS111 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
and then you can resolve fields based on indicators.
from cobrix.
![V2_DATA](https://private-user-images.githubusercontent.com/20523809/350686924-b44e8f80-0699-45f0-abd5-d18c0790ce72.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5Mzg4NTIsIm5iZiI6MTcyMTkzODU1MiwicGF0aCI6Ii8yMDUyMzgwOS8zNTA2ODY5MjQtYjQ0ZThmODAtMDY5OS00NWYwLWFiZDUtZDE4YzA3OTBjZTcyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI1VDIwMTU1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQwOWJmOTUwMzIxM2QyZGUzMTQ0YTI0YzU3MjkxYjRiZDU3YWQ1ZDMwNzVjMmY1YTA0MzVjZWQxMDhhZDQxNjImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.WW6l4HVsI5cD7xEDl1wiYhIs1ayC_d8COyhp9WDVhiU)
The above is the data I need to read.
"
-
Customer Data COBOL Copybook with Hexadecimal Fields and Segment Presence Indicators
01 CUSTOMER-RECORD. 05 SEGMENT-INDICATORS. 10 CUSTOMER-DETAILS-PRESENT PIC X(1) COMP-X. 10 ACCOUNT-INFORMATION-PRESENT PIC X(1) COMP-X. 10 TRANSACTION-HISTORY-PRESENT PIC X(1) COMP-X. 05 CUSTOMER-DETAILS. 10 CUSTOMER-ID PIC X(10) COMP-X. 10 CUSTOMER-NAME PIC X(30) COMP-X. 10 CUSTOMER-ADDRESS PIC X(50) COMP-X. 10 CUSTOMER-PHONE-NUMBER PIC X(15) COMP-X. 05 ACCOUNT-INFORMATION. 10 ACCOUNT-NUMBER PIC X(10) COMP-X. 10 ACCOUNT-TYPE PIC X(2) COMP-X. 10 ACCOUNT-BALANCE PIC X(12) COMP-X. 05 TRANSACTION-HISTORY. 10 TRANSACTION-ID PIC X(10) COMP-X. 10 TRANSACTION-DATE PIC X(8) COMP-X. 10 TRANSACTION-AMOUNT PIC X(12) COMP-X. 10 TRANSACTION-TYPE PIC X(2) COMP-X.
"
With above format. So this is not feasible at present?
from cobrix.
hi @yruslan .Cobrix is not able to understand next record for permutation DETAILS100. I guess it wont undertstand next record if either of the record is missing and length get disturbed. Can you suggest
.The source data is a continuous binary file. and if a particular segment is not present then cobrix gets confused what to read next as.
from cobrix.
Tried
* You may obtain a copy of the License at *
* *
* http://www.apache.org/licenses/LICENSE-2.0 *
* *
* Unless required by applicable law or agreed to in writing, software *
* distributed under the License is distributed on an "AS IS" BASIS, *
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
* See the License for the specific language governing permissions and *
* limitations under the License. *
* *
****************************************************************************
01 CUSTOMER-RECORD.
10 SEGMENT-INDICATORS.
15 CUSTOMER-DETAILS-PRESENT PIC 9(1).
15 ACCOUNT-INFORMATION-PRESENT PIC 9(1).
15 TRANSACTION-HISTORY-PRESENT PIC 9(1).
01 CUSTOMER-DETAILS-TAB.
10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
01 ACCOUNT-INFORMATION-TAB.
10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
01 TRANSACTION-HISTORY-TAB.
10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT.
15 TRANSACTION-ID PIC X(10).
15 TRANSACTION-DATE PIC X(8).
15 TRANSACTION-AMOUNT PIC X(12).
15 TRANSACTION-TYPE PIC X(2).
Getting error :
za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 13: Invalid input 'EGMENT-INDICATORS' at position 13:6
Code :
df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/Client_Data/xx_Cobix_RnD/SCHEMA/T_3.cob").option("record_format", "F").option("variable_size_occurs", True).option("variable_size_occurs", "true").load("/mnt/idfprodappdata/x/xx/DATA/customer_data_file_V2.dat")
df.display()
from cobrix.
Yes, I see that there is an additional complication. Record size varies, and it depends on index fields. Currently, Cobrix only supports record length mapping only if the segment field is a single field. Since your index fields are close together you can combine them, as the workaround.
See that I've removed
01 CUSTOMER-RECORD.
05 SEGMENT-ID PIC X(3).
05 SEGMENT-INDICATORS REDEFINES SEGMENT-ID.
10 CUSTOMER-DETAILS-PRESENT PIC X(1).
10 ACCOUNT-INFORMATION-PRESENT PIC X(1).
10 TRANSACTION-HISTORY-PRESENT PIC X(1).
05 DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
05 DETAILS010 REDEFINES DETAILS100.
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
05 DETAILS001 REDEFINES DETAILS100.
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
05 DETAILS110 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
05 DETAILS011 REDEFINES DETAILS100.
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
05 DETAILS101 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
05 DETAILS111 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
10 ACCOUNT-INFORMATION .
15 ACCOUNT-NUMBER PIC X(10).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
Then, you can use segment id to size mapping. But you need to get the size info for each combination of indexes.
.option("record_format", "F")
.option("record_length_field", "SEGMENT_ID")
.option("record_length_map", """{"001":50,"010":30,"100":20,"011":80,"110":50,"101":70,"111":100}""")
.option("segment_field", "SEGMENT-ID")
.option("redefine-segment-id-map:0", "DETAILS001 => 001")
.option("redefine-segment-id-map:1", "DETAILS010 => 010")
.option("redefine-segment-id-map:2", "DETAILS100 => 100")
.option("redefine-segment-id-map:3", "DETAILS011 => 011")
.option("redefine-segment-id-map:4", "DETAILS110 => 110")
.option("redefine-segment-id-map:5", "DETAILS101 => 101")
.option("redefine-segment-id-map:6", "DETAILS111 => 111")
from cobrix.
@yruslan
Getting error :
IllegalStateException: The record length field SEGMENT_ID must be an integral type.
from cobrix.
The input data is binary like 100,001,111,101,100
from cobrix.
Which version of Cobrix are you using?
You can add
.option("pedantic", "true")
to ensure all passed options are recognized.
from cobrix.
@yruslan This the copy book:
* You may obtain a copy of the License at *
* *
* http://www.apache.org/licenses/LICENSE-2.0 *
* *
* Unless required by applicable law or agreed to in writing, software *
* distributed under the License is distributed on an "AS IS" BASIS, *
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
* See the License for the specific language governing permissions and *
* limitations under the License. *
* *
****************************************************************************
01 CUSTOMER-RECORD.
05 SEGMENT-ID PIC X(3).
05 DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
05 DETAILS110 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
10 ACCOUNT-INFORMATION.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
05 DETAILS101 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
15 TRANSACTION-DATE PIC X(8).
15 TRANSACTION-AMOUNT PIC X(12).
15 TRANSACTION-TYPE PIC X(2).
05 DETAILS111 REDEFINES DETAILS100.
10 CUSTOMER-DETAILS.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
10 ACCOUNT-INFORMATION.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
10 TRANSACTION-HISTORY.
15 TRANSACTION-ID PIC X(10).
15 TRANSACTION-DATE PIC X(8).
15 TRANSACTION-AMOUNT PIC X(12).
15 TRANSACTION-TYPE PIC X(2).
read code:
df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/x/xx/SCHEMA/T_3_4.cob").option("record_format", "F").option("record_length_field", "SEGMENT_ID").option("record_length_map", """{"001":50,"010":30,"100":20,"011":80,"110":50,"101":70,"111":100}""") .option("segment_field", "SEGMENT-ID").option("redefine-segment-id-map:2", "DETAILS100 => 100").option("redefine-segment-id-map:4", "DETAILS110 => 110").option("redefine-segment-id-map:5", "DETAILS101 => 101").option("redefine-segment-id-map:6", "DETAILS111 => 111").option("pedantic", "true").load("/mnt/idfprodappdata/x/xx/DATA/customer_data_file_V2.dat")
Getting error : IllegalArgumentException: Redundant or unrecognized option(s) to 'spark-cobol': record_length_map.
from cobrix.
Version :za.co.absa.cobrix:spark-cobol_2.12:2.6.9
from cobrix.
@yruslan The above file I also tried to read using occurs but was getting syntax error.
from cobrix.
Version :za.co.absa.cobrix:spark-cobol_2.12:2.6.9
The record_length_map
was added in more recent versions of Cobrix, try 2.7.2
.
Getting error : IllegalArgumentException: Redundant or unrecognized option(s) to 'spark-cobol': record_length_map.
This confirms that you need to update to 2.7.2
in order to use this option.
from cobrix.
@yruslan I have updated the version but getting :
NumberFormatException: For input string: ""
from cobrix.
@yruslan Hi , I had updated the version to za.co.absa.cobrix:spark-cobol_2.12:2.7.3
from cobrix.
I had also tried occurs but that is giving me syntax error as :
"Py4JJavaError: An error occurred while calling o493.load.
: za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 13: Invalid input 'EGMENT-INDICATORS' at position 13:6"
* You may obtain a copy of the License at *
* *
* http://www.apache.org/licenses/LICENSE-2.0 *
* *
* Unless required by applicable law or agreed to in writing, software *
* distributed under the License is distributed on an "AS IS" BASIS, *
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
* See the License for the specific language governing permissions and *
* limitations under the License. *
* *
****************************************************************************
01 CUSTOMER-RECORD.
10 SEGMENT-INDICATORS.
15 CUSTOMER-DETAILS-PRESENT PIC 9(1).
15 ACCOUNT-INFORMATION-PRESENT PIC 9(1).
15 TRANSACTION-HISTORY-PRESENT PIC 9(1).
01 CUSTOMER-DETAILS-TAB.
10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
01 ACCOUNT-INFORMATION-TAB.
10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
01 TRANSACTION-HISTORY-TAB.
10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT.
15 TRANSACTION-ID PIC X(10).
15 TRANSACTION-DATE PIC X(8).
15 TRANSACTION-AMOUNT PIC X(12).
15 TRANSACTION-TYPE PIC X(2).
from cobrix.
The error message is due to padding of the copybook with spaces. Please, fix the padding by making sure the first 6 characters of each line are part of a comment, or use the these options to specify the padding for your copybook:
https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#copybook-parsing-options
from cobrix.
@yruslan regarding -->I have updated the version but getting :
NumberFormatException: For input string: ""
from cobrix.
@yruslan regarding -->I have updated the version but getting : NumberFormatException: For input string: ""
Please, post the full stack trace - it is hard to tell what causing the error.
Also, make sure the segment value to record length map that you pass to record_length_map
is a correct JSON, and record lengths are correct. Values that I posted were just examples of the feature.
An example of a valid JSON to pass to the record length mapping:
{
"100": 20,
"101": 70,
"110": 50,
"111": 100,
"001": 50,
"010": 30,
"011": 80
}
from cobrix.
@yruslan Stack Trace :
NumberFormatException: For input string: ""
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) (192.223.255.11 executor 1): java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1(VRLRecordReader.scala:200)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1$adapted(VRLRecordReader.scala:195)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchRecordUsingRecordLengthFieldExpression(VRLRecordReader.scala:195)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchNext(VRLRecordReader.scala:98)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.next(VRLRecordReader.scala:75)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.fetchNext(VarLenNestedIterator.scala:84)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:74)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1139)
at com.databricks.photon.NativeRowBatchIterator.hasNext(NativeRowBatchIterator.java:44)
at 0xc246792 .HasNext(external/workspace_spark_3_5/photon/jni-wrappers/jni-row-batch-iterator.cc:50)
at com.databricks.photon.JniApiImpl.hasNext(Native Method)
at com.databricks.photon.JniApi.hasNext(JniApi.scala)
at com.databricks.photon.JniExecNode.hasNext(JniExecNode.java:76)
at com.databricks.photon.BasePhotonResultHandler$$anon$1.hasNext(PhotonExec.scala:891)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:228)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.BasePhotonResultHandler.timeit(PhotonExec.scala:878)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:228)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:214)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:190)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:155)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:149)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1013)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:106)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1016)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:903)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.$anonfun$failJobAndIndependentStages$1(DAGScheduler.scala:3910)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3908)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3822)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3809)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3809)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1680)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1665)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1665)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:4157)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4069)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4057)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:55)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1329)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1317)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:3034)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$runSparkJobs$1(Collector.scala:355)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:299)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$collect$1(Collector.scala:384)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:381)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:122)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:131)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:94)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:90)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:78)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$1(ResultCacheManager.scala:552)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:546)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:563)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:400)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:399)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:318)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:560)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:557)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3840)
at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3831)
at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4803)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:1152)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4801)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$9(SQLExecution.scala:398)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:713)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:278)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1180)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:165)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:650)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4801)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3830)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:324)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:100)
at com.databricks.backend.daemon.driver.PythonDriverLocalBase.generateTableResult(PythonDriverLocalBase.scala:856)
at com.databricks.backend.daemon.driver.JupyterDriverLocal.computeListResultsItem(JupyterDriverLocal.scala:1451)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.addCustomDisplayData(JupyterDriverLocal.scala:283)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike.toInt(StringLike.scala:304)
at scala.collection.immutable.StringLike.toInt$(StringLike.scala:304)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:33)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1(VRLRecordReader.scala:200)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.$anonfun$fetchRecordUsingRecordLengthFieldExpression$1$adapted(VRLRecordReader.scala:195)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:193)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchRecordUsingRecordLengthFieldExpression(VRLRecordReader.scala:195)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.fetchNext(VRLRecordReader.scala:98)
at za.co.absa.cobrix.cobol.reader.iterator.VRLRecordReader.next(VRLRecordReader.scala:75)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.fetchNext(VarLenNestedIterator.scala:84)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:74)
at za.co.absa.cobrix.cobol.reader.iterator.VarLenNestedIterator.next(VarLenNestedIterator.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:43)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader$RowIterator.next(VarLenNestedReader.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1139)
at com.databricks.photon.NativeRowBatchIterator.hasNext(NativeRowBatchIterator.java:44)
at 0xc246792 .HasNext(external/workspace_spark_3_5/photon/jni-wrappers/jni-row-batch-iterator.cc:50)
at com.databricks.photon.JniApiImpl.hasNext(Native Method)
at com.databricks.photon.JniApi.hasNext(JniApi.scala)
at com.databricks.photon.JniExecNode.hasNext(JniExecNode.java:76)
at com.databricks.photon.BasePhotonResultHandler$$anon$1.hasNext(PhotonExec.scala:891)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:228)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.BasePhotonResultHandler.timeit(PhotonExec.scala:878)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:228)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$1(UnsafeRowBatchUtils.scala:68)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$2(Collector.scala:214)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:190)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:155)
at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:149)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:101)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1013)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:106)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1016)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:903)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
from cobrix.
Looks like some of your options might be incorrect. Use:
.option("pedantic", "true")
to reveal the incorrect option. Also, please share all options you are passing to spark-cobol
from cobrix.
@yruslan
df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/x/xx/SCHEMA/T_3_4.cob").option("record_format", "F").option("record_length_field", "SEGMENT_ID").option("record_length_map", """{"001":50,"010":30,"100":20,"011":80,"110":50,"101":70,"111":100}""") .option("segment_field", "SEGMENT-ID").option("redefine-segment-id-map:2", "DETAILS100 => 100").option("redefine-segment-id-map:4", "DETAILS110 => 110").option("redefine-segment-id-map:5", "DETAILS101 => 101").option("redefine-segment-id-map:6", "DETAILS111 => 111").option("pedantic", "true").load("/mnt/idfprodappdata/x/xx/DATA/customer_data_file_V2.dat")
df.display()
from cobrix.
The options look good with the exception of the JSON you are passing to record_length_map
. As I said, I just provided an example. It is up to you to figure out record sizes for each of the cases.
I can help you if you send the layout position table that is printed in the log when you use spark-cobol
UPDATE. You can also try
.option("record_length_field", "SEGMENT-ID")
instead of
.option("record_length_field", "SEGMENT_ID")
from cobrix.
I had also tried occurs but that is giving me syntax error as : "Py4JJavaError: An error occurred while calling o493.load. : za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 13: Invalid input 'EGMENT-INDICATORS' at position 13:6"
* You may obtain a copy of the License at * * * * http://www.apache.org/licenses/LICENSE-2.0 * * * * Unless required by applicable law or agreed to in writing, software * * distributed under the License is distributed on an "AS IS" BASIS, * * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * * See the License for the specific language governing permissions and * * limitations under the License. * * * **************************************************************************** 01 CUSTOMER-RECORD. 10 SEGMENT-INDICATORS. 15 CUSTOMER-DETAILS-PRESENT PIC 9(1). 15 ACCOUNT-INFORMATION-PRESENT PIC 9(1). 15 TRANSACTION-HISTORY-PRESENT PIC 9(1). 01 CUSTOMER-DETAILS-TAB. 10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT. 15 CUSTOMER-ID PIC X(10). 15 CUSTOMER-NAME PIC X(30). 15 CUSTOMER-ADDRESS PIC X(50). 15 CUSTOMER-PHONE-NUMBER PIC X(15). 01 ACCOUNT-INFORMATION-TAB. 10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT. 15 ACCOUNT-NUMBER PIC X(10). 15 ACCOUNT-TYPE PIC X(2). 15 ACCOUNT-BALANCE PIC X(12). 01 TRANSACTION-HISTORY-TAB. 10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT. 15 TRANSACTION-ID PIC X(10). 15 TRANSACTION-DATE PIC X(8). 15 TRANSACTION-AMOUNT PIC X(12). 15 TRANSACTION-TYPE PIC X(2).
This is also a possible solution. It is more elegant than what I proposed. I think if you just fix the padding of the copybook, it might work. (e.g. add 4 spaces to each of the line)
from cobrix.
* You may obtain a copy of the License at *
* *
* http://www.apache.org/licenses/LICENSE-2.0 *
* *
* Unless required by applicable law or agreed to in writing, software *
* distributed under the License is distributed on an "AS IS" BASIS, *
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. *
* See the License for the specific language governing permissions and *
* limitations under the License. *
* *
****************************************************************************
01 CUSTOMER-RECORD.
10 SEGMENT-INDICATORS.
15 CUSTOMER-DETAILS-PRESENT PIC 9(1).
15 ACCOUNT-INFORMATION-PRESENT PIC 9(1).
15 TRANSACTION-HISTORY-PRESENT PIC 9(1).
01 CUSTOMER-DETAILS-TAB.
10 CUST-TAB OCCURS 1 TO 2 TIMES DEPENDING ON CUSTOMER-DETAILS-PRESENT.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
01 ACCOUNT-INFORMATION-TAB.
10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
01 TRANSACTION-HISTORY-TAB.
10 TRANS-TAB OCCURS 1 TO 2 TIMES DEPENDING ON TRANSACTION-HISTORY-PRESENT.
15 TRANSACTION-ID PIC X(10).
15 TRANSACTION-DATE PIC X(8).
15 TRANSACTION-AMOUNT PIC X(12).
15 TRANSACTION-TYPE PIC X(2).
@yruslan : I have fixed the padding still giving syntax error.
Py4JJavaError: An error occurred while calling o409.load.
: za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException: Syntax error in the copybook at line 19: Invalid input '15' at position 19:15
at za.co.absa.cobrix.cobol.parser.antlr.ThrowErrorStrategy.recover(ANTLRParser.scala:38)
at za.co.absa.cobrix.cobol.parser.antlr.copybookParser.item(copybookParser.java:3305)
at za.co.absa.cobrix.cobol.parser.antlr.copybookParser.main(copybookParser.java:215)
at za.co.absa.cobrix.cobol.parser.antlr.ANTLRParser$.parse(ANTLRParser.scala:85)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.parseTree(CopybookParser.scala:282)
at za.co.absa.cobrix.cobol.reader.schema.CobolSchema$.fromReaderParameters(CobolSchema.scala:108)
at za.co.absa.cobrix.cobol.reader.VarLenNestedReader.loadCopyBook(VarLenNestedReader.scala:202)
at za.co.absa.cobrix.cobol.reader.VarLenNestedReader.(VarLenNestedReader.scala:52)
at za.co.absa.cobrix.spark.cobol.reader.VarLenNestedReader.(VarLenNestedReader.scala:37)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createVariableLengthReader(DefaultSource.scala:112)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:76)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:56)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:398)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:392)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:348)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:348)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)
File , line 1
----> 1 df = spark.read.format("cobol").option("copybook", "/mnt/idfprodappdata/x/xx/SCHEMA/T_3_2.cob").option("record_format", "F").option("variable_size_occurs", True).option("variable_size_occurs", "true").load("/mnt/idfprodappdata/x/xx/DATA/data_test25_occurs.dat")
2 df.display()
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
from cobrix.
I think you fixed only padding at the beginning of each line, but not at the end,
In copybooks, characters are ignored after position 72.
your copybook should look like this (including spaces):
01 CUSTOMER-RECORD.
10 SEGMENT-INDICATORS.
15 CUSTOMER-DETAILS-PRESENT PIC 9(1).
15 ACCOUNT-INFORMATION-PRESENT PIC 9(1).
15 TRANSACTION-HISTORY-PRESENT PIC 9(1).
01 CUSTOMER-DETAILS-TAB.
10 CUST-TAB OCCURS 1 TO 2 TIMES
DEPENDING ON CUSTOMER-DETAILS-PRESENT.
15 CUSTOMER-ID PIC X(10).
15 CUSTOMER-NAME PIC X(30).
15 CUSTOMER-ADDRESS PIC X(50).
15 CUSTOMER-PHONE-NUMBER PIC X(15).
01 ACCOUNT-INFORMATION-TAB.
10 ACCT-INFO-TAB OCCURS 1 TO 2 TIMES
DEPENDING ON ACCOUNT-INFORMATION-PRESENT.
15 ACCOUNT-NUMBER PIC X(10).
15 ACCOUNT-TYPE PIC X(2).
15 ACCOUNT-BALANCE PIC X(12).
01 TRANSACTION-HISTORY-TAB.
10 TRANS-TAB OCCURS 1 TO 2 TIMES
DEPENDING ON TRANSACTION-HISTORY-PRESENT.
15 TRANSACTION-ID PIC X(10).
15 TRANSACTION-DATE PIC X(8).
15 TRANSACTION-AMOUNT PIC X(12).
15 TRANSACTION-TYPE PIC X(2).
from cobrix.
Related Issues (20)
- Record length option is ignored when generate record id is turued on
- Add CI/CD for automatic releases
- Reading EBCDIC file with multiple structure HOT 1
- DataBricks Unity Catalog and Cobrix HOT 7
- Reading Variable Length File with OCCCURS DEPENDING HOT 12
- NoClassDefFoundError: Could not initialize class za.co.absa.cobrix.cobol.parser.decoders.FloatingPointDecoders$ HOT 3
- Not able to parse the content correctly when copybook has OCCURS X TIMES DEPENDING ON FIELD_NAME HOT 3
- Support for decimal scaling PV HOT 6
- Can't read multiple main headers defined in single copybook HOT 4
- Add support for parsing copybooks given Spark options
- Missing SIgn for few fileds that are negative HOT 5
- How to read a pipe separated file with Cobrix HOT 3
- PIC S9(10)V USAGE COMP-3 is converted to long instead of Decimal(10,0) HOT 7
- comp-3 values parsing issues HOT 2
- Shade ANTLR runtime in the parser to avoid ANTLR potential incompatibility issues
- Under some circumstances Cobrix selects wrong record reader failing the Spark job
- Add a feature to collapse structs or the output data
- Add support for `_` for key generation
- DataFrame with some columns in EBCDIC HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cobrix.