-
Notifications
You must be signed in to change notification settings - Fork 86
#723 Add the ability to generate _corrupt_fields column containing information on fields that Cobrix failed to decode
#820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Introduced `generate_corrupted_fields` option to append a `_corrupted_fields` array to the schema, containing field names and raw values of fields that could not be decoded. - Updated relevant methods in `RecordExtractors` to handle corrupted fields. - Enhanced documentation and README to reflect the new feature.
…ull fields, improve performance of generating corrupted fields slightly.
…d memory wise), fix `PIC X USAGE COMP` cases.
…and flexibility in schema creation.
WalkthroughAdds a new generate_corrupt_fields option that captures fields failing to decode by storing their field name and raw bytes into a Changes
Sequence DiagramsequenceDiagram
participant Client as Spark Client
participant Extractor as RecordExtractors
participant Primitive as Primitive
participant Output as Output Row
Client->>Extractor: extractRecord(..., generateCorruptFields=true)
activate Extractor
Extractor->>Extractor: init corruptedFields buffer
loop for each field
Extractor->>Primitive: getRawValue(offset)
activate Primitive
Primitive-->>Extractor: rawBytes or null
alt rawBytes present and decode succeeds
Extractor-->>Extractor: use decoded value
else rawBytes present but decode returns null
Extractor->>Extractor: create CorruptField(fieldName, rawBytes)
Extractor-->>Extractor: add to corruptedFields
end
deactivate Primitive
end
alt generateCorruptFields == true
Extractor->>Extractor: build _corrupt_fields array from buffer
Extractor->>Output: append _corrupt_fields to Row
end
Extractor-->>Client: Row (with _corrupt_fields if enabled)
deactivate Extractor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)
43-43:⚠️ Potential issue | 🟡 MinorTypo in Scaladoc: "pecifies" → "Specifies".
Proposed fix
- * `@param` schemaRetentionPolicy pecifies a policy to transform the input schema. The default policy is to keep the schema exactly as it is in the copybook. + * `@param` schemaRetentionPolicy Specifies a policy to transform the input schema. The default policy is to keep the schema exactly as it is in the copybook.
🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Around line 125-156: The isEmpty method currently treats byte 0x40 as an
empty-space unconditionally; update isEmpty to be encoding-aware by checking the
field's encoding (use dataType.enc on AlphaNumeric fields) and only treating
0x40 as an empty byte when the encoding is EBCDIC, while treating 0x20 as the
space for ASCII-encoded fields; retain the existing zero (0x00) check and fall
back to the current behavior for unknown encodings. Modify the loop in isEmpty
(and the earlier idx bounds logic using binaryProperties.dataSize) to build the
set of “empty” byte values based on dataType.enc and use that set to decide
emptiness.
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`:
- Around line 435-436: The hierarchical extractor call in
extractHierarchicalRecord always passes generateCorruptedFields = false and
corruptedFields = null to applyRecordPostProcessing, so users enabling
corrupted-fields support for hierarchical/multisegment data get no output;
update extractHierarchicalRecord to detect when the user option for corrupted
fields is true and either (a) propagate the flag and a proper corruptedFields
collection into applyRecordPostProcessing if you implement support, or (b) emit
a clear warning via the existing logging mechanism (e.g., logger.warn) that
corrupted-fields are not supported for hierarchical/multisegment extraction and
will be ignored; reference extractHierarchicalRecord and
applyRecordPostProcessing so the change is made at that call site.
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`:
- Line 50: The ScalaDoc for the parameter generateCorruptFields is referencing
the wrong field name "_corrupt_fields"; update the comment to the correct field
name "_corrupted_fields" so the doc matches the actual option and field used
elsewhere (search for generateCorruptFields and any references to
_corrupted_fields/_corrupt_fields to confirm consistency).
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala`:
- Around line 898-901: The current check uses
params.contains(PARAM_CORRUPTED_FIELDS) which treats an explicit "false" value
the same as presence and wrongly throws; update the condition in the
CobolParametersParser where PARAM_CORRUPTED_FIELDS is checked so it only throws
when the option is explicitly set to true (e.g., inspect params.get/PARAM value
or use params.getBoolean-like logic on the params map) instead of just checking
presence; change the check around PARAM_CORRUPTED_FIELDS and the throw in that
block to allow explicit generate_corrupted_fields=false while still blocking
true when used with segment-children:*.
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`:
- Line 61: The Scaladoc and parameter naming are inconsistent: update
ReaderParameters (parameter generateCorruptFields) to match the actual constant
and usage by renaming it to generateCorruptedFields (or vice-versa across the
codebase) and fix the Scaladoc text to refer to Constants.corruptedFieldsField
(`_corrupted_fields`); ensure RecordExtractors.extractRecord and any other
callers use the same name so the parameter and documentation are consistent.
In `@README.md`:
- Around line 600-603: The fenced code block showing
.option("generate_corrupted_fields", "true") lacks a language identifier, which
triggers markdownlint MD040; update the fenced block to add a language tag
(e.g., scala) immediately after the opening backticks so it reads ```scala and
retains the existing code snippet .option("generate_corrupted_fields", "true").
- Around line 1954-1960: The changelog heading for the 2.9.8 entry uses ####
which skips a level (MD001); update the heading text "2.9.8 will be released
soon." to use a triple-hash heading (###) instead of #### so it correctly
follows the prior `##` level; ensure the surrounding block (the bullet about
[`#723`] and the code example `.option("generate_corrupted_fields", "true")`)
remains unchanged and properly nested under the new `###` heading.
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Line 47: Update the Scaladoc for the generateCorruptedFields parameter in
CobolSchema (parameter name: generateCorruptedFields) to state that corrupted
record fields are appended to the end of the schema (not the beginning); this
aligns the doc with the implementation that uses :+ to add the
`_corrupted_fields` field at the end (see construction around the code that
appends `_corrupted_fields`).
🧹 Nitpick comments (7)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/common/Constants.scala (1)
67-69: Constants defined but not referenced inRecordExtractors.getCorruptedFieldsGroup.In
RecordExtractors.scala(lines 535-538), the corrupted fields group is constructed using hardcoded string literals"field_name","raw_value", and"_corrupted_fields"instead of referencing these constants. Either use the constants there or remove them here to avoid drift.Suggested change in RecordExtractors.scala (lines 532-539)
private def getCorruptedFieldsGroup: Group = { val corruptedFieldsInGroup = new mutable.ArrayBuffer[Statement] - corruptedFieldsInGroup += Primitive(15, "field_name", "field_name", 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None) - corruptedFieldsInGroup += Primitive(15, "raw_value", "raw_value", 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None) + corruptedFieldsInGroup += Primitive(15, Constants.fieldNameColumn, Constants.fieldNameColumn, 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None) + corruptedFieldsInGroup += Primitive(15, Constants.rawValueColumn, Constants.rawValueColumn, 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None) - Group(10, "_corrupted_fields", "_corrupted_fields", 0, children = corruptedFieldsInGroup, occurs = Some(10))(None) + Group(10, Constants.corruptedFieldsField, Constants.corruptedFieldsField, 0, children = corruptedFieldsInGroup, occurs = Some(10))(None) }cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)
125-156: Duplicated bounds-checking logic betweenisEmptyandgetRawValue.Both methods repeat the same offset validation and byte-range calculation (string vs. non-string, varchar trimming). Consider extracting a shared private helper that returns the valid
(startIdx, endIdx)ornull/Noneto reduce duplication.Also applies to: 167-191
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala (1)
72-73:ArrayBufferallocated even whengenerateCorruptedFieldsisfalse.Line 73 unconditionally allocates
new ArrayBuffer[CorruptedField]for every record, even when the feature is disabled. Consider making it lazy or conditional.Suggested fix
- val corruptedFields = new ArrayBuffer[CorruptedField] + val corruptedFields = if (generateCorruptedFields) new ArrayBuffer[CorruptedField] else nullThen the null-guard at line 511 already covers this.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenNestedIterator.scala (1)
95-110: Correct propagation ofgenerateCorruptFieldstoextractRecord.The wiring is correct. Note the naming inconsistency:
ReaderParameters.generateCorruptFieldsmaps toRecordExtractors.extractRecord'sgenerateCorruptedFieldsparameter. Not a bug, but the inconsistent naming (CorruptvsCorrupted) across the API surface could cause confusion.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptedFieldsSpec.scala (1)
104-113: Consider usingPARAM_CORRUPTED_FIELDSconstant for the option key.
PARAM_CORRUPTED_FIELDSis imported on line 21 but only used in the assertion at line 121. For consistency and to guard against the option key string drifting out of sync, consider using the constant here as well.That said, using string literals in Spark option maps is a common test pattern (it validates the actual user-facing API surface), so this is a minor style preference.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala (1)
905-910: Inconsistent type assertion style.Line 909 uses
isInstanceOf[StringType]while all other tests in this file compare with==(e.g.,== StringType,== IntegerType). For consistency, prefer the equality check.Proposed fix
- assert(sparkSchema.fields(2).dataType.isInstanceOf[StringType]) + assert(sparkSchema.fields(2).dataType == StringType)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala (1)
147-147: Naming inconsistency:generateCorruptFieldsvsgenerateCorruptedFieldsacross the codebase.The parameter is defined as
generateCorruptFields(no "ed") inReaderParametersandCobolParameters, butCobolSchemaandRecordExtractorsusegenerateCorruptedFields. This inconsistency spans multiple files and creates confusing naming conventions. Consider aligning all usages togenerateCorruptedFields, which is grammatically correct and already used in the public-facing option keygenerate_corrupted_fields.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala
Show resolved
Hide resolved
| applyRecordPostProcessing(ast, records.toList, effectiveSchemaRetentionPolicy, generateRecordId, generateRecordBytes = false, generateCorruptedFields = false, Nil, fileId, recordId, recordLength, Array.empty[Byte], generateInputFileField = generateInputFileField, inputFileName, null, handler) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hierarchical record extraction does not support corrupted fields.
extractHierarchicalRecord hardcodes generateCorruptedFields = false and corruptedFields = null. If a user enables the option with hierarchical/multisegment data, they'll silently get no corrupted fields output. Consider logging a warning if the feature is enabled but unsupported, or documenting this limitation.
🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`
around lines 435 - 436, The hierarchical extractor call in
extractHierarchicalRecord always passes generateCorruptedFields = false and
corruptedFields = null to applyRecordPostProcessing, so users enabling
corrupted-fields support for hierarchical/multisegment data get no output;
update extractHierarchicalRecord to detect when the user option for corrupted
fields is true and either (a) propagate the flag and a proper corruptedFields
collection into applyRecordPostProcessing if you implement support, or (b) emit
a clear warning via the existing logging mechanism (e.g., logger.warn) that
corrupted-fields are not supported for hierarchical/multisegment extraction and
will be ignored; reference extractHierarchicalRecord and
applyRecordPostProcessing so the change is made at that call site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they set the option to true the option map fails validation and throws an exception. So for now this is good. But in the future we can enable this option for hierarchical record extractor as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yruslan, Thank you for the clarification! I can see from the validation logic in CobolParametersParser that the combination of generate_corrupted_fields with hierarchical segment-children is properly prevented at option parsing time, throwing an IllegalArgumentException before extraction occurs. This makes the hardcoded false value in extractHierarchicalRecord safe and appropriate for the current implementation.
The validation approach is sound—failing fast during configuration rather than silently ignoring the option. Looking forward to seeing this feature extended to hierarchical extractors in the future when support is implemented.
✏️ Learnings added
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala:435-436
Timestamp: 2026-02-11T08:01:45.131Z
Learning: In `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`, the `extractHierarchicalRecord` method hardcodes `generateCorruptedFields = false` because validation in `CobolParametersParser` throws an `IllegalArgumentException` when `generate_corrupted_fields` is used with segment-children (hierarchical data), preventing users from enabling this unsupported combination.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| * @param variableLengthParams VariableLengthParameters containing the specifications for the consumption of variable-length Cobol records. | ||
| * @param variableSizeOccurs If true, OCCURS DEPENDING ON data size will depend on the number of elements | ||
| * @param generateRecordBytes Generate 'record_bytes' field containing raw bytes of the original record | ||
| * @param generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix doc field name to _corrupted_fields.
Line 50 mentions _corrupt_fields, but the option and field name elsewhere are _corrupted_fields.
📝 Suggested fix
- * `@param` generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully
+ * `@param` generateCorruptFields Generate '_corrupted_fields' field for fields that haven't converted successfully📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * @param generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully | |
| * `@param` generateCorruptFields Generate '_corrupted_fields' field for fields that haven't converted successfully |
🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`
at line 50, The ScalaDoc for the parameter generateCorruptFields is referencing
the wrong field name "_corrupt_fields"; update the comment to the correct field
name "_corrupted_fields" so the doc matches the actual option and field used
elsewhere (search for generateCorruptFields and any references to
_corrupted_fields/_corrupt_fields to confirm consistency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which name for the field seems better _corrupt_fields or _corrupted_fields ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yruslan, great question! Here's my analysis of both options:
_corrupted_fields (current implementation):
- ✅ More grammatically complete (past participle: "fields that have been corrupted")
- ✅ More explicit and descriptive
- ✅ Commonly used pattern in data quality contexts (similar to "rejected_records", "failed_rows")
- ✅ Clearer that it refers to fields that experienced corruption during processing
_corrupt_fields:
- ✅ Shorter and more concise
- ✅ Better parallelism with the parameter name
generateCorruptFields ⚠️ Could be ambiguous: "corrupt fields" might imply fields that are inherently corrupt vs. fields that failed during decoding
Recommendation: I'd lean toward _corrupted_fields for clarity and standard data quality terminology. The past participle form makes it unambiguous that these fields encountered corruption during the decoding process.
For consistency, you might also consider renaming the parameter to generateCorruptedFields to maintain parallelism, though generateCorruptFields is already quite clear.
What's your preference given the context of your users and how they'll interact with this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see but still lean towards _corrupt_fields since spark-csv has a similar concept generating _corrupt_records field for rows it failed to decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yruslan, excellent point — Spark CSV indeed uses _corrupt_record as the standard naming convention for similar functionality. Aligning with that established pattern makes perfect sense for consistency across the Spark ecosystem and will be more intuitive for users already familiar with Spark's data quality features.
So to confirm: you'd like to change the implementation from _corrupted_fields to _corrupt_fields (rather than updating the doc comment), which means:
- Update
Constants.scalafromcorruptedFieldsField = "_corrupted_fields"tocorruptedFieldsField = "_corrupt_fields" - Ensure all usages throughout the codebase reference the updated constant
- Keep the doc comment as-is (it already says
_corrupt_fields)
This would align the feature naming with Spark's established conventions while maintaining internal consistency. The parameter name generateCorruptFields would also align nicely with the field name _corrupt_fields.
Would you like me to help identify all the files that would need updates for this change?
✏️ Learnings added
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala:50-50
Timestamp: 2026-02-11T08:15:51.206Z
Learning: In the Cobrix project, the field name for corrupted/failed fields should be `_corrupt_fields` (not `_corrupted_fields`) to align with Apache Spark CSV's established naming convention which uses `_corrupt_record` for similar functionality, ensuring consistency across the Spark ecosystem.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala:435-436
Timestamp: 2026-02-11T08:01:45.131Z
Learning: In `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`, the `extractHierarchicalRecord` method hardcodes `generateCorruptedFields = false` because validation in `CobolParametersParser` throws an `IllegalArgumentException` when `generate_corrupted_fields` is used with segment-children (hierarchical data), preventing users from enabling this unsupported combination.
...-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
Outdated
Show resolved
Hide resolved
| * @param fileEndOffset A number of bytes to skip at the end of each file | ||
| * @param generateRecordId If true, a record id field will be prepended to each record. | ||
| * @param generateRecordBytes Generate 'record_bytes' field containing raw bytes of the original record | ||
| * @param generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming inconsistency: generateCorruptFields vs. generateCorruptedFields and doc typo.
- The Scaladoc (line 61) says
_corrupt_fieldsbut the actual column name inConstants.corruptedFieldsFieldis_corrupted_fields. - This parameter is named
generateCorruptFieldswhileRecordExtractors.extractRecordnames itgenerateCorruptedFields. Aligning these would reduce confusion.
Suggested doc fix
- * `@param` generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully
+ * `@param` generateCorruptFields Generate '_corrupted_fields' field for fields that haven't converted successfullyAlso applies to: 115-115
🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`
at line 61, The Scaladoc and parameter naming are inconsistent: update
ReaderParameters (parameter generateCorruptFields) to match the actual constant
and usage by renaming it to generateCorruptedFields (or vice-versa across the
codebase) and fix the Scaladoc text to refer to Constants.corruptedFieldsField
(`_corrupted_fields`); ensure RecordExtractors.extractRecord and any other
callers use the same name so the parameter and documentation are consistent.
| - #### 2.9.8 will be released soon. | ||
| - [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values | ||
| of fields that Cobrix couldn't decode. | ||
| ```scala | ||
| .option("generate_corrupted_fields", "true") | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix heading level increment (MD001).
The changelog jumps from ## to ####. Use ### for the 2.9.8 entry.
📝 Suggested fix
-- #### 2.9.8 will be released soon.
+- ### 2.9.8 will be released soon.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - #### 2.9.8 will be released soon. | |
| - [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values | |
| of fields that Cobrix couldn't decode. | |
| ```scala | |
| .option("generate_corrupted_fields", "true") | |
| ``` | |
| - ### 2.9.8 will be released soon. | |
| - [`#723`](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values | |
| of fields that Cobrix couldn't decode. |
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 1954-1954: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4
(MD001, heading-increment)
🤖 Prompt for AI Agents
In `@README.md` around lines 1954 - 1960, The changelog heading for the 2.9.8
entry uses #### which skips a level (MD001); update the heading text "2.9.8 will
be released soon." to use a triple-hash heading (###) instead of #### so it
correctly follows the prior `##` level; ensure the surrounding block (the bullet
about [`#723`] and the code example `.option("generate_corrupted_fields",
"true")`) remains unchanged and properly nested under the new `###` heading.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Around line 155-163: Update the ScalaDoc for the isEmpty(itOffset: Int,
record: Array[Byte]): Boolean method to accurately describe which byte values
are treated as "space" for different encodings: state that ASCII bytes use 0x20,
EBCDIC bytes use 0x40, and unknown/fallback encodings treat space as 0x00 (in
addition to null bytes 0x00 being considered empty); reference the isEmpty
method in Primitive.scala so readers understand the documented behavior matches
the implementation.
In `@README.md`:
- Around line 1955-1956: Update the changelog entry to reference the correct
field name _corrupt_fields (with leading underscore) instead of corrupt_fields,
and ensure the rest of the codebase consistently uses _corrupt_fields: search
for and rename any occurrences of corrupt_fields or _corrupted_fields to
_corrupt_fields (including the field definition and parameter in
CobolParameters.scala and any tests, serializers, or references that use the old
names), and update README.md, tests, and serialization/deserialization logic to
the new symbol to keep naming aligned with Spark conventions.
🧹 Nitpick comments (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala (1)
528-539: Prefer Constants for corrupt field names.
Avoids string drift now that canonical names live inConstants.♻️ Suggested refactor
- corruptFieldsInGroup += Primitive(15, "field_name", "field_name", 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None) - corruptFieldsInGroup += Primitive(15, "raw_value", "raw_value", 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None) + corruptFieldsInGroup += Primitive(15, Constants.fieldNameColumn, Constants.fieldNameColumn, 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None) + corruptFieldsInGroup += Primitive(15, Constants.rawValueColumn, Constants.rawValueColumn, 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None) - Group(10, "_corrupt_fields", "_corrupt_fields", 0, children = corruptFieldsInGroup, occurs = Some(10))(None) + Group(10, Constants.corruptFieldsField, Constants.corruptFieldsField, 0, children = corruptFieldsInGroup, occurs = Some(10))(None)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptFieldsSpec.scala (1)
104-105: Use the importedPARAM_CORRUPT_FIELDSconstant instead of the string literal.
PARAM_CORRUPT_FIELDSis already imported (line 21) and used in the assertion on line 121. Using the constant here too avoids the risk of the option name drifting out of sync.♻️ Suggested fix
- val df = getDataFrame(tmpFileName, Map("generate_corrupt_fields" -> "true")) + val df = getDataFrame(tmpFileName, Map(PARAM_CORRUPT_FIELDS -> "true"))Same applies to line 118:
- getDataFrame("/tmp/dummy", Map("generate_corrupt_fields" -> "true", "segment-children:0" -> "COMPANY => DEPT,CUSTOMER")) + getDataFrame("/tmp/dummy", Map(PARAM_CORRUPT_FIELDS -> "true", "segment-children:0" -> "COMPANY => DEPT,CUSTOMER"))spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala (1)
744-911: Thorough builder test coverage.The builder tests comprehensively exercise each
withXmethod and their combinations. One minor nit:Line 909 uses
isInstanceOf[StringType]while the rest of this file consistently uses== StringTypefor type assertions (e.g., lines 542, 544, 627). Consider using== StringTypefor consistency.♻️ Suggested fix
- assert(sparkSchema.fields(2).dataType.isInstanceOf[StringType]) + assert(sparkSchema.fields(2).dataType == StringType)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Line 174: Update the two typos in the comments inside Primitive.scala: change
"should exactly fix the required bytes" to "should exactly fit the required
bytes" for the comment that starts "Non-string field size..." and apply the same
replacement for the second occurrence around line 216; locate these comments in
the Primitive.scala source (search for the phrase "should exactly fix the
required bytes") and edit the text accordingly.
🧹 Nitpick comments (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)
66-102: Minor duplication inspaceCharinitialization betweenIntegralandDecimalbranches.The
Nonecompact case for bothIntegralandDecimal(lines 80-86 and 92-98) is identical. Not urgent, but could be deduplicated if this logic grows.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala (1)
872-887: Consider adding deeper structural assertions for the_corrupt_fieldsarray element type.The multi-segment test at Line 300 validates the inner struct fields (
field_name,raw_value) of the_corrupt_fieldsarray, but this builder test only checks the top-level name andArrayType. Adding a similar inner-struct assertion here would guard against regressions in the element schema.Suggested addition
assert(sparkSchema.fields(1).name == "_corrupt_fields") assert(sparkSchema.fields(1).dataType.isInstanceOf[ArrayType]) + val corruptFieldsType = sparkSchema.fields(1).dataType.asInstanceOf[ArrayType] + val elementType = corruptFieldsType.elementType.asInstanceOf[StructType] + assert(elementType.fields.map(_.name).toSet == Set("field_name", "raw_value"))cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala (2)
114-116: Minor style: missing space after comma ingetRawValuecalls.Lines 115 and 142 both have
s.getRawValue(offset,data)/st.getRawValue(useOffset,data)without a space after the comma, inconsistent with the rest of the codebase (e.g.,s.decodeTypeValue(offset, data)on line 113).Proposed fix
- corruptFields += CorruptField(s"${field.name}[$i]", s.getRawValue(offset,data)) + corruptFields += CorruptField(s"${field.name}[$i]", s.getRawValue(offset, data))- corruptFields += CorruptField(field.name, st.getRawValue(useOffset,data)) + corruptFields += CorruptField(field.name, st.getRawValue(useOffset, data))
533-540: Synthetic Group for corrupt fields schema uses hardcodedoccurs = Some(10)anddecode = null.The
occurs = Some(10)on line 539 is arbitrary since the actual array size is determined at runtime. If this Group is ever used for size calculations (e.g.,binaryProperties.actualSize), the hardcoded 10 could produce incorrect results. Similarly,decode = nullandencode = nullon the Primitives (lines 536–537) would NPE if anyone tries to decode through these synthetic nodes.Since this Group is only used as a structural schema hint for
handler.create(line 518), this is likely safe today. But it's fragile — consider adding a comment clarifying this Group must never be used for decoding or size computation.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptFieldsSpec.scala (1)
36-42: Test data comments could be more precise for record 3.Line 39 comment says "COMP invalid value (negative)" but the expected output shows
F3 = 61702(decoded successfully, not flagged as corrupt). The fieldF1 = 0x00is the one that decodes to null here. Consider updating the comment to clarify what's actually being tested in this record — it seems to test that an out-of-PIC-range COMP value is still decoded (not flagged) and that a null byte for alphanumeric is treated as empty (not corrupt).
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala
Outdated
Show resolved
Hide resolved
91a8caa to
8a835e5
Compare
_corrupted_fields column containing information on fields that Cobrix failed to decode_corrupt_fields column containing information on fields that Cobrix failed to decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Around line 66-101: The cached spaceChar value in Primitive.scala incorrectly
treats COMP-3 empty byte as 0x40; locate the private val spaceChar (pattern
matching on dataType) and change the COMP3() cases inside the Integral and
Decimal branches from returning 0x40 to returning 0x00 so COMP-3 fields follow
the project-wide null encoding; ensure both occurrences (the t.compact match in
Integral and the t.compact match in Decimal) are updated and run tests to verify
corruption detection now flags all-0x40 packed fields.
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`:
- Around line 533-540: getCorruptFieldsGroup currently hardcodes occurs =
Some(10) which caps maxElements to 10 and can truncate flattening; update
getCorruptFieldsGroup to compute the occurs value instead of using the literal
10 — e.g., derive it from the copybook metadata or a configuration parameter
(use an existing method/property that exposes max possible corrupt fields or add
a config lookup/fallback) and pass that computed Option[Int] into the Group
constructor (replace the literal Some(10) with the computed Option). Ensure the
change uses the same symbols: getCorruptFieldsGroup, Group(..., occurs = ...) so
downstream metadata reflects the true upper bound.
| /** This is cached value to speedup checking for empty values */ | ||
| private val spaceChar: Byte = { | ||
| dataType match { | ||
| case t: AlphaNumeric => | ||
| t.enc match { | ||
| case Some(EBCDIC) => 0x40 | ||
| case Some(ASCII) => 0x20 | ||
| case Some(_) => 0 | ||
| case None => 0x40 | ||
| } | ||
| case t: Integral => | ||
| t.compact match { | ||
| case Some(COMP3()) => 0x40 | ||
| case Some(_) => 0 | ||
| case None => | ||
| t.enc match { | ||
| case Some(EBCDIC) => 0x40 | ||
| case Some(ASCII) => 0x20 | ||
| case Some(_) => 0 | ||
| case None => 0x40 | ||
| } | ||
| } | ||
| case t: Decimal => | ||
| t.compact match { | ||
| case Some(COMP3()) => 0x40 | ||
| case Some(_) => 0 | ||
| case None => | ||
| t.enc match { | ||
| case Some(EBCDIC) => 0x40 | ||
| case Some(ASCII) => 0x20 | ||
| case Some(_) => 0 | ||
| case None => 0x40 | ||
| } | ||
| } | ||
| case _ => 0 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand how spaceChar is used
rg -A 5 -B 5 "spaceChar" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scalaRepository: AbsaOSS/cobrix
Length of output: 630
🏁 Script executed:
# Search for usages of spaceChar throughout the codebase
rg "spaceChar" --type scala -A 3 -B 3Repository: AbsaOSS/cobrix
Length of output: 3940
🏁 Script executed:
# Look for COMP-3 handling in reader/extractor code
rg "COMP3|0x40" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/ --type scala -A 2 -B 2Repository: AbsaOSS/cobrix
Length of output: 40
🏁 Script executed:
# Check for empty/null value detection in readers
rg "isEmpty|isNull|empty.*value" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/ --type scala -A 2 | head -50Repository: AbsaOSS/cobrix
Length of output: 5846
🏁 Script executed:
# Look for test cases related to COMP-3 empty values
fd ".*Suite\.scala$" cobol-parser/src/test --exec rg -l "COMP3|comp3" {} \;Repository: AbsaOSS/cobrix
Length of output: 40
🏁 Script executed:
# Find test files mentioning COMP3 or packed decimal
find cobol-parser/src/test -name "*.scala" -exec grep -l "COMP3\|comp3\|packed" {} \;Repository: AbsaOSS/cobrix
Length of output: 329
🏁 Script executed:
# Check for corrupted field handling related to COMP-3
rg "corrupted|Corrupt" cobol-parser/src/main/scala --type scala -B 2 -A 2 | grep -A 5 -B 5 "COMP3"Repository: AbsaOSS/cobrix
Length of output: 40
🏁 Script executed:
# Look for isEmpty method implementation and all its usages
rg "def isEmpty" cobol-parser/src/main/scala --type scala -A 10Repository: AbsaOSS/cobrix
Length of output: 1312
🏁 Script executed:
# Search for any configuration or documentation about null encoding
rg "null.*encod|encod.*null|0x00|zero.*byte" cobol-parser/src/main/scala --type scala | head -20Repository: AbsaOSS/cobrix
Length of output: 3289
🏁 Script executed:
# Let's look at the BinaryDecoderSpec test
cat cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryDecoderSpec.scala | head -100Repository: AbsaOSS/cobrix
Length of output: 10149
🏁 Script executed:
# Check isEmpty usage in tests
rg "isEmpty" cobol-parser/src/test --type scala -B 3 -A 3Repository: AbsaOSS/cobrix
Length of output: 24980
🏁 Script executed:
# Look for COMP3 in the test files we found
rg "COMP3|comp3" cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryDecoderSpec.scala -B 5 -A 5Repository: AbsaOSS/cobrix
Length of output: 4525
🏁 Script executed:
# Let's check the full isEmpty method
sed -n '102,130p' cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scalaRepository: AbsaOSS/cobrix
Length of output: 1087
🏁 Script executed:
# Look for the complete Primitive.scala spaceChar definition with context
sed -n '60,110p' cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scalaRepository: AbsaOSS/cobrix
Length of output: 1540
🏁 Script executed:
# Search for any test that calls isEmpty or checks empty values with COMP-3
rg "isEmpty|empty" cobol-parser/src/test --type scala | grep -i "comp3\|packed"Repository: AbsaOSS/cobrix
Length of output: 40
🏁 Script executed:
# Check for usage of isEmpty method in the codebase
rg "\.isEmpty\(" cobol-parser/src/main --type scala -B 2 | head -40Repository: AbsaOSS/cobrix
Length of output: 1013
Change COMP‑3 empty byte from 0x40 to 0x00 to properly detect corruption.
COMP-3 is packed decimal format where 0x40 is meaningless data. Using 0x40 as the empty indicator causes fields with invalid all-0x40 bytes to be incorrectly classified as empty and excluded from corruption reporting. This contradicts the design principle that null values are encoded as 0x00 for all field types. Changing this to 0x00 aligns with the null encoding scheme and ensures corrupted COMP-3 data is properly detected.
Suggested fix
- case Some(COMP3()) => 0x40
+ case Some(COMP3()) => 0- case Some(COMP3()) => 0x40
+ case Some(COMP3()) => 0🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`
around lines 66 - 101, The cached spaceChar value in Primitive.scala incorrectly
treats COMP-3 empty byte as 0x40; locate the private val spaceChar (pattern
matching on dataType) and change the COMP3() cases inside the Integral and
Decimal branches from returning 0x40 to returning 0x00 so COMP-3 fields follow
the project-wide null encoding; ensure both occurrences (the t.compact match in
Integral and the t.compact match in Decimal) are updated and run tests to verify
corruption detection now flags all-0x40 packed fields.
| private def getCorruptFieldsGroup: Group = { | ||
| val corruptFieldsInGroup = new mutable.ArrayBuffer[Statement] | ||
|
|
||
| corruptFieldsInGroup += Primitive(15, Constants.fieldNameColumn, Constants.fieldNameColumn, 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None) | ||
| corruptFieldsInGroup += Primitive(15, Constants.rawValueColumn, Constants.rawValueColumn, 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None) | ||
|
|
||
| Group(10, Constants.corruptFieldsField, Constants.corruptFieldsField, 0, children = corruptFieldsInGroup, occurs = Some(10))(None) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded occurs = Some(10) can under‑report max corrupt fields.
This caps maxElements metadata to 10, which can be misleading (and can truncate flattening logic). Consider deriving the upper bound from the copybook (e.g., max possible corrupt fields) or making it configurable.
🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`
around lines 533 - 540, getCorruptFieldsGroup currently hardcodes occurs =
Some(10) which caps maxElements to 10 and can truncate flattening; update
getCorruptFieldsGroup to compute the occurs value instead of using the literal
10 — e.g., derive it from the copybook metadata or a configuration parameter
(use an existing method/property that exposes max possible corrupt fields or add
a config lookup/fallback) and pass that computed Option[Int] into the Group
constructor (replace the literal Some(10) with the computed Option). Ensure the
change uses the same symbols: getCorruptFieldsGroup, Group(..., occurs = ...) so
downstream metadata reflects the true upper bound.
Closes #723
Summary by CodeRabbit
New Features
Documentation
Tests