Skip to content

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Feb 11, 2026

Closes #723

Summary by CodeRabbit

  • New Features

    • Added generate_corrupt_fields option that appends a _corrupt_fields array (field_name, raw_value) for values that fail conversion.
    • Added a fluent CobolSchema builder API for easier schema construction and configuration.
  • Documentation

    • README and changelog updated with usage examples and description of generate_corrupt_fields.
  • Tests

    • New integration tests validating corrupt-fields output and option compatibility (including incompatible option checks).

- Introduced `generate_corrupted_fields` option to append a `_corrupted_fields` array to the schema, containing field names and raw values of fields that could not be decoded.
- Updated relevant methods in `RecordExtractors` to handle corrupted fields.
- Enhanced documentation and README to reflect the new feature.
…ull fields, improve performance of generating corrupted fields slightly.
…d memory wise), fix `PIC X USAGE COMP` cases.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

Walkthrough

Adds a new generate_corrupt_fields option that captures fields failing to decode by storing their field name and raw bytes into a _corrupt_fields array. The flag is threaded through parameters, schema, iterators, and record extractors; Primitive gained raw extraction and isEmpty helpers. Tests, README, and a CobolSchema builder were added/updated.

Changes

Cohort / File(s) Summary
Parameters & Parsing
cobol-parser/src/main/scala/.../CobolParameters.scala, .../CobolParametersParser.scala, .../ReaderParameters.scala
Add generate_corrupt_fields constant and generateCorruptFields: Boolean option; validate incompatible combos and propagate into ReaderParameters.
Record Extraction & Corrupt model
cobol-parser/src/main/scala/.../record/RecordExtractors.scala, .../record/CorruptField.scala
Collect CorruptField(field_name, raw_value) when decoding yields null but raw bytes exist; pass buffer through post-processing and optionally append _corrupt_fields array to outputs.
Primitive API
cobol-parser/src/main/scala/.../Primitive.scala
Add getRawValue and isEmpty methods; separate raw extraction from decoding to enable corrupt-byte detection.
Schema & Constants
cobol-parser/src/main/scala/.../Constants.scala, .../schema/CobolSchema.scala, spark-cobol/src/main/scala/.../CobolSchema.scala
Add constants (_corrupt_fields, field_name, raw_value); propagate generateCorruptFields through CobolSchema/CobolReaderSchema and conditionally add corrupt-fields StructType; add CobolSchema builder API.
Iterators & Reader Flow
cobol-parser/src/main/scala/.../FixedLenNestedRowIterator.scala, .../VarLenNestedIterator.scala, spark-cobol/src/main/scala/.../SparkCobolOptionsBuilder.scala
Forward generateCorruptFields into RecordExtractors.extractRecord calls and adjust call sites to new parameter list.
Parser & Usage comments
cobol-parser/src/main/scala/.../ParserVisitor.scala, .../datatype/Usage.scala
Adjust COMP usage handling (COMP4/COMP9) and update usage guidance comments; minor import reorders.
Tests & Documentation
README.md, spark-cobol/src/test/scala/.../Test41CorruptFieldsSpec.scala, spark-cobol/src/test/scala/.../*
README documents new option and changelog; new integration test validates schema/data; many tests updated to use the CobolSchema builder and parameter constants.
Misc test updates
spark-cobol/src/test/.../DummyCobolSchema.scala, .../Test02TextFilesOldSchool.scala, etc.
Refactor tests to the builder API, small style/import tweaks, and replace some hard-coded strings with constants.

Sequence Diagram

sequenceDiagram
    participant Client as Spark Client
    participant Extractor as RecordExtractors
    participant Primitive as Primitive
    participant Output as Output Row

    Client->>Extractor: extractRecord(..., generateCorruptFields=true)
    activate Extractor

    Extractor->>Extractor: init corruptedFields buffer

    loop for each field
        Extractor->>Primitive: getRawValue(offset)
        activate Primitive
        Primitive-->>Extractor: rawBytes or null
        alt rawBytes present and decode succeeds
            Extractor-->>Extractor: use decoded value
        else rawBytes present but decode returns null
            Extractor->>Extractor: create CorruptField(fieldName, rawBytes)
            Extractor-->>Extractor: add to corruptedFields
        end
        deactivate Primitive
    end

    alt generateCorruptFields == true
        Extractor->>Extractor: build _corrupt_fields array from buffer
        Extractor->>Output: append _corrupt_fields to Row
    end

    Extractor-->>Client: Row (with _corrupt_fields if enabled)
    deactivate Extractor
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I nibble bytes both shy and bold,

When decoders stumble, crumbs are told.
I tuck raw bytes and names in rows,
A little trail where the mystery goes.
Builder-bounced and brisk I hop — corrupts now saved, no more flip-flop.

🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements core requirements from #723 [issue #723]: adds generate_corrupt_fields option, generates _corrupt_fields column with field_name and raw_value data when decoding fails. However, the linked issue requested user-configurable column names and detailed error messages, but the implementation uses a fixed column name and simplified data structure without error descriptions. Verify whether the simplified implementation (fixed column name '_corrupt_fields' with only field_name and raw_value) fully satisfies the original #723 requirements for user-configurable naming and detailed error messages. Consider documenting any scope changes or design trade-offs.
Out of Scope Changes check ❓ Inconclusive Changes are generally scoped to the corrupted fields feature. However, ParserVisitor.scala adjusts COMP encoding handling (COMP1/COMP4 to COMP4/COMP9), and Usage.scala adds deprecation comments—these appear tangential to the _corrupt_fields feature and may warrant clarification. Clarify the relationship between the COMP encoding changes in ParserVisitor.scala and Usage.scala comments to the #723 feature. Confirm whether these are prerequisite fixes or separate improvements bundled into this PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main feature: adding a _corrupted_fields column to capture field decoding failures, which aligns perfectly with the primary change in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/723-add-corrupted-fields-field

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.45% -0.59% 🍏
Files changed 62.29% 🍏

File Coverage
Constants.scala 89.7% 🍏
ParserVisitor.scala 89.44% -0.07% 🍏
CobolSchema.scala 80.67% -1% 🍏
RecordExtractors.scala 80.27% -17% 🍏
Usage.scala 79.59% 🍏
FixedLenNestedRowIterator.scala 73.44% 🍏
Primitive.scala 71.16% -13.6% 🍏
VarLenNestedIterator.scala 67.72% 🍏
CorruptField.scala 0%

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

JaCoCo code coverage report - 'spark-cobol'

Overall Project 82.39% 🍏
Files changed 100% 🍏

File Coverage
CobolSchema.scala 95.51% 🍏
SparkCobolOptionsBuilder.scala 90.32% 🍏

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)

43-43: ⚠️ Potential issue | 🟡 Minor

Typo in Scaladoc: "pecifies" → "Specifies".

Proposed fix
-  * `@param` schemaRetentionPolicy   pecifies a policy to transform the input schema. The default policy is to keep the schema exactly as it is in the copybook.
+  * `@param` schemaRetentionPolicy   Specifies a policy to transform the input schema. The default policy is to keep the schema exactly as it is in the copybook.
🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Around line 125-156: The isEmpty method currently treats byte 0x40 as an
empty-space unconditionally; update isEmpty to be encoding-aware by checking the
field's encoding (use dataType.enc on AlphaNumeric fields) and only treating
0x40 as an empty byte when the encoding is EBCDIC, while treating 0x20 as the
space for ASCII-encoded fields; retain the existing zero (0x00) check and fall
back to the current behavior for unknown encodings. Modify the loop in isEmpty
(and the earlier idx bounds logic using binaryProperties.dataSize) to build the
set of “empty” byte values based on dataType.enc and use that set to decide
emptiness.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`:
- Around line 435-436: The hierarchical extractor call in
extractHierarchicalRecord always passes generateCorruptedFields = false and
corruptedFields = null to applyRecordPostProcessing, so users enabling
corrupted-fields support for hierarchical/multisegment data get no output;
update extractHierarchicalRecord to detect when the user option for corrupted
fields is true and either (a) propagate the flag and a proper corruptedFields
collection into applyRecordPostProcessing if you implement support, or (b) emit
a clear warning via the existing logging mechanism (e.g., logger.warn) that
corrupted-fields are not supported for hierarchical/multisegment extraction and
will be ignored; reference extractHierarchicalRecord and
applyRecordPostProcessing so the change is made at that call site.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`:
- Line 50: The ScalaDoc for the parameter generateCorruptFields is referencing
the wrong field name "_corrupt_fields"; update the comment to the correct field
name "_corrupted_fields" so the doc matches the actual option and field used
elsewhere (search for generateCorruptFields and any references to
_corrupted_fields/_corrupt_fields to confirm consistency).

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala`:
- Around line 898-901: The current check uses
params.contains(PARAM_CORRUPTED_FIELDS) which treats an explicit "false" value
the same as presence and wrongly throws; update the condition in the
CobolParametersParser where PARAM_CORRUPTED_FIELDS is checked so it only throws
when the option is explicitly set to true (e.g., inspect params.get/PARAM value
or use params.getBoolean-like logic on the params map) instead of just checking
presence; change the check around PARAM_CORRUPTED_FIELDS and the throw in that
block to allow explicit generate_corrupted_fields=false while still blocking
true when used with segment-children:*.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`:
- Line 61: The Scaladoc and parameter naming are inconsistent: update
ReaderParameters (parameter generateCorruptFields) to match the actual constant
and usage by renaming it to generateCorruptedFields (or vice-versa across the
codebase) and fix the Scaladoc text to refer to Constants.corruptedFieldsField
(`_corrupted_fields`); ensure RecordExtractors.extractRecord and any other
callers use the same name so the parameter and documentation are consistent.

In `@README.md`:
- Around line 600-603: The fenced code block showing
.option("generate_corrupted_fields", "true") lacks a language identifier, which
triggers markdownlint MD040; update the fenced block to add a language tag
(e.g., scala) immediately after the opening backticks so it reads ```scala and
retains the existing code snippet .option("generate_corrupted_fields", "true").
- Around line 1954-1960: The changelog heading for the 2.9.8 entry uses ####
which skips a level (MD001); update the heading text "2.9.8 will be released
soon." to use a triple-hash heading (###) instead of #### so it correctly
follows the prior `##` level; ensure the surrounding block (the bullet about
[`#723`] and the code example `.option("generate_corrupted_fields", "true")`)
remains unchanged and properly nested under the new `###` heading.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Line 47: Update the Scaladoc for the generateCorruptedFields parameter in
CobolSchema (parameter name: generateCorruptedFields) to state that corrupted
record fields are appended to the end of the schema (not the beginning); this
aligns the doc with the implementation that uses :+ to add the
`_corrupted_fields` field at the end (see construction around the code that
appends `_corrupted_fields`).
🧹 Nitpick comments (7)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/common/Constants.scala (1)

67-69: Constants defined but not referenced in RecordExtractors.getCorruptedFieldsGroup.

In RecordExtractors.scala (lines 535-538), the corrupted fields group is constructed using hardcoded string literals "field_name", "raw_value", and "_corrupted_fields" instead of referencing these constants. Either use the constants there or remove them here to avoid drift.

Suggested change in RecordExtractors.scala (lines 532-539)
  private def getCorruptedFieldsGroup: Group = {
    val corruptedFieldsInGroup = new mutable.ArrayBuffer[Statement]

-   corruptedFieldsInGroup += Primitive(15, "field_name", "field_name", 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None)
-   corruptedFieldsInGroup += Primitive(15, "raw_value", "raw_value", 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None)
+   corruptedFieldsInGroup += Primitive(15, Constants.fieldNameColumn, Constants.fieldNameColumn, 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None)
+   corruptedFieldsInGroup += Primitive(15, Constants.rawValueColumn, Constants.rawValueColumn, 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None)

-   Group(10, "_corrupted_fields", "_corrupted_fields", 0,  children = corruptedFieldsInGroup, occurs = Some(10))(None)
+   Group(10, Constants.corruptedFieldsField, Constants.corruptedFieldsField, 0,  children = corruptedFieldsInGroup, occurs = Some(10))(None)
  }
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)

125-156: Duplicated bounds-checking logic between isEmpty and getRawValue.

Both methods repeat the same offset validation and byte-range calculation (string vs. non-string, varchar trimming). Consider extracting a shared private helper that returns the valid (startIdx, endIdx) or null/None to reduce duplication.

Also applies to: 167-191

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala (1)

72-73: ArrayBuffer allocated even when generateCorruptedFields is false.

Line 73 unconditionally allocates new ArrayBuffer[CorruptedField] for every record, even when the feature is disabled. Consider making it lazy or conditional.

Suggested fix
-    val corruptedFields = new ArrayBuffer[CorruptedField]
+    val corruptedFields = if (generateCorruptedFields) new ArrayBuffer[CorruptedField] else null

Then the null-guard at line 511 already covers this.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenNestedIterator.scala (1)

95-110: Correct propagation of generateCorruptFields to extractRecord.

The wiring is correct. Note the naming inconsistency: ReaderParameters.generateCorruptFields maps to RecordExtractors.extractRecord's generateCorruptedFields parameter. Not a bug, but the inconsistent naming (Corrupt vs Corrupted) across the API surface could cause confusion.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptedFieldsSpec.scala (1)

104-113: Consider using PARAM_CORRUPTED_FIELDS constant for the option key.

PARAM_CORRUPTED_FIELDS is imported on line 21 but only used in the assertion at line 121. For consistency and to guard against the option key string drifting out of sync, consider using the constant here as well.

That said, using string literals in Spark option maps is a common test pattern (it validates the actual user-facing API surface), so this is a minor style preference.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala (1)

905-910: Inconsistent type assertion style.

Line 909 uses isInstanceOf[StringType] while all other tests in this file compare with == (e.g., == StringType, == IntegerType). For consistency, prefer the equality check.

Proposed fix
-      assert(sparkSchema.fields(2).dataType.isInstanceOf[StringType])
+      assert(sparkSchema.fields(2).dataType == StringType)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala (1)

147-147: Naming inconsistency: generateCorruptFields vs generateCorruptedFields across the codebase.

The parameter is defined as generateCorruptFields (no "ed") in ReaderParameters and CobolParameters, but CobolSchema and RecordExtractors use generateCorruptedFields. This inconsistency spans multiple files and creates confusing naming conventions. Consider aligning all usages to generateCorruptedFields, which is grammatically correct and already used in the public-facing option key generate_corrupted_fields.

Comment on lines 435 to 436
applyRecordPostProcessing(ast, records.toList, effectiveSchemaRetentionPolicy, generateRecordId, generateRecordBytes = false, generateCorruptedFields = false, Nil, fileId, recordId, recordLength, Array.empty[Byte], generateInputFileField = generateInputFileField, inputFileName, null, handler)
}
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hierarchical record extraction does not support corrupted fields.

extractHierarchicalRecord hardcodes generateCorruptedFields = false and corruptedFields = null. If a user enables the option with hierarchical/multisegment data, they'll silently get no corrupted fields output. Consider logging a warning if the feature is enabled but unsupported, or documenting this limitation.

🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`
around lines 435 - 436, The hierarchical extractor call in
extractHierarchicalRecord always passes generateCorruptedFields = false and
corruptedFields = null to applyRecordPostProcessing, so users enabling
corrupted-fields support for hierarchical/multisegment data get no output;
update extractHierarchicalRecord to detect when the user option for corrupted
fields is true and either (a) propagate the flag and a proper corruptedFields
collection into applyRecordPostProcessing if you implement support, or (b) emit
a clear warning via the existing logging mechanism (e.g., logger.warn) that
corrupted-fields are not supported for hierarchical/multisegment extraction and
will be ignored; reference extractHierarchicalRecord and
applyRecordPostProcessing so the change is made at that call site.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they set the option to true the option map fails validation and throws an exception. So for now this is good. But in the future we can enable this option for hierarchical record extractor as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yruslan, Thank you for the clarification! I can see from the validation logic in CobolParametersParser that the combination of generate_corrupted_fields with hierarchical segment-children is properly prevented at option parsing time, throwing an IllegalArgumentException before extraction occurs. This makes the hardcoded false value in extractHierarchicalRecord safe and appropriate for the current implementation.

The validation approach is sound—failing fast during configuration rather than silently ignoring the option. Looking forward to seeing this feature extended to hierarchical extractors in the future when support is implemented.


✏️ Learnings added
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala:435-436
Timestamp: 2026-02-11T08:01:45.131Z
Learning: In `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`, the `extractHierarchicalRecord` method hardcodes `generateCorruptedFields = false` because validation in `CobolParametersParser` throws an `IllegalArgumentException` when `generate_corrupted_fields` is used with segment-children (hierarchical data), preventing users from enabling this unsupported combination.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

* @param variableLengthParams VariableLengthParameters containing the specifications for the consumption of variable-length Cobol records.
* @param variableSizeOccurs If true, OCCURS DEPENDING ON data size will depend on the number of elements
* @param generateRecordBytes Generate 'record_bytes' field containing raw bytes of the original record
* @param generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix doc field name to _corrupted_fields.
Line 50 mentions _corrupt_fields, but the option and field name elsewhere are _corrupted_fields.

📝 Suggested fix
-  * `@param` generateCorruptFields   Generate '_corrupt_fields' field for fields that haven't converted successfully
+  * `@param` generateCorruptFields   Generate '_corrupted_fields' field for fields that haven't converted successfully
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* @param generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully
* `@param` generateCorruptFields Generate '_corrupted_fields' field for fields that haven't converted successfully
🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`
at line 50, The ScalaDoc for the parameter generateCorruptFields is referencing
the wrong field name "_corrupt_fields"; update the comment to the correct field
name "_corrupted_fields" so the doc matches the actual option and field used
elsewhere (search for generateCorruptFields and any references to
_corrupted_fields/_corrupt_fields to confirm consistency).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which name for the field seems better _corrupt_fields or _corrupted_fields ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yruslan, great question! Here's my analysis of both options:

_corrupted_fields (current implementation):

  • ✅ More grammatically complete (past participle: "fields that have been corrupted")
  • ✅ More explicit and descriptive
  • ✅ Commonly used pattern in data quality contexts (similar to "rejected_records", "failed_rows")
  • ✅ Clearer that it refers to fields that experienced corruption during processing

_corrupt_fields:

  • ✅ Shorter and more concise
  • ✅ Better parallelism with the parameter name generateCorruptFields
  • ⚠️ Could be ambiguous: "corrupt fields" might imply fields that are inherently corrupt vs. fields that failed during decoding

Recommendation: I'd lean toward _corrupted_fields for clarity and standard data quality terminology. The past participle form makes it unambiguous that these fields encountered corruption during the decoding process.

For consistency, you might also consider renaming the parameter to generateCorruptedFields to maintain parallelism, though generateCorruptFields is already quite clear.

What's your preference given the context of your users and how they'll interact with this feature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see but still lean towards _corrupt_fields since spark-csv has a similar concept generating _corrupt_records field for rows it failed to decode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yruslan, excellent point — Spark CSV indeed uses _corrupt_record as the standard naming convention for similar functionality. Aligning with that established pattern makes perfect sense for consistency across the Spark ecosystem and will be more intuitive for users already familiar with Spark's data quality features.

So to confirm: you'd like to change the implementation from _corrupted_fields to _corrupt_fields (rather than updating the doc comment), which means:

  1. Update Constants.scala from corruptedFieldsField = "_corrupted_fields" to corruptedFieldsField = "_corrupt_fields"
  2. Ensure all usages throughout the codebase reference the updated constant
  3. Keep the doc comment as-is (it already says _corrupt_fields)

This would align the feature naming with Spark's established conventions while maintaining internal consistency. The parameter name generateCorruptFields would also align nicely with the field name _corrupt_fields.

Would you like me to help identify all the files that would need updates for this change?


✏️ Learnings added
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala:50-50
Timestamp: 2026-02-11T08:15:51.206Z
Learning: In the Cobrix project, the field name for corrupted/failed fields should be `_corrupt_fields` (not `_corrupted_fields`) to align with Apache Spark CSV's established naming convention which uses `_corrupt_record` for similar functionality, ensuring consistency across the Spark ecosystem.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 820
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala:435-436
Timestamp: 2026-02-11T08:01:45.131Z
Learning: In `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`, the `extractHierarchicalRecord` method hardcodes `generateCorruptedFields = false` because validation in `CobolParametersParser` throws an `IllegalArgumentException` when `generate_corrupted_fields` is used with segment-children (hierarchical data), preventing users from enabling this unsupported combination.

* @param fileEndOffset A number of bytes to skip at the end of each file
* @param generateRecordId If true, a record id field will be prepended to each record.
* @param generateRecordBytes Generate 'record_bytes' field containing raw bytes of the original record
* @param generateCorruptFields Generate '_corrupt_fields' field for fields that haven't converted successfully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Naming inconsistency: generateCorruptFields vs. generateCorruptedFields and doc typo.

  1. The Scaladoc (line 61) says _corrupt_fields but the actual column name in Constants.corruptedFieldsField is _corrupted_fields.
  2. This parameter is named generateCorruptFields while RecordExtractors.extractRecord names it generateCorruptedFields. Aligning these would reduce confusion.
Suggested doc fix
-  * `@param` generateCorruptFields   Generate '_corrupt_fields' field for fields that haven't converted successfully
+  * `@param` generateCorruptFields   Generate '_corrupted_fields' field for fields that haven't converted successfully

Also applies to: 115-115

🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`
at line 61, The Scaladoc and parameter naming are inconsistent: update
ReaderParameters (parameter generateCorruptFields) to match the actual constant
and usage by renaming it to generateCorruptedFields (or vice-versa across the
codebase) and fix the Scaladoc text to refer to Constants.corruptedFieldsField
(`_corrupted_fields`); ensure RecordExtractors.extractRecord and any other
callers use the same name so the parameter and documentation are consistent.

Comment on lines 1954 to 1960
- #### 2.9.8 will be released soon.
- [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
of fields that Cobrix couldn't decode.
```scala
.option("generate_corrupted_fields", "true")
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix heading level increment (MD001).
The changelog jumps from ## to ####. Use ### for the 2.9.8 entry.

📝 Suggested fix
-- #### 2.9.8 will be released soon.
+- ### 2.9.8 will be released soon.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- #### 2.9.8 will be released soon.
- [#723](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
of fields that Cobrix couldn't decode.
```scala
.option("generate_corrupted_fields", "true")
```
- ### 2.9.8 will be released soon.
- [`#723`](https://github.com/AbsaOSS/cobrix/pull/723) Added the option to generate `corrupted_fields` field that contains field names and raw values
of fields that Cobrix couldn't decode.
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 1954-1954: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

🤖 Prompt for AI Agents
In `@README.md` around lines 1954 - 1960, The changelog heading for the 2.9.8
entry uses #### which skips a level (MD001); update the heading text "2.9.8 will
be released soon." to use a triple-hash heading (###) instead of #### so it
correctly follows the prior `##` level; ensure the surrounding block (the bullet
about [`#723`] and the code example `.option("generate_corrupted_fields",
"true")`) remains unchanged and properly nested under the new `###` heading.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Around line 155-163: Update the ScalaDoc for the isEmpty(itOffset: Int,
record: Array[Byte]): Boolean method to accurately describe which byte values
are treated as "space" for different encodings: state that ASCII bytes use 0x20,
EBCDIC bytes use 0x40, and unknown/fallback encodings treat space as 0x00 (in
addition to null bytes 0x00 being considered empty); reference the isEmpty
method in Primitive.scala so readers understand the documented behavior matches
the implementation.

In `@README.md`:
- Around line 1955-1956: Update the changelog entry to reference the correct
field name _corrupt_fields (with leading underscore) instead of corrupt_fields,
and ensure the rest of the codebase consistently uses _corrupt_fields: search
for and rename any occurrences of corrupt_fields or _corrupted_fields to
_corrupt_fields (including the field definition and parameter in
CobolParameters.scala and any tests, serializers, or references that use the old
names), and update README.md, tests, and serialization/deserialization logic to
the new symbol to keep naming aligned with Spark conventions.
🧹 Nitpick comments (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala (1)

528-539: Prefer Constants for corrupt field names.
Avoids string drift now that canonical names live in Constants.

♻️ Suggested refactor
-    corruptFieldsInGroup += Primitive(15, "field_name", "field_name", 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None)
-    corruptFieldsInGroup += Primitive(15, "raw_value", "raw_value", 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None)
+    corruptFieldsInGroup += Primitive(15, Constants.fieldNameColumn, Constants.fieldNameColumn, 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None)
+    corruptFieldsInGroup += Primitive(15, Constants.rawValueColumn, Constants.rawValueColumn, 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None)
 
-    Group(10, "_corrupt_fields", "_corrupt_fields", 0,  children = corruptFieldsInGroup, occurs = Some(10))(None)
+    Group(10, Constants.corruptFieldsField, Constants.corruptFieldsField, 0, children = corruptFieldsInGroup, occurs = Some(10))(None)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptFieldsSpec.scala (1)

104-105: Use the imported PARAM_CORRUPT_FIELDS constant instead of the string literal.

PARAM_CORRUPT_FIELDS is already imported (line 21) and used in the assertion on line 121. Using the constant here too avoids the risk of the option name drifting out of sync.

♻️ Suggested fix
-        val df = getDataFrame(tmpFileName, Map("generate_corrupt_fields" -> "true"))
+        val df = getDataFrame(tmpFileName, Map(PARAM_CORRUPT_FIELDS -> "true"))

Same applies to line 118:

-        getDataFrame("/tmp/dummy", Map("generate_corrupt_fields" -> "true", "segment-children:0" -> "COMPANY => DEPT,CUSTOMER"))
+        getDataFrame("/tmp/dummy", Map(PARAM_CORRUPT_FIELDS -> "true", "segment-children:0" -> "COMPANY => DEPT,CUSTOMER"))
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala (1)

744-911: Thorough builder test coverage.

The builder tests comprehensively exercise each withX method and their combinations. One minor nit:

Line 909 uses isInstanceOf[StringType] while the rest of this file consistently uses == StringType for type assertions (e.g., lines 542, 544, 627). Consider using == StringType for consistency.

♻️ Suggested fix
-      assert(sparkSchema.fields(2).dataType.isInstanceOf[StringType])
+      assert(sparkSchema.fields(2).dataType == StringType)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Line 174: Update the two typos in the comments inside Primitive.scala: change
"should exactly fix the required bytes" to "should exactly fit the required
bytes" for the comment that starts "Non-string field size..." and apply the same
replacement for the second occurrence around line 216; locate these comments in
the Primitive.scala source (search for the phrase "should exactly fix the
required bytes") and edit the text accordingly.
🧹 Nitpick comments (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala (1)

66-102: Minor duplication in spaceChar initialization between Integral and Decimal branches.

The None compact case for both Integral and Decimal (lines 80-86 and 92-98) is identical. Not urgent, but could be deduplicated if this logic grows.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/CobolSchemaSpec.scala (1)

872-887: Consider adding deeper structural assertions for the _corrupt_fields array element type.

The multi-segment test at Line 300 validates the inner struct fields (field_name, raw_value) of the _corrupt_fields array, but this builder test only checks the top-level name and ArrayType. Adding a similar inner-struct assertion here would guard against regressions in the element schema.

Suggested addition
       assert(sparkSchema.fields(1).name == "_corrupt_fields")
       assert(sparkSchema.fields(1).dataType.isInstanceOf[ArrayType])
+      val corruptFieldsType = sparkSchema.fields(1).dataType.asInstanceOf[ArrayType]
+      val elementType = corruptFieldsType.elementType.asInstanceOf[StructType]
+      assert(elementType.fields.map(_.name).toSet == Set("field_name", "raw_value"))
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala (2)

114-116: Minor style: missing space after comma in getRawValue calls.

Lines 115 and 142 both have s.getRawValue(offset,data) / st.getRawValue(useOffset,data) without a space after the comma, inconsistent with the rest of the codebase (e.g., s.decodeTypeValue(offset, data) on line 113).

Proposed fix
-              corruptFields += CorruptField(s"${field.name}[$i]", s.getRawValue(offset,data))
+              corruptFields += CorruptField(s"${field.name}[$i]", s.getRawValue(offset, data))
-            corruptFields += CorruptField(field.name, st.getRawValue(useOffset,data))
+            corruptFields += CorruptField(field.name, st.getRawValue(useOffset, data))

533-540: Synthetic Group for corrupt fields schema uses hardcoded occurs = Some(10) and decode = null.

The occurs = Some(10) on line 539 is arbitrary since the actual array size is determined at runtime. If this Group is ever used for size calculations (e.g., binaryProperties.actualSize), the hardcoded 10 could produce incorrect results. Similarly, decode = null and encode = null on the Primitives (lines 536–537) would NPE if anyone tries to decode through these synthetic nodes.

Since this Group is only used as a structural schema hint for handler.create (line 518), this is likely safe today. But it's fragile — consider adding a comment clarifying this Group must never be used for decoding or size computation.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test41CorruptFieldsSpec.scala (1)

36-42: Test data comments could be more precise for record 3.

Line 39 comment says "COMP invalid value (negative)" but the expected output shows F3 = 61702 (decoded successfully, not flagged as corrupt). The field F1 = 0x00 is the one that decodes to null here. Consider updating the comment to clarify what's actually being tested in this record — it seems to test that an out-of-PIC-range COMP value is still decoded (not flagged) and that a null byte for alphanumeric is treated as empty (not corrupt).

@yruslan yruslan force-pushed the feature/723-add-corrupted-fields-field branch from 91a8caa to 8a835e5 Compare February 11, 2026 09:00
@yruslan yruslan changed the title #723 Add the ability to generate _corrupted_fields column containing information on fields that Cobrix failed to decode #723 Add the ability to generate _corrupt_fields column containing information on fields that Cobrix failed to decode Feb 11, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`:
- Around line 66-101: The cached spaceChar value in Primitive.scala incorrectly
treats COMP-3 empty byte as 0x40; locate the private val spaceChar (pattern
matching on dataType) and change the COMP3() cases inside the Integral and
Decimal branches from returning 0x40 to returning 0x00 so COMP-3 fields follow
the project-wide null encoding; ensure both occurrences (the t.compact match in
Integral and the t.compact match in Decimal) are updated and run tests to verify
corruption detection now flags all-0x40 packed fields.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`:
- Around line 533-540: getCorruptFieldsGroup currently hardcodes occurs =
Some(10) which caps maxElements to 10 and can truncate flattening; update
getCorruptFieldsGroup to compute the occurs value instead of using the literal
10 — e.g., derive it from the copybook metadata or a configuration parameter
(use an existing method/property that exposes max possible corrupt fields or add
a config lookup/fallback) and pass that computed Option[Int] into the Group
constructor (replace the literal Some(10) with the computed Option). Ensure the
change uses the same symbols: getCorruptFieldsGroup, Group(..., occurs = ...) so
downstream metadata reflects the true upper bound.

Comment on lines +66 to +101
/** This is cached value to speedup checking for empty values */
private val spaceChar: Byte = {
dataType match {
case t: AlphaNumeric =>
t.enc match {
case Some(EBCDIC) => 0x40
case Some(ASCII) => 0x20
case Some(_) => 0
case None => 0x40
}
case t: Integral =>
t.compact match {
case Some(COMP3()) => 0x40
case Some(_) => 0
case None =>
t.enc match {
case Some(EBCDIC) => 0x40
case Some(ASCII) => 0x20
case Some(_) => 0
case None => 0x40
}
}
case t: Decimal =>
t.compact match {
case Some(COMP3()) => 0x40
case Some(_) => 0
case None =>
t.enc match {
case Some(EBCDIC) => 0x40
case Some(ASCII) => 0x20
case Some(_) => 0
case None => 0x40
}
}
case _ => 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's understand how spaceChar is used
rg -A 5 -B 5 "spaceChar" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala

Repository: AbsaOSS/cobrix

Length of output: 630


🏁 Script executed:

# Search for usages of spaceChar throughout the codebase
rg "spaceChar" --type scala -A 3 -B 3

Repository: AbsaOSS/cobrix

Length of output: 3940


🏁 Script executed:

# Look for COMP-3 handling in reader/extractor code
rg "COMP3|0x40" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/ --type scala -A 2 -B 2

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Check for empty/null value detection in readers
rg "isEmpty|isNull|empty.*value" cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/ --type scala -A 2 | head -50

Repository: AbsaOSS/cobrix

Length of output: 5846


🏁 Script executed:

# Look for test cases related to COMP-3 empty values
fd ".*Suite\.scala$" cobol-parser/src/test --exec rg -l "COMP3|comp3" {} \;

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Find test files mentioning COMP3 or packed decimal
find cobol-parser/src/test -name "*.scala" -exec grep -l "COMP3\|comp3\|packed" {} \;

Repository: AbsaOSS/cobrix

Length of output: 329


🏁 Script executed:

# Check for corrupted field handling related to COMP-3
rg "corrupted|Corrupt" cobol-parser/src/main/scala --type scala -B 2 -A 2 | grep -A 5 -B 5 "COMP3"

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Look for isEmpty method implementation and all its usages
rg "def isEmpty" cobol-parser/src/main/scala --type scala -A 10

Repository: AbsaOSS/cobrix

Length of output: 1312


🏁 Script executed:

# Search for any configuration or documentation about null encoding
rg "null.*encod|encod.*null|0x00|zero.*byte" cobol-parser/src/main/scala --type scala | head -20

Repository: AbsaOSS/cobrix

Length of output: 3289


🏁 Script executed:

# Let's look at the BinaryDecoderSpec test
cat cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryDecoderSpec.scala | head -100

Repository: AbsaOSS/cobrix

Length of output: 10149


🏁 Script executed:

# Check isEmpty usage in tests
rg "isEmpty" cobol-parser/src/test --type scala -B 3 -A 3

Repository: AbsaOSS/cobrix

Length of output: 24980


🏁 Script executed:

# Look for COMP3 in the test files we found
rg "COMP3|comp3" cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryDecoderSpec.scala -B 5 -A 5

Repository: AbsaOSS/cobrix

Length of output: 4525


🏁 Script executed:

# Let's check the full isEmpty method
sed -n '102,130p' cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala

Repository: AbsaOSS/cobrix

Length of output: 1087


🏁 Script executed:

# Look for the complete Primitive.scala spaceChar definition with context
sed -n '60,110p' cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala

Repository: AbsaOSS/cobrix

Length of output: 1540


🏁 Script executed:

# Search for any test that calls isEmpty or checks empty values with COMP-3
rg "isEmpty|empty" cobol-parser/src/test --type scala | grep -i "comp3\|packed"

Repository: AbsaOSS/cobrix

Length of output: 40


🏁 Script executed:

# Check for usage of isEmpty method in the codebase
rg "\.isEmpty\(" cobol-parser/src/main --type scala -B 2 | head -40

Repository: AbsaOSS/cobrix

Length of output: 1013


Change COMP‑3 empty byte from 0x40 to 0x00 to properly detect corruption.

COMP-3 is packed decimal format where 0x40 is meaningless data. Using 0x40 as the empty indicator causes fields with invalid all-0x40 bytes to be incorrectly classified as empty and excluded from corruption reporting. This contradicts the design principle that null values are encoded as 0x00 for all field types. Changing this to 0x00 aligns with the null encoding scheme and ensures corrupted COMP-3 data is properly detected.

Suggested fix
-          case Some(COMP3()) => 0x40
+          case Some(COMP3()) => 0
-          case Some(COMP3()) => 0x40
+          case Some(COMP3()) => 0
🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala`
around lines 66 - 101, The cached spaceChar value in Primitive.scala incorrectly
treats COMP-3 empty byte as 0x40; locate the private val spaceChar (pattern
matching on dataType) and change the COMP3() cases inside the Integral and
Decimal branches from returning 0x40 to returning 0x00 so COMP-3 fields follow
the project-wide null encoding; ensure both occurrences (the t.compact match in
Integral and the t.compact match in Decimal) are updated and run tests to verify
corruption detection now flags all-0x40 packed fields.

Comment on lines +533 to +540
private def getCorruptFieldsGroup: Group = {
val corruptFieldsInGroup = new mutable.ArrayBuffer[Statement]

corruptFieldsInGroup += Primitive(15, Constants.fieldNameColumn, Constants.fieldNameColumn, 0, AlphaNumeric("X(50)", 50), decode = null, encode = null)(None)
corruptFieldsInGroup += Primitive(15, Constants.rawValueColumn, Constants.rawValueColumn, 0, AlphaNumeric("X(50)", 50, enc = Some(RAW), compact = Some(COMP4())), decode = null, encode = null)(None)

Group(10, Constants.corruptFieldsField, Constants.corruptFieldsField, 0, children = corruptFieldsInGroup, occurs = Some(10))(None)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded occurs = Some(10) can under‑report max corrupt fields.
This caps maxElements metadata to 10, which can be misleading (and can truncate flattening logic). Consider deriving the upper bound from the copybook (e.g., max possible corrupt fields) or making it configurable.

🤖 Prompt for AI Agents
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`
around lines 533 - 540, getCorruptFieldsGroup currently hardcodes occurs =
Some(10) which caps maxElements to 10 and can truncate flattening; update
getCorruptFieldsGroup to compute the occurs value instead of using the literal
10 — e.g., derive it from the copybook metadata or a configuration parameter
(use an existing method/property that exposes max possible corrupt fields or add
a config lookup/fallback) and pass that computed Option[Int] into the Group
constructor (replace the literal Some(10) with the computed Option). Ensure the
change uses the same symbols: getCorruptFieldsGroup, Group(..., occurs = ...) so
downstream metadata reflects the true upper bound.

@yruslan yruslan merged commit 8022849 into master Feb 11, 2026
7 checks passed
@yruslan yruslan deleted the feature/723-add-corrupted-fields-field branch February 11, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an option to store casting errors in a separate field

1 participant