Skip to content

Core: update manifest delete file size after rewrite table action#15470

Open
mbutrovich wants to merge 3 commits intoapache:mainfrom
mbutrovich:delete_file_size_after_rewrite
Open

Core: update manifest delete file size after rewrite table action#15470
mbutrovich wants to merge 3 commits intoapache:mainfrom
mbutrovich:delete_file_size_after_rewrite

Conversation

@mbutrovich
Copy link

@mbutrovich mbutrovich commented Feb 27, 2026

Which issue does this PR close?

Closes #12554.

Rationale for this change

rewriteTablePath rewrites position delete files (updating embedded data file paths), which changes their size. The manifest was written before the delete files were rewritten, so file_size_in_bytes in the manifest reflected the original size. Readers that trust this field (Trino, Impala, Comet, iceberg-rust) fail with errors like "end of stream not reached."

What changes are included in this PR?

Moves position delete file rewriting into the manifest-writing Spark task, eliminating the separate rewritePositionDeletes() Spark job. Each manifest-writing task now also rewrites the delete files that manifest references, measures the actual size via getLength(), and records it in the manifest entry.

This means:

  • No extra Spark job — down from two jobs to one
  • Parallelism preserved — one Spark task per manifest, same as before
  • Correct file_size_in_bytes at manifest write time, no reconciliation needed

Trade-off: if the same delete file appears in multiple manifests, it gets rewritten redundantly. The staging path is deterministic so the output is identical — wasted I/O, not incorrect. In practice this is rare.

How are these changes tested?

New test testDeleteFileSizeInBytesAfterRewrite creates a table with position deletes using a deeply nested path (so the rewritten path differs in length), runs rewriteTablePath, copies the result, and asserts that file_size_in_bytes in the rewritten manifest matches the actual file size on disk.

AI Usage

I am more familiar with the iceberg-rust codebase, so Claude helped me navigate the code, prototype a design, and draft the PR description (in DataFusion Comet's PR template). Claude also helped me with the API change failures in CI.

…ask, eliminating the separate `rewritePositionDeletes()` Spark job. Each manifest-writing task now also rewrites the delete files that manifest references, measures the actual size via `getLength()`, and records it in the manifest entry.
@mbutrovich mbutrovich changed the title Core: update delete file size after rewrite table action Core: update manifest delete file size after rewrite table action Feb 27, 2026
Comment on lines +407 to +414
"1.10.0":
org.apache.iceberg:iceberg-api:
- code: "java.class.defaultSerializationChanged"
old: "class org.apache.iceberg.encryption.EncryptingFileIO"
new: "class org.apache.iceberg.encryption.EncryptingFileIO"
justification: "New method for Manifest List reading"
org.apache.iceberg:iceberg-core:
- code: "java.class.noLongerInheritsFromClass"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this moved unintentionally?

// Rewrite inline so the manifest records the actual file size, which changes because
// embedded data file paths are rewritten. The staging path is deterministic, so
// duplicates across manifests simply overwrite with identical content.
String staging = stagingPath(file.location(), sourcePrefix, stagingLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe call the variable also stagingPath?

throw new UncheckedIOException(
"Failed to rewrite position delete file " + file.location(), e);
}
long actualSize = io.newInputFile(staging).getLength();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getLength() is a HEAD call, isn't it? I'm wondering if we can get the length from PositionDeleteReaderWriter somehow so we don't need to use getLength().

rewritePositionDeletes(deleteFiles);
int rewrittenDeleteFilesCount =
(int)
rewriteManifestResult.toRewrite().stream().filter(e -> e instanceof DeleteFile).count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have to deduplicate the delete files before counting. Previously we used Collectors.toSet().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing size rewrite in rewrite_table_path for delete file

2 participants