Skip to content

Spark 4.1: Add targetTable support to RewriteTablePath#15412

Open
mxm wants to merge 4 commits intoapache:mainfrom
mxm:rewrite-table-path-target-table
Open

Spark 4.1: Add targetTable support to RewriteTablePath#15412
mxm wants to merge 4 commits intoapache:mainfrom
mxm:rewrite-table-path-target-table

Conversation

@mxm
Copy link
Contributor

@mxm mxm commented Feb 23, 2026

This enables incremental copy using the target table to automatically determine where to resume. We validate that source and target are in sync at the determined version by comparing snapshot IDs and checking that all manifest files exist.

Example:

Table sourceTable = ...
Table targetTable = ...
actions()
    .rewriteTablePath(sourceTable)
    .rewriteLocationPrefix(sourceLocation, targetLocation)
    // Resolve startVersion from the targetTable
    .targetTable(targetTable)
    .execute();

The targetTable() parameter takes precedence over startVersion(..).

This enables incremental copy using the target table to automatically
determine where to resume. We validate that source and target are in sync
at the determined version by comparing snapshot IDs and checking that
all manifest files exist.

Example:

```java
Table sourceTable = ...
Table targetTable = ...
actions()
    .rewriteTablePath(sourceTable)
    .rewriteLocationPrefix(sourceLocation, targetLocation)
    // Resolve startVersion from the targetTable
    .targetTable(targetTable)
    .execute();
```

The `targetTable()` parameter takes precedence over `startVersion(..)`.
@mxm
Copy link
Contributor Author

mxm commented Feb 23, 2026

@manuzhang manuzhang changed the title Spark: Add targetTable support to RewriteTablePath Spark 4.1: Add targetTable support to RewriteTablePath Feb 23, 2026
@manuzhang
Copy link
Member

It looks a bit strange to have the concept of targetTable for RewriteTablePath, which merely prepares a table for copying to another location. What's the end-to-end process?

@mxm
Copy link
Contributor Author

mxm commented Feb 23, 2026

Thanks for asking! This is useful for continuous incremental replication of a source table to a destination table. Rather than doing a one-off full rewrite/copy of a table to a new destination, you want to repeat the process against an existing copy. In order to avoid having to rewrite/copy everything again, you want to rewrite/copy just the files between the last copied version and the current version of the source table. This is already possible today, but it requires setting the startVersion parameter to the last copied version, which is error-prone.

Comment on lines 234 to 235
Preconditions.checkArgument(
startVersionName == null, "Cannot set both startVersion and targetTable.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better placed in the validateAndSetStartVersion method

Copy link
Contributor Author

@mxm mxm Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should move this check.


private String findVersion(String version, TableMetadata sourceMetadata) {
String currentSourceMetadataFile = currentMetadataPath(table);
if (currentSourceMetadataFile.endsWith(version)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems brittle to me. I don't think we should rely on filenames to determine if the snapshot files are the same or not. I don't think we have guarantees there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe relying on snapshot ids would be better?

Copy link
Contributor Author

@mxm mxm Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The original code did only file name comparison to assess that the version/metadata files are identical. Now, this method is only used to select the candidate version. Afterwards, we check via isSameSnapshot(sourceVersionFullPath, targetVersionFullPath) that the snapshots ids match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but if the filenames are changed then we will not find the snapshot and do a full copy. So probably we should just skip this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how likely a rename is, given the implications this has. If we switch to snapshot ids, this means reading the metadata file for all versions until the matching one. This is slower than the filename-based search followed by the snapshot id verification. I've pushed this change with 1163057.

* @return this for method chaining
*/
default RewriteTablePath targetTable(Table targetTable) {
return this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated this to match the behavior in the other PRs #15381 and #15382.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants