Spark 4.1: Add targetTable support to RewriteTablePath#15412
Spark 4.1: Add targetTable support to RewriteTablePath#15412mxm wants to merge 4 commits intoapache:mainfrom
Conversation
This enables incremental copy using the target table to automatically
determine where to resume. We validate that source and target are in sync
at the determined version by comparing snapshot IDs and checking that
all manifest files exist.
Example:
```java
Table sourceTable = ...
Table targetTable = ...
actions()
.rewriteTablePath(sourceTable)
.rewriteLocationPrefix(sourceLocation, targetLocation)
// Resolve startVersion from the targetTable
.targetTable(targetTable)
.execute();
```
The `targetTable()` parameter takes precedence over `startVersion(..)`.
|
It looks a bit strange to have the concept of |
|
Thanks for asking! This is useful for continuous incremental replication of a source table to a destination table. Rather than doing a one-off full rewrite/copy of a table to a new destination, you want to repeat the process against an existing copy. In order to avoid having to rewrite/copy everything again, you want to rewrite/copy just the files between the last copied version and the current version of the source table. This is already possible today, but it requires setting the |
| Preconditions.checkArgument( | ||
| startVersionName == null, "Cannot set both startVersion and targetTable."); |
There was a problem hiding this comment.
This is better placed in the validateAndSetStartVersion method
There was a problem hiding this comment.
I agree, we should move this check.
|
|
||
| private String findVersion(String version, TableMetadata sourceMetadata) { | ||
| String currentSourceMetadataFile = currentMetadataPath(table); | ||
| if (currentSourceMetadataFile.endsWith(version)) { |
There was a problem hiding this comment.
This seems brittle to me. I don't think we should rely on filenames to determine if the snapshot files are the same or not. I don't think we have guarantees there
There was a problem hiding this comment.
Maybe relying on snapshot ids would be better?
There was a problem hiding this comment.
I agree. The original code did only file name comparison to assess that the version/metadata files are identical. Now, this method is only used to select the candidate version. Afterwards, we check via isSameSnapshot(sourceVersionFullPath, targetVersionFullPath) that the snapshots ids match.
There was a problem hiding this comment.
Yeah, but if the filenames are changed then we will not find the snapshot and do a full copy. So probably we should just skip this check.
There was a problem hiding this comment.
I wonder how likely a rename is, given the implications this has. If we switch to snapshot ids, this means reading the metadata file for all versions until the matching one. This is slower than the filename-based search followed by the snapshot id verification. I've pushed this change with 1163057.
| * @return this for method chaining | ||
| */ | ||
| default RewriteTablePath targetTable(Table targetTable) { | ||
| return this; |
This enables incremental copy using the target table to automatically determine where to resume. We validate that source and target are in sync at the determined version by comparing snapshot IDs and checking that all manifest files exist.
Example:
The
targetTable()parameter takes precedence overstartVersion(..).