Skip to content

Allow to read iceberg table data from any location#1461

Merged
zvonand merged 8 commits intoantalya-26.1from
backport/antalya-26.1/90740
Mar 6, 2026
Merged

Allow to read iceberg table data from any location#1461
zvonand merged 8 commits intoantalya-26.1from
backport/antalya-26.1/90740

Conversation

@zvonand
Copy link
Collaborator

@zvonand zvonand commented Feb 27, 2026

Supersedes #1092, #1163, #1212

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support Iceberg tables that have files outside table location or on different storage.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Feb 27, 2026

Workflow [PR], commit [8fd3016]

@zvonand zvonand force-pushed the backport/antalya-26.1/90740 branch from 66845e0 to c424634 Compare March 3, 2026 14:18
@vzakaznikov
Copy link
Collaborator

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1461 (Iceberg external location support):

Confirmed defects:

  • Medium: Inconsistent same-storage key normalization after prefetch exhaustion

    • Impact: Iceberg reads can fail when absolute-path files exceed prefetch window (max_threads), because later tasks may keep non-normalized path while still using base storage.
    • Anchor: src/Storages/ObjectStorage/StorageObjectStorageSource.cpp / ReadTaskIterator::next() vs constructor prefetch branch.
    • Trigger: absolute same-storage paths with file count > prefetch size.
    • Why defect: path normalization behavior differs by iterator branch, so correctness depends on task position.
    • Fix direction (short): always apply resolved key rewrite when absolute path is present; keep storage-switch decision separate.
    • Regression test direction (short): integration case with >max_threads files using absolute same-storage paths and validation that all tasks resolve key consistently.
  • Low: Secondary storage creation is performed under shared cache mutex

    • Impact: high contention and elevated deadlock-risk surface under concurrent cold-cache resolutions.
    • Anchor: src/Storages/ObjectStorage/Utils.cpp / getOrCreateStorageAndKey.
    • Trigger: concurrent scans that resolve many external locations simultaneously.
    • Why defect: potentially heavy ObjectStorageFactory::create(...) is executed inside locked section.
    • Fix direction (short): use double-checked creation (check under lock, create outside lock, emplace under lock).
    • Regression test direction (short): concurrent resolver stress test over many distinct cache keys; assert no deadlock and bounded latency.

Coverage summary:

  • Scope reviewed: resolver/iterator/position-delete/protocol paths touched by this PR.
  • Categories failed: path normalization parity; cache-lock concurrency.
  • Categories passed: protocol compatibility and unsupported-scheme error handling.
  • Assumptions/limits: static audit only; no runtime test execution in this pass.

@zvonand zvonand added the port-antalya PRs to be ported to all new Antalya releases label Mar 3, 2026
@zvonand zvonand merged commit 193f79a into antalya-26.1 Mar 6, 2026
375 of 466 checks passed
@alsugiliazova
Copy link
Member

PR #1461 CI Verification Report

Summary

Support Iceberg tables that have files outside table location or on different storage. Changed 40 files, primarily in src/Storages/ObjectStorage/DataLakes/Iceberg/ and added new test_storage_iceberg_multistorage integration test suite.

PR's Own Integration Tests

All 18 test_storage_iceberg_multistorage tests PASSED (amd_binary):

Test Status
test_multi_storage_combinations[s3-s3-s3-s3] OK
test_multi_storage_combinations[s3-s3-s3-local] OK
test_multi_storage_combinations[s3-s3-local-s3] OK
test_multi_storage_combinations[s3-s3-local-local] OK
test_multi_storage_combinations[s3-local-s3-s3] OK
test_multi_storage_combinations[s3-local-s3-local] OK
test_multi_storage_combinations[s3-local-local-s3] OK
test_multi_storage_combinations[s3-local-local-local] OK
test_multi_storage_combinations[azure-azure-azure-azure] OK
test_multi_storage_combinations[azure-azure-azure-local] OK
test_multi_storage_combinations[azure-azure-local-azure] OK
test_multi_storage_combinations[azure-azure-local-local] OK
test_multi_storage_combinations[azure-local-azure-azure] OK
test_multi_storage_combinations[azure-local-azure-local] OK
test_multi_storage_combinations[azure-local-local-azure] OK
test_multi_storage_combinations[azure-local-local-local] OK
test_multi_storage_combinations[local-local-local-local] OK
test_four_different_s3_buckets OK

MasterCI Failures

# Suite Failing Test Error Related to PR Flakiness (90d)
1 Stateless (amd_asan, dist plan, parallel 2/2) 03572_export_merge_tree_part_limits_and_table_functions Result differs with reference No 20 fails / 12 days
2 Stateless (amd_tsan, parallel 2/2) 03572_export_merge_tree_part_limits_and_table_functions Result differs with reference No 20 fails / 12 days
3 Stateless (amd_binary, old analyzer, s3, DBReplicated, seq) 01111_create_drop_replicated_db_stress Database dropped/renamed concurrently No 44 fails / 29 days
4 Integration (amd_asan, db disk, old analyzer, 2/6) test_storage_delta/test_cdf.py::test_cdf[] Query result mismatch No 88 fails / 27 days
5 Integration (amd_tsan, 1/6) test_export_replicated_mt_partition...::test_mutations_after_export_partition_started Query timeout No Pre-existing on parent
6 Integration (amd_tsan, 1/6) test_export_replicated_mt_partition...::test_patch_parts_after_export_partition_started Wait for export start timeout No Pre-existing on parent
7 Integration (amd_tsan, 1/6) test_export_replicated_mt_partition...::test_mutation_in_partition_clause Wait for export status timeout No Pre-existing on parent
8 Integration (amd_tsan, 1/6) test_export_replicated_mt_partition...::test_export_partition_with_mixed_computed_columns Wait for export status timeout No Pre-existing on parent
9 Integration (amd_tsan, 2/6) test_restore_db_replica...::test_query_after_restore_db_replica[rename-no exists-no restart] Restore table failure No 40 fails / 24 days
10 Integration (amd_tsan, 2/6) test_restore_db_replica...::test_query_after_restore_db_replica[alter-with exists-no restart] Restore table failure No 40 fails / 20 days
11 Stateless (amd_tsan, s3, sequential 1/2) Infrastructure Only 3 tests ran (Failures: 1/3) No Infrastructure issue
12 Stateless (amd_tsan, s3, sequential 2/2) Infrastructure Only 3 tests ran (Failures: 1/3) No Infrastructure issue

All MasterCI failures are unrelated to the PR. They are either known flaky tests (verified via cidb), pre-existing on the parent commit (77b2934), or infrastructure issues.

Regression Tests (clickhouse-regression)

All regression failures were pre-existing on the parent commit (77b2934):

Suite Aarch64 Release Pre-existing
swarms Fail Fail Yes
settings Fail Fail Yes
iceberg_1 Fail Fail Yes
iceberg_2 Fail Fail Yes
parquet Fail Fail Yes
s3_export_part - Fail Yes
alter_attach_1 Fail - Unrelated to Iceberg

Manual Verification

This PR was additionally verified manually using the localfileio example from the Altinity/ice repository. The example demonstrates reading Iceberg tables with local file storage through the Iceberg REST Catalog, confirming the PR's functionality of reading table data from different storage locations.

Conclusion

PR #1461 is verified. All 18 PR-specific integration tests passed. All CI failures are pre-existing or known flaky tests, none related to the PR changes. The PR was also manually verified using the Altinity/ice localfileio example.

@alsugiliazova alsugiliazova added the verified Verified by QA label Mar 6, 2026
@alsugiliazova
Copy link
Member

Audit Report: PR #1461

Possible defects (High/Medium/Low)

Medium

  • Inconsistent same-storage key normalization after prefetch exhaustion
    • Impact: Iceberg reads can fail for absolute-path files when the iterator falls back from prefetched buffer to callback tasks (e.g., file count exceeds prefetch window). Behavior depends on task position, not input semantics.
    • Anchor: src/Storages/ObjectStorage/StorageObjectStorageSource.cpp / ReadTaskIterator::{ReadTaskIterator,next}.
    • Fault-injection trigger: absolute Iceberg paths that resolve to base storage but require key normalization (e.g., URI to key conversion), with enough files to hit both constructor-prefetched and next() callback branches.
    • Transition mapping: T2 -> T3 invariant I2/I3 break.
    • Why defect: constructor branch rewrites key whenever key is non-empty; next() branch rewrites only when storage_to_use != object_storage. Equivalent paths should produce equivalent key normalization regardless of branch.
    • Smallest logical repro:
      1. Provide Iceberg metadata containing absolute same-storage paths.
      2. Ensure object count > prefetch window (max_threads) so iterator uses both branches.
      3. Observe some tasks use normalized key while later tasks keep unnormalized path.
      4. Metadata/read calls can target wrong key form and fail intermittently by task position.
    • Likely fix direction:
      • Always apply resolved-key rewrite when key is non-empty in both branches.
      • Keep storage-switch decision independent from key-normalization decision.
      • Add branch-parity assertion/tests for prefetch vs callback path.
    • Regression test direction:
      • Integration test with >max_threads absolute same-storage files.
      • Assert all tasks use identical normalized key form.
      • Cover both prefetched and callback branches explicitly.
    • Affected subsystem and blast radius: object-storage read iterator for Iceberg external-location tables; impacts correctness/reliability of reads under realistic multi-file workloads.
    • Evidence:
+#if USE_AVRO
+            if (auto iceberg_info = std::dynamic_pointer_cast<IcebergDataObjectInfo>(object))
+            {
+                if (auto abs_path = iceberg_info->getAbsolutePath())
+                {
+                    auto [storage_to_use, key] = resolveObjectStorageForPath(
+                        table_location, *abs_path, object_storage, secondary_storages, getContext());
+                    if (!key.empty())
+                    {
+                        iceberg_info->setResolvedStorage(storage_to_use);
+                        iceberg_info->relative_path_with_metadata.relative_path = key;
+                    }
+                }
+            }
...
+            if (auto abs_path = iceberg_info->getAbsolutePath())
+            {
+                auto [storage_to_use, key] = resolveObjectStorageForPath(
+                    table_location, *abs_path, object_storage, secondary_storages, getContext());
+                if (!key.empty() && storage_to_use != object_storage)
+                {
+                    iceberg_info->setResolvedStorage(storage_to_use);
+                    iceberg_info->relative_path_with_metadata.relative_path = key;
+                }
+            }

Low

  • Secondary storage creation is performed under shared cache mutex
    • Impact: elevated lock contention and larger deadlock-risk surface under concurrent cold-cache external-location resolution.
    • Anchor: src/Storages/ObjectStorage/Utils.cpp / getOrCreateStorageAndKey.
    • Fault-injection trigger: concurrent scans resolving many distinct external locations/storage endpoints.
    • Transition mapping: T4 invariant I4 break.
    • Why defect: potentially heavy ObjectStorageFactory::create(...) executes while secondary_storages.mutex is held, serializing unrelated misses and extending critical section.
    • Smallest logical repro:
      1. Start concurrent reads over Iceberg metadata that references many unique external storages.
      2. Force cache misses in secondary_storages.storages.
      3. Observe serialized creation under one mutex, with high wait time and reduced throughput.
    • Likely fix direction:
      • Use double-checked creation: check under lock, create outside lock, emplace under lock.
      • Optionally use per-key in-flight map to suppress duplicate creates without global lock hold.
      • Keep factory call out of global critical section.
    • Regression test direction:
      • Multi-thread stress test with many distinct cache keys.
      • Assert absence of long mutex hold and bounded p95/p99 resolve latency.
    • Affected subsystem and blast radius: shared object-storage resolver cache used by Iceberg multistorage paths; primarily reliability/performance, potentially availability under extreme contention.
    • Evidence:
+std::pair<ObjectStoragePtr, std::string> getOrCreateStorageAndKey(
+    const std::string & cache_key,
+    const std::string & key_to_use,
+    const std::string & storage_type,
+    SecondaryStorages & secondary_storages,
+    const ContextPtr & context,
+    std::function<void(Poco::Util::MapConfiguration &, const std::string &)> configure_fn)
+{
+    std::lock_guard lock(secondary_storages.mutex);
+    if (auto it = secondary_storages.storages.find(cache_key); it != secondary_storages.storages.end())
+        return {it->second, key_to_use};
+    ...
+    /// Create under lock to avoid duplicate creation and wasted work
+    ObjectStoragePtr storage = ObjectStorageFactory::instance().create(cache_key, *cfg, config_prefix, context, /*skip_access_check*/ true);
+    secondary_storages.storages.emplace(cache_key, storage);
+    return {storage, key_to_use};
+}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya-26.1 antalya-26.1.3.20001 port-antalya PRs to be ported to all new Antalya releases verified Verified by QA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants