Conversation
Adds a complete tpcds_benchmark executable (built with -DTPCDS_ENABLE=ON) that generates TPC-DS store_sales and inventory tables into Parquet, CSV, ORC, Lance, Paimon, or Iceberg via the existing tpch::WriterInterface. Build infrastructure (third_party/dsdgen/): - CMakeLists.txt: mkheader → tables.h/streams.h/columns.h → distcomp → tpcds.idx → gen_dsts.py → dsts_generated.c → dsdgen_objs object library - dsdgen_stubs.c: thin file-I/O stubs (not needed in embedded mode) - tpcds_dsdgen.h: C++-safe forward declarations that bypass the fragile config.h/porting.h LINUX/HUGE_TYPE dependency chain; defines ds_key_t, decimal_t, ds_pricing_t, W_STORE_SALES_TBL, W_INVENTORY_TBL, and the EMBEDDED_DSDGEN callback globals - cmake/gen_dsts.py: embeds tpcds.idx as a C byte array (dsts_generated.c) C++ wrappers (include/tpch/, src/dsdgen/): - dsdgen_wrapper.hpp/.cpp: DSDGenWrapper class — initialises dsdgen from the embedded tpcds.idx (via mkstemp temp file), exposes Arrow schemas for STORE_SALES and INVENTORY, and drives generation via the EMBEDDED_DSDGEN callback trampoline (store_sales is master-detail: 8-16 line items per ticket; callback fires once per line item) - dsdgen_converter.hpp/.cpp: append_*_to_builders() helpers mapping the dsdgen C structs to Arrow array builders; dec_to_double() correctly converts decimal_t scaled integers (avoids buggy upstream dectoflt) Executable (src/tpcds_main.cpp): - CLI mirrors tpch_benchmark: --format, --table, --scale-factor, --output-dir, --max-rows, --verbose - Batched Arrow generation loop (10 000 rows/batch) → writer->write_batch third_party/tpcds submodule updated to branch tpcds_cpp_embedded which adds EMBEDDED_DSDGEN guards to w_store_sales.c (callback + suppressed file I/O for both store_sales and store_returns output files). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hers) Implements the following tables in tpcds_benchmark: catalog_sales — master-detail (callback-based, 4-14 line items/ticket) web_sales — master-detail (callback-based, 8-16 line items/ticket) customer — dimension table (simple, 100K rows at SF=1) item — dimension table (simple, ~18K rows at SF=1) date_dim — dimension table (simple, 73,049 fixed rows) store_returns — driven through store_sales ticket loop catalog_returns — driven through catalog_sales ticket loop web_returns — driven through web_sales ticket loop Changes: - tpcds_dsdgen.h: added struct definitions for all 8 new table row types, table ID constants, mk_w_* function declarations, and EMBEDDED_DSDGEN extern callback declarations for catalog_sales and web_sales - dsdgen_wrapper.hpp: added generate_* method declarations for 8 tables - dsdgen_wrapper.cpp: Arrow schemas, trampolines (catalog/web sales), generate_* implementations; returns tables are driven through their parent sales loop using a no-op callback to suppress file output while populating the global sales struct for valid return references - dsdgen_converter.hpp/.cpp: Arrow builder converters for all 8 tables; updated append_dsdgen_row_to_builders() dispatcher - tpcds_main.cpp: updated parse_table(), generation dispatch, and usage text to reflect Phase 3 completion - third_party/tpcds submodule: updated to include w_catalog_sales.c and w_web_sales.c EMBEDDED_DSDGEN callback patches All 10 implemented tables verified: correct row counts, sane values. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
Pull request overview
Adds TPC-DS data generation support by vendoring the upstream tools, embedding distribution data, and exposing a new tpcds_benchmark executable.
Changes:
- Adds a
third_party/tpcdssubmodule and a CMake build for compiling/embedding TPC-DSdsdgen. - Introduces a C++
DSDGenWrapper+ Arrow conversion layer for several initial TPC-DS tables. - Adds a
TPCDS_ENABLEbuild flag and wires a newtpcds_benchmarktarget into the top-level build.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| third_party/tpcds | Adds the TPC-DS tools submodule pin for dsdgen sources. |
| third_party/dsdgen/tpcds_dsdgen.h | Provides a C++-friendly API surface for calling into dsdgen objects. |
| third_party/dsdgen/dsdgen_stubs.c | Placeholder TU for embed-mode stubs (currently comment-only). |
| third_party/dsdgen/CMakeLists.txt | Builds mkheader, distcomp, embeds tpcds.idx, and compiles dsdgen as objects. |
| src/tpcds_main.cpp | New tpcds_benchmark CLI that generates Arrow batches and writes using existing writer backends. |
| src/dsdgen/dsdgen_wrapper.cpp | Initializes embedded distributions + exposes per-table generation APIs with callbacks. |
| src/dsdgen/dsdgen_converter.cpp | Converts dsdgen structs into Arrow builders for supported tables. |
| include/tpch/dsdgen_wrapper.hpp | Public wrapper API + table enum + schema helpers. |
| include/tpch/dsdgen_converter.hpp | Public conversion function declarations. |
| cmake/gen_dsts.py | New generator to embed tpcds.idx as a uint8_t[] C array. |
| CMakeLists.txt | Adds TPCDS_ENABLE option and the tpcds_benchmark target when enabled. |
| .gitmodules | Registers the new third_party/tpcds submodule. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b0d16b9ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Adds support for all 24 TPC-DS tables to tpcds_benchmark: call_center, catalog_page, web_page, web_site, warehouse, ship_mode, household_demographics, customer_demographics, customer_address, income_band, reason, time_dim, promotion, store. Key changes: - tpcds_dsdgen.h: fix ds_addr_t (add missing plus4 field); add structs and TPCDS_* constants for all 14 new tables - dsdgen_wrapper.hpp/.cpp: TPCDS_SIMPLE_GENERATE macro + 14 new generators - dsdgen_converter.cpp: append_*_to_builders + dispatcher for 14 tables; append_addr_fields() helper for ds_addr_t (street/city/state/zip/etc.) - tpcds_main.cpp: parse_table() + generation dispatch for all 24 tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ructs The old approach manually copied all struct definitions (W_STORE_SALES_TBL, ds_addr_t, etc.) into tpcds_dsdgen.h, which could silently diverge from the actual generator sources — as demonstrated by the missing ds_addr_t.plus4 field that caused stack smashing in call_center/warehouse/customer_address. New approach: include the canonical tpcds w_*.h headers directly. All 24 table structs are now sourced from the real implementation files. The TPCDS_* integer constants are kept as prefixed literals (tables.h defines bare macros like CALL_CENTER=0 that would collide with C++ enum member names). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e enum - Re-added #include "tables.h" to tpcds_dsdgen.h - Changed TPCDS_* from literal values to aliases (e.g., #define TPCDS_CALL_CENTER CALL_CENTER) - This allows using the canonical build-generated constants while avoiding macro collisions - Renamed TableType enum members to PascalCase (CallCenter, CatalogPage, etc.) to avoid colliding with ALL_CAPS macros from tables.h - Updated all TableType:: references throughout dsdgen_wrapper.cpp and tpcds_main.cpp - All 24 TPC-DS tables now fully functional via CLI and generation dispatch This maintains the canonical struct layouts (by including real tpcds headers) while resolving namespace conflicts through C++ enum conventions. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1015c1c823
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Apply Arrow dictionary(int8(), utf8()) encoding to low/medium-cardinality TPC-DS columns, following TPC-H Phase 3.3 pattern (+57% throughput improvement). Encoded columns (50 total): - customer_demographics: cd_gender, cd_marital_status, cd_education_status, cd_credit_rating - customer_address: ca_location_type, ca_state, ca_country, ca_street_type - time_dim: t_am_pm, t_shift, t_sub_shift, t_meal_time - date_dim: d_day_name - item: i_category, i_size, i_color, i_units, i_container - call_center: cc_class, cc_hours, cc_name + address fields (cc_state, cc_country, cc_street_type) - catalog_page: cp_department, cp_type - web_page: wp_type - web_site: web_class + address fields (web_state, web_country, web_street_type) - warehouse: address fields (w_state, w_country, w_street_type) - ship_mode: sm_type, sm_code, sm_carrier - store: s_hours, s_geography_class, s_division_name, s_company_name + address fields (s_state, s_country, s_street_type) - customer: c_salutation - promotion: p_purpose Implementation: - src/tpcds_main.cpp: DICTIONARY type handling in create_builders() and finish_batch() - include/tpch/dsdgen_converter.hpp: get_dict_for_field() declaration - src/dsdgen/dsdgen_converter.cpp: 25+ encode functions + dictionary registry (41 entries) - src/dsdgen/dsdgen_wrapper.cpp: 9 table schemas updated to dict8 All 24 TPC-DS tables validated (SF=1). Expected: +50-60% performance gain at scale (from Phase 3.3 precedent). Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Two targeted fixes to address fact-table regressions found after DS-1
dict8 encoding (store_sales -59%, catalog_sales -74%, web_sales -74%):
1. Disable Parquet auto-dict for int64/int32/float64 columns
- Add make_writer_props(schema) helper in parquet_writer.cpp
- Iterates schema fields; calls disable_dictionary() for numeric types
- Arrow dictionary(int8,utf8) columns unaffected (identified by name)
- Eliminates ScalarMemoTable<double>::GetOrInsert (was 8.85% of CPU)
and ScalarMemoTable<long>::GetOrInsert (was 4.53%) from profiles
- Used in all 3 WriterProperties construction sites
2. std::map → std::unordered_map for builder lookup
- Add BuilderMap type alias in dsdgen_converter.hpp
- Replace all 26 function signatures in dsdgen_converter.cpp
- Replace create_builders/finish_batch/reset_builders in tpcds_main.cpp
- O(log N) string comparison → O(1) hash per column per row
Measured gains (SF=1, Parquet/SNAPPY, avg 2 runs):
web_sales: 356K → 424K r/s (+19%)
catalog_sales: 359K → 414K r/s (+16%)
store_returns: 161K → 167K r/s (+4%)
web_returns: 115K → 121K r/s (+5%)
customer_demographics: 1354K → 1508K r/s (+11%)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… indices
Replace BuilderMap = unordered_map<string, Builder> with
BuilderMap = vector<Builder> (schema-order), with named access via
auto-generated constexpr column index constants.
Code reads:
builders[col::store_returns::sr_returned_date_sk] // readable + zero cost
instead of:
builders["sr_returned_date_sk"] // was: hash per row
New files:
include/tpch/dsdgen_col_idx.hpp — auto-generated constexpr indices
(24 tables, one constexpr size_t per column per table)
scripts/gen_col_indices.py — generates dsdgen_col_idx.hpp and
rewrites builders[N] → builders[col::TABLE::col] in converter
Regenerate after schema changes: python3 scripts/gen_col_indices.py
dsdgen_converter.cpp:
- 374 builders["col"] → builders[col::TABLE::col] (named index)
- append_addr_fields: pfx+string arg → size_t base arg
- Callers pass col::TABLE::PREFIX_street_number (named, self-documenting)
dsdgen_converter.hpp: BuilderMap = vector<shared_ptr<ArrayBuilder>>
tpcds_main.cpp: create_builders push_back, finish_batch/reset by index
Measured gains vs DS-2 unordered_map (SF=1, warm runs):
store_returns: 167K → 198K r/s (+19%)
web_sales: 424K → 777K r/s (+83%)
catalog_sales: 414K → 768K r/s (+86%)
store_sales: 543K →1012K r/s (+86%)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Profiling after DS-3 showed Snappy at ~32% CPU for store_sales, but removing it actually SLOWS things down: compression reduces write volume, hiding I/O latency on WSL2 VirtIO-blk. Snappy remains the default and is the optimal choice. The flag is useful for explicit benchmarking of the tradeoff: --compression snappy (default, optimal for I/O-bound workloads) --compression zstd (better ratio, slightly more CPU) --compression none (fastest encode, most disk I/O) Note: LZ4 is intentionally excluded (Parquet LZ4 interoperability issues). Implementation: - ParquetWriter::set_compression(codec) + compression_codec_ member - parse_compression() maps string → parquet::Compression::type - tpcds_main: --compression CLI flag routed through create_writer() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add cmake/gen_dist_cache.py that reads tpcds.idx at build time and emits static const int[] / decimal_t[] arrays for all 42 TKN_INT and 0 TKN_DECIMAL value sets across 67 TPC-DS distributions. CMake integration in third_party/dsdgen/CMakeLists.txt: - Step 2c: gen_dist_cache.py → dist_cache_generated.c (51 KB) - dist_cache_generated.c added to dsdgen_objs OBJECT library - New tpcds_dist_cache_gen target The generated tpcds_lookup_int_cache() / tpcds_lookup_dec_cache() are referenced by dist.c (in submodule commit) when EMBEDDED_DSDGEN is defined. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ports the same cardinality-hint pattern from TPC-H (Phase 3.2/3.5) to TPC-DS schemas. Without hints Lance runs XXH3+HLL per batch per utf8 column, causing store_sales SF=5 to degrade to ~86K rows/s. Changes: - Add tpcds_field() helper (mirrors tpch_field()) that attaches lance.cardinality metadata to Arrow utf8 fields - Pass scale_factor to get_schema() so SF-scaled counts are correct - Annotate every utf8 field in all 18 TPC-DS dimension+fact schemas with the appropriate bounded cardinality from TPC-DS v3 spec - Convert hd_buy_potential (6 values) from utf8 to dict8; add encode_hd_buy_potential() and register the dictionary in get_dict_for_field() Result: store_sales SF=5 Lance throughput 86K -> 1.13M rows/s (13x). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In Apache ORC all integer types (tinyint/smallint/int/bigint) are stored using LongVectorBatch — there is no IntVectorBatch for the `int` (32-bit) ORC type. The previous code performed a dynamic_cast to orc::IntVectorBatch for Arrow INT32 arrays which always returned nullptr, causing every table with int32 fields to crash with "Failed to cast ORC column to IntVectorBatch". Fix: cast to LongVectorBatch (same as INT64 path), matching the ORC C++ library contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Enables Parquet streaming write path (ParquetWriter::enable_streaming_write()) and Lance streaming path (LanceWriter::enable_streaming_write(true)) via a new --zero-copy / -z CLI flag. Impact on store_sales SF=5: Parquet without --zero-copy: 1.00M r/s, 4.2 GB peak RSS Parquet with --zero-copy: 1.36M r/s, 271 MB peak RSS (15× less RAM, 36% faster) Lance: both modes stream via Rust FFI — no C++ batch accumulation regardless. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Change batch_size from 10,000 → 8,192 rows in run_generation() to match Lance's default max_rows_per_group. This ensures C++ batches align to Lance row-group boundaries, eliminating split/leftover rows at edges. Also benefits Parquet and ORC stripe alignment. SF=10 store_sales results (28.8M rows): Lance buffered: 290K → 868K r/s after alignment Lance zero-copy: stability improved Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Add -DTPCDS_ENABLE=ON to all build-matrix configs (base, orc, lance) so tpcds_benchmark is built and uploaded as part of the CI artifact - Rename benchmark-suite → tpch-benchmark-suite and optimization-benchmarks → tpch-optimization-benchmarks for clarity - Add tpcds-benchmark-suite job: 12 matrix entries covering csv/parquet × 4 tables, orc/lance × 2 tables (store_returns, store_sales, customer, item); tpcds.idx is embedded in the binary so no runtime data files are needed - Add tpcds-optimization-benchmarks job: 8 matrix entries covering parquet/lance × baseline/zero-copy × store_returns/store_sales - Update results-aggregation to wait on all 4 benchmark jobs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
This PR adds a dedicated
tpcds_benchmarkexecutable and integrates the TPC-DSdsdgenreference code into this repository in a form usable from the existing Arrow-based writer pipeline.The branch started as initial TPC-DS table support and grew to include:
What is included
1. TPC-DS benchmark executable
tpcds_benchmarktpch_benchmark2. Embedded TPC-DS dsdgen integration
tpcds.idxdistribution data at build time3. Table support
Initial phases added fact tables and then expanded to the full set used by the benchmark work in this branch, including:
store_salesinventorycatalog_salesweb_salesstore_returnscatalog_returnsweb_returnscustomer,item,date_dim,call_center,catalog_page,web_page,web_site,warehouse,ship_mode,household_demographics,customer_demographics,customer_address,income_band,reason,time_dim,promotion,store4. Arrow conversion layer
lance.cardinalitymetadata hints on TPC-DS schemas5. Output format support
Additional format-related work in this branch:
--compressionfor Parquet (snappy,zstd,none)INT32handlingLance zero-copy / streaming work
A substantial part of this branch investigates and improves Lance behavior for TPC-DS generation.
Added functionality
--zero-copystreaming mode intpcds_benchmarksync,auto,async)Main conclusions from the investigation
synczero-copy is the practical defaultasyncpath incurred large extra RSS and Tokio overhead for this workload128batches /1,048,576rows significantly reduced fragment/manifest churn and improvedstore_salesscalingCurrent default from this branch
tpcds_benchmark --format lance --zero-copydefaults to the bounded sync pathProfiling / benchmarking artifacts
This branch also includes investigation notes and generated benchmark artifacts under
benchmark-results/, including:These files are included because they were used to justify behavior changes in the Lance path and to document the observed tradeoffs.
Review-driven cleanup included
ValueOrDie()in the TPC-DS batch finalization path--scale-factorstatic_assertchecks tyingTableTypevalues to generated TPC-DS table constantsstore_sales,catalog_sales,web_sales)Known limitations / follow-up items
*_returnsgeneration semantics still deserve a follow-up review; there is an open question about whether benchmark convenience should preserve the native sparse return-rate exactlytpcds_benchmarktable dispatch remains explicitif/elselogic for now; this could be converted to a dispatch table later if desiredTesting / validation
Validation in this branch was primarily benchmark- and profiler-driven:
tpcds_benchmarkruns on major fact tablesstore_salesperf record/perf statprofiling of Lance zero-copy paths