A code-generating GPU database system that executes all 22 TPC-H benchmark queries using CUDA. Queries are compiled at runtime using either NVRTC (fast) or NVCC, with kernel code generated through a composable operator framework. CODEGENeral is extendable and allows benchmarking on custom defined schemas and queries.
.^.
/_+_\
( o_o )
__/ (_) \__ "Elevate your GPU Database!"
/ /| ___ |\ \ - the CODEGENERAL
| | | {JIT} | | |
\ \| --- |/ /
'--'_____'--'- All 22 TPC-H queries implemented
- Runtime JIT compilation (NVRTC or NVCC)
- Operator-based code generation framework
- Bitmap-based join acceleration
- Shared memory reductions for high-performance aggregations
- CUDA Toolkit (tested with CUDA 11+)
- CMake 3.18+
- C++17 compiler
- TPC-H data files (
.tblformat)
mkdir build && cd build
cmake ..
make -j8# Run with TPC-H data directory
./src/tpch_demo_refactored /path/to/tpch-dbgen
# Run with random test data (no data directory)
./src/tpch_demo_refactored| Option | Description |
|---|---|
--nvcc |
Use NVCC compiler instead of NVRTC (slower compilation, same kernel performance) |
--dp |
Use Dynamic Parallelism when generating the code for launching kernels |
--coop |
Use cooperative groups and fuse generated code into a single kernel |
-v, --verbose |
Print generated CUDA kernel code |
-q <number> |
Execute TPC-H only |
-o, --output FILE |
Write timing results to CSV file |
-h, --help |
Show help message |
# Run all queries with NVRTC and export timing to CSV
./src/tpch_demo /path/to/tpch-dbgen -o results.csv
# Run with NVCC compiler and verbose output
./src/tpch_demo /path/to/tpch-dbgen --nvcc -v
# Compare NVRTC vs NVCC compilation times
./src/tpch_demo /path/to/tpch-dbgen -o nvrtc_timing.csv
./src/tpch_demo /path/to/tpch-dbgen --nvcc -o nvcc_timing.csvquery,compiler,lineitem_count,compilation_ms,kernel_ms,total_ms
Q1,NVRTC,6001215,117.38,30.65,148.04
Q2,NVRTC,6001215,535.31,0.24,535.54
...The query engine uses a composable operator framework to generate CUDA kernels. Operators are chained together using a producer-consumer pattern.
| Class | Description |
|---|---|
Codegen |
Code generator that builds CUDA kernel source code. Manages indentation, parameters, and code blocks. |
Operator |
Abstract base class for all operators. Defines produce(codegen, consume) interface. |
UnaryOperator |
Base class for operators with a single child operator. |
| Operator | Description |
|---|---|
GPUTableScan |
Basic table scan with simple for-loop. Iterates over tuples with idx variable. |
GPUTableScanGridStride |
Table scan using grid-stride loop pattern for better GPU occupancy. |
| Operator | Description |
|---|---|
Selection |
Filter rows using a predicate (for basic scans). Generates if (predicate) { ... }. |
| Operator | Description |
|---|---|
BitmapBuild |
Build a bitmap from qualifying rows: bitmap[keyExpr] = 1. |
BitmapJoin |
Filter rows by checking bitmap: if (bitmap[keyExpr]) { ... }. |
MultiBitmapJoin |
Join on multiple bitmaps simultaneously (AND logic). |
AntiBitmapJoin |
Anti-join: passes rows where bitmap is NOT set: if (!bitmap[keyExpr]) { ... }. |
TableBitmapBuild |
Standalone operator that scans a table and builds a bitmap in one step. |
TableBitmapBuildGridStride |
Grid-stride version of TableBitmapBuild. |
| Operator | Description |
|---|---|
ArrayLookup |
Read value from array: resultVar = array[keyExpr]. |
ArrayStore |
Write value to array: array[keyExpr] = valueExpr. |
| Operator | Description |
|---|---|
AtomicArrayAgg |
Atomic aggregation by key: atomicAdd(&array[bucketExpr], valueExpr). |
AtomicArrayCount |
Atomic count by key: atomicAdd(&array[keyExpr], 1). |
AtomicCount |
Atomic increment of a single counter. |
SharedMemReductionAgg |
High-performance reduction using shared memory (one atomicAdd per block). |
ArrayMaxReduction |
Find maximum value in an array using shared memory reduction. |
KeyedAggregation |
GROUP BY aggregation with multiple aggregates per bucket. |
KeyedDualAggregation |
Dual aggregation with bucket key (GROUP BY with two aggregates). |
| Operator | Description |
|---|---|
ComputeExpr |
Compute a derived value: type varName = expression. |
#include "codegen/codegen.hpp"
// Generate kernel for: SELECT SUM(l_extendedprice) FROM lineitem WHERE l_shipdate >= '1994-01-01'
codegen::Codegen cg;
cg.setKernelName("aggregateRevenue");
auto scan = std::make_unique<codegen::GPUTableScanGridStride>("lineitem", "LineItemTuple", "li");
auto filter = std::make_unique<codegen::Selection>(std::move(scan),
"date_ge(li.l_shipdate, date_start)");
codegen::AtomicArrayAgg agg(std::move(filter), "d_result", "0", "li.l_extendedprice", "double");
agg.produce(&cg, [](){});
std::string kernelCode = cg.print();CUDACodeGeneral/
├── codegen/
│ └── codegen.hpp # Codegen framework
│ └── operator.hpp # Operator framework
├── queries/tpch/
│ └── tpch_q1.hpp ... tpch_q22.hpp # TPC-H query implementations
├── schema/tables/tpch/
│ └── tpch_schema.hpp # TPC-H table definitions
├── src/
│ ├── tpch_demo_refactored.cu # Main executable
│ ├── tpch_loader.hpp # TPC-H data loader
│ ├── launcher.hpp # Kernel launch utilities
│ └── jit_compiler.hpp # NVRTC/NVCC compilation
└── build/ # Build output
- NVRTC compilation is typically 3-6x faster than NVCC
- Kernel execution times are identical between compilers
- Shared memory reductions significantly outperform naive atomic aggregations
- Bitmap joins provide efficient multi-table query execution
See LICENSE file for details.