-
Notifications
You must be signed in to change notification settings - Fork 132
Description
MFC CI has zero build caching — every run compiles from scratch. There are two categories of runners:
- GitHub-hosted (Ubuntu, macOS): 7 matrix combos with GCC/Intel. Ephemeral VMs.
- Self-hosted HPC (Phoenix, Frontier, Frontier AMD): 8 matrix combos with NVHPC/Cray CCE/amdflang. Shared runner pools where jobs float between multiple runner instances, each with a different workspace path.
The build system already supports incremental builds — build.py:578 checks is_configured() (looks for CMakeCache.txt) and skips reconfiguration if found. The problem is purely that build artifacts don't persist across runs.
Key constraints discovered during analysis
build/is in.gitignore→actions/checkoutdoesn't delete itshutil.rmtree(used bymfc.sh clean) follows symlinks and would destroy cache contents- ccache does not work with NVHPC, Cray CCE, or amdflang → useless for HPC runners
- Self-hosted runners have different workspace paths per instance →
CMakeCache.txtcontains stale absolute paths when a job lands on a different runner uvis already used for pip installs → venv setup is already fast (~seconds)
Changes
1. actions/cache for GitHub-hosted runners
Files: .github/workflows/test.yml, .github/workflows/coverage.yml
Add actions/cache@v4 after checkout, caching the build/ directory.
In test.yml, insert after the Clone step in the github job (after line 95):
- name: Restore Build Cache
uses: actions/cache@v4
with:
path: build
key: mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
restore-keys: |
mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-In coverage.yml, insert after checkout (after line 36):
- name: Restore Build Cache
uses: actions/cache@v4
with:
path: build
key: mfc-coverage-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
restore-keys: |
mfc-coverage-How it works:
- Cache miss (first run or source hash change): builds from scratch, cache saved on completion
- Cache hit via
restore-keysprefix (source changed but config same): restores old build dir,is_configured()returns True, CMake does incremental build of only changed files - Exact cache hit: restores build dir, nothing to recompile, just runs tests
2. Persistent build cache for self-hosted HPC runners
Approach: Symlink build/ → $HOME/.mfc-ci-cache/<cluster>-<device>-<interface>/build/ so every run of the same config finds cached artifacts regardless of which runner instance it lands on.
2a. New helper script: .github/scripts/setup-build-cache.sh
#!/bin/bash
# Sets up a persistent build cache for self-hosted CI runners.
# Creates a symlink: ./build -> $HOME/.mfc-ci-cache/<key>/build
#
# Usage: source .github/scripts/setup-build-cache.sh <cluster> <device> <interface>
_cache_cluster="${1:?Usage: setup-build-cache.sh <cluster> <device> <interface>}"
_cache_device="${2:?}"
_cache_interface="${3:-none}"
_cache_key="${_cache_cluster}-${_cache_device}-${_cache_interface}"
_cache_dir="$HOME/.mfc-ci-cache/${_cache_key}/build"
echo "=== Build Cache Setup ==="
echo " Cache key: $_cache_key"
echo " Cache dir: $_cache_dir"
mkdir -p "$_cache_dir"
# If build/ exists (real dir or stale symlink), remove it
if [ -e "build" ] || [ -L "build" ]; then
rm -rf "build"
fi
ln -s "$_cache_dir" "build"
# Handle cross-runner workspace path changes.
# CMakeCache.txt stores absolute paths from whichever runner instance
# originally configured the build. If we're on a different runner, sed-replace
# the old workspace path with the current one so CMake can do incremental builds.
_workspace_marker="$_cache_dir/.workspace_path"
if [ -f "$_workspace_marker" ]; then
_old_workspace=$(cat "$_workspace_marker")
if [ "$_old_workspace" != "$(pwd)" ]; then
echo " Workspace path changed: $_old_workspace -> $(pwd)"
echo " Updating cached CMake paths..."
find "$_cache_dir/staging" -type f \
\( -name "CMakeCache.txt" -o -name "*.cmake" \
-o -name "Makefile" -o -name "build.ninja" \) \
-exec sed -i "s|${_old_workspace}|$(pwd)|g" {} + 2>/dev/null || true
fi
fi
echo "$(pwd)" > "$_workspace_marker"
echo " Symlink: build -> $_cache_dir"
echo "========================="Why this works:
rm -rf "build"on a symlink removes the symlink, not the target — cache is safeactions/checkoutmay remove the symlink viagit clean, but we recreate it immediately after- The sed fixup rewrites stale absolute paths in CMake files when the workspace path changes (different runner instance), enabling incremental builds across runners
$HOMEis on shared NFS on both Phoenix and Frontier, so the cache is accessible from login and compute nodes
2b. Modify HPC build scripts
.github/workflows/frontier/build.sh — Add cache setup after mfc.sh load, replace ./mfc.sh clean in retry with targeted cleanup:
. ./mfc.sh load -c f -m g
# Set up persistent build cache
source .github/scripts/setup-build-cache.sh frontier "$job_device" "$job_interface"
# In retry logic, replace:
# ./mfc.sh clean
# with:
rm -rf build/staging/* build/install/* build/lock.yamlThe targeted cleanup clears compiled artifacts (forcing full reconfigure on retry) without destroying the symlink or the venv.
.github/workflows/frontier_amd/build.sh — Same pattern, using frontier_amd as the cluster name. Module load is mfc.sh load -c famd -m g.
.github/workflows/phoenix/test.sh — Same pattern, using phoenix as the cluster name. Note: Phoenix builds inside SLURM jobs (not on login nodes), but $HOME is on shared NFS so the cache is accessible.
2c. No changes needed to:
frontier/test.sh,frontier_amd/test.sh— test-only scripts, no build stepfrontier/submit.sh,phoenix/submit.sh— submit scripts, build symlink is in the workspace which$SLURM_SUBMIT_DIRpoints totest.ymlself-hosted job definitions — cache logic lives entirely in the shell scripts- Any Python toolchain files (
build.py,clean.py,common.py, etc.) mfc.sh
File change summary
| File | Action | Description |
|---|---|---|
.github/workflows/test.yml |
Modify | Add actions/cache@v4 step to github job |
.github/workflows/coverage.yml |
Modify | Add actions/cache@v4 step to run job |
.github/scripts/setup-build-cache.sh |
New | Shared helper: symlink build/ → persistent cache, sed-fixup paths |
.github/workflows/frontier/build.sh |
Modify | Add cache setup, replace mfc.sh clean with targeted rm |
.github/workflows/frontier_amd/build.sh |
Modify | Same as frontier |
.github/workflows/phoenix/test.sh |
Modify | Add cache setup, replace mfc.sh clean with targeted rm |
Zero changes to MFC source code or Python toolchain.
How incremental builds work after these changes
- First run (cold cache): No
CMakeCache.txt→is_configured()returns False → full configure + build. Artifacts saved in persistent cache. - Same runner, code changed:
CMakeCache.txtexists →is_configured()returns True → skips configure →cmake --buildrecompiles only changed files. - Different runner, code changed: Symlink points to same cache → sed fixes workspace paths in CMake files →
is_configured()returns True → CMake re-runs configure (detects file changes) → incremental build. - Build failure: Retry logic clears
build/staging/*andbuild/install/*→ next attempt does full configure + build (same as today).
Verification
- GH runners: Open a PR against
MFlowCode/MFC. First run builds from scratch (cache miss). Push a trivial commit — second run should showCache restoredin the "Restore Build Cache" step and build faster. - Self-hosted runners: The first run on each runner creates the cache. Subsequent runs should show "Symlink: build -> ..." in the logs. If the runner changes, should see "Updating cached CMake paths..." then still build incrementally.
- Retry path: Intentionally break a build (if possible) to verify that the retry logic clears staging and recovers.
Risks and mitigations
| Risk | Mitigation |
|---|---|
| Stale CMake cache causes persistent build failures | Retry logic clears staging on failure; degrades to full rebuild (same as today) |
| sed misses some path references | CMake auto-regeneration handles most mismatches; retry catches the rest |
| Cache grows unbounded on HPC | Only 8 cache dirs total (one per config); each is updated in-place. Can add housekeeping cron later. |
| Concurrent jobs for same config corrupt cache | concurrency group in test.yml cancels in-progress runs; matrix dimensions ensure different configs use different cache dirs |