Skip to content

Add build caching to self-hosted runners #1145

@sbryngelson

Description

@sbryngelson

MFC CI has zero build caching — every run compiles from scratch. There are two categories of runners:

  • GitHub-hosted (Ubuntu, macOS): 7 matrix combos with GCC/Intel. Ephemeral VMs.
  • Self-hosted HPC (Phoenix, Frontier, Frontier AMD): 8 matrix combos with NVHPC/Cray CCE/amdflang. Shared runner pools where jobs float between multiple runner instances, each with a different workspace path.

The build system already supports incremental builds — build.py:578 checks is_configured() (looks for CMakeCache.txt) and skips reconfiguration if found. The problem is purely that build artifacts don't persist across runs.

Key constraints discovered during analysis

  • build/ is in .gitignoreactions/checkout doesn't delete it
  • shutil.rmtree (used by mfc.sh clean) follows symlinks and would destroy cache contents
  • ccache does not work with NVHPC, Cray CCE, or amdflang → useless for HPC runners
  • Self-hosted runners have different workspace paths per instance → CMakeCache.txt contains stale absolute paths when a job lands on a different runner
  • uv is already used for pip installs → venv setup is already fast (~seconds)

Changes

1. actions/cache for GitHub-hosted runners

Files: .github/workflows/test.yml, .github/workflows/coverage.yml

Add actions/cache@v4 after checkout, caching the build/ directory.

In test.yml, insert after the Clone step in the github job (after line 95):

      - name: Restore Build Cache
        uses: actions/cache@v4
        with:
          path: build
          key: mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
          restore-keys: |
            mfc-build-${{ matrix.os }}-${{ matrix.mpi }}-${{ matrix.debug }}-${{ matrix.precision }}-${{ matrix.intel }}-

In coverage.yml, insert after checkout (after line 36):

      - name: Restore Build Cache
        uses: actions/cache@v4
        with:
          path: build
          key: mfc-coverage-${{ hashFiles('CMakeLists.txt', 'toolchain/dependencies/**', 'toolchain/cmake/**', 'src/**/*.fpp', 'src/**/*.f90') }}
          restore-keys: |
            mfc-coverage-

How it works:

  • Cache miss (first run or source hash change): builds from scratch, cache saved on completion
  • Cache hit via restore-keys prefix (source changed but config same): restores old build dir, is_configured() returns True, CMake does incremental build of only changed files
  • Exact cache hit: restores build dir, nothing to recompile, just runs tests

2. Persistent build cache for self-hosted HPC runners

Approach: Symlink build/$HOME/.mfc-ci-cache/<cluster>-<device>-<interface>/build/ so every run of the same config finds cached artifacts regardless of which runner instance it lands on.

2a. New helper script: .github/scripts/setup-build-cache.sh

#!/bin/bash
# Sets up a persistent build cache for self-hosted CI runners.
# Creates a symlink: ./build -> $HOME/.mfc-ci-cache/<key>/build
#
# Usage: source .github/scripts/setup-build-cache.sh <cluster> <device> <interface>

_cache_cluster="${1:?Usage: setup-build-cache.sh <cluster> <device> <interface>}"
_cache_device="${2:?}"
_cache_interface="${3:-none}"

_cache_key="${_cache_cluster}-${_cache_device}-${_cache_interface}"
_cache_dir="$HOME/.mfc-ci-cache/${_cache_key}/build"

echo "=== Build Cache Setup ==="
echo "  Cache key: $_cache_key"
echo "  Cache dir: $_cache_dir"

mkdir -p "$_cache_dir"

# If build/ exists (real dir or stale symlink), remove it
if [ -e "build" ] || [ -L "build" ]; then
    rm -rf "build"
fi

ln -s "$_cache_dir" "build"

# Handle cross-runner workspace path changes.
# CMakeCache.txt stores absolute paths from whichever runner instance
# originally configured the build. If we're on a different runner, sed-replace
# the old workspace path with the current one so CMake can do incremental builds.
_workspace_marker="$_cache_dir/.workspace_path"
if [ -f "$_workspace_marker" ]; then
    _old_workspace=$(cat "$_workspace_marker")
    if [ "$_old_workspace" != "$(pwd)" ]; then
        echo "  Workspace path changed: $_old_workspace -> $(pwd)"
        echo "  Updating cached CMake paths..."
        find "$_cache_dir/staging" -type f \
            \( -name "CMakeCache.txt" -o -name "*.cmake" \
               -o -name "Makefile" -o -name "build.ninja" \) \
            -exec sed -i "s|${_old_workspace}|$(pwd)|g" {} + 2>/dev/null || true
    fi
fi
echo "$(pwd)" > "$_workspace_marker"

echo "  Symlink: build -> $_cache_dir"
echo "========================="

Why this works:

  • rm -rf "build" on a symlink removes the symlink, not the target — cache is safe
  • actions/checkout may remove the symlink via git clean, but we recreate it immediately after
  • The sed fixup rewrites stale absolute paths in CMake files when the workspace path changes (different runner instance), enabling incremental builds across runners
  • $HOME is on shared NFS on both Phoenix and Frontier, so the cache is accessible from login and compute nodes

2b. Modify HPC build scripts

.github/workflows/frontier/build.sh — Add cache setup after mfc.sh load, replace ./mfc.sh clean in retry with targeted cleanup:

. ./mfc.sh load -c f -m g

# Set up persistent build cache
source .github/scripts/setup-build-cache.sh frontier "$job_device" "$job_interface"

# In retry logic, replace:
#   ./mfc.sh clean
# with:
    rm -rf build/staging/* build/install/* build/lock.yaml

The targeted cleanup clears compiled artifacts (forcing full reconfigure on retry) without destroying the symlink or the venv.

.github/workflows/frontier_amd/build.sh — Same pattern, using frontier_amd as the cluster name. Module load is mfc.sh load -c famd -m g.

.github/workflows/phoenix/test.sh — Same pattern, using phoenix as the cluster name. Note: Phoenix builds inside SLURM jobs (not on login nodes), but $HOME is on shared NFS so the cache is accessible.

2c. No changes needed to:

  • frontier/test.sh, frontier_amd/test.sh — test-only scripts, no build step
  • frontier/submit.sh, phoenix/submit.sh — submit scripts, build symlink is in the workspace which $SLURM_SUBMIT_DIR points to
  • test.yml self-hosted job definitions — cache logic lives entirely in the shell scripts
  • Any Python toolchain files (build.py, clean.py, common.py, etc.)
  • mfc.sh

File change summary

File Action Description
.github/workflows/test.yml Modify Add actions/cache@v4 step to github job
.github/workflows/coverage.yml Modify Add actions/cache@v4 step to run job
.github/scripts/setup-build-cache.sh New Shared helper: symlink build/ → persistent cache, sed-fixup paths
.github/workflows/frontier/build.sh Modify Add cache setup, replace mfc.sh clean with targeted rm
.github/workflows/frontier_amd/build.sh Modify Same as frontier
.github/workflows/phoenix/test.sh Modify Add cache setup, replace mfc.sh clean with targeted rm

Zero changes to MFC source code or Python toolchain.

How incremental builds work after these changes

  1. First run (cold cache): No CMakeCache.txtis_configured() returns False → full configure + build. Artifacts saved in persistent cache.
  2. Same runner, code changed: CMakeCache.txt exists → is_configured() returns True → skips configure → cmake --build recompiles only changed files.
  3. Different runner, code changed: Symlink points to same cache → sed fixes workspace paths in CMake files → is_configured() returns True → CMake re-runs configure (detects file changes) → incremental build.
  4. Build failure: Retry logic clears build/staging/* and build/install/* → next attempt does full configure + build (same as today).

Verification

  1. GH runners: Open a PR against MFlowCode/MFC. First run builds from scratch (cache miss). Push a trivial commit — second run should show Cache restored in the "Restore Build Cache" step and build faster.
  2. Self-hosted runners: The first run on each runner creates the cache. Subsequent runs should show "Symlink: build -> ..." in the logs. If the runner changes, should see "Updating cached CMake paths..." then still build incrementally.
  3. Retry path: Intentionally break a build (if possible) to verify that the retry logic clears staging and recovers.

Risks and mitigations

Risk Mitigation
Stale CMake cache causes persistent build failures Retry logic clears staging on failure; degrades to full rebuild (same as today)
sed misses some path references CMake auto-regeneration handles most mismatches; retry catches the rest
Cache grows unbounded on HPC Only 8 cache dirs total (one per config); each is updated in-place. Can add housekeeping cron later.
Concurrent jobs for same config corrupt cache concurrency group in test.yml cancels in-progress runs; matrix dimensions ensure different configs use different cache dirs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions