Skip to content

by/fix sticky disk timeout#156

Closed
bruce-y wants to merge 70 commits intomainfrom
by/fix-sticky-disk-timeout
Closed

by/fix sticky disk timeout#156
bruce-y wants to merge 70 commits intomainfrom
by/fix-sticky-disk-timeout

Conversation

@bruce-y
Copy link

@bruce-y bruce-y commented Mar 10, 2026

  • *: initial scaffolding for the stickydisk action
  • README: update use cases in README
  • README: update README
  • README: add blacksmith logo (add a cache delete action #4)
  • main: maintain runner perms when mounting a stickydisk
  • update README (update readme to indicate blacksmith runner requirement #6)
  • README: update README with use cases and arch diagram
  • .github: fix typo
  • README: fix nits
  • Update README.md
  • .github: clarify in first sentence
  • Update README.md
  • Update README.md
  • src: add sync before umount
  • src: increase timeout to the same as build-push-action
  • src: explicitly flush journal before umounting (add check against specifying version when key is empty #12)
  • use grpc port from the env if present (chore: bump prettier from 3.3.3 to 3.4.0 #14)
  • src: print who we are establishing client with
  • src: use BLACKSMITH prefixed env vars to work inside job containers
  • src: revert to use the correct github repo name
  • README: add Alpha status callout (chore: bump typescript-eslint from 8.15.0 to 8.18.0 #22)
  • Add failure detection logic to sticky disk action
  • Fix linting errors and add CLAUDE.md
  • Update dependencies and improve CLAUDE.md
  • Fix mkdir permission error for restricted paths
  • Fix mkdir permission error for restricted paths
  • Add intentional failure to test failure detection
  • Revert "Add intentional failure to test failure detection"
  • add filesystem usage tracking to sticky disk commits (chore: bump @types/node from 20.17.7 to 22.10.5 #30)
  • fix: collect filesystem usage before unmounting sticky disk
  • docs: add stickydisk-delete section to README
  • docs: update stickydisk-delete section with accurate features
  • chore: update buf dependency versions to fix CI
  • docs: add reference to setup-docker-builder for Docker cache deletion
  • chore: update build artifacts for platform compatibility
  • chore: update build artifacts for platform compatibility
  • Revert "chore: update build artifacts for platform compatibility"
  • fix: update package-lock.json with latest buf dependency
  • fix: remove lost+found directory after formatting to prevent permission errors
  • test: add placeholder test to satisfy CI requirements
  • style: format code with prettier
  • fix: remove explicit buf package install from CI to prevent formatting diffs
  • docs: expand use cases with production-validated examples
  • docs: expand use cases with production-validated examples
  • fix: correct Nix example step order and action
  • chore: promote stickydisk action from Alpha to Beta
  • docs: BLA-2526 - Update stickydisk limit from 5 to 10 disks
  • feat(post): add explicit durability flush after unmount (BLA-3202)
  • fix: apply prettier formatting
  • refactor: remove env var, change debug to info, reduce timeout to 10s
  • refactor: use shell timeout command for flush operation
  • ensure that sticky disk timeout is cleared even if thrown

Note

Medium Risk
Medium risk because it changes the published action interface in action.yml (new required inputs and adds a post script), which can break existing consumers; the rest is CI/docs scaffolding.

Overview
This PR repackages the repository from a cache-delete action into the useblacksmith/stickydisk action by updating action.yml (required key/path inputs and a post step) and rewriting the README to document sticky disk behavior and common caching use cases.

It also adds CI/ops scaffolding: a basic.yaml workflow to exercise mounting multiple sticky disks, a bump-tag.yaml workflow to force-update the v1 tag, and tighter build CI checks (Buf setup/registry auth plus enforced Prettier + “no uncommitted build output” gating).

Written by Cursor Bugbot for commit 94119f5. This will update automatically on new commits. Configure here.

adityamaru and others added 30 commits December 13, 2024 18:31
*: initial scaffolding for the stickydisk action
main: maintain runner perms when mounting a stickydisk
README: update README with use cases and arch diagram
Update README.md - use blacksmith cache action
src: increase timeout to the same as build-push-action
* src: explicitly flush journal before umounting

* *: generated code

* *: upgrade @buf/blacksmith_vm-agent.connectrpc_es@latest
src: print who we are establishing client with
src: use BLACKSMITH prefixed env vars to work inside job containers
aayushshah15 and others added 29 commits October 3, 2025 01:09
Fixes critical bug where filesystem usage was collected after unmounting, causing df to report incorrect data.

Changes:
- Moved df command execution to before unmount
- Added proper path escaping to prevent shell injection
- Fixed log message syntax error and unit label (GB -> GiB)
- Added prettier config for consistent formatting
- Removed unsupported eslint comma-dangle rule

Co-authored-by: Aayush Shah <aayushshah15@users.noreply.github.com>
- Add comprehensive stickydisk-delete documentation section
- Include basic usage and cleanup workflow examples
- Add pattern matching and use case documentation
- Maintain consistent formatting with Blacksmith logo
- Remove pattern matching (not supported)
- Remove logo and overview fluff
- Focus on two supported methods: delete by key and Docker cache
- Simplify examples
Rebuild dist files with Node v23.2.0 to match .nvmrc version.
Build artifacts may differ slightly between macOS and Linux CI environment.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Rebuild dist files with Node v23.2.0 to match .nvmrc version.
Build artifacts may differ slightly between macOS and Linux CI environment.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update @buf/blacksmith_vm-agent.connectrpc_es version to match CI requirements

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
docs: add stickydisk-delete section to README
…on errors

Remove the lost+found directory created by mkfs.ext4 to prevent EACCES
permission errors when tools recursively scan sticky disk mount paths.

Problem:
mkfs.ext4 always creates lost+found with root:root 0700 permissions for
fsck recovery. When sticky disks are mounted to paths that tools scan
recursively (e.g., ./node_modules, ./build-cache), permission errors occur:

- pnpm/npm/yarn: Scan node_modules for packages, hit lost+found
  Error: EACCES: permission denied, open '.../lost+found/package.json'

- Docker buildx: Scans build context, hits lost+found
  Error: error from sender: open build-cache/lost+found: permission denied

Impact Analysis:
- 4 customer installations affected in past 11 days
- Pattern 1: pnpm/npm/yarn with node_modules mount
- Pattern 2: Docker buildx with build-cache mount
- Error is intermittent for some tools (buildx)
- Persistent for others (pnpm/yarn)

Evidence:
- Direct snapshot inspection shows empty lost+found (no corruption)
- This is standard mkfs.ext4 behavior, not a bug
- Workarounds already implemented by affected users

Solution:
After mkfs.ext4, mount temporarily and remove lost+found directory.
This is safe because:
- Sticky disks are ephemeral CI caches
- If corruption occurs, cache can be rebuilt
- lost+found only needed for fsck recovery of critical filesystems
- Standard practice in Docker/K8s for cache volumes

Prevents need for per-repo workarounds.

Fixes BLA-2150
Add minimal placeholder test since full integration testing of block
device operations requires actual hardware and is done manually.
…g diffs

The CI was installing @buf/blacksmith_vm-agent.connectrpc_es@latest which would
update package-lock.json, causing prettier to detect formatting changes. Since
npm ci already installs the correct version from package-lock.json, the explicit
install is unnecessary and causes issues.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…-mkfs

fix: remove lost+found directory after formatting to prevent permissi…
Replace deprecated Blacksmith-specific actions with upstream equivalents
and add real-world use case examples based on production data from 58+
installations with 2,000+ active sticky disk entities.

Changes to existing examples:
- NPM caching: Use actions/setup-node@v4 (not deprecated useblacksmith/setup-node@v5)
- Updated runner labels to specific images (blacksmith-4vcpu-ubuntu-2204)
- Removed node_modules mount (causes lost+found permission issues)
- Focus on npm cache (~/.npm) which is the safer pattern

New use cases added:
- Go Build and Module Cache (4 installations, ~1TB)
- Turborepo Cache (4 installations, ~610GB)
- Python Virtual Environments (2 installations, ~566GB)
- Nix Package Cache (2 installations, ~29GB)
- Playwright Browser Binaries (5 installations, ~36GB)

All examples:
- Use upstream actions (actions/setup-node, actions/setup-go, etc.)
- Based on real customer workflows
- Anonymized for privacy
- Production-validated patterns
Replace deprecated Blacksmith-specific actions with upstream equivalents
and add real-world use case examples based on production data from 58+
installations with 2,000+ active sticky disk entities.

Changes to existing examples:
- NPM: Use actions/setup-node@v4 (not deprecated useblacksmith/setup-node@v5)
- NPM: Added node_modules mount back (safe after lost+found fix)
- Updated all runner labels to specific images (blacksmith-4vcpu-ubuntu-2204)
- Bazel: Updated runner label for consistency

New use cases added:
- Go Build and Module Cache (4 installations, ~1TB)
- Turborepo Cache (4 installations, ~610GB)
- Python Virtual Environments (2 installations, ~566GB)
- Nix Package Cache (2 installations, ~29GB)
- Playwright Browser Binaries (5 installations, ~36GB)

All examples:
- Use upstream actions (no deprecated Blacksmith actions)
- Based on real customer workflows
- Anonymized for privacy
- Production-validated patterns
Fix Nix example to match production workflow pattern:
1. Create /nix directories first
2. Mount sticky disk to /nix
3. THEN install Nix (which populates the mounted directory)

Previous order (mount after install) would cause the sticky disk to
overwrite the Nix installation.

Also updated to use nixbuild/nix-quick-install-action@v30 which is
what production workflows actually use.
…production-data

docs/update use cases with examples from production uses
Update status from Alpha to Beta to reflect:
- 58+ installations in production
- 2,000+ active sticky disk entities
- Stable API with no breaking changes planned
- Recent fixes for edge cases (lost+found issue)
- Production-validated across diverse use cases

The action is stable and ready for broader adoption.
chore: promote stickydisk action from Alpha to Beta
BLA-2526: This change updates the documentation to reflect the new default limit
of 10 sticky disks per GitHub Action job (increased from 5).

Changes:
- Updated README.md line 29 from 'up to 5 sticky disks' to 'up to 10 sticky disks'

Note: When Docker pull caching is enabled, it reserves one slot, so customers will
effectively have 9 stickydisk slots available in that configuration.

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
Add blockdev --flushbufs operation on guest side after unmounting the sticky
disk to ensure data durability before Ceph RBD snapshots are taken.

Changes:
- Add getDeviceFromMount() to extract device path from mount point
- Add flushBlockDevice() that runs blockdev --flushbufs with stats logging
- Log I/O stats from /sys/block/{device}/stat before and after flush
- Add ENABLE_DURABILITY_FLUSH env var for feature flag (defaults to enabled)
- Handle errors gracefully - log warnings but don't fail the cleanup flow

Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
Co-Authored-By: maru@blacksmith.sh <adityamaru@gmail.com>
docs: BLA-2526 - Update stickydisk limit from 5 to 10 disks
…urability-flush

feat(post): add explicit durability flush after unmount (BLA-3202)
@bruce-y bruce-y closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants