feat: implement provisioning script hotfix system#8086
feat: implement provisioning script hotfix system#8086
Conversation
- Added a mechanism for publishing hotfixed provisioning scripts as OCI artifacts. - Nodes can now autonomously detect and pull hotfixes during provisioning. - Stamped VHDs with provisioning scripts version for hotfix detection. - Introduced `build-hotfix-oci.sh` for building and pushing hotfix artifacts. - Created `manifest.json` to map SKUs to their script inventories. - Added README documentation for the hotfix system and usage instructions.
There was a problem hiding this comment.
Pull request overview
Implements an OCI-based “provisioning script hotfix” mechanism so nodes can detect and apply updated provisioning scripts at CSE/provisioning time without requiring a new VHD release.
Changes:
- Adds a SKU→script inventory
manifest.jsonplus abuild-hotfix-oci.shtool to package and publish hotfix artifacts to an OCI registry. - Adds hotfix detection/application logic to
cse_start.sh, and stamps VHDs with a provisioning-scripts version for tag matching. - Adds an e2e scenario test and updates generated CustomData test snapshots accordingly.
Reviewed changes
Copilot reviewed 70 out of 76 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
vhdbuilder/provisioning-manifest/manifest.json |
Defines source→destination mappings and permissions for hotfix-packaged scripts (currently Ubuntu 22.04). |
vhdbuilder/provisioning-manifest/build-hotfix-oci.sh |
Builds a tarball + metadata and pushes them as an OCI artifact via oras. |
vhdbuilder/provisioning-manifest/README.md |
Documents the hotfix workflow and operator usage. |
vhdbuilder/packer/install-dependencies.sh |
Stamps the VHD with a provisioning-scripts version used for hotfix tag detection. |
parts/linux/cloud-init/artifacts/cse_start.sh |
Adds node-side detection/pull/extract logic for hotfix artifacts during provisioning. |
e2e/scenario_test.go |
Adds an e2e scenario validating hotfix detection runs and handles the no-hotfix case. |
Makefile |
Adds a convenience target to invoke the hotfix build script. |
.pipelines/scripts/verify_shell.sh |
Marks the new script as bash-only for linting/verification. |
pkg/agent/testdata/** |
Regenerates CustomData snapshot blobs to reflect updated provisioning scripts. |
You can also share your feedback on Copilot code review. Take the survey.
| # Generate metadata | ||
| METADATA_PATH="${OUTPUT_DIR}/hotfix-metadata.json" | ||
| cat > "$METADATA_PATH" <<EOF | ||
| { | ||
| "hotfixId": "${HOTFIX_TAG}", | ||
| "affectedVersion": "${AFFECTED_VERSION}", | ||
| "sku": "${SKU}", | ||
| "description": "${DESCRIPTION}", | ||
| "sourceCommit": "${SOURCE_COMMIT}", | ||
| "createdAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)", | ||
| "tarballSha256": "${TARBALL_SHA256}", | ||
| "files": ${METADATA_FILES_JSON} | ||
| } | ||
| EOF |
There was a problem hiding this comment.
hotfix-metadata.json is built via a heredoc with ${DESCRIPTION} interpolated directly into JSON. If the description contains quotes, backslashes, or newlines, the resulting file will be invalid JSON (and can break consumers). Generate the metadata using jq -n --arg ... (or otherwise JSON-escape fields) instead of string concatenation; same applies to METADATA_FILES_JSON entries.
| # Defaults | ||
| REGISTRY="hotfixscriptpoc.azurecr.io" | ||
| DRY_RUN=false | ||
| SKU="" | ||
| AFFECTED_VERSION="" | ||
| DESCRIPTION="" | ||
| FILES="" | ||
|
|
||
| usage() { | ||
| cat <<EOF | ||
| Usage: $(basename "$0") [OPTIONS] | ||
|
|
||
| Build and push a provisioning script hotfix as an OCI artifact. | ||
|
|
||
| Required: | ||
| --sku <sku> Target OS SKU (e.g., ubuntu-2204) | ||
| --affected-version <ver> Baked VHD image version to hotfix (e.g., 202602.10.0) | ||
| --description <desc> Human-readable description of the hotfix | ||
| --files <paths> Comma-separated source paths of changed files | ||
| (relative to repo root, e.g., parts/linux/cloud-init/artifacts/cse_helpers.sh) | ||
|
|
||
| Optional: | ||
| --registry <registry> Target container registry (default: hotfixscriptpoc.azurecr.io) | ||
| --dry-run Build artifact locally but do not push to registry | ||
|
|
||
| Examples: | ||
| # Build and push a hotfix for a single file | ||
| $(basename "$0") \\ | ||
| --sku ubuntu-2204 \\ | ||
| --affected-version 202602.10.0 \\ | ||
| --description "Fix CVE-2026-XXXX in provision_source.sh" \\ | ||
| --files "parts/linux/cloud-init/artifacts/cse_helpers.sh" | ||
|
|
||
| # Dry-run (no push) with multiple files | ||
| $(basename "$0") \\ | ||
| --sku ubuntu-2204 \\ | ||
| --affected-version 202602.10.0 \\ | ||
| --description "Fix provisioning regression" \\ | ||
| --files "parts/linux/cloud-init/artifacts/cse_helpers.sh,parts/linux/cloud-init/artifacts/cse_install.sh" \\ | ||
| --dry-run | ||
| EOF | ||
| exit 1 | ||
| } | ||
|
|
||
| # Parse arguments | ||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --sku) SKU="$2"; shift 2 ;; | ||
| --affected-version) AFFECTED_VERSION="$2"; shift 2 ;; | ||
| --description) DESCRIPTION="$2"; shift 2 ;; | ||
| --files) FILES="$2"; shift 2 ;; | ||
| --registry) REGISTRY="$2"; shift 2 ;; | ||
| --dry-run) DRY_RUN=true; shift ;; | ||
| -h|--help) usage ;; | ||
| *) echo "ERROR: Unknown option: $1"; usage ;; | ||
| esac | ||
| done | ||
|
|
||
| # Validate required arguments | ||
| if [[ -z "$SKU" || -z "$AFFECTED_VERSION" || -z "$DESCRIPTION" || -z "$FILES" ]]; then | ||
| echo "ERROR: Missing required arguments." | ||
| usage | ||
| fi | ||
|
|
||
| # Validate manifest exists | ||
| if [[ ! -f "$MANIFEST_FILE" ]]; then | ||
| echo "ERROR: Manifest not found at ${MANIFEST_FILE}" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Validate SKU exists in manifest | ||
| if ! jq -e ".skus[\"${SKU}\"]" "$MANIFEST_FILE" > /dev/null 2>&1; then | ||
| echo "ERROR: SKU '${SKU}' not found in manifest." | ||
| echo "Available SKUs: $(jq -r '.skus | keys[]' "$MANIFEST_FILE" | tr '\n' ', ')" | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Validate oras is available | ||
| if ! command -v oras &>/dev/null; then | ||
| echo "ERROR: 'oras' CLI not found. Install from https://oras.land/" | ||
| exit 1 | ||
| fi | ||
|
|
||
| HOTFIX_TAG="${AFFECTED_VERSION}-hotfix" | ||
| REPOSITORY=$(jq -r ".skus[\"${SKU}\"].repository" "$MANIFEST_FILE") | ||
| ARTIFACT_REF="${REGISTRY}/${REPOSITORY}:${HOTFIX_TAG}" | ||
| ARTIFACT_TYPE="application/vnd.aks.provisioning-scripts.hotfix.v1" |
There was a problem hiding this comment.
manifest.json includes a per-SKU registry field, but build-hotfix-oci.sh ignores it and always defaults to the hardcoded hotfixscriptpoc.azurecr.io unless --registry is provided. This makes the manifest misleading and increases the chance of publishing to the wrong registry. Consider defaulting REGISTRY from .skus[sku].registry (and letting --registry override).
| 1. **VHD Build**: Each VHD is stamped with the AgentBaker commit SHA in `/opt/azure/containers/.provisioning-scripts-version` | ||
| 2. **Hotfix Publish**: An operator builds and pushes corrected scripts as an OCI artifact tagged `<baked-version>-hotfix` | ||
| 3. **Node Detection**: At provisioning time, `check_for_script_hotfix()` in `cse_start.sh` checks the registry for a matching hotfix tag | ||
| 4. **Overlay**: If found, the tarball is extracted over the baked scripts before `provision.sh` runs | ||
| 5. **Fallback**: Any failure is non-fatal — nodes always proceed with baked scripts |
There was a problem hiding this comment.
The README says the VHD is stamped with the AgentBaker commit SHA in /opt/azure/containers/.provisioning-scripts-version, but the VHD build change stamps IMAGE_VERSION when available (VHD SIG version) and only falls back to the commit SHA. Please update the README wording to match the actual stamp semantics so operators publish hotfixes against the right value.
| if ! timeout 30 oras manifest fetch "${repo}:${hotfix_tag}" > /dev/null 2>&1; then | ||
| echo "$(date): Hotfix check: no hotfix tag '${hotfix_tag}' found (normal case)" >> "$hotfix_log" |
There was a problem hiding this comment.
The oras manifest fetch probe logs "no hotfix tag found (normal case)" for any failure (including DNS/network timeouts and 401/403 auth failures), because stderr is discarded. This will make real outages look like the normal no-hotfix path. Consider capturing the error output/exit code and logging a different message for transient/auth errors vs an actual missing tag.
| if ! timeout 30 oras manifest fetch "${repo}:${hotfix_tag}" > /dev/null 2>&1; then | |
| echo "$(date): Hotfix check: no hotfix tag '${hotfix_tag}' found (normal case)" >> "$hotfix_log" | |
| local manifest_err | |
| manifest_err=$(timeout 30 oras manifest fetch "${repo}:${hotfix_tag}" 2>&1 >/dev/null) | |
| local manifest_rc=$? | |
| if [ "$manifest_rc" -ne 0 ]; then | |
| # Distinguish "tag truly missing" (normal case) from other failures | |
| if echo "$manifest_err" | grep -qiE "404|MANIFEST_UNKNOWN|NAME_UNKNOWN|not[[:space:]]\+found"; then | |
| echo "$(date): Hotfix check: no hotfix tag '${hotfix_tag}' found (normal case)" >> "$hotfix_log" | |
| else | |
| echo "$(date): Hotfix check: oras manifest fetch failed for '${hotfix_tag}' (rc=${manifest_rc}): ${manifest_err}" >> "$hotfix_log" | |
| fi |
…onanonwestus3 - Changed registry references in provisioning manifest and related scripts to use abe2eprivatenonanonwestus3.azurecr.io instead of hotfixscriptpoc.azurecr.io. - Updated README.md to reflect the new registry for testing and artifact verification. - Modified build-hotfix-oci.sh to set the default registry to abe2eprivatenonanonwestus3.azurecr.io. - Adjusted manifest.json to point to the new registry for provisioning scripts.
- Renamed `oras_login_with_kubelet_identity` to `oras_login_with_managed_identity` in tests to reflect the new implementation. - Enhanced README to clarify the hotfix process, including the use of managed identity for ORAS login and registry selection details. - Updated `build-hotfix-oci.sh` to create tarballs without including the staging directory root entry, preventing permission issues during extraction.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 84 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
| # Defaults | ||
| REGISTRY="abe2eprivatenonanonwestus3.azurecr.io" | ||
| DRY_RUN=false |
| # Generate metadata | ||
| METADATA_PATH="${OUTPUT_DIR}/hotfix-metadata.json" | ||
| cat > "$METADATA_PATH" <<EOF | ||
| { | ||
| "hotfixId": "${HOTFIX_TAG}", | ||
| "affectedVersion": "${AFFECTED_VERSION}", | ||
| "sku": "${SKU}", | ||
| "description": "${DESCRIPTION}", | ||
| "sourceCommit": "${SOURCE_COMMIT}", | ||
| "createdAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)", | ||
| "tarballSha256": "${TARBALL_SHA256}", | ||
| "files": ${METADATA_FILES_JSON} | ||
| } | ||
| EOF |
|
|
||
| ### Step-by-Step | ||
|
|
||
| 1. **Identify affected versions**: Determine which baked VHD versions contain the bug. Check the version stamp format (currently git commit SHA). |
… for consistency - Changed the conditional check for IMAGE_VERSION in install-dependencies.sh from double brackets to single brackets for consistency with shell scripting best practices.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 84 changed files in this pull request and generated 6 comments.
You can also share your feedback on Copilot code review. Take the survey.
| `sudo sed -n '/^check_for_script_hotfix()/,/^}/p' /opt/azure/containers/provision_start.sh > /tmp/hotfix_func.sh`, | ||
| `sudo bash -c 'echo "999999.99.0" > /opt/azure/containers/.provisioning-scripts-version'`, | ||
| `sudo rm -f /opt/azure/containers/.hotfix-applied`, | ||
| `sudo bash -c '> /var/log/azure/hotfix-check.log'`, | ||
| fmt.Sprintf(`sudo bash -c 'export PATH=/opt/bin:$PATH && source /tmp/hotfix_func.sh && export HOTFIX_REGISTRY=%s && check_for_script_hotfix'`, hotfixRegistry), |
| `sudo sed -n '/^check_for_script_hotfix()/,/^}/p' /opt/azure/containers/provision_start.sh > /tmp/hotfix_func.sh`, | ||
| `sudo bash -c 'echo "999999.99.0" > /opt/azure/containers/.provisioning-scripts-version'`, | ||
| `sudo rm -f /opt/azure/containers/.hotfix-applied`, | ||
| `sudo bash -c '> /var/log/azure/hotfix-check.log'`, | ||
| fmt.Sprintf(`sudo bash -c 'export PATH=/opt/bin:$PATH && source /tmp/hotfix_func.sh && export HOTFIX_REGISTRY=%s && check_for_script_hotfix'`, hotfixRegistry), |
| nbc.ContainerService.Properties.SecurityProfile = &datamodel.SecurityProfile{ | ||
| PrivateEgress: &datamodel.PrivateEgress{ | ||
| Enabled: true, | ||
| ContainerRegistryServer: fmt.Sprintf("%s.azurecr.io/aks-managed-repository", config.PrivateACRNameNotAnon(config.Config.DefaultLocation)), |
| "hotfixId": "${HOTFIX_TAG}", | ||
| "affectedVersion": "${AFFECTED_VERSION}", | ||
| "sku": "${SKU}", | ||
| "description": "${DESCRIPTION}", | ||
| "sourceCommit": "${SOURCE_COMMIT}", |
| # Defaults | ||
| REGISTRY="abe2eprivatenonanonwestus3.azurecr.io" | ||
| DRY_RUN=false |
| 1. **Identify affected versions**: Determine which baked VHD versions contain the bug. Check the version stamp format (currently git commit SHA). | ||
|
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 84 changed files in this pull request and generated 1 comment.
You can also share your feedback on Copilot code review. Take the survey.
|
|
||
| ### Step-by-Step | ||
|
|
||
| 1. **Identify affected versions**: Determine which baked VHD versions contain the bug. Check the version stamp format (currently git commit SHA). |
What this PR does / why we need it:
build-hotfix-oci.shfor building and pushing hotfix artifacts.manifest.jsonto map SKUs to their script inventories.Which issue(s) this PR fixes:
Fixes #