Skip to content

[GCS Connector] Add Workload Identity Federation as authentication type #749

@Pankesito

Description

@Pankesito

Summary

The ADF Google Cloud Storage connector currently only supports HMAC key authentication. This feature request asks for Workload Identity Federation (WIF) as an additional authenticationType, enabling secretless, identity-based access to GCS from Azure Data Factory.

Problem

Organizations with many GCP projects face an O(n) scaling problem with HMAC keys:

GCP Projects HMAC Keys Key Vault Secrets Rotation Events/Year
10 10 10 20–40
50 50 50 100–200
100 100 100 200–400

Each HMAC key is a long-lived static credential — created per-project, stored in Key Vault, manually rotated. This does not align with zero-trust principles.

Proposed Solution

Add WorkloadIdentityFederation as an authenticationType on the GCS linked service. The token flow:

  1. ADF Managed Identity → Azure Entra ID access token (JWT, ~1h TTL)
  2. GCP Security Token Service → Federated access token (RFC 8693)
  3. GCP IAM Credentials API → Short-lived GCP access token via SA impersonation (1h)
  4. GCS API → Read objects using the SA token

Proposed Linked Service Schema

{
  "type": "GoogleCloudStorage",
  "typeProperties": {
    "authenticationType": "WorkloadIdentityFederation",
    "workloadIdentityFederation": {
      "gcpProjectNumber": "123456789012",
      "workloadIdentityPoolId": "azure-adf-pool",
      "workloadIdentityProviderId": "azure-entra-oidc",
      "serviceAccountEmail": "adf-reader@my-project.iam.gserviceaccount.com"
    }
  }
}

Benefits over HMAC

Property HMAC Keys WIF
Credential lifetime Indefinite ~1 hour (auto-refreshed)
Secret storage Key Vault per project None
Rotation Manual per project Automatic
Blast radius Full bucket access if leaked N/A — no secret to leak
Multi-project scale O(n) keys O(1) pool + O(n) IAM bindings
Auditability Key Vault logs only Entra + GCP Cloud Audit Logs

Precedent: Microsoft Defender for Cloud

Microsoft already ships this pattern in production. Defender for Cloud's GCP connector uses WIF to access GCP APIs via Entra ID — no stored credentials.

Aspect Defender for Cloud Proposed ADF Connector
Identity source Defender service principal ADF Managed Identity
Token exchange Entra JWT → GCP STS Entra JWT → GCP STS
SA impersonation Yes Yes
Credential storage None None
Multi-project support Yes (org connector) Yes (per linked service)

PoC Validation

We validated a complete PoC demonstrating the full credential chain end-to-end with msal, google-auth, google-cloud-storage, azure-identity.

Successful Execution

============================================================
  ADF <-> GCS  Workload Identity Federation  PoC
  Azure Entra ID -> GCP STS -> SA Impersonation -> GCS
  No HMAC keys. No long-lived secrets.
============================================================

INFO  token_exchange  Acquired Entra ID access token via client_secret (length=1247 chars)
INFO  token_exchange  GCP credentials acquired — SA: adf-gcs-reader@..., expires: 2026-03-03T13:00:02+00:00

  GCS Bucket: gs://adf-gcs-wif-poc-xxxxxxxx
  Objects found (1):
    - test-data/sample.csv

  SUCCESS: GCS access via Workload Identity Federation confirmed.
============================================================

Test Suite (99 tests, 99.3% coverage)

$ python -m pytest tests/unit/ --cov --cov-report=term-missing -q

........................................................................ [72%]
...........................                                              [100%]

Name                       Stmts   Miss  Cover   Missing
--------------------------------------------------------
python/blob_sink.py           37      0   100%
python/config.py              61      0   100%
python/gcs_client.py          45      0   100%
python/main.py                72      0   100%
python/token_exchange.py      72      2    97%
--------------------------------------------------------
TOTAL                        287      2    99%

99 passed

Negative Test (Access Control Enforced)

When the IAM workloadIdentityUser binding is removed, the flow correctly fails with a 403, confirming access control is enforced at the identity level — not via static keys.

Integration Issues Discovered and Resolved

# Issue Root Cause Resolution
1 Token issuer mismatch Default Entra tokens are v1 (iss: sts.windows.net/{tid}); GCP expects v2 Set accessTokenAcceptedVersion = 2 on the Entra app manifest (ref)
2 Audience mismatch v2 tokens set aud to client_id (GUID), not the App ID URI Configure GCP allowed_audiences with [client_id] (ref)
3 SA impersonation 400 identity_pool.Credentials requires OAuth scopes before impersonation call Call .with_scopes(["cloud-platform"]) before .refresh()
4 Entra identifier URI policy Tenant policy requires tenant ID in api:// URIs Dynamically compute URI as api://{tenant_id}/app-name
5 File encoding on Windows PowerShell > redirect produces UTF-16 BOM Use UTF-8 no-BOM encoding

GCP-Side Configuration (What Customers Set Up)

OIDC Provider (Terraform)

resource "google_iam_workload_identity_pool" "this" {
  project                   = var.gcp_project_id
  workload_identity_pool_id = "azure-adf-pool"
  display_name              = "Azure ADF Pool"
}

resource "google_iam_workload_identity_pool_provider" "azure_entra" {
  project                            = var.gcp_project_id
  workload_identity_pool_id          = google_iam_workload_identity_pool.this.workload_identity_pool_id
  workload_identity_pool_provider_id = "azure-entra-oidc"

  attribute_mapping = {
    "google.subject"   = "assertion.sub"   # Azure SP object ID
    "attribute.tid"    = "assertion.tid"   # Azure tenant ID
    "attribute.app_id" = "assertion.azp"   # Azure client ID
  }

  # Only accept tokens from the specific Azure tenant
  attribute_condition = "attribute.tid == '${var.azure_tenant_id}'"

  oidc {
    issuer_uri        = "https://login.microsoftonline.com/${var.azure_tenant_id}/v2.0"
    allowed_audiences = [var.azure_client_id]
  }
}

Per-Project Service Account + IAM Binding

resource "google_service_account" "adf_reader" {
  project      = var.gcp_project_id
  account_id   = "adf-gcs-reader"
  display_name = "ADF GCS Reader (WIF)"
}

resource "google_service_account_iam_member" "wif_impersonation" {
  service_account_id = google_service_account.adf_reader.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "principalSet://iam.googleapis.com/projects/${var.gcp_project_number}/locations/global/workloadIdentityPools/azure-adf-pool/attribute.app_id/${var.azure_client_id}"
}

# Grant read access to the target bucket
resource "google_storage_bucket_iam_member" "reader" {
  bucket = var.gcs_bucket_name
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_service_account.adf_reader.email}"
}

Core Token Exchange Code

The actual 3-step flow that ADF would implement internally:

Step 1 — Acquire Entra token (Managed Identity in production):

from azure.identity import ManagedIdentityCredential

credential = ManagedIdentityCredential(client_id=azure_client_id)
token = credential.get_token(f"{app_id_uri}/.default")

Steps 2+3 — STS exchange + SA impersonation (handled by google-auth):

from google.auth import identity_pool

creds = identity_pool.Credentials.from_info({
    "type": "external_account",
    "audience": "//iam.googleapis.com/projects/{project_number}/locations/global/workloadIdentityPools/{pool_id}/providers/{provider_id}",
    "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
    "token_url": "https://sts.googleapis.com/v1/token",
    "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/{sa_email}:generateAccessToken",
    "credential_source": {"file": "<path_to_azure_jwt>", "format": {"type": "text"}}
})

scoped = creds.with_scopes(["https://www.googleapis.com/auth/cloud-platform"])
scoped.refresh(google.auth.transport.requests.Request())
# scoped is now a valid GCP credential — pass to any GCS client

Step 4 — Use with GCS (unchanged from existing connector):

from google.cloud import storage

client = storage.Client(credentials=scoped, project=None)
bucket = client.bucket("my-bucket")
blobs = list(bucket.list_blobs())

What the ADF Team Would Need to Build

  1. Add WorkloadIdentityFederation as an authenticationType option (backward-compatible — HMAC remains the default)
  2. Implement the 3-step token flow: Managed Identity → STS → SA impersonation
  3. Use the resulting GCP access token with the existing GCS data reader (no changes needed)
  4. Add configuration fields to Portal UI, ARM/Bicep, and REST API schemas
  5. Document the GCP-side WIF setup for customers

No changes to GCS APIs are required — only a new authentication path in the ADF connector.

Interim Workaround

Until native support ships, customers can use an ADF Custom Activity running on Azure Batch with a User-Assigned Managed Identity to execute the WIF token exchange and transfer data from GCS to Azure Blob Storage. We have a working implementation of this approach with dual auth mode (client_secret for dev, managed_identity for production).

References

Microsoft Learn

Google Cloud

Terraform Registry

Standards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions