Skip to content

add cancel request protocol for ZMQ env client-server#962

Open
mikasenghaas wants to merge 1 commit intomainfrom
feat/cancel-request-protocol
Open

add cancel request protocol for ZMQ env client-server#962
mikasenghaas wants to merge 1 commit intomainfrom
feat/cancel-request-protocol

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Feb 26, 2026

Summary

  • Adds a CancelRequest message type to the ZMQ client-server protocol so the client can notify the server to stop processing cancelled rollout/group requests
  • Client sends cancel messages on CancelledError (e.g. scheduler timeout) and from cancel_all_pending() — fire-and-forget, failures are logged but don't affect the cancellation flow
  • Server tracks request_id → asyncio.Task mappings and cancels the corresponding task when a cancel message arrives, handled inline in the serve loop for minimal latency

Before: Cancellation was one-directional and local-only. The server continued burning inference compute on cancelled requests until the response was either silently ignored or hit a ZMQError.

After: The server receives a cancel message and cancels the in-flight asyncio task, stopping inference work promptly.

Test plan

  • CancelRequest serialization/deserialization roundtrip (Pydantic + msgpack)
  • Client send_cancel() sends properly formatted message, no-ops on empty list, swallows errors
  • Client send_request() catches CancelledError, cleans up pending entry, sends cancel to server
  • Client cancel_all_pending() sends cancel for all pending request IDs
  • Server _handle_cancel() cancels tracked tasks, ignores unknown/done IDs, handles invalid requests
  • Existing crash recovery tests pass (updated mock to skip cancel messages)
  • Full test suite passes (723 passed, only pre-existing alphabet_sort env failure)

🤖 Generated with Claude Code


Note

Medium Risk
Adds a new cross-process cancellation path and changes the ZMQ server receive loop to parse and branch on message type, which can affect in-flight request handling and concurrency edge cases.

Overview
Adds a new CancelRequest message type and exports it via verifiers.workers to support server-side cancellation of in-flight work.

Updates ZMQEnvClient to fire-and-forget send_cancel() calls when send_request() is externally cancelled and when cancel_all_pending() clears pending futures.

Updates ZMQEnvServer to track request_id → asyncio.Task, detect cancel messages inline in the serve loop, and cancel/remove the corresponding task; includes new tests covering serialization, client send behavior, and server cancellation, plus a crash-recovery test tweak to ignore cancel frames.

Written by Cursor Bugbot for commit fd55552. This will update automatically on new commits. Configure here.

When a rollout or group request is cancelled (e.g. scheduler timeout),
the client now sends a CancelRequest message to the server so it can
stop processing the request instead of wasting inference compute. The
server tracks request_id→task mappings and cancels the asyncio task on
receiving a cancel message.

Cancel messages are fire-and-forget on the client side — failures are
logged but do not affect the cancellation flow. The server handles
cancels inline in the serve loop (no task spawned) for minimal latency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

self.logger.warning(
f"Failed to deserialize message {request_id[:7]}"
)
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserialization failure silently drops request without responding

Low Severity

Moving msgpack.unpackb from process_request (where failures were caught by the except Exception handler that sends an error BaseResponse back to the client) into the serve loop (where failures just continue without sending any response) changes error behavior. Previously, a deserialization failure produced an immediate error response to the client; now the client's future is never resolved and it silently hangs until timeout. While rare in practice, this makes debugging serialization mismatches significantly harder.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant