Failure-mode evaluation harness for agent systems.
AgentEvalOps defines lightweight scenarios, scorecards, and CLI tooling for testing common failure modes in modular agent systems.
As agent systems become more capable, they also become more fragile in production-like environments. Failures are often not random: they recur in recognizable patterns such as loops, hidden delegation, unsafe retries, budget overflow, and missing safe-exit behavior.
AgentEvalOps exists to make those failure modes easier to define, inspect, and evaluate.
Many agent teams can demo happy-path behavior, but production reliability depends on how systems behave under failure pressure.
Common failure modes include:
- infinite retry loops
- hidden delegation
- budget overflow
- unsafe fallback behavior
- missing approvals
- weak stop conditions
- invalid output structures
- silent degradation under changed conditions
AgentEvalOps introduces a lightweight evaluation layer for these scenarios.
- define portable evaluation scenarios
- validate scenario structure
- evaluate common failure modes
- produce readable scorecards
- improve operational confidence before production
AgentEvalOps v0 starts with:
scenario.yamlparsing- scenario validation
- human-readable explanation
- simple evaluation flow
- scorecard generation
- example scenarios
Later phases may add:
- richer failure semantics
- benchmark packs
- runtime adapters
- CI-oriented evaluation suites
- synthetic scenario generation
- trace-aware scoring
agentevalops validate ./examples/loop-detection
agentevalops inspect ./examples/loop-detection
agentevalops explain ./examples/loop-detection
agentevalops evaluate ./examples/loop-detection
agentevalops scorecard ./examples/loop-detectionAgentEvalOps is not a full orchestration runtime.
It is a lightweight evaluation harness for failure-mode testing in modular agent systems.
See:
- docs/vision.md
- docs/architecture.md
- docs/scenario-spec.md
- docs/roadmap.md