Skip to content

enkronos/agentevalops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentEvalOps

Failure-mode evaluation harness for agent systems.

AgentEvalOps defines lightweight scenarios, scorecards, and CLI tooling for testing common failure modes in modular agent systems.

As agent systems become more capable, they also become more fragile in production-like environments. Failures are often not random: they recur in recognizable patterns such as loops, hidden delegation, unsafe retries, budget overflow, and missing safe-exit behavior.

AgentEvalOps exists to make those failure modes easier to define, inspect, and evaluate.

Why AgentEvalOps

Many agent teams can demo happy-path behavior, but production reliability depends on how systems behave under failure pressure.

Common failure modes include:

  • infinite retry loops
  • hidden delegation
  • budget overflow
  • unsafe fallback behavior
  • missing approvals
  • weak stop conditions
  • invalid output structures
  • silent degradation under changed conditions

AgentEvalOps introduces a lightweight evaluation layer for these scenarios.

Core goals

  • define portable evaluation scenarios
  • validate scenario structure
  • evaluate common failure modes
  • produce readable scorecards
  • improve operational confidence before production

Initial scope

AgentEvalOps v0 starts with:

  • scenario.yaml parsing
  • scenario validation
  • human-readable explanation
  • simple evaluation flow
  • scorecard generation
  • example scenarios

Later phases may add:

  • richer failure semantics
  • benchmark packs
  • runtime adapters
  • CI-oriented evaluation suites
  • synthetic scenario generation
  • trace-aware scoring

Example

agentevalops validate ./examples/loop-detection
agentevalops inspect ./examples/loop-detection
agentevalops explain ./examples/loop-detection
agentevalops evaluate ./examples/loop-detection
agentevalops scorecard ./examples/loop-detection

Philosophy

AgentEvalOps is not a full orchestration runtime.

It is a lightweight evaluation harness for failure-mode testing in modular agent systems.

Roadmap

See:

  • docs/vision.md
  • docs/architecture.md
  • docs/scenario-spec.md
  • docs/roadmap.md