diff --git a/docs/design/anthropic-article-summary.md b/docs/design/anthropic-article-summary.md new file mode 100644 index 0000000..2e39320 --- /dev/null +++ b/docs/design/anthropic-article-summary.md @@ -0,0 +1,219 @@ +# Effective Harnesses for Long-Running Agents + +## Article Summary + +Source: [Anthropic Engineering Blog](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) + +### The Problem + +Long-running agents face a fundamental challenge: maintaining coherent state and context across multiple sessions. Each new session begins without memory of prior work—like an engineering shift change where incoming workers lack context. Without structure, agents may: + +- Lose track of completed work +- Re-implement already-finished features +- Leave code in half-finished, unmergeable states +- Fail to recover context efficiently when resuming +- Declare victory prematurely without proper verification + +### The Solution: Structured Tracking Files + +The article advocates for a two-phase approach using an **Initializer Agent** followed by incremental **Coding Agent** sessions, connected by persistent tracking files. + +--- + +## Key Files + +### 1. feature_list.json + +A structured JSON file listing all required features with status tracking. + +```json +{ + "category": "functional", + "description": "New chat button creates fresh conversation", + "steps": ["Step-by-step verification instructions"], + "passes": false +} +``` + +**Critical constraint**: Agents may **only modify the `passes` field**. The article states: + +> "It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality." + +**Format choice**: JSON was chosen over Markdown for the feature list: + +> "We landed on using JSON for this, as the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files." + +**Scale example**: In the article's claude.ai clone demo, this meant over 200 features covering both functional requirements ("a user can open a new chat, type in a query, press enter, and see an AI response") and style requirements (UI polish, responsive layout). + +### 2. claude-progress.txt + +A chronological log of agent activities enabling subsequent sessions to quickly understand completed work without re-reading entire conversation context. Updated at the end of each session with: + +- Summary of work completed +- Commits made +- Blockers encountered +- Suggested next steps + +### 3. init.sh + +An executable script that starts the development environment. Agents run this at session start to ensure consistent application state before implementing new features. + +### 4. Git Repository + +Used for version control with descriptive commit messages. Allows agents to: + +- Revert problematic changes +- Maintain working codebase states between sessions +- Provide audit trail of all changes + +--- + +## Two-Phase Workflow + +### Phase 1: Initializer Agent + +Used once at project start to establish the foundational tracking files: + +1. **Create feature_list.json** - Analyze project requirements and create comprehensive feature list with: + - Clear descriptions + - Step-by-step verification procedures for each feature + - All features start with `passes: false` + +2. **Create init.sh** - Write startup script that: + - Sets up the development environment + - Installs dependencies + - Starts development server + - Is idempotent (safe to run multiple times) + +3. **Initialize git repository** - Create initial commit with added files + +4. **Create claude-progress.txt** - Document initial setup and suggested starting point + +### Phase 2: Coding Agent + +Used for each incremental work session. Each session follows this protocol: + +#### Session Start + +1. Run `pwd` to confirm working directory +2. Read `claude-progress.txt` for context on previous sessions +3. Run `git log` to see recent commits +4. Read `feature_list.json` to identify incomplete features +5. Start development server via `init.sh` + +This "warm-up" sequence saves tokens by quickly recovering context instead of re-exploring the codebase. + +#### Baseline Testing (Critical) + +**Before implementing new features**, run baseline functionality tests: + +- Select 1-2 features where `passes: true` +- Run their verification steps +- If any baseline tests fail, fix regressions BEFORE new work + +This catches bugs introduced in previous sessions. + +#### Feature Selection + +Select the **highest-priority incomplete feature** where `passes` is `false`. + +Work on **ONE feature per session**. This constraint ensures: + +- Focused, completable work units +- Clean state at session end +- Easy rollback if needed + +#### Implementation + +1. Implement the feature incrementally +2. Commit frequently with descriptive messages +3. After implementation, run verification steps + +#### End-to-End Verification + +The article emphasizes that agents initially failed to verify end-to-end functionality. The solution: + +> "Explicit prompting to use browser automation tools (Puppeteer MCP) for human-like testing rather than unit tests alone." + +**Verification requirements:** + +- Test through actual UI interactions (not just unit tests) +- Use browser automation (e.g., Puppeteer MCP) when applicable +- Verify as an end-user would experience the feature +- Only mark `passes: true` after full verification + +#### Session End + +**Before any session termination**: + +1. Commit all changes with descriptive message +2. Update `feature_list.json`: set `passes: true` for verified features +3. Append to `claude-progress.txt`: + - Session summary + - Features completed + - Commits made + - Blockers encountered + - Suggested next steps +4. Ensure code is in mergeable state + +--- + +## Common Failure Modes and Solutions + +| Problem | Initializer Solution | Coding Agent Solution | +| ---------------------------------------- | ------------------------------------------------ | -------------------------------------------------- | +| Agent declares victory prematurely | Create feature list with structured verification | Read feature file; work on single feature | +| Buggy/undocumented code states | Write git repo + progress file initially | Read progress/git logs; test baseline; commit work | +| Features marked complete without testing | Establish feature list with verification steps | Self-verify all features before marking complete | +| Time wasted understanding app setup | Write `init.sh` startup script | Read and execute `init.sh` | + +--- + +## Core Principles + +The patterns described here are derived from observing effective human engineers—the harness essentially codifies practices that experienced developers already follow during handoffs. + +### Clean State Outputs + +The system emphasizes producing orderly, documented code suitable for merging to main branches. This prevents subsequent agents from inheriting undocumented, half-finished features. + +### Incremental Progress + +- Work on one feature per session +- Use git commits with descriptive messages +- Write progress summaries for continuity +- Allows reverting failed changes + +### Explicit Verification + +Agents must self-verify features through end-to-end testing before marking them complete. Unit tests alone are insufficient—browser automation and user-perspective testing are required. + +--- + +## Future Directions + +The article notes open questions: + +> "It's still unclear whether a single, general-purpose coding agent performs best across contexts, or if better performance can be achieved through a multi-agent architecture." + +Potential specialized agents mentioned: + +- Testing agent +- Quality assurance agent +- Code cleanup agent + +Generalization beyond web development to: + +- Scientific research +- Financial modeling + +--- + +## Limitations + +The article acknowledges current constraints: + +- **Vision limitations**: Claude's visual processing has gaps that affect verification accuracy +- **Browser automation gaps**: Tools cannot detect all UI states (e.g., browser-native alert modals are invisible to Puppeteer) + +These limitations inform which verification approaches are most reliable. diff --git a/docs/design/anthropic-example-review.md b/docs/design/anthropic-example-review.md new file mode 100644 index 0000000..2777da2 --- /dev/null +++ b/docs/design/anthropic-example-review.md @@ -0,0 +1,364 @@ +# Anthropic Autonomous Coding Agent - Implementation Review + +**Repository**: [github.com/anthropics/claude-quickstarts/autonomous-coding](https://github.com/anthropics/claude-quickstarts/tree/main/autonomous-coding) +**Language**: Python +**SDK**: Claude Agent SDK +**Purpose**: Reference implementation demonstrating long-running autonomous coding patterns + +--- + +## Executive Summary + +This is Anthropic's official reference implementation of the patterns described in their ["Effective Harnesses for Long-Running Agents"](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) blog post. This review documents both the implementation details and how each design choice maps to problems identified in Anthropic's research. + +**Key characteristics**: + +- Minimal harness focused on demonstrating core concepts +- Defense-in-depth security model with custom bash allowlist +- Two-agent pattern (initializer + coding) with 200+ feature test cases +- Explicit regression testing before new feature work +- Browser/UI verification emphasis with Puppeteer MCP + +--- + +## Architecture + +### File Structure + +``` +autonomous-coding/ +├── autonomous_agent_demo.py # Entry point, CLI argument parsing +├── agent.py # Core session loop, phase selection logic +├── client.py # Claude SDK client configuration +├── security.py # Bash command allowlist and validation hooks +├── progress.py # Progress tracking (feature_list.json parsing) +├── prompts.py # Prompt template loading utilities +├── test_security.py # Security validation tests +├── prompts/ +│ ├── app_spec.txt # Application specification template +│ ├── initializer_prompt.md # First session: setup and planning +│ └── coding_prompt.md # Subsequent sessions: implementation +├── requirements.txt +└── README.md +``` + +### Generated Project Structure + +When run, creates: + +``` +{project-dir}/ +├── feature_list.json # Source of truth for test cases +├── app_spec.txt # Copied specification +├── init.sh # Environment setup script (agent-created) +├── claude-progress.txt # Session handoff notes (agent-created) +├── .claude_settings.json # Security settings +└── [application code] # Generated implementation +``` + +--- + +## Implementation Overview + +### Two-Phase Workflow + +**Phase 1: Initialization** (session 1) - Triggered when `feature_list.json` doesn't exist: + +1. Create `feature_list.json` with 200+ test cases +2. Create `init.sh` environment setup script +3. Initialize git repository +4. Write initial `claude-progress.txt` notes + +**Phase 2: Implementation** (sessions 2+) - Triggered when `feature_list.json` exists. Each session follows a 10-step workflow: + +1. Orient (read spec, feature list, git log, progress notes) +2. Start servers via `init.sh` +3. **Regression test** existing features (CRITICAL - must verify before new work) +4. Select one incomplete feature +5. Implement the feature +6. Verify via browser automation (Puppeteer MCP) +7. Update `feature_list.json` (only modify `passes` field) +8. Git commit with verification details +9. Update `claude-progress.txt` +10. Clean exit (commit all, leave working) + +### State Files + +**`feature_list.json`** - Primary state file with structured test cases: + +```json +[ + { + "category": "functional" | "style", + "description": "Brief description of the feature", + "steps": ["Step 1: ...", "Step 2: ...", "Step 3: ..."], + "passes": false + } +] +``` + +Requirements: 200+ features (25+ with 10+ steps), all start with `"passes": false`. Completion detected when all have `"passes": true`. + +**`claude-progress.txt`** - Agent-written session handoff notes: accomplishments, completed tests, issues discovered, next steps, completion status. + +**Git history** - Provides implementation record and rollback capability. + +### Security Model + +Three-layer defense-in-depth: + +1. **OS-level sandbox** - Bash commands run in isolated environment +2. **Filesystem restrictions** - Operations limited to project directory +3. **Custom bash allowlist** - Only 18 specific commands permitted + +**Bash command allowlist** (`security.py`): + +```python +ALLOWED_COMMANDS = { + # File inspection + "ls", "cat", "head", "tail", "wc", "grep", + + # File operations + "cp", "mkdir", "chmod", + + # Directory + "pwd", + + # Node.js development + "npm", "node", + + # Version control + "git", + + # Process management + "ps", "lsof", "sleep", "pkill", + + # Script execution + "init.sh", +} +``` + +**Specialized validators** for risky commands: + +- `pkill`: Only dev processes (node, npm, npx, vite, next) +- `chmod`: Only `+x` permissions (pattern: `^[ugoa]*\+x$`) +- `init.sh`: Only explicit paths (`./init.sh` or absolute paths) + +**Validation hook** (`bash_security_hook`): Pre-tool-use hook validates all bash commands against allowlist. Fail-safe design blocks unparseable commands. + +--- + +## Mapping Implementation to Blog Post Patterns + +The following design choices directly address failure modes identified in Anthropic's ["Effective Harnesses for Long-Running Agents"](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) blog post: + +### Problem 1: Premature Completion + +**Blog finding**: Agents declare "I'm done!" when the project is only partially complete, lacking objective completion criteria. + +**Implementation solution**: The initializer creates a `feature_list.json` file with 200+ discrete, testable features. Each has a boolean `passes` field. Completion is only when **all** features have `"passes": true`. The agent cannot claim completion without satisfying this objective, verifiable criterion. + +### Problem 2: Context Loss Between Sessions + +**Blog finding**: Agents waste time re-understanding project state when resuming, especially across context window boundaries. + +**Implementation solution**: Three-part context restoration pattern: + +- **`claude-progress.txt`**: Agent-written session notes explaining what was accomplished, issues found, and what to work on next +- **`init.sh`** script: One-command environment restart (install deps, start servers) +- **Git log review**: Implementation history provides concrete record of changes + +The coding prompt explicitly requires agents to run orientation commands (`pwd`, `cat app_spec.txt`, `cat feature_list.json`, `git log`, etc.) at the start of every session to rebuild context quickly. + +### Problem 3: Silent Regression + +**Blog finding**: Agents introduce bugs when implementing new features, breaking previously working functionality without detecting it. + +**Implementation solution**: Step 3 of the coding workflow **mandates** regression testing before any new work. Agents must test 1-2 features marked `"passes": true` to verify they still work. If regressions are found, the feature is immediately marked `"passes": false` and must be fixed before proceeding. This catches bugs introduced in the previous session before they compound. + +### Problem 4: Insufficient Verification + +**Blog finding**: Agents would mark features complete without thorough end-to-end testing, or test only the backend (curl requests) without verifying the UI actually works. The blog identifies "inadequate testing" as a core problem and recommends browser automation to test as human users would. + +**Implementation solution**: Step 6 explicitly requires Puppeteer MCP browser automation. The prompt includes strong language: + +- "CRITICAL: You MUST verify features through the actual UI" +- "DON'T: Only test with curl (backend testing alone is insufficient)" +- "DON'T: Use JavaScript evaluation to bypass UI" +- "DON'T: Skip visual verification" + +Agents must interact like real users (clicks, form input, screenshots) and verify visual appearance, not just API responses. + +### Problem 5: Feature List Corruption + +**Blog finding**: "The model is less likely to inappropriately change or overwrite JSON files compared to Markdown files." + +**Implementation solution**: JSON format instead of Markdown. JSON's structured nature makes accidental corruption during edits less likely. Additionally, prompts include "strongly-worded instructions": + +**From initializer prompt**: "IT IS CATASTROPHIC TO REMOVE OR EDIT FEATURES IN FUTURE SESSIONS. Features can ONLY be marked as passing." + +**From coding prompt**: "YOU CAN ONLY MODIFY ONE FIELD: 'passes' — NEVER: Remove tests, edit descriptions, modify steps, reorder" + +This constraint pattern (using emphatic language to enforce rules) is explicitly recommended in the blog post. + +### Problem 6: Inadequate Test Coverage + +**Blog finding**: Vague high-level requirements lead to incomplete implementations. The blog recommends expanding initial specs into "200+ discrete, testable features." + +**Implementation solution**: The initializer creates minimum 200 features with explicit requirements: + +- Both "functional" and "style" categories +- Mix of narrow tests (2-5 steps) and comprehensive tests (10+ steps) +- At least 25 tests must have 10+ steps each +- Ordered by priority (fundamental features first) + +This ensures comprehensive coverage and prevents the agent from considering simple happy-path testing as "complete." + +--- + +## Key Design Decisions + +### Custom Security Layer + +**Decision**: Implement custom bash allowlist via pre-tool-use hooks, despite the Claude SDK already providing some sandboxing. + +**Rationale**: Defense-in-depth security model. The SDK's sandbox provides OS-level isolation, but the custom allowlist gives explicit control over which commands can run. This layered approach means if one layer fails, others provide protection. + +**Design details**: + +- 18 allowed commands covering file inspection, development tools, and version control +- Specialized validators for risky commands (pkill limited to dev processes, chmod limited to `+x`, init.sh path restrictions) +- Fail-safe: unparseable commands are blocked rather than allowed +- Uses `shlex` for safe command parsing to prevent shell injection + +--- + +## Strengths + +### 1. Reference Implementation Quality + +- Clear code structure +- Well-commented +- Follows patterns from blog post +- Demonstrates concepts without over-engineering + +### 2. Security Focus + +- Multiple validation layers +- Explicit allowlist (18 commands) +- Specialized validators for risky commands (pkill, chmod, init.sh) +- Fail-safe design (block if unparseable) +- Uses shlex for safe command parsing + +### 3. Quality Emphasis + +From prompts: + +- Zero console errors +- Polished UI +- End-to-end workflows +- Screenshot verification +- Regression testing before new work + +### 4. Educational Value + +- Each file has clear purpose +- Prompts document workflow explicitly +- Easy to understand what harness vs. agent does + +--- + +## Limitations + +### 1. CLI-Only Configuration + +**Limitation**: No config file, only command-line args (`--project-dir`, `--max-iterations`, `--model`) + +**Can't easily**: + +- Share configuration across team +- Version control settings +- Reuse settings between projects + +### 2. No Session State Separation + +**Limitation**: Agent-written `feature_list.json` contains all state + +**Issues**: + +- Harness can't track which phase ran +- No session counter maintained by harness +- Harder to implement conditional logic + +### 3. Fixed Two Phases + +**Limitation**: Initializer + coding hardcoded + +**Can't**: + +- Add review/QA phase +- Insert research phase before init +- Run deployment phase after completion + +### 4. Python Runtime Dependency + +**Limitation**: Requires Python, pip, SDK installation + +**Issues**: + +- Not a single binary +- Version compatibility concerns +- Deployment complexity + +--- + +## Usage Example + +```bash +# Install +npm install -g @anthropic-ai/claude-code +pip install -r requirements.txt + +# Set API key +export ANTHROPIC_API_KEY='your-key' + +# First run (initialization) +python autonomous_agent_demo.py \ + --project-dir ./my-app \ + --model claude-sonnet-4-5-20250929 + +# Continue (resume automatically) +python autonomous_agent_demo.py --project-dir ./my-app + +# Limit iterations (testing) +python autonomous_agent_demo.py \ + --project-dir ./my-app \ + --max-iterations 5 +``` + +**Expected timeline**: + +- Session 1 (init): Several minutes +- Sessions 2+: 5-15 minutes each +- Full application: Many hours/days + +**Tip**: Reduce feature count in `initializer_prompt.md` from 200 → 20-50 for faster demos. + +**Session resumability**: + +- Press Ctrl+C to pause +- Run same command to resume +- Progress persists via git and feature_list.json + +--- + +## Conclusion + +The Anthropic autonomous coding agent is an excellent reference implementation that: + +- Faithfully demonstrates the blog post patterns +- Emphasizes security with defense-in-depth +- Prioritizes quality with regression testing and browser verification +- Provides clear educational value + +It makes deliberate trade-offs favoring clarity and demonstration over production features, which is appropriate for its role as a quickstart example. diff --git a/docs/design/lorah-recommendations.md b/docs/design/lorah-recommendations.md new file mode 100644 index 0000000..39c9335 --- /dev/null +++ b/docs/design/lorah-recommendations.md @@ -0,0 +1,488 @@ +# Lorah Enhancement Recommendations + +**Goal**: Make Lorah strictly enforce Anthropic's recommended patterns for long-running autonomous coding agents. + +**Philosophy**: Pre-v1, Lorah should be opinionated. Enforce proven patterns; relax constraints later based on feedback. + +--- + +## Context + +This document synthesizes: + +1. [Anthropic Blog Post](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) - Core concepts +2. [Anthropic Reference Implementation](https://github.com/anthropics/claude-quickstarts/tree/main/autonomous-coding) - Python quickstart +3. Lorah Implementation - Current Go-based harness + +### What Lorah Does Well + +| Strength | Details | +| -------------------- | ------------------------------------------------------------ | +| Single binary | No runtime dependencies except Claude CLI | +| Config system | JSON config with deep merge over defaults | +| Error recovery | Exponential backoff with circuit breaker | +| State separation | `tasks.json` (agent) vs `session.json` (harness) | +| Presets | Built-in configs for python, go, rust, web-nodejs, read-only | +| Workflow flexibility | Alternative prompt sets via `workflows/` | +| PID locking | Prevents concurrent runs | +| Atomic writes | Temp file + rename prevents corruption | + +### Gaps vs. Anthropic's Reference + +| Feature | Anthropic Has | Lorah Missing | +| -------------------- | --------------------------------------------- | ----------------- | +| Regression testing | Step 3: Test passing features before new work | Not in prompts | +| Browser verification | Puppeteer MCP, screenshots, E2E testing | Not in prompts | +| init.sh execution | Harness runs setup script each session | Prompt-based only | +| Strong immutability | "IT IS CATASTROPHIC to edit features" | Softer language | +| Quality standards | Zero console errors, polished UI checklist | Not explicit | +| Custom security | Bash command allowlist | Delegates to CLI | + +--- + +## Naming Alignment + +Adopt Anthropic's taxonomy for cognitive consistency: + +| Concept | Anthropic | Lorah Current | Adopt | +| --------------- | --------------------- | ------------- | -------------------- | +| Checklist file | `feature_list.json` | `tasks.json` | `feature_list.json` | +| Checklist items | features | tasks | features | +| Progress log | `claude-progress.txt` | `progress.md` | `claude-progress.md` | +| Specification | `app_spec.txt` | `spec.md` | `app_spec.md` | +| Setup script | `init.sh` | (none) | `init.sh` | + +**Keep `.md` extensions** (better than `.txt` for formatting/rendering). + +**Keep phase names**: "initialization/implementation" clearer than "Initializer Agent/Coding Agent". + +--- + +## Prompt Enforcement Architecture + +### Problem + +Users providing custom prompts can skip critical steps (regression testing, verification). + +### Solution: Wrapper + Slot Injection + +Harness controls **structure and requirements**; user controls **project-specific details**. + +``` +┌─────────────────────────────────────────┐ +│ [HARNESS: orientation] │ ← Cannot remove +│ [HARNESS: regression testing] │ ← Cannot remove +├─────────────────────────────────────────┤ +│ [USER: implementation.md] │ ← User controls +├─────────────────────────────────────────┤ +│ [HARNESS: verification wrapper] │ ← Cannot remove +│ [USER: verification.md] │ ← User controls method +│ [HARNESS: verification footer] │ ← Cannot remove +├─────────────────────────────────────────┤ +│ [USER: quality-standards.md] │ ← User controls criteria +├─────────────────────────────────────────┤ +│ [HARNESS: exit protocol] │ ← Cannot remove +│ [HARNESS: immutability warning] │ ← Cannot remove +└─────────────────────────────────────────┘ +``` + +### What's Embedded vs User-Provided + +| Content | Source | User Can Modify? | +| ------------------------------ | ----------------------------- | ------------------- | +| Orientation protocol | Harness embedded | No | +| Regression testing requirement | Harness embedded | No | +| "You MUST verify" framing | Harness embedded | No | +| _How_ to verify | User's `verification.md` | Yes (file required) | +| Quality criteria | User's `quality-standards.md` | Yes (file required) | +| Exit protocol | Harness embedded | No | +| Feature immutability warning | Harness embedded | No | + +### Prompt Assembly + +```go +func assembleImplementationPrompt(userDir string) string { + return fmt.Sprintf(`# Implementation Phase + +%s + +%s + +--- + +## Implementation + +%s + +--- + +## Verification (REQUIRED) + +⚠️ You MUST verify the feature works before marking it complete. +Do not skip this step. Do not rely on code inspection alone. + +### How to Verify This Project + +%s + +### After Verification + +- If verification passes → update feature_list.json, set passes: true +- If verification fails → fix the issue, re-verify + +--- + +## Quality Standards + +%s + +--- + +%s +`, + embedded("orientation"), + embedded("regression"), + readUserFile("implementation.md"), + readUserFile("verification.md"), + readUserFile("quality-standards.md"), + embedded("exit"), + ) +} +``` + +--- + +## Initialization Enforcement + +### What Can Go Wrong + +| Problem | Consequence | +| ------------------------------- | ------------------------------------------ | +| Vague features ("make it work") | Agent can't verify, marks done prematurely | +| Features too large | Can't finish in one session, broken state | +| Missing init.sh | No reliable environment setup | +| No git init | No rollback capability | + +### Post-Init Validation + +Harness validates before allowing implementation phase: + +```go +func validateInitialization(projectDir string) error { + checks := []struct { + name string + check func() error + }{ + {"feature_list.json exists", checkFeatureListExists}, + {"feature_list.json has 10-200 items", checkFeatureCount}, + {"feature_list.json schema valid", checkFeatureSchema}, + {"no vague feature names", checkFeatureQuality}, + {"all features start passes: false", checkInitialState}, + {"init.sh exists", checkInitShExists}, + {"init.sh is executable", checkInitShExecutable}, + {"init.sh runs successfully", runInitSh}, + {"git initialized", checkGitRepo}, + {"initial commit exists", checkGitCommits}, + {"claude-progress.md exists", checkProgressExists}, + } + + for _, c := range checks { + if err := c.check(); err != nil { + return fmt.Errorf("init validation [%s]: %w", c.name, err) + } + } + return nil +} +``` + +### Feature Quality Validation + +Detect vague feature names: + +```go +vaguePatterns := []string{ + "^make .* work$", + "^add \\w+$", // "add login" too vague + "^fix \\w+$", // "fix bugs" too vague + "^implement \\w+$", // "implement auth" too vague +} +``` + +**Implementation phase cannot start until initialization passes all checks.** + +--- + +## Directory Structure + +### Configured Project + +``` +my-project/ +├── .lorah/ +│ ├── config.json # User: harness + claude settings +│ ├── app_spec.md # User: project specification +│ │ +│ ├── prompts/ +│ │ ├── initialization/ +│ │ │ ├── project-description.md # User: what to build +│ │ │ └── feature-guidance.md # User: how to structure features +│ │ │ +│ │ └── implementation/ +│ │ ├── implementation.md # User: project-specific guidance +│ │ ├── verification.md # User: how to verify (preset default) +│ │ └── quality-standards.md # User: quality criteria (preset default) +│ │ +│ ├── feature_list.json # Agent-created, harness-validated +│ ├── claude-progress.md # Agent-created +│ ├── session.json # Harness-created +│ │ +│ └── init.sh # Agent-created, harness-executed +│ +└── [project source code...] +``` + +### Embedded in Harness Binary + +``` +embedded/ +├── wrappers/ +│ ├── orientation.md # pwd, ls, spec, features, git log +│ ├── regression.md # Test passing features first +│ ├── verification-wrapper.md # "You MUST verify..." +│ ├── exit.md # Commit all, leave working +│ └── immutability.md # "CATASTROPHIC to edit features" +│ +└── presets/ + ├── web-frontend/ + │ ├── verification.md + │ └── quality-standards.md + ├── cli-tool/ + │ ├── verification.md + │ └── quality-standards.md + ├── library/ + ├── backend-api/ + └── data-pipeline/ +``` + +--- + +## Project-Type Presets + +Verification and quality standards vary by project type: + +### web-frontend + +**verification.md:** + +```markdown +For each feature: + +1. Test through browser UI (not just code inspection) +2. Use actual browser interactions (clicks, keyboard) +3. Take screenshots of key states +4. Check browser console for errors (must be zero) +5. Verify responsive behavior +``` + +**quality-standards.md:** + +```markdown +- Zero console errors +- Polished UI design +- Responsive layout +- Accessible (keyboard navigation, contrast) +``` + +### cli-tool + +**verification.md:** + +```markdown +For each feature: + +1. Run the command with expected inputs +2. Capture stdout/stderr output +3. Verify exit code is 0 for success +4. Test error cases (invalid input, missing flags) +5. Confirm help text is accurate +``` + +**quality-standards.md:** + +```markdown +- Clear, helpful error messages +- Consistent flag naming (--verbose, -v) +- Exit code 0 on success, non-zero on failure +- Works in pipeline (stdin/stdout friendly) +``` + +### library + +**verification.md:** + +```markdown +For each feature: + +1. Run test suite (pytest/go test/cargo test) +2. Verify public API matches documentation +3. Check for breaking changes +4. Test with example usage code +``` + +**quality-standards.md:** + +```markdown +- Test coverage on public API +- Documented public functions +- No breaking changes without version bump +- Consistent error handling +``` + +### backend-api + +**verification.md:** + +```markdown +For each feature: + +1. Test endpoints with curl/httpie +2. Verify response status codes +3. Check response body structure +4. Test error cases (400, 404, 500) +5. Verify database state changes +``` + +**quality-standards.md:** + +```markdown +- Consistent response format +- Proper HTTP status codes +- Clear error messages +- Input validation +``` + +--- + +## Validation Flow + +``` +lorah run + │ + ▼ +┌─────────────────────────────────────┐ +│ Phase: Initialization │ +│ │ +│ 1. Assemble init prompt │ +│ (embedded + user slots) │ +│ 2. Run Claude session │ +│ 3. Validate outputs: │ +│ ├─ feature_list.json valid │ +│ ├─ init.sh exists + runs │ +│ ├─ git initialized │ +│ └─ claude-progress.md exists │ +│ 4. If validation fails → retry │ +│ with error context │ +└─────────────────────────────────────┘ + │ (validation passes) + ▼ +┌─────────────────────────────────────┐ +│ Phase: Implementation (loop) │ +│ │ +│ 1. Run init.sh │ +│ 2. Assemble impl prompt │ +│ (embedded + user slots) │ +│ 3. Run Claude session │ +│ 4. Check completion │ +│ └─ All features pass? → Done │ +│ └─ Features remain? → Continue │ +└─────────────────────────────────────┘ +``` + +--- + +## Implementation Roadmap + +### Phase 1: Prompt Improvements (No Harness Code) + +Update default prompts to include: + +- Regression testing step +- Browser/E2E verification emphasis +- Quality standards section +- Strong immutability language + +**Effort**: Low +**Breaking changes**: None + +### Phase 2: Naming Alignment + +- Rename `tasks.json` → `feature_list.json` +- Rename `progress.md` → `claude-progress.md` +- Rename `spec.md` → `app_spec.md` +- Update all prompts to use "features" not "tasks" + +**Effort**: Low +**Breaking changes**: Migration command needed + +### Phase 3: init.sh Support + +- Check for `.lorah/init.sh` before implementation phase +- Execute if present +- Handle errors gracefully +- Update init prompt to require init.sh creation + +**Effort**: Low-Medium +**Breaking changes**: None (only runs if file exists) + +### Phase 4: Wrapper + Slot Prompt Assembly + +- Embed required sections in binary +- Require user files (verification.md, quality-standards.md) +- Assemble prompts with wrapper + user content +- Validate user files exist and aren't empty + +**Effort**: Medium +**Breaking changes**: Directory structure change + +### Phase 5: Post-Init Validation + +- Implement validation checks +- Gate implementation phase on validation pass +- Retry init with error context on failure + +**Effort**: Medium +**Breaking changes**: None (stricter behavior) + +### Phase 6: Preset Enhancements + +- Add preset-specific verification.md templates +- Add preset-specific quality-standards.md templates +- Update `lorah init --preset` to create new structure + +**Effort**: Medium +**Breaking changes**: None + +--- + +## What Lorah Enforces (Non-Negotiable) + +1. **Orientation protocol** - pwd, ls, spec, features, git log +2. **Regression testing** - Test passing features before new work +3. **Verification requirement** - Must verify before marking complete +4. **Exit protocol** - Commit all, leave working state +5. **Feature immutability** - Only `passes` field can change +6. **init.sh execution** - Runs before each implementation session +7. **Post-init validation** - Gate to implementation phase + +## What Users Control + +1. **Project description** - What to build +2. **Verification method** - How to test (browser/CLI/API/etc.) +3. **Quality standards** - Criteria appropriate to project +4. **Implementation guidance** - Project-specific instructions +5. **Feature guidance** - How to structure features for this domain + +## What Presets Provide + +1. **Default verification.md** - Sensible verification for project type +2. **Default quality-standards.md** - Sensible criteria for project type diff --git a/docs/design/lorah-review.md b/docs/design/lorah-review.md new file mode 100644 index 0000000..2f16acf --- /dev/null +++ b/docs/design/lorah-review.md @@ -0,0 +1,644 @@ +# Lorah Agent Harness - Implementation Review + +**Repository**: `cpplain/lorah` +**Language**: Go +**Purpose**: Long-running autonomous coding agent orchestration for Claude Code CLI + +--- + +## Executive Summary + +Lorah is an implementation of Anthropic's recommended patterns for long-running agent harnesses, as documented in their blog post [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents). This review maps Lorah's implementation to the patterns described in Anthropic's research, identifying areas of full conformance, partial alignment, and intentional deviations. + +**Key characteristics**: + +- Clean separation between harness logic (orchestration) and agent behavior (prompts) +- Robust error recovery with exponential backoff +- Convention-based defaults with minimal configuration burden +- JSON task tracking preventing progress corruption +- Single binary deployment with no external dependencies + +**Key strength**: Simplicity. Lorah does one thing well - orchestrate Claude Code CLI sessions with reliable state management. + +--- + +## Architecture + +### Core Components + +``` +main.go → CLI entry point (run, verify, init, info) +lorah/ + runner.go → Agent loop, phase selection, error recovery + client.go → Claude CLI subprocess execution + config.go → Configuration loading with deep merge + tracking.go → JSON checklist progress monitoring (JsonChecklistTracker) + verify.go → Pre-run environment validation + messages.go → Stream-JSON parser + messages_types.go → Message type definitions + info.go → Template embedding, scaffolding + presets.go → Built-in project type configs + lock.go → PID-based instance locking + schema.go → Configuration schema generation +``` + +### Control Flow + +``` +┌─────────────────────────────────────────────────────────────┐ +│ 1. Acquire PID lock (harness.lock) │ +│ 2. Load config (.lorah/config.json) │ +│ 3. Initialize tracker (tasks.json) │ +│ 4. Load session state (session.json) │ +│ 5. Ensure tracking files exist │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ AGENT LOOP │ +│ │ +│ ┌────────────────────────────────────────────────────┐ │ +│ │ Select Phase: │ │ +│ │ - initialization (if not done) │ │ +│ │ - implementation (iterative) │ │ +│ └────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌────────────────────────────────────────────────────┐ │ +│ │ Load prompt from .lorah/prompts/{phase}.md │ │ +│ │ Prepend error context if previous session failed │ │ +│ └────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌────────────────────────────────────────────────────┐ │ +│ │ Run Claude CLI session (subprocess) │ │ +│ │ - Stream output to terminal │ │ +│ │ - Parse JSON messages │ │ +│ │ - Capture result │ │ +│ └────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌────────────────────────────────────────────────────┐ │ +│ │ Success? → Reset errors, mark phase, continue │ │ +│ │ Error? → Increment errors, backoff, retry │ │ +│ │ Complete? → Exit loop │ │ +│ └────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +## Implementation Overview + +### Two-Phase Workflow + +Lorah hardcodes two phases: + +| Phase | Prompt File | Purpose | Runs | +| ------------------ | ---------------------------------- | ------------------ | -------------- | +| **initialization** | `.lorah/prompts/initialization.md` | One-time setup | Once | +| **implementation** | `.lorah/prompts/implementation.md` | Iterative building | Until complete | + +**Phase selection logic** (from `runner.go`): + +```go +if !tracker.IsInitialized() && !initCompleted { + return initializationPhase, InitializationPromptFile +} +return implementationPhase, ImplementationPromptFile +``` + +**Initialization detection**: `tracker.IsInitialized()` returns true when `tasks.json` contains at least one item. + +**Completion detection**: `tracker.IsComplete()` returns true when all items in `tasks.json` have `"passes": true` and count > 0. + +### State Files + +Lorah maintains four state files with clear ownership boundaries: + +#### 1. `tasks.json` (Agent-written) + +Equivalent to Anthropic's `feature_list.json`. Progress checklist with task schema: + +```json +[ + { "name": "Task name", "description": "What to build", "passes": false }, + { "name": "Another task", "description": "Description", "passes": true } +] +``` + +**Tracker operations** (`tracking.go`): + +- `IsInitialized()` → file exists and has items (total > 0) +- `IsComplete()` → all items have `passes: true` and count > 0 +- `GetSummary()` → (passing_count, total_count) + +#### 2. `progress.md` (Agent-written) + +Equivalent to Anthropic's `claude-progress.txt`. Handoff notes between sessions for human-readable context. + +#### 3. `session.json` (Harness-written) + +Session state tracking: + +```json +{ + "session_number": 5, + "completed_phases": ["initialization"] +} +``` + +**Updated by**: Harness only (not the agent) +**Atomic writes**: Uses temp file + rename pattern + +#### 4. `harness.lock` (Harness-written) + +Contains PID of running harness instance. Prevents concurrent runs; detects and clears stale locks. + +**Clear ownership boundaries**: + +| File | Owner | Purpose | +| -------------- | ------- | ------------------------- | +| `tasks.json` | Agent | Task completion tracking | +| `progress.md` | Agent | Session handoff notes | +| `session.json` | Harness | Phase and session state | +| `harness.lock` | Harness | Concurrent run prevention | +| `config.json` | User | Configuration overrides | + +### Configuration + +**Convention-based paths** (fixed): + +``` +.lorah/ + config.json # Optional overrides + spec.md # Project specification + tasks.json # Created by init phase + progress.md # Created by init phase + session.json # Created by harness + harness.lock # Created by harness + prompts/ + initialization.md # Init phase prompt + implementation.md # Build phase prompt +``` + +**Loading strategy**: + +1. Load embedded defaults +2. Deep-merge user `config.json` over defaults +3. Apply CLI flag overrides +4. Validate harness section only (Claude section passed through) + +**No per-phase config** - single global config applies to all phases. + +--- + +## Mapping Implementation to Blog Post Patterns + +This section maps Lorah's design to the six key problems identified in Anthropic's [Effective Harnesses for Long-Running Agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) blog post. + +### Problem 1: Premature Completion + +**Blog finding**: Agents declare "I'm done!" when the project is only partially complete, lacking objective completion criteria. + +**Lorah's approach**: + +The initialization phase creates `tasks.json` with a boolean `passes` field for each task. Completion is only reached when **all** tasks have `"passes": true`. The `tracker.IsComplete()` function enforces this objective criterion: + +```go +func (t *JsonChecklistTracker) IsComplete() bool { + passing, total := t.GetSummary() + return passing == total && total > 0 +} +``` + +**Conformance**: ✓ FULL - Uses same JSON boolean tracking pattern as Anthropic's reference implementation + +**Deviation**: Simpler schema - Lorah uses `{name, description, passes}` vs Anthropic's `{category, description, steps[], passes}`. Trade-off: less structure but lower corruption risk. + +### Problem 2: Context Loss Between Sessions + +**Blog finding**: Agents waste time re-understanding project state when resuming, especially across context window boundaries. + +**Lorah's approach**: + +Three-part context restoration pattern: + +1. **`progress.md`**: Agent-written session notes explaining what was accomplished, issues found, and next steps +2. **Git log**: Implementation history provides concrete record of changes +3. **Orientation commands**: Implementation prompt explicitly requires reading state files + +From `.lorah/prompts/implementation.md`: + +```markdown +### STEP 1: Get Your Bearings + +pwd && ls -la +cat .lorah/spec.md +cat .lorah/tasks.json +cat .lorah/progress.md +git log --oneline -10 +``` + +**Conformance**: ✓ FULL - Three-part context restoration matches Anthropic's pattern + +**Deviation**: No `init.sh` script execution by harness. Lorah relies on agent-written setup instructions rather than a harness-executed environment setup script. + +### Problem 3: Silent Regression + +**Blog finding**: Agents introduce bugs when implementing new features, breaking previously working functionality without detecting it. + +**Lorah's approach**: + +Default prompts say "Implement & Test" and "verify it works" but do NOT include an explicit regression testing step. + +From `.lorah/prompts/implementation.md`: + +```markdown +### STEP 3: Implement & Test + +Implement the task and verify it works. +``` + +**Conformance**: ✗ MISSING - Structure supports regression testing but prompts don't mandate it + +**Anthropic's approach**: Step 3 of coding workflow explicitly requires testing 1-2 features marked `"passes": true` to verify they still work before implementing new features. + +**Gap**: Lorah's prompts lack "BEFORE implementing new features, test 1-2 previously passing tasks to verify no regressions" instruction. + +### Problem 4: Insufficient Verification + +**Blog finding**: Agents mark features complete without thorough end-to-end testing. Anthropic recommends browser automation (Puppeteer MCP) to test as human users would. + +**Lorah's approach**: + +Prompts mention testing but don't emphasize end-to-end verification or require specific testing methods: + +```markdown +### STEP 3: Implement & Test + +Implement the task and verify it works. +``` + +**Conformance**: ✗ MISSING - Testing mentioned but not emphasized with critical language or browser automation requirements + +**Anthropic's approach**: Explicit Puppeteer MCP requirement with strong language: + +- "CRITICAL: You MUST verify features through the actual UI" +- "DON'T: Only test with curl (backend testing alone is insufficient)" +- "DON'T: Use JavaScript evaluation to bypass UI" + +**Gap**: Lorah's prompts lack browser/screenshot verification emphasis and don't mandate end-to-end testing through the UI. + +### Problem 5: Feature List Corruption + +**Blog finding**: "The model is less likely to inappropriately change or overwrite JSON files compared to Markdown files." Agents may remove or modify test cases inappropriately. + +**Lorah's approach**: + +Uses JSON format (`tasks.json`) with minimal schema to reduce corruption risk. The schema itself (no nested arrays) makes accidental corruption during edits less likely. + +```json +[{ "name": "Task name", "description": "What to build", "passes": false }] +``` + +**Conformance**: ✓ FULL - JSON format choice matches Anthropic's recommendation + +**Deviation**: Prompts lack emphatic "CATASTROPHIC" language. Anthropic's prompts include: + +- "IT IS CATASTROPHIC TO REMOVE OR EDIT FEATURES IN FUTURE SESSIONS" +- "YOU CAN ONLY MODIFY ONE FIELD: 'passes' — NEVER: Remove tests, edit descriptions..." + +Lorah's prompts say "Mark task as `\"passes\": true`" but don't explicitly forbid other modifications with strong language. + +### Problem 6: Inadequate Test Coverage + +**Blog finding**: Vague high-level requirements lead to incomplete implementations. Anthropic recommends expanding specs into "200+ discrete, testable features" with detailed verification steps. + +**Lorah's approach**: + +Tasks are user-defined during initialization phase. No minimum count requirement. No structured verification steps per task. + +From `.lorah/prompts/initialization.md`: + +```markdown +### STEP 2: Create Task List + +Create `.lorah/tasks.json` with testable requirements: + +[{ "name": "Task name", "description": "What to build", "passes": false }] +``` + +**Conformance**: ~ PARTIAL - Structure supports comprehensive task lists, but doesn't enforce scale or detail + +**Anthropic's approach**: + +- Minimum 200 features +- Mix of narrow tests (2-5 steps) and comprehensive tests (10+ steps) +- At least 25 tests must have 10+ steps each +- Both "functional" and "style" categories + +**Gap**: Lorah doesn't enforce minimum task count, doesn't require step arrays per task, and doesn't mandate comprehensive coverage. + +--- + +## Conformance Summary + +Quick-reference table showing alignment with Anthropic's patterns: + +| Anthropic Pattern | Lorah Implementation | Status | +| --------------------------------- | ------------------------------------------- | --------- | +| Two-agent pattern (init + build) | Fixed init + implementation phases | ✓ FULL | +| JSON feature tracking | `tasks.json` with `passes` boolean | ✓ FULL | +| Session orientation | Prompts mandate pwd/git log/progress review | ✓ FULL | +| Single feature per session | Implicit in implementation prompt | ✓ FULL | +| Production-ready between sessions | "Commit your changes" in prompt | ✓ FULL | +| Clear handoff artifacts | `progress.md` + git commits | ✓ FULL | +| Error recovery with backoff | Exponential backoff implemented | ✓ FULL | +| Sandbox isolation | Delegated to Claude CLI | ✓ FULL | +| Regression testing before work | Not in default prompts | ✗ MISSING | +| Browser/E2E verification | Not in default prompts | ✗ MISSING | +| init.sh script execution | Not implemented | ✗ MISSING | +| "CATASTROPHIC" immutability | Not in default prompts | ✗ MISSING | +| 200+ features requirement | Not enforced | ✗ MISSING | + +**Summary**: 8/13 patterns fully implemented, 5/13 intentionally simplified or omitted. + +--- + +## Key Design Decisions + +### Security Delegation vs Defense-in-Depth + +**Decision**: Delegate all security to Claude CLI rather than implement custom validation layer. + +**Trade-off**: + +- **Anthropic's reference implementation**: Custom bash allowlist (18 commands), specialized validators for risky commands (pkill, chmod), fail-safe design +- **Lorah**: Pure passthrough to Claude CLI `--settings` flag for sandbox configuration + +From `config.go`: + +```go +// Claude config is not validated - Claude CLI handles its own validation +``` + +**Rationale**: Claude CLI already provides robust sandboxing with OS-level isolation, filesystem restrictions, and configurable permissions. Duplicating this logic would increase maintenance burden and create two sources of truth for security policy. By delegating, Lorah remains simple and leverages Claude CLI's battle-tested security model. + +### Convention-Based Configuration + +**Decision**: Fixed file paths (`.lorah/tasks.json`, `.lorah/progress.md`) and two-phase model rather than flexible configuration. + +**Trade-off**: + +- **Pros**: Simpler mental model, less configuration surface, works out-of-box with zero config +- **Cons**: Can't add phases without code changes, can't customize file paths, can't have per-phase configuration + +**Rationale**: Convention over configuration reduces cognitive load. Users get a working harness immediately with `lorah init`, and only configure when they need to override defaults. The deep-merge config system allows partial overrides without exposing every internal detail. + +### Minimal Task Schema + +**Decision**: Simple `{name, description, passes}` schema vs Anthropic's `{category, description, steps[], passes}`. + +**Trade-off**: + +- **Anthropic schema**: More structure, explicit verification steps, categorization +- **Lorah schema**: Minimal fields, lower corruption risk, simpler for agents to work with + +**Rationale**: Each additional field increases the surface area for agent mistakes. The simpler schema reduces the chance of accidental corruption while still providing enough structure for completion tracking. Verification steps can be written in the `description` field if needed. + +--- + +## Strengths + +### 1. Simplicity + +- Single binary (no runtime dependencies except Claude CLI) +- Minimal configuration surface +- Fixed two-phase model (easy to understand) +- Flat package structure (~3000 lines in one package) + +### 2. Robustness + +- PID-based locking prevents concurrent runs +- Exponential backoff on errors with circuit breaker +- Atomic file writes (temp file + rename) +- Stream-JSON parsing handles unknown message types gracefully + +### 3. Clear Ownership Boundaries + +Harness and agent have clearly separated state files with no overlap. Each file has a single owner (harness, agent, or user), reducing confusion and preventing race conditions. + +### 4. Workflow Flexibility + +Same harness works for feature development, code review, bug fixing, refactoring - just swap prompt files. See `workflows/review/` for an example of code review mode. + +--- + +## Limitations + +### 1. Fixed Phase Model + +Only two phases (init + implementation) hardcoded in `runner.go`. + +Can't easily: + +- Add research phase before init +- Add review/QA phase after implementation +- Run deployment phase after features complete + +**Workaround**: Modify prompts to include multi-step workflows within existing phases. + +### 2. No Per-Phase Configuration + +Single `config.json` for entire harness. Can't override settings (like `max-turns` or `model`) per phase. + +### 3. Minimal Task Schema + +Tasks have `name`, `description`, and `passes`. Can't express: + +- Task dependencies (blocking relationships) +- Priority levels +- Structured verification steps (step arrays) +- Categories (functional vs style) + +**Trade-off**: Simpler schema = less corruption risk. + +### 4. No Init Script Execution + +Environment setup via prompts only, not executable script. The harness doesn't run `init.sh` like Anthropic's reference - it relies on the agent to set up the environment as instructed. + +### 5. Prompt Gaps Compared to Anthropic + +Default prompts lack: + +- Explicit regression testing step before new work +- Browser/screenshot verification emphasis +- "CRITICAL" or "CATASTROPHIC" language for immutability rules +- Minimum feature count requirements (200+) +- Structured verification steps per feature + +--- + +## Lorah-Specific Features + +Features Lorah provides beyond the minimal reference implementation: + +### 1. Presets System + +Built-in configurations for common project types: + +| Preset | Configuration | +| ------------ | -------------------------------------------------------- | +| `python` | Python-specific settings (pip, uv, PyPI access) | +| `go` | Go-specific settings (module proxy access) | +| `rust` | Rust-specific settings (crates.io access) | +| `web-nodejs` | Node.js web app settings (npm, local dev server binding) | +| `read-only` | Analysis-only mode (restricts tools to Read, Glob, Grep) | + +**Usage**: `lorah init --preset python` + +### 2. Workflow Flexibility via Prompts + +Same harness supports different workflows by swapping prompt files: + +**Build workflow** (default): + +- Init: Create task list +- Implementation: Build tasks one at a time + +**Review workflow** (`workflows/review/`): + +- Init: Catalog ALL issues in codebase (no fixes) +- Implementation: Fix ONE issue per session, with explicit regression verification + +**Same harness, different behavior** - only prompts change. + +### 3. Error Recovery with Exponential Backoff + +Configurable error recovery with circuit breaker: + +```json +"error-recovery": { + "max-consecutive-errors": 5, + "initial-backoff-seconds": 5.0, + "max-backoff-seconds": 120.0, + "backoff-multiplier": 2.0, + "max-error-message-length": 2000 +} +``` + +**Backoff formula**: `delay = min(initial * multiplier^(n-1), max)` + +**Backoff schedule**: + +- Error 1: 5 seconds +- Error 2: 10 seconds +- Error 3: 20 seconds +- Error 4: 40 seconds +- Error 5: 80 seconds +- Error 6+: 120 seconds (capped) + +**Error context injection**: On retry, previous error message is prepended to prompt: + +``` +Note: The previous session encountered an error: {error_message} +Please continue with your work. + +{original_prompt} +``` + +### 4. Optional Configuration with Defaults + +Works out-of-box without config file. Embedded defaults + deep-merge user overrides: + +1. Load embedded template defaults +2. Deep-merge user `config.json` over defaults (if exists) +3. Apply CLI flag overrides +4. Validate harness section only + +**Philosophy**: Only configure what you need to change. + +### 5. Single Binary Distribution + +Go binary with no external runtime dependencies (except Claude CLI). Simpler deployment than Python-based reference implementation: + +```bash +# Build +go build -o ./bin/lorah . + +# Install +go install . + +# No pip, no virtualenv, no package.json +``` + +--- + +## Usage Example + +```bash +# Initialize project with preset +lorah init --project-dir ./my-app --preset go + +# Verify setup +lorah verify --project-dir ./my-app + +# Run agent loop +lorah run --project-dir ./my-app + +# Continue after interruption (same command) +lorah run --project-dir ./my-app + +# Limit iterations for testing +lorah run --project-dir ./my-app --max-iterations 5 +``` + +**Expected timeline**: + +- Session 1 (init): Several minutes +- Sessions 2+: 5-15 minutes each +- Full application: Many hours/days (depends on task count) + +**Session resumability**: + +- Press Ctrl+C to pause +- Run same command to resume +- Progress persists via git and `tasks.json` + +--- + +## Conclusion + +Lorah is a functional agent harness that implements the core structural patterns from Anthropic's guidance while making pragmatic trade-offs favoring simplicity. + +**Full conformance** (8/13 Anthropic patterns): + +- Two-agent architecture (init + implementation) +- JSON progress tracking with boolean completion field +- Session orientation via mandatory state file reads +- Single-feature-per-session workflow +- Clean handoffs via git commits and progress notes +- Error recovery with exponential backoff +- Sandbox isolation (delegated to Claude CLI) +- Production-ready code between sessions + +**Intentional simplifications** (5/13 patterns): + +- Regression testing not mandated in prompts +- Browser/E2E verification not emphasized +- No init.sh harness execution +- No emphatic "CATASTROPHIC" immutability language +- No minimum feature count enforcement (200+) + +These gaps reflect design choices favoring minimal prompt complexity and simpler mental models over comprehensive failure-mode coverage. The harness provides robust orchestration structure; prompts can be enhanced to add missing verification patterns when needed. + +**Best suited for**: + +- Projects with straightforward init → build workflows +- Teams wanting minimal configuration overhead +- Use cases where two phases suffice +- Developers comfortable writing custom prompts for specialized workflows + +The codebase is well-structured, idiomatic Go, and provides a solid foundation for autonomous coding workflows.