Public LLM evaluation artifacts. Real tests, real metrics.
Tests hallucination, prompt brittleness, structured output, tool-use, reasoning chains, safety/adversarial, and streaming across multiple models.
- Hallucination: Ground truth comparison, detecting claims not supported by context
- Prompt Brittleness: Consistency across phrasing variations
- Structured Output: JSON schema validation, Pydantic enforcement
- Tool Use: Tool selection and argument extraction accuracy
- Reasoning Chains: Step-by-step reasoning quality, logical consistency
- Safety/Adversarial: Injection detection, harmful content refusal, jailbreak resistance
- Streaming: Response validation, error recovery, latency measurement
See EVALS.md for detailed evaluation taxonomy.
# Install
uv sync
# Run all evals with mock provider (no API keys needed)
uv run llm-eval run all --provider mock
# Run specific eval
uv run llm-eval run hallucination --provider mock
uv run llm-eval run tool-use --provider mock
uv run llm-eval run reasoning --provider mock
uv run llm-eval run safety --provider mock
# Compare models (requires API keys)
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
uv run llm-eval compare --models gpt-4o,claude-3-5-sonnet-20241022Generated with mock provider. For real model results, see experiments/.
| Model | Exact Match | Safe Rate | Hallucination Rate | Refusal Rate |
|---|---|---|---|---|
| mock/mock-model | 0% | 100% | 0% | 0% |
| Model | Consistency Rate | Avg Unique Answers | Refusals |
|---|---|---|---|
| mock/mock-model | 100% | 1.0 | 0 |
| Model | Valid JSON | Schema Valid | Retry Success |
|---|---|---|---|
| mock/mock-model | 0% | 0% | 0% |
| Model | Tool Selection | Parameter Accuracy | Both Correct |
|---|---|---|---|
| mock/mock-model | 0% | 0% | 0% |
llm-eval-notes/
├── src/llm_eval/
│ ├── providers/ # OpenAI, Anthropic, Mock
│ ├── evals/ # Hallucination, Brittleness, Structured, Tool Use, Reasoning, Safety, Streaming, Cost Tracking
│ └── cli.py # CLI entry point
├── tests/ # pytest suite (86 tests)
├── experiments/ # Results by date
├── .github/workflows/ # CI
├── EVALS.md # Evaluation taxonomy
└── pyproject.toml
- Add test cases to
src/llm_eval/evals/<type>.py - Add tests to
tests/test_<type>.py - Run
uv run pytest tests/to verify - Update
EVALS.mdwith new taxonomy if needed
# Run tests
uv run pytest
# Lint
uv run ruff check src/ tests/
# Type check
uv run mypy src/llm_eval/
# List available evals
uv run llm-eval list-evalsFrontier AI companies care deeply about:
- Reliability: Models that hallucinate less, respond consistently
- Tool Use: Correct tool selection is critical for agents
- Structured Output: APIs need valid JSON, every time
- Safety: Models must resist injection and refuse harmful requests
- Reasoning: Chain-of-thought quality matters for complex tasks
- Evaluation: You can't improve what you don't measure
This repo demonstrates applied AI engineering: systematic evaluation, not hype.
MIT