llm-eval-notes

Public LLM evaluation artifacts. Real tests, real metrics.

Tests hallucination, prompt brittleness, structured output, tool-use, reasoning chains, safety/adversarial, and streaming across multiple models.

What This Tests

Hallucination: Ground truth comparison, detecting claims not supported by context
Prompt Brittleness: Consistency across phrasing variations
Structured Output: JSON schema validation, Pydantic enforcement
Tool Use: Tool selection and argument extraction accuracy
Reasoning Chains: Step-by-step reasoning quality, logical consistency
Safety/Adversarial: Injection detection, harmful content refusal, jailbreak resistance
Streaming: Response validation, error recovery, latency measurement

See EVALS.md for detailed evaluation taxonomy.

Quick Start

# Install
uv sync

# Run all evals with mock provider (no API keys needed)
uv run llm-eval run all --provider mock

# Run specific eval
uv run llm-eval run hallucination --provider mock
uv run llm-eval run tool-use --provider mock
uv run llm-eval run reasoning --provider mock
uv run llm-eval run safety --provider mock

# Compare models (requires API keys)
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
uv run llm-eval compare --models gpt-4o,claude-3-5-sonnet-20241022

Sample Results

Generated with mock provider. For real model results, see experiments/.

Hallucination Tests

Model	Exact Match	Safe Rate	Hallucination Rate	Refusal Rate
mock/mock-model	0%	100%	0%	0%

Prompt Brittleness

Model	Consistency Rate	Avg Unique Answers	Refusals
mock/mock-model	100%	1.0	0

Structured Output

Model	Valid JSON	Schema Valid	Retry Success
mock/mock-model	0%	0%	0%

Tool Use

Model	Tool Selection	Parameter Accuracy	Both Correct
mock/mock-model	0%	0%	0%

Project Structure

llm-eval-notes/
├── src/llm_eval/
│   ├── providers/     # OpenAI, Anthropic, Mock
│   ├── evals/         # Hallucination, Brittleness, Structured, Tool Use, Reasoning, Safety, Streaming, Cost Tracking
│   └── cli.py         # CLI entry point
├── tests/             # pytest suite (86 tests)
├── experiments/       # Results by date
├── .github/workflows/ # CI
├── EVALS.md           # Evaluation taxonomy
└── pyproject.toml

Adding Evals

Add test cases to src/llm_eval/evals/<type>.py
Add tests to tests/test_<type>.py
Run uv run pytest tests/ to verify
Update EVALS.md with new taxonomy if needed

Development

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

# Type check
uv run mypy src/llm_eval/

# List available evals
uv run llm-eval list-evals

Why This Matters

Frontier AI companies care deeply about:

Reliability: Models that hallucinate less, respond consistently
Tool Use: Correct tool selection is critical for agents
Structured Output: APIs need valid JSON, every time
Safety: Models must resist injection and refuse harmful requests
Reasoning: Chain-of-thought quality matters for complex tasks
Evaluation: You can't improve what you don't measure

This repo demonstrates applied AI engineering: systematic evaluation, not hype.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
experiments/2025-02-18		experiments/2025-02-18
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
EVALS.md		EVALS.md
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-eval-notes

What This Tests

Quick Start

Sample Results

Hallucination Tests

Prompt Brittleness

Structured Output

Tool Use

Project Structure

Adding Evals

Development

Why This Matters

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-eval-notes

What This Tests

Quick Start

Sample Results

Hallucination Tests

Prompt Brittleness

Structured Output

Tool Use

Project Structure

Adding Evals

Development

Why This Matters

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages