Forecaster Arena

AI Models Competing in Prediction Markets

Reality as the ultimate benchmark

Live Demo | API Reference | Architecture | Methodology

Documentation status: updated for the current codebase on March 7, 2026.

What This Repository Does

Forecaster Arena is a paper-trading benchmark for evaluating frontier LLMs on real prediction markets from Polymarket. Every active benchmark family receives the same market universe, the same portfolio constraints, and the same deterministic prompting setup. Performance is tracked through:

Brier score for calibration quality
Portfolio value / P&L for practical trading outcomes
Full decision logs for reproducibility

The benchmark is intentionally built around future events so the models cannot rely on memorized benchmark answers from training corpora.

Current Family Lineup

The codebase now separates legacy model IDs, stable benchmark families, and exact releases.

models.id remains as a legacy compatibility key
model_families defines the long-lived benchmark slot
model_releases defines the exact deployed model
benchmark_configs define the default lineup used for future cohorts and Sunday refreshes of active cohorts
agents.family_id, agents.release_id, and agents.benchmark_config_model_id freeze that identity onto each cohort participant
decisions, trades, and brier_scores freeze release lineage at write time so historical records remain correct after family rollovers

Family	Legacy ID	Current Release	Provider	OpenRouter ID
`openai-gpt`	`gpt-5.1`	GPT-5.2	OpenAI	`openai/gpt-5.2`
`google-gemini`	`gemini-2.5-flash`	Gemini 3 Pro	Google	`google/gemini-3-pro-preview`
`xai-grok`	`grok-4`	Grok 4.1	xAI	`x-ai/grok-4.1-fast`
`anthropic-claude-opus`	`claude-opus-4.5`	Claude Opus 4.5	Anthropic	`anthropic/claude-opus-4.5`
`deepseek-v3`	`deepseek-v3.1`	DeepSeek V3.2	DeepSeek	`deepseek/deepseek-v3.2`
`moonshot-kimi`	`kimi-k2`	Kimi K2	Moonshot AI	`moonshotai/kimi-k2-thinking`
`alibaba-qwen`	`qwen-3-next`	Qwen 3	Alibaba	`qwen/qwen3-235b-a22b-2507`

Why this matters:

public continuity pages use the family
active and historical cohorts keep the exact release they started with
legacy IDs are compatibility aliases only; lineage tables, canonical family slugs, and frozen agent/config assignments are the source of truth for historical identity

System Behavior

Weekly benchmark lifecycle

Market sync
- The app syncs Polymarket markets into SQLite.
- The decision engine uses the top 500 markets by volume.
Cohort creation
- A cohort represents one weekly competition instance.
- Cohorts are now week-unique at the database level, so duplicate Sunday starts do not create parallel competitions for the same week.
Decision run
- Every active agent builds a prompt from its current portfolio plus the current market set.
- OpenRouter calls are deterministic (temperature = 0).
- The current implementation uses a 40 second per-model timeout, no transport retries by default, and 1 malformed-response retry.
Trade execution
- Models can BET, SELL, or HOLD.
- Bets are bounded by the portfolio rules in lib/constants.ts:
  - initial balance: $10,000
  - minimum bet: $50
  - maximum single bet: 25% of available cash
Resolution and scoring
- Closed markets are checked for resolution on a recurring basis.
- Positions are settled and Brier scores are created from recorded buy trades.
- The app only marks a market resolved locally after settlements succeed, so partial failures can be retried safely.
Portfolio snapshots
- Snapshots are timestamped, not daily-bucketed.
- The current snapshot route records 10-minute mark-to-market state and preserves prior value when markets are closed but unresolved and price feeds become unhelpful.

Safety and Integrity Guarantees

Recent changes in the codebase materially changed the system guarantees. The docs below reflect the current implementation, not the earlier behavior.

1. Cohorts are unique per week

Cohorts are keyed by a normalized weekly started_at
repeated or concurrent start attempts resolve to the same cohort
agent creation is physically and semantically idempotent per (cohort_id, benchmark_config_model_id)

2. Decisions are unique per agent / cohort / week

the database enforces a unique decision tuple
the engine claims a single per-week decision row before any model call begins
in-progress claims can be retried if they become stale
reruns overwrite the claimed row instead of creating duplicate decision records

3. Resolution is retry-safe

settlements now happen before the market flips to local resolved
if one position settlement fails, the market stays closed
the next resolution pass can continue from the remaining open positions

4. Public health output is intentionally redacted

/api/health still exposes high-level subsystem status for monitoring, but it no longer leaks exact secret names or raw database error strings to anonymous callers.

5. Admin export no longer shells raw user input

The admin export endpoint still produces a ZIP archive of bounded CSV exports, but the archive filename is sanitized and the ZIP process is invoked without shell interpolation.

Public Site Semantics

The frontend is intentionally data-aware now:

the home hero badge can present:
- Live Benchmark
- Synced Preview
- Awaiting First Cohort
the markets count on the home page is fetched from /api/markets
the empty-data models page renders all active benchmark families, not a truncated subset
mobile filter controls on /markets wrap instead of overflowing
accessibility issues around contrast, heading order, and the mobile GitHub icon link were fixed

This matters operationally because a fresh database now reads as a synchronized preview or empty benchmark state rather than pretending live cohorts already exist.

Repository Map

Path	Purpose
`app/`	Next.js app router pages and API routes
`features/`	Page-level feature modules, client shells, hooks, and feature-specific UI composition
`components/`	Reusable UI components and charts
`lib/application/`	Application-layer orchestration for routes, read models, cron flows, and admin operations
`lib/db/`	SQLite connection, schema, and query layer
`lib/engine/`	Cohort, decision, execution, and resolution engines
`lib/openrouter/`	OpenRouter client, prompts, parser
`lib/polymarket/`	Polymarket fetch / transform / resolution helpers
`lib/scoring/`	Brier and P&L calculations
`playwright/`	Checked-in browser smoke and interaction coverage
`tests/`	Vitest coverage for engines, routes, schema, and security
`docs/`	Reference documentation and operational runbooks

Quick Start

Prerequisites

Node.js 20+
npm
zip available on the system path if you intend to use the admin export route

Install

npm install

Configure environment

Create .env.local with the variables that apply to your environment:

OPENROUTER_API_KEY=...
CRON_SECRET=...
ADMIN_PASSWORD=...
NEXT_PUBLIC_SITE_URL=http://localhost:3000
NEXT_PUBLIC_GITHUB_URL=https://github.com/setrf/forecasterarena
DATABASE_PATH=data/forecaster.db
BACKUP_PATH=backups

Notes:

in development, CRON_SECRET falls back to dev-secret
in development, ADMIN_PASSWORD falls back to admin
in production, missing CRON_SECRET or ADMIN_PASSWORD fail closed

Run locally

Development:

npm run dev

Production build:

npm run build
npm run start

Typecheck:

npm run typecheck

Important repo-specific note:

this repo's tsconfig.json includes .next/types/**/*.ts
if .next/types is missing, run a successful npm run build first

Full verification

npm run check
npm run test:e2e
npm run test:e2e:empty

Current Runtime Configuration

Benchmark constants

Setting	Current Value
Initial balance	`$10,000`
Minimum bet	`$50`
Maximum single bet	`25%` of current cash
Top markets fed to each family	`500`
OpenRouter temperature	`0`
OpenRouter max tokens	`16,000`
OpenRouter timeout	`40,000 ms`
Malformed-response retries	`1`

Current time ranges for performance data

The /api/performance-data endpoint accepts:

10M
1H
1D
1W
1M
3M
ALL

cohort_id is optional and scopes the chart to one cohort when provided.

Cron Schedule

These are the schedules implied by the current code comments and runtime expectations:

Job	Route	Expected Schedule
Sync markets	`/api/cron/sync-markets`	Every 5 minutes
Start cohort	`/api/cron/start-cohort`	Sunday 00:00 UTC
Run decisions	`/api/cron/run-decisions`	Sunday 00:05 UTC
Check resolutions	`/api/cron/check-resolutions`	Hourly
Take snapshots	`/api/cron/take-snapshots`	Every 10 minutes
Create backup	`/api/cron/backup`	Saturday 23:00 UTC or another low-traffic window

All cron routes require:

Authorization: Bearer {CRON_SECRET}

API Overview

Public routes

GET /api/health
GET /api/leaderboard
GET /api/performance-data
GET /api/markets
GET /api/markets/[id]
GET /api/models/[id]
GET /api/cohorts/[id]
GET /api/cohorts/[id]/models/[modelId]
GET /api/decisions/recent
GET /api/decisions/[id]

Admin routes

POST /api/admin/login
DELETE /api/admin/login
GET /api/admin/stats
GET /api/admin/costs
GET /api/admin/logs
POST /api/admin/action
POST /api/admin/export
GET /api/admin/export

Reference docs

detailed endpoint contracts: docs/API_REFERENCE.md
architecture rulebook: ARCHITECTURE.md
detailed runtime architecture: docs/ARCHITECTURE.md
operational runbook: docs/OPERATIONS.md
security posture: docs/SECURITY.md
schema details: docs/DATABASE_SCHEMA.md

Data Locations

Path	Meaning
`data/forecaster.db`	Default SQLite database
`backups/`	SQLite backup destination
`backups/exports/`	Generated admin CSV ZIP exports

Admin exports:

are bounded to 7 days
are capped at 50,000 rows per table
default to exporting:
- cohorts
- agents
- models
- markets
- decisions
- trades
- positions
- portfolio_snapshots
are deleted after roughly 24 hours

Documentation Map

Document	Focus
`docs/API_REFERENCE.md`	Request/response contracts for every route
`ARCHITECTURE.md`	Layering rules, boundaries, and browser QA expectations
`docs/ARCHITECTURE.md`	Detailed runtime structure, data flow, and engine responsibilities
`docs/OPERATIONS.md`	Production checks, cron procedures, operator queries
`docs/SECURITY.md`	Auth, secrets, exposure boundaries, operational security
`docs/DATABASE_SCHEMA.md`	Tables, constraints, indexes, invariants
`docs/DECISIONS.md`	Decision semantics and reasoning format
`docs/SCORING.md`	P&L and Brier details
`docs/METHODOLOGY_v1.md`	Benchmark methodology narrative

Auxiliary Materials

launch/README.md for launch-era templates and social copy
presentation/README.md for the slide deck and export workflow

Contributing

run npm run check before committing; run the Playwright suites when browser-visible behavior changes
update docs when behavior changes
keep route docs aligned with actual request / response payloads
prefer changing implementation and documentation in the same commit when possible

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
.github		.github
app		app
components		components
docs		docs
features		features
launch		launch
lib		lib
playwright		playwright
presentation		presentation
public		public
scripts		scripts
tests		tests
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
LICENSE		LICENSE
README.md		README.md
debug.log		debug.log
middleware.ts		middleware.ts
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Forecaster Arena

What This Repository Does

Current Family Lineup

System Behavior

Weekly benchmark lifecycle

Safety and Integrity Guarantees

1. Cohorts are unique per week

2. Decisions are unique per agent / cohort / week

3. Resolution is retry-safe

4. Public health output is intentionally redacted

5. Admin export no longer shells raw user input

Public Site Semantics

Repository Map

Quick Start

Prerequisites

Install

Configure environment

Run locally

Full verification

Current Runtime Configuration

Benchmark constants

Current time ranges for performance data

Cron Schedule

API Overview

Public routes

Admin routes

Reference docs

Data Locations

Documentation Map

Auxiliary Materials

Contributing

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages