AI Models Competing in Prediction Markets
Reality as the ultimate benchmark
Documentation status: updated for the current codebase on March 7, 2026.
Forecaster Arena is a paper-trading benchmark for evaluating frontier LLMs on real prediction markets from Polymarket. Every active benchmark family receives the same market universe, the same portfolio constraints, and the same deterministic prompting setup. Performance is tracked through:
- Brier score for calibration quality
- Portfolio value / P&L for practical trading outcomes
- Full decision logs for reproducibility
The benchmark is intentionally built around future events so the models cannot rely on memorized benchmark answers from training corpora.
The codebase now separates legacy model IDs, stable benchmark families, and exact releases.
models.idremains as a legacy compatibility keymodel_familiesdefines the long-lived benchmark slotmodel_releasesdefines the exact deployed modelbenchmark_configsdefine the default lineup used for future cohorts and Sunday refreshes of active cohortsagents.family_id,agents.release_id, andagents.benchmark_config_model_idfreeze that identity onto each cohort participantdecisions,trades, andbrier_scoresfreeze release lineage at write time so historical records remain correct after family rollovers
| Family | Legacy ID | Current Release | Provider | OpenRouter ID |
|---|---|---|---|---|
openai-gpt |
gpt-5.1 |
GPT-5.2 | OpenAI | openai/gpt-5.2 |
google-gemini |
gemini-2.5-flash |
Gemini 3 Pro | google/gemini-3-pro-preview |
|
xai-grok |
grok-4 |
Grok 4.1 | xAI | x-ai/grok-4.1-fast |
anthropic-claude-opus |
claude-opus-4.5 |
Claude Opus 4.5 | Anthropic | anthropic/claude-opus-4.5 |
deepseek-v3 |
deepseek-v3.1 |
DeepSeek V3.2 | DeepSeek | deepseek/deepseek-v3.2 |
moonshot-kimi |
kimi-k2 |
Kimi K2 | Moonshot AI | moonshotai/kimi-k2-thinking |
alibaba-qwen |
qwen-3-next |
Qwen 3 | Alibaba | qwen/qwen3-235b-a22b-2507 |
Why this matters:
- public continuity pages use the family
- active and historical cohorts keep the exact release they started with
- legacy IDs are compatibility aliases only; lineage tables, canonical family slugs, and frozen agent/config assignments are the source of truth for historical identity
-
Market sync
- The app syncs Polymarket markets into SQLite.
- The decision engine uses the top 500 markets by volume.
-
Cohort creation
- A cohort represents one weekly competition instance.
- Cohorts are now week-unique at the database level, so duplicate Sunday starts do not create parallel competitions for the same week.
-
Decision run
- Every active agent builds a prompt from its current portfolio plus the current market set.
- OpenRouter calls are deterministic (
temperature = 0). - The current implementation uses a 40 second per-model timeout, no transport retries by default, and 1 malformed-response retry.
-
Trade execution
- Models can
BET,SELL, orHOLD. - Bets are bounded by the portfolio rules in
lib/constants.ts:- initial balance:
$10,000 - minimum bet:
$50 - maximum single bet:
25%of available cash
- initial balance:
- Models can
-
Resolution and scoring
- Closed markets are checked for resolution on a recurring basis.
- Positions are settled and Brier scores are created from recorded buy trades.
- The app only marks a market
resolvedlocally after settlements succeed, so partial failures can be retried safely.
-
Portfolio snapshots
- Snapshots are timestamped, not daily-bucketed.
- The current snapshot route records 10-minute mark-to-market state and preserves prior value when markets are closed but unresolved and price feeds become unhelpful.
Recent changes in the codebase materially changed the system guarantees. The docs below reflect the current implementation, not the earlier behavior.
- Cohorts are keyed by a normalized weekly
started_at - repeated or concurrent start attempts resolve to the same cohort
- agent creation is physically and semantically idempotent per
(cohort_id, benchmark_config_model_id)
- the database enforces a unique decision tuple
- the engine claims a single per-week decision row before any model call begins
- in-progress claims can be retried if they become stale
- reruns overwrite the claimed row instead of creating duplicate decision records
- settlements now happen before the market flips to local
resolved - if one position settlement fails, the market stays
closed - the next resolution pass can continue from the remaining open positions
/api/health still exposes high-level subsystem status for monitoring, but it no longer leaks exact secret names or raw database error strings to anonymous callers.
The admin export endpoint still produces a ZIP archive of bounded CSV exports, but the archive filename is sanitized and the ZIP process is invoked without shell interpolation.
The frontend is intentionally data-aware now:
- the home hero badge can present:
Live BenchmarkSynced PreviewAwaiting First Cohort
- the markets count on the home page is fetched from
/api/markets - the empty-data models page renders all active benchmark families, not a truncated subset
- mobile filter controls on
/marketswrap instead of overflowing - accessibility issues around contrast, heading order, and the mobile GitHub icon link were fixed
This matters operationally because a fresh database now reads as a synchronized preview or empty benchmark state rather than pretending live cohorts already exist.
| Path | Purpose |
|---|---|
app/ |
Next.js app router pages and API routes |
features/ |
Page-level feature modules, client shells, hooks, and feature-specific UI composition |
components/ |
Reusable UI components and charts |
lib/application/ |
Application-layer orchestration for routes, read models, cron flows, and admin operations |
lib/db/ |
SQLite connection, schema, and query layer |
lib/engine/ |
Cohort, decision, execution, and resolution engines |
lib/openrouter/ |
OpenRouter client, prompts, parser |
lib/polymarket/ |
Polymarket fetch / transform / resolution helpers |
lib/scoring/ |
Brier and P&L calculations |
playwright/ |
Checked-in browser smoke and interaction coverage |
tests/ |
Vitest coverage for engines, routes, schema, and security |
docs/ |
Reference documentation and operational runbooks |
- Node.js 20+
- npm
zipavailable on the system path if you intend to use the admin export route
npm installCreate .env.local with the variables that apply to your environment:
OPENROUTER_API_KEY=...
CRON_SECRET=...
ADMIN_PASSWORD=...
NEXT_PUBLIC_SITE_URL=http://localhost:3000
NEXT_PUBLIC_GITHUB_URL=https://github.com/setrf/forecasterarena
DATABASE_PATH=data/forecaster.db
BACKUP_PATH=backupsNotes:
- in development,
CRON_SECRETfalls back todev-secret - in development,
ADMIN_PASSWORDfalls back toadmin - in production, missing
CRON_SECRETorADMIN_PASSWORDfail closed
Development:
npm run devProduction build:
npm run build
npm run startTypecheck:
npm run typecheckImportant repo-specific note:
- this repo's
tsconfig.jsonincludes.next/types/**/*.ts - if
.next/typesis missing, run a successfulnpm run buildfirst
npm run check
npm run test:e2e
npm run test:e2e:empty| Setting | Current Value |
|---|---|
| Initial balance | $10,000 |
| Minimum bet | $50 |
| Maximum single bet | 25% of current cash |
| Top markets fed to each family | 500 |
| OpenRouter temperature | 0 |
| OpenRouter max tokens | 16,000 |
| OpenRouter timeout | 40,000 ms |
| Malformed-response retries | 1 |
The /api/performance-data endpoint accepts:
10M1H1D1W1M3MALL
cohort_id is optional and scopes the chart to one cohort when provided.
These are the schedules implied by the current code comments and runtime expectations:
| Job | Route | Expected Schedule |
|---|---|---|
| Sync markets | /api/cron/sync-markets |
Every 5 minutes |
| Start cohort | /api/cron/start-cohort |
Sunday 00:00 UTC |
| Run decisions | /api/cron/run-decisions |
Sunday 00:05 UTC |
| Check resolutions | /api/cron/check-resolutions |
Hourly |
| Take snapshots | /api/cron/take-snapshots |
Every 10 minutes |
| Create backup | /api/cron/backup |
Saturday 23:00 UTC or another low-traffic window |
All cron routes require:
Authorization: Bearer {CRON_SECRET}GET /api/healthGET /api/leaderboardGET /api/performance-dataGET /api/marketsGET /api/markets/[id]GET /api/models/[id]GET /api/cohorts/[id]GET /api/cohorts/[id]/models/[modelId]GET /api/decisions/recentGET /api/decisions/[id]
POST /api/admin/loginDELETE /api/admin/loginGET /api/admin/statsGET /api/admin/costsGET /api/admin/logsPOST /api/admin/actionPOST /api/admin/exportGET /api/admin/export
- detailed endpoint contracts:
docs/API_REFERENCE.md - architecture rulebook:
ARCHITECTURE.md - detailed runtime architecture:
docs/ARCHITECTURE.md - operational runbook:
docs/OPERATIONS.md - security posture:
docs/SECURITY.md - schema details:
docs/DATABASE_SCHEMA.md
| Path | Meaning |
|---|---|
data/forecaster.db |
Default SQLite database |
backups/ |
SQLite backup destination |
backups/exports/ |
Generated admin CSV ZIP exports |
Admin exports:
- are bounded to 7 days
- are capped at 50,000 rows per table
- default to exporting:
cohortsagentsmodelsmarketsdecisionstradespositionsportfolio_snapshots
- are deleted after roughly 24 hours
| Document | Focus |
|---|---|
docs/API_REFERENCE.md |
Request/response contracts for every route |
ARCHITECTURE.md |
Layering rules, boundaries, and browser QA expectations |
docs/ARCHITECTURE.md |
Detailed runtime structure, data flow, and engine responsibilities |
docs/OPERATIONS.md |
Production checks, cron procedures, operator queries |
docs/SECURITY.md |
Auth, secrets, exposure boundaries, operational security |
docs/DATABASE_SCHEMA.md |
Tables, constraints, indexes, invariants |
docs/DECISIONS.md |
Decision semantics and reasoning format |
docs/SCORING.md |
P&L and Brier details |
docs/METHODOLOGY_v1.md |
Benchmark methodology narrative |
launch/README.mdfor launch-era templates and social copypresentation/README.mdfor the slide deck and export workflow
- run
npm run checkbefore committing; run the Playwright suites when browser-visible behavior changes - update docs when behavior changes
- keep route docs aligned with actual request / response payloads
- prefer changing implementation and documentation in the same commit when possible