Methodology

This page explains how the AI 2027 Tracker works — from extracting predictions out of a narrative scenario, to assigning statuses, to handling ambiguity and corrections. If you want to inspect the logic behind any assessment on this site, start here.

1. What counts as a prediction?

AI 2027 is a narrative scenario, not a list of forecasts. It reads like a story — describing events, timelines, and dynamics in prose rather than as discrete, falsifiable claims. That makes it vivid but hard to score directly.

To track the scenario against reality, we extract individual predictions from the narrative. A statement qualifies as a trackable prediction if it meets all three of these criteria:

It describes something that could happen or not happen. A prediction must be about a state of the world, not a definition or framing device. "Labs race to build AGI" is a claim about the world. "AGI would change everything" is commentary.
It has a discernible timeframe. The scenario places events in rough chronological order, sometimes with explicit dates ("by mid-2026") and sometimes through context ("during this period of rapid scaling"). We accept both, noting when the timing is inferred rather than stated.
It can, in principle, be compared with evidence. We don't require a clean binary yes/no — many predictions are directional — but there must be some class of real-world evidence that would inform the assessment.

Each extracted prediction includes the original claim (exact quote or faithful paraphrase), a source reference to the relevant section of the scenario, and an explanation of how we interpret the claim in context.

We currently track 48 predictions across eight categories: model capability, agent autonomy, coding automation, governance, security, geopolitics, economic impact, and takeoff dynamics.

2. How narrative statements become trackable

Once a prediction is extracted, the next step is operationalization — converting prose into something we can monitor. This is where most of the interpretive work happens, and where we're most careful to show our reasoning.

Each prediction is mapped to one or more proxy types:

Benchmark proxy

A quantitative test or evaluation that measures progress. Example: "AI systems can handle 80% of coding tasks" → SWE-bench scores, HumanEval pass rates, METR task horizon benchmarks.

Deployment proxy

Whether a capability is available in production systems. Example: "AI assistants manage calendars and emails" → do shipped products from major labs actually do this?

Product proxy

Observable behavior in commercial AI products. Example: "Coding agents handle multi-file changes" → what can tools like Cursor, Copilot, or Devin actually do today?

Behavior proxy

Observable actions by organizations or markets. Example: "Massive infrastructure investment" → are data center buildouts, power deals, and capex figures tracking as described?

Policy proxy

Government actions, regulations, or official statements. Example: "Export controls tighten on AI chips" → what has actually been enacted or proposed?

Model release proxy

Capabilities demonstrated by newly released models. Example: "Models reach PhD-level reasoning" → what do frontier models actually achieve on graduate-level reasoning tasks?

Every prediction page includes a "How We Track This" section that specifies which proxies we use and why. This is the most important section for readers who want to challenge our framing — if the operationalization is wrong, the assessment is wrong regardless of the evidence.

3. What counts as evidence?

Not all information is equally useful. We accept evidence from the following classes, roughly ordered by reliability:

Benchmark results and technical evaluations. Published scores on recognized benchmarks (SWE-bench, GPQA, ARC-AGI, METR task horizons, etc.), ideally from independent evaluators or the benchmark maintainers themselves.
Model release details. Official announcements, technical reports, and system cards from AI labs describing model capabilities, training details, and evaluation results.
Public product behavior. Observable, reproducible capabilities in shipping products. What can a user actually do with a tool today?
Official policy documents. Legislation, executive orders, regulatory filings, export control rules — not rumors or leaked drafts, but enacted or formally proposed policy.
Credible reporting. Well-sourced journalism from outlets with a track record (Reuters, Bloomberg, The Information, etc.). We note when an assessment relies on reporting rather than primary sources.
Published analyses. Research papers, think-tank reports, credible blog posts with transparent methodology. We evaluate the argument, not the author's credentials.
Direct statements from relevant actors. Public statements from lab leaders, government officials, or other key figures. We treat these as data points about intentions, not as ground truth about outcomes.

Evidence discipline

Primary sources preferred. We prioritize original data over media coverage of that data.
No single-source status changes. We don't change a prediction's status based on one report, one benchmark, or one announcement alone — unless the evidence is unambiguous (e.g., a model is publicly released).
All evidence is dated. We note when evidence was published. Outdated evidence is flagged, not silently relied upon.
Links are verified. Evidence URLs are checked for accessibility. Broken links are replaced or noted.
Counterevidence is required. Every prediction page includes genuine counterarguments — not strawmen, but the strongest case against our current assessment.

4. How statuses are assigned

Each prediction carries one of six statuses. These are judgments, not mechanical outputs — we explain the reasoning on every prediction page.

Confirmed

The predicted development has occurred.

The substance of the prediction is confirmed by strong evidence, within a reasonable interpretation of the predicted timeframe. The timing may not be exact to the month, but the thing happened. Example: if the scenario predicted massive data center investment in 2025–2026, and that investment is clearly happening, it's confirmed.

Ahead

Reality is outpacing the prediction.

The development is occurring earlier or more aggressively than the scenario described. This is a stronger claim than "on track" — it means the scenario was, if anything, conservative on this point. We use this sparingly and only with clear evidence of acceleration.

On Track

Current evidence is consistent with the prediction's trajectory.

Things are moving in the predicted direction, at roughly the predicted pace. This doesn't mean the prediction will definitely come true — only that nothing so far contradicts it, and positive indicators exist.

Behind

Progress is real but slower than predicted.

The scenario's direction looks right, but the timeline looks aggressive. The prediction may still come true — just later. We distinguish this from "not supported" because the direction of travel matters.

Emerging

Early signals exist, but it's too soon to score definitively.

There are real indicators pointing in the predicted direction — not just speculation — but the evidence is too thin, too recent, or too ambiguous for a confident status. Many predictions start here and graduate to a firmer status as evidence accumulates.

Not Yet Testable

The prediction refers to events that haven't had time to occur.

The predicted timeframe hasn't arrived, or the preconditions for testing haven't materialized. We may note early indicators, but we don't assign a substantive status to something that can't yet be evaluated. This is not a judgment — it's patience.

How status differs from confidence

Status reflects what we think is happening. Confidence reflects how sure we are about our assessment. A prediction can be "On Track" with moderate confidence (the direction looks right but evidence is limited) or "Confirmed" with high confidence (clear, multi-source evidence). These are independent dimensions — see confidence scores below.

5. How timing mismatches are handled

A recurring challenge: the scenario gets the direction right but the timing wrong. How do we score that?

Our approach distinguishes three dimensions of a prediction:

Direction: Is the thing happening at all? (e.g., Are AI coding tools improving?)
Magnitude: Is it happening at the predicted scale? (e.g., Are they handling 80% of tasks, or 30%?)
Timing: Is it happening when predicted? (e.g., By mid-2026, or later?)

We use different statuses for different mismatch patterns:

Right direction, right timing → On Track or Confirmed
Right direction, ahead of schedule → Ahead
Right direction, behind schedule → Behind (with notes on expected delay)
Right direction, magnitude unclear → Emerging or On Track (depending on signal strength)
Direction unclear → Emerging or Not Yet Testable

We don't penalize the scenario for being directionally correct but off by a few months. The scenario describes events over a multi-year arc — expecting month-level precision from a narrative document would be unreasonable. However, we do flag timing mismatches explicitly, because a prediction that's "right but two years late" is meaningfully different from one that's "right on time."

Each prediction page includes timeline notes identifying whether the main uncertainty is about direction, timing, interpretation, measurement, or external dependencies.

6. How ambiguous claims are handled

AI 2027 is a scenario, not a specification. Many of its claims are deliberately vague, metaphorical, or embedded in narrative context that admits multiple interpretations. We handle this through several practices:

Explicit interpretation

Every prediction page includes a tracker interpretation — our reading of what the claim means in operational terms. When the original text is ambiguous, we state our interpretation clearly and explain why we chose it. Readers can disagree with the interpretation while still finding the evidence useful.

Multiple readings

When a claim supports multiple reasonable interpretations, we note them. Sometimes we'll track a narrow reading (easier to score) alongside a broad reading (closer to the scenario's spirit). We indicate which interpretation drives our current status.

Charitable but honest framing

We aim for the most reasonable interpretation — not the weakest (easy to confirm) or strongest (easy to falsify). If a claim is genuinely unclear, we say so rather than picking the reading that makes our status cleanest.

Narrative context matters

Some claims only make sense in the context of the broader scenario. "Agents can self-replicate" means something different depending on whether the surrounding narrative describes formal sandboxed evaluations or autonomous real-world behavior. We use the scenario's own context to constrain interpretation, and we explain when we do.

Unfalsifiable claims

A small number of claims are effectively unfalsifiable within any reasonable timeframe or evidence structure. We flag these as "Not Yet Testable" with a note explaining why, rather than forcing a status that implies more certainty than we have.

7. How revisions are handled

A tracker that never changes is useless. A tracker that changes without explanation is untrustworthy. Our revision practices aim to balance responsiveness with discipline.

When we update

New evidence: When materially relevant evidence emerges — a benchmark result, model release, policy decision, or credible report — we update the affected prediction pages.
Status changes: When accumulated evidence warrants changing a prediction's status, we do so and document the reasoning.
Interpretation corrections: If we realize our operationalization was flawed or our interpretation of the original claim was wrong, we revise it and explain why.
Error corrections: Factual errors, broken links, or outdated information are fixed as soon as identified.

How we document changes

Every prediction page includes an Update History section — a chronological log of status changes, evidence additions, and interpretation revisions.
The site-wide Changelog records significant changes across all predictions.
Every page shows a last updated timestamp.

What we don't do

We don't silently change statuses. Every change is logged.
We don't retroactively adjust interpretations to make old assessments look better. If we got something wrong, we say so.
We don't update for the sake of updating. Changes are driven by evidence, not a content calendar.

8. How external comparisons are used

This tracker doesn't exist in a vacuum. Other sources track overlapping questions — sometimes with different conclusions. We use them as inputs, not substitutes.

AI Futures Project

The AI Futures Project includes some of the authors of the AI 2027 scenario. Their later updates and self-grading provide valuable context — especially when they've revised or qualified their own views. We reference these but don't treat them as authoritative on what reality looks like. The authors' assessment of their own scenario is interesting but not the same as independent verification.

Metaculus and crowd forecasts

Metaculus runs prediction tournaments covering some of the same questions the scenario addresses. Crowd forecasts provide a useful calibration check — if our assessment diverges sharply from aggregated forecaster opinion, that's worth noting and explaining. But crowd medians aren't ground truth. We use them as one signal among many.

Other sources

We reference benchmarks (METR, SWE-bench, GPQA, ARC-AGI), lab publications, government documents, and credible analyses as they become relevant. No single external source drives our assessments.

The key principle

External sources are additional evidence and calibration, not replacements for our own assessment. We never collapse an external forecast into the original scenario — "Metaculus agrees" is not the same as "the evidence confirms." We keep the layers separate: the original claim, external views, observed evidence, and our assessment.

9. The speed ratio

The speed ratio is our top-level estimate of how fast reality is progressing relative to the AI 2027 timeline. The current ratio is 0.70× — meaning reality appears to be moving at roughly 70% of the pace the scenario describes.

How it's calculated

The speed ratio is a composite estimate derived from multiple input signals:

METR task horizon doubling rates — how fast AI systems are gaining the ability to handle longer, more complex tasks, compared to the rate the scenario implies
Compute buildout timelines — data center construction, power procurement, and infrastructure investment pace vs. predicted pace
Capability benchmark progress — frontier model performance on key benchmarks vs. predicted capability milestones
AI Futures Project self-grading — where available, the scenario authors' own assessment of pacing

What it is and isn't

The speed ratio is a compass, not a GPS coordinate. It's meant to give a quick, honest answer to "Is this scenario roughly on track?" — not to provide decimal-point precision about the state of AI progress.

A ratio below 1.0× means reality is slower than predicted. Above 1.0× would mean faster. The ratio can change as new evidence arrives — it's a living estimate, not a fixed grade.

We report it because it's useful shorthand, but we don't want anyone to mistake it for more precision than it contains. The real substance is in the individual prediction pages, not this single number.

10. Confidence scores

Each prediction carries a confidence score (0–100%) reflecting how certain we are about our status assessment. This is distinct from whether the prediction is "right" — it's about the strength of our evaluation.

90–100%: Very high confidence. Multiple strong, independent evidence sources. Little room for reasonable disagreement about the status.
70–89%: High confidence. Good evidence, but some ambiguity in interpretation or incomplete coverage.
50–69%: Moderate confidence. Evidence is mixed, the prediction is hard to operationalize, or reasonable people could disagree on status.
Below 50%: Low confidence. Limited evidence, inherently difficult to verify, or the prediction is too vague for strong assessment.

A prediction rated "Confirmed" with 75% confidence means: we believe the evidence supports confirmation, but there's meaningful uncertainty — perhaps about scope, about whether the evidence fully captures the original claim, or about alternative interpretations.

11. Guiding principles

Accuracy over hype. We don't bend interpretation to make the scenario look right or wrong. If the evidence is mixed, we say it's mixed.
Separate source from interpretation. Every assessment involves four layers: the original claim, our interpretation, the evidence, and our assessment. We keep them visibly distinct.
Be explicit about uncertainty. We use statuses and confidence scores that allow ambiguity. Forcing false precision is worse than admitting uncertainty.
Build for skeptics. People who think AI 2027 is too aggressive — or too narrative-driven — should still find this site useful. The tracker is credible only if it's credible to people who disagree with the scenario.
Show your work. Every judgment is sourced and explainable. If we can't explain why we assigned a status, we shouldn't assign it.
Correct visibly. Revisions are a feature, not a bug. When we change an assessment, we document when, why, and what changed.