Meta-evaluation story

Four findings.

§1. The dominant axis of cross-model variance separates forward measures (PLS, ridge, classification accuracy, behavioral consistency) from reverse measures (reverse-PLS, RDM, single-unit tuning fidelity), cutting across the neural/behavioral split. The positive pole tracks ImageNet top-1; the negative pole dissociates from it. We read this axis as recall vs precision.

§2. Scoring metric is the largest single predictor of pairwise benchmark agreement under partial-correlation control, only modestly above brain region and benchmark family. Most of the variance in agreement between any two benchmarks is not explained by metadata at all.

§3. Conventional Brain-Score HIER scores fall as the suite expands. The 2022-era top performers maintain high completeness on the post-2024 benchmarks but are surpassed on those benchmarks by more recent submissions; the drop reflects measured performance, not coverage gaps.

§4. A single-threshold filter on a 15-benchmark panel (chosen by §1 PC1 + §2 metric diversity) and calibrated only on the pre-2024 cohort retains 100% of true Q1 (41/41) and 95% of true Q2 (41/43) of the 2024+ holdout, with sensitivity 97.6% (95% CI [93.4, 100.0]) and specificity 69.9% (95% CI [60.9, 77.8]).

0 The leaderboard data, in 60 seconds

The Brain-Score Vision leaderboard scores 524 public models against 99 benchmarks, producing a 524 × 99 matrix. Each cell is in one of three states:

SCORED (57%): the model was evaluated and produced a numeric score.
FAILED (10%): the model was evaluated but the scoring step produced no number (crash, missing output, etc.).
NEVER (33%): the model was never evaluated on that benchmark.

The HIER score visible on the public leaderboard is a recursive equal-weight average over the benchmark tree where both FAILED and NEVER count as zero at the leaf. A benchmark the model was never given the chance to run is therefore treated the same as one it tried and failed. Throughout this analysis we also report HIER_attempted, which uses the same tree-mean but skips NEVER cells from the parent average (FAILED still contributes zero). The two agree when coverage is uniform and diverge when it isn't; that divergence is the central observation of Section 3.

Two analytic cohorts are used. The structural cohort (105 models with ≥ 0.90 effective coverage on all 99 benchmarks) is used wherever pairwise covariance needs uniform coverage: PCA, region correlations, sibling agreement. The ranking cohort (253 models with ≥ 0.90 coverage on the 81 pre-2024 benchmarks) is used for ranking and quartile analyses, where requiring coverage on the 18 post-2024 additions would bias the cohort toward already-popular models.

1 The dominant axis of cross-model variance cuts across the neural/behavioral split

Principal-components analysis of the structural cohort matrix (105 models × 99 benchmarks, column-standardized, FAILED → 0, NEVER → column-mean impute) identifies 5–10 components by parallel analysis. PC1 alone captures 27% of standardized variance, more than the sum of the next three components. Its loadings do not align with the neural/behavioral partition that organizes the HIER tree.

PC1 loadings, structural cohort, ordered by absolute magnitude. The positive pole loads on forward measures (PLS, ridge, classification accuracy, behavioral consistency on distorted images); the negative pole loads on precision measures (reverse-PLS, neural representational-dissimilarity matrices, V1 texture-modulation index). Both poles draw on neural and behavioral data and span multiple species, regions, and acquisition modalities.

The strongest positive loadings on PC1 are Rajalingham 2018 behavioral i2n consistency (+0.86), Geirhos 2021 silhouette error-consistency (+0.85), Allen 2022 fMRI V2-ridge prediction (+0.83), and Geirhos 2021 eidolon error-consistency (+0.83). The strongest negative loadings are MajajHong 2015 IT reverse-PLS (-0.63), Coggan 2024 V1 RDM (-0.61), Marques 2020 abs-texture-modulation-index (-0.49), and Coggan 2024 V2 RDM (-0.44).

|r| ≥ 0.83

PC1 loadings agreement across three NEVER-as-missing imputation approaches (structural cohort, pairwise-complete, soft-impute rank-6). The axis is not an artifact of how missing data is filled in.

|r| = 0.57

PC1 loadings agreement between any of the three NEVER-as-missing approaches and naive NEVER → 0. Filling missing cells with zero distorts the axis by roughly 40% of its content.

PC1 is also robust to removing the 18 post-2024 benchmark additions entirely (|r| = 0.98 on the 81 common benchmarks). The axis is a property of the suite, not of the post-2024 additions and not of the missing-data convention.

Mechanism: recall vs precision, tested against ImageNet top-1

Forward measures (PLS, ridge, classification accuracy, behavioral consistency) score high when a model's features contain the variance the brain data carries, rewarding recall. Reverse measures (reverse-PLS, RDM distances, single-unit tuning fidelity) score high when a model's features lack the variance the brain data does not carry, rewarding precision. Modern vision backbones are trained for classification, segmentation, or contrastive matching, all of which reward recall.

The same loading pattern admits other readings, so the recall/precision label needs an external test. We use ImageNet top-1 accuracy as a recall-leaning reference: a recall axis should track it, while precision-leaning benchmarks should dissociate from it. On 75 cohort models, PC1 vs ImageNet ρ = +0.79 (p ≈ 10^-16). The five top-positive-pole benchmarks correlate with ImageNet at mean ρ = +0.58; the five top-negative-pole benchmarks at mean ρ = -0.10. The dissociation is what the recall/precision reading predicts, not what a generic competence axis would predict.

Per-new-benchmark Spearman ρ to HIER_old (the recursive HIER mean over the 81 pre-2024 benchmarks). Bars are colored by category: coral = anti-correlated, amber = orthogonal, blue = partial, sage = confirmatory. The two reverse-PLS variants are the only post-2024 additions with negative ρ to the established suite; the remaining 16 are partial, orthogonal, or confirmatory.

Of the 18 post-2024 benchmark additions, only the two MajajHong 2015 reverse-PLS variants are anti-correlated with HIER_old (mean ρ = -0.33). The other 16 additions reinforce, orthogonally extend, or confirm the established suite. The asymmetry implies that a benchmark family probing representational precision explores axes the suite does not already cover, while additional forward-prediction variants on new datasets largely duplicate existing signal.

2 Scoring metric is the largest single predictor of pairwise benchmark agreement

For each pair of benchmarks we computed the Spearman rank correlation across models on cells where both benchmarks have a SCORED outcome (pairwise-complete protocol). Partitioning all pairs along five same-axis indicators (shared scoring metric, acquisition type, species, region, and family) shows scoring metric as the largest single predictor under partial-correlation control, only modestly above region and family; the five axes together account for about a quarter of pairwise variance.

+0.36

Mean pairwise r between benchmarks that share a scoring metric (PLS, ridge, RDM, error-consistency) but differ in acquisition modality.

+0.19

Mean pairwise r between benchmarks that share an acquisition modality (fMRI, electrophysiology, behavior) but differ in scoring metric.

Panel A: mean pairwise r when two benchmarks share each metadata axis (green) versus differ on it (coral). Panel B: multiple-regression coefficients of pairwise r on the five same-axis indicators jointly.

To isolate each axis's marginal contribution, we regressed pairwise r on all five same-axis indicators jointly using a multiple regression. The joint R² is 0.25; 75% of pairwise variance is not explained by these five metadata axes. Pairs that match on every axis still vary in agreement because of specific stimulus, model, and scoring-implementation choices the metadata doesn't see. The coefficients below describe what the metadata does explain.

+0.19

Metric

+0.09

Species

+0.11

Region

-0.01

Acquisition

+0.10

Benchmark Family

The Metric coefficient (+0.19) is the largest of the five same-axis indicators, but only modestly larger than Region (+0.11), Benchmark Family (+0.10), and Species (+0.09). Acquisition type contributes essentially nothing (-0.01). Together the five axes explain R² = 0.25 of pairwise-agreement variance; most of the variance lives in stimulus, model, and scoring-implementation choices the regression doesn't see.

A subtlety: the raw within-axis pairwise r is slightly higher for Benchmark Family (mean 0.50) than for Metric (mean 0.49). Metric emerges as the largest unique contributor only after family is held fixed in the joint regression. The headline therefore says "largest single predictor", not "metric beats family". Within V1 and behavior, where same-metric pairs are nearly always same-family, the two contributions cannot be cleanly separated from this regression alone.

Same factorial broken out as a 2 × 2. Pairs sharing a metric agree more strongly than pairs sharing a modality, regardless of whether modality is held fixed.

Mechanism, and where the effect is not identifiable

A scoring metric is an algorithm that compresses model activations into a number against brain data. PLS rewards rank-k linear predictability, ridge rewards full-rank predictability with L2, RDMs reward similarity-structure preservation, error-consistency rewards trial-by-trial behavior. Choices at this compression step propagate into how rankings line up.

Stratifying by region shows where the metric effect is identifiable and where it isn't:

V4, IT: same-metric pairs span multiple families (0% same-metric is also same-family). β_metric unchanged when family is controlled (V4: 0.46 → 0.46; IT: 0.39 → 0.39).
V1: 99% of same-metric pairs are also same-family. β_metric drops 0.45 → 0.18 (60% reduction) once family is controlled.
Behavior: every same-metric pair is also same-family. β_metric drops 0.23 → 0.15 (36% reduction).

The single Metric coefficient (+0.19) averages over this asymmetry. The metric effect is robust at V4/IT, partially confounded with family at V1 and behavior.

3 Consequences of the NEVER → 0 convention

The conventional Brain-Score HIER aggregation dropped the median structural-cohort score by 0.066 (95% bootstrap CI 0.058–0.077) between the 2022 and 2026 suite states, with 77% (69%–86%) of cohort members losing more than 0.02 in absolute score. The drop is not a coverage artifact: the 2022-era top performers are still scored on today's full suite, and they are surpassed on the post-2024 benchmarks by more recent submissions. Replacing HIER with an aggregation that excludes never-attempted leaves (HIER_attempted) re-quartiles 19% (13%–27%) of the ranking cohort (κ = 0.75, 95% CI [0.64, 0.83]; median non-zero shift = 9 positions).

The conventional aggregation (HIER) treats any cell the model never attempted (NEVER) the same as a cell where the model was evaluated but the computation produced no score (FAILED): both contribute zero to the parent average. Roughly 33% of cells in the live model × benchmark matrix are NEVER and roughly 10% are FAILED, so a substantial fraction of every aggregate is structural zeros rather than measured scores. Each new benchmark added to the suite multiplies the rate of NEVER cells for prior submissions. We document the consequence two ways: at the present-day snapshot, and as a trajectory across the leaderboard's recent history.

3a · Snapshot view: 19% of submissions land in a different quartile

An alternative aggregation, HIER_attempted, applies the same recursive mean but excludes NEVER leaves from both numerator and denominator. The two conventions agree by construction when coverage is uniform; they diverge when coverage is structured. The 18 post-2024 benchmark additions are coverage-structured: of the 253 models in the ranking cohort, 51% have all 18 marked NEVER. Replacing HIER with HIER_attempted re-quartiles 19% of the cohort (48 of 253 models; Cohen's κ = 0.75 on quartile labels, ρ = 0.96 on continuous rank). The maximum absolute rank shift is 73 positions; the median non-zero shift is 9 positions.

19%

of the ranking cohort changes quartile when NEVER cells are excluded from the parent average (48 of 253).

51%

of the ranking cohort has all 18 post-2024 benchmarks marked NEVER. Conventional Brain-Score HIER assigns those cells a value of zero.

Conventional Brain-Score HIER score versus HIER_attempted score for the 253 ranking-cohort models. Off-diagonal distance is the magnitude of the convention-induced shift.

Mechanism

Adding a new benchmark to the HIER tree grows the relevant parent's denominator from k to k + 1. For models not yet evaluated on the new benchmark, the convention contributes a structural zero to the numerator. The model's published score therefore declines with each addition the model has not run, even when its performance on the prior 81 benchmarks is unchanged. Coverage, which varies systematically with submission date, is being averaged into the HIER tree as if it were capability.

3b · Temporal view: scores fall as the suite expands

This view fixes the aggregation (HIER) and asks where today's top models would have ranked in earlier suite states. For each historical era we prune the HIER tree to leaves that existed by that era and recompute HIER (NEVER → 0) on each cohort member. The reverse direction is not symmetric: models that were top-ranked earlier often lack scores on benchmarks added afterwards (never re-run, deprecated wrappers), so we report only today's models sent backwards.

We use the 105-model structural cohort (Section 0) and four historical eras (2022, 2023, 2024, and the present) spanning the suite's four largest expansions: 11 (2021-Q1) → 34 (2022) → 51 (2023) → 81 (2024) → 99 (2026) benchmarks. Within each era, ranks are taken among the 105 cohort members.

Each row is one of the 105 structural-cohort models, sorted by today's HIER (top = today's top). Columns are calendar years. Each cell is the model's conventional Brain-Score HIER on the era's pruned benchmark suite, the actual Brain-Score Vision score it would have had if submitted at that date (dark green = higher; coral = lower). Black outlines mark the top-5 models per era, restricted to models that had been first-scored by that era (so a 2025-submitted model cannot appear as a "top of 2022" outline even if its 2022-pruned score is high). Today's top-5 form a contiguous block at the top of the 2026 column; earlier eras' top-5 are scattered across the row order, showing which models actually led the leaderboard at those dates.

81 / 105

Cohort members whose HIER fell by more than 0.02 between 2022 and today (77%).

8 / 105

Cohort members whose HIER gained by more than 0.02 (8%).

-0.066

Median absolute drop in HIER on the structural cohort between the 2022 snapshot (0.442) and today (0.376).

Era-restricted top-5 outlines mark the top performers at each date, restricted to models that had been first-scored by that era (26 of 105 cohort models by 2022, 29 by 2023, 37 by 2024). Of today's top-5, 0 were in 2022's top-5, 0 in 2023's, and 0 in 2024's. The 2022 leaders (variants of `effnetb1_cutmix*`) are still in the cohort but rank in the 40s–70s today, displaced by ViT-Large CLIP and ConvNeXt-Large submissions that appeared later. Within-cohort rank reshuffles substantially over four years: 32 models climb more than 10 percentile points and 29 fall by the same margin.

4 A single-threshold filter on the §1+§2 subset

Section 1's PC1 axis says which benchmarks separate models best; Section 2 says metric diversity adds independent signal per benchmark. To build a small predictive panel, walk down the |PC1 loading| ranking, round-robin across the 8 scoring-metric types (behavioral, error-consistency, ridge, PLS, reverse-PLS, RDM, V1 physiologic, OST): every type contributes one benchmark before any type repeats. The first 15 picks are the panel.

The panel score is the recursive HIER mean across those 15 benchmarks (NEVER → 0). We calibrate one threshold T = the minimum panel score across pre-2024 Q1+Q2 models on the 56-model training cohort. No parameter touches the holdout; the 197-model 2024+ holdout is used only to measure how the threshold generalizes.

Holdout panel scores by true quartile (canonical-HIER quartile on the full 99 leaves). Each point is one of the 197 2024+ holdout models; horizontal jitter is for visibility. The dashed line is the threshold T = 0.199, calibrated from the pre-2024 cohort alone. Models above the line reach full canonical scoring; models below are filtered.

Holdout result

Per-quartile retention (fraction of each true quartile retained above T):

Q1: 100.0% retained (41/41), 95% bootstrap CI [100.0, 100.0].
Q2: 95.3% retained (41/43), 95% CI [88.0, 100.0].
Q3: 55.8% retained (29/52), 95% CI [43.1, 70.2].
Q4: 8.2% retained (5/61), 95% CI [1.8, 16.1].

Treating Q1+Q2 as the positive class (top tier) and Q3+Q4 as the negative class:

Sensitivity (P[retained | true Q1+Q2]) = 97.6%, 95% CI [93.4, 100.0].
Specificity (P[filtered | true Q3+Q4]) = 69.9%, 95% CI [60.9, 77.8].

All Q1 holdout models are retained. Two of 43 Q2 models (4.7%) are misclassified as below-threshold; this 5% Q2 leakage is real cohort drift between pre-2024 and 2024+ submissions on the |PC1|+metric panel, not bootstrap noise. (Both misclassified Q2 models land in true-Q3 panel-score territory, not Q4: their 15-leaf panel scores rank above ~90% of true-Q4 models. The failure mode is one-quartile-off boundary slippage, not pathological misranking.)

The 15 benchmarks

Rajalingham2018-i2n|PC1| = 0.858, scoring metric = behavioral
Geirhos2021silhouette-error_consistency|PC1| = 0.851, scoring metric = error_consistency
Allen2022_fmri_surface.V2-ridge|PC1| = 0.834, scoring metric = ridge
Sanghavi2020.IT-pls|PC1| = 0.825, scoring metric = pls
MajajHong2015public.IT-reverse_pls|PC1| = 0.630, scoring metric = reverse_pls
tong.Coggan2024_fMRI.V1-rdm|PC1| = 0.613, scoring metric = rdm
Marques2020_FreemanZiemba2013-abs_texture_modulation_index|PC1| = 0.488, scoring metric = physiologic
Kar2019-ost|PC1| = 0.180, scoring metric = ost
Geirhos2021eidolonI-error_consistency|PC1| = 0.833, scoring metric = error_consistency
Allen2022_fmri_surface.V4-ridge|PC1| = 0.819, scoring metric = ridge
MajajHong2015.IT-pls|PC1| = 0.771, scoring metric = pls
tong.Coggan2024_behavior-ConditionWiseAccuracySimilarity|PC1| = 0.731, scoring metric = behavioral
Bracci2019.anteriorVTC-rdm|PC1| = 0.556, scoring metric = rdm
MajajHong2015public.V4-reverse_pls|PC1| = 0.385, scoring metric = reverse_pls
Marques2020_FreemanZiemba2013-max_texture|PC1| = 0.375, scoring metric = physiologic

Three benchmarks recover the leaderboard order

The chart below tracks two agreement metrics as the subset grows: Cohen's κ on quartile labels (blue) and Spearman ρ on the panel score itself (coral). ρ reaches 0.93 at N = 3 and barely moves between N = 3 and N = 99. The first three benchmarks under the |PC1|+metric ordering are:

Rajalingham2018-i2n (behavioral image-level consistency, |PC1| = 0.86)
Geirhos2021silhouette-error_consistency (behavioral error pattern on silhouettes, |PC1| = 0.85)
Allen2022_fmri_surface.V2-ridge (V2 fMRI ridge fit, |PC1| = 0.83)

These three are also the §1 PC1 high-loaders, one per scoring-metric type. Scoring just these three on a new submission already places it within Spearman ρ ≈ 0.93 of where it would land on the full 99-leaf suite. Quartile-label agreement (κ) is noisier: it plateaus near 0.70 from N = 3 to N ≈ 50 because boundary models flip assignment depending on small panel-composition shifts, then climbs to 1.0 at N = 99 by construction. Quartile labels discard within-quartile rank, so they are sensitive to any model sitting near a quartile boundary; ρ is not.

Subset-vs-reference agreement under the |PC1|+metric ordering. Reference is the recursive HIER mean over all 99 leaves. Shaded bands are bootstrap 95% CIs.

One panel benchmark (MajajHong2015public.IT-reverse_pls) is one of the 18 post-2024 additions, on which roughly half the ranking cohort has NEVER cells (§3). For those models the panel score gets a structural zero from that leaf; this is the same NEVER → 0 bias §3 critiques. The bias is bounded (one of 15 panel benchmarks) but pulls panel scores downward for sparse-coverage models in the same direction §3 describes.

Recommendations for a Brain-Score 3.0 leaderboard

Distinguish SCORED, FAILED, and NEVER in the public interface. The database already records the three states. The leaderboard currently displays "0" for both FAILED and NEVER. Surfacing the distinction lets readers separate capability gaps from coverage gaps.
Offer HIER_attempted as a parallel aggregation. Same recursive tree, same per-benchmark weights; NEVER cells excluded from the parent average rather than contributed as zero. Conventional Brain-Score HIER and HIER_attempted differ at κ = 0.75 on quartile labels. Historical comparisons under HIER are confounded with subsequent benchmark additions; HIER_attempted is not.
Prioritize precision-side benchmark expansions. Of the 18 post-2024 additions, only the two reverse-PLS variants are anti-correlated with HIER_old. The next-most-informative additions are benchmark families that probe representational precision (reverse-PLS variants on additional datasets, RDM-on-residuals after PLS reconstruction, or single-unit-tuning fidelity metrics) rather than further forward-prediction variants on new datasets.
Use the 15-benchmark |PC1|+metric panel as a top-tier filter (§4). Calibrated on the pre-2024 cohort (T = minimum Q1+Q2 panel score, train-only, no parameter fit to the holdout). On the 2024+ holdout: sensitivity 97.6% (95% CI [93.4, 100.0]), specificity 69.9% (95% CI [60.9, 77.8]); 100% of Q1 and 95% of Q2 above threshold. The full 99-benchmark suite remains the canonical aggregation.
Surface scoring metric as a reporting axis alongside cortical region. The HIER tree is organized anatomically, but the data's pairwise structure is organized by metric (Section 2). Reporting per-metric strata within each region node would let readers see when a model is metric-fragile within a single anatomical level, which the current tree hides.
Expose a snapshot view alongside the conventional Brain-Score leaderboard (in progress). The wayback construction (§3b) lets a model be evaluated against the leaderboard as it existed at any prior date. This supports historical-progress claims and makes the timing component of any apparent rank improvement explicit. The dataset already carries the timestamps needed. A historical-wayback prototype is currently a work in progress.

Summary

The leaderboard's published rankings depend as much on aggregation choices as on the brain data being averaged. Different scoring metrics produce different rankings; the NEVER → 0 convention re-quartiles 19% of the cohort; and the principal axis of model variation cuts across the HIER tree's anatomical organization. The recommendations above make those choices visible in the public interface, and the gating cascade keeps the canonical scoring tractable as the suite grows.