Theme

RankedAGI

AI models ranked by latest benchmarks

Simulated Data Engine and Methodology Page29 May 2026
Rebuilt the simulated benchmark data estimator into a more accurate hybrid model, validated it with cross-validation, and published a public /engine page that explains how RankedAGI scores are built.

Estimator

  • Diagnosed the previous estimator's frontier under-prediction, where a top model could be dragged toward weaker neighbours (for example, Claude Mythos was estimated at 52% on DeepSWE when it should sit near the 70% frontier).
  • Added a masked cross-validation harness (scripts/sim-eval) that hides known results, predicts them, and measures accuracy across several patterns, including the hard case of recovering a benchmark's top scorer.
  • Built a global factor model (matrix completion: a per-model strength term plus learned latent interactions) as the v2 estimator.
  • Shipped a hybrid estimator that uses the local nearest-models method as the base and the global model as a frontier floor, a coverage backstop, and a guard against catastrophic errors.
  • Validated that the hybrid beats the previous estimator on every cross-validation slice while covering more of the grid, and confirmed the headline fixes (Claude Mythos 0.52 to 0.65, GPT-5.5 Pro 0.51 to 0.59 on DeepSWE).
  • Wired the hybrid into production and regenerated the simulated-data sidecar (now version 3).

Public /engine page

  • Added a new page at /engine ("The RankedAGI Engine") documenting the composite benchmarks and the simulated data that powers them, written for both humans and AI agents.
  • Built visuals from live data: a coverage waffle showing the share of the grid that is benchmarked versus simulated, per-composite reliance bars, and a scoring pipeline with a subtle motion-safe SVG animation.
  • Added a collapsible "Details for nerds" section with the full algorithm, the composite-score formula, and the validation approach.
  • Framed values as "benchmarked" versus "simulated" so estimates never read as fake, and removed em dashes from the page copy.
  • Added /engine to the footer, the sitemap, and llms.txt for discovery.

Docs and methodology split

  • Rewrote docs/ragi-simulated-data.md to describe the hybrid system as a living document, including an improvement backlog (bimodal benchmarks, sparse frontier models).
  • Moved the composite-scoring and simulated-data methodology off /sources, which now focuses on data provenance, and pointed it to /engine.

Other

  • Added an integer formatter so integer-format benchmark columns (such as Elo-style ratings) render as rounded numbers in the table.
  • Removed the "Made by Tavlean" item from the footer.