RankedAGI
AI models ranked by latest benchmarks
Simulated Data Engine and Methodology Page29 May 2026Rebuilt the simulated benchmark data estimator into a more accurate hybrid model, validated it with cross-validation, and published a public /engine page that explains how RankedAGI scores are built.
Estimator
- Diagnosed the previous estimator's frontier under-prediction, where a top model could be dragged toward weaker neighbours (for example, Claude Mythos was estimated at 52% on DeepSWE when it should sit near the 70% frontier).
- Added a masked cross-validation harness (
scripts/sim-eval) that hides known results, predicts them, and measures accuracy across several patterns, including the hard case of recovering a benchmark's top scorer. - Built a global factor model (matrix completion: a per-model strength term plus learned latent interactions) as the v2 estimator.
- Shipped a hybrid estimator that uses the local nearest-models method as the base and the global model as a frontier floor, a coverage backstop, and a guard against catastrophic errors.
- Validated that the hybrid beats the previous estimator on every cross-validation slice while covering more of the grid, and confirmed the headline fixes (Claude Mythos 0.52 to 0.65, GPT-5.5 Pro 0.51 to 0.59 on DeepSWE).
- Wired the hybrid into production and regenerated the simulated-data sidecar (now version 3).
Public /engine page
- Added a new page at
/engine("The RankedAGI Engine") documenting the composite benchmarks and the simulated data that powers them, written for both humans and AI agents. - Built visuals from live data: a coverage waffle showing the share of the grid that is benchmarked versus simulated, per-composite reliance bars, and a scoring pipeline with a subtle motion-safe SVG animation.
- Added a collapsible "Details for nerds" section with the full algorithm, the composite-score formula, and the validation approach.
- Framed values as "benchmarked" versus "simulated" so estimates never read as fake, and removed em dashes from the page copy.
- Added
/engineto the footer, the sitemap, andllms.txtfor discovery.
Docs and methodology split
- Rewrote
docs/ragi-simulated-data.mdto describe the hybrid system as a living document, including an improvement backlog (bimodal benchmarks, sparse frontier models). - Moved the composite-scoring and simulated-data methodology off
/sources, which now focuses on data provenance, and pointed it to/engine.
Other
- Added an integer formatter so integer-format benchmark columns (such as Elo-style ratings) render as rounded numbers in the table.
- Removed the "Made by Tavlean" item from the footer.

