AgentRank

Building

AI track

AgentRank measures whether websites actually work for AI browser agents. A fixed agent attempts the same task on dozens of real sites, we rank the sites by measured success rate, and we test whether Google's new static "agent-readiness" audit predicts those real outcomes.

View repo

Looking for

data analysis/collectionfront-end dev

About this project

Why it exists. AI agents are becoming real users of the web. They compare prices, look up policies, and pull facts off pages, and they fail in ways human visitors never would. Google recently shipped an "Agentic Browsing" category in Lighthouse 13.3, a static audit that scores how agent-ready a website is from signals like llms.txt, accessibility-tree quality, and layout stability. Site owners will inevitably optimize for that score. What nobody has established is whether a static audit like that reflects what happens when a real agent actually tries to use the site.

What it does. AgentRank answers that with behavioral ground truth: we rank real websites by how often an AI browser agent completes the same fixed task on them, then compare those measured rankings against the static audit scores.

How it works:

A Gemini-powered browser agent attempts the same task on every site: start at the homepage, navigate to the primary product or service page, and extract one specific factual claim. The model, prompt, step budget, and timeout are all frozen, so the websites are the only variable.
The cohort spans payment platforms, developer tools, government portals, big-box retail, and single-location small businesses, with multiple trials per site.
Scoring is pre-registered. Success means the agent's answer contains an expected substring, with the answer keys frozen and committed before any runs happen. The expected answer never appears in the agent's prompt, so the only way to score is to actually browse.
Two conditions are measured per site: full autonomous navigation from the homepage, and a deep-link extraction control that separates "can the agent read the page" from "can the agent find the page."
Every trial stores a full step-by-step transcript, so any number on the leaderboard can be inspected down to the individual agent actions.
The leaderboard reports Wilson 95% confidence intervals with tie-aware ranking, so sites that are statistically indistinguishable share a rank instead of implying false precision.

We are currently expanding the site cohort and running additional test waves before publishing the full analysis.

Stack: Python, Playwright, and Gemini 2.0 Flash for the agent harness; Lighthouse 13.3 for static scoring; Firebase Firestore for run storage; Next.js, TypeScript, Tailwind, and Recharts for the leaderboard and charts; Firebase Hosting for deploy.

Team

James OC

Lead