{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreig6en5sc4n2wvzydemw2d2oozu2o6jl33go7h23sggl3s77fow74a",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mi2ax6dwdez2"
},
"path": "/t/what-is-your-preferred-site-to-see-ai-scores-on-different-ai-tests/174698#post_2",
"publishedAt": "2026-03-27T13:29:13.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"leaderboards",
"Posts",
"Blog",
"Hub Models",
"Spaces",
"llm-stats.com",
"Reddit",
"LLM Stats",
"Artificial Analysis",
"Aurea",
"GitHub",
"LiveBench",
"99Franklin",
"DeepResearch Bench"
],
"textContent": "I think many people check leaderboards for that purpose.\n\nAlso, since the leaderboard essentially ranks models **based on benchmarks** , it isn’t particularly well-suited for **models specialized in narrow tasks** , so it’s safer to use other channels as well. (On HF, this includes Posts, Blog, Hub Models, Spaces, etc.)\n\n* * *\n\nThe public favorites are not one site. They cluster into **two general-purpose sites** and **several specialist leaderboards**.\n\nThe two names that come up most often are **Artificial Analysis** and **LLM Stats**. In a recent discussion asking “What LLM benchmarking sites do you use?”, one person explicitly said they use **Artificial Analysis** because it combines results from multiple benchmarks, while another commenter listed **LiveBench** , **Artificial Analysis** , **SWE-bench** , **ContextArena** , and others as their regular set. An AI news roundup also explicitly grouped **artificialanalysis.ai** and **llm-stats.com** together as recommended benchmarking resources. (Reddit)\n\n## The shortest answer\n\nIf you want **one site to browse first** , I would point you to **LLM Stats**.\nIf you want **one site to trust more for methodology** , I would point you to **Artificial Analysis**.\nIf you want **human preference rankings** , use **LM Arena**.\nIf you want **freshness and anti-contamination** , use **LiveBench**.\nIf you care about **OCR** or **research agents** , jump to **OCRBench v2** and **DeepResearch Bench / GAIA** instead of stopping at a general leaderboard. (LLM Stats)\n\n## What people seem to like most, and why\n\n### 1. Artificial Analysis\n\nThis is the closest thing to a **serious all-purpose benchmark dashboard**. Its homepage says it provides **independent analysis of AI** and compares models on **intelligence, speed, and price** , plus provider performance. Its methodology page says its **Artificial Analysis Intelligence Index v4.0.2** combines **10 evaluations** across reasoning, knowledge, math, and programming, and it explicitly says the suite is **text-only and English-only** , with image, speech, and multilingual performance benchmarked separately. That methodological clarity is a big reason people trust it more than random scoreboards. (Artificial Analysis)\n\nWhy people like it:\n\n * It is broad.\n * It explains how its composite score is built.\n * It includes practical dimensions like speed and price, not just benchmark scores. (Artificial Analysis)\n\n\n\nWhy it is not enough alone:\n\n * Its flagship Intelligence Index is **not** your OCR leaderboard.\n * It is **not** your deep-research-agent leaderboard.\n * It is a good overview, but it still has to be paired with specialist benchmarks for your exact task. (Artificial Analysis)\n\n\n\n### 2. LLM Stats\n\nThis is the strongest public match to your description of **“a sortable site where a person can sort any column on the web page.”** Its homepage says it is **updated Mar 27** , tracks **275 models** , and exposes a big leaderboard with columns like **Code** , **Arena** , **GPQA** , **SWE-bench** , **Context** , and input/output pricing. It also has separate areas for leaderboards, benchmarks, compare pages, arenas, and news. That makes it very good as a **fast scanning and comparison** site. (LLM Stats)\n\nWhy people like it:\n\n * It is dense and practical.\n * It is easy to scan.\n * It mixes benchmark scores with cost and context window info. (LLM Stats)\n\n\n\nWhy it is not enough alone:\n\n * It is more of an **aggregator and comparison hub** than a single deep evaluation methodology.\n * It is excellent for “what should I look at next?” and weaker for “which specialist benchmark is the right ground truth for OCR or research?” (LLM Stats)\n\n\n\n## The likely site used in those AICodeKing videos\n\nThe strongest public guess is **LLM Stats**. I cannot verify the exact video URL from here, but among current public sites, **LLM Stats** matches your description best because it has the **sortable many-column leaderboard layout** and broad comparison surface. Artificial Analysis is also possible, but its interface is more analysis-centric than “big sortable comparison grid.” (LLM Stats)\n\n## The other sites people keep using\n\n### 3. LM Arena\n\nLM Arena is different from the others. It is not mainly a static benchmark site. Its FAQ says user votes shape rankings through the **Bradley-Terry** rating system, and the models stay anonymous during voting until after the vote. Business Insider’s 2025 interview with LM Arena’s CTO says the site had grown to **over 3 million monthly users** and supports rankings across text, coding, vision, and image-generation-related experiences. (Aurea)\n\nWhy people like it:\n\n * It reflects **human preference** , not just benchmark math.\n * It is good for “which answer feels better in practice?” (Aurea)\n\n\n\nWhy people distrust it:\n\n * Community discussions say it can reward verbosity or style rather than the exact capability you care about.\n * Even supporters tend to treat it as one signal, not the only one. (Reddit)\n\n\n\n### 4. LiveBench\n\nLiveBench is one of the main answers from people who care about **stale benchmarks** and **contamination**. In that recent community thread, one commenter included **LiveBench** among their go-to sites. The official site says it **updates questions regularly so the benchmark completely refreshes every 6 months** , while the project materials say questions are added and updated monthly and scored against **objective ground truth** rather than an LLM judge. (Reddit)\n\nWhy people like it:\n\n * It is fresher than old static leaderboards.\n * It is explicitly designed around contamination limits.\n * It covers broad categories like math, coding, reasoning, language, instruction following, and data analysis. (GitHub)\n\n\n\nWhy it is not enough alone:\n\n * It is a benchmark, not a giant comparison dashboard.\n * It is strong for “what still has signal,” weaker for “show me everything in one sortable interface.” (LiveBench)\n\n\n\n## For OCR and research, people should not stop at general leaderboards\n\nYour examples matter. You mentioned **OCR** and **researching**. Those are specialized enough that a general site can mislead you.\n\n### OCR: OCRBench v2\n\nOCRBench v2’s official site says it aims to **update every quarter**. Its paper says it is a **large-scale bilingual text-centric benchmark** with **4× more tasks** than the earlier OCRBench, **31 scenarios** , **10,000 human-verified QA pairs** , and a **private test set with 1,500 manually annotated images**. That is far closer to real OCR evaluation than a generic “best model overall” chart. (99Franklin)\n\n### Researching: DeepResearch Bench and GAIA\n\nDeepResearch Bench’s official site says it consists of **100 PhD-level research tasks** spanning **22 distinct fields** , built with **100+ domain experts**. GAIA’s official leaderboard says it is a benchmark for **general AI assistants** that require reasoning, multimodality, web browsing, and tool use, with **450 questions** and three difficulty levels. If what you mean by “researching” is “can it browse, synthesize, and deliver a useful report,” these are more on-point than a generic chat leaderboard. (DeepResearch Bench)\n\n## What the internet consensus looks like\n\nThe recurring pattern in public discussions is this:\n\n * **No single leaderboard captures everything.**\n * **People mix multiple sites.**\n * **They often use one broad site, one freshness-oriented benchmark, and one specialist benchmark.** In the recent “most reliable benchmarking site” discussion, one commenter explicitly said no single benchmark captures everything and that different leaderboards measure different aspects of capability. In the “what benchmarking sites do you use?” thread, people named a mix of **UGI** , **LiveBench** , **Artificial Analysis** , **SWE-bench** , **EQBench** , and others rather than rallying around only one site. (Reddit)\n\n\n\nThat is the background. The web does **not** seem to have one universally accepted favorite. It has a few recurring favorites for different jobs. (Reddit)\n\n## What I would recommend for your exact use case\n\nIf your real goal is:\n\n### “I want one page that stays updated and lets me sort models quickly.”\n\nUse **LLM Stats** first. It best matches the UI you described and it is actively updated. (LLM Stats)\n\n### “I want one page that is more rigorous and easier to trust.”\n\nUse **Artificial Analysis** first. It has clearer methodology and a stronger “independent analysis” framing. (Artificial Analysis)\n\n### “I want to know what real people prefer in side-by-side use.”\n\nUse **LM Arena**. (Aurea)\n\n### “I want benchmark signal that is fresher and harder to game.”\n\nUse **LiveBench**. (LiveBench)\n\n### “I care specifically about OCR.”\n\nUse **OCRBench v2** before trusting any overall ranking. (99Franklin)\n\n### “I care specifically about research agents.”\n\nUse **DeepResearch Bench** and **GAIA** before trusting any overall ranking. (DeepResearch Bench)\n\n## My practical ranking\n\nFor a normal person trying to keep up without drowning:\n\n 1. **LLM Stats** for daily browsing and sortable comparisons. (LLM Stats)\n 2. **Artificial Analysis** for more serious comparison and methodology. (Artificial Analysis)\n 3. **LM Arena** for human-preference sanity checks. (Aurea)\n 4. **LiveBench** when you care about benchmark freshness. (LiveBench)\n 5. **OCRBench v2** or **DeepResearch Bench / GAIA** when the task is specialized. (99Franklin)\n\n\n\n## Bottom line\n\nIf you want the cleanest answer to “what are people’s favorite websites for this,” the recurring favorites are:\n\n * **Artificial Analysis**\n * **LLM Stats**\n * **LM Arena**\n * **LiveBench**\n\n\n\nAnd for your specific examples:\n\n * **OCRBench v2** for OCR\n * **DeepResearch Bench** and **GAIA** for research-style agent work. (Reddit)\n\n\n\nIf I had to guess the site from the video, I would bet on **LLM Stats** first. (LLM Stats)",
"title": "What is your preferred site to see AI scores on different AI tests?"
}