The backstory
Evaluating image generation models used to be incredibly difficult because qualitative differences—like typography adherence, spatial reasoning, and dynamic lighting—can't be easily captured by old automated metrics like FID or standard CLIP scores.
General 'Arena' platforms, where users blindly click which image they like better, are great for a vibe check, but they don't help practitioners understand why one model is better for a specific task. I wanted a platform that treated image generation like a rigorous science instead of an unpredictable art form.
What ImageBench.ai does
ImageBench is a streamlined, transparent benchmark platform. Instead of relying on legacy metrics or pure crowd preference, it utilizes advanced VLM-based judges (specifically Qwen 3 L and Qwen 122B) paired with human evaluations to methodically rate images.
Models are evaluated across extremely focused categories:
- Text Rendering
- Spatial Reasoning
- Human Realism
- Truthfulness
- Professional Studio
- Graphical Design
Users can explore a global leaderboard, compare two models side-by-side using the built-in Versus Mode, and deep dive into the underlying dataset using the platform's seamless interactive Hero Showcase, which instantly renders images directly off the edge without lag.
How it was created:
This project was brought to life using an AI agent:
- Next.js 15 (App Router) on Cloudflare Pages
- React 18, TypeScript, Tailwind CSS with fully responsive custom aesthetics
- Next.js dynamic metadata (next/og) replacing complex headless browsers for OpenGraph images and SVGs
- Clean file-based JSON dataset routing for benchmarks