Back to browse
AdvertBench, ranking the ability of LLMs to create image ads

AdvertBench, ranking the ability of LLMs to create image ads

by joegibbs·Jun 21, 2026·2 points·0 comments

AI Analysis

●●SolidBig BrainRabbit Hole

Human-voted ad benchmark as proxy for LLM tool-use ability.

Strengths
  • Models choose their own tools (Pillow, Chromium) instead of fixed APIs.
  • Human voting provides ground truth over automated image metrics.
  • Creative methodology measuring multimodal capability through practical tasks.
Weaknesses
  • Research experiment without clear production use case beyond benchmarking.
  • Human voting doesn't scale for large model comparisons.
Category
Target Audience

AI researchers and marketers evaluating multimodal LLM capabilities

Post Description

Experiment that I've made. The models get access to an E2B sandbox and are instructed to create an ad according to the specifications (they can choose whatever tools they want to use for it, e.g. Pillow, Chromium) as a proxy for their ability to use tools, create other kinds of images, do complex layouts etc. Currently Opus 4.8 is on top (not surprising, but it did take 66 conversation turns to create the image) and GLM-5.2 is on fifth (which I do find surprising because it doesn't have image capabilty).

Similar Projects

AI/ML●●Solid

Find the best local LLM for your hardware, ranked by benchmarks

Ranks models by actual benchmark scores instead of just fitting the biggest model in VRAM.

Solve My ProblemShip It
andyyyy64
283681mo ago
AI/ML●●Solid

LLM Debate Benchmark

Side-swapped debate matchups expose model weaknesses standard benchmarks miss.

Big BrainDark Horse
zone411
933mo ago