AdvertBench, ranking the ability of LLMs to create image ads

Name: AdvertBench, ranking the ability of LLMs to create image ads
Availability: InStock
Author: joegibbs

by joegibbs·Jun 21, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainRabbit Hole

Human-voted ad benchmark as proxy for LLM tool-use ability.

Strengths

•Models choose their own tools (Pillow, Chromium) instead of fixed APIs.
•Human voting provides ground truth over automated image metrics.
•Creative methodology measuring multimodal capability through practical tasks.

Weaknesses

•Research experiment without clear production use case beyond benchmarking.
•Human voting doesn't scale for large model comparisons.

Post Description

Experiment that I've made. The models get access to an E2B sandbox and are instructed to create an ad according to the specifications (they can choose whatever tools they want to use for it, e.g. Pillow, Chromium) as a proxy for their ability to use tools, create other kinds of images, do complex layouts etc. Currently Opus 4.8 is on top (not surprising, but it did take 66 conversation turns to create the image) and GLM-5.2 is on fifth (which I do find surprising because it doesn't have image capabilty).