Back to browse
GitHub Repository

I benchmarked 17 LLMs on a WP plugin task after Copilot removed Claude Opus. Exactly zero used the native UI.

3 starsJavaScript

I blind-tested 14 LLMs on a WP plugin task. Surprising Findings

by guilamu·Apr 23, 2026·3 points·2 comments

AI Analysis

MidNiche Gem

Rigorous benchmark methodology, but it's research not a tool you can use.

Strengths
  • Blind testing with anonymized outputs prevents evaluator bias in scoring.
  • Gemini 3.1 Pro as impartial judge against a 100-point rubric is clever.
  • Reveals all 14 models failed to hook into Gravity Forms native search input.
Weaknesses
  • Static README with findings — no interactive tool or repeatable benchmark runner.
  • WordPress-specific results don't generalize to other frameworks or use cases.
Category
Target Audience

WordPress developers evaluating AI coding assistants

Similar To

LMArena · Artificial Analysis

Post Description

Recently, GitHub Copilot silently dropped support for Claude Opus on Pro accounts. Since Opus was my go-to model for my daily workflow (developing WordPress plugins), I needed a reliable replacement.

I decided to run a rigorous, blind benchmark across 14 state-of-the-art and local LLMs to objectively measure which model understands WordPress development best. To ensure a perfectly fair test, I started with a completely fresh IDE and zero context for every single generation.

I asked each model to build a "Gravity Forms Live Search" plugin using a minimal, zero-shot prompt. To avoid personal bias, I had Gemini 3.1 Pro blindly grade the anonymized outputs against a strict 100-point rubric, comparing them to my own reference implementation.

Surprising Findings

1. The "Blind Spot" (Re-inventing the wheel) Out of 14 models, exactly 0 successfully hooked into the native Gravity Forms search input (#form_list_search). Instead of analyzing the implicit context (the DOM), every single model forcefully injected a brand new, redundant <input> into the page.

2. Complete lack of advanced UX foresight Because it wasn't explicitly asked for, no model anticipated the need for keyboard shortcuts (Ctrl+F), nor did any attempt to update the native item counter as rows were hidden. Zero models implemented background-fetching for paginated pages to make the search global.

3. The Diacritics Separator Most models used a simple .toLowerCase() for filtering, breaking on accents. Only a select few implemented robust normalization (.normalize('NFD')) to handle diacritics correctly.

4. Local models struggled Local inferences failed to keep up on my low end hardware (7700x 64gb, rx6700 10gb). Gemma4-26b underperformed significantly, generating a fatal PHP error and scoring 18/100.

The Standouts

The Winner: Claude 4.7 Opus (68/100). It wrote highly performant JS (caching DOM text, 120ms debounce), handled diacritics perfectly, and used modern WordPress i18n. It stands out as the most capable direct replacement for Copilot Pro Opus.

The Value King: GLM 5.1 (61/100). GLM secured a notable 2nd place before Opus 4.6! When checking OpenRouter, GLM 5.1 ($1.05 in / $3.50 out) is ~3-4x cheaper than Sonnet 4.6 and ~5-7x cheaper than Opus 4.6/4.7, making it a very cost-effective alternative for this task.

The Leaderboard

1. Claude 4.7 Opus plan – 68

2. GLM 5.1 – 61

3. Claude 4.6 Opus plan – 59

4. Mimo v2.5 pro – 58

5. Qwen 3.6+ – 55

6. Sonnet 4.6 – 55

7. Gemini 3.1 pro – 53

8. Kimi K2.6 – 49

9. GPT 5.4 xHigh – 49

10. Gemini 3 flash – 47

11. Claude 4.7 Opus fast – 46

12. Minimax m2.7 – 36

13. Gemma4-e4b (Local rx6700) – 32

14. Gemma4-26b (Local CPU) – 18

Takeaway

Even the best LLMs default to the path of least resistance: "just make it work." If you want native-feeling, fully integrated UX, you cannot rely on the model's implicit knowledge; you have to explicitly prompt for it.

I've published the full leaderboard, the exact prompts used, the detailed scoring grid, and all the generated code in the GitHub repository here: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I will be testing Level 2 prompt next, feeding the models a Wordpress+Gravity Forms reference file to see how they adapt.

Similar Projects

OpenCode Benchmark Dashboard

Benchmarks OpenCode models locally, but lacks preloaded datasets and only works with configured OpenAI-compatible APIs.

Niche Gem
grigio
103mo ago
AI/ML●●Solid

ModelSweep - Open-Source Benchmarking for Local LLMs

Postman for local LLMs with LLM-as-Judge and Elo ratings built in.

Ship ItNiche GemSlick
leonickson
203mo ago