GitHub Copilot port of Anthropic's AI vulnerability discovery harness
Makes Anthropic's security harness accessible to Copilot users who lack Claude Code access.

Multi-turn adaptive testing finds agent failures static benchmarks miss, but eval space is crowded.
AI engineers building production agents
LangSmith · Braintrust · Arize Phoenix
Nyx is an autonomous testing harness that probes your AI agents to find failure modes before users do. It’s used to find logic bugs, instruction following failures, edge cases in agent behavior, and for red-team security testing (jailbreaks, prompt injection, tool hijacking)
Technical approach: * Pure blackbox (no special access needed - test like your users interact) * Multi-turn adaptive conversations * Multi-modal testing (voice, text, images, documents, browser interactions) * Massively parallel by default
Instead of spending time writing static evals for the key failure modes of your AI agents, point Nyx at any system and it autonomously discovers failure modes that matter. We typically find issues in under 10 minutes that manual audits take hours to surface.
This is early work and we know the methodology is still going to evolve. We would love nothing more than feedback from the community as we iterate on this.
Makes Anthropic's security harness accessible to Copilot users who lack Claude Code access.
Drops autonomous experimentation into Cursor without installing new frameworks or complex agents.
Another MCP orchestration wrapper—claims autonomy, but chaining APIs over Docker isn't novel.
Flight recorder for AI agents: record, replay, enforce policies on every LLM call.
50% cheaper tokens but 2-minute waits kill interactive agent UX.
Accessibility tree + LLM loop beats vision-first approaches for reliable mobile automation.