Claude skill that evaluates B2B vendors by talking to their AI agents
AI agents interrogating other AI agents is a genuinely novel vendor evaluation approach.
Hundreds of agent skills for medical research, including protocol design, data analysis, evidence insights, and academic writing.
Curated prompt library with 420+ skills, but agent skill marketplaces already exist.
AI agent developers building medical research tools
LangChain tools · CrewAI skills · Prompt libraries
Medical Skill Auditor is an evaluation framework that AIPOCH uses to assess the quality of its medical research agent skills before they are made available to users. It acts as a gatekeeper, ensuring that skills meet defined standards in reliability, usability, security, and scientific integrity.
How does Medical Skill Auditor work?
Veto Gates
To enforce strict quality control, Skill Auditor is designed with two layers of veto mechanisms. Any failure in these checks may lead to immediate rejection of a skill.
Skill Veto
Operational Stability Structural Consistency Result Determinism System Security
Research Veto
Scientific Integrity Practice Boundaries Methodological Ground Code Usability
Core Capability
Evaluates a skill’s design and contract against key dimensions such as Functional Suitability, Reliability, Performance & Context, Agent Usability, Human Usability, Security, Agent-Specific and Maintainability.
Medical Task
Assesses actual outputs of a skill with layered criteria.
For skill testing, the AI automatically generates inputs. The number of inputs in specific categories will increase or decrease depending on the complexity of the skill. The following 7 inputs represent the most comprehensive version.
/Canonical /Variant A /Edge /Variant B /Stress /Scope Boundary /Adversarial
Skill Complexity Classification
Label Code/Rank Definition
Simple S Narrow task scope
Moderate M Moderate branching or multiple task types
Complex C Broad or multi-step specialized skill
Simple (S):3 inputs
Moderate (M):5 inputs
Complex (C):7 inputs
Final Score
The Skill Evaluator uses a two-stage scoring system: static evaluation (design quality, accounting for 40%) and dynamic evaluation (runtime performance, accounting for 60%). The final overall score is derived by combining both.
Static (40%) Dynamic (60%)
Final Score = Static Score × 40% + Dynamic Score × 60%
You can view evaluation results for selected AIPOCH skills here:https://www.aipoch.com/agent-skills/medical-research-literat....
This framework is still under active development, we’d love to hear your feedback! Right now this assessment framework is only applied to a subset of AIPOCH’s skills, but we’re considering expanding it more broadly. If this evaluation framework could be used to assess third‑party skills in the future, would you consider trying it in your own projects? Are there evaluation frameworks you’re already using?
AI agents interrogating other AI agents is a genuinely novel vendor evaluation approach.
Claude Skill for agent evals, but LangSmith and Arize already own this.
Security scanning catches data exfiltration before skills go live.
Docker sandbox execution catches runtime threats static analysis alone misses.
Self-healing agents patch prompts automatically via replay validation; beats manual iteration.
Git-native prompt versioning with Crucible evaluation, but only 1 star on GitHub.