Back to browse
GitHub Repository

Agent-first CLI for native UI automation with Set-of-Marks screenshots. MCP server + headless Xvfb support included.

18 starsPython

SoMatic – Vision-based OS automation framework for AI agents

by smyansondur·May 21, 2026·2 points·0 comments

AI Analysis

●●●BangerBig BrainWizardryZero to One

Brings Set-of-Marks prompting to native OS apps where DOM trees don't exist.

Strengths
  • YOLO-based element detection bypasses brittle accessibility trees entirely.
  • MCP server integration enables immediate use with Claude Code and Cursor.
  • npm wrapper bootstraps a local Python venv with ONNX runtime automatically.
Weaknesses
  • Relies on pre-converted ONNX models rather than training custom detectors.
  • Performance on high-DPI or multi-monitor setups remains unproven in docs.
Category
Target Audience

AI agent developers and RPA engineers

Similar To

Computer Use · ActionKit · Browser Use

Post Description

Hi HN, I'm Smyan and I enjoy building agents. Modern multimodal LLMs are great at vision and perception but are quite poor at localization. This naturally creates a massive problem when we try to take our RPA frameworks and give them to agents to perform computer use tasks.

For browsers, we have been able to solve this by using the DOM tree to supply the LLM with structural hints and now more recently modern browser use frameworks use Set-Of-Marks prompting which take the structural information of the webpage and convert it into visual bounding boxes with labels, which allow the LLM to use it's strong vision and perception and accurately convert it to a form of localization. Functionally, this means the LLM now needs to simply say "click 4" instead of having to say "click 443 213".

This methodology however fails horribly when we try to apply it to native OS automation. The accessibility tree, which is often exists for native apps, is usually quite brittle, exposes non-deterministic selectors and often stripped by developers, which can make it hard to localize elements. Fuzzy matching can help with this, but it is still none the less very hard to get right.

This is exactly why I made SoMatic. SoMatic is a pure vision based framework that uses a finetuned YOLO model (highly inspired from OmniParser v2) to identify text and interactable elements in a UI. The YOLO model runs locally on the CPU with ONNX and is quite fast. SoMatic draws the bounding boxes and labels and then maps the id for each bounding box to the coordinates for the center of the given box. This therefore enables Set-Of-Marks prompting for in principal ANY user interface.

I ran an ablation benchmark using the framework with GPT-5.5 (high) and was able to acquire a ~ 20% higher accuracy than just the raw model. What was however surprising was that the model performed slightly better with knowing just the location of the bounding boxes (without actually seeing them). This could be due to the threshold tuning for the YOLO model either drawing too many or too few boxes (I'm not entirely sure).

Either way, if you have been wanting to give your AI agents full autonomy of your computer (Windows, Mac and Linux), you can download the CLI with

npm install -g somatic-cli/cli

and the corresponding skill with

npx skills add Smyan1909/SoMatic

The CLI also comes with a stdio MCP server if you want the model to directly parse the screenshots (b64 encoded) from the chosen API instead of it having to read the image after each screenshot.

I'd love to get your feedback on the vision-only approach. Are we at the point where we can finally abandon the mess that is the OS accessibility tree for automation?

Similar Projects

Automate Mac with Codex: macOS Control MCP Demo

Lets agents actually see the screen and act on it by returning OCR text with pixel coordinates and offering commands like click_at, type_text, and press_key. You can run it instantly with npx (it auto-creates a Python venv and hooks into Apple Vision/Quartz), and there are ready-made integration snippets for Claude, VS Code, and Cursor — a pragmatic, technically neat tool for closed-loop agent UI work. It’s limited to macOS 13+ and Apple APIs, but within that niche it removes a lot of friction.

WizardryNiche Gem
peterhddcode
104mo ago
AI/MLMid

Colab pipeline for auto-labeling datasets with prompt and training YOLO

Notebook wrapper around an API, but Roboflow and Label Studio already do this.

Ship It
eyasu6464
213mo ago
Developer Tools●●Solid

A vision-based AI agent for end-to-end testing

They've traded brittle selector-based scripts for a vision-and-planning loop: describe a test in plain English, the agent visually inspects the UI, plans actions, executes them (including OS-level interactions) and iterates until success or failure. If it actually nails reproducible CI-friendly runs, debuggable artifacts, and edge cases like dynamic content and auth flows, this could be a meaningful shift — but those operational details will make or break it.

WizardryBold Bet
chikathreesix
203mo ago