Lumen – vision-first browser agent (state of the art, open source)
Vision-only coordinates beat DOM selectors where Stagehand and browser-use still stumble on UI changes.
Agent-first CLI for native UI automation with Set-of-Marks screenshots. MCP server + headless Xvfb support included.
Brings Set-of-Marks prompting to native OS apps where DOM trees don't exist.
AI agent developers and RPA engineers
Computer Use · ActionKit · Browser Use
For browsers, we have been able to solve this by using the DOM tree to supply the LLM with structural hints and now more recently modern browser use frameworks use Set-Of-Marks prompting which take the structural information of the webpage and convert it into visual bounding boxes with labels, which allow the LLM to use it's strong vision and perception and accurately convert it to a form of localization. Functionally, this means the LLM now needs to simply say "click 4" instead of having to say "click 443 213".
This methodology however fails horribly when we try to apply it to native OS automation. The accessibility tree, which is often exists for native apps, is usually quite brittle, exposes non-deterministic selectors and often stripped by developers, which can make it hard to localize elements. Fuzzy matching can help with this, but it is still none the less very hard to get right.
This is exactly why I made SoMatic. SoMatic is a pure vision based framework that uses a finetuned YOLO model (highly inspired from OmniParser v2) to identify text and interactable elements in a UI. The YOLO model runs locally on the CPU with ONNX and is quite fast. SoMatic draws the bounding boxes and labels and then maps the id for each bounding box to the coordinates for the center of the given box. This therefore enables Set-Of-Marks prompting for in principal ANY user interface.
I ran an ablation benchmark using the framework with GPT-5.5 (high) and was able to acquire a ~ 20% higher accuracy than just the raw model. What was however surprising was that the model performed slightly better with knowing just the location of the bounding boxes (without actually seeing them). This could be due to the threshold tuning for the YOLO model either drawing too many or too few boxes (I'm not entirely sure).
Either way, if you have been wanting to give your AI agents full autonomy of your computer (Windows, Mac and Linux), you can download the CLI with
npm install -g somatic-cli/cli
and the corresponding skill withnpx skills add Smyan1909/SoMatic
The CLI also comes with a stdio MCP server if you want the model to directly parse the screenshots (b64 encoded) from the chosen API instead of it having to read the image after each screenshot.I'd love to get your feedback on the vision-only approach. Are we at the point where we can finally abandon the mess that is the OS accessibility tree for automation?
Vision-only coordinates beat DOM selectors where Stagehand and browser-use still stumble on UI changes.
Lets agents actually see the screen and act on it by returning OCR text with pixel coordinates and offering commands like click_at, type_text, and press_key. You can run it instantly with npx (it auto-creates a Python venv and hooks into Apple Vision/Quartz), and there are ready-made integration snippets for Claude, VS Code, and Cursor — a pragmatic, technically neat tool for closed-loop agent UI work. It’s limited to macOS 13+ and Apple APIs, but within that niche it removes a lot of friction.
Standardizes portable cryptographic receipts for agent behavior—but adoption unclear, overlaps Nobulex heavily.
Notebook wrapper around an API, but Roboflow and Label Studio already do this.
Site returns Cloudflare 522 error — can't evaluate a broken landing page.
They've traded brittle selector-based scripts for a vision-and-planning loop: describe a test in plain English, the agent visually inspects the UI, plans actions, executes them (including OS-level interactions) and iterates until success or failure. If it actually nails reproducible CI-friendly runs, debuggable artifacts, and edge cases like dynamic content and auth flows, this could be a meaningful shift — but those operational details will make or break it.