Lumen – vision-first browser agent (state of the art, open source)
Vision-only coordinates beat DOM selectors where Stagehand and browser-use still stumble on UI changes.
Desktop automation beyond the DOM using Groq Vision and PyAutoGUI.
RPA engineers, agent developers
UiPath · Claude Computer Use · Adept
75% of real computer work happens in desktop apps, legacy software, and tools with zero APIs. Agents are completely blind to all of it.
PerceptAI uses EasyOCR + Groq Vision to read any screen and PyAutoGUI to act on it. One plain English instruction executes autonomously with self-healing and memory.
Demo: percept-ai-phi.vercel.app GitHub: github.com/Neeraj04-CY/PerceptAi
Would love feedback from anyone building agents.
Vision-only coordinates beat DOM selectors where Stagehand and browser-use still stumble on UI changes.
Lets agents actually see the screen and act on it by returning OCR text with pixel coordinates and offering commands like click_at, type_text, and press_key. You can run it instantly with npx (it auto-creates a Python venv and hooks into Apple Vision/Quartz), and there are ready-made integration snippets for Claude, VS Code, and Cursor — a pragmatic, technically neat tool for closed-loop agent UI work. It’s limited to macOS 13+ and Apple APIs, but within that niche it removes a lot of friction.
Screenpipe alternative with ripgrep redaction for passwords, API keys, and SSNs.
GUI wrapper around Apple's Vision Framework, but it's free and open source.
Finally, visual proof when your AI agent claims it finished the UI work.
Vision models catch UI bugs that Playwright selectors miss — built for AI agent workflows.