2500 vision benchmarks / evals for Vision Language Models
Daily arXiv scraping with Claude classification beats manual curation.

3-line real-time VLM API, but competing products handle camera inference already.
Developers building video agents, accessibility apps, home security, and content moderation systems
Replicate · Banana.dev · Modal Labs
Daily arXiv scraping with Claude classification beats manual curation.
Streams LLM weights from CD-ROM during inference to fit 77MB models in 32MB RAM.
Multi-threaded video capture fixes OpenCV's standard blocking I/O bottleneck for Python pipelines.
Beats Qwen2.5-VL-7B on temporal grounding while running on a single consumer GPU.
Ditches the rendering engine entirely to let a VLM hallucinate the pixels from HTML.
Runs Foundation Models on the Neural Engine and can also host MLX/GGUF models locally while offering an in-app HuggingFace browser, on-device WhisperKit/tts, vision analysis and image/video generation — all in a native SwiftUI shell. Exposing 33+ tools over TCP via the Model Context Protocol is a clever move for automation and orchestration, but the macOS-only scope and crowded local-LLM space mean it's a powerful niche play rather than a universal winner.