DataFlow,Turn raw data into high-quality LLM training datasets
LLM-based cleaning operators beat regex pipelines for messy text data.
Example scripts and notebooks for bootstrapping YOLO datasets with open-vocabulary object detection using free-text prompts and the Detect Anything API. Includes Python, Colab, and JS examples for easy dataset generation.
Notebook wrapper around an API, but Roboflow and Label Studio already do this.
Computer vision engineers, ML hobbyists
Roboflow · Label Studio · CVAT
The workflow uses open-vocabulary object detection to generate bounding boxes from free-text prompts, which are then exported as YOLO labels and used to train a detector.
Typical workflow in the notebook:
Start with an unlabeled or weakly labeled image dataset Generate bounding boxes using prompts (for example "cat's head" or "dent in car bumper") Filter positives and rebalance the dataset Export labels in YOLO format Train and evaluate a YOLO model
In the example notebook I use a cats vs dogs dataset with only image-level labels. Using the prompt "cat's and dog's head", the pipeline auto-generates head bounding boxes and trains a small YOLO model.
The repository mainly contains the Colab notebook plus example scripts for running the detection and exporting YOLO labels.
Curious to know what people think of this approach.
LLM-based cleaning operators beat regex pipelines for messy text data.
Yet another Hugging Face dataset in a sea of thousands.
Synthetic rare-defect dataset solves real validation gap, but relies on closed Silera tool.
Auto-fills in_features when connecting layers, exports clean PyTorch code.
Beats GPT-5 at golf forecasting via auto-labeled data pipeline; replicable recipe for any domain via SDK.
CC0 data bundles with Annex IV reports for EU AI Act compliance before August 2026.