Skip to content

ProCUA-SFT: Automatically Generated Training Data for Desktop Agents

Share on:

The bottom line: Automatically synthesized training data improves desktop agents by 18.7 percentage points compared to previous approaches.

Researchers present a dataset with 3.1 million step-level samples to improve AI models that interact with screens and input devices. The previous public dataset AgentNet leads to degraded performance during training, while the new ProCUA-SFT method achieves substantial improvements.

Training computer-use agents (CUAs) – models that interact with desktop environments via screenshots and keyboard and mouse input – requires large amounts of diverse trajectory data. However, the largest public dataset AgentNet with 22,500 human trajectories leads to negative transfer: fine-tuning the UI-TARS 7B model on AgentNet reduces the OSWorld success rate from 26.3% to 8–10%.

The new ProCUA-SFT method uses a fully automated pipeline: it synthesizes 93,000 trajectories across 2,484 different application combinations and distills them into 3.1 million step-level SFT samples. The data is based on grounded tasks in live desktop environments with real content – 912 spreadsheets from SpreadsheetBench, approximately 10,000 license-free presentations from Zenodo, and multi-app configurations from OSWorld. A single vision-language model (Kimi-K2.5) generates goals, checks preconditions, and executes trajectories, eliminating differences between planner and executor. Each trajectory is expanded into step-prefix samples that exactly reproduce the context-layout the model sees at inference time.

Training UI-TARS 7B on ProCUA-SFT for one epoch achieves a 45.0% success rate on OSWorld – an improvement of 18.7 percentage points over the base model and over 35 percentage points above AgentNet-trained counterparts. Part of the ProCUA dataset was integrated into the training data for Nvidia’s Nemotron 3 Nano Omni model.


Source: arxiv.org · Published June 14, 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.7.1.

Share on: