DigiData

Training and Evaluating General-Purpose Mobile Control Agents

Yuxuan Sun1* Manchen Wang1* Shengyi Qian1* William R. Wong1* Eric Gan1* Pierluca D'Oro1* Alejandro Castillejo Munoz1* Sneha Silwal1 Pedro Matias2 Nitin Kamra2 Satwik Kottur1 Nick Raines2 Xuanyi Zhao1 Joy Chen1 Joseph Greer1 Andrea Madotto1 Allen Bolourchi1,3 James Valori1 Kevin Carlberg2 Karl Ridgeway2โ€  Joseph Tighe1โ€ 
1FAIR at Meta, 2Meta Reality Labs Research, 3University of Southern California
*Core contribution, โ€ Project supervision

Abstract

AI agents capable of controlling user interfaces have the potential to transform human interaction with digital devices. To accelerate this transformation, two fundamental building blocks are essential: high-quality datasets that enable agents to achieve complex and human-relevant goals, and robust evaluation methods that allow researchers and practitioners to rapidly enhance agent performance. In this paper, we introduce DigiData, a large-scale, high-quality, diverse, multi-modal dataset designed for training mobile control agents. Unlike existing datasets, which derive goals from unstructured interactions, DigiData is meticulously constructed through comprehensive exploration of app features, resulting in greater diversity and higher goal complexity. Additionally, we present DigiData-Bench, a benchmark for evaluating mobile control agents on real-world complex tasks. We demonstrate that the commonly used step-accuracy metric falls short in reliably assessing mobile control agents and, to address this, we propose dynamic evaluation protocols and AI-powered evaluations as rigorous alternatives for agent assessment. Our contributions aim to significantly advance the development of mobile control agents, paving the way for more intuitive and effective human-device interactions.

Dataset

DigiData is a dataset designed to offer diverse and high-quality data to train mobile control agents. Differently from existing datasets, DigiData is created using a data collection protocol that attempts to comprehensively cover all app features, while simultaneously ensuring high data quality.

Dataset Characteristics

Trajectory distribution barplot Pairwise cosine distance

Left: Percentage of data distribution for DigiData's top apps. The dataset presents no major imbalance towards specific apps. Right: Comparison of distribution of pairwise cosine distances across datasets. DigiData exhibits the largest degree of goal diversity, especially compared to AitW.

Data Collection Pipeline

Data collection v1

A representation of our data collection pipeline. For each app, our pipeline includes three phases. In the first phase (goal curation), human workers exhaustively explore the app and curate a list of goals that attempts to cover all of its features. In the second phase (demonstrations collection), human annotators create a set of demonstrations, generating trajectories that achieve the specified goals. In the third phase (trajectory verification), trajectories that do not achieve their corresponding goal are filtered out of the dataset, by a verification system based on a combination of LLMs and humans. Overall, this pipeline allows to collect in-depth and high-quality mobile control data.

Comparison with Existing Datasets

Related work

Visualization of different features of existing mobile control datasets. DigiData constitutes a step change in terms of goal depth, being the first large-scale dataset obtained by comprehensive exploration of the functionalities of mobile device apps.

Benchmark

DigiData-Bench evaluates mobile control agents through two complementary approaches: offline evaluation using step-accuracy on 309 tasks with ground truth demonstrations, and online evaluation in live environments with the same task set, including 150+ fully automated scenarios through DigiData-Bench-Auto.

Task Statistics & Distribution

App Diversity

DigiData-Bench comprises 309 goals across 37 Android apps spanning 8 categories: Communications, Food, Management, Media, Navigation, Restricted Search, Shopping, and Misfit.

Offline Evaluation

Offline evaluation uses ground truth demonstration trajectories to assess agent performance without requiring live environment execution. Agents are evaluated step-by-step against reference interaction traces, where each predicted action is compared to the corresponding ground truth action from human demonstrations.

Trajectory vis 1

Goal: Using Gallery (Google) app, create a new folder named "Mountains" and move the most recently taken photo of a mountain landscape into it.

Online Dynamic Evaluation

Our online evaluation protocol assesses true task completion in live Android environments. Human evaluators monitor agent execution for safety while judging overall success. DigiData-Bench-Auto automates this process for 150+ tasks using LLM judges, enabling scalable evaluation without human oversight.

Online evaluation protocol

A visual representation of our online evaluation protocol. Human workers initialize app states, monitor agent execution, and assess task completion success in live environments.

Key Results

Our experiments demonstrate that agents trained on DigiData achieve superior performance compared to those trained on existing datasets, with our best model reaching 47.3% task success rate on DigiData-Bench. Importantly, we show that Chain-of-Thought data significantly improves both performance and explainability, while highlighting fundamental limitations of step-accuracy metrics for mobile control evaluation.

Step-accuracy Success Rate (DigiData-Bench)
DigiData-Bench AitW ACtrl(H) ACtrl(L) All Seen Familiar Novel All (LLM)
GPT4o 40.0 - - - 27.8 26.7 33.3 26.5 38.0
Qwen2.5VL 49.2 - - - 39.2 37.9 42.6 40.8 46.2
AitW - 73.1 - - - - - - -
AndroidControl - - 71.5 86.6 - - - - -
CogAgent - 76.8 - - - - - - -
Ours 1B 67.6 77.4 64.1 78.5 35.0 38.4 37.0 18.4 37.7
Ours 3B 70.7 78.0 65.3 82.1 44.3 49.5 46.3 20.4 49.8
Ours 8B 70.7 78.1 64.1 81.4 42.1 45.2 44.4 26.5 48.5
Ours 8B COT 72.8 78.7 63.8 77.8 47.3 51.0 42.6 36.7 53.6

Comparison of step-accuracy across different datasets and success rate on DigiData-Bench. ACtrl(H) and ACtrl(L) refer to high-level and low-level tasks from Android Control's test set, while success rate on DigiData-Bench is reported on three app novelty subsets. All (LLM) reports success rate from GPT4o as a judge model. Ours 8B COT has the highest success rate on DigiData-Bench from both human evals and LLM judge showing the effectiveness of our DigiData and the synthetic COT data.

BibTeX

@misc{sun2025digidatatrainingevaluatinggeneralpurpose,
  title         = {DigiData: Training and Evaluating General-Purpose Mobile Control Agents}, 
  author        = {Yuxuan Sun and Manchen Wang and Shengyi Qian and William R. Wong and Eric Gan and Pierluca D'Oro and Alejandro Castillejo Munoz and Sneha Silwal and Pedro Matias and Nitin Kamra and Satwik Kottur and Nick Raines and Xuanyi Zhao and Joy Chen and Joseph Greer and Andrea Madotto and Allen Bolourchi and James Valori and Kevin Carlberg and Karl Ridgeway and Joseph Tighe},
  year          = {2025},
  eprint        = {2511.07413},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2511.07413}, 
}