Google researchers have published a paper detailing a method to extract user intent from user interactions, which can be applied to autonomous agents. The described approach utilizes on-device small models, eliminating the need to send data back to Google and thereby protecting user privacy.
The researchers addressed the problem by dividing it into two distinct tasks.
This solution reportedly outperformed the base performance of multi-modal large language models (MLLMs) operating in data centers.
The core of the research focuses on identifying user intent through a sequence of actions on a mobile device or browser, while maintaining all processing and data on the device itself.
The Two-Stage Approach
This was achieved through a two-stage process:
- The first stage involves an on-device model summarizing individual user actions.
- These sequential summaries are then fed into a second model, which identifies the overall user intent.
The researchers stated that their two-stage method demonstrated superior performance compared to both smaller models and a leading large MLLM, regardless of dataset or model type. It also handled noisy data more effectively than traditional supervised fine-tuning methods.
Background and Adaptation
The technique of intent extraction from UI interactions using MLLMs was previously proposed in 2025. Google's researchers adapted this approach using an improved prompt.
Extracting intent is a complex problem, susceptible to errors. The researchers define a "trajectory" as a user's journey within an application, represented by a sequence of interactions. Each interaction step consists of two parts: an observation (visual state of the screen) and an action (user interaction like clicking or typing).
Defining Effective Intent Extraction
Key qualities for effective extracted intent were identified as:
- Faithful: Accurately describes what occurred in the trajectory.
- Comprehensive: Provides all necessary information to re-enact the trajectory.
- Relevant: Excludes extraneous information.
Evaluation Challenges
Evaluating extracted intents presents challenges due to complex details (e.g., dates, transaction data) and the inherent subjectivity and ambiguity of user motivations. Previous research indicated an 80% intent match between humans on web trajectories and 76% on mobile trajectories.
After considering and ruling out methods like Chain of Thought (CoT) reasoning for small language models, the two-stage approach was selected, emulating CoT reasoning.
Stage One: Summarizing Individual Actions
In the first stage, prompting is used to generate a summary for each interaction, consisting of a visual screenshot and textual action representation. This stage is prompt-based as no training data with summary labels for individual interactions is currently available. The summary for each interaction is divided into a description of the screen, a description of the user's action, and a