VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

1National University of Singapore  ·  2Meituan  ·  3University of Science and Technology of China
4Beijing University of Posts and Telecommunications  ·  5Zhejiang University
Corresponding authors: guqi03@meituan.com, an_zhang@ustc.edu.cn

Abstract

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios.

To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. Tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures.

We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

VitaBench 2.0 teaser
Figure 1. Overview of VitaBench 2.0. Tasks are organized as temporally ordered sequences for each user. Agents must infer evolving user preferences from fragmented dialogues and behavior logs, maintain these preferences through a memory mechanism, and make personalized and proactive decisions in executable real-life service environments.

Overview

VitaBench 2.0 evaluates personalized and proactive agents through long-term, user-specific task sequences. Rather than treating each task as an isolated prompt, the benchmark places every request within an evolving user trajectory. The agent must rely on interaction histories, memory mechanisms, executable tools, and environment feedback to make decisions that are consistent with the user's preferences and current context. The benchmark is organized around three complementary components. First, each user is associated with a structured profile and a set of fine-grained preferences that are revealed only indirectly through fragmented dialogues and behavior logs. Second, tasks are instantiated in executable service environments, where the user agent interacts with task-specific agents and tools to complete concrete service requests. These tasks are separated by temporal intervals, during which new interaction histories are generated and introduced into the user trajectory, requiring the agent to update its understanding of the user over time. Third, VitaBench 2.0 provides a unified memory interface that separates memory construction from task execution, allowing different memory mechanisms to be compared under the same tasks, tools, and evaluation rubrics. The following sections describe these components and summarize the main capabilities evaluated by the benchmark.

1 What we evaluate

VitaBench 2.0 evaluates four core capabilities required for personalized and proactive agents.

  • Preference Extraction — infer stable user preferences from fragmented dialogues and behavior logs, while filtering out noisy or misleading signals.
  • Preference Utilization — apply the relevant preferences to the current task and consistently incorporate them into multi-step, multi-tool decision-making.
  • Preference Updating — track preference drift over time, including newly emerging preferences, inactive preferences, and modified preferences.
  • Proactive Interaction — identify missing information that is necessary for decision-making and actively acquire it through user interaction or environmental exploration before taking action.

2 How VitaBench 2.0 runs

VitaBench 2.0 is a sequential user-agent interaction benchmark. For each user, tasks arrive in temporal order. At each task, a user simulator issues a concrete request, and the agent must fulfill it by interacting with domain-specific tools and an executable environment. Between consecutive tasks, the agent is exposed to newly generated interaction histories that reflect emerging preferences or preference drift. The agent may then update its memory to maintain an evolving representation of the user. This design makes each user trajectory a continuous long-term interaction rather than a collection of independent prompts.

Two design choices make the evaluation better aligned with realistic personalization. First, the user simulator does not directly reveal the underlying preferences. It issues requests from a predefined task list and provides only limited, controlled feedback, preventing agents from shortcutting preference inference through explicit simulator responses. Second, evaluation is rubric-based. Each task is paired with atomic constraints that reflect the user's preferences, such as item attributes, price ranges, temporal conditions, and contextual requirements. These rubrics are applied to both the agent trajectory and the final outcome, allowing VitaBench 2.0 to evaluate not only whether the final decision is correct, but also whether the agent follows a preference-consistent and information-seeking interaction process.

3 How we build the data

VitaBench 2.0 contains 56 users with carefully constructed profiles and more than 2,000 fine-grained preferences. User profiles cover diverse demographic, geographic, occupational, family, and lifestyle characteristics, while the preference set spans dining, travel, lodging, time, budget, leisure, shopping, and conditional preferences. Each user is associated with a temporally ordered sequence of 10-20 tasks, resulting in 819 subtasks across three real-life service domains: Delivery, In-store Consumption, and Online Travel Agency. These tasks are supported by 66 executable tools.

Preferences are not directly exposed to the agent. Instead, they are embedded in fragmented interaction histories consisting of dialogues and behavior logs, such as browsing, ordering, reviewing, and searching records. These histories contain both preference-relevant signals and noisy interactions that are irrelevant, ambiguous, or context-dependent. This setting requires agents to distinguish stable user preferences from incidental behaviors. To simulate long-term user dynamics, VitaBench 2.0 further introduces temporally grounded preference drift, where preferences may be added, deleted, or modified across the user's task sequence. Each task is paired with atomic rubrics that verify the final environment state rather than only the generated text.

User profile overview dashboard
Figure 2. Overview of user profiles in VitaBench 2.0. The curated users cover diverse demographic, geographic, occupational, and family contexts, while focusing on active online-service scenarios to support realistic personalized task construction.
User preference overview
Figure 3. Overview of user preferences in VitaBench 2.0. Fine-grained preferences span multiple daily-life categories and service domains, with controlled preference drift introduced across temporal task sequences to evaluate long-term user modeling.

4 Unified memory interface

To systematically study the role of memory in personalization, VitaBench 2.0 defines an extensible memory interface with two operations: UPDATE(M, H) integrates newly observed interaction histories into the user's memory, while RETRIEVE(M, q) returns task-relevant information for the current query. When memory is enabled, the agent does not directly access the full interaction history during task execution. Instead, it must rely on the information returned by the memory module. This interface allows different memory architectures to be plugged into the same tasks, tools, and rubrics, enabling controlled comparison.

  • Agentic Memory. The memory module maintains a structured representation of the user and agents decide what information should be retained, updated, or discarded after each new history batch. This design tests the model's ability to perform selective abstraction, resolve conflicts, and preserve long-term consistency. Our reference implementation is based on MemAgent.
  • RAG Memory. Interaction records are stored in a vector database. UPDATE indexes new records, and RETRIEVE performs similarity-based lookup for the current query. This provides a strong retrieval-based baseline, but it does not explicitly control which information should be retained or how conflicting preferences should be resolved.

Leaderboard

Performance under three memory settings, sorted by Avg@4 on Full Context. Bold = best in column.

# Model Full Context Agentic Memory RAG Memory
Avg@4Pass@4Pass^4 Avg@4Pass@4Pass^4 Avg@4Pass@4Pass^4
1Claude-Opus-4.6 0.5030.6640.337 0.4540.6450.259 0.4300.5660.299
2Doubao-Seed-2.0-pro 0.4740.6830.270 0.4280.6500.225 0.3390.4960.205
3DeepSeek-V4-Pro 0.4720.6490.295 0.4490.6560.255 0.4300.5840.271
4GPT-5 0.4410.6580.226 0.4210.6470.204 0.4100.5910.236
5Claude-4.5-Sonnet 0.4170.6580.197 0.3970.6420.178 0.3740.5730.186
6o3 0.4030.6530.169 0.4010.6690.154 0.3620.5870.158
7DeepSeek-R1-0528 0.3960.6910.131 0.4120.7120.118 0.3900.6430.153
8GLM-5.1 0.3940.5870.213 0.3520.5560.150 0.3280.4850.185
9Doubao-Seed-1.6 0.3730.5990.176 0.3830.6460.123 0.3750.5910.179
10GLM-4.5 0.3640.6230.156 0.3110.5960.106 0.3360.5550.147
11GLM-4.6 0.3590.6120.116 0.3510.6250.107 0.3360.5740.135
12MiniMax-M2.7 0.3450.5840.145 0.3510.6090.124 0.3140.5180.143
13Gemini-2.5-Pro 0.3310.6050.109 0.3780.6380.138 0.3200.5790.109
14Kimi-K2.6 0.2930.5330.099 0.2800.5080.088 0.3030.5110.118
15Qwen3-Max 0.2840.4990.105 0.3240.5990.091 0.3150.5190.134
16Gemini-2.5-Flash 0.2820.5560.063 0.3120.5670.098 0.3090.5440.107
17o4-mini 0.2100.4330.047 0.2700.5330.073 0.2610.4520.091
# Model Full Context Agentic Memory RAG Memory
Avg@4Pass@4Pass^4 Avg@4Pass@4Pass^4 Avg@4Pass@4Pass^4
1DeepSeek-V4-Pro 0.4560.6520.267 0.4270.6580.207 0.4240.6180.247
2Doubao-Seed-2.0-pro 0.4280.6490.218 0.4260.6650.198 0.4060.6250.208
3GLM-5.1 0.4200.6540.204 0.4230.6640.182 0.3830.5850.200
4Kimi-K2.6 0.3780.6320.147 0.3970.6740.145 0.3830.6210.163
5GLM-4.6 0.3420.6120.113 0.3360.6230.084 0.3170.5550.123
6Doubao-Seed-1.6 0.3260.5120.171 0.3400.5760.129 0.3510.5430.174
7GLM-4.5 0.3070.5290.127 0.3300.5690.112 0.3160.5230.152
8LongCat-Flash-Chat 0.2980.5100.123 0.3020.5370.105 0.2900.4710.136
9GPT-3.5-Turbo 0.1400.3140.019 0.2310.4670.056 0.2050.4090.059
10GPT-4o-mini 0.0670.1800.006 0.0840.2290.008 0.0940.2270.011

Each result is the average over four independent runs, evaluated with gpt-4.1-2025-04-14 as user simulator and rubric judge. Last update: May 2026

Key Findings

01

Real-world personalization remains challenging for current agents.

Even under the Full Context setting, where agents have access to the complete interaction history, the best-performing models achieve only moderate scores. This suggests that current agents still struggle to identify, prioritize, and apply user preferences in long, noisy, and evolving interaction histories.

02

Memory is essential for performance, yet current models cannot effectively leverage it.

Although external memory is essential for long-term user modeling, existing memory mechanisms do not always translate stored user information into better task performance. Agentic Memory may lose or distort information during memory updates, while RAG Memory may retrieve surface-level matches that are not truly relevant to the current decision.

Avg@4 vs number of turns: thinking vs non-thinking
Figure 4. Avg@4 versus interaction turns for thinking and non-thinking models. Thinking and non-thinking models occupy overlapping regions on both axes — enabling thinking mode does not consistently yield higher performance (Avg@4) nor better efficiency (number of turns).
Performance vs temporal task index across memory settings
Figure 5. Performance across temporal task indices under different memory settings. Performance declines as the user trajectory progresses, indicating that preference drift, accumulated noise, and memory errors make later tasks more challenging.
03

There is a clear gap between general reasoning ability and personalization capability.

Figure 4 shows that thinking and non-thinking models occupy overlapping regions on both performance (Avg@4) and efficiency (number of turns), with no consistent advantage from enabling thinking mode. This indicates that personalization requires capabilities beyond step-by-step reasoning, including preference extraction, noise filtering, drift handling, and recognition of missing information.

04

Model task success rate declines as user interactions accumulate.

Figure 5 shows that average performance declines as the user trajectory progresses, under all three memory settings. As interaction histories become longer, preferences evolve, and memory errors accumulate, agents face increasing difficulty in maintaining a consistent and up-to-date understanding of the user.

Proactiveness vs personalization, plus ground-truth preference
Figure 6. Proactiveness and ground-truth preference analysis. Left: across model families, proactive task performance consistently lags behind personalization performance. Right: even when ground-truth user preferences are provided, top models achieve only moderate performance, suggesting that preference utilization remains a key bottleneck.
05

SOTA models often fail to realize they lack the information they need.

The left panel of Figure 6 shows that proactive task performance consistently trails personalization performance across model families. Agents often fail to recognize when the current query and memory are insufficient, leading them to make premature decisions instead of asking clarifying questions or seeking additional context.

06

Preference utilization alone is challenging to SOTA models.

The right panel of Figure 6 shows that even when ground-truth user preferences are provided, top models still reach only moderate performance. This suggests that the main challenge is not only extracting preferences from history, but also correctly ranking, combining, and applying them across multi-step tool-based decisions.

Failure mode breakdown for DeepSeek-V4-Pro and DeepSeek-R1
Figure 7. Failure-mode distribution for DeepSeek-R1 and DeepSeek-V4-Pro. As model capability advances from R1 to V4-Pro, tool-related errors (A1–A3) shrink while preference-related errors (B1–B4) grow to dominate — signaling that as models get stronger, personalization is emerging as the new bottleneck.
07

Personalization is becoming the new bottleneck as models get stronger.

Figure 7 shows that as model capability advances from DeepSeek-R1 to DeepSeek-V4-Pro, tool-related errors (A1–A3) shrink while preference-related errors (B1–B4) grow to dominate. This points to a clear shift: basic tool invocation is no longer the limiting factor for stronger models — instead, capturing, prioritizing and applying user preferences is what frontier models still cannot reliably do, suggesting that future progress on real-world assistants depends more on personalization capability than on raw tool-use proficiency.

Citation

@article{chen2026vitabench,
  title   = {VitaBench 2.0: Evaluating Personalized and Proactive Agents
             in Long-Term User Interactions},
  author  = {Chen, Yuxin and Zhang, Yi and Cai, Zhengzhou and Shi, Yaorui
             and Yao, Zhiyuan and Cui, Chenhang and Zheng, Jingnan
             and Huo, Yaqi and Su, Xi and Gu, Qi and Cai, Xunliang
             and Wang, Xiang and Zhang, An and Chua, Tat-Seng},
  journal = {arXiv preprint arXiv:2605.27141},
  year    = {2026}
}