The first large-scale benchmark for LLM agents based on years of authentic, cross-domain user behavioral data rather than synthetic personas.
March 30, 2026
Original Paper
MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization
arXiv · 2603.25973
The Takeaway
Most 'long memory' benchmarks are based on toy tasks or short sessions. MemoryCD provides a realistic testbed for lifelong personalization, revealing that current state-of-the-art models fail to maintain user context across disparate domains (e.g., shopping vs. healthcare) over time.
From the abstract
Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic