Streaming Memory Benchmark: Diagnosing Memory with Evidence-Grounded Episodes
Introduction
Most existing memory benchmarks evaluate agents in a static setting: the full conversation history is provided at once, and questions are asked in aggregate. This is convenient for offline comparison, but it diverges from how memory works in deployed systems [1,2,3].
In production, memory is inherently streaming: interactions arrive incrementally, and the system must decide under strict latency and token budgets, such as what to write, how to organize it, what to retrieve later, and how to apply the retrieved context. These constraints mean that each stage of the memory pipeline incurs real-time costs that directly impact deployability. As a result, what matters is not only end-task accuracy, but also retrieval fidelity and system efficiency at each stage. However, existing benchmarks focus primarily on aggregate accuracy, offering no visibility into which stage drives the cost or where failures occur.
To enable stage-level diagnosis, we ground our evaluation framework in cognitive psychology, which characterizes memory as a sequential process comprising four universal phases: encoding, maintenance, retrieval, and execution [4,5,6]. Building on this theoretical foundation, we propose Evidence-Based Streaming Evaluation by restructuring interactions into streaming episodes, where information is introduced as explicit Evidence (E) and must be retained to answer future Queries (Q). This yields a transparent memory pipeline, Formation (encoding), Management (maintenance), Retrieval, and Application (execution), and enables stage-level accountability for accuracy, latency, and token cost. Figure 1 visualizes this paradigm shift from static, opaque evaluation to evidence-based streaming evaluation.

Figure 1: The paradigm shift from current static, opaque evaluation to evidence-based streaming evaluation.
1. Evidence-Based Streaming Transformation
At the core of memory lies a universal truth: every valid recall is grounded in a specific antecedent—an Evidence. Regardless of the complexity of a dataset, any memory task can be traced back to this fundamental unit where user intent or factual information is first introduced. We leverage this by restructuring static dialogues into Streaming Episodes, where each episode is strictly designed to carry the necessary Evidence () required to resolve a future Query ().
This transformation enforces a strict Streaming Dependency, formalized as a sequence , where each episode provides new evidence and may trigger a query . The system must dynamically process and retain in a continuous stream to successfully address . While we present single evidence-query pairs for clarity, this naturally generalizes: complex episodes decompose into multiple pairs. This design not only aligns with the real-world reality of dynamic state updates but also offers a universal protocol to convert any static dataset into a controllable streaming benchmark. By mapping each episode to specific evidence, we gain granular visibility into the pipeline, enabling precise diagnosis across Formation, Management, Retrieval, and Application without the overhead of lengthy qualitative descriptions.
2. Memory Evaluation Protocols
Building on the four-stage framework introduced above, we focus on three evaluation protocols: Formation (F), Retrieval (R), and Application (A). Since Management operates without clear evidence-triggered boundaries, we leave its direct evaluation as future work.
Figure 2 illustrates this workflow with a concrete example: upon receiving user input about a yoga retreat, the system encodes the emphasis on "personal growth and mindfulness" (Formation), retrieves relevant historical context such as a recent museum visit (Retrieval), and synthesizes user preference evolution to generate the final response (Application).

Figure 2: The illustrative operational workflow of the memory evaluation protocols.
For each protocol, we measure both accuracy (, binary 0/1 judgment by an LLM evaluator) and token cost (), then combine them into a unified Efficiency Index:
Formation (, ): whether the required evidence is encoded in newly generated memories; token cost is the size of stored memories.
Retrieval (, ): whether the retrieved context contains the required evidence; token cost is the size of retrieved context.
Application (, ): whether the final answer is correct; token cost is the size of generated response.
Since raw efficiency values are hard to interpret across different systems, we report a relative score by normalizing against a simple LLM Baseline (details in Section 4):
3. Dataset Characteristics
We apply streaming transformation to two primary datasets, LoCoMo [2] and PersonaMem [3], each representing a different conversational scale, as shown in Table 1.
| Dataset | Focus | Avg. Turns / Episode | Avg. Evidence / Episode | Core Challenge |
|---|---|---|---|---|
| LoCoMo | Fact-based Reasoning | ~5 | 2.06 | Frequent updates and high-density factual evidence |
| PersonaMem | Personal Preferences | ~20 | 2.84 | Long-range context tracking and sparse preference evolution |
Table 1: Dataset characteristics after streaming transformation.
The similarity in "Evidence per Episode" (2.06 vs. 2.84) ensures a fair baseline for information density. The contrast in episode length (~5 vs. ~20 turns) allows us to evaluate memory systems across different temporal granularities: LoCoMo tests short episodes with dense updates, while PersonaMem tests long episodes with sparse signals.
To reflect real-world scenarios where queries occur long after relevant information is introduced, we introduce an Episode-Delay Protocol: for a subset of questions, we delay the query by episodes after its evidence appears ( for LoCoMo, for PersonaMem).
4. System Components by Pipeline Stage
We analyze memory systems based on three dimensions that directly correspond to our granular evaluation protocols (F, R, A): Memory Form (evaluated by F), Retrieval Method (evaluated by R), and Answering Context (evaluated by A). The systems compared include EverMemOS [7], Mem0 [9], MemOS [8], and MemoryOS [10], plus an LLM baseline, as shown in Table 2.
| Method | Memory Form | Retrieval Method | Answering Context |
|---|---|---|---|
| EverMemOS | Event-centric Atomic Fact (Memcells) | Hybrid (BM25 + Vector) | Memcells (7-step CoT) |
| Mem0 | Attribute-centric Fact | Vector (Qdrant) | Fact-based Context |
| MemOS | Structured Key-Value Metadata | Vector (Qdrant) | Textual Memories |
| MemoryOS | Raw Dialogue with Hierarchical Memory | Multi-layer Search | Multi-source Context |
| LLM Baseline | Independent Fact | Vector (Cosine Sim) | Fact-based Context |
Table 2: System components by pipeline stage (formation, retrieval, answering).
Comparative Performance Analysis
In our experiments, we use Gemini-3-flash as the backbone model for all memory systems and as the LLM evaluator for accuracy judgments. We evaluate 50 episodes per dataset on LoCoMo and PersonaMem.
1. Latency Analysis: The Hidden Cost of Memory
We measure per-episode latency for each pipeline stage and the Total Latency (Formation + Retrieval + Application) on both LoCoMo and PersonaMem datasets, using the same underlying model (Table 3).
At the first experiment, we measure per-episode latency for each pipeline stage and the Total Latency (Formation + Retrieval + Application) on both LoCoMo and PersonaMem datasets, using the same underlying model (Table 3). In production-facing interactive agents, a common expectation is single-digit seconds end-to-end (with longer latencies reserved for occasional background work), making 10–50s per episode difficult to deploy.
| System | Dataset | Formation (s) | Retrieval (s) | Answer (s) | Total (s) |
|---|---|---|---|---|---|
| LLM Baseline | LoCoMo | 9.60 | 0.24 | 2.48 | 12.32 |
| LLM Baseline | PersonaMem | 11.77 | 0.31 | 3.44 | 15.52 |
| EverMemOS | LoCoMo | 7.93 | 0.40 | 8.08 | 16.41 |
| EverMemOS | PersonaMem | 9.21 | 0.51 | 9.46 | 19.18 |
| MemOS | LoCoMo | 23.96 | 0.76 | 4.45 | 29.17 |
| MemOS | PersonaMem | 40.05 | 0.92 | 8.35 | 49.32 |
| Mem0 | LoCoMo | 14.38 | 0.27 | 2.93 | 17.58 |
| Mem0 | PersonaMem | 14.43 | 0.30 | 2.93 | 17.66 |
| MemoryOS | LoCoMo | 24.15 | 1.04 | 3.58 | 28.77 |
| MemoryOS | PersonaMem | 40.35 | 1.90 | 3.88 | 46.13 |
Table 3: Per-episode latency by stage and total latency.
Formation emerges as the primary optimization target, with latencies ranging from 7.93s (EverMemOS) to 40s (MemOS/MemoryOS on PersonaMem). For production-facing interactive agents, a common expectation is single-digit seconds end-to-end, making the 10–50s range observed here challenging to deploy. This underscores that formation-time optimization is critical for streaming memory systems.
2. Formation Results: Accuracy–Efficiency Trade-offs
Table 4 below reports formation accuracy (), formation token cost (), and the relative formation efficiency .
| Dataset | System | |||
|---|---|---|---|---|
| LoCoMo | LLM Baseline | 0.966 | 119.40 | 1.000 |
| LoCoMo | EverMemOS | 0.966 | 287.62 | 0.415 |
| LoCoMo | MemOS | 0.986 | 161.35 | 0.755 |
| LoCoMo | Mem0 | 0.972 | 94.08 | 1.277 |
| LoCoMo | MemoryOS | 1.0* | 245.68 | 0.503 |
| PersonaMem | LLM Baseline | 0.935 | 199.20 | 1.000 |
| PersonaMem | EverMemOS | 0.987 | 511.06 | 0.415 |
| PersonaMem | MemOS | 1.0 | 293.24 | 0.755 |
| PersonaMem | Mem0 | 1.0 | 156.62 | 1.360 |
| PersonaMem | MemoryOS | 1.0* | 2329.97 | 0.091 |
Table 4: Formation results on LoCoMo (left) and PersonaMem (right).
- For MemoryOS, the Formation stage stores the original dialogue directly, so is treated as 1.0 by definition.
Our results show a clear formation cost–quality frontier: near-perfect is common, but token cost varies widely, making the key differentiator. Notably, Mem0 maintains perfect while improving on the longer PersonaMem dataset (1.28 → 1.36), suggesting attribute-centric facts can scale with episode length without proportional cost growth.
3. Category Analysis: When Retrieval and Application Diverge
To make the stage metrics easier to interpret, the table below compares and on two representative categories from the PersonaMem dataset, and reports their gap () in Table 5.
| Category | Model | Gap (R-A) | ||
|---|---|---|---|---|
| suggest_new_ideas | LLM Baseline | 0.78 | 0.44 | +0.34 |
| EverMemOS | 0.78 | 0.22 | +0.56 | |
| MemOS | 0.91 | 0.45 | +0.46 | |
| Mem0 | 0.89 | 0.33 | +0.56 | |
| MemoryOS | 0.83 | 0.25 | +0.58 | |
| track_full_preference_evolution | LLM Baseline | 0.47 | 0.63 | -0.16 |
| EverMemOS | 0.37 | 0.79 | -0.42 | |
| MemOS | 0.42 | 0.68 | -0.26 | |
| Mem0 | 0.58 | 0.68 | -0.10 | |
| MemoryOS | 0.32 | 0.42 | -0.10 |
Table 5: Category-level divergence between retrieval and answering (PersonaMem).
The table reveals two distinct patterns: retrieval succeeds but answering fails (), or answering succeeds despite retrieval failure ().
In suggest_new_ideas, most systems retrieve the right information but fail to answer correctly. The problem is that the retrieved context often contains multiple semantically similar options. For example, Mem0 retrieves 10+ relevant chunks (), but when faced with several similar suggestions, the model gets distracted by more salient or detailed descriptions and picks the wrong one ().
In track_full_preference_evolution, systems answer correctly more often than they retrieve the designated evidence. This occurs because PersonaMem's evidence is defined as the most recent preference mention before the query, but not the only one. Memory systems can leverage other nearby preference descriptions and the user's latest input to make approximately correct inferences, even when they miss the designated evidence.
These patterns reveal that retrieval and application must be measured separately, as end-task accuracy alone can mask whether the system truly retained the information or simply inferred it. Beyond measurement, these divergences highlight that retrieval and application must be jointly designed: what is retrieved should align with what the application stage can effectively process.
4. Results of Episode-Delay Settings
We evaluate long-range retention by delaying when the query is triggered by episodes after the evidence appears. Results are shown in Figure 3.

Figure 3: Performance changes under episode delay .
The diagonal line represents equal degradation in retrieval and application. All systems fall to the left of the line, meaning retrieval accuracy drops more than application accuracy under delay. This indicates that models can partially compensate for retrieval failures through inference, maintaining reasonable application performance even when retrieval degrades. We also experimented with varying the delay amount, but found that performance changes remain relatively modest. This is likely because current datasets have limited conversation lengths and we did not constrain memory storage capacity—more targeted benchmark designs for long-range retention remain future work.
One notable exception is MemOS on LoCoMo, where application accuracy actually increases under delay. This occurs because: (1) MemOS may initially extract incomplete information from early conversations, and (2) when the same information reappears in later episodes, MemOS not only captures it correctly but also consolidates it with earlier memories. Under delay, some queries benefit from this more complete, consolidated memory state.
Overall, retrieval accuracy is more sensitive to delay than application accuracy, highlighting the importance of measuring retrieval fidelity separately, as end-task performance alone can mask underlying degradation of memory retrieval.
Conclusion
We introduced a Streaming Memory Benchmark designed for deployment-realistic evaluation. The motivation is simple: while most benchmarks are static, real memory systems are streaming and must operate under strict latency and token budgets—making retrieval fidelity and system efficiency as important as end-task accuracy.
Our benchmark operationalizes this setting through evidence-grounded episodes and a transparent pipeline. It enables: (1) stage-level diagnosis (formation vs. retrieval vs. application), (2) cost attribution (latency/token per stage), and (3) retention fidelity checks that reduce the risk of “getting it right for the wrong reasons.”
More broadly, this shifts memory evaluation toward a deployment-oriented research loop: track retrieval correctness and stage-wise efficiency over time, identify regressions, and optimize for real budgets rather than one-shot QA scores. Ultimately, a streaming benchmark turns evaluation into instrumentation—and instrumentation is what makes iteration, and eventually Learning from Real Experience, possible.
References
[1] Memory in the Age of AI Agents (Hu Y et al, 2025)
[2] Evaluating very long-term conversational memory of llm agents (Maharana A et al, 2024)
[3] Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale (Jiang B et al, 2025)
[4] Complex prospective memory and executive control of working memory: A process model (Kliegel M et al, 2002)
[5] Working memory: Maintenance, updating, and the realization of intentions (Nyberg L et al, 2016)
[6] From Correlation to Causation: Understanding Episodic Memory Networks (Khan A et al, 2025)
[7] EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning (Hu C et al, 2026)
[8] Memos: A memory os for ai system (Li Z et al, 2025)
[9] Mem0: Building production-ready ai agents with scalable long-term memory (Chhikara P et al, 2025)
[10] Memory OS of AI Agent (Kang J et al, 2025)
Author
Mind Lab
Core Contributors
Guanming Liu, Alex Yin, Tianchen Li, Andrew Chen, Pony Ma
Team
Andrew Chen, Kaijie Chen, Songlin Jiang, Yuhua Jiang, Xiang Lei, Guanming Liu, Qihan Liu, Tianchen Li, Yiwen Lu, Pony Ma, Warrior Xu, Alex Yin, Rio Yang and Mindverse Team
Acknowledgement
Special thanks to our guest scientists, Hongru Wang (University of Edinburgh) and Sikuan Yan (Ludwig Maximilian University of Munich), for their valuable feedback on this blog.
Citation
Please cite this work using the BibTeX citation:
@misc{liu2026streamingbenchmark, author = {Guanming Liu and Alex Yin and Tianchen Li and Hongru Wang and Sikuan Yan and Andrew Chen and Pony Ma and {Mind Lab}}, title = {Streaming Memory Benchmark: Diagnosing Memory with Evidence-Grounded Episodes}, year = {2026}, howpublished = {Mind Lab: A Lab for Experiential Intelligence}, note = {https://macaron.im/mindlab/research/streaming-memory-benchmark-diagnosing-memory-with-evidence-grounded-episodes} }