Streaming Memory Benchmark: Diagnosing Memory with Evidence-Grounded Episodes

Introduction

Most existing memory benchmarks evaluate agents in a static setting: the full conversation history is provided at once, and questions are asked in aggregate. This is convenient for offline comparison, but it diverges from how memory works in deployed systems [1,2,3].

In production, memory is inherently streaming: interactions arrive incrementally, and the system must decide under strict latency and token budgets, such as what to write, how to organize it, what to retrieve later, and how to apply the retrieved context. These constraints mean that each stage of the memory pipeline incurs real-time costs that directly impact deployability. As a result, what matters is not only end-task accuracy, but also retrieval fidelity and system efficiency at each stage. However, existing benchmarks focus primarily on aggregate accuracy, offering no visibility into which stage drives the cost or where failures occur.

To enable stage-level diagnosis, we ground our evaluation framework in cognitive psychology, which characterizes memory as a sequential process comprising four universal phases: encoding, maintenance, retrieval, and execution [4,5,6]. Building on this theoretical foundation, we propose Evidence-Based Streaming Evaluation by restructuring interactions into streaming episodes, where information is introduced as explicit Evidence (E) and must be retained to answer future Queries (Q). This yields a transparent memory pipeline, Formation (encoding), Management (maintenance), Retrieval, and Application (execution), and enables stage-level accountability for accuracy, latency, and token cost. Figure 1 visualizes this paradigm shift from static, opaque evaluation to evidence-based streaming evaluation.

StreamingBenchmark.jpg

Figure 1: The paradigm shift from current static, opaque evaluation to evidence-based streaming evaluation.

1. Evidence-Based Streaming Transformation

At the core of memory lies a universal truth: every valid recall is grounded in a specific antecedent—an Evidence. Regardless of the complexity of a dataset, any memory task can be traced back to this fundamental unit where user intent or factual information is first introduced. We leverage this by restructuring static dialogues into Streaming Episodes, where each episode is strictly designed to carry the necessary Evidence (EE) required to resolve a future Query (QQ).

This transformation enforces a strict Streaming Dependency, formalized as a sequence S=(E1,Q1),,(Et,Qt),S = \langle (E_1, Q_1), \dots, (E_t, Q_t), \dots \rangle, where each episode tt provides new evidence EtE_t and may trigger a query QtQ_t. The system must dynamically process and retain EtE_t in a continuous stream to successfully address QtQ_{t}. While we present single evidence-query pairs for clarity, this naturally generalizes: complex episodes decompose into multiple (E,Q)(E, Q) pairs. This design not only aligns with the real-world reality of dynamic state updates but also offers a universal protocol to convert any static dataset into a controllable streaming benchmark. By mapping each episode to specific evidence, we gain granular visibility into the pipeline, enabling precise diagnosis across Formation, Management, Retrieval, and Application without the overhead of lengthy qualitative descriptions.

2. Memory Evaluation Protocols

Building on the four-stage framework introduced above, we focus on three evaluation protocols: Formation (F), Retrieval (R), and Application (A). Since Management operates without clear evidence-triggered boundaries, we leave its direct evaluation as future work.

Figure 2 illustrates this workflow with a concrete example: upon receiving user input about a yoga retreat, the system encodes the emphasis on "personal growth and mindfulness" (Formation), retrieves relevant historical context such as a recent museum visit (Retrieval), and synthesizes user preference evolution to generate the final response (Application).

d6aee033-f45a-4114-b954-20b9444bb21b.png

Figure 2: The illustrative operational workflow of the memory evaluation protocols.

For each protocol, we measure both accuracy (saccs_{acc}, binary 0/1 judgment by an LLM evaluator) and token cost (stoks_{tok}), then combine them into a unified Efficiency Index:

Is=saccstok,s{F,R,A}I_{s} = \frac{s_{acc}}{s_{tok}}, \quad s \in \{F, R, A\}

Formation (FaccF_{acc}, FtokF_{tok}): whether the required evidence is encoded in newly generated memories; token cost is the size of stored memories.

Retrieval (RaccR_{acc}, RtokR_{tok}): whether the retrieved context contains the required evidence; token cost is the size of retrieved context.

Application (AaccA_{acc}, AtokA_{tok}): whether the final answer is correct; token cost is the size of generated response.

Since raw efficiency values are hard to interpret across different systems, we report a relative score by normalizing against a simple LLM Baseline (details in Section 4):

Isrel=IsIsbaseI_{s}^{rel} = \frac{I_{s}}{I_{s}^{base}}

3. Dataset Characteristics

We apply streaming transformation to two primary datasets, LoCoMo [2] and PersonaMem [3], each representing a different conversational scale, as shown in Table 1.

DatasetFocusAvg. Turns / EpisodeAvg. Evidence / EpisodeCore Challenge
LoCoMoFact-based Reasoning~52.06Frequent updates and high-density factual evidence
PersonaMemPersonal Preferences~202.84Long-range context tracking and sparse preference evolution

Table 1: Dataset characteristics after streaming transformation.

The similarity in "Evidence per Episode" (2.06 vs. 2.84) ensures a fair baseline for information density. The contrast in episode length (~5 vs. ~20 turns) allows us to evaluate memory systems across different temporal granularities: LoCoMo tests short episodes with dense updates, while PersonaMem tests long episodes with sparse signals.

To reflect real-world scenarios where queries occur long after relevant information is introduced, we introduce an Episode-Delay Protocol: for a subset of questions, we delay the query by Δ\Delta episodes after its evidence appears (Δ=15\Delta = 15 for LoCoMo, Δ=5\Delta = 5 for PersonaMem).

4. System Components by Pipeline Stage

We analyze memory systems based on three dimensions that directly correspond to our granular evaluation protocols (F, R, A): Memory Form (evaluated by F), Retrieval Method (evaluated by R), and Answering Context (evaluated by A). The systems compared include EverMemOS [7], Mem0 [9], MemOS [8], and MemoryOS [10], plus an LLM baseline, as shown in Table 2.

MethodMemory FormRetrieval MethodAnswering Context
EverMemOSEvent-centric Atomic Fact (Memcells)Hybrid (BM25 + Vector)Memcells (7-step CoT)
Mem0Attribute-centric FactVector (Qdrant)Fact-based Context
MemOSStructured Key-Value MetadataVector (Qdrant)Textual Memories
MemoryOSRaw Dialogue with Hierarchical MemoryMulti-layer SearchMulti-source Context
LLM BaselineIndependent FactVector (Cosine Sim)Fact-based Context

Table 2: System components by pipeline stage (formation, retrieval, answering).

Comparative Performance Analysis

In our experiments, we use Gemini-3-flash as the backbone model for all memory systems and as the LLM evaluator for accuracy judgments. We evaluate 50 episodes per dataset on LoCoMo and PersonaMem.

1. Latency Analysis: The Hidden Cost of Memory

We measure per-episode latency for each pipeline stage and the Total Latency (Formation + Retrieval + Application) on both LoCoMo and PersonaMem datasets, using the same underlying model (Table 3).

At the first experiment, we measure per-episode latency for each pipeline stage and the Total Latency (Formation + Retrieval + Application) on both LoCoMo and PersonaMem datasets, using the same underlying model (Table 3). In production-facing interactive agents, a common expectation is single-digit seconds end-to-end (with longer latencies reserved for occasional background work), making 10–50s per episode difficult to deploy.

SystemDatasetFormation (s)Retrieval (s)Answer (s)Total (s)
LLM BaselineLoCoMo9.600.242.4812.32
LLM BaselinePersonaMem11.770.313.4415.52
EverMemOSLoCoMo7.930.408.0816.41
EverMemOSPersonaMem9.210.519.4619.18
MemOSLoCoMo23.960.764.4529.17
MemOSPersonaMem40.050.928.3549.32
Mem0LoCoMo14.380.272.9317.58
Mem0PersonaMem14.430.302.9317.66
MemoryOSLoCoMo24.151.043.5828.77
MemoryOSPersonaMem40.351.903.8846.13

Table 3: Per-episode latency by stage and total latency.

Formation emerges as the primary optimization target, with latencies ranging from 7.93s (EverMemOS) to 40s (MemOS/MemoryOS on PersonaMem). For production-facing interactive agents, a common expectation is single-digit seconds end-to-end, making the 10–50s range observed here challenging to deploy. This underscores that formation-time optimization is critical for streaming memory systems.

2. Formation Results: Accuracy–Efficiency Trade-offs

Table 4 below reports formation accuracy (FaccF_{acc}), formation token cost (FtokF_{tok}), and the relative formation efficiency IFrelI^{rel}_F.

DatasetSystemFaccF_{acc}FtokF_{tok}IFrelI_{F}^{rel}
LoCoMoLLM Baseline0.966119.401.000
LoCoMoEverMemOS0.966287.620.415
LoCoMoMemOS0.986161.350.755
LoCoMoMem00.97294.081.277
LoCoMoMemoryOS1.0*245.680.503
PersonaMemLLM Baseline0.935199.201.000
PersonaMemEverMemOS0.987511.060.415
PersonaMemMemOS1.0293.240.755
PersonaMemMem01.0156.621.360
PersonaMemMemoryOS1.0*2329.970.091

Table 4: Formation results on LoCoMo (left) and PersonaMem (right).

  • For MemoryOS, the Formation stage stores the original dialogue directly, so FaccF_{acc} is treated as 1.0 by definition.

Our results show a clear formation cost–quality frontier: near-perfect FaccF_{acc} is common, but token cost varies widely, making IFrelI_F^{rel} the key differentiator. Notably, Mem0 maintains perfect FaccF_{acc} while improving IFrelI_F^{rel} on the longer PersonaMem dataset (1.28 → 1.36), suggesting attribute-centric facts can scale with episode length without proportional cost growth.

3. Category Analysis: When Retrieval and Application Diverge

To make the stage metrics easier to interpret, the table below compares RaccR_{acc} and AaccA_{acc} on two representative categories from the PersonaMem dataset, and reports their gap (RAR-A) in Table 5.

CategoryModelRaccR_{acc}AaccA_{acc}Gap (R-A)
suggest_new_ideasLLM Baseline0.780.44+0.34
EverMemOS0.780.22+0.56
MemOS0.910.45+0.46
Mem00.890.33+0.56
MemoryOS0.830.25+0.58
track_full_preference_evolutionLLM Baseline0.470.63-0.16
EverMemOS0.370.79-0.42
MemOS0.420.68-0.26
Mem00.580.68-0.10
MemoryOS0.320.42-0.10

Table 5: Category-level divergence between retrieval and answering (PersonaMem).

The table reveals two distinct patterns: retrieval succeeds but answering fails (R>AR > A), or answering succeeds despite retrieval failure (A>RA > R).

In suggest_new_ideas, most systems retrieve the right information but fail to answer correctly. The problem is that the retrieved context often contains multiple semantically similar options. For example, Mem0 retrieves 10+ relevant chunks (Racc=0.89R_{acc}=0.89), but when faced with several similar suggestions, the model gets distracted by more salient or detailed descriptions and picks the wrong one (Aacc=0.33A_{acc}=0.33).

In track_full_preference_evolution, systems answer correctly more often than they retrieve the designated evidence. This occurs because PersonaMem's evidence is defined as the most recent preference mention before the query, but not the only one. Memory systems can leverage other nearby preference descriptions and the user's latest input to make approximately correct inferences, even when they miss the designated evidence.

These patterns reveal that retrieval and application must be measured separately, as end-task accuracy alone can mask whether the system truly retained the information or simply inferred it. Beyond measurement, these divergences highlight that retrieval and application must be jointly designed: what is retrieved should align with what the application stage can effectively process.

4. Results of Episode-Delay Settings

We evaluate long-range retention by delaying when the query is triggered by Δ\Delta episodes after the evidence appears. Results are shown in Figure 3.

0ebf1306-ba7a-49d2-a4fd-456e4c30cffc.png

Figure 3: Performance changes under episode delay .

The diagonal line represents equal degradation in retrieval and application. All systems fall to the left of the line, meaning retrieval accuracy drops more than application accuracy under delay. This indicates that models can partially compensate for retrieval failures through inference, maintaining reasonable application performance even when retrieval degrades. We also experimented with varying the delay amount, but found that performance changes remain relatively modest. This is likely because current datasets have limited conversation lengths and we did not constrain memory storage capacity—more targeted benchmark designs for long-range retention remain future work.

One notable exception is MemOS on LoCoMo, where application accuracy actually increases under delay. This occurs because: (1) MemOS may initially extract incomplete information from early conversations, and (2) when the same information reappears in later episodes, MemOS not only captures it correctly but also consolidates it with earlier memories. Under delay, some queries benefit from this more complete, consolidated memory state.

Overall, retrieval accuracy is more sensitive to delay than application accuracy, highlighting the importance of measuring retrieval fidelity separately, as end-task performance alone can mask underlying degradation of memory retrieval.

Conclusion

We introduced a Streaming Memory Benchmark designed for deployment-realistic evaluation. The motivation is simple: while most benchmarks are static, real memory systems are streaming and must operate under strict latency and token budgets—making retrieval fidelity and system efficiency as important as end-task accuracy.

Our benchmark operationalizes this setting through evidence-grounded episodes and a transparent pipeline. It enables: (1) stage-level diagnosis (formation vs. retrieval vs. application), (2) cost attribution (latency/token per stage), and (3) retention fidelity checks that reduce the risk of “getting it right for the wrong reasons.”

More broadly, this shifts memory evaluation toward a deployment-oriented research loop: track retrieval correctness and stage-wise efficiency over time, identify regressions, and optimize for real budgets rather than one-shot QA scores. Ultimately, a streaming benchmark turns evaluation into instrumentation—and instrumentation is what makes iteration, and eventually Learning from Real Experience, possible.

References

[1] Memory in the Age of AI Agents (Hu Y et al, 2025)

[2] Evaluating very long-term conversational memory of llm agents (Maharana A et al, 2024)

[3] Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale (Jiang B et al, 2025)

[4] Complex prospective memory and executive control of working memory: A process model (Kliegel M et al, 2002)

[5] Working memory: Maintenance, updating, and the realization of intentions (Nyberg L et al, 2016)

[6] From Correlation to Causation: Understanding Episodic Memory Networks (Khan A et al, 2025)

[7] EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning (Hu C et al, 2026)

[8] Memos: A memory os for ai system (Li Z et al, 2025)

[9] Mem0: Building production-ready ai agents with scalable long-term memory (Chhikara P et al, 2025)

[10] Memory OS of AI Agent (Kang J et al, 2025)

Author

Mind Lab

Core Contributors

Guanming Liu, Alex Yin, Tianchen Li, Andrew Chen, Pony Ma

Team

Andrew Chen, Kaijie Chen, Songlin Jiang, Yuhua Jiang, Xiang Lei, Guanming Liu, Qihan Liu, Tianchen Li, Yiwen Lu, Pony Ma, Warrior Xu, Alex Yin, Rio Yang and Mindverse Team

Acknowledgement

Special thanks to our guest scientists, Hongru Wang (University of Edinburgh) and Sikuan Yan (Ludwig Maximilian University of Munich), for their valuable feedback on this blog.

Citation

Please cite this work using the BibTeX citation:

@misc{liu2026streamingbenchmark, author = {Guanming Liu and Alex Yin and Tianchen Li and Hongru Wang and Sikuan Yan and Andrew Chen and Pony Ma and {Mind Lab}}, title = {Streaming Memory Benchmark: Diagnosing Memory with Evidence-Grounded Episodes}, year = {2026}, howpublished = {Mind Lab: A Lab for Experiential Intelligence}, note = {https://macaron.im/mindlab/research/streaming-memory-benchmark-diagnosing-memory-with-evidence-grounded-episodes} }
Share to
FacebookLinkedInX

Mind Lab © 2025 · contact@mindlab.ltd