From Context Engineering to Context Learning
AI agent development is at an inflection point: from Context Engineering to Context Learning.
This transition extends our ongoing research at Mind Lab on "Experiential Intelligence"—intelligence that continuously evolves through real-world interactions. It also addresses the most pressing practical question in the industry today: now that we have powerful foundation models and mature agent stacks, how do we transform the temporary gains of Context Engineering into a model's permanent capability and long-term memory?
1. The Bottleneck: From "Using Context" to "Learning Context"

Figure 1: From using context to learning context.
Today's pre-trained models are great at passing tests, but their performance on real-world tasks is often underwhelming. They crush benchmarks, yet they still struggle to handle the messy reality of production environments.
The obvious fix is post-training, specifically Reinforcement Learning (RL). But in practice, RL is a nightmare to get right. Most developers get stuck at the very beginning: the reward model. Unless you're working on math or code, it's almost impossible to design a reward that the model won't immediately "hack".
Because we couldn't easily train models for specific tasks, we found a workaround: Context Engineering. Since we couldn't change the model's brain, we focused on cramming the right info into its "working memory." We built RAG pipelines, tool-use protocols like MCP, and elaborate agent stacks just to help the base model do its job [1,2].
But Context Engineering has a low ceiling: it doesn't stick. The model isn't actually getting smarter; it's just reading from better notes. If the retrieval fails or the prompt isn't perfect, the model instantly "forgets" the task and acts like a stranger. You end up in a manual loop, building new context pipelines for every single new task.
The real question is: how do we take the knowledge we've gathered in Context Engineering and bake it directly into the model? We want the model to internalize its experience and self-iterate. That is the shift to Context Learning.
2. The Three Pillars of Agentic RL

Figure 2: The three pillars of Agentic RL.
In the post-training era, Agentic Reinforcement Learning (RL) has emerged as a crucial focus [3]. It rests on three major pillars:
- Reasoning: Recently propelled to new heights by models like DeepSeek-R1 and the integration of Chain-of-Thought [4,5].
- Tool Use: Ecosystems like Claude (including MCP and computer-use capabilities) have established mature paradigms for utilizing systemic tools (grep, sed, bash) and orchestrating diverse skills [6,7,8,9].
- Memory: The third pillar—and the one that hasn't cracked yet [10].
While reasoning and tool use now have clear developmental recipes and successful commercial products, memory remains difficult. And memory is what ultimately dictates the user experience. Working memory (anything within the context window) is manageable. True friction arises with long-term memory, the accumulation of user habits, preferences, constraints, and historical context. Ultimately, seamless long-term memory is indistinguishable from a model's ability to internalize information through training.
2.1 The Need for Parametric Memory

Figure 3: Parametric memory vs search-and-stuff memory.
A robust memory system should remember who we are, what we have built together, and our specific preferences. It should not operate on a disjointed "search-and-stuff" basis, where every new query triggers a frantic RAG search to populate the prompt. If the retrieval fails, the agent "forgets" the user and acts like a stranger.
Real long-term memory must be parametric. A Large Language Model (LLM) is the ultimate parametric memory engine: it flawlessly remembers the world's general knowledge. But it struggles to remember "me" or maintain a stable, accurate feature list for "my product."
2.2 Why Long-Term Memory is Hard to Train

Figure 4: Why long-term memory is hard: credit assignment and reward gaps.
Why haven't we solved parametric memory? Because the RL tricks that work for reasoning and tool use fail when applied to long-term memory. There are two core hurdles:
- Credit Assignment: It is notoriously difficult to trace today's correct output back to a specific fact learned or generalized months ago [3].
- Lack of Verifiable Rewards: Memory quality involves long-horizon, subtle improvements. Unlike math or code, where correctness is binary and easily verifiable, memory is subjective.
As a result, most teams default to Context Engineering [1,2]. But even the most sophisticated retrieval systems can yield disjointed experiences. We need a new direction: Context Learning, an improvement mechanism that works alongside Context Engineering without relying on strict, verifiable rewards.
Terminology Note: Context Learning vs. In-Context Learning
- In-Context Learning (ICL): A test-time prompting technique where demonstrations are placed within the context window [5,11].
- Context Learning: Training the model's parameters so the context-driven improvement becomes a permanent, long-term capability. The term was recently formalized by the CL-bench benchmark, which evaluates a model's ability to internalize new rules [12]. Our philosophy shares this spirit but is learning-first rather than evaluation-first.
3. The Mechanics of Growth: Internalizing the Signal

Figure 5: The mechanics of growth: signal and internalization.
For an AI system to truly iterate and improve, it requires two fundamental properties:
- A mechanism to generate a signal that identifies "this outcome is better."
- A mechanism to internalize that signal into its memory.
In modern LLM post-training, "storing the gain" is efficiently handled by Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA [13]. Finding the right learning signal, however, is much harder.
Forcing rigid, rubric-based rewards on non-deterministic tasks often collapses the model's goal into merely "following the rules," which is fundamentally different from internalizing context. On-Policy Distillation takes a different approach [14]: rather than pure Supervised Fine-Tuning (SFT), it uses RL-style online updates to distill knowledge, which has proven effective at resisting catastrophic forgetting. This is the foundation for our framework: Context Distillation.
4. Context Distillation: Writing "Context Gains" into Parameters

Figure 6: Context distillation: writing context gains into parameters.
Context Distillation asks a simple question: If our Context Engineering pipeline makes the model better at test time, can we systematically encode that gain into the model's parameters so it retains the capability even without the external context?
Practically, it is a "context → parameters” transfer. At test time, we leverage demonstrations, RAG results, tool specs, and execution outcomes. By feeding the model a query + context, it produces a dense, high-quality scoring signal for the output it generated using query-only. We then convert this scoring signal into training updates.
def context_distill(model, query, build_context, rl_update): # Step 1: Query-only produces an on-policy rollout out = model.sample(query) # Step 2: Query + Context scores the rollout with token-level rewards ctx = build_context(query) r_tok = model.token_reward(query, ctx, out) # Step 3: RL-style update using only the rewards return rl_update(model, query, out, r_tok)
Terminology Note: Context Distillation
The term "Context Distillation" was introduced by Snell et al. (2022) as an off-policy method (producing targets with context, then learning them via SFT without context) [15]. Our approach is an on-policy variant: the context is exclusively used to generate token-level rewards for RL-style learning, never as visible input during the update itself [14].
Under this framework, recent self-distillation methods—whether turning mistakes into dense learning signals or reducing continual learning forgetfulness—are effectively special cases of using extra information (context) to create stable parameter updates [16,17].
5. Context Learning: Policy Iteration via Context Distillation
By stringing together these distillation steps, we establish a continuous learning loop: Context Learning.
For every real-world query, the model generates an on-policy output (query-only). It then evaluates itself (query + context) to generate rewards, and updates its parameters. Over time, this mirrors the classic Policy Iteration process [3]:
def context_learning(model, queries, build_context, rl_update, steps): for _ in range(steps): model = context_distill(model, next(queries), build_context, rl_update) return model
This works because of the Test-Time Scaling Law, driven by two dimensions [4,5]:
- More Compute/Reasoning: Extra test-time reasoning yields better decisions and checks.
- More Information: Richer context (RAG, tool feedback) closes information gaps.

Figure 7: Test-time scaling: more reasoning and more information.
5.1 Take an Example: Internalizing RAG Knowledge
To make this loop concrete, consider a memory system powered by RAG. RAG is one of the most widely used Context Engineering techniques in production: retrieve external knowledge and inject it at runtime to improve quality [18].
Its limitation is that the gain is still external and transient. If retrieval misses or context budget is tight, quality drops. Context Learning addresses this by converting external retrieval gains into internal parametric memory.
Let's walk through a concrete iteration loop. The symbols M1...M10 below represent retrieval slices from a larger memory store.
The iteration:
- Query A arrives. The retriever returns
M1...M5. - The model answers with
query-onlyunder policy , then usesquery + M1...M5to score that rollout and update once. After this step, useful parts ofM1...M5are partially internalized in parameters. - Query B arrives. The retriever now returns
M6...M10.M1...M5may not appear in the context window. - Even without explicitly retrieving
M1...M5, policy starts Query B from a stronger foundation than , because the previous update already stored part of that knowledge.
This is the RAG2LoRA move: context retrieved from an external RAG system is progressively internalized into a LoRA-based parametric memory.
For the same Query B, with M6...M10 outperforms the one-shot baseline with M6...M10. Each query doesn't just benefit from its own retrieved context, but from all previously learned context across the entire history. Repeated over many queries, this is concrete evidence for Policy Iteration in Context Learning.
5.2 Context Engineering Rewritten: From Runtime Support to Teacher-Signal Design
Under Context Learning, Context Engineering is no longer just a real-time serving layer that patches missing information at inference. It becomes a teacher-signal design stack that decides what the model should learn and how strongly it should update.
This shift unlocks three fundamental differences:
- More Reasoning Effort for the Teacher Signal. Because this signal is generated for training updates rather than user-facing latency, it can afford deeper reasoning and more expensive verification passes. We can run multi-step verification, cross-check with external tools, or even invoke slower but more accurate models to score outputs.
- Bidirectional and Post-Outcome Evaluation. Real-time systems can only look backward at past user history and previous turns. A teacher signal can look both forward and backward. Like BERT's bidirectional attention, we can see the execution result before scoring the intermediate steps. We don't need to obey strict online Markovian constraints. This is similar to how reward models in RLHF can score an entire trajectory after seeing the outcome.
- From Context Payloads to Meta-Knowledge Signals. The goal is not only to provide relevant snippets for one response, but to construct higher-quality meta-knowledge that produces better parameter updates. This means the Context Engineering pipeline must optimize for learnability, not just immediate answer quality. We care about what sticks in the model's parameters, not just what helps right now.
This is a paradigm shift: Context Engineering evolves from "helping the model answer right now" to "teaching the model to learn for the future." The infrastructure we build is no longer just for serving better responses, but for generating better training signals that compound over time.
6. Three Agent Experiences Reshaped

Figure 8: Three agent experiences reshaped by context learning.
We believe this paradigm will change how we build and interact with AI.
6.1 A New Division of Labor in Agent Building
Historically, agent development required hand-crafted rewards and constant battles against reward hacking. Context Learning shifts this burden. Product Managers and Engineers can now focus entirely on building a superior Context Engineering pipeline (better retrieval, tool orchestration, and reasoning triggers). This pipeline naturally produces the learning signals that the model internalizes, drastically shortening the path from "system design" to "lasting capability." Trajectories stop being disposable logs; they become reusable fuel for growth.
6.2 Self-Iteration via Self-Evolution
Many modern systems can already reflect on their mistakes and generate reusable procedures from agent trajectories [19,20,21]. Under the Context Learning paradigm, these dynamically generated skills are not just saved as text files for future context—they become training inputs. This bridges system-level evolution and model-level evolution, turning dynamically discovered skills into stable, parametric knowledge.
6.3 True Personalization
A truly personalized model has long been out of reach, largely because defining a clean objective function for individual human preference is nearly impossible. Context Learning points toward a practical path. By building a personalized pipeline that retrieves a user's specific memories as context, repeated on-policy updates gradually weave those preferences into the model's parameters. Paired with PEFT, every user could maintain a personal LoRA adapter (100MB–1GB), continually updating and evolving alongside them [13].
7. This is Only the Beginning
We are currently running broad experiments around Context Learning to map its upper limits, focusing on personalization quality, advanced parametric memory, and long-horizon coding tasks.
At Mind Lab, our core vision is Experiential Intelligence: models that learn continuously from real users and real products. Context Learning is what closes this loop, turning transient test-time gains into permanent training-time growth.
We think this is a concrete step toward true continual learning and models that genuinely "grow with you." We will keep working in this direction, and we invite researchers, builders, and entrepreneurs to join us.
References
[1] Context Engineering for AI Agents: Lessons from Building Manus (Ji Y et al, 2025)
[2] Effective context engineering for AI agents (Anthropic et al, 2025)
[3] Reinforcement Learning: An Introduction (Sutton Richard S. et al, 2018)
[4] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek et al, 2025)
[5] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei Jason et al, 2022)
[6] Introducing the Model Context Protocol (Anthropic et al, 2024)
[7] Model Context Protocol (MCP) Specification (Anthropic et al, 2025)
[8] Computer Use Tool (Anthropic et al, 2024)
[9] Developing computer use (Anthropic et al, 2024)
[10] MemGPT: Towards LLMs as Operating Systems (Packer Charles et al, 2023)
[11] Language Models are Few-Shot Learners (Brown Tom et al, 2020)
[12] CL-bench: A Benchmark for Context Learning (Dou et al, 2026)
[13] LoRA: Low-Rank Adaptation of Large Language Models (Hu Edward J. et al, 2021)
[14] On-Policy Distillation (Lu K et al, 2025)
[15] Learning by Distilling Context (Snell Charlie et al, 2022)
[16] Reinforcement Learning via Self-Distillation (Hübotter J et al, 2026)
[17] Self-Distillation Enables Continual Learning (Shenfeld I et al, 2026)
[18] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis Patrick et al, 2020)
[19] SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning (Xia P et al, 2026)
[20] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents (Zhang H et al, 2026)
[21] Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn Noah et al, 2023)
Author
Mind Lab
Core Contributors
Pony Ma, Andrew Chen
Team
Andrew Chen, Kaijie Chen, Song Cao, Nolan Ho, Songlin Jiang, Fancy Kong, Xiang Lei, Lucian Li, Qihan Liu, Tianchen Li, Yiwen Lu, Pony Ma, Warrior Xu, Wenbin Wang, Alex Yin, Rio Yang, Di Zhang, Conley Zhao, Congjie Zheng and Mindverse Team
Names are listed alphabetically within team.
Citation
Please cite this work using the BibTeX citation:
@misc{ma2026contextlearning, author = {Pony Ma and Andrew Chen and {Mind Lab}}, title = {From Context Engineering to Context Learning}, year = {2026}, howpublished = {Mind Lab: A Lab for Experiential Intelligence}, note = {https://macaron.im/mindlab/research/from-context-engineering-to-context-learning} }