Inside Macaron's Memory Engine: Compression, Retrieval and Dynamic Gating

Author: Boxu Li at Macaron

Introduction

While the novelty of Macaron AI often draws attention to its ability to generate custom mini‑apps or to act as an empathetic friend, its true backbone is an intricate memory engine. This system allows Macaron to remember what matters, forget what doesn't, and retrieve relevant experiences quickly and safely. A simple conversation about music can lead to reminders about a concert next month, an automatically compiled playlist, or the generation of a karaoke assistant. None of this is possible without memory mechanisms capable of handling long dialogues and diverse topics. This blog provides a deep technical dive into Macaron's memory engine, discussing hierarchical compression, vector retrieval, reinforcement‑guided gating and privacy control. We compare Macaron's design with other retrieval‑augmented generation (RAG) systems and discuss how these mechanisms enable Japanese and Korean users to enjoy personalized experiences.

1 Hierarchical Memory Representation

1.1 Multi‑store architecture: short‑term, episodic and long‑term

Macaron organizes memory into multiple stores. The short‑term store maintains the current conversation and spans roughly 8–16 messages. It acts like a typical transformer context: tokens are processed sequentially with attention. The episodic store holds recent interactions (e.g., the last few days) and is refreshed periodically. Here, Macaron employs a compressive transformer: messages are compressed into summary vectors using convolutional attention, enabling the model to maintain context beyond the native window length. The long‑term store keeps important events, facts and mini‑app configurations and is implemented as a vector database. Each memory item includes metadata (timestamp, domain tags, language tags) and an embedding produced by a multilingual encoder.

1.2 Compression via latent summarization and autoencoding

One of the key challenges in long conversations is that the cost of self‑attention grows quadratically with sequence length. To manage this, Macaron employs a latent summarization layer: rather than attending to every token, the model learns to identify salient segments and compress them into a fixed‑length representation. This layer is trained using an autoencoding objective that reconstructs hidden states from compressed summaries. Reinforcement learning fine‑tunes the summarizer: if the agent fails to recall important details later, the policy is penalized, encouraging it to retain more information about similar events in the future.

1.3 Dynamic memory token as a pointer network

The memory token described in the Taiwan news article functions like a pointer that traverses memory to pick relevant items. During recall, the token iteratively queries the memory bank: it retrieves a candidate memory, evaluates its relevance to the current context using a learned scoring function, and decides whether to return it or continue searching. This process is akin to a pointer network used in neural combinatorial optimization. Reinforcement signals guide the token to select sequences of memories that maximize user satisfaction (e.g., correctly predicting a user's preference for jazz). The token can also update the memory: when new information arrives, it decides whether to merge it with existing memories or allocate a new slot.

2 Vector Retrieval and Query Expansion

2.1 Approximate nearest neighbour search

Macaron's long‑term memory uses a high‑dimensional vector database. Queries are converted into embeddings via a multilingual encoder; then an approximate nearest neighbour (ANN) search returns the top‑k memories. The system uses product quantization to accelerate search and maintain a latency below 50 ms, even when storing millions of memory items. To avoid retrieving trivial duplicates, the system applies maximal marginal relevance (MMR), balancing similarity and diversity among results.

2.2 Query expansion using context and user goals

Simple keyword matching is not enough to capture user intent. Macaron expands queries using the user's current goal and latent intent. For example, if the user in Tokyo mentions "花火大会" (fireworks festival), the system expands the query to include "tickets", "date" and "weather" based on typical actions related to festivals. If a Korean user asks about "김치전 만드는 법" (how to make kimchi pancakes), the system also searches for past cooking experiences, nutrition data, and local ingredient availability. Query expansion is handled by a goal predictor trained to map conversation context to a set of relevant subtopics.

2.3 Cross‑domain retrieval and relevance federation

The memory engine must handle queries that span multiple domains. The relevance federation mechanism described in Macaron's self‑model article allows the system to access memories across domain boundaries. When the agent helps a Japanese user plan a wedding, it might need to retrieve travel memories (honeymoon destinations), finance memories (budget) and cultural memories (wedding etiquette). Each domain has its own retrieval index, and the system uses a softmax gating function to distribute retrieval probabilities across domains. The gating function is trained with RL to minimize retrieval of irrelevant items while ensuring important cross‑domain connections aren't missed. For cross‑lingual queries, the gating function also considers language tags to prefer same‑language memories but allows cross‑language retrieval when semantic similarity is high.

3 Reinforcement‑Guided Memory Gating

3.1 Reward modelling and FireAct inspiration

The Macaron team was inspired by the FireAct project, which demonstrated that RL post‑training improves reasoning accuracy by 77% compared with prompt‑based methods. In Macaron, RL is used to train the memory gating policy: a neural network that decides whether to store, update or discard information and how strongly to weight retrieved memories. The reward function combines multiple signals: task completion, user satisfaction, privacy compliance and computational efficiency. For instance, retrieving too many memories slows down responses, so the reward penalizes unnecessary recall. Forgetting relevant details results in lower user satisfaction, so the policy learns to keep them longer. The reward function is tuned differently for Japanese and Korean markets: Japanese users may penalize oversharing private details, while Korean users may value speed and proactive suggestions.

3.2 Temporal credit assignment and time weaving

Reinforcement learning often struggles with long horizons: actions taken now may affect outcomes far in the future. Macaron addresses this through time weaving, a mechanism where events across time are connected by timestamps and narrative threads. When evaluating the impact of recalling an old memory, the system can trace the chain of interactions that followed. This allows the RL agent to assign credit or blame to specific retrieval decisions. For example, if referencing a forgotten anniversary improves a relationship, the system assigns positive reward to the memory gate that preserves the anniversary memory. If re‑surfacing an embarrassing moment caused discomfort, the gate receives a negative reward.

3.3 Hierarchical RL and modular gating policies

Macaron uses hierarchical reinforcement learning to manage complexity. A high‑level controller selects modules (e.g., retrieval, summarization, compression) based on the user's current goal, while low‑level policies handle specific actions within each module. This modular design facilitates transfer learning: a gating policy trained for Japanese cooking conversations can be reused for Korean recipes. It also allows Macaron to update individual modules without retraining the entire system. To ensure stability, Macaron employs proximal policy optimization (PPO) with trust region clipping, balancing exploration and exploitation and preventing catastrophic forgetting.

4 Comparison with Other Memory Systems

4.1 Retrieval‑augmented generation (RAG)

Many AI systems use retrieval‑augmented generation to improve factual accuracy by pulling information from external databases. Models like GPT‑4 with RAG rely on static knowledge bases and do not adapt retrieval based on user feedback. Macaron's memory engine differs in three key ways:

Personalized content: memories are user‑specific rather than generic web documents. Retrieval yields experiences and goals, not encyclopedic facts.
Reinforcement‑guided storage: the system learns what to store or forget based on reward signals, whereas RAG systems often store everything indiscriminately.
Privacy and policy binding: each memory includes privacy metadata, and retrieval respects access rules. Most RAG implementations lack such fine‑grained control.

4.2 Long‑context language models

Recent LLMs like Anthropic's Claude 3 and Google's Gemini can handle contexts of hundreds of thousands of tokens by scaling the attention window. These models do not perform explicit retrieval; instead, they rely on the ability to attend to long sequences. While this allows them to recall earlier conversation segments, it is computationally expensive and does not support user‑controlled forgetting. Macaron combines a medium context with retrieval to achieve similar coverage at lower cost and with greater privacy control. The dynamic memory token acts as a pointer to external storage, enabling the model to handle years of data without storing everything in active context.

4.3 Vector databases and memory networks

Vector databases like Pinecone and Faiss are often used to store embeddings for retrieval tasks. Macaron's long‑term store builds on these technologies but integrates them with RL‑controlled gating. Meanwhile, early memory networks like the End‑to‑End Memory Network precompute a fixed set of memory slots and attend over them with soft attention. Macaron extends this by allowing the number of slots to grow or shrink dynamically and by using RL to decide which slots remain. In this sense, Macaron's memory engine is more akin to a neural Turing machine with a learned controller that reads and writes to an external memory tape.

5 Privacy and Regulatory Alignment

5.1 Policy binding and differentiated transparency

Compliance with regional regulations is crucial. Policy binding attaches machine‑readable privacy rules to data. For instance, a memory containing financial data might include a rule that it can only be accessed after biometric authentication. Differentiated transparency offers varying levels of disclosure to different stakeholders: a Japanese consumer can review their own data, a Korean regulator can see aggregated statistics, and developers get anonymized feedback for model improvement. These mechanisms align with the AI Promotion Act's emphasis on transparency and Korea's AI Framework Act requirements for risk management and human oversight.

5.2 Name‑and‑shame enforcement and accountability

Japan's AI Promotion Act lacks direct penalties but uses a name‑and‑shame mechanism to publicly identify non‑compliant companies. Macaron's audit logs track memory access and policy decisions, allowing the company to demonstrate compliance if audited. Korea's framework may impose modest fines (up to KRW 30 million) for violations. By attaching metadata to every memory event, Macaron can generate compliance reports automatically. The system also allows users to export and delete their data, aligning with the emerging global norm of data portability.

5.3 Analogies to human memory

Macaron's memory system echoes the architecture of human memory. Cognitive scientists describe working memory as a limited buffer in the prefrontal cortex, episodic memory as event‑based storage mediated by the hippocampus, and semantic memory as general knowledge distributed across the cortex. Similarly, Macaron has a short‑term context window, an episodic store and a long‑term vector database. Reference decay resembles the human forgetting curve: memories fade unless reinforced. Time weaving parallels the way humans create life narratives by linking events across time. By mimicking these mechanisms, Macaron not only optimizes computational resources but also produces more natural interactions. When a user reminisces about a childhood festival, the agent can recall related events and weave them into the current conversation, much like a human friend would.

5.4 Future research directions

Despite its sophistication, Macaron's memory engine leaves open questions. One area is self‑compressing memory: developing neural modules that automatically summarize and compress memories without external supervision. Another is lifelong learning: enabling the agent to continually adapt its memory strategies as user behaviour evolves. Cross‑lingual alignment remains an active research topic; future models may employ contrastive representation learning to align memories across Japanese, Korean and other languages more seamlessly. Researchers are also exploring neuromorphic hardware and spiking neural networks to implement memory at lower energy cost. Finally, integrating federated learning will allow users to train Macaron's memory models locally, sharing only model updates rather than raw data, thus enhancing privacy while improving collective performance.