Reinforcement Learning in Personal Agents: Macaron AI's Reward Models and Hierarchical Adaptation

Author: Boxu Li at Macaron

Introduction

Reinforcement learning (RL) has become a cornerstone of modern AI, enabling agents to learn optimal policies through trial and error. In the context of personal AI, however, RL faces unique challenges: rewards are subjective, environments are non‑stationary, and ethical considerations abound. Macaron AI's designers confronted these challenges head‑on, building a multi‑layered RL system that governs memory management, code synthesis, conversation style and more. This blog examines how Macaron applies hierarchical RL, reward modelling, credit assignment and fairness constraints to craft a truly personalized agent. We also contrast Macaron's RL approach with RL in other domains and explore future directions.

1 Reward Modelling: Capturing Human Preferences

1.1 Implicit and explicit feedback signals

Unlike board games or simulated environments, personal agents operate in open‑ended spaces where reward cannot be derived solely from task success. Macaron gathers implicit feedback (conversation length, frequency of use, tone of user responses) and explicit feedback (ratings, thumbs up/down) to construct a reward signal. For example, if a Japanese user engages in longer conversations after the agent uses polite language, this positive correlation increases the reward for similar behaviour. If a Korean user rates a generated mini‑app poorly due to cluttered design, the reward for that UI pattern decreases. These signals feed into a reward model that predicts user satisfaction for a given state and action.

1.2 Multi‑objective reward functions

Macaron's RL is multi‑objective. In addition to user satisfaction, the reward includes terms for privacy, compliance, resource usage and ethics. Sharing sensitive information without proper consent incurs a penalty, while compressing memory effectively yields a bonus. For code generation, efficiency and maintainability influence the reward: excessive complexity (e.g., generating 100,000 lines unnecessarily) results in negative rewards. The reward weights are tuned for different regions. Japan's emphasis on privacy and transparency increases the penalty for privacy violations, while Korea's focus on innovation may place higher weight on speed and novelty. Balancing these objectives requires careful design; Macaron uses a scalarization function that converts multiple objectives into a single reward through weighted sums and dynamic scaling.

1.3 Preference elicitation and human‑in‑the‑loop

Human feedback is crucial for aligning AI systems with values. Macaron implements preference elicitation by presenting alternative responses or mini‑app designs and asking users which they prefer. This data feeds an inference model that learns a latent utility function over possible actions. The approach is similar to RLHF (Reinforcement Learning from Human Feedback) used to train large language models, but Macaron extends it by incorporating cultural annotations: Japanese annotators comment on politeness and context, while Korean annotators note communal vs individualistic phrasing. The resulting reward model reflects nuanced preferences across cultures.

2 Hierarchical RL: Decomposing Complexity

2.1 High‑level policy over modules

Macaron's tasks range from casual chat to generating complex software. To manage this diversity, the system employs hierarchical RL. At the top level, a meta‑controller selects among modules: conversation manager, memory manager, synthesis engine, emotion regulator, etc. Each module is itself controlled by a separate RL policy. For example, the memory manager uses RL to decide what to store or forget, while the synthesis engine uses RL to choose code templates. The meta‑controller receives a high‑level reward combining all module rewards and learns when to delegate tasks. This decomposition reduces the search space and improves sample efficiency.

2.2 Option discovery and transfer learning

Within modules, Macaron uses the options framework to represent reusable sub‑policies. An "option" corresponds to a sequence of actions achieving a subgoal, such as "summarize last month's expenses" or "recommend a bilingual study plan." Options discovered in the Japanese domain can transfer to the Korean domain if underlying structure aligns. When Macaron learns an effective way to handle a user's request in one language, it can apply the same option when the concept appears in another language, accelerating adaptation.

2.3 Temporal abstraction and macro‑actions

Temporal abstraction allows RL agents to reason over different time scales. Macaron defines macro‑actions that encapsulate multi‑turn dialogues or prolonged computations. For instance, planning a Korean family vacation involves a macro‑action encompassing destination selection, transportation, accommodation and itinerary design. RL agents evaluate the macro‑action based on cumulative reward rather than short‑term signals. This encourages the agent to consider long‑term satisfaction, such as ensuring the trip aligns with school holidays or avoiding scheduling conflicts.

3 Credit Assignment and Time Weaving

3.1 Tracing causal chains

Assigning credit to specific actions is challenging when rewards arrive late. Macaron employs time weaving, connecting events across time with narrative threads. The agent builds a graph of interactions where nodes represent memories and edges represent causal relationships. When evaluating an outcome, the system traverses the graph backward to identify which retrievals or actions contributed. For example, if recommending a Japanese festival increased user happiness weeks later, the agent attributes part of the reward to retrieving the festival memory and to generating a corresponding mini‑app. This explicit causal analysis helps the RL policy learn effective retrieval strategies.

3.2 Counterfactual reasoning

To improve credit assignment, Macaron uses counterfactual anchoring. The agent considers alternative actions it could have taken and estimates the outcome difference. If not reminding a Korean user about a family event would have resulted in embarrassment, the actual reminder receives a positive counterfactual reward. This encourages the agent to anticipate the consequences of forgetting or recalling information. Counterfactual reasoning also helps avoid overfitting: the agent does not automatically assume that repeating a successful action will always yield the same reward; instead, it tests whether the action truly causes the outcome.

3.3 Delayed rewards and eligibility traces

Macaron's RL implementation incorporates eligibility traces, a mechanism that assigns credit to states and actions that precede rewards. When the agent receives a delayed reward (e.g., a user's satisfaction after using a mini‑app for weeks), the trace helps propagate the signal back to earlier decisions such as memory selection, conversation tone and code module choices. Eligibility traces are weighted by a decay factor; states closer to the reward receive higher credit. This mechanism encourages the agent to optimize long‑term satisfaction rather than short‑term gains.

4 Fairness, Safety and Ethical Considerations

4.1 Avoiding bias and discrimination

Reinforcement learning can inadvertently learn biases from feedback data. Macaron mitigates this by incorporating fairness constraints into the reward function. For example, the agent is penalized if it consistently recommends gender‑specific activities without being asked. The system monitors recommendation patterns across demographic groups and adjusts rewards to equalize opportunities. When dealing with sensitive topics like finance or health, the agent consults an ethical policy library that encodes cultural norms and legal requirements. Breaching these guidelines triggers a negative reward or blocks the action entirely.

4.2 Human oversight and regulatory compliance

Korea's AI Framework Act requires human oversight for high‑impact systems and generative AI notifications. Macaron complies by including a human‑in‑the‑loop for major decisions such as financial planning or healthcare advice. When a Korean user generates a high‑stakes mini‑app, the system prompts them to review and approve actions. Japan's AI Promotion Act emphasises transparency; thus, Macaron logs RL decisions and provides users with explanations of why certain memories or modules were selected. These measures build trust and ensure accountability.

4.3 Name‑and‑shame enforcement and audit trails

Japan's AI law implements a name‑and‑shame mechanism for non‑compliance. Macaron's RL logs include not just rewards but the rationale behind decisions. If regulators investigate, the company can demonstrate that biases were addressed and privacy rules were respected. The logs also support user audits; individuals can see how their feedback influenced the agent's behaviour. Such transparency deters misuse of RL and fosters ethical innovation.

5 Comparative Analysis: Macaron vs Other RL‑Driven Agents

5.1 Gaming, robotics and recommendation systems

RL has delivered impressive results in gaming (AlphaGo, Dota 2), robotics and recommendation systems. However, these environments offer explicit goals (winning a game, minimizing error) and clear rewards. Personal AI, by contrast, must infer goals from messy data and align with human values. In gaming, exploration is often unconstrained; an agent may sacrifice a pawn to gain positional advantage. In personal AI, sacrificing user trust for short‑term engagement is unacceptable. Macaron's reward model explicitly penalizes actions that degrade trust, making the system conservative when necessary.

5.2 Open‑source personal assistant frameworks

Some open‑source projects offer RL‑driven personal assistants that schedule tasks or automate workflows. These systems often assume constant user feedback and treat tasks as independent. Macaron diverges by integrating tasks through its memory engine and by using hierarchical RL to manage interactions. Its RL model is deeply entangled with cultural context, privacy rules and code generation, making it more complex but also more capable. While other agents might use RL to recommend songs based on listening history, Macaron uses RL to decide whether to remind you to call your mother before generating a gift recommendation.

5.3 Emerging academic research

Researchers have proposed RL methods for controlling large language models, such as RLHF and unsupervised environment design. Macaron contributes to this literature by demonstrating RL in a real‑world, multi‑domain, cross‑lingual environment. The FireAct project previously established that RL improves reasoning accuracy by 77% over prompt‑based agents; Macaron extends this idea by training RL policies not only on reasoning tasks but also on memory management, code synthesis and dialogue style. It highlights the importance of hierarchical design, credit assignment, and fairness constraints in scaling RL to personal agents.

5.4 Meta‑ethics and normative frameworks

Reinforcement learning optimizes for reward, but reward functions encode human values that differ across cultures. Meta‑ethical questions arise: Should the agent maximize happiness, adhere to duty‑based ethics, or balance fairness with autonomy? Macaron addresses this by learning normative priors from cultural data. In Japan, where harmony and respect for social order are prized, the reward model emphasizes politeness, consensus and subtlety. In Korea, which values community resilience and bold innovation, the model rewards proactive assistance and transparency. These normative frameworks are not static; users can adjust ethical sliders, and Macaron explores value space under constraints. An ongoing research direction is integrating formal ethical theories—utilitarianism, deontology, virtue ethics—into RL agents so that they can explain the moral trade‑offs behind their actions. This is especially important for high‑impact decisions such as financial planning or healthcare recommendations.

5.5 Future directions: social RL and group rewards

Personal agents increasingly mediate interactions within families, teams and communities. Social reinforcement learning extends RL to multi‑agent settings, where agents must consider the welfare of multiple stakeholders. For example, when scheduling a family event, Macaron must balance individual preferences (privacy, workload) with collective satisfaction. Group rewards can be shaped using Pareto efficiency—ensuring that improving one member's outcome does not harm others—or fair division principles. In cross‑lingual contexts, group communication may happen in multiple languages; the agent must unify rewards across language boundaries while respecting cultural norms. Future research will explore equitable RL where marginalized voices are weighted more heavily, ensuring inclusivity. Other avenues include self‑play to simulate interactions between agents, meta‑learning to adapt to new group dynamics, and causal inference to disentangle correlation from causation in social feedback. These advances will allow Macaron and similar personal AIs to move from one‑to‑one interactions to orchestrating social experiences, making them invaluable partners in both Japanese and Korean society.