Catalysing Macaron's Capabilities with Claude & DeepSeek Updates

Author: Boxu Li at Macaron


Macaron AI isn’t just a productivity tool – it’s a platform that turns our conversations into mini‑applications that manage calendars, plan trips and explore hobbies. Underneath the friendly surface is a sophisticated reinforcement learning (RL) system and a memory engine that remember what matters and forget what doesn’t[1]. As Macaron prepares to integrate Claude Sonnet 4.5 and DeepSeek V3.2‑Exp, together with the Claude Agent SDK/Code 2.0, this blog explores how these new models and tools can raise the quality of Macaron’s output, shorten mini‑app creation and reduce bugs. We combine technical insights from Anthropic’s developer updates, DeepSeek research and Macaron’s own engineering blogs to build a clear picture of what’s ahead.

1 Macaron’s internal engine: RL, memory and ethics

Before comparing models, it helps to understand what makes Macaron unique. Macaron uses a multi‑layered RL system to convert everyday conversations into tasks and code. The system breaks down the problem into several modules – conversation management, memory selection, code synthesis and simulator feedback – and applies hierarchical reinforcement learning (HRL) to coordinate them[2]. A high‑level meta‑controller decides which module to activate next, while lower‑level RL policies decide on specific actions such as retrieving a memory, calling an API or executing generated code[2]. This design enables Macaron to decompose complex goals – from planning a trip to organising finances – into manageable sub‑tasks.

1.1 Reward modelling and human feedback

In personal AI, there is no single “win condition”; user satisfaction, privacy, timeliness and cultural nuance all matter. Macaron constructs its reward function by combining implicit and explicit feedback. Implicit signals include conversation length, frequency of use and tone, while explicit ratings and thumbs‑up/down help calibrate preferences[3]. Macaron also uses preference elicitation, presenting alternative responses or mini‑app designs and asking users which they prefer. An inference model then learns a latent utility function over possible actions, similar to reinforcement learning from human feedback (RLHF) but extended with cultural annotations – Japanese raters emphasise politeness and context, while Korean raters highlight communal versus individualistic phrasing[4]. These signals feed into a reward model that predicts user satisfaction and encourages the agent to follow local norms.

1.2 Hierarchical RL and macro‑actions

To manage diverse user tasks, Macaron leverages HRL to select modules and sub‑policies. Within modules, it uses the options framework: a sequence of actions achieving a sub‑goal is treated as a single option (for example “summarise last month’s expenses” or “recommend a bilingual study plan”)[3]. Options discovered in one domain can transfer to another if underlying structures align. Macaron also defines macro‑actions that encapsulate multi‑turn dialogues or prolonged computations, like planning a family vacation (destination, transportation, accommodation and itinerary)[3]. RL agents evaluate macro‑actions based on cumulative reward rather than short‑term signals, encouraging the agent to optimise long‑term satisfaction.

1.3 Credit assignment and time weaving

Assigning credit to specific actions when rewards arrive late is difficult. Macaron employs time weaving, connecting events across time with narrative threads. The system builds a graph of interactions where nodes represent memories and edges represent causal relationships; when evaluating an outcome it traverses the graph backward to identify which retrievals or actions contributed[2]. Counterfactual reasoning helps assess what would have happened if alternative actions were taken, preventing the agent from automatically assuming that repeating a successful action always yields the same reward[2]. Macaron also uses delayed rewards and eligibility traces to propagate the signal back to earlier decisions – such as memory selection or conversation tone – encouraging the agent to optimise long‑term satisfaction[5].

1.4 Fairness, safety and ethics

Personal AI agents must avoid bias and comply with regulations. Macaron incorporates fairness constraints into the reward function; for example the agent is penalised if it consistently recommends gender‑specific activities without being asked[5]. An ethical policy library encodes cultural norms and legal requirements, and violating these guidelines triggers a negative reward or blocks the action entirely[5]. Human oversight is built into high‑impact decisions like financial planning or healthcare advice, satisfying the Korean AI Framework Act and Japan’s AI Promotion Act[5]. Macaron logs RL decisions and provides users with explanations of why certain memories or modules were selected, supporting audits and transparency[5].

1.5 The memory engine: compression, retrieval and gating

Macaron’s memory engine is the backbone of personalisation. It organises memories into short‑term, episodic and long‑term stores. The short‑term store keeps the current conversation (8–16 messages); the episodic store holds recent interactions compressed via convolutional attention; and the long‑term store uses a high‑dimensional vector database with metadata tags (timestamp, domain, language)[6]. To manage cost, Macaron uses latent summarisation to identify salient segments and compress them into fixed‑length vectors; an autoencoding objective reconstructs hidden states from compressed summaries, and RL fine‑tunes the summariser to retain information important for later recall[7]. A dynamic memory token acts as a pointer network: it retrieves candidate memories, evaluates relevance and decides whether to return them or keep searching[8].

Retrieval involves approximate nearest neighbour search with product quantisation and maximal marginal relevance to balance similarity and diversity[9]. Query expansion uses the user’s goal and latent intent; for instance a Japanese request for "花火大会" (fireworks festival) expands to include tickets, date and weather[10]. Relevance federation handles cross‑domain queries, using a softmax gating function to distribute retrieval probabilities across domains and languages[11]. These components are trained with RL, and credit assignment via time weaving ensures the agent learns which memories were crucial[12]. Macaron’s memory system differs from traditional retrieval‑augmented generation (RAG) because memories are user‑specific, storage and retrieval are guided by RL, and each memory includes privacy metadata governing access[13].

2 The Claude Agent SDK and Claude Code 2.0

While Macaron’s internal architecture is robust, building mini‑apps still requires reading and writing files, executing code, using version control and interacting with web APIs. Anthropic’s Claude Agent SDK provides exactly these capabilities, exposing the same agent harness that powers Claude Code’s terminal assistant[14]. It packages fine‑grained tools: file operations (read, write, grep, glob), bash commands, web fetch, multi‑language code execution and Git operations[15]. Unlike assistants that pre‑index a codebase, Claude agents search on demand using grep/find/glob to locate files, making them more flexible in dynamic repos[16]. The SDK includes large context windows with automatic compaction and summarisation, allowing agents to hold substantial code context without hitting token limits[17]. Developers can specify allowed tools and permission modes and add hooks for safety, enabling autonomy with guardrails[18].

Core building blocks of the SDK

  1. Tools – The SDK lets engineers select which tools (file I/O, bash, web fetch, code execution) are available to an agent[19].

  2. MCP extensions – Integration with the Model Context Protocol allows external servers (databases, email search, vector search) to extend the toolset[20].

  3. Sub‑agents – Agents defined in .claude/agents have their own system prompts, restricted toolsets and optional model selection; tasks can be delegated to these sub‑agents[21].

  4. Memory & project context – A persistent scratchpad (CLAUDE.md) maintains context across sessions and honours repo‑level configuration[22].

  5. Context management & runtime – Automatic context compaction, streaming responses and typed error handling simplify long‑running tasks[23].

New features in Claude Code 2.0

Claude Code 2.0 brings developer‑friendly updates: checkpoints let developers save progress and roll back when the agent makes mistakes[24]. A VS Code extension embeds the agent into the IDE, while a refreshed terminal interface improves state management[25]. The Claude API gains context editing and a memory tool that help agents run longer by automatically clearing context and retrieving relevant pieces[26]. Claude’s app and API can now execute code, create files and analyse data[27], turning an LLM into a full coding assistant. These features are particularly relevant for Macaron’s mini‑app pipeline, which involves generating program code, testing it in a sandbox, correcting errors and interacting with external services.

3 Claude Sonnet 4.5: long autonomy and higher quality

Claude Sonnet 4.5 is Anthropic’s most capable model for coding, agentic tasks and computer use. DevOps.com reports that Sonnet 4.5 can operate autonomously for over 30 hours, far longer than the seven hours of its predecessor. It excels in instruction following, code refactoring and production‑ready output, and leads the SWE‑Bench Verified benchmark on realistic coding tasks. In real‑world deployments the improvements are tangible: Replit’s internal benchmarks saw code editing errors drop from 9 % with Sonnet 4 to 0 % with Sonnet 4.5, while cybersecurity teams cut vulnerability intake time by 44 % and improved accuracy by 25 %. Netflix engineers describe Sonnet 4.5 as “excellent at software development tasks, learning our codebase patterns to deliver precise implementations”.

Sonnet 4.5’s developer tooling and memory features synergise with the Agent SDK. The model supports context editing and memory management, which automatically clear old context and bring relevant pieces back into focus[24]. It can navigate GUIs by clicking, typing and interacting with menus, enabling automation of tools without APIs. Combined with the SDK’s sub‑agent architecture and checkpoints, this means Macaron can build mini‑apps across multi‑day sessions without losing context, and roll back mistakes when necessary.

4 DeepSeek V3.2‑Exp: efficiency through sparse attention

While Sonnet 4.5 focuses on quality and autonomy, DeepSeek V3.2‑Exp emphasizes efficiency. The model introduces DeepSeek Sparse Attention (DSA), selecting only the most important tokens during attention. This reduces complexity from quadratic O(n²) to O(nk), delivering 2–3× faster inference on long contexts, 30–40 % lower memory usage and a 50 %+ reduction in API prices[28]. Despite these savings, V3.2‑Exp maintains parity with the previous V3.1‑Terminus model on most benchmarks[29]. The open‑source release allows Macaron to run the model locally, fine‑tune it and explore novel architectures[30]. Reuters notes that DeepSeek views this as an intermediate step toward its next‑generation architecture; the DSA mechanism cuts computing costs while boosting some types of performance[31], and the service automatically upgrades to V3.2‑Exp with a massive price cut for users[32].

DeepSeek V3.2‑Exp inherits the mixture‑of‑experts design and adds mixed precision and multi‑head latent attention[33]. However, being experimental, it shows minor regressions on complex reasoning tasks[34] and lacks the integrated agent tooling of the Claude ecosystem. For Macaron this means V3.2‑Exp is better suited for cost‑sensitive tasks or prototyping, where speed and throughput are more important than highest coding accuracy.

5 Comparing Sonnet 4.5 and DeepSeek V3.2‑Exp for Macaron

Macaron’s decision to connect to both models invites a comparison of their strengths and weaknesses. The table below summarises key attributes:

FeatureSonnet 4.5DeepSeek V3.2‑Exp
FocusHigh‑quality coding, agentic tasks, long autonomyEfficient long‑context processing[35]
ArchitectureProprietary model with long‑duration autonomy (>30 hours) and strong instruction followingMixture‑of‑experts with sparse attention reducing compute[28]
Memory & contextLarge context windows; automatic memory management via memory tool[24]Supports long contexts via sparse attention; memory usage reduced[28]
Developer toolingAgent SDK with sub‑agents, checkpoints, VS Code integration[36][24]No official SDK; open‑source code allows custom integrations but lacks built‑in memory tooling
CostUnchanged from Sonnet 4; $3/M input tokens and $15/M output tokens[37]50 %+ API price cut[38]; free to self‑host
StrengthsHighest coding accuracy (SWE‑Bench Verified 77–82 %), extended autonomy, robust safetyExceptional efficiency; 2–3× faster inference and lower memory use[28]; open‑source
WeaknessesHigher token costs; proprietary API; may require careful prompt managementExperimental status; minor regressions on complex reasoning[34]; lacks integrated tooling

From this comparison, we can derive a hybrid strategy. Macaron could use DeepSeek V3.2‑Exp for initial drafts, benefiting from low latency and cost, then refine or validate with Sonnet 4.5 to ensure correctness and security. For complex mini‑apps requiring deep reasoning, Sonnet 4.5 remains the best choice, while V3.2‑Exp excels in rapid iterations or large‑batch generation.

6 How new models will improve Macaron’s mini‑app pipeline

The core question for Macaron is whether Sonnet 4.5 and DeepSeek V3.2‑Exp can improve quality, shorten development time and reduce bugs. We analyse each factor in the context of Macaron’s pipeline:

6.1 Quality of code and output

Sonnet 4.5 delivers higher code quality and fewer errors. According to Replit, code editing errors dropped from 9 % to zero when moving from Sonnet 4 to Sonnet 4.5. This means mini‑apps generated by Macaron will compile more reliably, with fewer syntax mistakes or missing imports. The model’s improved instruction following helps Macaron understand user specifications more accurately; its enhanced code refactoring ensures that generated modules are clean and modular. In financial and cybersecurity tasks, Sonnet 4.5 improved accuracy by 25 % to 44 %, suggesting similar gains for Macaron’s travel and wellness apps. DeepSeek V3.2‑Exp, while slightly weaker on complex reasoning, still maintains performance comparable to V3.1 with better efficiency[29]; when fine‑tuned on Macaron’s domain, it could deliver sufficiently high accuracy for simpler mini‑apps.

6.2 Speed of mini‑app creation

Sonnet 4.5’s ability to run autonomously for over 30 hours means Macaron can generate end‑to‑end mini‑apps in a single continuous session without manual resets. Combined with the Agent SDK’s context management and checkpoints, this reduces time spent restarting tasks or re‑loading context. The Sub‑agent architecture allows Macaron to parallelise tasks: one agent can handle UI generation while another manages API integration, each with its own context and tools. Meanwhile, DeepSeek V3.2‑Exp’s 2–3× faster inference and lower memory usage translate into quicker responses[28]. For example, if generating a travel itinerary required 30 seconds using Sonnet 4.5, V3.2‑Exp could produce a rough draft in 10–15 seconds; Sonnet 4.5 would then refine it. The net effect is a shorter time to first usable version, enabling rapid user feedback loops.

6.3 Smoother processes and fewer bugs

Automation reduces human errors, but autonomy can introduce new bugs if not properly managed. The Agent SDK’s checkpoints let developers save and roll back the agent’s state[24]. If Macaron makes an incorrect API call or writes to the wrong file during mini‑app generation, the developer can revert to a previous checkpoint instead of starting over. Context editing prevents token exhaustion and ensures that only relevant context is kept, minimising hallucinations. For DeepSeek, the open‑source release allows Macaron’s team to inspect and modify the model, integrate custom safety checks and fine‑tune for domain‑specific tasks. Additionally, Macaron’s own RL mechanisms – time weaving, counterfactual reasoning and fairness constraints – continue to monitor user satisfaction and penalise harmful behaviour[2][5], reducing the risk of bugs and ethical violations.

6.4 Cost considerations

High‑quality models come at a price. Sonnet 4.5’s token pricing remains unchanged from Sonnet 4 ($3/M input tokens, $15/M output tokens)[37]. DeepSeek V3.2‑Exp halves the cost of API calls[38] and, because it is open‑source, can be self‑hosted. Macaron can therefore optimise costs by using V3.2‑Exp for initial drafts or low‑stakes tasks (e.g., generating UI components or simple calculators) and reserving Sonnet 4.5 for high‑stakes tasks (e.g., financial planning, medical advice) where correctness and compliance are critical. Savings from faster inference and reduced GPU usage (discussed below) also offset compute costs.

7 Macaron’s RL training innovations: DAPO, LoRA and All‑Sync RL

Improving the model is only part of the story; training efficiency affects how quickly Macaron can iterate on RL policies. MIND LABS describes a system that combines Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with Low‑Rank Adaptation (LoRA) in an All‑Sync RL architecture to train a 671B DeepSeek model using just 48 H800 GPUs – a 10× reduction compared with the 512 GPUs needed for standard RL[39]. Pipeline parallelism using Coati and SGLang, plus accelerated LoRA merge and quantisation, eliminate “GPU bubbles” where GPUs sit idle waiting for inference[40]. The result is a reduction of the wall‑clock time for a single training step from 9 hours to 1.5 hours[41]. These advances mean Macaron can retrain its reward models or memory gates faster, incorporate feedback more quickly and roll out improvements to users sooner.

Figure 1 – GPU usage drops from 512 to 48 H800 GPUs when using All‑Sync RL with LoRA, enabling more accessible RL research and faster experimentation[39].

Beyond efficiency, LoRA’s low‑rank updates reduce model weight communication costs, and dynamic sampling stabilises training by filtering prompts and shaping rewards[42]. For Macaron, these techniques mean that future memory and policy updates can be trained quickly without incurring prohibitive compute costs.

8 Developer workflow: integrating Sonnet 4.5 and DeepSeek into Macaron

Creating a mini‑app with Macaron involves several stages:

  • Intent understanding – Macaron parses the user’s request and identifies the necessary components (e.g., data sources, UI elements, external APIs). Sonnet 4.5’s improved instruction following helps extract accurate intent and plan execution steps, while V3.2‑Exp can rapidly prototype potential intents for user selection.

  • Program synthesis – The agent uses the Claude Agent SDK to generate code, search the repository, read templates and write new files. Sub‑agents may specialise in front‑end (React) or back‑end (Python), and context management ensures the right code is available without overloading memory. Sonnet 4.5’s long context and code refactoring capabilities produce cleaner, more maintainable programs, while V3.2‑Exp speeds up the first draft.

  • Sandbox execution – Generated code is executed in a secure environment. The agent reads logs, captures errors and iteratively fixes bugs. Checkpoints provide safe fallbacks, and RL reward signals penalise code that fails tests. Macaron may also perform integration tests against external services using the Agent SDK’s bash and web fetch tools.

  • Interaction and refinement – The agent presents the mini‑app to the user through Macaron’s conversational interface. The memory engine stores the conversation and uses RL to decide which memories to recall in future interactions. Feedback from the user updates the reward model and influences future generations.

By integrating Sonnet 4.5 and DeepSeek V3.2‑Exp, Macaron can tailor this workflow. For example, a travel planning app might have the UI generator agent using DeepSeek to propose layouts quickly, while the itinerary logic and schedule optimisation use Sonnet 4.5 to ensure accuracy and proper handling of calendars. A budgeting app might rely on DeepSeek for initial charts and tables but use Sonnet 4.5 for complex financial calculations and compliance with regulations.

9 Visualization of improvements

To illustrate the tangible benefits of these technologies, the following charts summarise key metrics.

Figure 2 – A comparative view of Sonnet 4.5 and DeepSeek V3.2‑Exp across coding accuracy, relative speed, cost and autonomy. Higher bars represent better values for accuracy and autonomy; lower bars indicate better (faster or cheaper) performance on efficiency and cost.

Figure 3 – Replit’s internal benchmarks show code editing errors dropped from 9 % with Sonnet 4 to zero with Sonnet 4.5. Improved instruction following and code refactoring lead to more reliable mini‑apps.

Figure 4 – Combining DAPO and LoRA in an All‑Sync RL pipeline reduces the wall‑clock time of a training step from 9 hours to 1.5 hours[41], enabling faster updates to reward models and memory policies.

These visualisations underscore that the benefits are not theoretical. Reduced GPU requirements, faster training, higher accuracy and lower costs all contribute to a smoother, more efficient mini‑app pipeline.

10 Future directions

Looking ahead, both Anthropic and DeepSeek have hinted at more ambitious architectures. Sonnet 4.5’s successor may expand context windows, improve multilingual reasoning and support more complex tool interactions. DeepSeek’s next‑generation architecture is expected to build on sparse attention to achieve even higher performance at lower cost[31]. For Macaron, further research into self‑compressing memory, lifelong learning and cross‑lingual alignment could enhance personalisation and privacy[43]. Integrating federated learning would allow users to train memory models locally, sharing only model updates, thus improving collective performance while preserving privacy[43]. On the RL side, Macaron’s approach could incorporate normative theories – utilitarianism, deontology, virtue ethics – to provide explanations for its actions[44].

In summary, Macaron’s decision to connect to Claude Sonnet 4.5 and DeepSeek V3.2‑Exp, powered by the Claude Agent SDK, positions it at the forefront of personal AI. Sonnet 4.5 offers unmatched quality, extended autonomy and rich developer tooling; DeepSeek provides speed, efficiency and open‑source flexibility. Combined with Macaron’s innovative RL training techniques and memory engine, these models will help Macaron build mini‑apps faster, smoother and with fewer bugs. As personal AI continues to evolve, Macaron’s blend of autonomy, safety, ethics and efficiency serves as a blueprint for responsible innovation.


[1] [6] [7] [8] [9] [10] [11] [12] [13] [43] Inside Macaron's Memory Engine: Compression, Retrieval and Dynamic Gating - Macaron

https://macaron.im/memory-engine

[2] [3] [4] [5] [44] [title unknown]

https://macaron.im/reinforcement-learning

[14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [36] Building Agents with Claude Code's SDK

https://blog.promptlayer.com/building-agents-with-claude-codes-sdk/

[24] [25] [26] [27] [37] Anthropic Claude Sonnet 4.5: Features, Pricing And Comparison - Dataconomy

https://dataconomy.com/2025/09/30/anthropic-claude-sonnet-4-5-features-pricing-and-comparison/

[28] [29] [30] [32] [33] [34] [35] AI on AI: DeepSeek-3.2-Exp and DSA – Champaign Magazine

https://champaignmagazine.com/2025/09/29/ai-on-ai-deepseek-3-2-exp-and-dsa/

[31] [38] China's DeepSeek releases 'intermediate' AI model on route to next generation | Reuters

https://www.reuters.com/technology/deepseek-releases-model-it-calls-intermediate-step-towards-next-generation-2025-09-29/

[39] [40] [41] [42] MIND LABS | Scaling All-Sync RL with DAPO and LoRA

https://mindlabs.macaron.im/

Artículos Relacionados

Loading related articles...

Aplicar para convertirse Los primeros amigos de Macaron