Macaron-A2UI: A Model for Generative UI in Personal Agent

April 30, 2026

Most software still treats the interface as a fixed artifact. Generative UI changes that assumption: when a Personal Agent understands the user's intent, context, and next step, it can render a small trusted surface for the current decision instead of forcing every interaction into a static screen or another paragraph of text. The useful middle ground is not plain markdown and not unconstrained HTML; it is an interface expressive enough for interaction and controlled enough to render safely. In Macaron's food flow, this may mean confirming dietary restrictions, guiding restaurant search, presenting favorite options as structured cards, opening map navigation, or staying text-only when the user simply wants advice on eating better.

Macaron has already explored this direction through MiniApps, turning natural-language requests into durable life tools such as meal logs, cooking journals, fitness trackers, and travel planners; Macaron-A2UI now brings the same tool-making impulse into the turn-by-turn agent loop. It lets the assistant create, update, or dismiss interface surfaces as the task evolves. This space is sometimes discussed as dynamic UI or conversational UI; here we use Generative UI to mean model-generated, protocol-grounded interface actions inside a Personal Agent conversation.

Under the A2UI protocol [2], one assistant turn can contain natural language for the user and executable UI actions for a trusted client to validate and render. Macaron-A2UI learns when UI helps, what form it should take, and how to keep the interaction valid across turns. This also matters for deployment: instead of pasting a full schema into every prompt, much of the protocol competence lives in the model, enabling shorter prompts and lower-latency interactions. For Mind Lab, these generated surfaces are also part of the Experiential Intelligence loop: they help Personal Agents act, receive grounded feedback, and learn from real use.

Figure 1 summarizes three recurring patterns: collecting information with chips or sliders, structuring actions into compact plans, and presenting results with ranked cards and action buttons.

Figure 1: Many dialogue turns that are cumbersome in plain text become more efficient when the assistant can render lightweight structured interfaces.

TL;DR

Macaron-A2UI turns Generative UI into a model capability for Personal Agents: the assistant can reply with natural language plus executable UI actions inside the same turn.
We build the data and evaluation stack for this setting: 14,245 A2UI-grounded training samples and A2UI-Bench, a 300-task benchmark with language-side and visual-side metrics.
SFT teaches the model to speak the protocol; GRPO teaches it to use the protocol better. On Qwen3-235B, the overall score rises from 21.6 to 63.6 after SFT and 74.2 after GRPO.
The main deployment point is interaction speed: Macaron-A2UI uses prompts that are 27x shorter than full-schema prompting in our setup, and the preview stack uses faster inference infrastructure for a lower-latency Generative UI experience.

Why this problem matters

Today’s assistants are caught between several interaction styles. The default is still plain-text dialogue, even when the user is really trying to compare options, confirm details, or keep track of changing state. Text can express these things, but it often pushes the work of organization back onto the user.

Approach	In chat	Rich UI	Creates UI	Updates surfaces	Controlled render	Turn updates
Text-only dialogue	✓	✗	✗	✗	✓	△
Generated apps / webpages	△	✓	✓	✗	△	△
Agents on existing GUIs	△	✓	✗	✓	△	△
Macaron-A2UI	✓	✓	✓	✓	✓	✓

Table 1: Four interaction styles for agent interfaces. ✓ indicates a core strength, △ indicates possible but indirect or heavier support, and ✗ indicates behavior outside the main design.

A second line of work turns natural language or visual inputs into markup, code, or full webpages [10,11]. This is powerful, but it is often heavier than what a tight multi-turn assistant loop needs. A third line builds agents that operate on existing screens and GUIs [12]. Those agents are important, but they usually act on interfaces that already exist rather than creating new conversational surfaces inside the chat itself.

Generative UI work points in the other direction: interfaces can be outputs of foundation models, not just containers around them. Recent work studies LLM-generated interfaces along functional and interaction-centered axes [7], preference-aligned layout [8], and the ability of large models to assemble structured task layouts beyond linear text [9].

Macaron-A2UI targets the space between these threads. It focuses on assistant-side Generative UI under a fixed declarative protocol [2]: the model emits validated interface actions, the client renders them through a trusted stack, and training stresses when a compact surface is better than another paragraph. The goal is not open-ended webpage synthesis on every turn, but repeatable and auditable interaction contracts inside the dialogue.

Macaron-A2UI treats UI as part of the assistant’s response. Each turn can contain ordinary language for the user and a list of structured interface actions for the client:

{
  "text_response": "...",
  "a2ui": [ ... ]
}

The text response explains what the assistant is doing. The A2UI messages describe what the client should render: a surface, a set of components, state updates, or cleanup when the surface is no longer needed. The client owns the rendering stack, so the model is not writing arbitrary front-end code. It is choosing and filling a controlled interface contract. In this sense, Macaron-A2UI is a conversational UI model, but its output space is deliberately constrained: it generates validated A2UI actions rather than arbitrary front-end implementations.

That contract is what makes the problem interesting. A valid response is not enough. The model has to decide whether UI belongs in the turn at all, choose components that match the user’s need, ground every label and option in the conversation, and leave the user with a clear next action.

Building an A2UI-grounded Generative UI corpus

There is no natural dataset where every assistant turn already says whether a UI should appear, what it should look like, and how it should connect back to the dialogue. We therefore built an A2UI-grounded training set for Generative UI from four dialogue sources: MultiWOZ [3], Schema-Guided Dialogue [4], ESConv [5], and AnnoMI [6].

These sources play different roles. MultiWOZ and SGD provide task-oriented interactions where UI is often natural: collecting constraints, showing candidates, and confirming bookings. ESConv and AnnoMI provide a very different signal. In emotional-support and counseling-style conversations, many turns should stay as text. A model that renders UI everywhere would feel pushy and unnatural, so these datasets help teach restraint.

Source	Domain	Samples	UI ratio
MultiWOZ	Task-oriented assistance	5,424	80.4%
SGD	Task-oriented assistance	4,757	79.1%
ESConv	Emotional support	1,098	55.0%
AnnoMI	Motivational interviewing	2,966	50.0%
Total	—	14,245	71.7%

Table 2: Corpus composition by source. Samples are assistant-turn training pairs; UI ratio is the fraction of samples whose assistant response includes a non-empty A2UI payload. The Total row sums 14,245 turns spanning task-oriented dialogue (MultiWOZ, SGD), emotional support (ESConv), and motivational interviewing (AnnoMI).

In total, the corpus contains 14,245 assistant-turn samples: 10,210 UI turns and 4,035 text-only turns. The overall UI ratio is 71.7%, but the split is not uniform. Task-oriented sources sit around 80% UI, while ESConv and AnnoMI are close to balanced. This contrast is important: the model must learn both sides of the decision, when a compact interface helps and when another human sentence is the better response.

The corpus also covers a range of interface shapes. At the component level, it includes common layout primitives such as rows, columns, and cards, as well as interactive widgets such as buttons, selection lists, sliders, and date/time inputs. At the response level, the supervision spans selection or slider turns, button-driven actions, form-like inputs, display-only UI, and text-only responses. This matters because a Generative UI model should not only learn one template; it should learn a vocabulary of interaction patterns.

Figure 2: Corpus coverage. Up: top A2UI component types. Do: response-level supervision archetypes.

How we build the data

The annotation pipeline is deliberately hybrid. When the source dataset already contains structured signals, we use them. For MultiWOZ and SGD, dialogue acts, slots, and booking states map naturally into surfaces such as forms, candidate lists, and confirmations. Rules handle most of this conversion, while LLMs mainly polish user-facing language.

For ESConv and AnnoMI, the right UI is less explicit. Here we use a two-step LLM process. An Editor decides whether the turn should contain UI and what interaction pattern fits. An Author then fills in the labels, options, and layout for the selected turns. This separation keeps the “should we render?” decision distinct from the “what should it say?” decision.

Every generated UI turn then goes through deterministic repair, validation, and retry. After this process, 99.2% of UI turns are renderable. This pipeline turns model-assisted annotation into a reliable training signal by combining flexible generation with deterministic checks.

Figure 3: Overview of the A2UI corpus construction pipeline: dialogue normalization and intermediate action mapping; rule-first conversion for task-oriented data; Editor and Author generation for open-domain data; deterministic post-processing; validation and retry.

A2UI-Bench: a benchmark for useful Generative UI

Training data gives the model examples. Evaluation needs a harder question: did the model produce an interface that actually works for this turn?

How we build the data

A2UI-Bench: a benchmark for useful Generative UI

Training data gives the model examples. Evaluation needs a harder question: did the model produce an interface that actually works for this turn?

A2UI-Bench is our 300-task benchmark for that question. It covers three task shapes. Atomic tasks test one-turn Generative UI: should the assistant render anything, and if so, what surface fits the user's request? Depth tasks roll through short multi-turn episodes using the model's own previous outputs, so state and surface updates have to stay coherent. Width tasks combine multiple needs in one turn, such as comparing options while also collecting a constraint, and test whether the model can organize the result without overloading the user. The benchmark also includes no-UI cases. This is important because a model that renders something on every turn can look active while making the conversation worse. Sometimes the right answer is simply a clear sentence.

We score each output from two views. The language-side evaluation asks whether the response is valid, useful, and well-paced. L1 checks protocol correctness: parsing, schema, references, required fields, and value formats. L2 checks task construction: whether the UI trigger is appropriate, whether the component matches the intent, whether labels and options are grounded in the text, whether state is tracked, and whether actions form a usable loop. L3 checks user experience: whether the UI adds value over text, feels natural in context, and keeps cognitive load manageable.

Figure 4: Language-side evaluation asks whether each response is valid, useful, and well paced across protocol correctness (L1), task construction (L2), and user experience (L3).

The visual-side evaluation looks at the rendered result. We pass the generated A2UI through the same client renderer, capture a screenshot, and ask a VLM judge to score visual integrity, task alignment, and action clarity. This catches failures that raw JSON cannot show: clipped text, awkward spacing, invisible controls, or a surface that looks clean but does not help the user.

Training Macaron-A2UI

Macaron-A2UI is trained in two stages on fixed backbones (Qwen3-30B-A3B-Instruct and Qwen3-235B-A22B-Instruct, same protocol at both scales).

Supervised fine-tuning teaches the unified JSON format and joint text and UI grounding; the model must learn to “speak A2UI” at all.

GRPO (group-relative policy optimization) continues from the SFT checkpoint with rewards aligned to executable interaction quality: hard gates zero out malformed or non-renderable outputs; passing responses are scored with the same L1/L2/L3 structure as the benchmark, including a simplified reward path for appropriate text-only turns. This targets properties imitation alone struggles to pin down, such as trigger timing and completion of interaction loops.

The headline result

The central empirical message matches the paper: Generative UI competence can be internalized such that inference relies on minimal instructions only, with no need to paste the full protocol into every prompt.

Figure 5: Training-pipeline ablation under the prompt-without-full-schema regime and comparison to full-prompt upper bounds. Solid bars on the left of each panel show results with only lightweight instructions for untuned, SFT, and SFT+RL models, together with untuned frontier references. Hatched bars on the right of the dashed separator show upper bounds where models receive the complete A2UI schema and protocol specification. Scores are reported on A2UI-Bench for L1, L2, L3, and the overall language-side score (GRPO is the RL stage in our pipeline).

With only lightweight instructions in the prompt (no full schema text), Qwen3-30B moves from 19.8 (base) to 37.2 after SFT and 58.8 after GRPO. Qwen3-235B moves from 21.6 to 63.6 after SFT; GRPO then reaches 69.5 and 74.2 at two RL checkpoints. The 74.2 checkpoint is the one we highlight upstream.

For perspective on untuned frontier models under the same short-instruction setup, the paper reports overall scores of 25.5 (GPT-4o mini), 23.9 (GPT-5.4), and 21.9 (DeepSeek-V3.1), far below the trained models. Generic instruction alone appears insufficient for stable A2UI competence.

The headline comparison to frontier models is stated carefully in the paper and worth repeating here. Our best 235B model reaches an overall language-side score of 74.2, slightly above GPT-5.4 at 74.1 when GPT-5.4 is evaluated with the full schema and protocol in the prompt. Strong A2UI behavior can emerge without embedding the entire schema in every request. The aggregate edge comes mainly from stronger L1 and competitive L2, while GPT-5.4 remains stronger on higher-level dimensions such as L3. The takeaway is internalized protocol and structure under light prompts, including cases where scores diverge across layers.

Frontier models: minimal prompts vs full specification

Keep these two evaluation setups distinct. At inference time, Macaron-A2UI is meant to run with minimal instructions only: the model must rely on internalized protocol competence. Full specification in the prompt gives frontier models the complete A2UI schema and protocol text and is reported as a strong upper bound. It answers a different question than our trained inference-time setup.

When models do receive that full specification, frontier systems score highly: GPT-5.4 reaches 74.1, Gemini-3.1 Pro 71.0, and DeepSeek-V3.1 63.8 on overall language-side score in the paper’s aggregate reporting. The same frontier models without that injection remain comparatively weak under minimal instructions alone (see numbers above). Training closes most of that gap where prompting alone falls short.

Model	Prompt	L1	L2	L3	V1	V2	V3	Avg.
GPT-5.4	w/ schema	4.02	3.59	3.27	3.46	3.73	3.17	3.54
Gemini-3.1-Pro	w/ schema	4.25	3.20	2.96	3.53	3.55	3.04	3.42
Qwen3-235B-A22B-Instruct	w/ schema	4.00	2.87	2.76	3.32	3.32	2.95	3.20
DeepSeek-V3.1	w/ schema	4.19	2.54	2.47	3.27	3.35	2.95	3.13
Qwen3-30B-A3B-Instruct	w/ schema	3.13	2.13	2.09	2.92	2.64	2.51	2.57
GPT-4o-mini	w/ schema	3.45	2.27	2.15	2.78	2.28	2.06	2.50
Macaron-A2UI-235B	w/o schema	4.67	3.22	2.91	3.95	3.74	3.47	3.66

Table 3: Per-metric averages on language-side (L1–L3) and visual-side (V1–V3) scores, plus a simple mean Avg. over those six components. Frontier rows use prompt w/ schema (the model receives the full A2UI schema and protocol in the prompt), a strong upper bound and not a like-for-like comparison to minimal-prompt training. The last row is Macaron-A2UI under w/o schema__, shown on the same grid for contrast. Higher is better on every metric.

The disciplined claim is that with data, benchmark design, and GRPO rewards, models evaluated with minimal prompts alone can match or slightly exceed the strongest full-specification baseline in aggregate score. For deployment, brief prompts stay cheaper and easier to operate than injecting the entire protocol on every request. In our setup, Macaron-A2UI uses prompts that are 27x shorter than the full-schema baseline while preserving strong aggregate performance.

What changes after training

SFT and GRPO improve different parts of the A2UI skill. SFT teaches the response format, common components, and the basic link between the text response and the rendered surface. GRPO changes the model in a more interaction-level way: for the same user turn, there are often many valid UI responses, but some only wrap text in a card, ask for too much information, or miss the next action. GRPO teaches the model to make A2UI valid, task-aware, and user-friendly under the reward signal.

This is where the trained model starts to show a more interesting form of generalization. It can combine familiar controls such as chips, sliders, cards, and action buttons across domains, and it can decide when plain text is the better response. The model is not only memorizing a schema; it is learning a Generative UI interaction style.

Why brief prompts matter at deployment

Keeping the full A2UI schema and protocol specification out of every user turn shrinks context windows, simplifies versioning when the protocol evolves, and avoids leaking long structural prompts into logs or partner integrations. Macaron-A2UI is trained precisely for that minimal-instruction regime: weights hold most protocol machinery while prompts supply short steering signals each turn.

This is also a latency argument. In a Generative UI product, delay changes the user's perception of agency: a restaurant card, preference selector, or map action needs to appear while the user still feels the agent is helping with the current decision. We therefore optimize deployment along two axes. First, Macaron-A2UI reduces the number of prompt tokens needed for reliable A2UI behavior by internalizing much of the protocol. Second, the serving stack can use faster inference frameworks, such as TileRT, to reduce runtime overhead and make generated interfaces feel responsive rather than appended after the conversation has moved on. With the combined model-training and serving optimizations in the preview stack, UI generation can typically be kept within 1 to 3 seconds.

That trade has limits: frontier models evaluated with the full specification still lead on some subjective dimensions (L3), so products may combine internalized competence with selective schema hints or judges for high-stakes flows. Training can still deliver strong aggregate performance without perpetual schema stuffing if it aligns with the benchmark’s layered goals.

Preview and open-source release

We have opened a Macaron-A2UI preview so users can try the new Generative UI flow directly. The preview focuses on the product path where latency and interface timing matter most: helping users eat better, clarify dietary preferences, search for restaurants, inspect candidate restaurants as cards, and move into map navigation when a choice is made.

We have also released the trained Macaron-A2UI model checkpoints on Hugging Face at mindlab-research/Macaron-A2UI-Tall. The goal is to make the research artifact inspectable and reusable, not only to report benchmark numbers. A protocol-grounded Generative UI model should be evaluated in papers, tested in products, and available for the community to study.

What the interaction looks like in practice

Example 1: from intention to action

Motivational interviewing often progresses through reflective dialogue; pushing another dense paragraph can feel dismissive or rushed. In the atomic AnnoMI-style setting below, the assistant pairs empathetic language with a compact surface that turns an abstract intention into a specific, user-owned next step, for example capturing a commitment the user can revisit without re-reading a wall of text.

The point is timing and proportion: the UI supplements the therapeutic stance of the reply and anchors a decision so the conversation can move forward without forcing the user to mentally track every nuance from prior turns.

Figure 6: A reflective motivational exchange culminates in a concise reminder card that turns an abstract goal into a concrete next step.

Example 2: making transactional state glanceable

Task-oriented assistants routinely reach moments where the user must trust transactional details: times, locations, IDs, confirmation codes. A prose recap can be correct yet error-prone for humans to verify; surfacing the same state as structured fields exploits visual grouping and alignment that plain text rarely achieves at similar length.

The MultiWOZ-style example below illustrates that pattern: booking parameters appear as inspectable facts tied to the assistant’s wording, reducing the cognitive work of reconciling “what was said” with “what we think is true” before the user acts.

Figure 7: The final booking turn is rendered as a compact confirmation surface, making the transaction state easy to verify at a glance.

Conclusion

Macaron-A2UI is only a first step toward Generative UI for Personal Agents. Today, most interfaces are still designed as fixed surfaces, and most assistants still answer as if text were the only medium. We think the next generation of AI products will need something more flexible: interfaces that can be generated, evaluated, and improved as part of the model's behavior.

A2UI gives us a concrete place to start. It turns Generative UI into something that can be trained, rendered, validated, and measured. Macaron-A2UI shows that models can internalize part of this interface competence, so dynamic UI does not have to depend on pasting a long schema into every request. But there is still a long path ahead. The training data still depends partly on model-assisted annotation, visual evaluation still relies on judge models, and longer-horizon interaction remains underexplored. Future systems should handle longer tasks, richer visual layouts, stronger accessibility guarantees, personalization, and human feedback on pacing and trust. They should also learn when to stay quiet.

For Mind Lab, the broader implication is that the frontend of a Personal Agent can become part of the learning loop. A generated surface is not only a presentation layer; it can expose state, ask for grounded feedback, and create structured evidence about whether the agent helped. That is why Generative UI matters for Experiential Intelligence. The interface should meet users where they are, change as their intent changes, and help Personal Agents learn from real experience with less friction. Macaron-A2UI is our first step in that direction.

References

[1] Macaron-A2UI: A Model for Generative UI in Personal Agent (Mind Lab et al, 2026)

[2] A2UI Protocol v0.8 (Stable) (A2UI Project et al, 2026)

[3] MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling (Budzianowski et al, 2018)

[4] Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset (Rastogi et al, 2020)

[5] Towards Emotional Support Dialog Systems (Liu et al, 2021)

[6] Anno-MI: A Dataset of Expert-Annotated Counselling Dialogues (Wu et al, 2022)

[7] Generative Interfaces for Language Models (Chen et al, 2025)

[8] AlignUI: A Method for Designing LLM-Generated UIs Aligned with User Preferences (Liu et al, 2026)

[9] Generative UI: LLMs are Effective UI Generators (Leviathan et al, 2025)

[10] UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback (Wu et al, 2024)

[11] Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering (Si et al, 2025)

[12] WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point (Zhao et al, 2025)

Author

Mind Lab

Core Contributors

Fancy Kong, Congjie Zheng, Murphy Zhuang, Rio Yang, Sueky Zhang, Arthur Fu, Gene Jin, Song Cao, Kaijie Chen, Andrew Chen, Pony Ma

Team

Andrew Chen, Kaijie Chen, Song Cao, Cleon Cheng, Steven Chiang, Nolan Ho, Charles Huang, Fancy Kong, Kyrie Lei, Andrew Lei, Lucian Li, Ray Li, Theo Li, Logan Liu, Kieran Liu, Xiang Liu, Irvine Lu, Pony Ma, Vincent Wang, Guikun Yang, Rio Yang, Shiro Yang, Maxwell Yao, Regis Ye, Di Zhang, Ruijia Zhang, Conley Zhao, Congjie Zheng, Adrian Zhou, Murphy Zhuang and Mindverse Team

Names are listed alphabetically within team and acknowledgement.

Citation

Please cite this work using the BibTeX citation:

@misc{kong2026macaron_a2ui,
  author = {Fancy Kong and Congjie Zheng and Murphy Zhuang and Rio Yang and Sueky Zhang and Arthur Fu and Gene Jin and Song Cao and Kaijie Chen and Andrew Chen and Pony Ma and {Mind Lab}},
  title = {Macaron-A2UI: A Model for Generative UI in Personal Agent},
  year = {2026},
  howpublished = {Mind Lab: A Lab for Experiential Intelligence},
  note = {https://macaron.im/mindlab/research/macaron-a2ui-generative-ui-personal-agent}
}

Share to