
Author: Boxu LI
In the evolving landscape of artificial intelligence, where pretraining at extreme scales has yielded formidable static capabilities, the frontier now shifts from building ever-larger static models to creating agentic systems – AI agents that can reason deeply, use tools, see and remember, and continuously learn from experience[1].
Thinking Machines Lab's Tinker platform, with its recent general availability announcement on December 12, 2025, represents a pivotal infrastructural leap, democratizing access to fine-tuning and multimodal extension of trillion-parameter models. Concurrently, Mind Lab— the research division of Macaron AI—articulates a philosophical and technical framework for "experiential intelligence," wherein models transition from frozen repositories of knowledge to dynamic processes that refine themselves via real-world feedback. This convergence offers profound opportunities for refining the co-design of research and product, closing the loop between algorithmic innovation and deployed adaptation.
Key Innovations in Tinker's Updates
In this post, we’ll dive into Tinker’s new Kimi K2 reasoning model, OpenAI-compatible interface, and Qwen3-VL vision models, then explore Mind Lab’s philosophy of experiential intelligence, their trillion-parameter reinforcement learning (RL) breakthroughs, memory diffusion approach, and the strategic implications for building the next generation of AI systems.
Tinker is an AI training platform designed to let researchers fine-tune and deploy cutting-edge models without worrying about infrastructure[2][3]. In December 2025, Tinker announced several major updates that bolster the reasoning capabilities, tool use, and vision understanding of AI models[4]:
[15] Comparison of fine-tuned Qwen3-VL-235B (vision-language model) vs. DINOv2 (vision-only baseline) on image classification tasks with limited labeled examples. Qwen3-VL achieves higher accuracy, especially in the low-data regime (far left), thanks to its language-informed visual understanding.
Even with only one example per class, the 235B Qwen3-VL model attained reasonable accuracy, significantly outperforming DINOv2 in this extreme low-data regime[15]. As the number of examples increased, both models improved, but Qwen3-VL retained an edge, demonstrating stronger few-shot generalization[16]. The advantage comes from the model’s built-in language and world knowledge – for instance, Qwen3-VL already has a concept of what a “sunflower” or “golden retriever” looks like or is described as, by virtue of its multimodal pretraining[16]. This means it can recognize or categorize novel images with minimal new examples. In practical terms, Tinker’s users can achieve high accuracy on vision tasks with very small datasets by leveraging these large vision-language models. This data-efficient vision capability is crucial for real-world scenarios where labeled data is scarce. It also hints at the power of tool-augmented reasoning: a model that “sees” can leverage both visual cues and linguistic context, making it a more versatile agent (for example, reading a diagram and explaining it, or using an image as part of a reasoning chain). Overall, the addition of Qwen3-VL to Tinker extends the platform’s reach from pure text to the visual domain, enabling multi-modal reasoning workflows under the same unified training API.
On the research front, Mind Lab – a new frontier research lab affiliated with Macaron AI – is tackling the challenge of making AI agents truly adaptive and experiential. Mind Lab’s ethos is that “real intelligence comes from real experience, not just bigger pre-training”[17]. In other words, simply scaling up models on static datasets is not enough; the next leap in AI will come from systems that learn continually from interactions, much like humans accumulating experience. Mind Lab frames this vision as Experiential Intelligence – moving from static “brains” to adaptive “minds” that can form internal world models, update their knowledge through feedback, have explicit goals or values, and even reflect on their own actions[18]. This is a direct response to the limitations of current LLMs, which are often powerful but frozen after pre-training[18]. By introducing mechanisms for genuine adaptation – such as continual reinforcement learning and dynamic memory – Mind Lab aims to create agents that evolve with use.
Two core pillars of Mind Lab’s work are: (1) Efficient RL fine-tuning of massive models to instill new behaviors, and (2) Advanced memory systems that allow agents to retain and utilize long-term knowledge. Both are geared toward making AI more agentic (autonomously deciding and improving) and tightly coupling research advances with product deployment.
One of Mind Lab’s headline achievements is demonstrating reinforcement learning at trillion-parameter scale – and doing so in a practical, cost-effective way. In December 2025 they announced the first end-to-end RL pipeline on the 1.04T-parameter Kimi K2 reasoning model, achieved with only ~10% of the GPU resources that such training would normally require[19]. How was this possible? The team built a specialized training engine that combines parameter-efficient finetuning (LoRA) with hybrid parallelism across the model’s Mixture-of-Experts structure[20][21].
Instead of tuning all trillion weights, Mind Lab’s approach injects low-rank adaptation matrices into selected layers of Kimi K2 (both in the dense backbone and within expert layers) and updates only those during RL[22]. This dramatically reduces the number of trainable parameters (for example, a LoRA rank of a few tens or hundreds per layer, instead of full matrices) and hence cuts memory and compute usage by an order of magnitude. At the same time, training a model of this size requires distributing the workload across many GPUs efficiently. The team employed a hybrid-parallel strategy: a coordinated use of tensor parallelism, pipeline parallelism, expert parallelism (for the MoE experts), and sequence parallelism (for long sequence training), all made compatible with sharded LoRA updates[23]. In practice, this meant leveraging existing large-model training frameworks (NVIDIA’s Megatron and ByteDance’s VolcEngine RL), augmenting them to handle LoRA on MoE, and carefully balancing the computation across 64 GPUs in a cluster[24]. The result was stable on-policy RL training (akin to a PPO-style algorithm) on the full Kimi K2 model with a reward model providing feedback on reasoning quality[22] – something previously thought infeasible for most teams due to cost.
Equally important, it worked: the LoRA-finetuned Kimi K2 achieved significant improvements on long-horizon reasoning tasks, with smooth learning curves and no divergence[25]. Crucially, the adapted model retained the general skills of the base model (thanks to only minimal, focused weight changes) while gaining new task-specific behaviors[26]. This means the base model’s massive prior knowledge was not overwritten, only augmented – a key benefit of LoRA finetuning. In fact, Mind Lab’s experiments confirmed that larger models provide a stronger foundation for RL. Under a fixed training budget, a large model plus small LoRA adapters outperformed a smaller model trained with full tuning, both on in-domain tasks and transferring to new ones[27]. As the team puts it, RL is “prior-limited” – if the base model can’t generate high-quality trajectories to begin with, RL has little signal to amplify[27]. A powerful pretrained prior like Kimi K2 gives RL a rich set of behaviors to hone in on, whereas training a small model from scratch has to invent those behaviors anew. This insight flips the conventional wisdom: it can be more compute-efficient to do RL on a large model (with a strong prior and LoRA efficiency) than to do RL on a smaller model, even if the smaller model is cheaper per step[28]. Mind Lab’s contribution here is not just an algorithm, but an infrastructure strategy – a blueprint for making continuous learning feasible on the biggest models. They have upstreamed their methods into open-source projects (Megatron-Bridge, VERL)[29], so the community can reproduce and build on this work, potentially enabling many groups to fine-tune trillion-parameter agents on modest hardware budgets.

Another frontier Mind Lab is exploring is how an AI agent can handle long-term memories of its interactions. Many current systems bolt on a vector database for retrieving past conversation snippets or use summary techniques to compress history. Mind Lab proposes a more integrated, “model-native” memory system called Memory Diffusion[30]. The idea is to treat the entire sequence of an agent’s dialogue or trajectory as editable memory within the model’s context, rather than something stored externally. Memory Diffusion works by iteratively maintaining a fixed-size window of context via a mask–allocate–refill loop[30]. At each step, the model decides which tokens (pieces of past conversation) to keep (mask) and which to drop, then refills the freed space with newly incoming content – all while respecting a strict token budget for the context length[30]. Essentially, the model is learning to manage its own context, compressing or forgetting less relevant details and retaining important facts as the interaction grows. This is analogous to intelligent forgetting, where the goal isn’t to remember everything indefinitely (which isn’t feasible given context length limits), but to remember usefully under real constraints[30].
By operating at the token sequence level, Memory Diffusion avoids the need for external embeddings or similarity search; the “memory” lives in the same representational space as the model’s working context. Mind Lab reports that this approach achieves state-of-the-art long-horizon memory performance, meaning the agent can carry on extended conversations or tasks without losing pertinent information, all through learned in-model mechanisms[31]. It also runs in constant time relative to context size – no explosion of retrieval cost as history grows, since the context length is fixed and managed via the mask/refill operations[31]. In practical terms, an agent with Memory Diffusion could engage in a conversation lasting thousands of turns, and while it cannot explicitly keep every detail, it will continuously decide what to keep in mind. Important user preferences or unresolved questions will persist, while trivial chit-chat from much earlier might be pruned away. This approach treats memory as a first-class component of the model’s cognition, aligning with Mind Lab’s view that memory should be an active, learning part of the system rather than a passive datastore[30].
Read more at our technical blog
Tinker's infrastructural affordances and Mind Lab's algorithmic efficiencies form a natural symbiosis. Tinker enables direct application of Mind Lab's hybrid LoRA RL to Kimi K2 and Qwen3-VL, facilitating multimodal agentic loops.
In research-product co-design—Mind Lab's core tenet—this manifests as:
Strategically, this paradigm accelerates iteration: products become experimental testbeds, yielding high-fidelity data that refines research hypotheses. For instance, few-shot vision classification gains from Tinker can seed RL objectives in deployed visual agents, progressively aligning perceptual policies with user preferences.
Traditionally, AI research would produce a model or algorithm, and then separately a product team might figure out how to deploy it, with relatively slow iteration between the two. Mind Lab instead operates on a philosophy of research–product co-design: every new technique is quickly tested in a live agent setting, and real user interactions generate data to refine the research[32].
“Research and product are no longer separate tracks. They are a closed feedback loop: user experience → data → RL training → deployment → better UX → richer data → repeat.”[33]. In practice, this means that when Mind Lab improves their RL algorithm or memory system, they integrate it into an actual user-facing agent (for example, Macaron’s personal AI assistant) and observe how it performs with real users. The usage data – what questions users ask, where the agent fails or succeeds, explicit feedback – is then fed back as training signal (through supervised fine-tuning or reinforcement learning) for the next model update. This tight loop greatly accelerates learning: the product is the experiment.
One implication is the use of streaming reward models and online RLHF (Reinforcement Learning from Human Feedback). Instead of collecting a static dataset of human preference comparisons and training a reward model once, Mind Lab’s framework envisions continuously updating the reward model as new feedback comes in during deployment. For example, if an agent is solving tasks for users and occasionally gets a thumbs-down or correction, those signals can be streamed into the reward model to refine its notion of “good” behavior on the fly. The next time RL is run (which could be in a scheduled cadence or even asynchronously), the updated reward model guides the policy to better align with user preferences. This streaming RL paradigm turns deployment into an extension of training – the longer the agent runs in the real world, the more experience it gathers, and the better it becomes. The OpenAI-compatible interface provided by Tinker actually complements this strategy: it allows these continuously-learned models to be plugged into existing products and tools easily, meaning a research lab can rapidly push new model versions to a product and observe results, without needing to rebuild the integration each time.
From Tinker’s side, the platform’s ability to sample from a model mid-training[10] could facilitate such iterative loops by enabling intermediate evaluations and fine-grained tuning decisions. On Mind Lab’s side, the co-design loop ensures that their innovations (like trillion-scale RL or memory diffusion) are stress-tested in real use cases. This approach surfaces practical challenges early (e.g., how to handle latency or unexpected user inputs) and closes the gap between cutting-edge research and user-facing AI products. The strategic payoff is that improvements are driven by real-world needs and directly validated against real-world use. As Mind Lab notes, genuine progress comes from “continuous learning from user–product interactions”[33], and an agent that can adapt in situ will ultimately deliver a far better user experience than one that is fixed at deployment.
Taken together, the advances from Tinker and Mind Lab highlight a profound shift in how we build AI systems – from static models to adaptive agents co-designed with their environments. Several key implications emerge:
As static scaling laws plateau, the synthesis exemplified by Tinker's accessible trillion-scale customization and Mind Lab's efficient experiential RL heralds a transformative era. By embedding adaptation into the product loop, we move beyond brittle brains toward resilient minds—systems that not only reason and perceive at frontier levels but grow symbiotically with their environments. This co-evolutionary trajectory promises AI that is not merely capable, but continually becoming more attuned to human needs and the complexities of the real world.
[1] [34] [35] [36] [2507.20534] Kimi K2: Open Agentic Intelligence
https://ar5iv.labs.arxiv.org/html/2507.20534
[2] [3] [8] [9] Tinker - Thinking Machines Lab
https://thinkingmachines.ai/tinker/
[4] [5] [6] [10] [11] [12] [13] [14] [15] [16] Tinker: General Availability and Vision Input - Thinking Machines Lab
https://thinkingmachines.ai/blog/tinker-general-availability/
[7] [20] [21] [22] [23] [24] [25] [26] [27] [28] [37] How We Build Trillion Parameter Reasoning RL with 10% GPUs
[17] [30] [33] Macaron AI | LinkedIn
https://www.linkedin.com/company/macaronaiofficial
[18] [19] [29] [31] [32] Introducing Mind Lab — Macaron AI's Research Arm