
Author: Boxu Li
After a decade dominated by large-scale pre-training, the AI community is entering what some call the “second half” of AI development[1][2]. In the first half, progress was driven by new model architectures and training methods that relentlessly hillclimbed benchmarks[3] – from convnets and LSTMs to Transformers – all optimized via supervised or self-supervised learning on static datasets. But today, frontier models like GPT-4 have essentially saturated many benchmarks, and simply scaling data and parameters yields diminishing returns[2]. This shift has sparked a re-examination of how we achieve further intelligence and utility from AI.
One emerging consensus is that Reinforcement Learning (RL) will play an outsized role in this next phase. RL has long been considered the “end game” of AI – a framework powerful enough to eventually win at arbitrary tasks by optimizing long-term rewards[4]. Indeed, it’s hard to imagine superhuman systems like AlphaGo or AlphaStar without RL at their core[4]. Now, with large pre-trained models as a foundation, many researchers argue that “pre-training is over” – the future breakthroughs will come from post-training these models in interactive environments via RL. As one recent essay put it, once we have massive pretrained models (the “priors”) and suitable environments, “the RL algorithm might be the most trivial part” of building advanced agents[5]. In other words, we’ve baked the cake with pre-training; reinforcement learning is the key to frosting it with reasoning and agency.
Shunyu Yao, in The Second Half, articulates this ethos. He notes that modern AI already provides a “working recipe” – large language model pre-training + scaling + reasoning – that can solve many tasks without new algorithms[2][6]. Thus, the game has changed: simply inventing another architecture won’t yield the leaps it once did. Instead, we must focus on evaluation and environments – essentially, on tasks that force the AI to truly think and act, not just predict the next token[7][8]. And that inevitably means using RL. Yao calls RL “the endgame of AI” and argues that now that we have the right ingredients (powerful priors from pre-training, plus richer environments with language and tools), “the recipe is completely changing the game” in this second half[1]. We should expect a pivot from static benchmarks to interactive tasks, and from one-and-done evaluations to continuous learning in the wild. In short, reinforcement learning is becoming central to how we advance AI from here on out.
Why the renewed focus on RL? Simply put, reinforcement learning enables capabilities that supervised learning alone can’t easily attain. Large Language Models (LLMs) are a case in point. A transformer like GPT-4, pre-trained on internet text, learns a tremendous amount of knowledge and linguistic pattern recognition – yet on its own it still lacks true agency. Pre-training teaches “how to talk,” but not necessarily what decisions to make in an interactive setting. By contrast, RL can teach an AI what goals to pursue and how to take actions to achieve them, by maximizing rewards that reflect those goals. This shift from passively predicting to actively experimenting and receiving feedback is crucial for reasoning, planning, and alignment.
Recent work on LLM-based agents demonstrates how RL unlocks new levels of performance. For example, the open-source Kimi K2 model was fine-tuned end-to-end with reinforcement learning, which “teaches the model to plan, react, and self-correct through long reasoning chains instead of relying solely on supervised post-training”[9]. Through RL, K2 acquired autonomous reasoning patterns – it learns to cross-check facts, iterate on hypotheses, and stay cautious even when a question looks easy[10]. The result is a model that doesn’t just regurgitate training data, but actively figures out how to solve novel problems. Similarly, the K2 project emphasizes reliability: the agent prefers to verify answers before finalizing them, reflecting an RL-trained tendency to maximize correctness over speed[11]. In essence, reinforcement learning imbued the model with an internal “agentic” loop of planning and reflection, moving it beyond the limits of next-token prediction.
We see this pattern with other advanced systems as well. ChatGPT’s own improvement from GPT-3 came largely via Reinforcement Learning from Human Feedback (RLHF). After pre-training the model on text, OpenAI fine-tuned it with human feedback and reward models, which dramatically improved its helpfulness and adherence to instructions. John Schulman – a lead researcher on ChatGPT – describes that process: human testers provided a reward signal that made the model much better at holding coherent conversations, staying on track, and avoiding undesired outputs[12]. In other words, RLHF aligned the model with human preferences and conversational norms. This technique has become a de facto standard for turning raw LLMs into helpful assistants. As a WIRED piece notes, reinforcement learning is now an “increasingly popular” method for fine-tuning models by giving them feedback-based rewards to optimize[13]. Whether it’s to make a chatbot follow instructions or to imbue a large model with problem-solving skills, RL is the tool of choice once pre-training has done all it can.
The significance of RL goes beyond just fine-tuning for politeness; it’s about teaching models to make decisions. A recent technical blog from Macaron AI’s Mind Labs encapsulated this: “As LLMs evolve beyond pre-training toward experiential learning, Reinforcement Learning has emerged as the key to unlocking advanced reasoning capabilities.”[14] Rather than treating RL as an afterthought, cutting-edge projects treat it as a “first-class design pillar for agentic behavior, not just a final polish step”[15]. In practical terms, that means training AI systems by placing them in simulated or real environments where they must act, get feedback, and improve – be it an LLM agent browsing tools or a robot learning to navigate. Experiential learning through RL is how AI will acquire skills that can’t be captured in static datasets.
It’s telling that new AI labs are forming around this philosophy. Thinking Machines Lab, a startup founded by former OpenAI leaders, just launched with a massive $2B seed valuation to build tools for fine-tuning frontier models via RL and other techniques. Their flagship product “Tinker” aims to automate RL-fine-tuning of large models, betting that empowering many people to “coax new abilities out of big models by leveraging reinforcement learning” will be the next big thing in AI[16][17]. Likewise, Macaron AI (a new research venture) is designing custom RL optimizers and infrastructure to scale RL to trillion-parameter models[18][19]. Efforts like these underscore a broader trend: the AI community sees huge opportunity in RL to push models to new frontiers – whether that’s making them more tool-using and reasoning (as with Kimi K2 and Macaron’s agents) or more aligned and customized (as with ChatGPT and Tinker). In sum, RL is now viewed as a key enabling technology to realize the full potential of the foundation models built in the last decade.

Perhaps the most compelling reason for RL’s rising prominence is its success in tackling problems beyond the sandbox of static datasets – often achieving feats that were long out of reach. Game-playing milestones were the first dramatic proof: DeepMind’s AlphaGo, AlphaZero and OpenAI’s Five conquered Go, chess, and even complex video games through deep reinforcement learning. These systems demonstrated that, given a well-defined reward (like winning a game), RL agents can surpass human champions via sheer practice and optimization[4]. Notably, OpenAI Five’s victory over the world champion Dota-2 team in 2019 was achieved by training purely via self-play RL at unprecedented scale – showcasing the “surprising power” of today’s RL algorithms when enough experience is provided[20]. That project highlighted both RL’s potential and its challenges: it required massive simulation (equivalent to hundreds of years of gameplay) and ingenious engineering to work, but it did work, producing teamwork and strategies beyond what any rule-based AI could do.
Crucially, RL is no longer confined to games. A landmark achievement in 2022 saw DeepMind use deep RL to control a nuclear fusion plasma in real-time, something previously impossible with manual controllers. By training in a simulator and then deploying to a tokamak reactor, their agent learned to manipulate magnetic coils to contain the plasma, successfully learning to stabilize a fusion reaction autonomously[21]. This demonstrated how RL can handle high-dimensional, dynamic control problems in physics – opening new avenues for scientific research that relies on precise sequential decision-making[21].
Another domain where RL is proving its real-world mettle is multi-agent interaction and game theory. A striking example is Meta’s CICERO, the first AI to achieve human-level performance in the game Diplomacy, which requires negotiation and alliance-building among multiple players. CICERO combines an LLM for language with an RL-trained planning module; it must devise strategies, model other players’ intentions, and dialog persuasively. The result was a breakthrough – CICERO managed to cooperate and compete effectively with humans, even in the presence of lies and bluffing. As observers noted, it’s “the first AI to achieve human-level performance in Diplomacy, a strategy game requiring trust, negotiation and cooperation with multiple players.”[22] This goes beyond board-game tactics; it hints that RL agents can handle social strategy and dynamic game-theoretic environments. Such capabilities are essential for AI that might one day navigate economies, negotiations, or complex organizational decisions.
Finally, and perhaps most dramatically, RL is venturing off Earth entirely. In the past year, researchers have achieved what can only be described as science fiction made real: autonomous satellites and robots in orbit controlled by reinforcement learning. In a U.S. Naval Research Lab experiment on the International Space Station, an RL algorithm (trained in simulation) took control of an Astrobee free-flying robot and successfully performed autonomous maneuvers in microgravity[23][24]. NRL’s team noted this is “the first autonomous robotic control in space using reinforcement learning algorithms”, and it builds confidence that RL can handle the unforgiving conditions of space operations[23]. Even more recently, on October 30, 2025, a University of Würzburg team achieved a world-first in-orbit demo: their small InnoCube satellite executed an attitude alignment maneuver entirely under the control of an onboard RL agent[25][26]. As the lead researcher put it, “we have achieved the world’s first practical proof that a satellite attitude controller trained using Deep Reinforcement Learning can operate successfully in orbit.”[26] This is a watershed moment – RL has graduated from simulations and labs to controlling physical systems in space. The AI controller learned in a high-fidelity simulator and was uploaded to the satellite, where it performed precise orientation tasks with no human in the loop[27][28]. The usual months-long process of hand-tuning a satellite’s control algorithm was replaced by an RL agent that can adapt on the fly[29]. These successes in space robotics highlight RL’s ability to produce policies that adapt and generalize under real-world uncertainty – a key stepping stone toward more autonomous vehicles, drones, and robots here on Earth as well.
All these examples underscore a pivotal point: Reinforcement learning is coming of age just when we need it the most. As AI moves into the “second half,” where the challenge is not just predicting but performing, RL provides the framework for experimentation, adaptation, and long-horizon optimization. Unlike supervised learning, which is tethered to past data, RL enables systems to learn from their own experience and improve through trial-and-error. This is essential for any AI that must operate in unstructured, novel situations – whether it’s an assistant solving a new user query or a robot coping with unexpected obstacles.
There are also deeper implications for how we measure progress in AI. We can no longer rely solely on static benchmarks to gauge a model’s intelligence. Instead, researchers are proposing new evaluation setups that mirror the real world: continuous tasks, human-in-the-loop interactions, and non-i.i.d. scenarios[8][30]. By coupling such rich environments with RL training, we force our models to develop more robust, generalizable behaviors. In Yao’s words, the second half will be about creating agents that break out of the benchmark loop and actually deliver real-world utility[31][32]. The flurry of investment in RL-centric labs and the rapid adoption of RLHF in industry reflect a recognition that now is the time to make this leap.
That said, embracing RL doesn’t come without challenges. RL training can be unstable and resource-intensive (OpenAI Five’s costly training is a case in point[20]). It often demands fast simulations or environments where mistakes are cheap – something not always available in high-stakes domains. However, progress is being made on these fronts too. New algorithms and frameworks (like Macaron’s All-Sync RL with DAPO optimizations) are dramatically improving the efficiency of large-scale RL training[19][33]. Techniques like sim2real transfer, reward modeling, and safer exploration strategies are helping RL systems make the jump to real deployments without catastrophic failures[34][35]. Importantly, the community is learning how to blend RL with other paradigms – for example, using language models as critics or planners, using human demonstrations to guide RL (a kind of hybrid imitation learning), and more. These hybrid approaches often get the best of both worlds: the knowledge of pre-training and the decision-making of reinforcement learning.
In conclusion, focusing on reinforcement learning now is not a matter of hype for its own sake – it’s a recognition of where the needs and opportunities lie. We stand at a juncture where our AI systems have vast latent capabilities (thanks to pre-training), and the way to activate those capabilities is through goal-directed learning. Whether it’s aligning AI behavior with human values, endowing robots with true autonomy, or pushing AI to solve new scientific and engineering problems, RL provides the tools to iteratively refine and improve AI through feedback. We are witnessing the transition from an era of passive learning to one of active learning and doing. As the saying goes, “what got us here won’t get us there.” The heavy lifting of representation learning might be largely done by giant models, but turning those models into useful, adaptive, and trustworthy agents – that is the work of reinforcement learning. By investing in RL research and applications now, we’re essentially tackling the hard problems head-on: making AI that can think in steps, explore alternatives, recover from errors, and ultimately master open-ended tasks. In the grand trajectory of AI, this shift is as significant as the deep learning revolution of the 2010s. The second half has only just begun, and reinforcement learning is poised to be its driving force.
References:[4][1][2][13][12][9][15][18][23][22][25][26][19][21]
[1] [2] [3] [4] [5] [6] [7] [8] [30] [31] [32] The Second Half – Shunyu Yao – 姚顺雨
https://ysymyth.github.io/The-Second-Half/
[9] [10] [11] [15] Introducing Kimi K2 Thinking | Blog
https://kimik2thinking.org/blog/introducing-kimi-k2-thinking
[12] [13] [16] [17] Exclusive: Mira Murati’s Stealth AI Lab Launches Its First Product | WIRED
https://www.wired.com/story/thinking-machines-lab-first-product-fine-tune/
[14] [19] [33] MIND LABS | Scaling All-Sync RL with DAPO and LoRA
[18] A Macaron Analysis: Kimi K2 “Thinking” Model: Advancing Open Agentic AI - Macaron
https://macaron.im/blog/kimi-k2-thinking
[20] OpenAI Five defeats Dota 2 world champions | OpenAI
https://openai.com/index/openai-five-defeats-dota-2-world-champions/
[21] Accelerating fusion science through learned plasma control - Google DeepMind
https://deepmind.google/blog/accelerating-fusion-science-through-learned-plasma-control/
[22] CICERO: AI In Diplomacy and Relations | blog_posts – Weights & Biases
https://wandb.ai/vincenttu/blog_posts/reports/CICERO-AI-In-Diplomacy-and-Relations--VmlldzozMzIzNDQ5
[23] [24] [34] [35] Reinforcement Learning is Making a Buzz in Space > U.S. Naval Research Laboratory > NRL News
[25] [26] [27] [28] [29] World Premiere in Space: Würzburg AI Controls Satellite -
https://www.uni-wuerzburg.de/en/news-and-events/news/detail/news/world-premiere-ai-control/