MinT Cookbook: From Reproducible Baselines to Continuous Autoresearch

May 29, 2026

AI systems research is entering a new phase. The bottleneck is no longer model capability, but our ability to quickly organize research questions into sustainable, self-improving experiment loops.

A new paradigm is taking shape: autoresearch. Instead of AI acting as a passive coding assistant, it participates directly in the experiment loop itself: modifying code, launching runs, reading metrics, and deciding what to try next. In March 2026, Andrej Karpathy gave this paradigm its first concrete form: an AI agent running hundreds of small experiments overnight on a single GPU, compounding tiny improvements into real gains. It was the spark.

To scale autoresearch beyond single-GPU prototypes into the kinds of experiments that drive modern LLM research (distributed training, multi-stage evaluation, complex reward models), you need two things. First, infrastructure that can keep the entire loop running coherently over hours or days without human intervention. Second, a standardized way to package experiments so that agents can jump in, understand the conventions, and start iterating without a human writing glue code for every new benchmark.

MinT provides the first piece: a production platform that handles distributed orchestration, checkpoint management, failure recovery, and consistent evaluation pipelines, all exposed through a clean Python SDK.

MinT Cookbook provides the second: a standardized experiment recipe layered on top of MinT. It transforms a benchmark into a reproducible baseline, and then into a continuous research loop that an AI agent can operate on directly.

TL;DR: we have a powerful platform (MinT). We have a proven methodology (autoresearch). What's missing is the standardized practice that combines them efficiently and lets anyone get started quickly. That practice is the MinT Cookbook.

What is Autoresearch?

In March 2026, Andrej Karpathy introduced autoresearch to fully automate the machine learning experimentation cycle [1]. Instead of manual trial-and-error, researchers simply define high-level goals in a text file (program.md), allowing an AI agent to autonomously and continuously modify the training script (train.py). The agent runs 5-minute experiments, keeping code changes that improve an objective metric via Git commits and instantly reverting failures. This continuous "ratchet" loop enables the agent to run hundreds of experiments overnight, systematically compounding small discoveries into real performance gains.

Karpathy's setup proved something crucial: autonomous research loops work. On a single GPU, the agent reliably improved a model overnight. It was the first demonstration that an AI agent can participate in the experiment loop, not just as a coding assistant, but as an active experimenter making decisions based on evidence. Other systems, including ML‑Intern [2], AI Scientist‑v2 [3], Auto Research with Specialist Agents [4], and AutoResearchClaw [5], have explored similar paradigms.

The natural next question is: how do you take this proven paradigm and scale it to the experiments that actually matter in modern LLM research (distributed training, multi-stage evaluation, complex reward models)?

MinT [6] already provides the infrastructure answer. Its platform handles the heavy engineering: distributed orchestration, checkpoint management, failure recovery, and consistent evaluation pipelines. With MinT, the foundation for large-scale autonomous research is already in place.

What's missing is the organizational layer on top of that foundation. Infrastructure gives you a clean lab. You still need a standardized way to set up experiments so that an agent can jump in, understand the conventions, and start iterating without a human writing glue code for every new benchmark.

That is what MinT Cookbook provides. It takes the proven autoresearch paradigm, runs it on MinT's production infrastructure, and packages every experiment into a standardized recipe that both developers and agents can work with seamlessly.

MinT Cookbook: AI-native Experiment Recipe

MinT handles the heavy lifting: distributed training, failure recovery, model state management. But infrastructure alone doesn't tell you how to start a new benchmark, how to organize experiments so an agent can continuously operate on them, or what a reproducible research iteration actually looks like in practice.

MinT Cookbook fills that gap. Its core design principle is simple:

Every experiment should be a minimal, self-contained unit that both developers and agents can understand quickly.

The goal is not that every experiment uses identical code. The goal is that every experiment follows a similar shape.

Every experiment directory is self-contained. Enter experiments/<name>/, and you don't need to re-learn the whole repo. Just follow a few local conventions to get the baseline running and start iterating. train.py stays a single, executable file. It's not just a training script. It's the executable truth of the experiment: CLI entry point, evaluation path, training flow, checkpoint semantics, metric output, and artifact writing all live right there.

This design gives you two immediate wins:

It drives the cost of setting up a baseline low enough that a benchmark can go from "idea" to "runnable, evaluable, iterable starting point" much faster.
It defines the experiment boundary clearly enough that an agent can take over without stumbling through hidden state and cross-directory dependencies.

A Shared Experiment Shape. MinT Cookbook doesn't force all experiments to share the same code, but it does require them to follow a shared "experiment shape." Every experiment obeys these conventions:

Eval-first lifecycle – Every training script must include evaluation, and evaluation can be run standalone.
Unified metric format – All metrics are printed to stdout as METRIC_NAME=value.
Consistent checkpoint semantics – Checkpoints are saved and loaded the same way across experiments.
Resumable training – Every run supports seamless recovery from interruptions.

Each experiment can still keep its own unique pieces: how benchmark data is interpreted and processed, prompt templates and scoring logic, task-specific adapters, details of the reward or objective function, and experiment-specific autoresearch protocols.

This "shared conventions, local differences" design gives MinT Cookbook enough flexibility while keeping all experiments running in a similar way, so an AI agent can switch between experiments without learning new rules every time.

Putting it into practice: three prompts that drive real work. The following examples show how a researcher delegates standard Cookbook tasks to an AI agent using nothing but natural language, no special commands, no glue code. The agent follows the shared experiment shape automatically.

Example 1: Run an existing baseline

In mint-cookbook, run the baseline for experiments/fingpt.
Follow the experiment’s own README and AGENTS.md to set up the environment,
then execute the canonical dry-run and eval-only commands.
Summarize what worked and any blockers.

Example 2: Spin up a new benchmark baseline

Create a new Cookbook experiment at experiments/my-benchmark for benchmark [name].
Use the repo’s skill workflow: first write the requirement doc, then scaffold the eval‑first baseline.
Make sure uv sync, dry‑run, and eval‑only all pass.
Report the requirement path, chosen targets, and validation status.

Example 3: Launch an autoresearch loop

Start one autoresearch run on experiments/lawbench.
Validate the contract (dry‑run), then review the current evidence in autoresearch.md,
pick the most promising hypothesis, and run bash autoresearch.sh with the needed overrides.
Report what hypothesis you chose and what evidence the run returned.

How Autoresearch Works Inside MinT Cookbook

Autoresearch in MinT Cookbook isn't about blindly running more experiments. It's about systematically tackling the four places where research tends to go off the rails.

Baselines must be trivially easy to set up.

Many benchmarks stall not because they lack research value, but because the starting point is too heavy. MinT Cookbook provides recipes (SFT, GRPO, DPO, OPSD, and more) that do more than just set up an environment. They quickly shape a benchmark into a runnable, evaluable, iterable baseline, so both developers and agents can jump straight into the real research loop.

This aligns with the goals of other systems, such as MLE-Agent [7], which is designed to lower the barrier for ML engineers and researchers through autonomous baseline creation and integration with the latest research resources.
Use a proxy to search cheaply.

Once the baseline runs, what really happens at high frequency is proxy evaluation. Its job is to rapidly screen directions (or to pick promising checkpoints along a training run) using as little compute as possible. Kill bad leads early and keep the candidates worth pursuing.
The full benchmark only delivers the final verdict.

The proxy builds a shortlist; it never declares victory. Only when the proxy gives a strong enough positive signal does the agent send a candidate to full benchmark confirmation. This way the system maintains search efficiency without mistaking a local spike for real progress. In short: the proxy decides which candidates are worth a closer look; the full benchmark confirmation decides which results are worth trusting.
Both results and failures must be written back into the system.

Autoresearch logs more than scores. It records failure reasons, false peaks, runtime issues, and why a line of investigation should continue or stop. More importantly, that information feeds back to refine the search protocol for future runs [8].

Autoresearch therefore shifts the focus from “did the experiment finish?” to “was this version correctly analyzed, compared, and archived?”. It is not a training script, nor a thin wrapper. It is a full research process system that treats every run as an evidence-gathering step.

Experiment results

Think of a machine learning experiment as a dish.

Reproducible Baseline: follow the recipe exactly, just to confirm the kitchen works.
Baseline Optimization Autoresearch: keep the same cooking method but continuously taste, adjust, and improve based on what the evidence tells you.
Idea-Driven Autoresearch: go beyond the recipe: read cookbooks, invent new techniques, and chase ideas wherever they lead.

We evaluated MinT Cookbook across these three stages, using LawBench (legal AI) [9], FinGPT (financial AI) [10], and AIME (math) to show how the loop moves from simple reproduction to fully autonomous method-level exploration.

Stage 1: Reproducible Baseline

We validated MinT Cookbook's SFT training recipe on the standard FinGPT sentiment analysis benchmarks, covering four core tasks: FPB, FiQA-SA, TFNS, and NWGI.

Training configuration: batch_size=32, learning_rate=1e-4, steps=2000.

The validation set performance of all models is summarized in the following table.

Model	FPB	FiQA-SA	TFNS	NWGI
GPT-4	0.833	0.630	0.808	-
FinGPT v3.3	0.882	0.874	0.903	0.643
Qwen/Qwen3-4B-Instruct-2507 (Base)	0.689	0.832	0.591	0.450
Qwen/Qwen3-4B-Instruct-2507 (SFT)	0.881	0.870	0.910	0.563

Stage 2: Baseline Optimization Autoresearch

Here the training algorithm remains fixed. The agent’s job is to take the reproducible baseline from Stage 1 and optimize it, systematically searching for better data mixtures, hyperparameters, training steps, and checkpoint selections that push performance higher without changing the underlying method.

We demonstrated this on LawBench, a comprehensive legal LLM benchmark covering 20 tasks across three cognitive levels. The agent ran 28 experiments in 2–3 hour iteration cycles, autonomously trying different data weights, document corrections, task-specific optimization, learning rate adjustments, and extended training steps. A cheap proxy metric screened every candidate; only configurations that improved the signal moved forward to full benchmark confirmation.

The blue line traces the validated gains: starting from a baseline proxy score of 0.492, the agent improved it to 0.592 through a sequence of confirmed changes. On the full benchmark, the final checkpoint reached 53.91, outperforming both the base model (47.93) and the standard SFT control (49.14).

We call this SFT-AR (SFT with autoresearch): the agent continuously improves the training recipe itself — selecting data mixtures, adjusting hyperparameters, and confirming only the most promising candidates on the full benchmark.

Note: the proxy score and the full benchmark score are on different scales; the proxy is designed for fast relative comparison only.

Scaling with zero re-engineering. Using the same LawBench recipe, data pipeline, scorer, wrapper, and artifact protocol, we ran comparable experiments on larger Qwen3 models. On Qwen3-30B, the standard SFT control improved the score from 53.76 to 56.05, while the current SFT-AR reference reached 59.07. On Qwen3-235B, the standard SFT control improved the score from 59.44 to 61.13, and the current SFT-AR reference reached 63.93.

This is the key point: with MinT, scaling from 4B to 30B and 235B is not a new engineering project. It is the same experiment recipe running on a stronger execution platform. MinT provides the infrastructure; Cookbook provides the standardized experimental practice that makes the loop portable across model scales.

Stage 3: Idea-Driven Autoresearch

This is the stage where the agent moves beyond optimizing a fixed baseline and starts chasing ideas. An idea can come from a published paper, a technical report, or the agent's own hypothesis formed during earlier runs. The loop shifts from "how can we make this baseline better?" to "what completely new approach should we try?"

In the current demonstration, the idea is "Can we improve math reasoning by changing how on-policy self-distillation (OPSD) [11] filters training samples?" Future iterations will source ideas directly from the latest papers on arXiv, automate reproduction, and run cross-method comparisons, all within the same standardized experiment framework.

Step 1: Reproduce the idea. MinT Cookbook provides a runnable reproduction of the OPSD training recipe for competition-level math reasoning on Qwen3-30B-A3B-Instruct-2507. Under the locked vanilla OPSD setup and a fixed evaluation contract (temperature=0.6, num_samples=32, mean@32), we ran two identical-seed reproductions, obtaining 0.7128 and 0.7250 (base model: 0.7056). We report both runs to characterize the pipeline's stochasticity (~0.012 noise band).

Step 2: Explore the idea. With the idea's baseline reproduced and its noise level characterized, the agent tested 7 algorithmic variants: different rollout lengths, advantage scaling strategies, and correct-only filtering.

Version	Algorithm change	mean@32	Outcome
v0	Base model (no training)	0.7056	Baseline
v1a / v1b	Vanilla OPSD (two runs)	0.7128 / 0.7250	Reproduction ✓
v2 / v3	Shorter rollouts (768, 640 tokens)	0.7236 / 0.7167	No help → discarded
v4	Zero out negative advantages	0.7038	Hurt performance → discarded
v5	STaR: train only on correct rollouts	0.7333	Best result
v6	Soft advantage downscaling	0.7097	No help → discarded
v7	OGLS: softer incorrect scaling	0.7160	No help → discarded

One variant survived: STaR-style correct-only filtering [12]. During training, ~65% of rollouts where the model's answer was wrong are filtered out, so only correct reasoning is reinforced.

Model	AIME2024	AIME2025	AIME2026	HMMT25
Qwen3-30B-A3B-Instruct-2507 (Base)	0.7479	0.6260	0.7427	0.4188
Qwen3-30B-A3B-Instruct-2507 (OPSD run1)	0.7583	0.6229	0.7573	0.3969
Qwen3-30B-A3B-Instruct-2507 (OPSD run2)	0.7667	0.6438	0.7646	0.4094
Qwen3-30B-A3B-Instruct-2507 (OPSD+STaR)	0.7740	0.6458	0.7802	0.4104

From idea to improvement. The agent took a published method, reproduced it, characterized its noise, and discovered a simple but effective modification. This is the core loop of Idea-Driven Autoresearch, one that will only accelerate as the system gains the ability to ingest new ideas directly from the research literature.

Conclusion

MinT provides the scalable platform. Autoresearch provides the proven methodology. MinT Cookbook combines them into a standardized practice that any research team can adopt.

AI research is shifting from human-led to human–AI collaborative research.

Autoresearch redefines the research process: AI moves from a passive tool to an active participant in the experiment loop.
MinT provides the scalable foundation: enabling large-scale autonomous research across thousands of experiments.
MinT Cookbook standardizes the protocol: turning autoresearch from a concept into a reproducible practice that any research team can adopt.

MinT Cookbook isn't here to replace researchers. It's here to free them from repetitive experimental overhead so they can focus on what they do best: asking better questions, designing better research protocols, interpreting results, and discovering new scientific principles.

MinT Cookbook is just the beginning.

Standardized experiment conventions unlock a new capability: end-to-end translation from research papers to runnable, validated code. Future agents will automatically scan the literature, verify reproducibility, convert implementations into MinT-compatible recipes, and run cross-model comparisons. The result is a living library of reproducible methods, making state-of-the-art techniques instantly accessible and accelerating AI research for everyone.

MinT Cookbook is open. Contribute recipes, improve autoresearch protocols, or run reproducibility validations. Our roadmap includes more tasks (multi-step reasoning, tool use) and larger models (Qwen, GLM, Kimi, MiniMax). This is not the end. The fastest way to get a feel for the Cookbook loop is to run a training session yourself at mintui-preview.macaron.im — no install, no setup. If you don't yet have a MinT API key, register and sign in at https://macaron.im/mindlab/mint first. To run larger experiments on MinT infrastructure, apply for a community edition account on the same page.

References

[1] karpathy/autoresearch (Karpathy, 2026)

[2] ML-Intern: an agent that autonomously researches, writes, and ships good quality ML related code using the Hugging Face ecosystem (Reedi et al., 2026)

[3] The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search (Yamada et al., 2025)

[4] Auto Research with Specialist Agents (Ning et al., 2026)

[5] AutoResearchClaw: Fully Autonomous Research from Idea to Paper (Liu et al., 2026)

[6] MinT: RL Infrastructure for Experiential Intelligence (Lu et al., 2026)

[7] MLE-Agent: Your Intelligent Companion for Seamless AI Engineering and Research (Zhang et al., 2024)

[8] ARIS: Fully Autonomous Research via Adversarial Multi-Agent Collaboration (Yang et al., 2026)

[9] LawBench: Benchmarking Legal Knowledge of Large Language Models (Fei et al., 2023)

[10] FinGPT: Open-Source Financial Large Language Models (Yang et al., 2023)

[11] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models (Zhao et al., 2026)

[12] STaR: Bootstrapping Reasoning With Reasoning (Zelikman et al., 2022)

Author

Mind Lab

Core Contributors

Andrew Lei, Lucian Li, Nolan Ho, Di Zhang, Kieran Liu, Andrew Chen, Pony Ma

Team

Andrew Chen, Kaijie Chen, Song Cao, Cleon Cheng, Steven Chiang, Nolan Ho, Charles Huang, Fancy Kong, Andrew Lei, Lucian Li, Ray Li, Theo Li, Wenhao Li, Logan Liu, Kieran Liu, Xiang Liu, Irvine Lu, Pony Ma, Vincent Wang, Guikun Yang, Rio Yang, Shiro Yang, Maxwell Yao, Regis Ye, Di Zhang, Ruijia Zhang, Conley Zhao, Congjie Zheng, Adrian Zhou, Murphy Zhuang and Mindverse Team

Names are listed alphabetically within team and acknowledgement.

Citation

Please cite this work using the BibTeX citation:

@misc{lei2026mintcookbook,
  author = {Andrew Lei and Lucian Li and Nolan Ho and Di Zhang and Kieran Liu and Andrew Chen and Pony Ma and {Mind Lab}},
  title = {MinT Cookbook: From Reproducible Baselines to Continuous Autoresearch},
  year = {2026},
  howpublished = {Mind Lab: A Lab for Experiential Intelligence},
  note = {https://macaron.im/mindlab/research/mint-cookbook-from-reproducible-baselines-to-continuous-autoresearch}
}

Share to