Learn-to-Steer: NVIDIA’s Data‑Driven Solution to Spatial Reasoning in Text-to-Image Diffusion

Author: Boxu Li

Text-to-image diffusion models can generate stunning visuals, but they have a notorious blind spot: spatial reasoning. Today’s best models often misplace objects in a scene or merge them oddly when asked for specific layouts. For example, a prompt like “a dog to the right of a teddy bear” might confuse a model – it may put the dog on the left or even fuse the dog and teddy together. These are tasks a young child finds trivial, yet diffusion models frequently fail at them[1]. The problem becomes even more pronounced with unusual combinations (imagine a giraffe standing above an airplane)[1]. Traditional fixes involve either fine-tuning models on special data or adding handcrafted spatial losses at generation time, but both approaches have drawbacks[1]. Fine-tuning requires expensive retraining and risks altering the model’s creativity or style. Handcrafted losses, on the other hand, encode our own imperfect assumptions about spatial relationships, often yielding suboptimal results.

Enter Learn-to-Steer, NVIDIA’s novel approach (to appear at WACV 2026) that tackles spatial reasoning by learning directly from the model itself. Instead of hard-coding where objects should go, the idea is to teach the model how to guide itself during image generation using data-driven loss functions. In this blog post, we’ll explore the challenges of spatial reasoning in diffusion models and how NVIDIA’s Learn-to-Steer method works under the hood. We’ll delve into its architecture – including how it leverages cross-attention maps and a learned classifier at inference – and review quantitative gains on benchmarks. We’ll also critically examine the trade-offs of optimizing at inference time (like compute cost and generalizability) and consider the broader implications for prompt fidelity, multimodal alignment, and the future of generative model design.

Spatial Reasoning: The Missing Piece in Diffusion Models

Modern diffusion models like Stable Diffusion can paint photorealistic or fantastical scenes with impressive detail. However, ask for a simple spatial arrangement and you might be disappointed. Spatial reasoning – understanding and generating correct relative positions (left/right, above/below, inside/outside) – remains a stumbling block. Prompts specifying object relationships often yield images that misalign with the request. For instance, a prompt “a cat on top of a bookshelf” might produce a cat next to the bookshelf or a surreal cat-bookshelf hybrid. Why does this happen?

One reason is that diffusion models learn from huge image-text datasets where explicit spatial relationships are rare or ambiguous. They excel at style and object fidelity, but the training data may not strongly enforce where each object should appear relative to others. As a result, the model’s internal representation of spatial terms (“on top of”, “to the right of”) is weak. Recent benchmarks confirm that even state-of-the-art text-to-image models struggle on spatial tasks involving simple geometric relations[2]. These failures show up as three main issues: incorrect object placement, missing objects that were in the prompt, or fused, chimeric objects when the model tries to mash two things together[3]. In short, the model often knows what you asked for, but not where to put it.

Existing methods have attempted to address this gap. Some researchers fine-tune diffusion models on images with known layouts or relations, effectively retraining the model to be spatially aware. Others use test-time interventions: for example, guiding generation with extra loss terms that penalize overlap or reward correct ordering of objects. However, manually designing such loss functions is tricky – it requires guessing how to measure “left of” or “above” using the model’s internal data. These handcrafted losses may work for simple cases but can encode suboptimal heuristics, failing on more complex scenes[4]. Fine-tuning, meanwhile, can achieve good spatial accuracy (e.g. the COMPASS method retrains a model with spatially aligned data[5]) but it’s resource-intensive and can inadvertently degrade other image qualities (in one case, color accuracy and object counting worsened after fine-tuning for spatial relations[6]). There’s a need for a solution that improves spatial fidelity without retraining the whole model or relying on brittle heuristics.

Learning to Steer Diffusion with Data‑Driven Losses

https://research.nvidia.com/publication/2025-11_data-driven-loss-functions-inference-time-optimization-text-image

NVIDIA’s Learn-to-Steer framework offers a fresh take: instead of imposing rules, learn them from the model’s own signals[7]. The key insight is that diffusion models already produce rich internal data during generation – particularly in the form of cross-attention maps – which can be mined to understand spatial relationships. Cross-attention maps are generated at each step of the diffusion denoising process and essentially tell us which image regions are attending to a given word in the prompt[8]. In other words, they form a bridge between textual tokens (like “dog”, “teddy bear”, “to the right of”) and image locations[8]. Prior work noticed that these attention maps can be interpreted to locate objects, so it’s natural to use them as a guide. Test-time optimization methods often choose cross-attention maps as the target for their spatial losses because of this interpretability and direct text-image alignment[9].

Learn-to-Steer (L2S) builds on this idea by learning an objective function from data instead of hand-crafting one. It introduces a lightweight relation classifier that is trained offline to recognize spatial relationships from the diffusion model’s cross-attention patterns[7]. During inference, this classifier acts as a learned loss function: it evaluates whether the generated image (so far) reflects the prompt’s relation correctly, and if not, it steers the generation in the right direction[7]. Essentially, NVIDIA’s team taught the diffusion model to critique its own attention maps and adjust accordingly, all on the fly without altering the model weights.

Training this relation classifier turned out to be more nuanced than it sounds. A straightforward approach might be: take a bunch of images with known relationships (e.g., images annotated that “dog is left of cat”), run the diffusion model’s inversion process to get attention maps for “dog” and “cat”, then train the classifier to output “left-of” for those maps. This indeed provides supervision. However, an unexpected pitfall emerged – something the authors call the “relation leakage” problem[10][11]. The classifier started cheating by picking up on linguistic traces of the relation in the attention maps, rather than truly understanding spatial layout. How is that possible? It turns out that when you invert an image using the correct descriptive prompt (say “a dog to the left of a cat”), subtle clues about the word “left” might get encoded in the model’s internal activations. The classifier then latches onto these clues (effectively reading the prompt back out of the attention map) instead of learning the visual concept of “left of”[10][12]. The result: it performs well on training data but fails during generation, because in generation the prompt’s relation word always matches the image (there’s nothing to distinguish whether the spatial arrangement is correct or not if the classifier is just echoing the prompt).

To solve this, Learn-to-Steer uses a clever dual-inversion training strategy[13][14]. For each training image, they generate two versions of the attention maps: one from a positive prompt that correctly describes the spatial relation (e.g. “A dog to the left of a cat”) and one from a negative prompt that deliberately uses the wrong relation (e.g. “A dog above a cat”)[15][16]. Both sets of attention maps are labeled with the true relation (“left of” in this example), based on the actual image layout. By seeing the same image relation with conflicting textual descriptions, the classifier is forced to ignore the unreliable linguistic cue and focus on the genuine geometric pattern in the attention maps[14]. This ensures it learns invariance: whether the prompt said “left” or “above,” the classifier must still detect the dog is actually left of the cat from the spatial evidence alone. This dual-inversion approach neutralizes the leakage problem, yielding a classifier that genuinely understands spatial relations in terms of the model’s vision, not just the text prompts[17].

Another important aspect is the training data for this classifier. The team drew from both real images and synthetic images to cover a wide range of scenarios[18]. Real images (from a dataset called GQA) provide natural complexity and varied object arrangements, though their attention maps can be noisy when scenes are crowded[18]. Synthetic images, generated in a controlled way (using an Image-Generation-CoT method), offer simpler scenes with clearer attention patterns more akin to those encountered during diffusion generation[18]. By blending real and synthetic data, the classifier benefits from both realism and clarity. An ablation study confirmed that using both data sources led to better accuracy than either alone[19].

Inference-Time Steering with Learned Loss Functions

Once the relation classifier is trained, Learn-to-Steer plugs it into the diffusion process to steer images as they are generated. This happens during inference (generation time) and does not require any changes to the diffusion model’s weights. Here’s how it works:

When given a text prompt that includes a spatial relation (for example, “a dog to the right of a teddy bear”), the system first parses the prompt to identify the subject, object, and relation (in this case, subject: dog, relation: to the right of, object: teddy bear)[20]. As the diffusion model begins to denoise random latent noise into an image, Learn-to-Steer intervenes at certain timesteps. At a chosen frequency (e.g. at each step or every few steps in the first half of the diffusion process), it extracts the cross-attention maps corresponding to the two objects in question[20]. These are essentially the model’s current “belief” about where each object might be in the emerging image. The extracted attention maps are fed into the trained relation classifier, which produces a probability distribution over possible relations (left-of, right-of, above, below, etc.)[20][21]. Since we know what the desired relation from the prompt is, the system can compute a loss – for example, a cross-entropy loss that penalizes the classifier if it isn’t confident in the correct relation[20][22].

Now comes the steering part: the gradient of this loss is backpropagated into the diffusion model’s latent representation (the noisy image-in-progress) at that timestep[23]. In practice, this means nudging the latent variables in a direction that should increase the probability of the correct relation according to the classifier. Intuitively, if the classifier thinks the dog is not sufficiently to the right of the teddy bear in the current partial image, the gradient will shift the latent in a way that moves the dog’s features rightward (or the teddy’s leftward). The diffusion process then continues with this slightly adjusted latent and noise. By iteratively applying these guided updates, the generation is “steered” toward an image that conforms to the spatial instruction without ever explicitly telling the model where to draw each object. It’s as if the model has a coach whispering during painting: “move the dog a bit more to the right.”

An exciting aspect of Learn-to-Steer is that it works across different diffusion architectures. The authors demonstrated it on both Stable Diffusion (a popular UNet-based model) and Flux (an MMDiT-based diffusion model), with minimal changes[24]. The approach is architecture-agnostic because it relies on generic signals (attention maps) and a separate classifier. This means future or alternative text-to-image models could potentially be “plugged into” the same steering mechanism by training a new classifier on that model’s attention outputs. Additionally, although the system was trained on single-object-pair relations, it can handle prompts that chain multiple relations. For instance, consider a prompt: “a frog above a sneaker below a teapot.” This has two relations (“frog above sneaker” and “sneaker below teapot”) involving three objects. Learn-to-Steer tackles such cases by alternating the optimization focus between relations at different timesteps[25][26]. It will optimize the latent for the frog-sneaker relation on one step, then the sneaker-teapot relation on the next, and so on in a round-robin fashion. Using this strategy, the method was able to enforce multiple spatial constraints in a single image, something that static loss functions or naive prompting often fail to achieve. (In practice, the authors found that phrasing a multi-relation prompt in a simple chained manner – e.g. “A frog above a sneaker below a teapot” – yielded better results than a more verbose sentence with conjunctions[27].)

Quantitative Gains on Spatial Benchmarks

How much does Learn-to-Steer improve spatial understanding in generated images? The paper reports significant leaps in accuracy on standard text-to-image evaluation benchmarks for spatial relations. Two benchmarks are used: GenEval (which checks if generated images satisfy a given relation prompt) and T2I-CompBench (Text-to-Image Composition Benchmark, another test for spatial arrangements). The team evaluated four different diffusion models – two Flux variants and Stable Diffusion 2.1 and 1.4 – comparing vanilla generation versus various methods. The results tell a clear story: learned steering objectives outperform both the unguided models and prior methods by a wide margin[28]. Some highlights:

Stable Diffusion 2.1 (SD2.1): Spatial accuracy on GenEval jumped from 0.07 (7%) to 0.54 when using Learn-to-Steer[29]. In other words, a model that “barely works” for spatial tasks was transformed into one that gets it right more than half the time[29]. On the T2I-CompBench metric, SD2.1 went from 0.089 to 0.365, showing a similarly large improvement[29].
Flux 1.0-dev (MMDiT-based): Accuracy rose from 0.20 to 0.61 on GenEval (20% to 61%) with Learn-to-Steer, and a related metric from 0.177 to 0.392[30]. This effectively turned a hit-or-miss model into a reliably accurate one for spatial inputs.
Outperforming Handcrafted Losses: Competing test-time methods that rely on manually designed losses saw lower scores across the board. For example, a prior approach called STORM achieved only 0.19 on SD2.1 GenEval, whereas Learn-to-Steer hit 0.54 on the same test[31]. Another baseline, FOR (Fast Optimizer for Restoration) and its spatial variant, reached around 0.26–0.35 on SD2.1, still far behind L2S’s performance[32]. These gaps illustrate that the data-driven learned loss is more effective than guesswork losses encoded by humans.
Matching Fine-Tuned Models: Perhaps most impressively, the learned steering nearly matches or exceeds the accuracy of models that were explicitly fine-tuned for spatial relations. The COMPASS method (which retrains the diffusion model with spatially aware data and a special token ordering) achieved 0.60 on Flux’s benchmark[33]. Learn-to-Steer, without any model retraining, scored 0.61 – essentially on par[33]. This demonstrates that test-time optimization can attain state-of-the-art fidelity that previously required heavy model training. Moreover, it did so while keeping the base model’s other capabilities intact (COMPASS, in contrast, improved spatial skill but caused drops in color and counting accuracy as a side effect[34]).
Multiple Relations Generalization: Even though the relation classifier was trained only on single relations, Learn-to-Steer showed a capacity to handle prompts with multiple simultaneous relations. In a stress-test with 3–5 objects and up to three relations in a prompt, the base model alone almost always failed (virtually 0% success)[35][36]. With L2S enabled, the model managed a substantial increase – for example, about 28% accuracy on prompts with two relations among three objects, and around 10–12% accuracy for very complex cases of three relations among four or five objects[37][38]. These numbers aren’t high in absolute terms, but they are orders of magnitude better than the near-zero of the unassisted model, indicating that the method can compose multiple learned objectives to some extent. Importantly, performance degrades gracefully as more relations are added, rather than collapsing – hinting that each relation can be handled somewhat independently by the approach[39]. This compositional generalization is a promising sign for tackling more elaborate scene descriptions in the future.

Equally telling are the qualitative results. The paper’s examples show that with Learn-to-Steer, generated images faithfully reflect the spatial instructions in the prompt while maintaining high image quality[40]. In scenarios where vanilla diffusion or other methods would place objects incorrectly or omit some entities, L2S produces images where the objects are correctly arranged and all present. It also handles unusual requests gracefully – e.g. it can render “a bus below a toothbrush” or “an elephant below a surfboard” with the correct spatial ordering and without the bizarre mergings that other methods produce[41]. The NVIDIA team points out that their method overcomes the three common failure modes: it fixes object misplacement, prevents entity neglect (every object in the prompt appears in the image), and avoids object fusion (no more surreal hybrids caused by the model conflating two items)[3]. In side-by-side comparisons, other baselines might omit a vase or zebra from a scene or entangle them, whereas Learn-to-Steer’s outputs include all the right pieces in the right configuration[3]. This boost in prompt fidelity – getting exactly what was asked, where it was asked for – is a big step forward for the reliability of generative AI outputs.

Inference-Time Optimization: Costs and Trade-Offs

Learn-to-Steer’s approach of optimizing during inference brings both advantages and considerations. On the plus side, test-time optimization means we don’t need to tamper with the model’s weights or perform expensive fine-tuning for spatial tasks[42]. The same pretrained model can be flexibly “steered” only when needed – preserving its original versatility when spatial control isn’t required[34]. This avoids the kind of trade-off seen with fine-tuned models that might overfit to spatial relations at the expense of other skills (like color accuracy or counting)[34]. In NVIDIA’s approach, if a prompt doesn’t specify spatial relations, one could simply run the diffusion model normally with no additional overhead, maintaining the original speed and output characteristics. The steering kicks in only for prompts that demand it[43].

However, the flip side is that when we do invoke this inference-time loss, it comes with a computational cost. The process requires running the classifier and backpropagating gradients multiple times during generation, which can slow down image synthesis considerably. The authors measured how much slower things get: for the smaller Flux 1.0-schnell model, generation went from ~0.5 seconds per image to ~16.5 seconds with Learn-to-Steer – roughly a 33× slowdown[44]. For the larger Flux 1.0-dev, 11 seconds became 6 minutes (~33× slower). Stable Diffusion 2.1, which normally takes about 4.5 seconds per image on their hardware, climbed to ~90 seconds with steering (~20× slower)[44]. SD1.4 saw a similar jump (4.5s to ~80s)[44]. These are non-trivial overheads. In scenarios where speed and scalability are crucial (e.g. high-throughput image generation or real-time applications), applying test-time optimization to every single image may be impractical.

There are some ways to mitigate this. One is to limit when and how the optimization is applied. Learn-to-Steer only optimizes during the first half of the diffusion steps in their implementation[23], which they found sufficient to set the course for the image. Additionally, as mentioned, it can be used selectively: an AI image service could generate an image normally, and only if the result looks spatially off (or the user explicitly requests a strict spatial layout) would it then run a second pass with L2S enabled. Another angle is improving efficiency: since the relation classifier is quite small and only a few attention maps are involved, the overhead mainly comes from doing backpropagation through the large diffusion model for multiple steps. Future research might explore accelerating this with better optimizers or partial updates. Nonetheless, at present, the method is best suited for cases where accuracy matters more than speed – e.g. generating a precise diagram or scene for a design, or handling relatively small batches of images where quality trumps quantity.

Generality and robustness are another aspect of trade-offs. The Learn-to-Steer framework proved surprisingly general across model architectures (UNet vs MMDiT)[24], which suggests it could be applicable to other diffusion models or future systems with minimal adaptation. The requirement is that one can extract cross-attention or a similar alignment signal from the model. It also shows robustness in handling multiple relations and never-before-seen object combinations by virtue of how it was trained (focusing on generic attention patterns). However, it’s worth noting some limitations. The paper’s analysis points out that what counts as “above” or “below” is judged in 2D – by the image’s pixels and attention – which might not always align with true 3D spatial understanding[45]. For instance, if an object is in front of another in 3D space, from a certain camera angle it might appear below the other in the 2D image, confusing the spatial relation. Learn-to-Steer doesn’t explicitly model depth or real-world size relationships; it purely learns from visual attention overlays. So in complex scenes with perspective, it might enforce a relation that makes sense in the 2D projection but not in a truly physical sense [45]. Moreover, while the method can handle up to three relations, its accuracy drops as scenes get very crowded[46]. Generating a perfect scene with, say, five objects, all relative to each other is still an open challenge – sometimes the method succeeds, other times not[37]. These limitations highlight that there is room to improve, possibly by incorporating more sophisticated reasoning or multi-step planning for complex prompts.

Broader Implications: Prompt Fidelity and Future Model Design

By dramatically improving spatial fidelity, NVIDIA’s Learn-to-Steer marks an important step toward more trustworthy multimodal systems. For users – whether they are artists, designers, or enterprise developers – having a text-to-image model that actually respects spatial instructions means less frustration and manual correction. It brings us closer to “what you prompt is what you get.” This fidelity is not just about pretty pictures; it’s about aligning the AI’s output with the user’s intent in a controllable way. In a sense, it enhances multimodal alignment: the textual modality (relations described in language) is more faithfully reflected in the visual modality (the generated image)[3]. Improved alignment on spatial reasoning may also carry over to other aspects of the prompt, since the approach shows it’s possible to target specific failure modes (like object placement) without ruining others (like color, count, or overall coherence)[34]. It’s a demonstration that we can inject domain-specific “common sense” into a large generative model post-hoc, rather than hoping a single giant model gets everything right out of the box.

The success of using cross-attention maps as a teaching signal could influence future model designs and training regimes. One implication is that future diffusion models might integrate modules that monitor or enforce certain constraints internally. For example, a next-generation model could include a learned loss (like this classifier) as part of its training, not just inference. Such a model would effectively train with a tutor that penalizes it whenever it arranges objects incorrectly, potentially internalizing spatial reasoning end-to-end. That could reduce the need for test-time optimization in the long run. In the meantime, approaches like Learn-to-Steer provide a versatile toolkit: they can be layered on top of existing models as a form of post-training specialization. This is attractive for enterprise use-cases where one might take a general pre-trained model and safely adapt it to a niche requirement (like always obeying layout instructions) without risking the model’s integrity on other tasks.

There’s also a broader message about data-driven loss design. Handcrafting a loss function is essentially guessing how the model should behave, whereas learning a loss function lets the model tell us what works. Here, by probing the model’s own attention, the researchers let the data (inverted images and attention maps) reveal the right objective. This principle could be applied to other generative alignment problems. We might see analogous “learned steering” for ensuring attribute consistency (e.g. that a “red cube” comes out red), counting (ensuring a prompt for five apples yields five distinct apples), or even style consistency across multiple images. Each would involve training a small network on the model’s internals to guide a specific aspect of generation.

Finally, prompt engineering could become less of an art and more of a science thanks to such techniques. Instead of contorting our text prompts to coax a model into doing what we mean (“maybe if I say ‘a dog on the far right of a teddy bear’ it will listen…”), we can rely on learned controllers to enforce interpretation. This frees users to specify what they want in straightforward terms and trust the system to handle the rest. In multi-part prompts or complex scenes, having the ability to maintain control over each relation or detail means generative models can be used for more compositional tasks – like drafting a storyboard, designing a user interface layout, or generating scientific diagrams – where spatial accuracy is crucial.

In summary, NVIDIA’s Learn-to-Steer paper demonstrates an insightful balance of machine learning and practical problem-solving. By leveraging a model’s own knowledge (via cross-attention) and injecting a learned objective at inference, it achieves a new level of prompt fidelity for spatial requests. The approach does come with trade-offs in compute cost, but it opens the door to highly targeted improvements of generative models without retraining them from scratch. As diffusion models become ever more central in AI content creation, solutions like this ensure that “minor details” like where things are in the image won’t be so easily overlooked. It’s a compelling example of how a bit of additional intelligence – in the form of a learned loss function – can steer a massive generative model to even greater heights of alignment with human intent[3][47]. The road ahead may involve integrating such mechanisms directly into model training or expanding them to new types of constraints, but one thing is clear: letting models learn how to steer themselves is a powerful idea that we’re likely to see much more of in the future.

[1] [4] [7] Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image Generation

https://learn-to-steer-paper.github.io/

[2] [3] [5] [6] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image Generation