Bryson Jones

Methods for Conditioning Diffusion Models

Bryson Jones — Thu, 22 Jan 2026 19:38:56 GMT

I’ve been spending time on research around different multi-modal generation techniques recently, and was discussing with someone recently about all the different ways you can conditioning diffusion or flow-matching models.

It made me want to make a quick blog post about it, so here you go!

Here is a look at the four conditioning paradigms that define the current landscape of diffusion modeling. There’s obviously more work going on than just these four, but this is what I’ve seen prominently and work effectively. If you have other methods you think are critical to know, mention them below!

1. Early Days: Simple Cross-Attention

The first widely successful approach to conditioning diffusion models came from using cross-attention to merge latent information from different signals, and this was before transformers were the common backbone of diffusion models.

In early latent diffusion systems, the problem was framed very simply: how do you inject a text description into an image denoising process to augment the generation process?

The initial approach that saw success was popularized by High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022) from Stable Diffusion 1.5 and beyond. This was to keep the two streams largely separate and connect them through cross-attention.

Text is encoded once by a frozen model such as CLIP, producing a set of keys (K) and values (V). The image latent, evolving through a U-Net backbone during denoising, produces the queries (Q). At each cross-attention layer, the model combines them through a classic attention block:

The cross-attention effectively “injects” conditioning by projecting text embeddings into the image denoiser, where image features act as queries over text-derived keys and values. This mechanism biases the denoising updates toward prompt-consistent semantics while preserving a separation between text representation and spatial image processing.

This design choice was extremely pragmatic and simple. For straightforward text-to-image synthesis, this ended up working remarkably well and defined the baseline for an entire generation of models.

However, the same separation that gives cross-attention its stability also limits its expressiveness. Because text and image representations never fully merge, the model struggles with fine-grained spatial reasoning, compositional logic, and especially typography (remember seeing letters scrambled like they were written schizophrenically?). The conditioning signal is directional and coarse: the text can guide what appears, but not reliably where or how precisely.

2. AdaLN (Adaptive Layernorm): Global Conditioning for Diffusion Transformers

When the paper Scalable Diffusion Models with Transformers (Peebles & Xie, 2022) came out, the field shifted from U-Nets to Diffusion Transformers (DiTs) as the backbones of diffusion models. In this same paper, AdaLN conditioning was introduced, and has really become a preferred method for injecting global context extremely effectively.

The idea is that, rather than adding extra tokens or specialized attention layers, AdaLN modulates the layer normalization blocks. The model learns to predict the scale (γ) and shift (β) parameters of the Layer Normalization based on a condition vector c.

The operation is defined as:

This ends up being a really efficient way of applying global conditioning to the diffusion transformer backbone, like adding a time or style conditioning

AdaLN is nice because it’s computationally cheap, O(D) complexity versus the quadratic cost of attention. It influences the entire generation uniformly, making it excellent for setting global style or coherence.

One of the challenges though is is that lacks precision, and it’s edits can be coarse or blunt. It can’t easily tell the model to place a specific object at specific coordinates, or adjust the orientation of that object, etc.

Still, it is one of the most common and powerful ways to apply conditioning to one of these DiT models, and it’s implementation simplicity is quite elegant, making it easy to understand.

Side Note:

This is one of the most effective ways we’ve seen to condition diffusion and flow matching action chunking policies and -heads for robotics. You take all of your observations, camera view encodings, joint angles, task description, etc, and concatenate them into an AdaLN conditioning vector.

You can see this in action in a repo I’ve open-sourced for Multitask Diffusion Policy, here: https://github.com/brysonjones/multitask_dit_policy/tree/main

3. In-Context Learning (ICL): Few-Shot Adaptation

In-context learning is an approach to apply conditioning in the inputs of the diffusion transformer rather than an architectural modification to inject conditioning during processing.

One of the first papers we saw this with was In-Context Learning Unlocked for Diffusion Models (Wang et al., 2023), where the idea mirrors almost exactly few-shot prompting in LLMs. The model is given one or more example pairs, like an input image and its desired outputs (segmenting, edge detection, style transfer, etc) followed by a new query input.

We end up effectively treating these image tokens as "visual prompts" alongside or instead of text. Just as you might show a language model a few examples of English-to-French translation to teach it the pattern, in diffusion ICL, you feed the model a "context pair" consisting of a source image and a transformed version.

This is the architecture from "PromptDiffusion” in In-Context Learning Unlocked for Diffusion Models, but there are many different ways to accomplish this

Practically, this is implemented either by concatenating images into a single tensor (for example: stacking an edge map and its target photo next to a new edge map), or by using attention masking so that query tokens can attend to example tokens. The diffusion process implicitly learns the transformation by pattern matching within its context window.

This approach ends up working quite well, and multitask demonstrations end up teaching the model how to combine and generalize these editing concepts. It allows a single pretrained model to perform edge-to-image, colorization, style transfer, and other conditional tasks without specialized adapters.

But this flexibility and expressivity comes at a cost. Because the model is not explicitly optimized for the task, output fidelity usually lags behind fine-tuned or adapter-based methods (this starts to change as you scale, as most models do). Additionally, the approach is fundamentally bounded by context length (quadratic cost on computation) and GPU memory, making it difficult to scale to complex or high-resolution demonstrations.

4. Joint Attention for Multiple Modalities (MM-DiT)

Joint Attention with MM-DiTs represents the one of the most recent big leap in diffusion conditioning. This approach basically creates two separate streams of data that run in parallel through the model layers:

Image Stream: Processes the visual latent patches (the noisy image being denoised).
Text Stream: Processes the text tokens (from the prompt).

Importantly, each stream has its own set of weights. This means the model learns separate parameters to process visual data and textual data, acknowledging that pixels and words behave very differently and have different patterns to learn, etc.

Then within the MM-DiT blocks, image and text tokens are concatenated into a single sequence, processed with attention, and then split back into their distinct streams.

This iterative merging and separation helps ensure the model retains exact character sequences and resolves ambiguity, leading to superior typography (no more misspellings, mangled letters, etc) and complex prompt adherence.

This idea was introduced in Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Esser et al., 2024) for Stable Diffusion 3. The architecture evolves the simple cross-attention concept into an model that more fully merges the multi-modal signals and you can see the diagram below:

There’s quite a few trade-offs though:

Just like ICL, because we are increasing our input context for these signals, we incur quadratic compute cost increases for increasing context/conditioning length
As you can see from the diagram above, this architecture is quite complicated with lots of sub-details of implementation, making the implementation and hyperparameter tuning cumbersome

There’s a lot of exciting work going on with diffusion/flow-matching generative models right now, and it’s easy to get lost in the sauce of all the different methods. I found trying to group and categorize approaches like this helped me grasp when and where to try and apply the different strategies.

I’m working on diffusion models and other architectures for multi-modal generation, representation learning research, and beyond.

Most of my work is focused on robot manipulation, world-modeling, and decision-making. Reach out if you are interested in chatting (https://x.com/brysonkjones), and share this post with others!

Share Bryson Jones

Subscribe now

Why Manipulation Is "Harder" Than Locomotion

Bryson Jones — Sun, 21 Sep 2025 05:49:50 GMT

When people imagine the frontier of robotics, they often think of humanoid robots or quadrupeds sprinting, climbing, dancing, doing backflips, etc. Locomotion captures our attention because it’s visually striking and straightforward to grok how difficult the tasks are (most of us can’t do backflips…).

However, the frontier challenges we face are very much in manipulation, or the ability to handle, assemble, and interact with the physical world with our robotic systems.

We don’t think of the ability to pick up an egg without cracking it or folding a blanket as the most impressive tasks humans are capable of, but they are some of the hardest problems to build intelligence for.

This is a prime example of Moravec’s Paradox, effectively: the hard things are easy, and the easy things are hard.

What makes up robot autonomy?

The problems within robotics can largely be broken up into:

Navigation
Locomotion
Manipulation

Jitendra Malik gave a keynote talk over the past year (I unfortunately forget which conference this was1) that went as far as spicily saying something along the lines of:

“Navigation is solved, locomotion is basically solved, and while we are making progress on manipulation, we really don’t have what the recipe is yet to solve it”

I’m a huge fan of Jitendra, and while I know part of his goal with this slide was to be provocative, I generally agree (although I’m more bullish on the track we’re on in manipulation than he may be).

Three ways to train manipulation policies

Researchers are attacking manipulation from three main directions:

Behavior cloning from tele-operation data – teaching robots by having humans control them and mimicking the demonstrations.
Large-scale unsupervised learning – using internet-scale video data to learn broad priors about how actors in a scene interact with objects and the environment they’re in
Reinforcement learning in simulation – training policies in massive parallelized environments where billions of interactions can be run safely and cheaply.

Each approach has made progress, but none has cracked the full reliability and generalization needed for everyday manipulation tasks.

I don’t think it’s controversial to say that #1 (BC through tele-op) is the leading approach in terms of performance, but there’s a lot of work going on behind the scenes at places like SkildAI, Tesla, etc that we aren’t privy to and things are moving so quickly this could be wrong by the time you read this.

Why locomotion is simpler

“Making this huge hunk of metal run looks pretty hard, man.”

Locomotion benefits from properties that make it extremely well-suited for massively parallel reinforcement learning in simulation:

Clear reward functions:
The reward function for locomotion is actually pretty easy to formulate: move forward, have smooth motions, minimize energy usage. These are mathematically compact objectives, and almost feel tailor-made for reinforcement learning.
Minimal sensing needs:
Proprioceptive feedback (joint angles, velocities, foot contacts) often suffices. Many locomotion policies succeed even without cameras, relying only on physical feedback.
Representation efficiency:
High-performing locomotion policies can evidently be squeezed into ~10 million parameters (you could argue more for multiple embodiments, but I’ll make the generalization)
Structural regularity:
Locomotion has strong symmetries where gaits are often periodic and physics is uniform across terrains. This allows policies to generalize well, with intelligent behaviors like efficient gait cycles or arm-swinging for energy minimization emerging naturally.

All of these aspects lead to the ability to leverage large-scale simulation + RL to train super-human (super-dog for quadrupeds??) locomotion policies.

What makes manipulation different (and harder)?

“How hard can folding these f&%$ing clothes be?”

Manipulation poses a more sobering challenge:

No compact reward functions
The “cost function” for inserting a screw or cooking an omelet is hard to specify. Rewards often need careful shaping or human priors, leading to brittleness and unexpected outcomes from reward-hacking.
Rich sensing requirements
Unlike locomotion, manipulation usually requires vision and tactile feedback to estimate object shape, pose, affordances, and contacts. Tactile sensing hardware is still immature. The example I like to give people is: “imagine dipping your hand in anesthetic and trying to pick your phone up… it’s basically impossible”
Discrete modes and long-horizons introduce unique complexity
Manipulation often involves discrete modes for task completion: grasping, lifting, pushing, where each object can behave differently. Contact dynamics are messy, occlusions are common, and the precision bar is much higher than for locomotion.
Few emergent behaviors
Unlike locomotion, manipulation doesn’t “discover” elegant solutions on its own (so far). Without demonstrations or heavy engineering, policies struggle to converge on useful strategies.

This is why we haven’t been able to just “zero-shot” transfer all of the methods that accelerated locomotion progress in the past 5 years into manipulation.

Where does this leave us?

This field is moving so fast that some of these challenges in manipulation could be solved in the next month, or it may take a few years; I don’t want to paint too bleak of a picture, but really just wanted to highlight that progress in locomotion often gets conflated with how to tackle manipulation, and I don’t believe that’s fair to do.

I’m personally much more optimistic than many in the field are about how quickly behavior cloning with tele-operation data collection will result in useful2 manipulation policies that can be deployed into real-world settings and kick off a data flywheel of self-improvement (this is the bet we3 are making).

On the topic of reward functions, there’s a lot of amazing work happening leveraging large pre-trained VLMs (and other foundation models) to bootstrap reward models that can rapidly be adapted to new tasks.

I could talk for hours about all of the exciting work going on in the field right now, but will save that for a future post.

If you liked this post, feel free to subscribe, share with others that might enjoy, or reach out to chat!

Subscribe now

Text within this block will maintain its original spacing when published

“The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.” - George Bernard Shaw

Please comment and let me know if you find it!

We don’t need to solve general manipulation to have useful policies in niche and specific tasks, even if we leverage as general and broad of data that is available for pre-training

My company is developing manipulation policies and tools to deploy them for skilled labor tasks. If this is something you’re interested in working on, I would love to talk to you about joining our team.