<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Bryson Jones]]></title><description><![CDATA[I'm working on robot autonomy, machine learning, and occasionally try to organize my thoughts here with posts!

Excited to have you read some of my musings, and hopefully share your own opinions. You can reach out to me directly if you want to talk more.]]></description><link>https://brysonkjones.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!peXz!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbrysonkjones.substack.com%2Fimg%2Fsubstack.png</url><title>Bryson Jones</title><link>https://brysonkjones.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 10 May 2026 18:28:29 GMT</lastBuildDate><atom:link href="https://brysonkjones.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Bryson Jones]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[brysonkjones@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[brysonkjones@substack.com]]></itunes:email><itunes:name><![CDATA[Bryson Jones]]></itunes:name></itunes:owner><itunes:author><![CDATA[Bryson Jones]]></itunes:author><googleplay:owner><![CDATA[brysonkjones@substack.com]]></googleplay:owner><googleplay:email><![CDATA[brysonkjones@substack.com]]></googleplay:email><googleplay:author><![CDATA[Bryson Jones]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy]]></title><description><![CDATA[There&#8217;s been a boom around different end-to-end learning approaches for robot manipulation for a while, but it&#8217;s really started to boil over in the past year.]]></description><link>https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy</link><guid isPermaLink="false">https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy</guid><dc:creator><![CDATA[Bryson Jones]]></dc:creator><pubDate>Mon, 30 Mar 2026 19:11:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_Yqq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There&#8217;s been a boom around different end-to-end learning approaches for robot manipulation for a while, but it&#8217;s really started to boil over in the past year. Late last year <a href="https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/">Boston Dynamics&#8217; showed a whole-body manipulation demo</a>, and again at <a href="https://www.youtube.com/watch?v=9e0SQn9uUlw">CES where they made another big splash</a> demonstrating impressive capabilities with TRI adapting their Large Behavior Model (LBM) work to Atlas<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FFBG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FFBG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FFBG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FFBG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FFBG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FFBG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg" width="777" height="389" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:389,&quot;width&quot;:777,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Why Did You Drop It Again?\&quot;... Hyundai Robot Unfazed by Human Interference  - The Asia Business Daily&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Why Did You Drop It Again?&quot;... Hyundai Robot Unfazed by Human Interference  - The Asia Business Daily" title="Why Did You Drop It Again?&quot;... Hyundai Robot Unfazed by Human Interference  - The Asia Business Daily" srcset="https://substackcdn.com/image/fetch/$s_!FFBG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 424w, https://substackcdn.com/image/fetch/$s_!FFBG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 848w, https://substackcdn.com/image/fetch/$s_!FFBG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!FFBG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F208b0f0d-06b5-430f-ba93-633a243ff1d8_777x389.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Atlas with Multitask Diffusion Transformer Policy Trajectory Generator <a href="https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/">[2]</a></figcaption></figure></div><p>The policy that is used for trajectory generation here is a <em>Multitask Diffusion Transformer Policy</em> <a href="https://arxiv.org/pdf/2507.05331">[1]</a> <a href="https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/">[2]</a>.</p><p>And it&#8217;s worth noting that this isn&#8217;t a VLA<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> in the sense that it doesn&#8217;t use a pretrained VLM backbone, instead using much smaller vision and text encoders, but still shows extremely impressive performance. I don&#8217;t like to throw around the phrase &#8220;SOTA&#8221; in manipulation, because the benchmarking and evaluation process is quite nascent/hand-wavy in the field right now<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> for anything beyond straight-forward tasks (<a href="https://libero-project.github.io/main.html">LIBERO</a>, etc), so without recreating on your own setup it&#8217;s impossible to know. With that being said, the results are pretty majestic.</p><p>But with no code publicly available, it&#8217;s not possible to compare this against other approaches.</p><p>This motivated me to work out an implementation from what has been released publicly, literature from TRI, additional digging and intuition. I&#8217;ve built a version of this policy, break down the implementation in this post, and have released an open-source implementation for the community<em>.</em></p><p>What made it stand out to me is how far you can push performance without a massive compute budget. I&#8217;ve found you can achieve amazing results on single-task to few-task datasets with ~100-200 examples and ~10 hours of training on a single H200 <em>(so for tens of dollars you can iterate on policies)</em>.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;145c9374-6c21-4be2-b032-3d20d97eb40e&quot;,&quot;duration&quot;:null}"></div><p>I&#8217;ve open-sourced the implementation, and made it available in two places:</p><ul><li><p>A standalone repo with a simple training pipeline and example inference loop which can be found <a href="https://github.com/brysonjones/multitask_dit_policy">here</a></p><ul><li><p><em>this doesn&#8217;t directly integrate with any downstream hardware and users will need to do that themselves</em></p></li><li><p>there are a few extension pieces added here, like experimental vision encoders, etc, along with modularity to add experiment with future observation and training objective strategies</p></li></ul></li><li><p>Directly integrated it into HuggingFace&#8217;s <a href="https://github.com/huggingface/lerobot">LeRobot</a> project, as I believe this was the best way to make the work the accessible and easily deployable on hardware</p><ul><li><p>this is implementation is as close to the vanilla policy implementation as I could conceive; it should be simple and easy to understand, and give users a solid baseline to work from</p></li></ul></li></ul><h3>Policy Architecture and Details:</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Yqq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Yqq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 424w, https://substackcdn.com/image/fetch/$s_!_Yqq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 848w, https://substackcdn.com/image/fetch/$s_!_Yqq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 1272w, https://substackcdn.com/image/fetch/$s_!_Yqq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Yqq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png" width="727" height="318.1387800783436" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:782,&quot;width&quot;:1787,&quot;resizeWidth&quot;:727,&quot;bytes&quot;:205903,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/178113708?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe9af8ad-5850-44bd-93cc-7ef261a618ac_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_Yqq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 424w, https://substackcdn.com/image/fetch/$s_!_Yqq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 848w, https://substackcdn.com/image/fetch/$s_!_Yqq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 1272w, https://substackcdn.com/image/fetch/$s_!_Yqq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46e1b73f-f77b-4f8d-9cb7-c69bee6ee632_1787x782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A similar diagram was in the TRI LBM paper [1], but I found it was helpful to spell out all of the details and not leave it up to random interpretation</figcaption></figure></div><p>The policy follows a diffusion transformer (DiT) adapted for 1D sequence modeling instead of 2D image generation, and is conditioned on all of the relevant observation information during generation <a href="https://arxiv.org/pdf/2212.09748">[3]</a><strong>. </strong> </p><p>This is policy is trained with classic behavior cloning, gathering data through tele-operation. TRI presented pretty in-depth analysis for the performance you can expect with this model when training on single task and leveraging large-scale pretraining, so refer to that to calibrate expectations <a href="https://arxiv.org/pdf/2507.05331">[1]</a>. Additionally, I&#8217;ve trained and evaluated the model against the LIBERO benchmark as a sanity check, and saw the following performance which may be helpful as well:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5E-0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5E-0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 424w, https://substackcdn.com/image/fetch/$s_!5E-0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 848w, https://substackcdn.com/image/fetch/$s_!5E-0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 1272w, https://substackcdn.com/image/fetch/$s_!5E-0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5E-0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png" width="375" height="99.16748285994123" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:1021,&quot;resizeWidth&quot;:375,&quot;bytes&quot;:310788,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/178113708?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdde0c847-f63d-4c9f-bb31-fa40554b1906_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5E-0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 424w, https://substackcdn.com/image/fetch/$s_!5E-0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 848w, https://substackcdn.com/image/fetch/$s_!5E-0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 1272w, https://substackcdn.com/image/fetch/$s_!5E-0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f8d7ab-8bbe-434c-86db-38a38f5bd2bf_1021x270.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>I&#8217;ve aimed to make this implementation relatively modular without introducing a ton of complexity, making it easier to extend or swap components for future experiments and ablations.</p><div><hr></div><h3>Submodules:</h3><h4><strong>Vision Encoder</strong></h4><p>The vision encoder is a CLIP ViT (specifically ViT-B/16) that only uses the CLS token encoding to represent information for each image <a href="https://arxiv.org/pdf/2103.00020">[4]</a>. The majority of research on Diffusion Policy models still seem to use ResNet vision encoders, but when ablating between CLIP and ResNet there was a significant performance boost from CLIP, which isn&#8217;t surprising.</p><p>The idea of only using the CLS token and none of the patch embeddings is to keep the representation compact while retaining rich semantic grounding from the pretrained backbone. </p><p>The encoder is <strong>co-trained end-to-end</strong>, and follows the setup described in prior work, it uses a <strong>learning rate 1/10th</strong> that of the rest of the model parameters to maintain the stability of pretrained features. This balance allows the vision module to slightly adapt to the your environment/task without losing the rich generalization capabilities learned from large-scale pretraining.</p><p><strong>Additional Note on Vision Encoders:</strong></p><p>In the released implementation, I&#8217;ve also added a separate vision encoder module that can be used to extract visual features, DINOv3 <a href="https://arxiv.org/pdf/2508.10104">[6]</a>. The usage is identical to CLIP in that only the CLS token is used. I did not see better performance with DINOv3, but did not experiment very heavily (and actually saw degraded performance). </p><p>There&#8217;s many reasons this could be: the objectives of the models are inherently very different, there&#8217;s no semantic grounding in the DINOv3 features which may impede performance.</p><p>The rich spatial features in DINOv3 make me think it was possible that there could be an interesting performance boost in there somewhere (whether for singular task performance or generalizability), so I&#8217;ve left it in so that it can be experimented with. In addition, I&#8217;ve made it straightforward to add and swap in other vision encoders in the future (such as SigLIP, etc).</p><h4><strong>Text Encoder</strong></h4><p>For language conditioning, the policy uses the CLIP text tokenizer and embedding layers, but these are kept frozen throughout training. The only learnable component is a single projection layer that maps the CLIP text embedding into the features that are concatenated with the rest of the conditioning vector. </p><p>This keeps compute requirements minimal and stabilizes training, but it also leaves the <strong>language steering</strong> relatively weak (I talk about this more below), as the conditioning is a shallow text embedding rather than a deeply aligned joint representation. Still, this approach strikes a good balance for multitask learning without requiring full-scale language models and the downstream compute implications.</p><h4>Proprioceptive Observations:</h4><p>The proprioceptive observations (joint angles, or otherwise) are notably not processed or embedded prior to being concatenated into the conditioning vector. In practice this seems to work well, but it is interesting that the amount of proprioceptive features are in the conditioning vector are an order of magnitude or more smaller than the embedded exteroceptive and language features.</p><p>My intuition is this could be interesting to explore other rich feature encoding strategies as has been seen in many other papers, but the likely result is only marginal improvement.</p><h4><strong>Diffusion Transformer Prediction Model</strong></h4><p>The core prediction model is a <strong>Diffusion Transformer</strong> adapted from the original formulation from Meta, for image generation, to 1D sequence modeling for joint trajectory generation <a href="https://arxiv.org/pdf/2212.09748">[3]</a>. </p><p>The network models action sequences as temporally correlated trajectories, where each denoising step uses the model to predict the noise, and refines the predicted actions, iteratively. This noise prediction is conditioned on an observation vector that concatenates the encoded images, encoded task description, joint angles, and embedded diffusion time step.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GOla!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GOla!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 424w, https://substackcdn.com/image/fetch/$s_!GOla!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 848w, https://substackcdn.com/image/fetch/$s_!GOla!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 1272w, https://substackcdn.com/image/fetch/$s_!GOla!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GOla!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png" width="321" height="493.9917452830189" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1305,&quot;width&quot;:848,&quot;resizeWidth&quot;:321,&quot;bytes&quot;:206675,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/178113708?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GOla!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 424w, https://substackcdn.com/image/fetch/$s_!GOla!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 848w, https://substackcdn.com/image/fetch/$s_!GOla!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 1272w, https://substackcdn.com/image/fetch/$s_!GOla!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0176ef7c-c63e-4ec0-8651-f1c9b448ed81_848x1305.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DiT w/ AdaLN Conditioning Block Diagram <a href="https://arxiv.org/pdf/2507.05331">[1]</a></figcaption></figure></div><p>The actions are of shape (N, H), where N is the size of your action space, and H is the horizon length. They are initially sampled from a normal distribution and iteratively denoised to the final action trajectory.</p><p>In the implementation, I&#8217;ve added both standard vanilla positional encodings and rotary positional embeddings (RoPE) within the transformer blocks. In practice, RoPE consistently shows clear improvements in performance and robustness, especially in diffusion transformers, by better capturing relative positional structure across long horizons. With this, RoPE is used as the default positional encoding throughout the model.</p><h3>Bringing All the Conditioning Together</h3><p>When being passed into the DiT model, the visual, task-instruction, proprioceptive feedback, and diffusion time-step features are all concatenated into a single feature vector that the DiT conditions on during the denoising process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-1N2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-1N2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 424w, https://substackcdn.com/image/fetch/$s_!-1N2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 848w, https://substackcdn.com/image/fetch/$s_!-1N2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 1272w, https://substackcdn.com/image/fetch/$s_!-1N2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-1N2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png" width="477" height="400.3974943052392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:737,&quot;width&quot;:878,&quot;resizeWidth&quot;:477,&quot;bytes&quot;:107600,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/178113708?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-1N2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 424w, https://substackcdn.com/image/fetch/$s_!-1N2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 848w, https://substackcdn.com/image/fetch/$s_!-1N2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 1272w, https://substackcdn.com/image/fetch/$s_!-1N2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a58e64-0491-4662-923f-f81bc36a4f2c_878x737.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This conditioning vector this the singular point that all of the observation information is used to generate the output trajectory.</p><div><hr></div><h3><strong>Model Objectives</strong></h3><p>To make the policy flexible, I&#8217;ve implemented support for multiple generative objectives:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6XfL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6XfL!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 424w, https://substackcdn.com/image/fetch/$s_!6XfL!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 848w, https://substackcdn.com/image/fetch/$s_!6XfL!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 1272w, https://substackcdn.com/image/fetch/$s_!6XfL!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6XfL!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Flow Matching vs Diffusion. Briefly going into mathematical&#8230; | by Harsh  Maheshwari | Medium&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Flow Matching vs Diffusion. Briefly going into mathematical&#8230; | by Harsh  Maheshwari | Medium" title="Flow Matching vs Diffusion. Briefly going into mathematical&#8230; | by Harsh  Maheshwari | Medium" srcset="https://substackcdn.com/image/fetch/$s_!6XfL!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 424w, https://substackcdn.com/image/fetch/$s_!6XfL!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 848w, https://substackcdn.com/image/fetch/$s_!6XfL!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 1272w, https://substackcdn.com/image/fetch/$s_!6XfL!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F569ef3a9-49ad-43f4-8e69-5b28aa594aec_1800x600.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Visualization of Diffusion and Flow Matching Evolution over Time <a href="https://harshm121.medium.com/flow-matching-vs-diffusion-79578a16c510">[8]</a></figcaption></figure></div><ul><li><p><strong>Classic Diffusion Objective:</strong></p><p>This follows the standard denoising diffusion formulation, where Gaussian noise is iteratively added to the action sequence, and the model learns to predict and remove it step-by-step. The objective minimizes the expected L2 loss between the predicted and true noise, learning a score function over action trajectories.<br><br>The implementation supports DDPM and DDIM sampling strategies as is common among most diffusion models.<br></p></li><li><p><strong>Flow Matching Objective:</strong></p><p>This replaces the discrete diffusion process with flow-matching that models the trajectory evolution as an ordinary differential equation. The model learns the instantaneous velocity field that transports noisy samples toward the data distribution, providing smoother training dynamics and often faster convergence.<br><br>This objective can be configured as an alternative loss mode in the policy configuration. Additionally, I was interested in the beta distribution time step sampling that Physical Intelligence described in their <em>&#960;0</em> model, and implemented that as a time step sampling option as well <a href="https://www.physicalintelligence.company/download/pi0.pdf">[5]</a>. <br><br>Note that their use of a flow-matching objective was for an action expert that&#8217;s integrated into a different architecture than described in this post. But, in practice, I found that this beta distribution did result in faster training and higher quality samples being generated at inference time.</p></li></ul><h3>Language Steerability Limitations</h3><p>One of the trade-offs with this architecture is <em>limited language steerability</em>. Because the model only uses a small set of embeddings from the CLIP language encoder, the conditioning signal is relatively weak. </p><p>In practice, this means that if you try to compose instructions<strong>,</strong> for example, asking the policy to perform a sequence of skills it has seen individually, you&#8217;ll often get inconsistent or ignored responses. The model tends to &#8220;latch&#8221; onto one dominant skill from training rather than synthesizing or sequencing behaviors based on nuanced phrasing.</p><p>Whether this matters depends on what you&#8217;re trying to do. For focused, narrow-task applications, it&#8217;s not a deal breaker. In fact, I&#8217;d argue we&#8217;re at the point where these lightweight LBMs can deliver real value in production environments with well-scoped manipulation tasks, and the constrained language conditioning actually simplifies evaluation. If your goal is open-world generalization or robust instruction following though, this approach may hit its ceiling early. Still, it&#8217;s a strong baseline for benchmarking and a pragmatic step toward scalable multitask manipulation.</p><h3>Common Failure Modes and How to Debug Them</h3><p>These models are still early and they can exhibit finicky, hard-to-interpret failures. Below is a distilled guide to the <em>most common, reproducible failure modes</em> I&#8217;ve encountered while training this model, along with practical debugging strategies try.</p><h4>1. <strong>Idling / Output Collapse (No Motion)</strong></h4><p>At inference time, the policy produces near-zero or static actions. The collapse often happens immediately at the trajectory start, but can also appear mid-task.</p><p><strong>Potential Causes:</strong><br>When the observations do not sufficiently disambiguate between multiple possible behaviors, the generation process seems to gravitate toward an &#8220;average&#8221; trajectory where the output can stall out. Two common causes I&#8217;ve seen for this</p><ol><li><p><strong>Insufficient training data for a given task</strong></p><ul><li><p>With only ~20&#8211;50 demonstrations, the policy often may not be able resolve action modes and collapses.</p></li><li><p>Once you&#8217;ve exceed ~300 examples for a <em>single</em> task, this collapse seems less likely to occur; if it persists, the task likely has hidden complexity or unobservable state information that you be using when gathering demonstrations.</p></li></ul></li><li><p><strong>Multiple similar tasks in the dataset</strong></p><ul><li><p>Example: moving <em>two different objects</em> that share similar geometry or visual appearance.</p></li><li><p>The CLIP text conditioning is shallow in this architecture, and it may not be strong enough to sharply separate tasks or environments with semantically similar observations and instructions.</p></li></ul></li></ol><p><strong>How to Debug:</strong></p><ul><li><p><strong>Increase dataset size</strong>: double example count until you exceed ~300 demos per task.</p></li><li><p><strong>Train longer</strong>: don&#8217;t stop when loss plateaus; the model often needs ~50k&#8211;100k steps to converge.</p></li><li><p><strong>Strengthen or diversify task instructions:</strong> language steerability is limited; more explicit phrasing helps, but don&#8217;t make the description too long.</p></li></ul><div><hr></div><h4>2. <strong>Performing the Wrong Task</strong></h4><p>As described above, at inference time, the robot may confidently perform a behavior from your dataset&#8230; but not the one you instructed.</p><p><strong>Potential Causes:</strong></p><ul><li><p><strong>Ambiguous or overly similar text prompts: </strong>with the shallow CLIP text encoder frozen during training, small description differences often fail to meaningfully steer the model.</p></li><li><p><strong>Insufficient per-task data: </strong>a task with only a handful of examples is easily overshadowed by more popular tasks in the dataset.</p></li><li><p><strong>High overlap between tasks: </strong>when two tasks share visual structure, dynamics, or geometry, the model might to &#8220;default&#8221; to whichever appears more frequently.</p></li></ul><p><strong>How to Debug:</strong></p><ul><li><p><strong>Diversify task description (same as above)</strong></p></li><li><p><strong>Rebalance task distribution</strong>: up-weight the underperforming task distribution when sampling during training</p></li><li><p><strong>Perform task-specific fine-tuning: </strong>A few thousand steps on only the failing task can dramatically improve instruction adherence</p></li></ul><h3>&#8220;How Can I Use This?&#8221;</h3><p>I&#8217;ve released this in two places, and provide initial suggestions for dataset size and training parameters:</p><ol><li><p>In a standalone repo that is stripped down to the bare-bones of the implementation for those interested in understanding the internals to dig into and build off of. This can be found <a href="https://github.com/brysonjones/multitask_dit_policy">here</a>.</p></li><li><p>By integrating it into HuggingFace&#8217;s LeRobot project would make it the most accessible for others to actually getting it running on hardware<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>, which can be found <a href="https://github.com/huggingface/lerobot/tree/main/src/lerobot/policies/multi_task_dit">here</a>.</p></li></ol><p><strong>Note:</strong> I have not trained and released any pre-trained base-model weights; If there&#8217;s interest in this, please reach out to me!</p><p>I&#8217;ve defaulted to using the <a href="https://github.com/huggingface/lerobot/blob/main/src/lerobot/datasets/lerobot_dataset.py">LeRobotDataset</a> format in both places, and if you have an existing dataset in that format, training on this policy should work out of the box.</p><p>I won&#8217;t go into detail about LeRobot specifics, and defer to their own documentation within the repo, but within the policy README, I note high level training and inference suggestions.</p><p>So far, I&#8217;ve only trained and tested policies for simple single- and bi-manual manipulation setups, but I expect it would work well for whole-body manipulation on humanoids (as BD has shown).</p><p>Please share any trained policies demos you create &#8212; I&#8217;d love to see them!</p><h3>What&#8217;s Next?</h3><p>The majority of my current work is focused on RL fine-tuning, representation learning, and training encoders for new sensing modalities,  all areas that are rich for robot learning research right now!</p><p>If this is an area you&#8217;re interested in or working on, I&#8217;d love to chat with you, so please reach out!</p><h3>Acknowledgements:</h3><p>A special thanks is given to the HuggingFace LeRobot team: Khalil Meftah, Steven Palma, Thomas Wolf, Michel Aractingi, as well as Aditya Bhat, and others at South Park Commons for their help reviewing this work.</p><h3>Cite This Work</h3><pre><code>@misc{jones2025multitaskditpolicy,
  author = {Bryson Jones},
  title = {Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy},
  year = {2026},
  url = {https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy},
  note = {Blog post}
}

@misc{jones2025multitaskditpolicy,
  author = {Bryson Jones},
  title = {Dissecting and Open-Sourcing Multitask Diffusion Transformer Policy},
  year = {2026},
  url = {https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy},
  note = {Blog post}
}</code></pre><div><hr></div><p>If you liked this post, please subscribe and share with others! I&#8217;m also always happy to chat about work like this, so please reach of if you&#8217;re interested.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://brysonkjones.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://brysonkjones.substack.com/p/dissecting-and-open-sourcing-multitask-diffusion-transformer-policy?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p><h3>References</h3><ol><li><p><a href="https://arxiv.org/pdf/2507.05331">A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation</a> - TRI LBM Team</p></li><li><p><a href="https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/">Large Behavior Models and Atlas Find New Footing</a> - Eric Cousineau, Scott Kuindersma, Lucas Manuelli, Pat Marion</p></li><li><p><a href="https://arxiv.org/pdf/2212.09748">Scalable Diffusion Models with Transformers</a> - William Peebles, Saining Xie</p></li><li><p><a href="https://arxiv.org/pdf/2103.00020">Learning Transferable Visual Models From Natural Language Supervision</a> - Radford et al</p></li><li><p><a href="https://www.physicalintelligence.company/download/pi0.pdf">&#960;0: A Vision-Language-Action Flow Model for General Robot Control</a> - Black et al.</p></li><li><p><a href="https://arxiv.org/pdf/2508.10104">DINOv3</a> - Sim&#233;oni et al.</p></li><li><p><a href="https://www.physicalintelligence.company/download/pi0.pdf">LeRobot</a></p></li><li><p><a href="https://harshm121.medium.com/flow-matching-vs-diffusion-79578a16c510">Flow Matching vs Diffusion</a> - Blog Post - Harsh Maheshwari</p></li><li><p><a href="https://www.fastumi.com/">FastUMI</a> - Ding et al.</p></li></ol><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The BD team has a quite high quality low-level whole-body MPC that tracks commands at high-frequency and ensures smooth motion execution; people will split hairs on what &#8220;end-to-end&#8221; means, but I digress.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>These terms are all made up, so someone is sure to argue with me, but in general, I classify VLAs as using a pretrained VLM backbone, and adapting the vocabulary to incorporate robot actions. Because of this, I generally prefer the term LBM because it is all encompassing. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>There&#8217;s a lot of &#8220;vibe-benchmarking&#8221; going on, but there&#8217;s a lot of good work going on to change this!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The only thing that matters in robotics :)</p></div></div>]]></content:encoded></item><item><title><![CDATA[It’s Not About Modality, It’s About Scale]]></title><description><![CDATA[Modern robot learning, like all deep learning, is a numbers game. You should play it that way.]]></description><link>https://brysonkjones.substack.com/p/its-not-about-modality-its-about</link><guid isPermaLink="false">https://brysonkjones.substack.com/p/its-not-about-modality-its-about</guid><dc:creator><![CDATA[Bryson Jones]]></dc:creator><pubDate>Wed, 25 Feb 2026 20:20:48 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3687e143-78bb-4e1c-9d3c-e6470acb6e90_3219x1966.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There&#8217;s a ton of discourse in the ecosystem right now about the right way to approach robot learning, and I think a lot of the discussion is talking past each other.</p><p>&#8220;Is vision really enough?&#8221; </p><p>&#8220;Tactile sensing is necessary to solve manipulation!&#8221;</p><p>&#8220;What about depth and LiDAR?&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9lF5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9lF5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 424w, https://substackcdn.com/image/fetch/$s_!9lF5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 848w, https://substackcdn.com/image/fetch/$s_!9lF5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 1272w, https://substackcdn.com/image/fetch/$s_!9lF5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9lF5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png" width="453" height="332.0699677072121" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dea74126-24a2-42e4-a7da-996a67a72b51_929x681.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:929,&quot;resizeWidth&quot;:453,&quot;bytes&quot;:446544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/189096100?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9lF5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 424w, https://substackcdn.com/image/fetch/$s_!9lF5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 848w, https://substackcdn.com/image/fetch/$s_!9lF5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 1272w, https://substackcdn.com/image/fetch/$s_!9lF5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea74126-24a2-42e4-a7da-996a67a72b51_929x681.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Wrist camera view of shape pick and place task</figcaption></figure></div><p>If you&#8217;ve been working on robot manipulation over the past few years, you&#8217;ve no doubt either said this, or heard someone say it to you:</p><p>&#8220;It <em>makes sense</em> that using richer sensing modalities should produce a better foundation model, so why do the best methods not use them? Why not just build an advanced gripper with rich tactile data? Why are we ignoring all these amazing sensing modalities? Are people just stupid?&#8221;</p><p>Yes, there has been tons of great research over the past decade on this area, but the leading manipulation foundation models don&#8217;t incorporate any of these modalities besides vision and joint angles<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Why?</p><p>Alas, you&#8217;re not playing the game you think you are.</p><p>It&#8217;s not about which modality has more information or a richer representation. Those are important, but they aren&#8217;t the high-order bit<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p><p><em><strong>It&#8217;s about scale.</strong></em></p><p>It&#8217;s all about data scale and diversity. That sounds obvious and non-helpful, I know, but it really is worth stating plainly. &#8220;Everyone knows data is the important part of training ML models!&#8221;</p><p>But do you really internalize the implications of that? </p><p>RGB cameras are dominating for a few reasons here:</p><ul><li><p>Cameras are dirt cheap</p></li><li><p>There&#8217;s effectively zero integration lift: camera drivers and software to capture and process the data already exist, and beyond that, are highly optimized</p></li><li><p>Large amounts of existing internet-scale data for this modality &#8594; enables large, general, pre-trained encoders (CLIP, SigLIP, etc) to be built, which produce rich representations that robotics models can leverage &#8220;for free&#8221;</p></li></ul><p>Even though cameras alone aren&#8217;t able to provide any rich tactile feedback for grasping and dexterous manipulation tasks, you get to ride the coattails of all of these advantages. It&#8217;s just inherently very easy to use, deploy, and scale; so it wins by default.</p><p>This is also logically consistent with why the &#8220;new-hotness&#8221; in robotics foundation models is video-model backbones.</p><p>I seriously think the last point on availability of pre-trained modality-specific encoders is vastly underappreciated when considering sensing strategies. It&#8217;s an easy trap to fall into to over-extrapolate from previous paradigms of automation to try and inform your strategy, but thinking by analogy can get you into serious trouble. </p><p>Previous wisdom was to start with the most expensive sensor stack you can buy to make your classical, deterministic, optimal control stack work, and once it&#8217;s reliable, apply pressure to bring the sensing package down the cost curve. But doing that here just makes you end up with a sensing stack that&#8217;s too expensive and specialized to gather a meaningful enough amount of data to train a useful model, so your &#8220;high-precision-ultra-super-mega-sensor&#8221; is likely a hindrance, not a stepping stone.</p><h4>So are new sensing modalities useless?</h4><p>No! I think working on new approaches to bring in new sensors is a great place to focus on right now. However, you should attack it from the right &#8220;angle".</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gNJ1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gNJ1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 424w, https://substackcdn.com/image/fetch/$s_!gNJ1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 848w, https://substackcdn.com/image/fetch/$s_!gNJ1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 1272w, https://substackcdn.com/image/fetch/$s_!gNJ1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gNJ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png" width="1210" height="423" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:423,&quot;width&quot;:1210,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Introducing the GelSight Mini: A Human Resolution Tactile Sensor for  Engineers and Scientists&#65532; - GelSight&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Introducing the GelSight Mini: A Human Resolution Tactile Sensor for  Engineers and Scientists&#65532; - GelSight" title="Introducing the GelSight Mini: A Human Resolution Tactile Sensor for  Engineers and Scientists&#65532; - GelSight" srcset="https://substackcdn.com/image/fetch/$s_!gNJ1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 424w, https://substackcdn.com/image/fetch/$s_!gNJ1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 848w, https://substackcdn.com/image/fetch/$s_!gNJ1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 1272w, https://substackcdn.com/image/fetch/$s_!gNJ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d9cdfee-a60b-4ab2-a23f-7bf46f6dbc1d_1210x423.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">GelSight Vision-based Tactile Sensing</figcaption></figure></div><p>What modalities and sensing technologies are inherently compatible with the needs to drastically scale?</p><ul><li><p>Are they cheap enough to deploy and replace? As in, tens of dollars</p></li><li><p>Are they easy to integrate into HW and SW platforms? Connectors, drivers, communication protocols, SDKs, etc</p></li><li><p>What are different ways to build standalone pre-trained encoders for this modality that can be used downstream in various models?</p></li><li><p><em>Bonus:</em> is there the possibility of some way to extract data of this modality without collecting it all yourself? Yes, the siren song of &#8220;we&#8217;ll just push until the Fairytale Never-Ending Data Flywheel Deployment happens and all will work out&#8221; can be strong, but I encourage you to be intellectually honest about this.</p></li></ul><p>A $10k tactile sensing gripper is unlikely to be the silver bullet here<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><p></p><p>I&#8217;ve got some work in this exact area going on right now, so if this is something that interests you, please reach out!</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>It&#8217;s easy to pick a random task, gather a small dataset with a new sensor added to the input space, and beat a baseline model. That doesn&#8217;t get you close to the frontier of generalization though, which is the important aspect of making these models <em>useful</em>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>At least not yet</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Unless your insight is on how you will innovate on mass manufacturing to bring cost down by three order of magnitude, which the US desperately needs more people focused on  :)</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://brysonkjones.substack.com/p/its-not-about-modality-its-about?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://brysonkjones.substack.com/p/its-not-about-modality-its-about?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://brysonkjones.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://brysonkjones.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Methods for Conditioning Diffusion Models]]></title><description><![CDATA[A simple overview of different conditioning strategies and their origins]]></description><link>https://brysonkjones.substack.com/p/methods-for-conditioning-diffusion</link><guid isPermaLink="false">https://brysonkjones.substack.com/p/methods-for-conditioning-diffusion</guid><dc:creator><![CDATA[Bryson Jones]]></dc:creator><pubDate>Thu, 22 Jan 2026 19:38:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0WpW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve been spending time on research around different multi-modal generation techniques recently, and was discussing with someone recently about all the different ways you can conditioning diffusion or flow-matching models. </p><p>It made me want to make a quick blog post about it, so here you go!</p><p><em>Here is a look at the four conditioning paradigms that define the current landscape of diffusion modeling. There&#8217;s obviously more work going on than just these four, but this is what I&#8217;ve seen prominently and work effectively. If you have other methods you think are critical to know, mention them below!</em></p><h3><strong>1. Early Days: Simple Cross-Attention</strong></h3><p>The first widely successful approach to conditioning diffusion models came from using cross-attention to merge latent information from different signals, and this was before transformers were the common backbone of diffusion models. </p><p>In early latent diffusion systems, the problem was framed very simply: how do you inject a text description into an image denoising process to augment the generation process?</p><p>The initial approach that saw success was popularized by <em>High-Resolution Image Synthesis with Latent Diffusion Models</em> (Rombach et al., 2022) from Stable Diffusion 1.5 and beyond. This was to keep the two streams largely separate and connect them through cross-attention. </p><p>Text is encoded once by a frozen model such as CLIP, producing a set of <em>keys</em> <strong>(K)</strong> and <em>values <strong>(V)</strong></em>. The image latent, evolving through a U-Net backbone during denoising, produces the <em>queries <strong>(Q)</strong></em><strong>.</strong> At each cross-attention layer, the model combines them through a classic attention block:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0WpW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0WpW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 424w, https://substackcdn.com/image/fetch/$s_!0WpW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 848w, https://substackcdn.com/image/fetch/$s_!0WpW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 1272w, https://substackcdn.com/image/fetch/$s_!0WpW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0WpW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png" width="578" height="289.79395604395603" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1456,&quot;resizeWidth&quot;:578,&quot;bytes&quot;:349988,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/185434785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0WpW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 424w, https://substackcdn.com/image/fetch/$s_!0WpW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 848w, https://substackcdn.com/image/fetch/$s_!0WpW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 1272w, https://substackcdn.com/image/fetch/$s_!0WpW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb001a242-3e14-4e1d-b60f-da222dc4e557_2571x1289.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The cross-attention effectively &#8220;injects&#8221; conditioning by projecting text embeddings into the image denoiser, where image features act as queries over text-derived keys and values. This mechanism biases the denoising updates toward prompt-consistent semantics while preserving a separation between text representation and spatial image processing.</p><p>This design choice was extremely pragmatic and simple. For straightforward text-to-image synthesis, this ended up working remarkably well and defined the baseline for an entire generation of models.</p><p>However, the same separation that gives cross-attention its stability also limits its expressiveness. Because text and image representations never fully merge, the model struggles with fine-grained spatial reasoning, compositional logic, and especially typography (remember seeing letters scrambled like they were written schizophrenically?). The conditioning signal is directional and coarse: the text can guide <em>what</em> appears, but not reliably <em>where</em> or <em>how precisely</em>.</p><h3><strong>2. AdaLN (Adaptive Layernorm): Global Conditioning for Diffusion Transformers</strong></h3><p>When the paper <em>Scalable Diffusion Models with Transformers</em> (Peebles &amp; Xie, 2022) came out, the field shifted from U-Nets to Diffusion Transformers (DiTs) as the backbones of diffusion models. In this same paper, AdaLN conditioning was introduced, and has really become a preferred method for injecting global context extremely effectively.</p><p>The idea is that, rather than adding extra tokens or specialized attention layers, AdaLN modulates the layer normalization blocks. The model learns to predict the <code>scale</code><strong> (&#947;)</strong> and <code>shift</code><strong> (&#946;)</strong> parameters of the Layer Normalization based on a condition vector <code>c</code>.</p><ul><li><p>The operation is defined as: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;$AdaLN(x, c) = \\gamma(c) \\cdot Norm(x) + \\beta(c)$&quot;,&quot;id&quot;:&quot;DGSOGGWNRJ&quot;}" data-component-name="LatexBlockToDOM"></div></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F7wY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F7wY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 424w, https://substackcdn.com/image/fetch/$s_!F7wY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 848w, https://substackcdn.com/image/fetch/$s_!F7wY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 1272w, https://substackcdn.com/image/fetch/$s_!F7wY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F7wY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp" width="830" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:830,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22898,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/185434785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F7wY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 424w, https://substackcdn.com/image/fetch/$s_!F7wY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 848w, https://substackcdn.com/image/fetch/$s_!F7wY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 1272w, https://substackcdn.com/image/fetch/$s_!F7wY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78a4290f-f62d-4799-8331-3fbac8f18e18_830x343.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This ends up being a really efficient way of applying global conditioning to the diffusion transformer backbone, like adding a time or style conditioning</p><p>AdaLN is nice because it&#8217;s computationally cheap, <em><strong>O(D)</strong></em> complexity versus the quadratic cost of attention. It influences the entire generation uniformly, making it excellent for setting global style or coherence.</p><p>One of the challenges though is is that lacks precision, and it&#8217;s edits can be coarse or blunt. It can&#8217;t easily tell the model to place a specific object at specific coordinates, or adjust the orientation of that object, etc.</p><p>Still, it is one of the most common and powerful ways to apply conditioning to one of these DiT models, and it&#8217;s implementation simplicity is quite elegant, making it easy to understand.</p><h4>Side Note:</h4><p>This is one of the most effective ways we&#8217;ve seen to condition diffusion and flow matching action chunking policies and -heads for robotics. You take all of your observations, camera view encodings, joint angles, task description, etc, and concatenate them into an AdaLN conditioning vector.</p><p>You can see this in action in a repo I&#8217;ve open-sourced for Multitask Diffusion Policy, here: <a href="https://github.com/brysonjones/multitask_dit_policy/tree/main">https://github.com/brysonjones/multitask_dit_policy/tree/main</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ap-4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ap-4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 424w, https://substackcdn.com/image/fetch/$s_!ap-4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 848w, https://substackcdn.com/image/fetch/$s_!ap-4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 1272w, https://substackcdn.com/image/fetch/$s_!ap-4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ap-4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp" width="1456" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42670,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/185434785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ap-4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 424w, https://substackcdn.com/image/fetch/$s_!ap-4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 848w, https://substackcdn.com/image/fetch/$s_!ap-4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 1272w, https://substackcdn.com/image/fetch/$s_!ap-4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe55205ea-781e-4e26-a6a9-7e571c399622_1787x782.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3><strong>3. In-Context Learning (ICL): Few-Shot Adaptation</strong></h3><p>In-context learning is an approach to apply conditioning in the inputs of the diffusion transformer rather than an architectural modification to inject conditioning during processing. </p><p>One of the first papers we saw this with was <em>In-Context Learning Unlocked for Diffusion Models</em> (Wang et al., 2023), where the idea mirrors almost exactly few-shot prompting in LLMs. The model is given one or more example pairs, like an input image and its desired outputs (segmenting, edge detection, style transfer, etc) followed by a new query input.</p><p>We end up effectively treating these image tokens as "visual prompts" alongside or instead of text. Just as you might show a language model a few examples of English-to-French translation to teach it the pattern, in diffusion ICL, you feed the model a "context pair" consisting of a source image and a transformed version.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5OLU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5OLU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 424w, https://substackcdn.com/image/fetch/$s_!5OLU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 848w, https://substackcdn.com/image/fetch/$s_!5OLU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!5OLU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5OLU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png" width="546" height="420" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1120,&quot;width&quot;:1456,&quot;resizeWidth&quot;:546,&quot;bytes&quot;:686633,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/185434785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5OLU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 424w, https://substackcdn.com/image/fetch/$s_!5OLU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 848w, https://substackcdn.com/image/fetch/$s_!5OLU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!5OLU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcac5439e-dafb-4d27-adfd-74cfffe0492c_1701x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">This is the architecture from "<em>PromptDiffusion&#8221; in In-Context Learning Unlocked for Diffusion Models, but there are many different ways to accomplish this</em></figcaption></figure></div><p>Practically, this is implemented either by concatenating images into a single tensor (for example: stacking an edge map and its target photo next to a new edge map), or by using attention masking so that query tokens can attend to example tokens. The diffusion process implicitly learns the transformation by pattern matching within its context window.</p><p>This approach ends up working quite well, and multitask demonstrations end up teaching the model how to combine and generalize these editing concepts. It allows a single pretrained model to perform edge-to-image, colorization, style transfer, and other conditional tasks without specialized adapters.</p><p>But this flexibility and expressivity comes at a cost. Because the model is not explicitly optimized for the task, output fidelity usually lags behind fine-tuned or adapter-based methods (this starts to change as you scale, as most models do). Additionally, the approach is fundamentally bounded by context length (quadratic cost on computation) and GPU memory, making it difficult to scale to complex or high-resolution demonstrations.</p><h3><strong>4. Joint Attention for Multiple Modalities (MM-DiT)</strong></h3><p>Joint Attention with MM-DiTs represents the one of the most recent big leap in diffusion conditioning. This approach basically creates two separate streams of data that run in parallel through the model layers:</p><ul><li><p><strong>Image Stream:</strong> Processes the visual latent patches (the noisy image being denoised).</p></li><li><p><strong>Text Stream:</strong> Processes the text tokens (from the prompt).</p></li></ul><p>Importantly, <strong>each stream has its own set of weights</strong>. This means the model learns separate parameters to process visual data and textual data, acknowledging that pixels and words behave very differently and have different patterns to learn, etc.</p><p>Then within the MM-DiT blocks, image and text tokens are concatenated into a single sequence, processed with attention, and then split back into their distinct streams.</p><p>This iterative merging and separation helps ensure the model retains exact character sequences and resolves ambiguity, leading to superior typography (no more misspellings, mangled letters, etc) and complex prompt adherence.</p><p>This idea was introduced in <em>Scaling Rectified Flow Transformers for High-Resolution Image Synthesis</em> (Esser et al., 2024) for Stable Diffusion 3. The architecture evolves the simple cross-attention concept into an model that more fully merges the multi-modal signals and you can see the diagram below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tw7P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tw7P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 424w, https://substackcdn.com/image/fetch/$s_!tw7P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 848w, https://substackcdn.com/image/fetch/$s_!tw7P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 1272w, https://substackcdn.com/image/fetch/$s_!tw7P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tw7P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png" width="1456" height="1027" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1027,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:420131,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://brysonkjones.substack.com/i/185434785?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tw7P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 424w, https://substackcdn.com/image/fetch/$s_!tw7P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 848w, https://substackcdn.com/image/fetch/$s_!tw7P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 1272w, https://substackcdn.com/image/fetch/$s_!tw7P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5969ea9d-7c8e-4c43-a5ed-497c16380fe8_2446x1726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>There&#8217;s quite a few trade-offs though:</p><ul><li><p>Just like ICL, because we are increasing our input context for these signals, we incur quadratic compute cost increases for increasing context/conditioning length</p></li><li><p>As you can see from the diagram above, this architecture is quite complicated with lots of sub-details of implementation, making  the implementation and hyperparameter tuning cumbersome</p></li></ul><p></p><div><hr></div><p>There&#8217;s a lot of exciting work going on with diffusion/flow-matching generative models right now, and it&#8217;s easy to get lost in the sauce of all the different methods. I found trying to group and categorize approaches like this helped me grasp when and where to try and apply the different strategies.</p><div><hr></div><p><em>I&#8217;m working on diffusion models and other architectures for multi-modal generation, representation learning research, and beyond. </em></p><p><em>Most of my work is focused on robot manipulation, world-modeling, and decision-making. Reach out if you are interested in chatting (<a href="https://x.com/brysonkjones">https://x.com/brysonkjones</a>), and share this post with others!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://brysonkjones.substack.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share Bryson Jones&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://brysonkjones.substack.com/?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share Bryson Jones</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://brysonkjones.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://brysonkjones.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Why Manipulation Is "Harder" Than Locomotion]]></title><description><![CDATA[Locomotion has taught the field a lot, but we need more to reach functional manipulation foundation models]]></description><link>https://brysonkjones.substack.com/p/why-manipulation-is-harder-than-locomotion</link><guid isPermaLink="false">https://brysonkjones.substack.com/p/why-manipulation-is-harder-than-locomotion</guid><dc:creator><![CDATA[Bryson Jones]]></dc:creator><pubDate>Sun, 21 Sep 2025 05:49:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1cf298d7-c7ff-4b1e-973b-80524dcff7bf_356x287.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When people imagine the frontier of robotics, they often think of humanoid robots or quadrupeds sprinting, climbing, dancing, doing backflips, etc. Locomotion captures our attention because it&#8217;s visually striking and straightforward to grok how difficult the tasks are (most of us can&#8217;t do backflips&#8230;). </p><p>However, the frontier challenges we face are very much in manipulation, or the ability to handle, assemble, and interact with the physical world with our robotic systems.</p><p>We don&#8217;t think of the ability to pick up an egg without cracking it or folding a blanket as the most impressive tasks humans are capable of, but they are some of the hardest problems to build <em>intelligence</em> for.</p><p>This is a prime example of <a href="https://en.wikipedia.org/wiki/Moravec%27s_paradox">Moravec&#8217;s Paradox</a>, effectively: <em>the hard things are easy, and the easy things are hard.</em></p><h3>What makes up robot autonomy?</h3><p>The problems within robotics can largely be broken up into:</p><ul><li><p>Navigation</p></li><li><p>Locomotion</p></li><li><p>Manipulation</p></li></ul><p><a href="https://people.eecs.berkeley.edu/~malik/">Jitendra Malik</a> gave a keynote talk over the past year (I unfortunately forget which conference this was<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>) that went as far as spicily saying something along the lines of:</p><p><em>&#8220;Navigation is solved, locomotion is basically solved, and while we are making progress on manipulation, we really don&#8217;t have what the recipe is yet to solve it&#8221;</em></p><p>I&#8217;m a huge fan of Jitendra, and while I know part of his goal with this slide was to be provocative, I generally agree (although I&#8217;m more bullish on the track we&#8217;re on in manipulation than he may be).</p><div><hr></div><h3>Three ways to train manipulation policies</h3><p>Researchers are attacking manipulation from three main directions:</p><ol><li><p><strong>Behavior cloning from tele-operation data</strong> &#8211; teaching robots by having humans control them and mimicking the demonstrations.</p></li><li><p><strong>Large-scale unsupervised learning</strong> &#8211; using internet-scale video data to learn broad priors about how actors in a scene interact with objects and the environment they&#8217;re in</p></li><li><p><strong>Reinforcement learning in simulation</strong> &#8211; training policies in massive parallelized environments where billions of interactions can be run safely and cheaply.</p></li></ol><p>Each approach has made progress, but none has cracked the full reliability and generalization needed for everyday manipulation tasks.</p><p>I don&#8217;t think it&#8217;s controversial to say that #1 (BC through tele-op) is the leading approach in terms of performance, but there&#8217;s a lot of work going on behind the scenes at places like SkildAI, Tesla, etc that we aren&#8217;t privy to and things are moving so quickly this could be wrong by the time you read this.</p><h3>Why locomotion is simpler</h3><p><em>&#8220;Making this huge hunk of metal run looks pretty hard, man.&#8221;</em></p><p>Locomotion benefits from properties that make it extremely well-suited for massively parallel reinforcement learning in simulation:</p><ul><li><p><strong>Clear reward  functions:</strong></p><p>The reward function for locomotion is actually pretty easy to formulate: move forward, have smooth motions, minimize energy usage.  These are mathematically compact objectives, and almost feel tailor-made for reinforcement learning.</p></li><li><p><strong>Minimal sensing needs:</strong></p><p>Proprioceptive feedback (joint angles, velocities, foot contacts) often suffices. Many locomotion policies succeed even without cameras, relying only on physical feedback.</p></li><li><p><strong>Representation efficiency:</strong> </p><p>High-performing locomotion policies can <a href="https://arxiv.org/pdf/2109.11978">evidently</a> be squeezed into ~10 million parameters (you could argue more for multiple embodiments, but I&#8217;ll make the generalization)</p></li><li><p><strong>Structural regularity:</strong></p><p>Locomotion has strong symmetries where gaits are often periodic and physics is uniform across terrains. This allows policies to generalize well, with intelligent behaviors like efficient gait cycles or arm-swinging for energy minimization emerging naturally.</p></li></ul><p>All of these aspects lead to the ability to leverage large-scale simulation + RL to train super-human (super-dog for quadrupeds??) locomotion policies. </p><h3>What makes manipulation different (and harder)?</h3><p><em>&#8220;How hard can folding these f&amp;%$ing clothes be?&#8221;</em></p><p>Manipulation poses a more sobering challenge:</p><ul><li><p><strong>No compact reward functions</strong></p><p>The &#8220;cost function&#8221; for inserting a screw or cooking an omelet is hard to specify. Rewards often need careful shaping or human priors, leading to brittleness and unexpected outcomes from <em>reward-hacking.</em></p></li><li><p><strong>Rich sensing requirements</strong><br>Unlike locomotion, manipulation usually requires vision and tactile feedback to estimate object shape, pose, affordances, and contacts. Tactile sensing hardware is still immature. The example I like to give people is: &#8220;imagine dipping your hand in anesthetic and trying to pick your phone up&#8230; it&#8217;s basically impossible&#8221;</p></li><li><p><strong>Discrete modes and long-horizons introduce unique complexity</strong></p><p>Manipulation often involves <em>discrete</em> modes for task completion: grasping, lifting, pushing, where each object can behave differently. Contact dynamics are messy, occlusions are common, and the precision bar is much higher than for locomotion.</p></li><li><p><strong>Few emergent behaviors</strong></p><p>Unlike locomotion, manipulation doesn&#8217;t &#8220;discover&#8221; elegant solutions on its own (so far). Without demonstrations or heavy engineering, policies struggle to converge on useful strategies.</p></li></ul><p>This is why we haven&#8217;t been able to just &#8220;zero-shot&#8221; transfer all of the methods that accelerated locomotion progress in the past 5 years into manipulation.</p><div><hr></div><h3>Where does this leave us?</h3><p>This field is moving so fast that some of these challenges in manipulation could be solved in the next month, or it may take a few years; I don&#8217;t want to paint too bleak of a picture, but really just wanted to highlight that progress in locomotion often gets conflated with how to tackle manipulation, and I don&#8217;t believe that&#8217;s fair to do.</p><p>I&#8217;m personally much more optimistic than many in the field are about how quickly behavior cloning with tele-operation data collection will result in <em>useful</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> manipulation policies that can be deployed into real-world settings and kick off a data flywheel of self-improvement (this is the bet we<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> are making).</p><p>On the topic of reward functions, there&#8217;s a lot of amazing work happening leveraging large pre-trained VLMs (and other foundation models) to bootstrap reward models that can rapidly be adapted to new tasks.</p><p>I could talk for hours about all of the exciting work going on in the field right now, but will save that for a future post.</p><div><hr></div><p>If you liked this post, feel free to subscribe, share with others that might enjoy, or reach out to chat!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://brysonkjones.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://brysonkjones.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="preformatted-block" data-component-name="PreformattedTextBlockToDOM"><label class="hide-text" contenteditable="false">Text within this block will maintain its original spacing when published</label><pre class="text"><em>&#8220;The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.&#8221; - George Bernard Shaw</em></pre></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Please comment and let me know if you find it!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>We don&#8217;t need to solve general manipulation to have useful policies in niche and specific tasks, even if we leverage as general and broad of data that is available for pre-training</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>My company is developing manipulation policies and tools to deploy them for skilled labor tasks. If this is something you&#8217;re interested in working on, I would love to talk to you about joining our team.</p></div></div>]]></content:encoded></item></channel></rss>