By Gowthami Somepalli and Sravani Somepalli · Part 2 of the Latent Scaffolding series. This is original, independent research. A full paper with detailed analysis and reproducible experiments is in preparation.
In Part 1, we showed how a simple weight splice turns Z-Image into an image-to-image model — zero training, just swapping QwenVL's LM weights with Z-Image's. The results were surprisingly clean. However...
Look at the image-to-image results from the spliced modelSetup note: all experiments in this post use layer 34 (the penultimate LLM layer, i.e. hidden_states[-2]) for conditioning unless stated otherwise. This is the same default layer Z-Image uses for text-to-image. Resolution is 768×768 max_pixels for VL encoding.:
Two things bothered us: 1) the generated images are too similar to the original — same color palette, same outfits, same composition. This defeats the whole point if variations are just slightly reshuffled copies. 2) generations across different seeds are nearly identical. Z-Image-Turbo has well-known mode collapse, but it's particularly bad with image conditioning. The VL embeddings are so strong that the DiT has very little room to explore.
So we set out to fix at least the first problem and hopefully get the second one for free along the way.
Different Ablations
We tried a bunch of standard approaches to increase diversity. Here's the rundown — most of them didn't work.
| Method | Idea | Result |
|---|---|---|
| Reduce resolution | Fewer visual tokens → coarser representation | Color palette and structure stubbornly retained |
| Embedding noise | Add Gaussian noise to prompt embeddings | Either ignored or destroys the image — no sweet spot |
| PCA perturbation | Perturb along top-k principal components | Absolute failure — garbled outputs (see Figure 2) |
| Attention temperature | Soften VL self-attention (T=0.5 to 5.0) | No meaningful diversity gains |
| Layer interpolation | LERP/SLERP between layers 24 and 34 | Interesting for creative control, doesn't solve diversity |
| Multi-crop ensemble | Average embeddings from augmented views | Modest gains, not enough |
| Token shuffle | Randomly permute visual tokens | More diverse but spatially confused |
None of these cracked it. Here's a side-by-side of the best setting from each method:
Token Dropout: The Clear Winner
Then we tried the simplest possible thing: randomly remove a fraction of tokens from the conditioning sequence before passing it to the DiT.
Here's what happens as you increase the dropout rate:
At low dropout the changes are subtle — maybe a slightly different wrench angle. By 50% the model is making real creative choices. At 75% it's a completely different illustration.
And when you look across multiple seeds at 75% dropout:
Instead of four copies of the same plumber, we get four genuinely different takes. Different poses, different tools, (somewhat) different compositions. The model finally has room to be creative because the conditioning signal is sparse enough that it has to fill in the gaps.
But there's a catch. Look at seed s2024 — the one with the red border in Figure 4. Weird noisy artifacts, color shifts. Zooming in:
Some seeds consistently produce these artifacts. It's not random quality variation — it's a specific failure mode. What's going on?
The Side Quest: Hunting Attention Sinks
Why are some seeds breaking but others aren't, only difference being random token drop which led to the hypothesis: maybe some tokens are more important than others. At 75% random dropout, we're probably dropping some critical tokens that the DiT really needs.
To test this, we dug into what tokens actually exist in the conditioning sequence. The full VL embedding isn't just visual tokens — it's wrapped in a chat template:
<|im_start|> system \n You are a helpful assistant. <|im_end|> \n
<|im_start|> user \n <|vision_start|> [visual tokens] <|vision_end|> <|im_end|> \n
<|im_start|> assistant \n
There are template tokens (the chat scaffolding — ~20 tokens) and visual tokens (the actual image information — ~1000–1500 tokens at 768×768 depending on aspect ratio). Random dropout doesn't distinguish between them.
We captured the DiT's attention patterns over the conditioning tokens during denoising, and something jumped right out:
<|im_start|> system \n You are a helpful assistant. <|im_end|> \n <|im_start|> user \n). The bright lines at the bottom (~index 240–280) are the suffix template tokens (<|vision_end|> <|im_end|> \n <|im_start|> assistant \n). The dark middle region is the actual visual tokens — each one receives relatively little individual attention. Bottom — the corresponding denoising progression from noise to final image.
Figure 6 tells the story clearly: the bright horizontal lines at the top and bottom are template tokens — they receive 10–50x more attention than individual visual tokens. The prefix tokens (token 0: <|im_start|>, token 1: system) are the biggest attention magnets, but the suffix tokens (<|im_end|>, assistant) also absorb significant attention. These aren't just background scaffolding — they're attention sinks.
Attention sinks are a well-studied phenomenon in causal language modelsIn any causal (left-to-right) transformer, early tokens in the sequence accumulate disproportionate attention mass simply because every subsequent token can attend to them. They become “sinks” — dumping grounds for attention that doesn't have anywhere more specific to go.. There are two kinds of sinks in our conditioning sequence, and they behave very differently under dropout:
Template (text) sinks are the chat scaffolding tokens (<|im_start|>, system, <|im_end|>, assistant, etc.) at the start and end of the sequence. These are structural sinks — they sit at extreme positions in the causal ordering and absorb massive attention from the DiT. Crucially, each template token is unique and irreplaceable. There's only one <|im_start|>, one system token. The DiT has learned to rely on these specific tokens as global context anchors during generation. Drop any of them and there's nothing else in the sequence that can compensate. These are critical and must never be dropped.
Visual sinks are the top-left visual patch tokens, which absorb extra attention simply because they appear first in the visual token sequence (due to raster-scan ordering + causal attention). These are positional sinks — high-attention but not actually carrying unique information. After passing through 34 layers of the VL LLM with causal self-attention, image information gets redundantly spread across all visual tokens. Each visual token encodes a mixed representation of the full image context, not just its local patch. So dropping any individual visual token — even a high-attention one — barely matters, because the same information is recoverable from neighboring tokens. This redundancy is why we can drop 95% of visual tokens and still get coherent generations.
This is why random dropout at 75% causes artifacts intermittently. With ~20 template tokens out of ~1000+ total, they're a tiny fraction — but at 75% dropout you're dropping ~750 tokens uniformly at random. The chance of hitting at least one of the ~20 template tokens is essentially 100%. The damage depends on which template token gets dropped — dropping <|im_start|> at position 0 (the biggest sink) is catastrophic, while dropping a less critical one might be survivable.
So we ran a systematic ablation — keep or drop specific token groups and see what breaks:
The findings from Figure 7 were clear:
Dropping all template tokens (“Vis Only”) produces consistent artifacts. This is the smoking gun. These template tokens are load-bearing attention sinks that the DiT relies on for global context.
The prefix matters more than the suffix. Dropping just the prefix (system prompt + user header before <|vision_start|>) produces more severe artifacts than dropping the suffix (tokens after <|vision_end|>). The prefix tokens, particularly <|im_start|> and system, are the primary sinks.
Dropping visual tokens is relatively safe. Even dropping the top 10–25% of visual tokens (the positional sinks — top-left patches that absorb extra attention due to causal ordering) degrades quality much less than dropping any template tokens.
What about data-driven sink removal?The “Drop Sinks” column in Figure 7 takes a smarter approach — instead of dropping visual tokens randomly by position, we run a single DiT forward pass to capture which visual tokens receive the most attention, then drop the top 10% highest-attention visual tokens. The idea: surgically remove the visual sinks. The result: it barely changes anything. The visual sinks aren't carrying critical information — they're just absorbing excess attention due to causal position bias. Dropping them doesn't hurt, but it doesn't increase diversity either.
We also tried more aggressive post-encoding sink removal — dropping 5–30% of highest-attention visual tokens after VL encoding. Same story: at layer 34, every remaining token already encodes a mixed representation of the full image. Removing individual tokens after the LLM has already done its mixing doesn't selectively remove any semantic content. The information has already been blended.
This is actually good news for the practical approach: random dropout within visual tokens works just as well as sink-aware dropout. Since visual sinks don't carry special information, there's no benefit to identifying and preserving them. And sink-aware dropping requires an extra DiT forward pass to capture attention patterns — a significant computational overhead. Random dropout is free.
The Fix: Vision-Only Token Dropout
So the fix is obvious: only drop visual tokens, never touch the template tokens.
We identify visual token boundaries using the <|vision_start|> and <|vision_end|> special token IDs, then apply dropout exclusively within that range. Template tokens — the system prompt, chat formatting, everything outside the vision markers — are always preserved.
No more random artifacts. The template sink tokens are always intact, so the DiT always has its global context. And we can push the dropout rate much higher — 85%, even 95% — without the artifact problem because we're only dropping visual information, not structural anchors.
The sweet spot depends on what you want:
- 75% dropout: Good diversity while keeping subject identity strong
- 85% dropout: More creative variations — the model takes bigger liberties with composition and color
- 95% dropout: Very loose interpretation — captures the broad semantic gist but freely reimagines everything else
- 99% dropout: Barely connected to the input
And the seed diversity problem from Figure 1? Solved:
Caveats
It's not a perfect solution of course. At high dropout rates, things can go wrong in interesting ways. Subjects sometimes vanish entirely — a woman standing next to a train just… disappears from the scene. The rendering style can shift unexpectedly — a stylized 3D illustration becomes photorealistic. Object counts get mangled — two cats merge into one. One interesting pattern: the model tends to recycle colors from the original image (or at least how the VLM perceives it to be) rather than introducing genuinely new ones — the color palette stays anchored to whatever information remains. These are the tradeoffs of aggressive information dropping: the model fills in what it doesn't know, and sometimes it fills in wrong.
In practice, the dropout rate is a user-controlled knob with a clear tradeoff. At 75% dropout, the subject identity is preserved — you get the same person, object, or scene but in a different composition or pose. At 95% dropout, the model takes much more creative liberty — you get the gist of the input but reimagined freely. The right setting depends on the use case.
Steering color with a palette image
One limitation noted above: the model tends to recycle colors from the surviving visual tokens. What if you could steer the color palette while still getting diverse compositions?
The VL model can encode multiple images in a single pass — they just become separate blocks of visual tokens in the conditioning sequence. So we tried a simple trick: encode the input image alongside a second “palette” image (a photo with the desired color scheme), apply 95% dropout to both images' visual tokens, and generate.
The results are (somewhat) promising. In most cases the background picks up color cues from the palette image, and sometimes the main subject does too. But it's not really controllable — the palette influence is more of a nudge than a hard constraint, since both images are getting aggressively dropped. One thing we found: applying dropout to both images is important. Keeping the palette image intact produces striped/tiled background artifacts where the model copies palette patches verbatim. This isn't a final solution, more work needed here.
Layer choice as a second diversity knob
All results so far use layer 34 — the penultimate LLM layer. But what happens if we tap an earlier layer? We ran the same dropout sweep at layer 24 and compared layer 34 at 85% dropout against layer 24 at 95% dropout:
This is surprising at first glance: layer 24 with more aggressive dropout produces more diversity than layer 34 with less dropout. But it makes sense when you think about what each layer represents. By layer 34, the LLM has had 34 layers of self-attention to thoroughly extract and encode every detail of the input image — color palette, spatial layout, fine textures, object identities. The representation is rich and specific. Even after dropping 85% of tokens, the surviving 15% still carry enough of that detailed signal to strongly constrain the DiT.
Layer 24, on the other hand, sits earlier in the LLM stack. The representation is more semantic and less pixel-faithful — it captures what's in the image more than exactly how it looks. When you then drop 95% of those already-coarser tokens, the DiT receives a very loose semantic sketch: “there's a cat,” “there's a person with a headwrap,” but without the fine-grained detail that would lock it into reproducing the original. The model has to fill in much more on its own, and it does so creatively.
This gives us two orthogonal controls: dropout rate determines how much information survives, while layer choice determines how abstract that information was to begin with. A practical rule of thumb: if you want variations that preserve the look and feel of the original but change composition, use layer 34 with moderate dropout. If you want the model to freely reinterpret the subject, go earlier.
Does this generalize to text-to-image?
The natural question: if dropping visual tokens increases diversity in image-to-image, does the same trick work for text-to-image? Z-Image encodes text prompts through the same VL model (just without an image), producing a sequence of text token embeddings wrapped in the same chat template. The template sinks are identical — <|im_start|>, system, assistant — but instead of visual patch tokens in the middle, you get text content tokens.
The same logic should apply. Template sinks are structural anchors the DiT needs — don't touch them. Text content tokens, after passing through 34 LLM layers, should carry redundantly mixed representations of the prompt's semantics, just like visual tokens carry redundantly mixed image information. If that's true, we should be able to drop a large fraction of text content tokens and get diverse generations from the same prompt.
We ran this experiment: 10 diverse prompts, 4 seeds each, dropping [0%, 10%, 25%, 50%, 75%, 95%] of text content tokens while keeping all template tokens intact. The content/template boundary is identified by encoding an empty string to find the template token count, then comparing token IDs to locate where content tokens are inserted.
The answer: yes, but with a much tighter sweet spot. Text dropout at 10–25% produces meaningful variation while keeping the prompt's core intent intact. But it degrades much faster than vision dropout — by 50–75% the generations are already losing key prompt details. This makes sense: text prompts have far fewer content tokens than images (tens vs hundreds-thousands), so each dropped token removes a larger fraction of the semantic signal. There's less redundancy to lean on.
The practical upshot: text token dropout works as a diversity mechanism, but it's a scalpel rather than a sledgehammer. The safe range is roughly 10–25% — enough to get the model to reinterpret details while respecting the overall prompt. This confirms that token dropout is a general phenomenon, not something specific to visual tokens. The underlying mechanism is the same: after 34 LLM layers, token representations are redundantly mixed, so dropping a fraction forces the downstream model to fill in gaps creatively.
Why This Matters
The fact that we can drop 95% of visual tokens and still get coherent generations tells us something about how QwenVL represents images internally: by the later layers, image information is massively redundant across tokens. The subject identity, the broad color story, the semantic gist — these persist even when almost everything is thrown away. What doesn't survive is the fine-grained detail: exact spatial layout, precise colors, specific textures. That's the information that lives in the long tail of tokens, and that's what dropout strips away to create room for diversity.
There's an efficiency angle here too. If 95% of visual tokens can be dropped without losing semantic coherence, then VL-conditioned diffusion models are dramatically over-conditioned for many use cases. There may be an efficiency play here — faster inference by pruning the conditioning sequence, or more efficient training by working with compressed representations.
The attention sink thing is interesting on its own. The DiT doesn't just consume a bag of embedding vectors. It has learned to use specific tokens as structural anchors. The template tokens from the chat format — which might seem like irrelevant boilerplate — have become load-bearing infrastructure that the generation process relies on. The model has developed an implicit hierarchy: a handful of structural tokens that must always be present, and hundreds of content tokens that are individually expendable.
If you're doing any kind of conditioning token manipulation in diffusion transformers, be careful about which tokens you're touching. The boilerplate chat template tokens might matter more than the actual content.
What's Next
We've now got a pretty good image variation pipeline: the VL splice from Part 1 gives us zero-shot I2I, and vision-only dropout gives us controllable diversity. In Part 3, we will show how to take rough cut-paste composites and use SDEdit-style denoising to clean them into coherent images — approximate object composition for free.
If you would like to cite this post in an academic context, you can use this BibTeX snippet:
@misc{somepalli2026latentscaffold,
author = {Somepalli, Gowthami and Somepalli, Sravani},
title = {Latent Scaffolding Image Generation Models},
url = {https://somepago.github.io/posts/latent-scaffolding-series/},
year = {2026}
}