By Gowthami Somepalli and Sravani Somepalli · Part 2 of the Latent Scaffolding series. This is original, independent research. A full paper with detailed analysis and reproducible experiments is in preparation.

Token Dropout Before/After comparison showing diverse image variations
Same input image, same model, same pipeline. Left: four seeds produce nearly identical outputs. Right: with vision-only token dropout, each seed gives a genuinely different variation. The fix? Randomly dropping 85% of the visual tokens before generation.

In Part 1, we showed how a simple weight splice turns Z-Image into an image-to-image model — zero training, just swapping QwenVL's LM weights with Z-Image's. The results were surprisingly clean. However...

Look at the image-to-image results from the spliced modelSetup note: all experiments in this post use layer 34 (the penultimate LLM layer, i.e. hidden_states[-2]) for conditioning unless stated otherwise. This is the same default layer Z-Image uses for text-to-image. Resolution is 768×768 max_pixels for VL encoding.:

Four different random seeds producing nearly identical outputs
Figure 1: Four different random seeds, nearly identical outputs. Look at the first row — yellow helmet, blue overalls, similar pose, similar wrench. The only differences are minor: hand position, tool angle. For all practical purposes these are the same image four times.

Two things bothered us: 1) the generated images are too similar to the original — same color palette, same outfits, same composition. This defeats the whole point if variations are just slightly reshuffled copies. 2) generations across different seeds are nearly identical. Z-Image-Turbo has well-known mode collapse, but it's particularly bad with image conditioning. The VL embeddings are so strong that the DiT has very little room to explore.

So we set out to fix at least the first problem and hopefully get the second one for free along the way.

Different Ablations

We tried a bunch of standard approaches to increase diversity. Here's the rundown — most of them didn't work.

Method Idea Result
Reduce resolution Fewer visual tokens → coarser representation Color palette and structure stubbornly retained
Embedding noise Add Gaussian noise to prompt embeddings Either ignored or destroys the image — no sweet spot
PCA perturbation Perturb along top-k principal components Absolute failure — garbled outputs (see Figure 2)
Attention temperature Soften VL self-attention (T=0.5 to 5.0) No meaningful diversity gains
Layer interpolation LERP/SLERP between layers 24 and 34 Interesting for creative control, doesn't solve diversity
Multi-crop ensemble Average embeddings from augmented views Modest gains, not enough
Token shuffle Randomly permute visual tokens More diverse but spatially confused

None of these cracked it. Here's a side-by-side of the best setting from each method:

Ablation comparison showing different diversity methods
Figure 2: Baseline vs one setting per method. Most methods at seed 42 — dropout shown at seed 777 to highlight the diversity. Noise and temperature barely change anything from baseline. Multicrop and shuffle produce slight variation. PCA is straight up garbled. Dropout at 75% produces a completely different outfit and pose. The simplest method wins.

Token Dropout: The Clear Winner

Then we tried the simplest possible thing: randomly remove a fraction of tokens from the conditioning sequence before passing it to the DiT.

Here's what happens as you increase the dropout rate:

Token dropout from 0% to 75%
Figure 3: Token dropout from 0% to 75%, single seed. At 10–25%, minor changes. At 50%, noticeably different composition. At 75%, a completely different take on “plumber.”

At low dropout the changes are subtle — maybe a slightly different wrench angle. By 50% the model is making real creative choices. At 75% it's a completely different illustration.

And when you look across multiple seeds at 75% dropout:

Dropout comparison across multiple seeds
Figure 4: Left: baseline (0% dropout), 4 seeds — nearly identical. Right: 75% token dropout, 4 seeds — real diversity in pose, composition, and style. But notice the last seed (red border) — something's off.

Instead of four copies of the same plumber, we get four genuinely different takes. Different poses, different tools, (somewhat) different compositions. The model finally has room to be creative because the conditioning signal is sparse enough that it has to fill in the gaps.

But there's a catch. Look at seed s2024 — the one with the red border in Figure 4. Weird noisy artifacts, color shifts. Zooming in:

Zoomed-in crops showing artifacts from random dropout
Figure 5: Zoomed-in crops from broken seeds at 75% random dropout. High-frequency noise, color distortion, incoherent regions. Not every seed — maybe 1 in 4 — but enough to be a problem.

Some seeds consistently produce these artifacts. It's not random quality variation — it's a specific failure mode. What's going on?

The Side Quest: Hunting Attention Sinks

Why are some seeds breaking but others aren't, only difference being random token drop which led to the hypothesis: maybe some tokens are more important than others. At 75% random dropout, we're probably dropping some critical tokens that the DiT really needs.

To test this, we dug into what tokens actually exist in the conditioning sequence. The full VL embedding isn't just visual tokens — it's wrapped in a chat template:

<|im_start|> system \n You are a helpful assistant. <|im_end|> \n
<|im_start|> user \n <|vision_start|> [visual tokens] <|vision_end|> <|im_end|> \n
<|im_start|> assistant \n

There are template tokens (the chat scaffolding — ~20 tokens) and visual tokens (the actual image information — ~1000–1500 tokens at 768×768 depending on aspect ratio). Random dropout doesn't distinguish between them.

We captured the DiT's attention patterns over the conditioning tokens during denoising, and something jumped right out:

Attention heatmap showing template sinks and denoising progression
Figure 6: Top — attention mass per conditioning token (y-axis) across denoising timesteps (x-axis). Bright lines = high attention. The first few tokens at index 0–15 are the prefix template tokens (<|im_start|> system \n You are a helpful assistant. <|im_end|> \n <|im_start|> user \n). The bright lines at the bottom (~index 240–280) are the suffix template tokens (<|vision_end|> <|im_end|> \n <|im_start|> assistant \n). The dark middle region is the actual visual tokens — each one receives relatively little individual attention. Bottom — the corresponding denoising progression from noise to final image.

Figure 6 tells the story clearly: the bright horizontal lines at the top and bottom are template tokens — they receive 10–50x more attention than individual visual tokens. The prefix tokens (token 0: <|im_start|>, token 1: system) are the biggest attention magnets, but the suffix tokens (<|im_end|>, assistant) also absorb significant attention. These aren't just background scaffolding — they're attention sinks.

Attention sinks are a well-studied phenomenon in causal language modelsIn any causal (left-to-right) transformer, early tokens in the sequence accumulate disproportionate attention mass simply because every subsequent token can attend to them. They become “sinks” — dumping grounds for attention that doesn't have anywhere more specific to go.. There are two kinds of sinks in our conditioning sequence, and they behave very differently under dropout:

Template (text) sinks are the chat scaffolding tokens (<|im_start|>, system, <|im_end|>, assistant, etc.) at the start and end of the sequence. These are structural sinks — they sit at extreme positions in the causal ordering and absorb massive attention from the DiT. Crucially, each template token is unique and irreplaceable. There's only one <|im_start|>, one system token. The DiT has learned to rely on these specific tokens as global context anchors during generation. Drop any of them and there's nothing else in the sequence that can compensate. These are critical and must never be dropped.

Visual sinks are the top-left visual patch tokens, which absorb extra attention simply because they appear first in the visual token sequence (due to raster-scan ordering + causal attention). These are positional sinks — high-attention but not actually carrying unique information. After passing through 34 layers of the VL LLM with causal self-attention, image information gets redundantly spread across all visual tokens. Each visual token encodes a mixed representation of the full image context, not just its local patch. So dropping any individual visual token — even a high-attention one — barely matters, because the same information is recoverable from neighboring tokens. This redundancy is why we can drop 95% of visual tokens and still get coherent generations.

This is why random dropout at 75% causes artifacts intermittently. With ~20 template tokens out of ~1000+ total, they're a tiny fraction — but at 75% dropout you're dropping ~750 tokens uniformly at random. The chance of hitting at least one of the ~20 template tokens is essentially 100%. The damage depends on which template token gets dropped — dropping <|im_start|> at position 0 (the biggest sink) is catastrophic, while dropping a less critical one might be survivable.

So we ran a systematic ablation — keep or drop specific token groups and see what breaks:

Ablation showing effects of dropping different token groups
Figure 7: Selectively dropping different token groups across 3 test images. “Vis Only” (all template dropped) and “No Prefix” (system prompt dropped) produce the worst artifacts (red borders). Dropping visual tokens — even the high-attention ones — is much safer.

The findings from Figure 7 were clear:

Dropping all template tokens (“Vis Only”) produces consistent artifacts. This is the smoking gun. These template tokens are load-bearing attention sinks that the DiT relies on for global context.

The prefix matters more than the suffix. Dropping just the prefix (system prompt + user header before <|vision_start|>) produces more severe artifacts than dropping the suffix (tokens after <|vision_end|>). The prefix tokens, particularly <|im_start|> and system, are the primary sinks.

Dropping visual tokens is relatively safe. Even dropping the top 10–25% of visual tokens (the positional sinks — top-left patches that absorb extra attention due to causal ordering) degrades quality much less than dropping any template tokens.

What about data-driven sink removal?The “Drop Sinks” column in Figure 7 takes a smarter approach — instead of dropping visual tokens randomly by position, we run a single DiT forward pass to capture which visual tokens receive the most attention, then drop the top 10% highest-attention visual tokens. The idea: surgically remove the visual sinks. The result: it barely changes anything. The visual sinks aren't carrying critical information — they're just absorbing excess attention due to causal position bias. Dropping them doesn't hurt, but it doesn't increase diversity either.

We also tried more aggressive post-encoding sink removal — dropping 5–30% of highest-attention visual tokens after VL encoding. Same story: at layer 34, every remaining token already encodes a mixed representation of the full image. Removing individual tokens after the LLM has already done its mixing doesn't selectively remove any semantic content. The information has already been blended.

This is actually good news for the practical approach: random dropout within visual tokens works just as well as sink-aware dropout. Since visual sinks don't carry special information, there's no benefit to identifying and preserving them. And sink-aware dropping requires an extra DiT forward pass to capture attention patterns — a significant computational overhead. Random dropout is free.

The Fix: Vision-Only Token Dropout

So the fix is obvious: only drop visual tokens, never touch the template tokens.

We identify visual token boundaries using the <|vision_start|> and <|vision_end|> special token IDs, then apply dropout exclusively within that range. Template tokens — the system prompt, chat formatting, everything outside the vision markers — are always preserved.

Vision-only token dropout at increasing rates
Figure 8: Vision-only token dropout at increasing rates. 0% = baseline (faithful reproduction). 75% = nice diversity while preserving subject. 85–95% = increasingly creative variations. 99% = very loose semantic connection.

No more random artifacts. The template sink tokens are always intact, so the DiT always has its global context. And we can push the dropout rate much higher — 85%, even 95% — without the artifact problem because we're only dropping visual information, not structural anchors.

The sweet spot depends on what you want:

  • 75% dropout: Good diversity while keeping subject identity strong
  • 85% dropout: More creative variations — the model takes bigger liberties with composition and color
  • 95% dropout: Very loose interpretation — captures the broad semantic gist but freely reimagines everything else
  • 99% dropout: Barely connected to the input

And the seed diversity problem from Figure 1? Solved:

Seed diversity comparison with vision-only dropout
Figure 9: 9 images, 4 seeds each. Baseline (left): nearly identical outputs across seeds. With 85% vision-only dropout (right): genuinely diverse variations. The mode collapse problem from Figure 1 is fixed.

Caveats

It's not a perfect solution of course. At high dropout rates, things can go wrong in interesting ways. Subjects sometimes vanish entirely — a woman standing next to a train just… disappears from the scene. The rendering style can shift unexpectedly — a stylized 3D illustration becomes photorealistic. Object counts get mangled — two cats merge into one. One interesting pattern: the model tends to recycle colors from the original image (or at least how the VLM perceives it to be) rather than introducing genuinely new ones — the color palette stays anchored to whatever information remains. These are the tradeoffs of aggressive information dropping: the model fills in what it doesn't know, and sometimes it fills in wrong.

In practice, the dropout rate is a user-controlled knob with a clear tradeoff. At 75% dropout, the subject identity is preserved — you get the same person, object, or scene but in a different composition or pose. At 95% dropout, the model takes much more creative liberty — you get the gist of the input but reimagined freely. The right setting depends on the use case.

Steering color with a palette image

One limitation noted above: the model tends to recycle colors from the surviving visual tokens. What if you could steer the color palette while still getting diverse compositions?

The VL model can encode multiple images in a single pass — they just become separate blocks of visual tokens in the conditioning sequence. So we tried a simple trick: encode the input image alongside a second “palette” image (a photo with the desired color scheme), apply 95% dropout to both images' visual tokens, and generate.

Color palette steering with dual image encoding
Figure 10: Input image + palette reference → diverse generations that pick up color cues from the palette image. Both images get 95% vision dropout. The palette image doesn't control composition — it nudges the color distribution.

The results are (somewhat) promising. In most cases the background picks up color cues from the palette image, and sometimes the main subject does too. But it's not really controllable — the palette influence is more of a nudge than a hard constraint, since both images are getting aggressively dropped. One thing we found: applying dropout to both images is important. Keeping the palette image intact produces striped/tiled background artifacts where the model copies palette patches verbatim. This isn't a final solution, more work needed here.

Layer choice as a second diversity knob

All results so far use layer 34 — the penultimate LLM layer. But what happens if we tap an earlier layer? We ran the same dropout sweep at layer 24 and compared layer 34 at 85% dropout against layer 24 at 95% dropout:

Layer 24 vs layer 34 comparison
Figure 11: Layer 34 at 85% dropout vs layer 24 at 95% dropout, 3 seeds each. Despite dropping more tokens, layer 24 produces more diverse outputs — different compositions, color choices, and structural interpretations of the same input.

This is surprising at first glance: layer 24 with more aggressive dropout produces more diversity than layer 34 with less dropout. But it makes sense when you think about what each layer represents. By layer 34, the LLM has had 34 layers of self-attention to thoroughly extract and encode every detail of the input image — color palette, spatial layout, fine textures, object identities. The representation is rich and specific. Even after dropping 85% of tokens, the surviving 15% still carry enough of that detailed signal to strongly constrain the DiT.

Layer 24, on the other hand, sits earlier in the LLM stack. The representation is more semantic and less pixel-faithful — it captures what's in the image more than exactly how it looks. When you then drop 95% of those already-coarser tokens, the DiT receives a very loose semantic sketch: “there's a cat,” “there's a person with a headwrap,” but without the fine-grained detail that would lock it into reproducing the original. The model has to fill in much more on its own, and it does so creatively.

This gives us two orthogonal controls: dropout rate determines how much information survives, while layer choice determines how abstract that information was to begin with. A practical rule of thumb: if you want variations that preserve the look and feel of the original but change composition, use layer 34 with moderate dropout. If you want the model to freely reinterpret the subject, go earlier.

Does this generalize to text-to-image?

The natural question: if dropping visual tokens increases diversity in image-to-image, does the same trick work for text-to-image? Z-Image encodes text prompts through the same VL model (just without an image), producing a sequence of text token embeddings wrapped in the same chat template. The template sinks are identical — <|im_start|>, system, assistant — but instead of visual patch tokens in the middle, you get text content tokens.

The same logic should apply. Template sinks are structural anchors the DiT needs — don't touch them. Text content tokens, after passing through 34 LLM layers, should carry redundantly mixed representations of the prompt's semantics, just like visual tokens carry redundantly mixed image information. If that's true, we should be able to drop a large fraction of text content tokens and get diverse generations from the same prompt.

We ran this experiment: 10 diverse prompts, 4 seeds each, dropping [0%, 10%, 25%, 50%, 75%, 95%] of text content tokens while keeping all template tokens intact. The content/template boundary is identified by encoding an empty string to find the template token count, then comparing token IDs to locate where content tokens are inserted.

Text content token dropout from 0% to 95%
Figure 12: Text content token dropout from 0% to 95% across 5 prompts. At 10–25%, the semantic core is preserved but details shift — the Wonderland poster gets a different layout, the portrait subtly changes styling, the cats shift colors. By 50–75%, prompts start breaking — the train loses its foliage tunnel, the angel loses coherence. At 95%, it's barely connected to the original prompt.

The answer: yes, but with a much tighter sweet spot. Text dropout at 10–25% produces meaningful variation while keeping the prompt's core intent intact. But it degrades much faster than vision dropout — by 50–75% the generations are already losing key prompt details. This makes sense: text prompts have far fewer content tokens than images (tens vs hundreds-thousands), so each dropped token removes a larger fraction of the semantic signal. There's less redundancy to lean on.

Text dropout seed diversity comparison
Figure 13: Baseline (0%) vs 25% text dropout, 3 seeds each. Five prompts: portrait with veil, line sketch, silhouette illustration, Japanese train scene, floral portrait. T2I already has more seed diversity than I2I at baseline (mode collapse is less severe without image conditioning). Text dropout at 25% adds further variation — subtle changes in pose, styling, color balance, and composition.

The practical upshot: text token dropout works as a diversity mechanism, but it's a scalpel rather than a sledgehammer. The safe range is roughly 10–25% — enough to get the model to reinterpret details while respecting the overall prompt. This confirms that token dropout is a general phenomenon, not something specific to visual tokens. The underlying mechanism is the same: after 34 LLM layers, token representations are redundantly mixed, so dropping a fraction forces the downstream model to fill in gaps creatively.

Why This Matters

The fact that we can drop 95% of visual tokens and still get coherent generations tells us something about how QwenVL represents images internally: by the later layers, image information is massively redundant across tokens. The subject identity, the broad color story, the semantic gist — these persist even when almost everything is thrown away. What doesn't survive is the fine-grained detail: exact spatial layout, precise colors, specific textures. That's the information that lives in the long tail of tokens, and that's what dropout strips away to create room for diversity.

There's an efficiency angle here too. If 95% of visual tokens can be dropped without losing semantic coherence, then VL-conditioned diffusion models are dramatically over-conditioned for many use cases. There may be an efficiency play here — faster inference by pruning the conditioning sequence, or more efficient training by working with compressed representations.

The attention sink thing is interesting on its own. The DiT doesn't just consume a bag of embedding vectors. It has learned to use specific tokens as structural anchors. The template tokens from the chat format — which might seem like irrelevant boilerplate — have become load-bearing infrastructure that the generation process relies on. The model has developed an implicit hierarchy: a handful of structural tokens that must always be present, and hundreds of content tokens that are individually expendable.

If you're doing any kind of conditioning token manipulation in diffusion transformers, be careful about which tokens you're touching. The boilerplate chat template tokens might matter more than the actual content.

What's Next

We've now got a pretty good image variation pipeline: the VL splice from Part 1 gives us zero-shot I2I, and vision-only dropout gives us controllable diversity. In Part 3, we will show how to take rough cut-paste composites and use SDEdit-style denoising to clean them into coherent images — approximate object composition for free.


If you would like to cite this post in an academic context, you can use this BibTeX snippet:

@misc{somepalli2026latentscaffold,
  author = {Somepalli, Gowthami and Somepalli, Sravani},
  title = {Latent Scaffolding Image Generation Models},
  url = {https://somepago.github.io/posts/latent-scaffolding-series/},
  year = {2026}
}