By Gowthami Somepalli and Sravani Somepalli
A series exploring emergent capabilities hidden inside VL-conditioned single-stream diffusion transformers. No fine-tuning, no additional training — just scaffolding: architectural hacking and careful probing of what these models already know.
Posts
A simple architectural splice unlocks zero-shot image-to-image variations with no training.
Vision-only token dropout solves mode collapse. Hunting attention sinks and finding two orthogonal knobs for diversity.
SDEdit-style denoising for approximate object composition.
If you would like to cite this series in an academic context, you can use this BibTeX snippet:
@misc{somepalli2026latentscaffold,
author = {Somepalli, Gowthami and Somepalli, Sravani},
title = {Latent Scaffolding Image Generation Models},
url = {https://somepago.github.io/posts/latent-scaffolding-series/},
year = {2026}
}