Multimodal AI researcher obsessed with how machines perceive, remember, and generate the world. Now at World Labs, working on world models. Based in Mountain View, CA.
PhD from UMD focused on diffusion model memorization — built memorization evals for diffusion models and CSD, a widely-used style similarity metric. Also built video understanding evals: CinePile, a long-video QA benchmark (Best Paper at CVPR 2024 SynCV), and ARGUS for hallucination/omission detection in dense captions. Friends call me the "Evals Shill" for a reason.
Before academia: did SGD in industry for a while in India, IIT Madras alum, founded a Fashion AI startup that was way too early to the party.
Open to collabs on generative modeling (evals + post-training). Hit me up: gowthami [dot] somepalli [at] gmail.com
// featured writing
Latent Scaffolding: Z-Image Is Secretly an I2I Model
A simple architectural splice unlocks zero-shot image-to-image variations with no training.
BLOG POST · PART 2Latent Scaffolding: Token Dropout for Diverse Image Variations
Vision-only token dropout solves mode collapse. Hunting attention sinks and finding two orthogonal knobs for diversity.