Sparse autoencoders (SAEs) have emerged as a promising approach to reverse engineering large language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. Thus far, similar approaches have remained under-explored for text-to-image models.
We investigated the possibility of using SAEs to learn interpretable features for few-step text-to-image diffusion models such as SDXL Turbo. To this end, we trained SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net.
We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find blocks specializing in image composition, adding local details, color, illumination, and style. Our work thus represents an encouraging first step towards better understanding the internals of generative text-to-image models.
Sparse Autoencoders (SAEs) are two-layer neural networks designed to reconstruct input vectors while enforcing sparsity constraints in the hidden layer (i.e., only a few neurons are activated at once). This sparsity enables SAEs to extract interpretable features, which have been shown to be useful in large language models. Given this success, we explored the potential of SAEs in the domain of text-to-image generative models.
We used SDXL Turbo, a model that integrates transformer-powered U-net architecture to process input prompts for text-to-image generation. We selected four specific transformer blocks within the U-net -- down.2.1, mid.0, up.0.0, and up.0.1 -- to analyze their roles in the generative process. Using a dataset of 1.5 million textual prompts from LAION-COCO, we stored intermediate feature map representations from these blocks and trained four sparse autoencoders to reconstruct these feature maps. The resulting learned features were not only interpretable but also had a controllable causal effect on generated images. Manipulating specific features revealed distinct transformer block specializations and enabled control over the generated images, such as adjusting semantic content or enhancing visual details.
For every feature, we collected images that fire the feature most strongly. Interestingly, the images reflect similar objects or concepts, like folders (down.2.1.#0), kitchen islands (down.2.1.#1), and so on. Note that the activation regions differ from quite scattered in the blocks down.2.1 and up.0.1 to more localized and concentrated on mid.0 and up.0.0.
Now, let's break into image generation process and try to manipulate feature values. You can increase the features' activation by using the sliders below and observe their effects.
Up until now, we discovered how features affect images generated with a meaningful prompt. However, would the features make sense independently of any textual description? To investigate it, we tried to generate an image given an empty prompt, and turning off all the features but one. That's what happened:
down.2.1 and up.0.1 features generate meaningful images even than being turned on during empty-prompt generation. down.2.1 interventions result in meaningful images, whereas up.0.1 interventions generate texture-like images.
We trained SAEs on SDXL Turbo's intermediate representations and showed that the learned features are interpretable and causal.
Moreover, they demonstrate the blocks in the SDXL Turbo's have different roles in the image generation process.
Our hypothesis is:
- down.2.1 defines what will be shown on the images (compositional block)
- mid.0 assigns some low-level abstract semantics
- up.0.0 adds details (details block)
- up.0.1 generates colors and textures (style block)
Now it is your turn! Try our demo application and explore 20K+ features.
@misc{surkov2024unpackingsdxlturbointerpreting,
title={Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders},
author={Viacheslav Surkov and Chris Wendler and Mikhail Terekhov and Justin Deschenaux and Robert West and Caglar Gulcehre},
year={2024},
eprint={2410.22366},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.22366},
}