Unboxing SDXL Turbo

How Sparse Autoencoders Unlock the Inner Workings of Text-to-Image Models

*equal contribution, 1EPFL, 2Northeastern University
We decomposed the generative process of SDXL Turbo into interpretable features using Sparse Autoencoders. The features are causal and can be manipulated to control the generated images.

Abstract

For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks' features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.

How it works

Sparse Autoencoders (SAEs) are two-layer neural networks designed to reconstruct input vectors while enforcing sparsity constraints in the hidden layer (i.e., only a few neurons are activated at once). This sparsity enables SAEs to extract interpretable features, which have been shown to be useful in large language models. Given this success, we explored the potential of SAEs in the domain of text-to-image generative models.


We used SDXL Turbo, a model that integrates transformer-powered U-net architecture to process input prompts for text-to-image generation. We selected four specific transformer blocks within the U-net -- down.2.1, mid.0, up.0.0, and up.0.1 -- to analyze their roles in the generative process. Using a dataset of 1.5 million textual prompts from LAION-COCO, we stored intermediate feature map representations from these blocks and trained four sparse autoencoders to reconstruct these feature maps. The resulting learned features were not only interpretable but also had a controllable causal effect on generated images. Manipulating specific features revealed distinct transformer block specializations and enabled control over the generated images, such as adjusting semantic content or enhancing visual details.

Where features are active

For every feature, we collected images that fire the feature most strongly. Interestingly, the images reflect similar objects or concepts, like folders (down.2.1.#0), kitchen islands (down.2.1.#1), and so on. Note that the activation regions differ from quite scattered in the blocks down.2.1 and up.0.1 to more localized and concentrated on mid.0 and up.0.0.

Feature 1

down.2.1 #0

Feature 2

down.2.1 #1

Feature 3

down.2.1 #2

Feature 1

mid.0 #3

Feature 2

mid.0 #50

Feature 3

mid.0 #609

Feature 1

up.0.1 #0

Feature 2

up.0.1 #1

Feature 3

up.0.1 #2

Feature 1

up.0.0 #0

Feature 2

up.0.0 #1

Feature 3

up.0.0 #2

These features are highly active during the image generation process. Notably, they appear to encode both concrete objects as well as abstract concepts.
Features are causal

Now, let's break into image generation process and try to manipulate feature values. You can increase the features' activation by using the sliders below and observe their effects.

down.2.1 #1802

down.2.1 #89

up.0.0 #4473

down.2.1 #1678

up.0.1 #4977

up.0.0 #4907

mid.0 #4227

Generalization to multi-step

Interestingly, our features that we trained on 1-step SDXL Turbo generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model without additional training. Here you can explore the effect of adding the features across different ranges of timesteps. Moving the slider changes the range of timesteps at which the intervention is performed in SDXL base. We start from [0, 25] and move the left side to [5, 25], [10, 25], [15, 25], [20, 25].

FLUX Schnell

We also trained a SAE on FLUX Schnell. Sliding to the right increases the intervention strength (= feature coefficient).

FLUX dev

Our FLUX Schnell SAE also transfers to FLUX dev. Sliding to the right increases the intervention strength (= feature coefficient).

RIEBench

We introduce a new representation-based image editing benchmark, RIEBench. The key idea is to transfer features between parallell forward passes as shown below. Dragging the slider corresponds to varying number of features transferred across within the indicated blocks. We precomputed results for a SAE with k=10 and expansion factor 4 transporting 1, 2, 4, 8, 16, 32, 64 features.

The same intervention can be performed for neurons. Dragging the slider corresponds to varying number of features transferred across within the indicated blocks. We precomputed results for transporting 1000, 2000, 4000, 8000, 160000, 32000, 64000 neurons. For more information check out our RIEBench repository.

Single feature generates meaningful image

Up until now, we discovered how features affect images generated with a meaningful prompt. However, would the features make sense independently of any textual description? To investigate it, we tried to generate an image given an empty prompt, and turning off all the features but one. That's what happened:

Feature 1

down.2.1

Feature 2

up.0.1

down.2.1 and up.0.1 features generate meaningful images even than being turned on during empty-prompt generation. down.2.1 interventions result in meaningful images, whereas up.0.1 interventions generate texture-like images.

Empty-prompt generations as previews

The empty-prompt generations can be used as a visual representation or preview of the features. Here is the brush animation from the beginning with the corresponding preview images. Goodfire in their recent blog post used our technique in their UMAP visualization, where our zero-shot intervention previews appear when hovering over the points.

Try it yourself!

We trained SAEs on SDXL Turbo's intermediate representations and showed that the learned features are interpretable and causal. Moreover, they demonstrate the blocks in the SDXL Turbo's have different roles in the image generation process. Our hypothesis is:
- down.2.1 defines what will be shown on the images (compositional block)
- mid.0 assigns some low-level abstract semantics
- up.0.0 adds details (details block)
- up.0.1 generates colors and textures (style block)

Now it is your turn! Try our demo application and explore 20K+ features.

BibTeX

The first version of our paper, which is the first work leveraging SAEs within text-to-image diffusion models.

@misc{surkov2024unpackingsdxlturbointerpreting,
      title={Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders}, 
      author={Viacheslav Surkov and Chris Wendler and Mikhail Terekhov and Justin Deschenaux and Robert West and Caglar Gulcehre},
      year={2024},
      eprint={2410.22366},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.22366}, 
}

Updated version of the paper, showcasing multi-step generalization and a few more nice findings that we made along the way.

@misc{surkov2025onestepenoughsparseautoencoders,
      title={One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models}, 
      author={Viacheslav Surkov and Chris Wendler and Antonio Mari and Mikhail Terekhov and Justin Deschenaux and Robert West and Caglar Gulcehre and David Bau},
      year={2025},
      eprint={2410.22366},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.22366}, 
}