텍스트 크기 테스트 -->

PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Kwanyoung Kim*, Byeongsu Sim
Samsung Research
*First and Correponding Author

main_figure.

Qualitative comparison (Top): guidance sampling methods (CFG, PAG, SEG) (Mid): guidance-distilled models (DMD2, SDXL-Lightning, Hyper-SDXL) (Bottom): Other backbone such as Stable Diffusion 1.5 and SANA with our proposed method, PLADIS(Ours). PLADIS is compatible with all guidance techniques and also supports guidance-distilled models including various backbone. It provides the generation of plausible and improved text alignment without any training or extra inference.

Abstract

Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.

Concept Overview of PLADIS

Existing guidance methods require extra inference steps due to undesired paths, such as null conditions or perturbing self-attention with an identity matrix or blurred attention weights. In contrast, PLADIS avoids additional inference paths by computing both sparse and dense attentions within all cross-attention modules using a scaling factor, λ. Moreover, PLADIS can be easily integrated with existing guidance approaches and even guidance-distilled model by simply replacing the cross-attention module.

figure1_image.

Why Does PLADIS Work?

Our study reveals that the Sparse Attention mechanism within diffusion models significantly contributes to noise robustness. In contrast to conventional modern Hopfield Networks, Sparse Hopfield Networks exhibit enhanced resilience to noise. Remarkably, the update rule for Sparse Hopfield energy is mathematically identical to that used in sparse attention. Furthermore, we propose the retrieval error bound of dynamics for the intermidate case and also introduce the noise robustness of that cases. figure1_image.

In Text-to-Image diffusion processes, the cross-attention mechanism processes the query, key, and value matrices derived from both noisy images and text prompts. Due to the inherent Gaussian noise corruption during the diffusion process, the query matrix is naturally perturbed, thereby benefiting from the robustness offered by sparse attention. This advantage leads to improved text alignment and overall sample quality.

Effectiveness of PLADIS

Qualitative comparison by varying the scale λ. As the scale λ increases, images represent improved plausibility and enhanced text alignment. But too high a value leads to smoother textures and potential artifacts, similar to those seen in CFG. When λ is greater than 0, our PLADIS method is applied. In our configuration, λ is set to 2.0.

figure1_image.
Qualitative Result of PLADIS

Qualitative Result of PLADIS for Boosting Guidance Sampling Approaches

CFG
CFG
PAG
PAG
SEG
SEG

Qualitative Result of PLADIS for Boosting Guidance-Distilled Models

One Step Sampling
One Step
Four Step Sampling
Four Step

Qualitative Result of PLADIS for Other Backbone

Stable Diffusion 1.5
Stable Diffusion 1.5
SANA
SANA