OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Abstract

The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.

Overview of OTSeg for zero-shot semantic segmentation

(a) MPS path refines the score map using the MPS algorithm. (b) Decoder path involves the decoder output, which integrates the the Multi-Prompts Sinkhorn Attention (MPSA) predictions. (c) During inference, OTSeg ensembles predictions from both paths with a balancing factor lambda.

Effectiveness of OTSeg

While all the text prompt-related score maps are cohered without MPSA (white arrows), with MPSA within our proposed OTSeg+, each Score map is diversely activated (red arrows) to help the model segment each object sharply. The yellow tag indicates seen classes, while the green tag indicates unseen classes.

Introduce a Novel Attention Mechanism

(a) Cross-attention mechanism for multimodal settings. (b) Sinkformer self-attention mechanism for unimodal settings. (c) Our proposed Muti-Prompt Sinkhorn Attention (MPSA) for multimodal settings, which aims to optimally transport image pixel (M) to multiple text prompts (N). Our proposed MPSA integrated as plugin module in each transformer layer, and achieve optimal matching between multiple text embedding and pixel embedding .

Quantitative Result of OTSeg

Qualitative Result of OTSeg

The yellow tag indicates seen classes, while the green tag indicates unseen classes. Our OTSeg segmentes semantic objectex most accurately of both seen and unseen objects compare to previous SOTA methods.