ZegOT: Zero-shot segmentation Through Optimal Transport with Text Prompts

Abstract

Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we propose a novel we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport. In particular, we introduce a novel Multiple Prompt Optimal Transport Solver (MPOT), which is designed to learn an optimal mapping between multiple text prompts and visual feature maps of the frozen image encoder hidden layers. This unique mapping method facilitates each of the multiple text prompts to effectively focus on distinct visual semantic attributes. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance over existing Zero-shot Semantic Segmentation (ZS3) approaches.

Effectiveness of ZegOT

Visualization of text prompts-related image pixel alignment. Without proposed MPOT, all the multiple prompts on each class P are cohered and their related score maps resemble each others. On the other hand, with our MPOT, each P is diversely projected and each P-related score map focuses on different semantic attributes.

Multi Prompt OT Solver (MPOT)

A schematic diagram of Multi Prompt Optimal Transport (MPOT) solvers. The pixel-text score matrix given each image encoder layer is refined through the optimal transport plan.

Analysis of MPOT

In the baseline method without MPOT, the learned text prompts P show tightly grouped distribution within their respective class clusters (indicated as different colors in the left image, respectively). In contrast, our ZegOT with MPOT produces dispersed distributions, with even cohesion within corresponding prompt clusters (indicated as gray circles in the right image in Figure, respectively).

Qualitative Result of ZegOT

PASCAL Context

Qualitative zero-shot segmentation results on PASCAL Context dataset. The yellow tag indicates seen classes, while the green tag indicates unseen classes. Our ZegOT segmentes semantic objectex most accurately of unseen objects compare to previous SOTA methods.

COCO-Stuff164K

Qualitative zero-shot segmentation results on COCO-Stuff164K dataset. The yellow tag indicates seen classes, while the green tag indicates unseen classes. Surprisingly, our ZegOT demonstrates superior performance in accurately categorizing classes and precisely sectioning boundaries of unseen objects, unlike previous SOTA methods which tend to misclassify the category (see red arrows .)

Our ZegOT Results on Real-World Data

Qualitative zero-shot segmentation results on real-world data. The yellow tag indicates seen classes, whille the green tag indicates unseen classes. We demonstarte that our ZegOT can be suceessfully applied to real-world dataset, which are not available during training.

ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts