PosSAM: Panoptic Open-vocabulary
Segment Anything

Johns Hopkins University1, Qualcomm AI Research2

Motivation & Contribution

  • SAM possesses exceptional spatial awareness and promptable segmentation capabilities but lacks class/semantic awareness and tends to over-segment objects into multiple regions.
  • The SAM decoder is designed to be lightweight, enabling the SAM encoder to undertake the majority of the segmentation task at the encoder level itself.
  • Therefore, by unifying the SAM encoder and CLIP, we introduce PosSAM an end-to-end trainable framework for open-vocabulary panoptic segmentation model that generates class and instance-aware masks for a variety of visual concepts.
  • Further, we develop a novel Local Discriminative Pooling (LDP) module for unbiased open-vocabulary classification and a Mask-Aware Selective Ensembling (MASE) algorithm for robust real-world open-vocabulary segmentation.

PosSAM Framework

Overview of our PosSAM training pipeline. We first encode the input image using the SAM backbone to extract spatially rich features, which are processed through a Feature Pyramid Network to obtain hierarchical multi-scale features decoded to form mask features and predict class-agnostic masks. Concurrently, we train an IoU predictor for each mask to measure its quality. For classification, using our proposed LDP module we enhance discriminative CLIP features with class-agnostic SAM features for an unbiased OV classification. These LDP features are then classified by a standard open-vocabulary supervision with ground truth category labels derived from the CLIP text encoder.

Quantitative Results

Qualitative Results

Zero-shot panoptic segmentation capability from COCO to ADE20K on unseen classes. This figure shows comparison with recent SOTA approaches. Only novel classes are shown. We can observe that PosSAM has the ability to accurately segment objects that are never seen before such as paintings, dishwashers, exhaust, showing advantages over other SOTA methods.

BibTeX


  @article{vs2024possam,
  title={PosSAM: Panoptic Open-vocabulary Segment Anything},
  author={VS, Vibashan and Borse, Shubhankar and Park, Hyojin and Das, Debasmit and Patel, Vishal and Hayat, Munawar and Porikli, Fatih},
  journal={arXiv preprint arXiv:2403.09620},
  year={2024}
}
Acknowledgement: The website Template taken from Nerfies