Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation

CVPR 2026

1University of Ljubljana, 2ETH Zurich, 3University of Zagreb
GIF

OVRCOAT segmentation examples.



Abstract

Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision–language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP’s region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively).


Qualitative Results on ADE20k

Interpolate start reference image.


Quantitative Results

Method ADE20K Mapillary Cityscapes COCO
PQSQRQ PQSQRQ PQSQRQ PQSQRQ
MaskCLIP 15.170.519.2 --- --- ---
FreeSeg 16.371.821.6 --- --- ---
OPSNet 19.052.423.0 --- 41.567.550.0 52.483.562.1
ODISE 23.478.128.3 14.261.017.2 23.975.329.0 55.4--
FC-CLIP 26.871.232.3 18.356.023.1 44.075.453.6 54.4--
MAFT+ 27.173.532.9 15.755.519.8 38.370.246.9 50.382.260.3
OVRCOAT 28.6(+1.5) 77.3(-0.8) 34.7(+1.8) 19.6(+1.3) 65.7(+4.7) 24.8(+1.7) 45.3(+1.3) 78.7(+3.3) 55.6(+2.0) 54.6 82.9 65.1

Qualitative Results on Internet Images

Out-of-Vocabulary examples

BibTeX

@inproceedings{ovrcoat,
  title={Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation},
  author={Nikolay Kormushev and Josip Šarić and Matej Kristan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}