1: \begin{abstract}
2: Image segmentation is a fundamental task in computer vision. Mask2Former is a successful practice of Transformer in image segmentation, which pushes all $3$ segmentation tasks to state-of-the-art. Although its masked attention makes the model easy to train, it is still an initial attempt that suffers from some problems such as matching inconsistency among decoder layers and meaningless decoder queries. Inspired by DN-DETR's success in detection, we proposed two improvements on Mask2Former architecture. We use mask guided training to solve the matching inconsistency problem and set decoder queries as class embedding. Based on our improvements, our \M achieved improvement on all three image segmentation tasks. Our model trained for $36$ epochs exceeds Mask2Former trained for $50$ epochs in instance and panoptic segmentation tasks on COCO. Especially, our Mask3Former exceeds Mask2Former within half training epochs on semantic segmentation. We also achieve $+1.1$ AP, $+0.8$
3: PQ and $+0.9$ mIoU on instance, panoptic and semantic segmentations respectively after convergence when trained with ResNet$50$ backbone. In addition, our method only introduce negligible computation during training and no extra computation during inference. Except for the improvements in performance, we also give detailed theoretic analysis and visualization to prove the effectiveness of our method. Our code will be released after blind review.
4: \end{abstract}
5: