1: \begin{abstract}
2: Image segmentation is a fundamental task in computer vision. Mask2Former is a successful practice of Transformer in image segmentation, which pushes all $3$ segmentation tasks to state of the art. Although its masked attention makes the model easy to train, it is still a initial attempt that suffers some problems such as matching inconsistency among decoder layers and meaningless decoder queries. Based on its problems, we proposed two improvements on Mask2Former architecture. We use mask guided training set decoder queries as class embedding to solve the matching inconsistency problem. Based on our improvement, our Mask3Former trained for $36$ epochs exceeds Mask2Former trained for $50$ epochs in instance and panoptic segmentation tasks. Especially, our Mask3Former exceeds Mask2Former within half training epochs on semantic segmentation. We also achieve $+1.1$ AP, $+0.8$
3: PQ and $+0.9$ mIoU on instance, panoptic and semantic segmentations respectively after convergence when trained with ResNet$50$ backbone. In addition, our method only introduce negligible computation during training and no extra computation during inference. Except for the improvements in performance, we also give detailed theoretic analysis and visualization to prove the effectiveness of our method. Our code will be released after blind review.
4: \end{abstract}
5: