Instance Sequence Queries for Video Instance Segmentation with Transformers

Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constrain...

Full description

Bibliographic Details
Main Authors:	Zhujun Xu, Damien Vivet
Format:	Article
Language:	English
Published:	MDPI AG 2021-06-01
Series:	Sensors
Subjects:	video instance segmentation transformer query
Online Access:	https://www.mdpi.com/1424-8220/21/13/4507

Description
Summary:	Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.
ISSN:	1424-8220

Instance Sequence Queries for Video Instance Segmentation with Transformers

Similar Items