Instance Sequence Queries for Video Instance Segmentation with Transformers

Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constrain...

Full description

Bibliographic Details
Main Authors:	Zhujun Xu, Damien Vivet
Format:	Article
Language:	English
Published:	MDPI AG 2021-06-01
Series:	Sensors
Subjects:	video instance segmentation transformer query
Online Access:	https://www.mdpi.com/1424-8220/21/13/4507

id	doaj-e04accff7e3d4832b04304e4c3eae171
record_format	Article
spelling	doaj-e04accff7e3d4832b04304e4c3eae1712021-07-15T15:45:44ZengMDPI AGSensors1424-82202021-06-01214507450710.3390/s21134507Instance Sequence Queries for Video Instance Segmentation with TransformersZhujun Xu0Damien Vivet1Institut Supérieur de l’Aéronautique et de l’Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, FranceInstitut Supérieur de l’Aéronautique et de l’Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, FranceExisting methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.https://www.mdpi.com/1424-8220/21/13/4507video instance segmentationtransformerquery
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Zhujun Xu Damien Vivet
spellingShingle	Zhujun Xu Damien Vivet Instance Sequence Queries for Video Instance Segmentation with Transformers Sensors video instance segmentation transformer query
author_facet	Zhujun Xu Damien Vivet
author_sort	Zhujun Xu
title	Instance Sequence Queries for Video Instance Segmentation with Transformers
title_short	Instance Sequence Queries for Video Instance Segmentation with Transformers
title_full	Instance Sequence Queries for Video Instance Segmentation with Transformers
title_fullStr	Instance Sequence Queries for Video Instance Segmentation with Transformers
title_full_unstemmed	Instance Sequence Queries for Video Instance Segmentation with Transformers
title_sort	instance sequence queries for video instance segmentation with transformers
publisher	MDPI AG
series	Sensors
issn	1424-8220
publishDate	2021-06-01
description	Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.
topic	video instance segmentation transformer query
url	https://www.mdpi.com/1424-8220/21/13/4507
work_keys_str_mv	AT zhujunxu instancesequencequeriesforvideoinstancesegmentationwithtransformers AT damienvivet instancesequencequeriesforvideoinstancesegmentationwithtransformers
_version_	1721298574456651776

Instance Sequence Queries for Video Instance Segmentation with Transformers

Similar Items