Instance Sequence Queries for Video Instance Segmentation with Transformers

Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constrain...

Full description

Bibliographic Details
Main Authors: Zhujun Xu, Damien Vivet
Format: Article
Language:English
Published: MDPI AG 2021-06-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/21/13/4507
id doaj-e04accff7e3d4832b04304e4c3eae171
record_format Article
spelling doaj-e04accff7e3d4832b04304e4c3eae1712021-07-15T15:45:44ZengMDPI AGSensors1424-82202021-06-01214507450710.3390/s21134507Instance Sequence Queries for Video Instance Segmentation with TransformersZhujun Xu0Damien Vivet1Institut Supérieur de l’Aéronautique et de l’Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, FranceInstitut Supérieur de l’Aéronautique et de l’Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, FranceExisting methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.https://www.mdpi.com/1424-8220/21/13/4507video instance segmentationtransformerquery
collection DOAJ
language English
format Article
sources DOAJ
author Zhujun Xu
Damien Vivet
spellingShingle Zhujun Xu
Damien Vivet
Instance Sequence Queries for Video Instance Segmentation with Transformers
Sensors
video instance segmentation
transformer
query
author_facet Zhujun Xu
Damien Vivet
author_sort Zhujun Xu
title Instance Sequence Queries for Video Instance Segmentation with Transformers
title_short Instance Sequence Queries for Video Instance Segmentation with Transformers
title_full Instance Sequence Queries for Video Instance Segmentation with Transformers
title_fullStr Instance Sequence Queries for Video Instance Segmentation with Transformers
title_full_unstemmed Instance Sequence Queries for Video Instance Segmentation with Transformers
title_sort instance sequence queries for video instance segmentation with transformers
publisher MDPI AG
series Sensors
issn 1424-8220
publishDate 2021-06-01
description Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.
topic video instance segmentation
transformer
query
url https://www.mdpi.com/1424-8220/21/13/4507
work_keys_str_mv AT zhujunxu instancesequencequeriesforvideoinstancesegmentationwithtransformers
AT damienvivet instancesequencequeriesforvideoinstancesegmentationwithtransformers
_version_ 1721298574456651776