Instance Sequence Queries for Video Instance Segmentation with Transformers
Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constrain...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-06-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/21/13/4507 |
id |
doaj-e04accff7e3d4832b04304e4c3eae171 |
---|---|
record_format |
Article |
spelling |
doaj-e04accff7e3d4832b04304e4c3eae1712021-07-15T15:45:44ZengMDPI AGSensors1424-82202021-06-01214507450710.3390/s21134507Instance Sequence Queries for Video Instance Segmentation with TransformersZhujun Xu0Damien Vivet1Institut Supérieur de l’Aéronautique et de l’Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, FranceInstitut Supérieur de l’Aéronautique et de l’Espace (ISAE-SUPAERO), University of Toulouse, 31400 Toulouse, FranceExisting methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset.https://www.mdpi.com/1424-8220/21/13/4507video instance segmentationtransformerquery |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Zhujun Xu Damien Vivet |
spellingShingle |
Zhujun Xu Damien Vivet Instance Sequence Queries for Video Instance Segmentation with Transformers Sensors video instance segmentation transformer query |
author_facet |
Zhujun Xu Damien Vivet |
author_sort |
Zhujun Xu |
title |
Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_short |
Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_full |
Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_fullStr |
Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_full_unstemmed |
Instance Sequence Queries for Video Instance Segmentation with Transformers |
title_sort |
instance sequence queries for video instance segmentation with transformers |
publisher |
MDPI AG |
series |
Sensors |
issn |
1424-8220 |
publishDate |
2021-06-01 |
description |
Existing methods for video instance segmentation (VIS) mostly rely on two strategies: (1) building a sophisticated post-processing to associate frame level segmentation results and (2) modeling a video clip as a 3D spatial-temporal volume with a limit of resolution and length due to memory constraints. In this work, we propose a frame-to-frame method built upon transformers. We use a set of queries, called instance sequence queries (ISQs), to drive the transformer decoder and produce results at each frame. Each query represents one instance in a video clip. By extending the bipartite matching loss to two frames, our training procedure enables the decoder to adjust the ISQs during inference. The consistency of instances is preserved by the corresponding order between query slots and network outputs. As a result, there is no need for complex data association. On TITAN Xp GPU, our method achieves a competitive 34.4% mAP at 33.5 FPS with ResNet-50 and 35.5% mAP at 26.6 FPS with ResNet-101 on the Youtube-VIS dataset. |
topic |
video instance segmentation transformer query |
url |
https://www.mdpi.com/1424-8220/21/13/4507 |
work_keys_str_mv |
AT zhujunxu instancesequencequeriesforvideoinstancesegmentationwithtransformers AT damienvivet instancesequencequeriesforvideoinstancesegmentationwithtransformers |
_version_ |
1721298574456651776 |