Video Person Re‐Identification with Frame Sampling–Random Erasure and Mutual Information–Temporal Weight Aggregation

Partial occlusion and background clutter in camera video surveillance affect the accuracy of video‐based person re‐identification (re‐ID). To address these problems, we propose a person re‐ ID method based on random erasure of frame sampling and temporal weight aggregation of mutual information of p...

Full description

Bibliographic Details
Main Authors: Li, J. (Author), Piao, Y. (Author)
Format: Article
Language:English
Published: MDPI 2022
Subjects:
Online Access:View Fulltext in Publisher
LEADER 02626nam a2200337Ia 4500
001 10-3390-s22083047
008 220425s2022 CNT 000 0 und d
020 |a 14248220 (ISSN) 
245 1 0 |a Video Person Re‐Identification with Frame Sampling–Random Erasure and Mutual Information–Temporal Weight Aggregation 
260 0 |b MDPI  |c 2022 
856 |z View Fulltext in Publisher  |u https://doi.org/10.3390/s22083047 
520 3 |a Partial occlusion and background clutter in camera video surveillance affect the accuracy of video‐based person re‐identification (re‐ID). To address these problems, we propose a person re‐ ID method based on random erasure of frame sampling and temporal weight aggregation of mutual information of partial and global features. First, for the case in which the target person is interfered or partially occluded, the frame sampling–random erasure (FSE) method is used for data enhancement to effectively alleviate the occlusion problem, improve the generalization ability of the model, and match persons more accurately. Second, to further improve the re‐ID accuracy of video‐based persons and learn more discriminative feature representations, we use a ResNet‐50 network to extract global and partial features and fuse these features to obtain frame‐level features. In the time dimension, based on a mutual information–temporal weight aggregation (MI–TWA) module, the partial features are added according to different weights and the global features are added according to equal weights and connected to output sequence features. The proposed method is exten-sively experimented on three public video datasets, MARS, DukeMTMC‐VideoReID, and PRID‐ 2011; the mean average precision (mAP) values are 82.4%, 94.1%, and 95.3% and Rank‐1 values are 86.4%, 94.8%, and 95.2%, respectively. © 2022 by the authors. Licensee MDPI, Basel, Switzerland. 
650 0 4 |a Background clutter 
650 0 4 |a deep learning 
650 0 4 |a Deep learning 
650 0 4 |a Deep learning 
650 0 4 |a Frame sampling random erasure 
650 0 4 |a frame sampling random erasure (FSE) 
650 0 4 |a Global feature 
650 0 4 |a Mutual information temporal weight aggregation 
650 0 4 |a mutual information temporal weight aggregation (MI TWA) 
650 0 4 |a Mutual informations 
650 0 4 |a Partial occlusions 
650 0 4 |a Person re identifications 
650 0 4 |a Security systems 
650 0 4 |a video person re‐identification 
650 0 4 |a Video person re‐identification 
650 0 4 |a Video surveillance 
700 1 |a Li, J.  |e author 
700 1 |a Piao, Y.  |e author 
773 |t Sensors