Summary: | Multiple object tracking sets the foundation for many intelligent video applications. The authors present a novel tracking solution using the ability of recurrent neural networks to effectively model complex temporal dynamics between objects irrespective of appearances, pose, occlusions, and illumination. For online tracking, a real-time and accurate association of objects with active tracks poses the major algorithmic challenge. Additionally, re-entry of objects should also be correctly resolved. They follow tracking-by-detection methodology using hierarchical long short term memory (LSTM) network structure for modelling the motion dynamics between objects by learning the fusion of appearance and motion cues. Existing works capture object's perspective for tracking within the detected bounding boxes. They also incorporate object instance segments for track modelling by applying the maskRCNN detector. They present a novel motion coding scheme that anchors the LSTM structure to effectively model the motion and relative position between objects in a single representation scheme. The proposed motion representation and deep features representing objects appearances are fused in an embedded space learned by the hierarchical LSTM structure for predicting the object to track association. The authors present experimental validation of the proposed approach on multiple object tracking challenge datasets and demonstrate that their solution naturally deals with major tracking challenges under all uncertainties.
|