The Traffic Scene Understanding and Prediction Based on Image Captioning

The traffic scene understanding is the core technology in Intelligent Transportation Systems (ITS) and Advanced Driver Assistance System (ADAS), and it is becoming increasingly important for smart or autonomous vehicles. The recent methods for traffic scene understanding, such as Traffic Sign Recogn...

Full description

Bibliographic Details
Main Authors: Wei Li, Zhaowei Qu, Haiyu Song, Pengjie Wang, Bo Xue
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9306804/
Description
Summary:The traffic scene understanding is the core technology in Intelligent Transportation Systems (ITS) and Advanced Driver Assistance System (ADAS), and it is becoming increasingly important for smart or autonomous vehicles. The recent methods for traffic scene understanding, such as Traffic Sign Recognition (TSR), Pedestrian Detection, and Vehicle Detection, have three major shortcomings. First, most models are customized for recognizing a specific category of traffic target instead of general traffic targets. Second, as for these recognition modules, the task of traffic scene understanding is to recognize objects rather than make driving suggestions or strategies. Third, numerous independent recognition modules disadvantage to fusing multi-modal information to make a comprehensive decision for driving operation in accordance with complicated traffic scenes. In this paper, we first introduce the image captioning model to alleviate the aforementioned shortcomings. Different from existing methods, our primary idea is to accurately identify all categories of traffic objects and understand traffic scenes by making full use of all information, and making the suggestions or strategy for driving operation in natural language by using Long Short Term Memory network (LSTM) rather than keywords. The proposed solution naturally solves the problems of feature fusion, general object recognition, and low-level semantic understanding. We tested the solution on our created traffic scene image dataset for evaluation of image captioning. Extensive experiments including quantitative and qualitative comparisons demonstrate that the proposed solution can identify more objects and produce higher-level semantic information than the state-of-the-arts.
ISSN:2169-3536