CoReHAR: A Hybrid Deep Network for Video Action Recognition

Document Type : Original Article


Shahid Chamran University of Ahvaz


Automating the processing of videos in applications such as surveillance, sport commentary and activity detection, human-machine interaction, and health/disability care is crucial to their correct functioning. In such video processing tasks, recognition of various human actions is a pivotal component for the correct understanding of videos and making decisions upon it. Accurately recognizing human actions is a complex process, demanding high computing capabilities and intelligent algorithms. Several factors, such as object occlusion, camera movement, and background clutter, further challenge the task and its accuracy, essentially leaving deep learning approaches the only viable option for properly detecting human actions in videos. In this study, we propose CoReHAR, a novel Human Action Recognition method that employs both deep Convolutional and Recurrent neural networks on raw video frames. Using the pre-trained ResNet152 CNN, deep features are initially extracted from video frames. The sequential information of the frames is then learned using DB-LSTM RNN. Multiple stacked layers in forward and backward passes of the DB-LSTM provide increased network depth for higher accuracy. A number of techniques are also applied to improve CoReHAR’s processing speed on heterogeneous GPU-enabled systems. The proposed method is evaluated using PyTorch, and is compared to the state-of-the-art methods, showing a considerable efficiency increase, with nearly 95% recognition accuracy measured as an average over all splits of the challenging UCF101 dataset.