CoReHAR: A Hybrid Deep Network for Video Action Recognition

Document Type : Original Article

Authors

1 Department of Computer Engineering, Shahid Chamran University of Ahvaz, Ahvaz, Iran a-mihanpoor@stu.scu.ac.ir, {mohammad.rashti, se.alavi}@scu.ac.ir

2 Shahid Chamran University of Ahvaz

Abstract

Automating the processing of videos in applications such as surveillance, sport commentary and activity detection, human-machine interaction, and health/disability care is crucial to their correct functioning. In such video processing tasks, recognition of various human actions is a pivotal component for the correct understanding of videos and making decisions upon it. Accurately recognizing human actions is a complex process, demanding high computing capabilities and intelligent algorithms. Several factors, such as object occlusion, camera movement, and background clutter, further challenge the task and its accuracy, essentially leaving deep learning approaches the only viable option for properly detecting human actions in videos. In this study, we propose CoReHAR, a novel Human Action Recognition method that employs both deep Convolutional and Recurrent neural networks on raw video frames. Using the pre-trained ResNet152 CNN, deep features are initially extracted from video frames. The sequential information of the frames is then learned using DB-LSTM RNN. Multiple stacked layers in forward and backward passes of the DB-LSTM provide increased network depth for higher accuracy. A number of techniques are also applied to improve CoReHAR’s processing speed on heterogeneous GPU-enabled systems. The proposed method is evaluated using PyTorch, and is compared to the state-of-the-art methods, showing a considerable efficiency increase, with nearly 95% recognition accuracy measured as an average over all splits of the challenging UCF101 dataset.

Keywords


[1]   Y. Kong, and Y. Fu, “Human action recognition and prediction: A survey”, arXiv preprint arXiv:1806.11230, 2018.
[2]   S. Ranasinghe, F. Al Machot, and H. C. Mayr, “A review on applications of activity recognition systems with regard to performance and evaluation”, International Journal of Distributed Sensor Networks, vol. 12, p. 1550147716665520, 2016.
[3]   R. K. Tripathi, A. S. Jalal, and S. C. Agrawal, “Suspicious human activity recognition: a review”, Artificial Intelligence Review, vol. 50, pp. 283-339, 2018.
[4]   Y. Kong, S. Gao, B. Sun, and Y. Fu, “Action prediction from videos via memorizing hard-to-predict samples”, in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[5]   L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, et al., “Temporal segment networks: Towards good practices for deep action recognition”, in European Conference on Computer Vision, 2016, pp. 20-36.
[6]   G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolutions for action recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 1510-1517, 2017.
[7]   K. Wang, X. Wang, L. Lin, M. Wang, and W. Zuo, “3d human activity recognition with reconfigurable convolutional neural networks”, in Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 97-106.
[8]   H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Two stream lstm: A deep fusion framework for human action recognition”, in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 177-186.
[9]   S. Herath, M. Harandi, and F. Porikli, “Going deeper into action recognition: A survey”, Image and Vision Computing, vol. 60, pp. 4-21, 2017.
[10] G. Johansson, “Visual perception of biological motion and a model for its analysis”, Perception & psychophysics, vol. 14, pp. 201-211, 1973.
[11] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 257-267, 2001.
[12] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes”, in Tenth IEEE International Conference on Computer Vision (ICCV'05) vol. 1, 2005, pp. 1395-1402.
[13] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition”, Computer Vision and Image Understanding, vol. 115, pp. 224-241, 2011.
[14] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos in the wild” In 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 1996-2003.
[15] M. Marszałek, I. Laptev, and C. Schmid, “Actions in context”, in CVPR 2009-IEEE Conference on Computer Vision & Pattern Recognition, 2009, pp. 2929-2936.
[16] I. Laptev, “On space-time interest points”, International Journal of Computer Vision, vol. 64, pp. 107-123, 2005.
[17] R. Cutler and M. Turk, “View-based interpretation of real-time optical flow for gesture recognition”, in Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998, pp. 416-421.
[18] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping”, in European Conference on Computer Vision, Springer, Berlin, Heidelberg, 2004, pp. 25-36.
[19] V. Kantorov and I. Laptev, “Efficient feature extraction, encoding and classification for action recognition”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2593-2600.
[20] H. Wang and C. Schmid, “Action recognition with improved trajectories”, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3551-3558.
[21] B. Fernando and S. Gould, “Learning end-to-end video classification with rank-pooling”, in International Conference on Machine Learning, 2016, pp. 1187-1196.
[22] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos”, in Advances in Neural Information Processing Systems, 2014, pp. 568-576.
[23] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 221-231, 2012.
[24] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110-1118.
[25] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms”, in International Conference on Machine Learning, 2015, pp. 843-852.
[26] C. Dai, X. Liu, and J. Lai, “Human action recognition using two-stream attention based LSTM networks”, Applied Soft Computing, vol. 86, p. 105820, 2020.
[27] Z. Zhang, Z. Lv, C. Gan, and Q. Zhu, “Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions”, Neurocomputing, vol. 410, pp. 304-316. 2020.
[28] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks”, in Proceedings of the IEEE International Conference on computer Vision, 2015, pp. 4489-4497.
[29] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action recognition with enhanced motion vector CNNs”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2718-2726.
[30] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time action recognition with deeply transferred motion vector cnns”, IEEE Transactions on Image Processing, vol. 27, pp. 2326-2339, 2018.
[31] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in video sequences using deep bi-directional LSTM with CNN features”, IEEE Access, vol. 6, pp. 1155-1166, 2017.
[32] M. Usman Khalid and J. Yu, “Multi-Modal Three-Stream Network for Action Recognition”, arXiv, p. arXiv: 1909.03466, 2019.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, in Advances in Neural Information Processing Systems, 2012, pp. 1097-1105.
[35] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks”, IEEE Transactions on Signal Processing, vol. 45, pp. 2673-2681, 1997.
[36] V. Subramanian, Deep Learning with PyTorch: A practical approach to building neural network models using PyTorch, Packt Publishing Ltd, 2018.
[37] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild”, arXiv preprint arXiv:1212.0402, 2012.
[38] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933-1941.
[39] N. T. Vu, P. Gupta, H. Adel, and H. Schütze, “Bi-directional recurrent neural network with ranking loss for spoken language understanding”, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 6060-6064.
[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., “Going deeper with convolutions”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1-9.
[41] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human action recognition using factorized spatio-temporal convolutional networks”, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4597-4605.
[42] B. Mahasseni and S. Todorovic, “Regularizing long short term memory with 3D human-skeleton sequences for action recognition”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3054-3062.
[43] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic image networks for action recognition”, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3034-3042.
 
Akram Mihanpour received her bachelor’s degree in software engineering from Shahid Chamran University of Ahvaz (SCU), Iran, in 2015, with research in cloud computing. She received her master's degree in artificial intelligence from the Shahid Chamran University of Ahvaz in 2019. Her research interests include deep learning, image and video processing, optimization algorithms, features extraction, pattern recognition, machine learning, data mining, and computer vision.
 
Mohammad Javad Rashti is an assistant professor of computer engineering at Shahid Chamran University of Ahvaz (SCU), Iran. He received his BSc, MSc, and Ph.D. degrees from the University of Tehran, Sharif University of Technology, and Queen’s University at Kingston, respectively. He has conducted his research in the area of high-performance computing and networking, in collaboration with leading companies, universities, and national labs in Canada, USA , and Iran, publishing several scholarly papers in these areas. He is the founder of Innovation and Creativity Center and the Director of IT services at SCU.
 
Seyed Enayatallah Alavi is an assistant professor at the Department of Computer Engineering, Shahid Chamran University of Ahvaz (SCU), Iran. He received his B.Sc. degree from the Isfahan University of Technology, Isfahan, Iran, in computer engineering in 1992 and his M.Sc. degree in computer engineering- machine intelligence and robotics in 1996 from Shiraz University, Shiraz, Iran.  In 2011, he received his Ph.D. degree in Computer Engineering, major in Artificial intelligence, from Belarusian National Technical University, Minsk, Belarus. He has over 17 years of academic experience and has published more than 60 papers in international and national conferences and more than 20 papers in international and national journals, in addition to 5 books. His current research interests are deep learning and evolutionary processing.