Improving Action Recognition using Temporal Regions

Roger Granada; João Paulo Aires; Juarez Monteiro; Felipe Meneguzzi; Rodrigo C. Barros

doi:10.5753/jidm.2018.2047

Authors

Roger Granada Pontifícia Universidade Católica do Rio Grande do Sul
João Paulo Aires Pontifícia Universidade Católica do Rio Grande do Sul
Juarez Monteiro Pontifícia Universidade Católica do Rio Grande do Sul
Felipe Meneguzzi Pontifícia Universidade Católica do Rio Grande do Sul
Rodrigo C. Barros Pontifícia Universidade Católica do Rio Grande do Sul

DOI:

https://doi.org/10.5753/jidm.2018.2047

Keywords:

Action Recognition, Convolutional Neural Networks, Neural Networks

Abstract

Recognizing actions in videos is an important task in computer vision area, having important applications such as the surveillance and assistance of the sick and disabled. Automatizing this task can improve the way we monitor
actions since it is not necessary to have a human watching a video all the time. However, the classification of actions in a video is challenging since we have to identify temporal features that best represent each action. In this work, we propose an approach to obtain temporal features from videos by dividing the sequence of frames of a video into regions. Frames from these regions are merged in order to identify the temporal aspect that classifies actions in a video. Our approach yields better results when compared to a frame-by-frame classification.

Downloads

References

Carreira, J. and Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. pp. 4724–4733, 2017.

Chen, H., Chen, J., Hu, R., Chen, C., and Wang, Z. Action recognition with temporal scale-invariant deep learning framework. China Communications 14 (2): 163–172, 2017.

Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2 (Dec): 265–292, 2001.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255, 2009.

Dollár, P., Rabaud, V., Cottrell, G., and Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. pp. 65–72, 2005.

Feichtenhofer, C., Pinz, A., and Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1933–1941, 2016.

Ge, W., Yang, S., and Yu, Y. Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In Proceedings of CVPR’18. pp. 1277–1286, 2018.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. pp. 249–256, 2010.

Goodale, M. A. and Milner, A. D. Separate visual pathways for perception and action. Trends in Neuro-sciences 15 (1): 20–25, 1992.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation 9 (8): 1735–1780, 1997.

Hsu, C.-W., Chang, C.-C., and Lin, C.-J. A practical guide to support vector classification. Tech. rep., National Taiwan University. July, 2003.

Iwashita, Y., Takamine, A., Kurazume, R., and Ryoo, M. S. First-person animal activity recognition from egocentric videos. In Proceedings of the 22nd International Conference on Pattern Recognition. pp. 4310–4315, 2014.

Ji, S., Xu, W., Yang, M., and Yu, K. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1): 221–231, 2013.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. pp. 675–678, 2014.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1725–1732, 2014.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., and Zisserman, A. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, May 2017 , 2017.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems. pp. 1097–1105, 2012.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. A., and Serre, T. Hmdb: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision. ICCV’11. IEEE, Washington, DC, USA, pp. 2556–2563, 2011.

Laptev, I. On space-time interest points. International Journal of Computer Vision 64 (2-3): 107–123, 2005.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11): 2278–2324, 1998.

Liu, J., Luo, J., and Shah, M. Recognizing realistic actions from videos “in the wild”. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 1996–2003, 2009.

Lu, C., Su, H., Li, Y., Lu, Y., Yi, L., Tang, C.-K., and Guibas, L. J. Beyond holistic object recognition: Enriching image understanding with part states. In Proceedings of CVPR’18. pp. 6955–6963, 2018.

McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12 (2): 153–157, 1947.

Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. Recurrent models of visual attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems. pp. 2204–2212, 2014.

Monteiro, J., Aires, J. P., Granada, R., Barros, R. C., and Meneguzzi, F. Virtual guide dog: An application to support visually-impaired people through deep convolutional neural networks. In Proceedings of the 2017 International Joint Conference on Neural Networks. pp. 2267–2274, 2017.

Monteiro, J., Granada, R., Barros, R. C., and Meneguzzi, F. Deep neural networks for kitchen activity recognition. In Proceedings of the 2017 International Joint Conference on Neural Networks. pp. 2048–2055, 2017.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (Oct): 2825–2830, 2011.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 779–788, 2016.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3): 211–252, 2015.

Ryoo, M. S., Rothrock, B., and Matthies, L. Pooled motion features for first-person videos. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. pp. 896–904, 2015.

Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014.

Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 , 2012.

Sukthankar, G., Goldman, R. P., Geib, C., Pynadath, D. V., and Bui, H. H. Plan, Activity, and Intent Recognition: Theory and Practice. Morgan Kaufmann Publishers Inc., 2014.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9, 2015.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015). pp. 4489–4497, 2015.

Wang, C.-Y., Chiang, C.-C., Ding, J.-J., and Wang, J.-C. Dynamic tracking attention model for action recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 1617–1621, 2017.

Wang, H., Kläser, A., Schmid, C., and Liu, C.-L. Action recognition by dense trajectories. In Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 3169–3176, 2011.

Wang, H. and Schmid, C. Action recognition with improved trajectories. In Proceedings of the 2013 IEEE International Conference on Computer Vision. pp. 3551–3558, 2013.

Wilcoxon, F. Individual comparisons by ranking methods. Biometrics Bulletin 1 (6): 80–83, 1945.

Improving Action Recognition using Temporal Regions

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Metrics: