Robust Face Super-Resolution and Recognition Through Multi-Feature Aggregation in Diffusion Models
DOI:
https://doi.org/10.5753/jbcs.2026.5884Keywords:
Diffusion models, Super-Resolution, Face RecognitionAbstract
Images acquired in surveillance environments often suffer from conditions such as low resolution, variations in pose, irregular illumination, and occlusions. Due to the low quality of these images, face recognition algorithms often struggle. This major limitation can be addressed by employing super-resolution techniques that enhance the details of the image. However, due to the high degree of difficulty of the problem, most super-resolution algorithms tend to cause distortions in the image and in the individual's identity. Thus, additional information must be incorporated into the processing to improve recognition robustness. In this regard, surveillance cameras can capture multiple images, even at low quality, and the data extracted from these images, such as consecutive video frames, can significantly enhance both super-resolution and facial recognition. In this work, we introduce FASR++, a diffusion-model-based super-resolution algorithm. It leverages a reference low-resolution image and features extracted from multiple auxiliary low-quality images to generate a super-resolved output, minimizing distortions in the individual's identity. Our approach recovers facial features without explicitly providing soft attributes or computing a function gradient to guide the reconstruction process. FASR++ generates high-quality images that can considerably improve performance in face recognition tasks when used as a pre-processing step. We validate our approach on two standard face recognition datasets and attain state-of-the-art results for verification, face recognition, and image quality metrics such as PSNR, SSIM, and LPIPS.
Downloads
References
Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313-326. DOI: 10.1016/0304-4149(82)90051-5.
Baker, S. and Kanade, T. (2002). Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1167-1183. DOI: 10.1109/TPAMI.2002.1033210.
Bilgazyev, E., Efraty, B., Shah, S. K., and Kakadiaris, I. A. (2011). Improved face recognition using super-resolution. In International Joint Conference on Biometrics (IJCB), pages 1-7. DOI: 10.1109/IJCB.2011.6117554.
Cai, R., Yang, G., Averbuch-Elor, H., Hao, Z., Belongie, S., Snavely, N., and Hariharan, B. (2020). Learning gradient fields for shape generation. In European Conference on Computer Vision (ECCV), pages 364-381. DOI: 10.1007/978-3-030-58580-8_22.
Chen, C., Gong, D., Wang, H., Li, Z., and Wong, K.-Y. K. (2020). Learning spatial attention for face super-resolution. IEEE Transactions on Image Processing, 30:1219-1231. DOI: 10.1109/TIP.2020.3043093.
dos Santos, M., Laroca, R., Ribeiro, R. O., Neves, J., and Menotti, D. (2024a). Multi-feature aggregation in diffusion models for enhanced face super-resolution. In Conference on Graphics, Patterns and Images (SIBGRAPI), pages 1-6. DOI: 10.1109/SIBGRAPI62404.2024.10716316.
dos Santos, M., Laroca, R., Ribeiro, R. O., Neves, J., Proença, H., and Menotti, D. (2022). Face super-resolution using stochastic differential equations. In Conference on Graphics, Patterns and Images (SIBGRAPI), pages 216-221. DOI: 10.1109/SIBGRAPI55357.2022.9991799.
dos Santos, M., Neves, J. C. R., Proença, H., and Menotti, D. (2024b). Defying limits: Super-resolution refinement with diffusion guidance. In International Conference on Computer Vision Theory and Applications (VISAPP), pages 426-434. DOI: 10.5220/0012398900003660.
Gao, S., Liu, X., Zeng, B., Xu, S., Li, Y., Luo, X., Liu, J., Zhen, X., and Zhang, B. (2023). Implicit diffusion models for continuous super-resolution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10021-10030. DOI: 10.1109/CVPR52729.2023.00966.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778. DOI: 10.1109/CVPR.2016.90.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840-6851. DOI: 10.5555/3495724.3496298.
Jiang, J., Wang, C., Liu, X., and Ma, J. (2021). Deep learning-based face super-resolution: A survey. ACM Computing Surveys (CSUR), 55(1):1-36. DOI: 10.1145/3485132.
Jolicoeur-Martineau, A., Li, K., Piché-Taillefer, R., Kachman, T., and Mitliagkas, I. (2021). Gotta go fast when generating data with score-based models. arXiv preprint. DOI: 10.48550/arXiv.2105.14080.
Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396-4405. DOI: 10.1109/CVPR.2019.00453.
Kim, M., Jain, A. K., and Liu, X. (2022). AdaFace: Quality adaptive margin for face recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). DOI: 10.1109/CVPR52688.2022.01201.
Kloeden, P. and Platen, E. (2011). Numerical Solution of Stochastic Differential Equations, volume 23. Springer. DOI: 10.1007/978-3-662-12616-5.
Labach, A., Salehinejad, H., and Valaee, S. (2019). Survey of dropout methods for deep neural networks. arXiv preprint. DOI: 10.48550/arXiv.1904.13310.
Lee, C.-H., Zhang, K., Lee, H.-C., Cheng, C.-W., and Hsu, W. (2018). Attribute augmented convolutional neural network for face hallucination. In IEEE Conference on Computer Vision and Pattern Recognition workshops, pages 721-729. DOI: 10.1109/CVPRW.2018.00115.
Li, H., Yang, Y., Chang, M., Chen, S., Feng, H., Xu, Z., Li, Q., and Chen, Y. (2022). SRDiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47-59. DOI: 10.1016/j.neucom.2022.01.029.
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R. (2021). Swinir: Image restoration using swin transformer. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1833-1844. DOI: 10.1109/ICCVW54120.2021.00210.
Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision (ICCV), pages 3730-3738. DOI: 10.1109/ICCV.2015.425.
Lu, Y., Tai, Y.-W., and Tang, C.-K. (2018). Attribute-guided face generation using conditional cyclegan. In European Conference on Computer Vision (ECCV), pages 282-297. DOI: 10.1007/978-3-030-01258-8_18.
Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J., and Salimans, T. (2023). On distillation of guided diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14297-14306. DOI: 10.1109/CVPR52729.2023.01374.
Nascimento, V., Laroca, R., Lambert, J. A., Schwartz, W. R., and Menotti, D. (2022). Combining attention module and pixel shuffle for license plate super-resolution. In Conference on Graphics, Patterns and Images (SIBGRAPI), pages 228-233. DOI: 10.1109/SIBGRAPI55357.2022.9991753.
Nascimento, V., Laroca, R., Ribeiro, R. O., Schwartz, W. R., and Menotti, D. (2024). Enhancing license plate super-resolution: A layout-aware and character-driven approach. Conference on Graphics, Patterns and Images (SIBGRAPI), pages 1-6. DOI: 10.1109/SIBGRAPI62404.2024.10716303.
Neves, J., Moreno, J., and Proença, H. (2018). QUIS-CAMPI: an annotated multi-biometrics data feed from surveillance scenarios. IET Biometrics, 7(4):371-379. DOI: 10.1049/iet-bmt.2016.0178.
Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon, S. (2020). Permutation invariant graph generation via score-based generative modeling. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 108, pages 4474-4484. DOI: 10.48550/arXiv.2003.00638.
Richter, J., Frintrop, S., and Gerkmann, T. (2023). Audio-visual speech enhancement with score-based generative models. In ITG Conference on Speech Communication, pages 275-279. DOI: 10.48550/arXiv.2306.01432.
Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Navab, N., Hornegger, J., Wells, W. M., and Frangi, A. F., editors, Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, pages 234-241. DOI: 10.1007/978-3-319-24574-4_28.
Saharia, C., Chan, W., Saxena, S., Lit, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Gontijo-Lopes, R., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. (2022). Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 36479-36494. DOI: 10.5555/3600270.3602913.
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. (2023). Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713-4726. DOI: 10.1109/TPAMI.2022.3204461.
Särkkä, S. and Solin, A. (2019). Applied stochastic differential equations, volume 10. Cambridge University Press. Book.. DOI: 10.1017/9781108186735.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), pages 2256-2265. DOI: 10.48550/arXiv.1503.03585.
Song, Y. and Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (NeurIPS), pages 1-13. DOI: 10.5555/3454287.3455354.
Song, Y. and Ermon, S. (2020). Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems (NeurIPS), 33:12438-12448. DOI: 10.5555/3495724.3496767.
Song, Y., Shen, L., Xing, L., and Ermon, S. (2022). Solving inverse problems in medical imaging with score-based generative models. In International Conference on Learning Representations (ICLR), pages 1-18. DOI: 10.48550/arXiv.2111.08005.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), pages 1-36. DOI: 10.48550/arXiv.2011.13456.
Suin, M., Nair, N. G., Pong Lau, C., Patel, V. M., and Chellappa, R. (2024). Diffuse and restore: A region-adaptive diffusion model for identity-preserving blind face restoration. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6343-6352. DOI: 10.1109/WACV57701.2024.00622.
Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J., and Ng, R. (2020). Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems (NeurIPS), 33:7537-7547. DOI: 10.5555/3495724.3496356.
Vahdat, A., Kreis, K., and Kautz, J. (2021). Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 11287-11302. DOI: 10.5555/3540261.3541124.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30. DOI: 10.5555/3295222.3295349.
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661-1674. DOI: 10.1162/NECO_a_00142.
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., and Tang, X. (2017). Residual attention network for image classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156-3164. DOI: 10.1109/CVPR.2017.683.
Wang, X., Li, Y., Zhang, H., and Shan, Y. (2021). Towards real-world blind face restoration with generative facial prior. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9164-9174. DOI: 10.1109/CVPR46437.2021.00905.
Yi, D., Lei, Z., Liao, S., and Li, S. Z. (2014). Learning face representation from scratch. arXiv preprint. DOI: 10.48550/arXiv.1411.7923.
Yu, X., Fernando, B., Hartley, R., and Porikli, F. (2018). Super-resolving very low-resolution face images with supplementary attributes. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pages 908-917. DOI: 10.1109/CVPR.2018.00101.
Yuan, Y., Chen, W., Yang, Y., and Wang, Z. (2020). In defense of the triplet loss again: Learning robust person re-identification with fast approximated triplet loss and label distillation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1454-1463. DOI: 10.1109/CVPRW50498.2020.00185.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586-595. DOI: 10.1109/CVPR.2018.00068.
Zhang, Z., Han, L., Ghosh, A., Metaxas, D., and Ren, J. (2023). Sine: Single image editing with text-to-image diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6027-6037. DOI: 10.1109/CVPR52729.2023.00584.
Zhu, S., Liu, S., Loy, C. C., and Tang, X. (2016). Deep cascaded bi-network for face hallucination. In European Conference on Computer Vision (ECCV), pages 614-630. DOI: 10.1007/978-3-319-46454-1_37.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Marcelo dos Santos, Rayson Laroca, João Carlos Raposo Neves, David Menotti

This work is licensed under a Creative Commons Attribution 4.0 International License.

