CNNs for JPEGs: Designing Cost-Efficient Stems

Samuel Felipe dos Santos; Nicu Sebe; Jurandy Almeida

doi:10.5753/jbcs.2026.5873

Authors

Samuel Felipe dos Santos Federal University of São Carlos https://orcid.org/0000-0001-6061-5582
Nicu Sebe University of Trento https://orcid.org/0000-0002-6597-7248
Jurandy Almeida Federal University of São Carlos https://orcid.org/0000-0002-4998-6996

DOI:

https://doi.org/10.5753/jbcs.2026.5873

Keywords:

JPEG, Compressed Domain, DCT, Cost Efficient Models

Abstract

Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade, pushing the state-of-the-art in several computer vision tasks. CNNs are capable of learning robust representations of the data directly from RGB pixels. However, most image data is usually available in compressed format, of which the JPEG is the most widely used due to transmission and storage purposes. For this motive, a preliminary decoding process that has a high computational load and memory usage is demanded. Image decoding can be a performance bottleneck for devices with limited computational resources, such as embedded devices, even when hardware accelerators are used. For this reason, deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years. These methods usually extract a frequency domain representation of the image, like DCT, by a partial decoding, and then make adaptation to typical CNN architectures to work with it. In this paper, we perform an in-depth study of the computational cost of deep models designed for the frequency domain, evaluating the cost of decoding and passing images through the network. We notice that previous work increased the model's computational complexity to accommodate for the compressed images, nullifying the speed up gained by not decoding images. We propose to remove the changes to the model that increase the computational cost, replacing it with our designed lightweight stems. This way, we can take full advantage of the speed-up obtained by avoiding the decoding. Our strategies were successful in generating models that balance efficiency and effectiveness, allowing deep models to be deployed in a wider array of devices. We achieve up to 25.91% reduction in computational complexity (FLOPs), while only decreasing accuracy in up to 2.97%. We also propose the efficiency-effectiveness score S_E to highlight models with favorable trade-offs between accuracy, computational cost and number of parameters.

Downloads

Download data is not yet available.

Author Biographies

Samuel Felipe dos Santos, Federal University of São Carlos

Samuel Felipe dos Santos is postdoctoral researcher in Computer Science at the Federal University of São Carlos (UFSCar), Sorocaba campus. He received his B.Sc. in Science and Technology (2016) and in Computer Science (2018), his M.Sc. (2019) and Ph.D. (2023) in Computer Science from the Federal University of São Paulo (UNIFESP). During his Ph.D., he participated in Sandwich Doctorate Program (PDSE) supported by CAPES, carrying out a research internship in the Dept. of Information Engineering and Computer Science, University of Trento in Trento, Italy (01/2022-06/2022). His research interests include content-based image retrieval, computer vision, machine learning, and deep learning.

Nicu Sebe, University of Trento

Nicu Sebe is Professor in the Dept. of Information Engineering and Computer Science, University of Trento, Italy, where he is leading the research in the areas of multimedia analysis and human behavior understanding. He was the General Co-Chair of the IEEE FG 2008 and ACM Multimedia 2013. He was a program chair of ACM Multimedia 2011 and 2007, ECCV 2016, ICCV 2017 and ICPR 2020. He was a general chair of ACM Multimedia 2022. He is a fellow of ELLIS and of IAPR.

Jurandy Almeida, Federal University of São Carlos

Jurandy Almeida is an Associate Professor in the Department of Computing at the Federal University of São Carlos, campus of Sorocaba, Brazil. He used to hold a position as an Assistant Professor in the Institute of Science and Technology at the Federal University of São Paulo, campus of São José dos Campos, Brazil (2013-2022) and as an Associate Researcher in the Institute of Computing at the University of Campinas, Brazil (2011-2013). Jurandy received his B.Sc. in Computer Science (2004) from São Paulo State University, Brazil, and his M.Sc. (2007) and Ph.D. (2011) degrees in Computer Science from University of Campinas, Brazil. He is a productivity research fellow (2018-present) of CNPq (the Brazilian National Council for Scientific and Technological Development), and a member of IEEE, IAPR, and SBC (the Brazilian Computer Society).

References

Abdellatef, H. and Karam, L. J. (2024). Reduced-complexity convolutional neural network in the compressed domain. Neural Networks, 169:555-571. DOI: 10.1016/j.neunet.2023.10.020.

Ayat, S. O., Khalil-Hani, M., Ab Rahman, A. A.-H., and Abdellatef, H. (2019). Spectral-based convolutional neural network without multiple spatial-frequency domain switchings. Neurocomputing, 364:152-167. DOI: 10.1016/j.neucom.2019.06.094.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE/CVF Conf. on Comput. Vis. and Pattern Recog. (CVPR), pages 4724-4733. DOI: 10.1109/CVPR.2017.502.

Chen, T., Lin, L., Zuo, W., Luo, X., and Zhang, L. (2018). Learning a wavelet-like auto-encoder to accelerate deep neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). DOI: 10.1609/aaai.v32i1.12282.

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In IEEE/CVF Conf. on Comput. Vis. and Pattern Recog. (CVPR), pages 1800-1807. DOI: 10.1109/CVPR.2017.195.

Deguerre, B., Chatelain, C., and Gasso, G. (2019). Fast object detection in compressed jpeg images. In IEEE Intell. Transp. Syst. Conf. (ITSC), pages 333-338. DOI: 10.1109/ITSC.2019.8916937.

Deguerre, B., Chatelain, C., and Gasso, G. (2021). Object detection in the DCT domain: is luminance the solution? In IEEE Int. Conf. on Pattern Recog. (ICPR), pages 2627-2634. DOI: 10.1109/ICPR48806.2021.9412998.

Drews-Jr, P., Souza, I. d., Maurell, I. P., Protas, E. V., and C. Botelho, S. S. (2021). Underwater image segmentation in the wild using deep learning. Journal of the Brazilian Computer Society (JBCS), 27:1-14. DOI: 10.1186/s13173-021-00117-7.

Duan, Z., Ma, Z., and Zhu, F. (2023). Unified architecture adaptation for compressed domain semantic inference. IEEE Trans. on Circuits and Systems for Video Technology (TCSVT), 33(8):4108-4121. DOI: 10.1109/TCSVT.2023.3240391.

Ehrlich, M., Davis, L., Lim, S.-N., and Shrivastava, A. (2020). Quantization guided jpeg artifact correction. In European Conf. on Comput. Vis. (ECCV), pages 293-309. Springer. DOI: 10.1007/978-3-030-58598-3_18.

Ehrlich, M., Davis, L., Lim, S.-N., and Shrivastava, A. (2021). Analyzing and mitigating jpeg compression defects in deep learning. In IEEE/CVF Int. Conf. on Comput. Vis. Workshops (ICCVW), pages 2357-2367. DOI: 10.1109/ICCVW54120.2021.00267.

Ehrlich, M. and Davis, L. S. (2019). Deep residual learning in the JPEG transform domain. In IEEE Int. Conf. on Comput. Vis. (ICCV), pages 3484-3493. DOI: 10.1109/ICCV.2019.00358.

Fang, Y., Chen, Z., Lin, W., and Lin, C.-W. (2012). Saliency detection in the compressed domain for adaptive image retargeting. IEEE Trans. on Image Process. (IEEE TIP), 21(9):3888-3901. DOI: 10.1109/TIP.2012.2199126.

Ferraz, A. and Betini, R. C. (2025). Comparative evaluation of deep learning models for diagnosis of covid-19 using x-ray images and computed tomography. Journal of the Brazilian Computer Society (JBCS), 31(1):99-131. DOI: 10.5753/jbcs.2025.3043.

Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., and Yosinski, J. (2018). Faster neural networks straight from JPEG. In Annual Conf. on Neural Information Process. Syst. (NIPS), pages 3937-3948. DOI: 10.5555/3327144.3327308.

Hanzo, L., Cherriman, P., and Streit, J. (2007). Video Compression and Communications: From Basics to H.261, H.263, H.264, MPEG4 for DVB and HSDPA-Style Adaptive Turbo-Transceivers. John Wiley & Sons, Chichester, SXW, UK, 2 edition. Book.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In IEEE/CVF Conf. on Comput. Vis. and Pattern Recog. (CVPR), pages 770-778. DOI: 10.1109/CVPR.2016.90.

He, L., Lu, W., Jia, C., and Hao, L. (2017). Video quality assessment by compact representation of energy in 3D-DCT domain. Neurocomputing, 269:108-116. DOI: 10.1016/j.neucom.2016.08.143.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861. DOI: 10.48550/arXiv.1704.04861.

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. CoRR, abs/1602.07360. DOI: 10.48550/arXiv.1602.07360.

Ji, R. and Karam, L. J. (2024). Compressed-domain vision transformer for image classification. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 14(2):299-310. DOI: 10.1109/JETCAS.2024.3394878.

Jiang, W., Wang, Z., Jin, J. S., Han, Y., and Sun, M. (2019). DCT–CNN-based classification method for the Gongbi and Xieyi techniques of Chinese ink-wash paintings. Neurocomputing, 330:280-286. DOI: 10.1016/j.neucom.2018.11.003.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., and Zisserman, A. (2017). The kinetics human action video dataset. CoRR, abs/1705.06950. DOI: 10.48550/arXiv.1705.06950.

Li, Y., Gu, S., Gool, L. V., and Timofte, R. (2019). Learning filter basis for convolutional neural network compression. In IEEE Int. Conf. on Comput. Vis. (ICCV), pages 5623-5632. DOI: 10.1109/ICCV.2019.00572.

Lin, M., Chen, Q., and Yan, S. (2013). Network in network. CoRR, abs/1312.4400. DOI: 10.48550/arXiv.1312.4400.

Liu, H., Liu, W., Chi, Z., Wang, Y., Yu, Y., Chen, J., and Tang, J. (2023). Fast human pose estimation in compressed videos. IEEE Trans. on Multimedia, 25:1390-1400. DOI: 10.1109/TMM.2022.3141888.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E., Fu, C.-Y., and Berg, A. C. (2016). SSD: single shot multibox detector. In European Conf. on Comput. Vis. (ECCV), pages 21-37. DOI: 10.1007/978-3-319-46448-0_2.

Lo, S.-Y. and Hang, H.-M. (2020). Exploring semantic segmentation on the dct representation. In ACM MMAsia, pages 1-6. ACM. DOI: 10.1145/3338533.3366557.

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. In Conf. on Empirical Methods in NLP (EMNLP), pages 1412-1421. DOI: 10.18653/v1/D15-1166.

Marchisio, A., Hanif, M. A., Khalid, F., Plastiras, G., Kyrkou, C., Theocharides, T., and Shafique, M. (2019). Deep learning for edge computing: Current trends, cross-layer optimizations, and open research challenges. In IEEE Comput. Soc. Annu. Symp. on VLSI (ISVLS), pages 553-559. DOI: 10.1109/ISVLSI.2019.00105.

Marinó, G. C., Petrini, A., Malchiodi, D., and Frasca, M. (2023). Deep neural networks compression: A comparative survey and choice recommendations. Neurocomputing, 520:152-170. DOI: 10.1016/j.neucom.2022.11.072.

Ming, Y., Zhou, J., Hu, N., Feng, F., Zhao, P., Lyu, B., and Yu, H. (2024). Action recognition in compressed domains: A survey. Neurocomputing, 577:127389. DOI: 10.1016/j.neucom.2024.127389.

Park, J. and Johnson, J. (2023). Rgb no more: Minimally-decoded jpeg vision transformers. In IEEE/CVF Conf. on Comput. Vis. and Pattern Recog. (CVPR), pages 22334-22346. DOI: 10.1109/CVPR52729.2023.02139.

Peng, L., Cao, Y., Sun, Y., and Wang, Y. (2024). Lightweight adaptive feature de-drifting for compressed image classification. IEEE Trans. on Multimedia, 26:6424-6436. DOI: 10.1109/TMM.2024.3350917.

Qin, Z., Zhang, P., Wu, F., and Li, X. (2021). Fcanet: Frequency channel attention networks. In IEEE Int. Conf. on Comput. Vis. (ICCV), pages 783-792. DOI: 10.1109/ICCV48922.2021.00082.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F.-F. (2015). Imagenet large scale visual recognition challenge. Int. J. of Comput. Vis. (IJCV), 115(3):211-252. DOI: 10.1007/s11263-015-0816-y.

Santos, S. F. and Almeida, J. (2020). Faster and accurate compressed video action recognition straight from the frequency domain. In Conf. on Graphics, Patterns and Images (SIBGRAPI), pages 62-68. DOI: 10.1109/SIBGRAPI51738.2020.00017.

Santos, S. F., Sebe, N., and Almeida, J. (2019). CV-C3D: action recognition on compressed videos with convolutional 3d networks. In Conf. on Graphics, Patterns and Images (SIBGRAPI), pages 24-30. DOI: 10.1109/SIBGRAPI.2019.00012.

Santos, S. F., Sebe, N., and Almeida, J. (2020). The good, the bad, and the ugly: Neural networks straight from jpeg. In IEEE Int. Conf. on Image Process. (ICIP), pages 1896-1900. DOI: 10.1109/ICIP40778.2020.9190741.

Santos, S. F. d. and Almeida, J. (2021). Less is more: Accelerating faster neural networks straight from jpeg. In Iberoamerican Congress on Pattern Recog. (CIARP), pages 237-247. DOI: 10.1007/978-3-030-93420-0_23.

Santos, S. F. d., Sebe, N., and Almeida, J. (2024). Efficient deep learning for image classification: Lighter preprocessing and fewer parameters. In Conf. on Graphics, Patterns and Images (SIBGRAPI), pages 56-62. DOI: 10.5753/sibgrapi.est.2024.31645.

Su, K., Cao, L., Zhao, B., Li, N., Wu, D., Han, X., and Liu, Y. (2024). Dctvit: Discrete cosine transform meet vision transformers. Neural Networks, 172:106139. DOI: 10.1016/j.neunet.2024.106139.

Sun, M., He, X., Xiong, S., Ren, C., and Li, X. (2020). Reduction of jpeg compression artifacts based on dct coefficients prediction. Neurocomputing, 384:335-345. DOI: 10.1016/j.neucom.2019.12.015.

Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In Int. Conf. on Mach. Learn. (ICML), pages 6105-6114. DOI: 10.48550/arXiv.1905.11946.

Temburwar, S., Rajesh, B., and Javed, M. (2022). Deep learning-based image retrieval in the jpeg compressed domain. In Adv. Mach. Intell. and Signal Process., pages 351-363. Springer. DOI: 10.1007/978-981-19-0840-8_26.

Tran, D., Ray, J., Shou, Z., Chang, S.-F., and Paluri, M. (2017). Convnet architecture search for spatiotemporal feature learning. CoRR, abs/1708.05038. DOI: 10.48550/arXiv.1708.05038.

Wang, X., Zhou, Z., Yuan, Z., Zhu, J., Cao, Y., Zhang, Y., Sun, K., and Sun, G. (2023). Fd-cnn: A frequency-domain fpga acceleration scheme for cnn-based image-processing applications. ACM Trans. on Embedded Comput. Syst. (TECS), 22(6). DOI: 10.1145/3559105.

Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y.-K., and Ren, F. (2020). Learning in the Frequency Domain. In IEEE/CVF Conf. on Comput. Vis. and Pattern Recog. (CVPR), pages 1740-1749. DOI: 10.1109/CVPR42600.2020.00181.

Zhang, J., Feng, Y., Wang, C., Shao, M., Jiang, Y., and Wang, J. (2023a). Multi-domain clustering pruning: Exploring space and frequency similarity based on GAN. Neurocomputing, 542:126279. DOI: 10.1016/j.neucom.2023.126279.

Zhang, Q., Zhang, M., Chen, T., Sun, Z., Ma, Y., and Yu, B. (2019). Recent advances in convolutional neural network acceleration. Neurocomputing, 323:37-51. DOI: 10.1016/j.neucom.2018.09.038.

Zhang, S., Gao, M., Ni, Q., and Han, J. (2023b). Filter pruning with uniqueness mechanism in the frequency domain for efficient neural networks. Neurocomputing, 530:116-124. DOI: 10.1016/j.neucom.2023.02.004.