Learning Self-distilled Features for Facial Deepfake Detection Using Visual Foundation Models: General Results and Demographic Analysis

Yan Martins Braz Gurevitz Cunha; Bruno Rocha Gomes; José Matheus C. Boaro; Daniel de Sousa Moraes; Antonio José Grandson Busson; Julio Cesar Duarte; Sérgio Colcher

doi:10.5753/jis.2024.4120

Authors

Yan Martins Braz Gurevitz Cunha Telemidia Lab. – Pontifical Catholic University of Rio de Janeiro https://orcid.org/0009-0003-1366-1262
Bruno Rocha Gomes Telemidia Lab. – Pontifical Catholic University of Rio de Janeiro https://orcid.org/0009-0000-5136-2210
José Matheus C. Boaro Telemidia Lab. – Pontifical Catholic University of Rio de Janeiro https://orcid.org/0000-0002-4456-9050
Daniel de Sousa Moraes Telemidia Lab. – Pontifical Catholic University of Rio de Janeiro https://orcid.org/0000-0003-2995-3115
Antonio José Grandson Busson BTG Pactual https://orcid.org/0000-0001-5394-0707
Julio Cesar Duarte Military Institute of Engineering https://orcid.org/0000-0001-6656-1247
Sérgio Colcher Telemidia Lab. – Pontifical Catholic University of Rio de Janeiro https://orcid.org/0000-0002-3476-8718

DOI:

https://doi.org/10.5753/jis.2024.4120

Keywords:

Deepfake Detection, Foundation Models, Machine Learning, Demographic Analysis, Self-Supervised Methods

Abstract

Modern deepfake techniques produce highly realistic false media content with the potential for spreading harmful information, including fake news and incitements to violence. Deepfake detection methods aim to identify and counteract such content by employing machine learning algorithms, focusing mainly on detecting the presence of manipulation using spatial and temporal features. These methods often utilize Foundation Models trained on extensive unlabeled data through self-supervised approaches. This work extends previous research on deepfake detection, focusing on the effectiveness of these models while also considering biases, particularly concerning age, gender, and ethnicity, for ethical analysis. Experiments with DINOv2, a novel Vision Transformer-based Foundation Model, trained using the diverse Deepfake Detection Challenge Dataset, which encompasses several lighting conditions, resolutions, and demographic attributes, demonstrated improved deepfake detection when combined with a CNN classifier, with minimal bias towards these demographic characteristics.

Downloads

References

Afchar, D., Nozick, V., Yamagishi, J., and Echizen, I. (2018). Mesonet: a compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS), pages 1–7. IEEE. DOI: https://doi.org/10.1109/WIFS.2018.8630761.

Almond Solutions (2021). Why do people post on social media. [link]. Accessed: 09 July 2024.

Beaumont-Thomas, B. (2024). Taylor swift deepfake pornography sparks renewed calls for us legislation. [link]. Accessed: 09 July 2024.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. DOI: https://doi.org/10.48550/arXiv.2108.07258.

Bonettini, N., Cannas, E. D., Mandelli, S., Bondi, L., Bestagini, P., and Tubaro, S. (2021). Video face manipulation detection through ensemble of cnns. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 5012–5019. DOI: https://doi.org/10.1109/ICPR48806.2021.9412711.

Brock, A., Donahue, J., and Simonyan, K. (2019). Large scale GAN training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. DOI: https://doi.org/10.48550/arXiv.1809.11096.

Bulat, A. and Tzimiropoulos, G. (2017). How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision, pages 1021–1030. DOI: https://doi.org/10.1109/ICCV.2017.116.

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660. DOI: https://doi.org/10.1109/ICCV48922.2021.00951.

Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo, J. (2018). StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797. DOI: https://doi.org/10.1109/CVPR.2018.00916.

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258. DOI: https://doi.org/10.1109/CVPR.2017.195.

Coccomini, D. A., Messina, N., Gennaro, C., and Falchi, F. (2022). Combining efficientnet and vision transformers for video deepfake detection. In Sclaroff, S., Distante, C., Leo, M., Farinella, G. M., and Tombari, F., editors, Image Analysis and Processing – ICIAP 2022, pages 219–229, Cham. Springer International Publishing. DOI: https://doi.org/10.1007/978-3-031-06433-3_19.

Corvi, R., Cozzolino, D., Zingarini, G., Poggi, G., Nagano, K., and Verdoliva, L. (2023). On the detection of synthetic images generated by diffusion models. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. DOI: https://doi.org/10.1109/ICASSP49357.2023.10095167.

Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., and Bharath, A. A. (2018). Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1):53–65. DOI: https://doi.org/10.1109/MSP.2017.2765202.

Dhariwal, P. and Nichol, A. (2024). Diffusion models beat GANs on image synthesis. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA. Curran Associates Inc.. DOI: https://dl.acm.org/doi/10.5555/3540261.3540933.

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., and Ferrer, C. C. (2020a). The deepfake detection challenge dataset.

Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., and Ferrer, C. C. (2020b). The deepfake detection challenge (DFDC) dataset. https://doi.org/10.48550/arXiv.2006.07397. Accessed: 09 July 2024.

Dufour, N. and Gully, A. (2019). Contributing data to deepfake detection research. [link]. Accessed: 09 July 2024.

Feng, Y., Wu, F., Shao, X., Wang, Y., and Zhou, X. (2018). Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 534–551. DOI: https://doi.org/10.1007/978-3-030-01264-9_33.

Gomes, B. R., Busson, A. J. G., Boaro, J., and Colcher, S. (2023). Realistic facial deep fakes detection through self-supervised features generated by a self-distilled vision transformer. In Proceedings of the 29th Brazilian Symposium on Multimedia and the Web, WebMedia ’23, page 177–183, New York, NY, USA. Association for Computing Machinery. DOI: https://doi.org/10.1145/3617023.3617047.

Heo, Y.-J., Choi, Y.-J., Lee, Y.-W., and Kim, B.-G. (2021). Deepfake detection scheme based on vision transformer and distillation. arXiv preprint arXiv:2104.01353. DOI: https://doi.org/10.48550/arXiv.2104.01353.

Iglovikov, V. and Shvets, A. (2018). Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746. DOI: https://doi.org/10.48550/arXiv.1801.05746.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134. DOI: https://doi.org/10.1109/CVPR.2017.632.

Jiang, L., Li, R., Wu, W., Qian, C., and Loy, C. C. (2020). Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2886–28958. DOI: https://doi.org/10.1109/CVPR42600.2020.00296.

Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer. DOI: https://doi.org/10.1007/978-3-319-46475-6_43.

Kae, A., Sohn, K., Lee, H., and Learned-Miller, E. (2013). Augmenting CRFs with boltzmann machine shape priors for image labeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2019–2026. DOI: https://doi.org/10.1109/CVPR.2013.263.

Khalid, H. and Woo, S. S. (2020). Oc-fakedect: Classifying deepfakes using one-class variational autoencoder. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 656–657. DOI: https://doi.org/10.1109/CVPRW50498.2020.00336.

Kiefer, B. (2023). This brand’s social experiment uses ai to expose the dark side of ’sharenting’. [link]. Accessed: 09 July 2024.

King, D. E. (2009). Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755–1758. DOI: https://dl.acm.org/doi/10.5555/1577069.1755843.

Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D. H., Hawk, S. T., and Van Knippenberg, A. (2010). Presentation and validation of the radboud faces database. Cognition and emotion, 24(8):1377–1388. DOI: https://doi.org/10.1080/02699930903485076.

Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., and Guo, B. (2020a). Face x-ray for more general face forgery detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5000–5009. DOI: https://doi.org/110.1109/CVPR42600.2020.00505.

Li, M., Zuo, W., and Zhang, D. (2016). Deep identity-aware transfer of facial attributes. arXiv preprint arXiv:1610.05586. DOI: https://doi.org/10.48550/arXiv.1610.05586.

Li, Y., Sun, P., Qi, H., and Lyu, S. (2022). Toward the creation and obstruction of deepfakes. In Handbook of Digital Face Manipulation and Detection, pages 71–96. Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-87664-7_4.

Li, Y., Yang, X., Sun, P., Qi, H., and Lyu, S. (2020b). Celeb-DF: A large-scale challenging dataset for deepfake forensics. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3204–3213. DOI: https://doi.org/10.1109/CVPR42600.2020.00327.

Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV). DOI: https://doi.org/10.1109/ICCV.2015.425.

Maze, B., Adams, J., Duncan, J. A., Kalka, N., Miller, T., Otto, C., Jain, A. K., Niggel, W. T., Anderson, J., Cheney, J., et al. (2018). IARPA janus benchmark - c: Face dataset and protocol. In 2018 international conference on biometrics (ICB), pages 158–165. IEEE. DOI: https://doi.org/10.1109/ICB2018.2018.00033.

Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J. G., and Shapiro, L. (2018). Y-net: joint segmentation and classification for diagnosis of breast biopsy images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 893–901. Springer. DOI: https://doi.org/10.1007/978-3-030-00934-2_99.

Nguyen, H. H., Yamagishi, J., and Echizen, I. (2019). Capsule-forensics: Using capsule networks to detect forged images and videos. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2307–2311. DOI: https://doi.org/10.1109/ICASSP.2019.8682602.

Nirkin, Y., Keller, Y., and Hassner, T. (2019). FSGAN: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7184–7193. DOI: https://doi.org/10.1109/ICCV.2019.00728.

Nirkin, Y., Masi, I., Tuan, A. T., Hassner, T., and Medioni, G. (2018). On face segmentation, face swapping, and face perception. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 98–105. IEEE. DOI: https://doi.org/10.1109/FG.2018.00024.

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2024). DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research. DOI: https://doi.org/10.48550/arXiv.2304.07193.

Perarnau, G., Van De Weijer, J., Raducanu, B., and Álvarez, J. M. (2016). Invertible conditional GANs for image editing. arXiv preprint arXiv:1611.06355. DOI: https://doi.org/10.48550/arXiv.1611.06355.

Perov, I., Gao, D., Chervoniy, N., Liu, K., Marangonda, S., Umé, C., Dpfks, M., Facenheim, C. S., RP, L., Jiang, J., et al. (2023). Deepfacelab: Integrated, flexible and extensible face-swapping framework. Pattern Recogn., 141(C). DOI: https://doi.org/10.1016/j.patcog.2023.109628.

Pokroy, A. A. and Egorov, A. D. (2021). Efficientnets for deepfake detection: Comparison of pretrained models. In 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus), pages 598–600. IEEE. DOI: https://doi.org/10.1109/ElConRus51938.2021.9396092.

Radford, A., Kim, J. W., Chris Hallacy, A. R., Gabriel Goh, S. A., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR. DOI: https://doi.org/10.48550/arXiv.2103.00020.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685. DOI: https://doi.org/10.1109/CVPR52688.2022.01042.

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., and Nießner, M. (2019). Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11. DOI: https://doi.org/10.1109/ICCV.2019.00009.

Schmunk, R. (2024). Explicit fake images of taylor swift prove laws haven’t kept pace with tech, experts say. [link]. Accessed: 09 July 2024.

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. (2024). LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22. DOI: https://dl.acm.org/doi/10.5555/3600270.3602103.

Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR. DOI: https://doi.org/10.48550/arXiv.1905.11946.

Tjon, E., Moh, M., and Moh, T.-S. (2021). Eff-ynet: A dual task network for deepfake detection and segmentation. In 2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM), pages 1–8. IEEE. DOI: https://doi.org/10.1109/IMCOM51814.2021.9377373.

Trinh, L. and Liu, Y. (2021). An examination of fairness of ai models for deepfake detection. In Zhou, Z.-H., editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 567–574. International Joint Conferences on Artificial Intelligence Organization. Main Track. DOI: https://doi.org/10.24963/ijcai.2021/79.

Wang, J., Wu, Z., Chen, J., and Jiang, Y.-G. (2022). M2TR: Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the 2022 International Conference on Multimedia Retrieval, page 615–623. DOI: https://doi.org/10.1145/3512527.3531415.

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612. DOI: https://doi.org/10.1109/TIP.2003.819861.

Xu, Y., Terhörst, P., Raja, K., and Pedersen, M. (2024). Analyzing fairness in deepfake detection with massively annotated databases. IEEE Transactions on Technology and Society, 5(1):93–106. DOI: https://doi.org/10.1109/TTS.2024.3365421.

Zhang, K., Zhang, Z., Li, Z., and Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503. DOI: https://doi.org/10.1109/LSP.2016.2603342.

Zhang, S., Zhu, X., Lei, Z., Shi, H., Wang, X., and Li, S. Z. (2017). S3FD: Single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision, pages 192–201. DOI: https://doi.org/10.1109/ICCV.2017.30.

Zhao, H., Zhou, W., Chen, D., Zhang, W., and Yu, N. (2022). Self-supervised transformer for deepfake detection. arXiv preprint arXiv:2203.01265. DOI: https://doi.org/10.48550/arXiv.2203.01265.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2242–2251. DOI: https://doi.org/10.1109/ICCV.2017.244.