Advancing Biodiversity Monitoring by Integrating Multimodal AI Models into Camera Trap Workflow

Luiz Alencar; Fagner Cunha; Eulanda M. dos Santos

doi:10.5753/jbcs.2026.5894

Authors

Luiz Alencar Federal University of Amazonas https://orcid.org/0009-0006-4040-4706
Fagner Cunha Federal University of Amazonas https://orcid.org/0000-0002-7495-2552
Eulanda M. dos Santos Federal University of Amazonas https://orcid.org/0000-0002-5671-581X

DOI:

https://doi.org/10.5753/jbcs.2026.5894

Keywords:

Multimodal language models, Camera trap data, Animal detection, Species identification, Behavior analysis

Abstract

Camera trap is an important non-invasive technique for wildlife monitoring. A typical camera-trap workflow involves various relevant tasks, such as filtering empty images, classifying animal species and identifying animal behavior. In this study, we explore the application of large-scale multimodal language models (MLLMs) for processing camera trap images to perform these three tasks. We evaluate the performance of four state-of-the-art models across these tasks, precisely BLIP, CLIP, Gemini, and GPT with zero-shot and few-shot learning methodologies. Our experiments showed several interesting results. First, few-shot learning significantly enhanced model performance in filtering empty images, with BLIP achieving a much higher accuracy (91.0%) compared to only 7.61% of its zero-shot counterpart. In the task of animal species classification, Gemini showed strong baseline performance, reaching 75.89 % of accuracy with zero-shot. In terms of identifying animal behavior, two scenarios were investigated: using single image or sequences of images. The results indicate that sequence-based processing improves behavioral analysis, with BLIP attaining the highest accuracy (75.57 %) in this task. In general, our study emphasizes the limitations of the zero-shot approach in complex tasks while highlights the effective potential of few-shot and sequence-based learning to address challenging problems such as empty images, and species misclassifications. These findings demonstrate the efficacy of advanced MLLMs in automating biodiversity monitoring, offering a scalable and accurate solution for processing large-scale datasets, and advancing conservation science.

Downloads

Download data is not yet available.

References

Alencar, L., Cunha, F., and dos Santos, E. M. (2023). A context-aware approach for filtering empty images in camera trap data using siamese network. In 2023 36th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 85-90. IEEE. DOI: 10.1109/sibgrapi59091.2023.10347159.

Alencar, L., Cunha, F., and Dos Santos, E. M. (2024). Zero and few-shot learning with modern mllms to filter empty images in camera trap data. In 2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 1-6. IEEE. DOI: 10.1109/sibgrapi62404.2024.10716305.

Beery, S., Van Horn, G., and Perona, P. (2018). Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pages 456-473. DOI: 10.1007/978-3-030-01270-0_28.

Binta Islam, S., Valles, D., Hibbitts, T. J., Ryberg, W. A., Walkup, D. K., and Forstner, M. R. (2023). Animal species recognition with deep convolutional neural networks from ecological camera trap images. Animals, 13(9):1526. DOI: 10.3390/ani13091526.

Choiński, M., Rogowski, M., Tynecki, P., Kuijper, D. P., Churski, M., and Bubnicki, J. W. (2021). A first step towards automated species recognition from camera trap images of mammals using ai in a european temperate forest. In Computer Information Systems and Industrial Management: 20th International Conference, CISIM 2021, Ełk, Poland, September 24-26, 2021, Proceedings 20, pages 299-310. Springer. DOI: 10.1007/978-3-030-84340-3_24.

Cunha, F., dos Santos, E. M., Barreto, R., and Colonna, J. G. (2021). Filtering empty camera trap images in embedded systems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2438-2446. DOI: 10.1109/cvprw53098.2021.00276.

Cunha, F., dos Santos, E. M., and Colonna, J. G. (2023). Bag of tricks for long-tail visual recognition of animal species in camera-trap images. Ecological Informatics, 76:102060. DOI: 10.1016/j.ecoinf.2023.102060.

Dorm, F., Millard, J., Purves, D., Harfoot, M., and Mac Aodha, O. (2025). Large language models possess some ecological knowledge, but how much? bioRxiv, pages 2025-02. DOI: 10.1016/j.ecoinf.2026.103699.

Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. (2023). Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378. DOI: 10.48550/arxiv.2303.03378.

Dussert, G., Miele, V., Van Reeth, C., Delestrade, A., Dray, S., and Chamaille-Jammes, S. (2024). Zero-shot animal behavior classification with image-text foundation models. bioRxiv, pages 2024-04. DOI: 10.1101/2024.04.05.588078.

Fabian, Z., Miao, Z., Li, C., Zhang, Y., Liu, Z., Hernandez, A., Arbelaez, P., Link, A., Montes-Rojas, A., Escucha, R., et al. (2023). Knowledge augmented instruction tuning for zero-shot animal species recognition. NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. Available at:[link].

Fennell, M., Beirne, C., and Burton, A. C. (2022). Use of object detection in camera trap image identification: Assessing a method to rapidly and accurately classify human and animal detections for research and application in recreation ecology. Global Ecology and Conservation, 35:e02104. DOI: 10.1016/j.gecco.2022.e02104.

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N. A., Ma, W.-C., and Krishna, R. (2024). Blink: Multimodal large language models can see but not perceive. In European Conference on Computer Vision, pages 148-166. Springer. DOI: 10.1007/978-3-031-73337-6_9.

Gabeff, V., Rußwurm, M., Tuia, D., and Mathis, A. (2024). Wildclip: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models. International Journal of Computer Vision, 132(9):3770-3786. DOI: 10.1007/s11263-024-02026-6.

Guo, C., Miguel, A., and Maciejewski, A. A. (2024). Automatic identification of individual african leopards in unlabeled camera trap images. IEEE Transactions on Automation Science and Engineering. DOI: 10.1109/tase.2024.3379553.

Iannarilli, F., Erb, J., Arnold, T. W., and Fieberg, J. R. (2021). Evaluating species-specific responses to camera-trap survey designs. Wildlife Biology, 2021(1):1-12. DOI: 10.2981/wlb.00726.

Islam, R. and Moushi, O. M. (2024). Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Preprints. DOI: 10.36227/techrxiv.171986596.65533294/v1.

Koh, J. Y., Fried, D., and Salakhutdinov, R. R. (2024). Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36. DOI: 10.52202/075280-0939.

Leorna, S. and Brinkman, T. (2022). Human vs. machine: Detecting wildlife in camera trap images. Ecological Informatics, 72:101876. DOI: 10.1016/j.ecoinf.2022.101876.

Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888-12900. PMLR. DOI: 10.48550/arXiv.2201.12086.

Liu, H., Li, C., Li, Y., and Lee, Y. J. (2024). Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296-26306. DOI: 10.1109/cvpr52733.2024.02484.

Ma, Y., Cao, Y., Sun, J., Pavone, M., and Xiao, C. (2024). Dolphins: Multimodal language model for driving. In European Conference on Computer Vision, pages 403-420. Springer. DOI: 10.1007/978-3-031-72995-9_23.

Muhtar, D., Li, Z., Gu, F., Zhang, X., and Xiao, P. (2024). Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440-457. Springer. DOI: 10.1007/978-3-031-72904-1_26.

Norouzzadeh, M. S., Nguyen, A., Kosmala, M., Swanson, A., Palmer, M. S., Packer, C., and Clune, J. (2018). Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proceedings of the National Academy of Sciences, 115(25):E5716-E5725. DOI: 10.1073/pnas.1719367115.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748-8763. PMLR. DOI: 10.48550/arxiv.2103.00020.

Santamaria, J. D., Isaza, C., and Giraldo, J. H. (2024). Catalog: A camera trap language-guided contrastive learning model. arXiv preprint arXiv:2412.10624. DOI: 10.1109/wacv61041.2025.00124.

Schneider, S., Greenberg, S., Taylor, G. W., and Kremer, S. C. (2020). Three critical factors affecting automated image species recognition performance for camera traps. Ecology and evolution, 10(7):3503-3517. DOI: 10.1002/ece3.6147.

Swanson, A., Kosmala, M., Lintott, C., Simpson, R., Smith, A., and Packer, C. (2015a). Data from: Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific Data. DOI: doi:10.5061/dryad.5pt92.

Swanson, A., Kosmala, M., Lintott, C., Simpson, R., Smith, A., and Packer, C. (2015b). Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific data, 2(1):1-14. DOI: 10.1038/sdata.2015.26.

Tabak, M. A., Norouzzadeh, M. S., Wolfson, D. W., Sweeney, S. J., VerCauteren, K. C., Snow, N. P., Halseth, J. M., Di Salvo, P. A., Lewis, J. S., White, M. D., et al. (2019). Machine learning to classify animal species in camera trap images: Applications in ecology. Methods in Ecology and Evolution, 10(4):585-590. DOI: 10.1111/2041-210x.13120.

Tan, M., Chao, W., Cheng, J.-K., Zhou, M., Ma, Y., Jiang, X., Ge, J., Yu, L., and Feng, L. (2022). Animal detection and classification from camera trap images using different mainstream object detection architectures. Animals, 12(15):1976. DOI: 10.3390/ani12151976.

Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. DOI: 10.48550/arXiv.2312.11805.

Vecvanags, A., Aktas, K., Pavlovs, I., Avots, E., Filipovs, J., Brauns, A., Done, G., Jakovels, D., and Anbarjafari, G. (2022). Ungulate detection and species classification from camera trap images using retinanet and faster r-cnn. Entropy, 24(3):353. DOI: 10.3390/e24030353.

Vélez, J., McShea, W., Shamon, H., Castiblanco-Camacho, P. J., Tabak, M. A., Chalmers, C., Fergus, P., and Fieberg, J. (2023). An evaluation of platforms for processing camera-trap data using artificial intelligence. Methods in Ecology and Evolution, 14(2):459-477. DOI: 10.1111/2041-210X.14044.

Vyskočil, J. and Picek, L. (2024). Towards zero-shot camera trap image categorization. arXiv preprint arXiv:2410.12769. DOI: 10.48550/arXiv.2410.12769.

Wang, Z., Cai, S., Liu, A., Jin, Y., Hou, J., Zhang, B., Lin, H., He, Z., Zheng, Z., Yang, Y., et al. (2024). Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. IEEE Transactions on Pattern Analysis and Machine Intelligence. DOI: 10.48550/arxiv.2311.05997.

Willi, M., Pitman, R. T., Cardoso, A. W., Locke, C., Swanson, A., Boyer, A., Veldthuis, M., and Fortson, L. (2019). Identifying animal species in camera trap images using deep learning and citizen science. Methods in Ecology and Evolution, 10(1):80-91. DOI: 10.1111/2041-210x.13099.

Wu, J., Gan, W., Chen, Z., Wan, S., and Philip, S. Y. (2023). Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247-2256. IEEE. DOI: 10.1109/bigdata59044.2023.10386743.

Yang, D.-Q., Li, T., Liu, M.-T., Li, X.-W., and Chen, B.-H. (2021). A systematic study of the class imbalance problem: Automatically identifying empty camera trap images using convolutional neural networks. Ecological Informatics, 64:101350. DOI: 10.1016/j.ecoinf.2021.101350.