Crowd-Powered Sampling for Machine Learning: Leveraging Citizen Scientist Response Patterns in AutoML Workflows

Hugo Resende; Eduardo B. Neto; Fabio A. M. Cappabianco; Álvaro L. Fazenda; Fabio A. Faria

doi:10.5753/jbcs.2026.5888

Authors

Hugo Resende Institute of Science and Technology - Universidade Federal de São Paulo https://orcid.org/0000-0001-9735-905X
Eduardo B. Neto Institute of Science and Technology - Universidade Federal de São Paulo https://orcid.org/0000-0001-6515-0403
Fabio A. M. Cappabianco Institute of Science and Technology - Universidade Federal de São Paulo https://orcid.org/0000-0002-2139-7938
Álvaro L. Fazenda Institute of Science and Technology - Universidade Federal de São Paulo https://orcid.org/0000-0002-4052-1113
Fabio A. Faria Instituto Superior Tecnico, Universidade de Lisboa https://orcid.org/0000-0003-2956-6326

DOI:

https://doi.org/10.5753/jbcs.2026.5888

Keywords:

Sampling Approaches, Citizen Science Data, AutoML, ForestEyes Project, Deforestation Detection

Abstract

Defining effective models for data classification is challenging, especially in complex contexts. Automated Machine Learning (AutoML) tools can assist in this process by generating rankings tailored to the nature of the data and the problem. In this work, we investigate the performance of five classifiers applied to the task of deforestation segment classification, using data labeled through a citizen science campaign from the ForestEyes project. We selected SVM, Ridge, AdaBoost, KNN, and MLP models based on a ranking generated with the PyCaret AutoML library, prioritizing diverse modeling approaches. Initially, the performance of the models is assessed using the incremental training strategy based on entropy of the volunteer's classifications. Then, a new training strategy is proposed based on the median response time of volunteers when evaluating each segment, exploring three ordering strategies: ascending, descending, and edge-based. Experimental results aligned with the PyCaret ranking, with SVM achieving the best performance, followed by Ridge and AdaBoost, especially when trained on smaller and more reliable data subsets. Both the entropy-based approach and the new strategy using median response time demonstrated strong potential to efficiently train machine learning models in scenarios with scarce data, typical in citizen science campaigns.

Downloads

Download data is not yet available.

References

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and SÃ¼sstrunk, S. (2012). Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274-2282. DOI: 10.1109/TPAMI.2012.120.

Ali, M. (2020). PyCaret: An open source, low-code machine learning library in Python. Available at:[link] PyCaret version 1.0.

Barbudo, R., Ventura, S., and Romero, J. R. (2023). Eight years of automl: categorisation, review and trends. Knowledge and Information Systems, 65(12):5097-5149. DOI: 10.1007/s10115-023-01935-1.

CAO, Y., MIAO, Q.-G., LIU, J.-C., and GAO, L. (2013). Advance and prospects of adaboost algorithm. Acta Automatica Sinica, 39(6):745-758. DOI: 10.1016/S1874-1029(13)60052-X.

Dallaqua, F. B., Fazenda, Á. L., and Faria, F. A. (2021). Foresteyes project: Conception, enhancements, and challenges. Future Generation Computer Systems, 124:422-435. DOI: 10.1016/j.future.2021.06.002.

Dallaqua, F. B. J. R., Faria, F. A., and Fazenda, Ã. L. (2022). Building data sets for rainforest deforestation detection through a citizen science project. IEEE Geoscience and Remote Sensing Letters, 19:1-5. DOI: 10.1109/LGRS.2020.3032098.

Dallaqua, F. B. J. R., Fazenda, Á. L., and Faria, F. A. (2019). Foresteyes project: Can citizen scientists help rainforests? In 2019 15th International Conference on eScience (eScience), pages 18-27. IEEE. DOI: 10.1109/escience.2019.00010.

Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F., and Bargellini, P. (2012). Sentinel-2: Esa's optical high-resolution mission for gmes operational services. Remote Sensing of Environment, 120:25-36. The Sentinel Missions - New Opportunities for Science. DOI: 10.1016/j.rse.2011.11.026.

Epiphanio, J. C. N. (2011). Cbers-3/4: características e potencialidades. In Proceedings of the Brazilian Remote Sensing Symposium, Curitiba, Brazil, volume 30, page 90099016. Available at:[link].

Fazenda, A. L. and Faria, F. A. (2024). Foresteyes: Citizen scientists and machine learning-assisting rainforest conservation. Communications of the ACM, 67(8):95â96. DOI: 10.1145/3653319.

Gomes, A. R., Diniz, C. G., and Almeida, C. A. (2014). Amazon regional center (inpe/cra) actions for brazilian amazon forest: Terraclass and capacity building projects. Interdiscip. Analysis and Modeling of Carbon-Optimized Land Manag. Strategies for Southern Amazonia, page 101. Book.

Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., and Khraisat, A. (2024). Enhancing k-nearest neighbor algorithm: a comprehensive review and performance analysis of modifications. Journal of Big Data, 11(1):113. DOI: 10.1186/s40537-024-00973-y.

Haralick, R. M., Shanmugam, K., and Dinstein, I. (1973). Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, SMC-3(6):610-621. DOI: 10.1109/TSMC.1973.4309314.

He, X., Zhao, K., and Chu, X. (2021). Automl: A survey of the state-of-the-art. Knowledge-based systems, 212:106622. DOI: 10.1016/j.knosys.2020.106622.

INPE (2024). PRODES - Project for Monitoring Deforestation in the Legal Amazon by Satellite. Available at:[link] Acessado em setembro/2024.

Irving, B. (2016). maskslic: regional superpixel generation with application to local pathology characterisation in medical images. arXiv preprint arXiv:1606.09518. DOI: 10.48550/arxiv.1606.09518.

Jodas, D. S., Passos, L. A., Adeel, A., and Papa, J. P. (2022). Pl-k nn: A parameterless nearest neighbors classifier. In 2022 29th International Conference on Systems, Signals and Image Processing (IWSSIP), pages 1-4. IEEE. DOI: 10.1109/iwssip55020.2022.9854445.

Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145-151. DOI: 10.1109/18.61115.

Liu, B., Li, X., Xiao, Y., Sun, P., Zhao, S., Peng, T., Zheng, Z., and Huang, Y. (2024). Adaboost-based svdd for anomaly detection with dictionary learning. Expert Systems with Applications, 238:121770. DOI: 10.2139/ssrn.4379462.

Main-Knorn, M., Pflug, B., Louis, J., Debaecker, V., Müller-Wilm, U., and Gascon, F. (2017). Sen2Cor for Sentinel-2. In Bruzzone, L., editor, Image and Signal Processing for Remote Sensing XXIII, volume 10427, page 1042704. International Society for Optics and Photonics, SPIE. DOI: 10.1117/12.2278218.

Peng, C. and Cheng, Q. (2020). Discriminative ridge machine: A classifier for high-dimensional data or imbalanced data. IEEE transactions on neural networks and learning systems, 32(6):2595-2609. DOI: 10.1109/tnnls.2020.3006877.

QGIS (2025). Qgis - geographic information system. Available at:[link].

Resende, H., Neto, E. B., Cappabianco, F. A., Fazenda, A. L., and Faria, F. A. (2024). Sampling strategies based on wisdom of crowds for amazon deforestation detection. In 2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pages 1-6. IEEE. DOI: 10.1109/sibgrapi62404.2024.10716332.

Roy, A. and Chakraborty, S. (2023). Support vector machine in structural reliability analysis: A review. Reliability Engineering & System Safety, 233:109126. DOI: 10.1016/j.ress.2023.109126.

Saunders, C., Gammerman, A., and Vovk, V. (1998). Ridge regression learning algorithm in dual variables. page 515â521. Available at:[link].

Schreiber-Gregory, D. N. (2018). Ridge regression and multicollinearity: An in-depth review. Model Assisted Statistics and Applications, 13(4):359-365. DOI: 10.3233/mas-180446.

Tashakkori, A., Talebzadeh, M., Salboukh, F., and Deshmukh, L. (2024). Forecasting gold prices with mlp neural networks: A machine learning approach. International Journal of Science and Engineering Applications (IJSEA), 13:13-20. DOI: 10.7753/ijsea1308.1003.

Uddin, S., Haque, I., Lu, H., Moni, M. A., and Gide, E. (2022). Comparative performance analysis of k-nearest neighbour (knn) algorithm and its different variants for disease prediction. Scientific Reports, 12(1):6256. DOI: 10.1038/s41598-022-10358-x.

Valeriano, D. M., Mello, E. M., Moreira, J. C., Shimabukuro, Y. E., Duarte, V., Souza, I., Santos, J., Barbosa, C. C., and Souza, R. (2004). Monitoring tropical forest from space: the prodes digital project. International Archives of Photogrammetry Remote Sensing and Spatial Information Sciences, 35:272-274. Available at:[link].

Wu, Y.-c. and Feng, J.-w. (2018). Development and application of artificial neural network. Wireless Personal Communications, 102:1645-1656. DOI: 10.1007/s11277-017-5224-x.