Efficient online tree, rule-based and distance-based algorithms

Authors

DOI:

https://doi.org/10.5753/jbcs.2025.5117

Keywords:

Online machine learning, Hoeffding trees, Ensembles, Regression, Online nearest neighbors, River library

Abstract

The fast development of technology resulted in the constant production of data in different forms and from different sources. Contrary to what was observed in the first machine learning (ML) research works, there might be too much data to handle with traditional algorithms. Changes in the underlying data distributions might also render traditional ML solutions useless in real-world applications. Online ML (OML) aims to create solutions able to process data incrementally, with limited computation resource usage, and to deal with time-changing data distributions. Unfortunately, we have seen a recent growing trend in creating OML algorithms that solely focus on predictive performance and overlook computational costs. In regression tasks, the problem is even more pronounced when considering some of the most popular OML solutions: decision trees, decision rules, and ensembles of these models. We created improved and efficient OML algorithms from the algorithmic families mentioned, focusing on decreasing time and memory costs while maintaining competitive predictive performance. In this paper, we present an overview of the main contributions discussed in the Ph.D. thesis of the main author, which was awarded the best thesis prize by the Brazilian Computer Society in the 37th edition of the competition. Our proposals are either novel standalone OML algorithms or additions that can be paired with any existing tree or decision rule regressors.

Downloads

Download data is not yet available.

References

Agrahari, S. and Singh, A. K. (2022). Concept drift detection in data stream mining: A literature review. Journal of King Saud University-Computer and Information Sciences, 34(10):9523-9540. DOI: 10.1016/j.jksuci.2021.11.006.

Bahri, M., Bifet, A., Gama, J., Gomes, H. M., and Maniu, S. (2021). Data stream analysis: Foundations, major tasks and tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(3):e1405. DOI: 10.1002/widm.1405.

Bayram, F., Ahmed, B. S., and Kassler, A. (2022). From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems, 245:108632. DOI: 10.1016/j.knosys.2022.108632.

Bifet, A. and Gavaldà, R. (2009). Adaptive learning from evolving data streams. In International Symposium on Intelligent Data Analysis, pages 249-260. Springer. DOI: 10.1007/978-3-642-03915-7_22.

Bifet, A., Gavaldà, R., Holmes, G., and Pfahringer, B. (2018). Machine Learning for Data Streams with Practical Examples in MOA. MIT Press. Available online [link].

Cano, A. and Krawczyk, B. (2022). ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Machine Learning, pages 1-39. DOI: 10.1007/s10994-022-06168-x.

Chan, T. F., Golub, G. H., and LeVeque, R. J. (1982). Updating formulae and a pairwise algorithm for computing sample variances. In COMPSTAT 1982 5th Symposium held at Toulouse 1982, pages 30-41. Springer. DOI: 10.1007/978-3-642-51461-6_3.

Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71-80, Boston, MA, USA. ACM. DOI: 10.1145/347090.347107.

Dong, W., Moses, C., and Li, K. (2011). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577-586. DOI: 10.1145/1963405.1963487.

Duarte, J. and Gama, J. (2015). Multi-target regression from high-speed data streams with adaptive model rules. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pages 1-10, Campus des Cordeliers, Paris, France. IEEE. DOI: 10.1109/DSAA.2015.7344900.

Duarte, J., Gama, J., and Bifet, A. (2016). Adaptive model rules from high-speed data streams. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(3):1-22. DOI: 10.1145/2829955.

Esteban, A., Cano, A., Zafra, A., and Ventura, S. (2024). Hoeffding adaptive trees for multi-label classification on data streams. Knowledge-Based Systems, page 112561. DOI: 10.1016/j.knosys.2024.112561.

Gama, J. (2010). Knowledge discovery from data streams. Chapman and Hall/CRC. DOI: 10.3233/978-1-60750-611-9-125.

García Martín, E. (2020). Energy Efficiency in Machine Learning: Approaches to Sustainable Data Stream Mining. PhD thesis, Blekinge Tekniska Högskola. Available online [link].

García-Martín, E., Rodrigues, C. F., Riley, G., and Grahn, H. (2019). Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134:75-88. DOI: 10.1016/j.jpdc.2019.07.007.

Gomes, H. M., Barddal, J. P., Enembreck, F., and Bifet, A. (2017). A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR), 50(2):23. DOI: 10.1145/3054925.

Gomes, H. M., Barddal, J. P., Ferreira, L. E. B., and Bifet, A. (2018). Adaptive random forests for data stream regression. In 26th European Symposium on Artificial Neural Networks, ESANN 2018, Bruges, Belgium, April 25-27, 2018. Available online [link].

Gomes, H. M., Montiel, J., Mastelini, S. M., Pfahringer, B., and Bifet, A. (2020). On Ensemble Techniques for Data Stream Regression. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1-8. IEEE. DOI: 10.1109/IJCNN48605.2020.9206756.

Gomes, H. M., Read, J., Bifet, A., and Durrant, R. J. (2021). Learning from evolving data streams through ensembles of random patches. Knowledge and Information Systems, pages 1-29. DOI: 10.1007/s10115-021-01579-z.

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association. DOI: 10.1007/978-1-4612-0865-5_26.

Ikonomovska, E., Gama, J., and Džeroski, S. (2011a). Incremental multi-target model trees for data streams. In Proceedings of the 2011 ACM symposium on applied computing, pages 988-993. ACM. DOI: 10.1145/1982185.1982402.

Ikonomovska, E., Gama, J., and Džeroski, S. (2011b). Learning model trees from evolving data streams. Data mining and knowledge discovery, 23(1):128-168. DOI: 10.1007/s10618-010-0201-y.

Ikonomovska, E., Gama, J., and Džeroski, S. (2015). Online tree-based ensembles and option trees for regression on evolving data streams. Neurocomputing, 150:458-470. DOI: 10.1016/j.neucom.2014.04.076.

Kirkby, R. B. (2007). Improving hoeffding trees. PhD thesis, The University of Waikato. Available online [link].

Knuth, D. E. (2014). Art of computer programming, volume 2: Seminumerical algorithms. Addison-Wesley Professional. Book.

Korycki, Ł. and Krawczyk, B. (2020). Adaptive Deep Forest for Online Learning from Drifting Data Streams. arXiv preprint arXiv:2010.07340. DOI: 10.48550/arXiv.2010.07340.

Manapragada, C., Salehi, M., and Webb, G. I. (2022). Extremely fast hoeffding adaptive tree. In 2022 IEEE International Conference on Data Mining (ICDM), pages 319-328. IEEE. DOI: 10.1109/ICDM54844.2022.00042.

Mastelini, S. M. and de Carvalho, A. C. P. d. L. F. (2021). Using dynamical quantization to perform split attempts in online tree regressors. Pattern Recognition Letters, 145:37-42. DOI: 10.1016/j.patrec.2021.01.033.

Mastelini, S. M., Montiel, J., Gomes, H. M., Bifet, A., Pfahringer, B., and de Carvalho, A. C. (2021). Fast and lightweight binary and multi-branch Hoeffding Tree Regressors. In 2021 International Conference on Data Mining Workshops (ICDMW), pages 380-388. IEEE. DOI: 10.1109/ICDMW53433.2021.00053.

Mastelini, S. M., Nakano, F. K., Vens, C., de Leon Ferreira, A. C. P., et al. (2022). Online Extra Trees Regressor. IEEE Transactions on Neural Networks and Learning Systems. DOI: 10.1109/TNNLS.2022.3212859.

Mastelini, S. M. and Ponce de Leon Ferreira de Carvalho, A. C. (2020). 2CS: correlation-guided split candidate selection in Hoeffding tree regressors. In Brazilian Conference on Intelligent Systems, pages 337-351. Springer. DOI: 10.1007/978-3-030-61380-8_23.

Mastelini, S. M., Veloso, B., Halford, M., de Leon Ferreira, A. C. P., Gama, J., et al. (2024). SWINN: Efficient nearest neighbor search in sliding windows using graphs. Information Fusion, 101:101979. DOI: 10.1016/j.inffus.2023.101979.

McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. DOI: 10.48550/arXiv.1802.03426.

Montiel, J., Halford, M., Mastelini, S. M., Bolmier, G., Sourty, R., Vaysse, R., Zouitine, A., Gomes, H. M., Read, J., Abdessalem, T., and Bifet, A. (2021). River: machine learning for streaming data in Python. Journal of Machine Learning Research, 22(110):1-8. Available online [link].

Osojnik, A. (2017). Structured output prediction on data streams. PhD thesis, Ph. D. thesis, Jozef Stefan International Postgraduate School. Available online [link].

Osojnik, A., Panov, P., and Džeroski, S. (2018). Tree-based methods for online multi-target regression. Journal of Intelligent Information Systems, 50(2):315-339. DOI: 10.1007/s10844-017-0462-7.

Palli, A. S., Jaafar, J., Gilal, A. R., Alsughayyir, A., Gomes, H. M., Alshanqiti, A., and Omar, M. (2024). Online Machine Learning from Non-stationary Data Streams in the Presence of Concept Drift and Class Imbalance: A Systematic Review. Journal of Information and Communication Technology, 23(1):105-139. DOI: 10.32890/jict2024.23.1.5.

Pfahringer, B., Holmes, G., and Kirkby, R. (2008). Handling numeric attributes in hoeffding trees. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 296-307. Springer. DOI: 10.1007/978-3-540-68125-0_27.

Russell, S. J. and Norvig, P. (2016). Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,. Book.

Schubert, E. and Gertz, M. (2018). Numerically stable parallel computation of (co-) variance. In Proceedings of the 30th international conference on scientific and statistical database management, pages 1-12. DOI: 10.1145/3221269.3223036.

Shimomura, L. C., Oyamada, R. S., Vieira, M. R., and Kaster, D. S. (2021). A survey on graph-based methods for similarity searches in metric spaces. Information Systems, 95:101507. DOI: 10.1016/j.is.2020.101507.

Downloads

Published

2025-06-18

How to Cite

Mastelini, S. M., & de Carvalho, A. C. P. de L. F. (2025). Efficient online tree, rule-based and distance-based algorithms. Journal of the Brazilian Computer Society, 31(1), 426–434. https://doi.org/10.5753/jbcs.2025.5117

Issue

Section

Articles