Adaptive Fast XGBoost for Multiclass Classification

Authors

DOI:

https://doi.org/10.5753/jidm.2023.3150

Keywords:

Multiclass Classification, XGBoot, Fast Classification, Data Stream Mining, Supervised Classification

Abstract

The popularization of sensoring and connectivity technologies like 5G and IoT are boosting the generation of data streams. Such kinds of data are one of the last frontiers of data mining applications. However, data streams are massive and unbounded sequences of non-stationary data objects that are continuously generated at rapid rates. To deal with these challenges, the learning algorithms should analyze the data just once and update their classifiers to handle the concept drifts. The literature presents some algorithms to deal with the classification of multiclass data streams. However, most of them have high processing time. Therefore, this work proposes a XGBoost-based classifier called AFXGB-MC to fast classify non-stationary data streams with multiple classes. We compared it with the six state-of-the-art algorithms for multiclass classification found in the literature. The results pointed out that AFXGB-MC presents similar accuracy performance, but with faster processing time, being twice faster than the second fastest algorithm from the literature, and having fast drift recovery time.

Downloads

Download data is not yet available.

References

Abbaszadeh, O., Amiri, A., and Khanteymoori, A. R. (2015). An ensemble method for data stream classification in the presence of concept drift. Frontiers of Information Technology & Electronic Engineering, 16:1059–1068.

Aggarwal, C. C. (2006). Data Streams: Models and Algorithms (Advances in Database Systems). Springer-Verlag, Berlin, Heidelberg.

Baldo, F., Grando, J., Weege, K., and Bonassa, G. (2022). Adaptive fast xgboost for binary classification. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 13–25, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2022.224291.

Barddal, J. P. (2019). Vertical and horizontal partitioning in data stream regression ensembles. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. DOI: 10.1109/IJCNN.2019.8852244.

Bifet, A. and Gavalda, R. (2009). Adaptive learning from evolving data streams. In International Symposium on Intelligent Data Analysis, pages 249–260. Springer.

Bifet, A., Gavalda, R., Holmes, G., and Pfahringer, B. (2018). Machine learning for data streams: with practical examples in MOA. MIT press.

Bifet, A. and Gavaldà, R. (2007). Learning from Time Changing Data with Adaptive Windowing, pages 443–448. DOI: 10.1137/1.9781611972771.42.

Bifet, A., Holmes, G., and Pfahringer, B. (2010). Leveraging bagging for evolving data streams. In Joint European conference on machine learning and knowledge discovery in databases, pages 135–150. Springer.

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2939672.2939785.

Deng, Z., Zhu, X., Cheng, D., Zong, M., and Zhang, S. (2016). Efficient knn classification algorithm for big data. Neurocomputing, 195:143–148. Learning for Medical Imaging. DOI: https://doi.org/10.1016/j.neucom.2015.08.112.

Dua, D. and Graff, C. (2017). UCI machine learning repository.

Ferreira, A. J. and Figueiredo, M. A. T. (2012). Boosting Algorithms: A Review of Methods, Theory, and Applications, pages 35–85. Springer US, Boston, MA. DOI: 10.1007/978-1-4419-9326-72.

Fields, T., Hsieh, G., and Chenou, J. (2019). Mitigating drift in time series data with noise augmentation. In 2019 International Conference on Computational Science and Computational Intelligence (CSCI), pages 227–230.

Gama, J. and Gaber, M. M. (2007). Learning from Data Streams: Processing Techniques in Sensor Networks. 1 edition.

Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G., and Abdessalem, T. (2017). Adaptive random forests for evolving data stream classification. Machine Learning, 106(9):1469–1495.

Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., and Woźniak, M. (2017). Ensemble learning for data stream analysis: A survey. Information Fusion, 37:132–156. DOI: https://doi.org/10.1016/j.inffus.2017.02.004.

Lopes, R. H., Reid, I., and Hobson, P. R. (2007). The two-dimensional kolmogorov-smirnov test.

Losing, V., Hammer, B., and Wersing, H. (2016a). Knn classifier with self adjusting memory for heterogeneous concept drift. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 291–300. DOI: 10.1109/ICDM.2016.0040.

Losing, V., Hammer, B., and Wersing, H. (2016b). Knn classifier with self adjusting memory for heterogeneous concept drift. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 291–300. DOI: 10.1109/ICDM.2016.0040.

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang, G. (2019). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12):2346–2363. DOI: 10.1109/TKDE.2018.2876857.

Montiel, J., Mitchell, R., Frank, E., Pfahringer, B., Abdessalem, T., and Bifet, A. (2020). Adaptive XGBoost for evolving data streams. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. DOI: 10.1109/IJCNN48605.2020.9207555.

Oza, N. C. and Russell, S. J. (2001). Online bagging and boosting. In International Workshop on Artificial Intelligence and Statistics, pages 229–236. PMLR.

Santhanam, R., Raman, S., Uzir, N., and Banerjeeb, S. (2016). Experimenting XGBoost algorithm for prediction and classification of different datasets. International Journal of Control Theory and Applications, 9(40).

Schapire, R. (1999). A brief introduction to boosting. In IJCAI-99: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, v. 1 & 2, pages 1401–1406.

Scikit-Multiflow, A. (2020a). Scikit-multiflow – ensemble methods api reference.

Scikit-Multiflow, A. (2020b). Scikit-multiflow – stream generators api reference.

Silva, J. A., Faria, E. R., Barros, R. C., Hruschka, E. R., Carvalho, A. C. P. L. F. d., and Gama, J. a. (2013). Data stream clustering: A survey. ACM Comput. Surv., 46(1). DOI: 10.1145/2522968.2522981.

Togbe, M. U., Chabchoub, Y., Boly, A., Barry, M., Chiky, R., and Bahri, M. (2021). Anomalies detection using isolation in concept-drifting data streams. Computers, 10(1). DOI: 10.3390/computers10010013.

Vafaie, P., Viktor, H., and Michalowski, W. (2020). Multi-class imbalanced semi-supervised learning from streams through online ensembles. In 2020 International Conference on Data Mining Workshops (ICDMW), pages 867–874. DOI: 10.1109/ICDMW51313.2020.00124.

Downloads

Published

2023-10-31

How to Cite

Baldo, F., Grando, J., Yamada Correa, Y., & Amorim Policarpo, D. (2023). Adaptive Fast XGBoost for Multiclass Classification. Journal of Information and Data Management, 14(1). https://doi.org/10.5753/jidm.2023.3150

Issue

Section

SBBD 2022 Full papers - Extended Papers