Grid-Ordering for Outlier Detection in Massive Data Streams
DOI:
https://doi.org/10.5753/jidm.2025.4116Keywords:
Distance-based Outlier Detection, Distributed Computing, Data Streams, Apache SparkAbstract
Outlier detection is critical in data mining, encompassing the revelation of hidden insights or identification of potentially disruptive anomalies. While numerous strategies have been proposed for serial-processing outlier detection, the ever-expanding realm of big data applications demands efficient distributed computing solutions. This paper addresses the challenge of real-time outlier detection in multidimensional data streams with high-frequency arrivals, by presenting GOOST. This novel algorithm employs neighborhood analysis by leveraging grid-based data sorting. GOOST efficiently detects distance-based outliers, ensuring accurate detection in distributed environments within a competitive and much more stable processing time than previous solutions. We perform experiments on 6 real and synthetic data sets with up to 1.2M events, and up to 55 dimensions. We demonstrate that GOOST outperforms 3 state-of-the-art methods in terms of quality of results (30% more accurate) within competitive (and 45% more stable) processing times for real-time analysis of multidimensional data streams and high event frequency, thus offering a promising solution for various scientific and commercial domains.
Downloads
References
Aggarwal, C. C. (2007). Data streams: models and algorithms. Springer Science & Business Media.
Aggarwal, C. C. (2017). Outlier Analysis(Second Edition). Springer International Publishing.
Aggarwal, C. C. and Yu, P. S. (2001). Outlier detection for high dimensional data. SIGMOD Record, 30(2):37–46.
Angiulli, F., Basta, S., Lodi, S., and Sartori, C. (2012). Distributed strategies for mining outliers in large data sets. TKDE, 25(7):1520–1532.
Angiulli, F. and Fassetti, F. (2007). Detecting distance-based outliers in streams of data. In CIKM, page 811.
Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5(1):4308.
Bhaduri, K., Matthews, B. L., and Giannella, C. R. (2011). Algorithms for speeding up distance-based outlier detection. In SIGKDD, pages 859–867.
Böhm, C., Braunmüller, B., Krebs, F., and Kriegel, H.-p. (2001). Epsilon grid order. SIGMOD Record, 30(2):379–388.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). Lof: Identifying density-based local outliers. SIG-MOD, 29(2):93–104.
Cao, L., Yan, Y., Kuhlman, C., Wang, Q., Rundensteiner, E. A., and Eltabakh, M. (2017). Multi-tactic distance-based outlier detection. In ICDE, pages 959–970.
Cao, L., Yang, D., Wang, Q., Yu, Y., Wang, J., and Rundensteiner, E. A. (2014). Scalable distance-based outlier detection over high-volume data streams. ICDE, D:76–87.
Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection. ACM Computing Surveys, 41(3):1–58.
Cordeiro, R. L. F., Traina Jr, C., Traina, A. J. M., Hernandez, J. L., Kang, U., and Faloutsos, C. (2011). Clustering very large multi-dimensional datasets with mapreduce. In KDD, volume 11, pages 408–516.
Faloutsos, C., Wu, L., Traina, A., and Traina Jr, C. (2010). Fast feature selection using fractal dimension. JIDM, 1(1):3–3.
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4):1–37.
Iwashita, A. S. and Papa, J. P. (2019). An overview on concept drift learning. IEEE Access, 7(Section III):1532–1547.
Jiang, S., Cordeiro, R. L. F., and Akoglu, L. (2022). D.MCA: outlier detection with explicit micro-cluster assignments. In ICDM, pages 987–992.
Knorr, E. M. and Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392–403.
Kontaki, M., Gounaris, A., Papadopoulos, A. N., Tsichlas, K., and Manolopoulos, Y. (2011). Continuous monitoring of distance-based outliers over data streams. In ICDE, pages 135–146.
Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation forest. In ICDM, pages 413–422.
Rajaraman, A. and Ullman, J. D. (2020). Mining of massive datasets. Cambridge University Press.
Ramaswamy, S., Rastogi, R., and Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In SIG-MOD.
Sadik, M. S. and Gruenwald, L. (2010). Dbod-ds: Distance based outlier detection for data streams. In DEXA, Part I 21, pages 122–136.
Stonebraker, M., Çetintemel, U., and Zdonik, S. (2005). The 8 requirements of real-time stream processing. SIGMOD Record, 34(4):42–47.
Toliopoulos, T., Gounaris, A., Tsichlas, K., Papadopoulos, A., and Sampaio, S. (2020). Continuous outlier mining of streaming data in flink. Information Systems, 93.
Tran, L., Fan, L., and Shahabi, C. (2016). Distance-based outlier detection in data streams. PVLDB, 9(12):1089–1100.
Wang, X.-T., Shen, D.-R., Bai, M., Nie, T.-Z., Kou, Y., and Yu, G. (2015). An efficient algorithm for distributed outlier detection in large multi-dimensional datasets. JCST, 30(6):1233–1248.
Yan, Y., Cao, L., Kulhman, C., and Rundensteiner, E. (2017). Distributed local outlier detection in big data. In SIGKDD, pages 1225–1234.
Yang, D., Rundensteiner, E. A., and Ward, M. O. (2009). Neighbor-based pattern detection for windows over streaming data. In EDBT, pages 529–540.
Zhu, S., Li, J., Huang, J., Luo, S., and Peng, W. (2014). A mapreduced-based and cell-based outlier detection algorithm. WUJNS, 19:199–205.
Zimek, A., Schubert, E., and Kriegel, H.-P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min., 5(5):363–387.