Set Similarity Joins on Heterogeneous Clusters

Larissa Ramos Marques Silva; Leonardo Andrade Ribeiro

doi:10.5753/jidm.2023.3283

Authors

Larissa Ramos Marques Silva Implanta IT Solutions LTDA https://orcid.org/0009-0005-9651-427X
Leonardo Andrade Ribeiro Universidade Federal de Goiás https://orcid.org/0000-0002-3534-9232

DOI:

https://doi.org/10.5753/jidm.2023.3283

Keywords:

Advanced Query Processing, Distributed Computing, GPU, Heterogeneous Hardware, Set Similarity Join

Abstract

Set similarity join (SSJ) is a fundamental operation widely used in many application scenarios, including data discovery, cleaning, and integration. As this operation is computationally expensive, its runtime can be excessive on large volumes of data. Previous research has focused on improving SSJ scalability using distributed computing or the massive parallelism available in GPUs, but not both. Hence, these efforts cannot fully exploit the processing power of increasingly heterogeneous computing architectures. In this article, we present an approach to evaluating SSJ on a heterogeneous cluster of compute nodes equipped with CPU and GPU. We propose a cost model to distribute the workload between these processors and apply this model to integrate two algorithms, one distributed and the other parallel, in a coprocessing fashion. Experimental results show that our proposal is efficient, scalable, and outperforms previous work.

Downloads

Download data is not yet available.

References

Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All Pairs Similarity Search. In Proceedings of the International World Wide Web Conferences, pages 131–140.

Bellas, C. and Gounaris, A. (2019). Exact Set Similarity Joins for Large Datasets in the GPGPU Paradigm. In Proceedings of the International Workshop on Data Management on New Hardware, pages 5:1–5:10.

Bellas, C. and Gounaris, A. (2021). HySet: A Hybrid Framework for Exact Set Similarity Join using a GPU. Parallel Computing, 104-105:102790.

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the IEEE International Conference on Data Engineering, page 5.

Dean, J. and Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, pages 137–150.

Fier, F., Augsten, N., Bouros, P., Leser, U., and Freytag, J. (2018). Set Similarity Joins on MapReduce: An Experimental Survey. Proceedings of the VLDB Endowment, 11(10):1110–1122.

Fier, F. and Freytag, J. (2021). Scaling Up Set Similarity Joins Using a Cost-Based Distributed-Parallel Framework. In International Conference on Similarity Search and Applications, pages 17–31.

Fier, F. and Freytag, J. (2022). Parallelizing Filter-and-verification based Exact Set Similarity Joins on Multicores. Information Systems, 108:101912.

Mann, W., Augsten, N., and Bouros, P. (2016). An Empirical Evaluation of Set Similarity Join Techniques. Proceedings of the VLDB Endowment, 9(9):636–647.

Oliveira, D., Borges, F. F., and Ribeiro, L. A. (2017). Uma Abordagem para Processamento Distribuído de Junção por Similaridade sobre Múltiplos Atributos. In Proceedings of the Brazilian Symposium on Databases, pages 300–305.

Oliveira, D. J. C., Borges, F. F., Ribeiro, L. A., and Cuzzocrea, A. (2018). Set Similarity Joins with Complex Expressions on Distributed Platforms. In Proceedings of the Symposium on Advances in Databases and Information Systems, pages 216–230.

Quirino, R. D., Ribeiro-Júnior, S., Ribeiro, L. A., and Martins, W. S. (2017). fgssjoin: A GPU-based Algorithm for Set Similarity Joins. In International Conference on Enterprise Information Systems, pages 152–161. SciTePress.

Ribeiro, L. A. and Härder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.

Ribeiro-Júnior, S., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2016). gSSJoin: a GPU-based Set Similarity Join Algorithm. In Proceedings of the Brazilian Symposium on Databases, pages 64–75. SBC.

Ribeiro-Júnior, S., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2017). Fast Parallel Set Similarity Joins on Many-core Architectures. Journal of Information and Data Management, 8(3):255–270.

Rosenfeld, V., Breß, S., and Markl, V. (2022). Query Processing on Heterogeneous CPU/GPU Systems. ACM Computing Surveys, 55(1).

Sarawagi, S. and Kirpal, A. (2004). Efficient Set Joins on Similarity Predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 743–754. ACM.

Shanbhag, A., Madden, S., and Yu, X. (2020). A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 1617–1632.

Silva, L. R. M. and Ribeiro, L. A. (2022). Junções por Similaridade usando Processamento Distribuído e Paralelismo Massivo. In Proceedings of the Brazilian Symposium on Databases, pages 421–426. SBC.

Sun, J., Shang, Z., Li, G., Bao, Z., and Deng, D. (2019). Balance-Aware Distributed String Similarity-Based Query Processing System. Proceedings of the VLDB Endowment, 12(9):961–974.

Suri, S., Ilyas, I. F., Ré, C., and Rekatsinas, T. (2021). Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins. Proceedings of the VLDB Endowment, 15(3):699–712.

Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 495–506.

Wang, X., Qin, L., Lin, X., Zhang, Y., and Chang, L. (2017). Leveraging Set Relations in Exact Set Similarity Join. Proceedings of the VLDB Endowment, 10(9):925–936.

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-Duplicate Detection. ACM Transactions on Database Systems, 36(3):15:1–15:41.

Xu, L., Butt, A. R., Lim, S., and Kannan, R. (2018). A Heterogeneity-Aware Task Scheduler for Spark. In Proceedings of the IEEE International Conference on Cluster Computing, pages 245–256.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M. J., Shenker, S., and Stoica, I. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, pages 15–28.

Set Similarity Joins on Heterogeneous Clusters

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Metrics: