Set Similarity Joins on Heterogeneous Clusters




Advanced Query Processing, Distributed Computing, GPU, Heterogeneous Hardware, Set Similarity Join


Set similarity join (SSJ) is a fundamental operation widely used in many application scenarios, including data discovery, cleaning, and integration. As this operation is computationally expensive, its runtime can be excessive on large volumes of data. Previous research has focused on improving SSJ scalability using distributed computing or the massive parallelism available in GPUs, but not both. Hence, these efforts cannot fully exploit the processing power of increasingly heterogeneous computing architectures. In this article, we present an approach to evaluating SSJ on a heterogeneous cluster of compute nodes equipped with CPU and GPU. We propose a cost model to distribute the workload between these processors and apply this model to integrate two algorithms, one distributed and the other parallel, in a coprocessing fashion. Experimental results show that our proposal is efficient, scalable, and outperforms previous work.


Download data is not yet available.


Ramos Marques Silva, L., & Andrade Ribeiro, L. (2023). Set Similarity Joins on Heterogeneous Clusters. Journal of Information and Data Management, 14(2).



SBBD 2022 Short papers - Extended papers