Exploiting Machine Learning Algorithms in the Classification Step of Record Linkage
DOI:
https://doi.org/10.5753/jidm.2025.4303Keywords:
Record Linkage, Machine Learning, Classification, DeduplicationAbstract
Record linkage is a well-known task that aims to determine duplicate pairs of records in datasets. In this work, we evaluated several Machine Learning-based classification algorithms (Adaboost, MLP, SVM, Random Forest and XGboost) in the context of record linkage. We conducted experiments which aimed to evaluate the influence of balanced and unbalanced training sets over the efficacy of the record linkage classification step. We also explore the usage of scatterplots to improve the qualitative discussion of the obtained experimental results. According to the obtained experimental results, the Random Forest algorithm has generated the highest F-measure considering the evaluated datasets. In addition, the XGboost model has also presented competitive results, especially in the context of bibliographic and movie datasets.
Downloads
References
Andrzejewski, W., Bębel, B., Boiński, P., Kowalewska, J., Marszałek, A., and Wrembel, R. (2024). Statistical modeling vs. machine learning for deduplication of customer records (industrial paper).
Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
Comber, S. and Arribas-Bel, D. (2019). Machine learning innovations in address matching: A practical comparison of word2vec and crfs. Transactions in GIS, 23(2):334–348.
Dal Bianco, G., Gonçalves, M. A., and Duarte, D. (2018). Bloss: Effective meta-blocking with almost no effort. Information Systems, 75:75–89.
de Souza Silva, L., Nascimento Filho, D. C., and Moro, M. M. (2017). Uma avaliação de eficiência e eficácia da combinaçao de técnicas para deduplicaçao de dados. In Anais do XXXII Simpósio Brasileiro de Bancos de Dados, pages 160–171. SBC.
Ilangovan, G. (2019). Benchmarking the effectiveness and efficiency of machine learning algorithms for record linkage.
Jurek-Loughrey, A. and P, D. (2019). Semi-supervised and unsupervised approaches to record pairs classification in multi source data linkage. Linking and Mining Heterogeneous and Multi-view Data, pages 55–78.
Kaur, P. et al. (2020). A comparison of machine learning classifiers for use on historical record linkage.
Köpcke, H., Thor, A., and Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2):484–493.
Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020). Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584.
Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2023). Effective entity matching with transformers. The VLDB Journal, 32(6):1215–1235.
Makri, C., Karakasidis, A., and Pitoura, E. (2022). Towards a more accurate and fair svm-based record linkage. In 2022 IEEE International Conference on Big Data (Big Data), pages 4691–4699. IEEE.
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., and Raghavendra, V. (2018). Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34.
Nafa, Y., Chen, Q., Chen, Z., Lu, X., He, H., Duan, T., and Li, Z. (2022). Active deep learning on entity resolution by risk sampling. Knowledge-Based Systems, 236:107729.
Paganelli, M., Del Buono, F., Baraldi, A., Guerra, F., et al. (2022). Analyzing how bert performs entity matching. Proceedings of the VLDB Endowment, 15(8):1726–1738.
Papadakis, G., Koutrika, G., Palpanas, T., and Nejdl, W. (2013). Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946–1960.
Peeters, R., Der, R. C., and Bizer, C. (2023). Wdc products: A multi-dimensional entity matching benchmark. arXiv preprint arXiv:2301.09521.
Pita, R., Mendonça, E., Reis, S., Barreto, M., and Denaxas, S. (2017). A machine learning trainable model to assess the accuracy of probabilistic record linkage. In Big Data Analytics and Knowledge Discovery: 19th International Conference, DaWaK 2017, Lyon, France, August 28–31, 2017, Proceedings 19, pages 214–227. Springer.
Ramezani Foukolayi, M. (2021). Comparison of machine learning algorithms in a human-computer hybrid record linkage system.
Santos, M. M. and Nascimento, D. C. (2023). Avaliando fatores de influência sobre algoritmos de aprendizado de máquina na etapa de classificação da resolução de entidades. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 63–75. SBC.
Wang, P., Zheng, W., Wang, J., and Pei, J. (2021). Automating entity matching model development. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1296–1307. IEEE.

