A Framework to Compute Entity Relatedness in Large RDF Knowledge Bases
DOI:
https://doi.org/10.5753/jidm.2022.2435Keywords:
entity relatedness, RDF graph, path search strategy, entity similarity, path ranking, DBpedia, SPARKAbstract
The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. This article addresses such problem by combining distributed RDF path search and ranking strategies in a framework called DCoEPinKB, which helps reduce the overall execution time in large RDF graphs and yet maintains adequate ranking accuracy. The framework allows the implementation of different strategies and enables their comparison. The article also reports experiments with data from DBpedia, which provide insights into the performance of different strategies.
Downloads
References
Abdelaziz, I., Harbi, R., Khayyat, Z., and Kalnis, P. A survey and experimental comparison of distributed SPARQL engines for very large RDF data. Proceedings of the VLDB Endowment 10 (13): 2049–2060, 2017.
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., and Sudarshan, S. Keyword searching and browsing in databases using BANKS. In Proceedings 18th International Conference on Data Engineering. IEEE Computer Society, Los Alamitos, CA, USA, pp. 431–440, 2002.
Cheng, G., Liu, D., and Qu, Y. Fast Algorithms for Semantic Association Search and Pattern Mining. IEEE Transactions on Knowledge and Data Engineering 33 (4): 1490–1502, 2021.
Cheng, G., Shao, F., and Qu, Y. An Empirical Evaluation of Techniques for Ranking Semantic Associations. IEEE Transactions on Knowledge and Data Engineering 29 (11): 2388–2401, 2017.
Cheng, G., Zhang, Y., and Qu, Y. Explass: Exploring Associations between Entities via Top-K Ontological Patterns and Facets. In Proceedings of the 13th International Semantic Web Conference (ISWC 2014), P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. Knoblock, D. Vrandečić, P. Groth, N. Noy, K. Janowicz, and C. Goble (Eds.). Vol. 8797. Springer International Publishing, Cham, pp. 422–437, 2014.
Church, K. W. and Hanks, P. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 16 (1): 22–29, 1990.
De Virgilio, R. and Maccioni, A. Distributed Keyword Search over RDF via MapReduce. In The Semantic Web: Trends and Challenges. Vol. 8465. Springer International Publishing, Cham, pp. 208–223, 2014.
De Vocht, L., Beecks, C., Verborgh, R., Mannens, E., Seidl, T., and Van de Walle, R. Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data. In Human Interface and the Management of Information: Information, Design and Interaction. Vol. 9734. Springer International Publishing, Cham, pp. 238–251, 2016.
De Vocht, L., Coppens, S., Verborgh, R., Sande, M. V., Mannens, E., and de Walle, R. V. Discovering Meaningful Connections between Resources in the Web of Data. In Proceedings of the 8th Workshop on Linked Data on the Web (LDOW 2013). CEUR-WS.org, Rio de Janeiro, 2013.
Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters. Communications of the ACM 51 (1): 107–113, 2008.
Fang, L., Sarma, A. D., Yu, C., and Bohannon, P. REX: explaining relationships between entity pairs. Proceedings of the VLDB Endowment 5 (3): 241–252, 2011.
Farhan Husain, M., Doshi, P., Khan, L., and Thuraisingham, B. Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce. In Cloud Computing. Vol. 5931. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 680–686, 2009.
Guillot Jiménez, J., P. Paes Leme, L. A., Torres Izquierdo, Y., Batista Neves, A., and Casanova, M. A. A distributed framework to investigate the entity relatedness problem in large RDF knowledge bases. In Anais do XXXVI Simpósio Brasileiro de Banco de Dados (SBBD 2021). Sociedade Brasileira de Computação - SBC, Rio de Janeiro, pp. 121–132, 2021.
Heim, P., Hellmann, S., Lehmann, J., Lohmann, S., and Stegemann, T. RelFinder: Revealing Relationships in RDF Knowledge Bases. In Semantic Multimedia. Vol. 5887. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 182–187, 2009.
Herrera, J. E. T. On the Connectivity of Entity Pairs in Knowledge Bases. Ph.D. thesis, Pontifícia Universidade Católica do Rio de Janeiro, 2017.
Herrera, J. E. T., Casanova, M. A., Nunes, B. P., Leme, L. A. P. P., and Lopes, G. R. An Entity Relatedness Test Dataset. In Proc. of the 16th Int’l Semantic Web Conf. (ISWC’17). Vol. 10588 LNCS. Springer, Cham, Cham, pp. 193–201, 2017.
Herrera, J. E. T., Casanova, M. A., Nunes, B. P., Lopes, G. R., and Leme, L. DBpedia Profiler Tool: Profiling the Connectivity of Entity Pairs in DBpedia. In Proceedings of the 5th International Workshop on Intelligent Exploration of Semantic Data (IESD 2016). Springer-Verlag Berlin Heidelberg, Kobe/Japan, 2016.
Huang, J., Abadi, D. J., and Ren, K. Scalable SPARQL querying of large RDF graphs. Proceedings of the VLDB Endowment 4 (11): 1123–1134, 2011.
Hulpuş, I., Prangnawarat, N., and Hayes, C. Path-Based Semantic Relatedness on Linked Data and Its Use to Word and Entity Disambiguation. In The Semantic Web - ISWC 2015. Vol. 9366. Springer International Publishing, Cham, pp. 442–457, 2015.
Husain, M., McGlothlin, J., Masud, M. M., Khan, L., and Thuraisingham, B. M. Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing. IEEE Transactions on Knowledge and Data Engineering 23 (9): 1312–1327, 2011.
Jaccard, P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat vol. 37, pp. 547–579, 1901.
Järvelin, K. and Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20 (4): 422–446, 2002.
Jiménez, J. G. Strategies to Understand the Connectivity of Entity Pairs in Knowledge Bases. Ph.D. thesis, Department of Informatics, Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil, 2021.
Jiménez, J. G., Leme, L. A. P. P., and Casanova, M. A. CoEPinKB: A Framework to Understand the Connectivity of Entity Pairs in Knowledge Bases. In Anais do XLVIII Seminário Integrado de Software e Hardware (SEMISH 2021). Sociedade Brasileira de Computação - SBC, Online, pp. 97–105, 2021.
Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. Bidirectional Expansion For Keyword Search on Graph Databases. In Proceedings of the 31st international Conference on Very Large Data Bases (VLDB 2005). VLDB Endowment, Trondheim, Norway, 2005.
Kim, T., Li, W., Behm, A., Cetindil, I., Vernica, R., Borkar, V., Carey, M. J., and Li, C. Similarity query support in big data management systems. Information Systems vol. 88, pp. 101455, 2020.
Le, W., Li, F., Kementsietsidis, A., and Duan, S. Scalable Keyword Search on Large RDF Data. IEEE Transactions on Knowledge and Data Engineering 26 (11): 2774–2788, 2014.
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., and Bizer, C. DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6 (2): 167–195, 2015.
Lehmann, J., Schüppel, J., and Auer, S. Discovering Unknown Connections – the DBpedia Relationship Finder. In Proceedings of the 1st Conference on Social Semantic Web (CSSW 2007). Gesellschaft für Informatik e. V., Bonn, pp. 99–109, 2007.
Milne, D. and Witten, I. H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In Proc. AAAI 2008 Workshop on Wikipedia and Artificial Intelligence. AAAI Press, Chicago, pp. 25–30, 2008.
Moore, J. L., Steinke, F., and Tresp, V. A Novel Metric for Information Retrieval in Semantic Networks. In ESWC. Vol. 7117. Springer, Berlin, Heidelberg, pp. 65–79, 2012.
Pereira Nunes, B., Herrera, J., Taibi, D., Lopes, G. R., Casanova, M. A., and Dietze, S. SCS Connector - Quantifying and Visualising Semantic Paths Between Entity Pairs. In Proceedings of the Satellite Events of the 11th European Semantic Web Conference (ESWC’14), V. Presutti, E. Blomqvist, R. Troncy, H. Sack, I. Papadakis, and A. Tordai (Eds.). Springer, Anissaras, Crete, Greece, pp. 461–466, 2014.
Pirrò, G. Explaining and Suggesting Relatedness in Knowledge Graphs. In Proceedings of the 14th International Conference on The Semantic Web (ISWC 2015). Vol. 9366. Springer International Publishing, Cham, pp. 622–639, 2015.
Przyjaciel-Zablocki, M., Schätzle, A., Hornung, T., and Lausen, G. RDFPath: Path Query Processing on Large RDF Graphs with MapReduce. In Lecture Notes in Computer Science. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 50–64, 2012.
Ragab, M., Tommasini, R., Awaysheh, F. M., and Ramos, J. C. An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark-SQL. In Proceedings of the 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2021). CEUR-WS.org, Nicosia, Cyprus, pp. 71–80, 2021.
Ragab, M., Tommasini, R., Eyvazov, S., and Sakr, S. Towards making sense of Spark-SQL performance for processing vast distributed RDF datasets. In Proceedings of The International Workshop on Semantic Big Data. ACM, Portland, Oregon, pp. 1–6, 2020.
Ragab, M., Tommasini, R., and Sakr, S. Benchmarking Spark-SQL under Alliterative RDF Relational Storage Backends. In Proceedings of the QuWeDa 2019: 3rd Workshop on Querying and Benchmarking the Web of Data co-located with 18th International Semantic Web Conference (ISWC 2019). CEUR-WS.org, Auckland, New Zealand, pp. 67–82, 2019.
Rohloff, K. and Schantz, R. E. High-performance, massively scalable distributed systems using the MapReduce software framework. In Programming Support Innovations for Emerging Distributed Applications on - PSI EtA’10. ACM, Reno/USA, pp. 1–5, 2010.
Schätzle, A., Przyjaciel-Zablocki, M., Berberich, T., and Lausen, G. S2X: Graph-Parallel Querying of RDF with GraphX. In Biomedical Data Management and Graph Online Querying. Vol. 9579. Springer, Cham, pp. 155–168, 2016.
Schätzle, A., Przyjaciel-Zablocki, M., Hornung, T., and Lausen, G. PigSPARQL: A SPARQL Query Processing Baseline for Big Data. In Proceedings of the 12th International Semantic Web Conference (ISWC 2013). CEUR-WS.org, Sydney, Australia, pp. 241–244, 2013.
Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., and Lausen, G. S2RDF: RDF querying with SPARQL on spark. Proceedings of the VLDB Endowment 9 (10): 804–815, 2016.
Sun, J., Shang, Z., Li, G., Deng, D., and Bao, Z. Dima: a distributed in-memory similarity-based query processing system. Proceedings of the VLDB Endowment 10 (12): 1925–1928, 2017.
Sun, J., Shang, Z., Li, G., Deng, D., and Bao, Z. Balance-aware distributed string similarity-based query processing system. Proceedings of the VLDB Endowment 12 (9): 961–974, 2019.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster Computing with Working Sets. HotCloud 10 (10-10): 7, 2010.