Usage of the Bag Distance Filtering with In-Memory Metric Trees

Authors

  • Sergio Luis Sardi Mergen Universidade Federal de Santa Maria

DOI:

https://doi.org/10.5753/jidm.2024.3060

Keywords:

Metric space, Metric trees, Similarity search

Abstract

Metric trees are efficient indexing structures for multidimensional objects defined in terms of a metric space. One possible application is for string similarity search, using the edit distance as the metric function. A previous work proposes clustering objects under leaf nodes and using the bag distance as a filtering step before the edit distance is computed. Cost predictions estimate that the filtering compensates in practical scenarios. The work has important implications when data resides on secondary storage, where nodes have a fixed size that aligns with page disks. In this paper, we expand the discussion by using the bag distance filtering step for in-memory metric trees, where the clusters have no size constraints. We adjust existing metric trees to support leaf nodes with arbitrary cluster sizes and incorporate parameters based on size and density to decide when a leaf node should be subdivided. Experiments show that cluster size can have a substantial impact during both index construction and search. We report the gains achieved in terms of processing cost and the number of distance computations when using the most suited values for the cluster size and density parameters.

Downloads

Download data is not yet available.

References

Bartolini, I., Ciaccia, P., and Patella, M. (2002). String matching with metric trees using an approximate distance. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval, SPIRE 2002, pages 271–283, London, UK, UK. Springer-Verlag.

Blumenthal, L. M. (1954). Theory and applications of distance geometry. Clarendon Press - Oxford.

Bozkaya, T. and Ozsoyoglu, M. (1997). Distance-based indexing for high-dimensional metric spaces. In ACM SIGMOD Record, volume 26, pages 357–368. ACM.

Burkhard, W. A. and Keller, R. M. (1973). Some approaches to best-match file searching. Communications of the ACM, 16(4):230–236.

Chen, L., Gao, Y., Li, X., Jensen, C. S., and Chen, G. (2015). Efficient metric indexing for similarity search. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 591–602. IEEE.

Chen, L., Gao, Y., Zheng, B., Jensen, C. S., Yang, H., and Yang, K. (2017). Pivot-based metric indexing. Proceedings of the VLDB Endowment, 10(10):1058–1069.

Deng, D., Li, G., Wen, H., Jagadish, H., and Feng, J. (2016). Meta: an efficient matching-based method for error-tolerant autocompletion. Proceedings of the VLDB Endowment, 9(10):828–839.

Deza, M. M. and Deza, E. (2009). Encyclopedia of distances. In Encyclopedia of Distances, pages 1–583. Springer.

Figueroa, K., Navarro, G., and Chávez, E. (2007). Metric spaces library. Available at [link].

Hjaltason, G. R. and Samet, H. (1999). Distance browsing in spatial databases. ACM Trans. Database Syst., 24(2):265–318. DOI: 10.1145/320248.320255.

Jensen, A. H., Lauridsen, F. A., Zardbani, F., Idreos, S., and Karras, P. (2021). Revisiting multidimensional adaptive indexing. EDBT, pages 469ś474.

Kahveci, T. and Singh, A. K. (2001). An efficient index structure for string databases. In VLDB, volume 1, pages 351–360.

Lee, H., Ng, R. T., and Shim, K. (2007). Extending q-grams to estimate selectivity of string matching with low edit distance. In Proceedings of the 33rd international conference on Very large data bases, pages 195–206. VLDB Endowment.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710.

Li, H., Ni, B., Wong, M.-H., and Leung, K.-S. (2011). A fast cuda implementation of agrep algorithm for approximate nucleotide sequence matching. In Application Specific Processors (SASP), 2011 IEEE 9th Symposium on, pages 74–77. IEEE.

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453.

Papamichail, D. and Papamichail, G. (2009). Improved algorithms for approximate string matching. BMC bioinformatics, 10(1):S10.

Sprenger, S., Schäfer, P., and Leser, U. (2019). Bb-tree: A main-memory index structure for multidimensional range queries. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1566–1569. DOI: 10.1109/ICDE.2019.00143.

Traina, C., Traina, A., Faloutsos, C., and Seeger, B. (2002). Fast indexing and visualization of metric data sets using slim-trees. IEEE Transactions on Knowledge and Data Engineering, 14(2):244–260.

Traina Jr, C., Traina, A. J., Vieira, M. R., Faloutsos, C., et al. (2007). The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. The VLDB Journal—The International Journal on Very Large Data Bases, 16(4):483–505.

Yianilos, P. N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA, volume 93, pages 311–321.

Zezula, P., Amato, G., Dohnal, V., and Batko, M. (2006). Similarity search: the metric space approach, volume 32. Springer Science & Business Media.

Zhu, Y., Chen, L., Gao, Y., and Jensen, C. S. (2022). Pivot selection algorithms in metric spaces: a survey and experimental study. The VLDB Journal, pages 1–25.

Downloads

Published

2024-04-17

How to Cite

Luis Sardi Mergen, S. (2024). Usage of the Bag Distance Filtering with In-Memory Metric Trees. Journal of Information and Data Management, 15(1), 255–267. https://doi.org/10.5753/jidm.2024.3060

Issue

Section

Regular Papers