An extension to Kendall’s Tau metric to evaluate dissimilarities between data series

Bruno Erbisti; David Kohan Marzagão; Vanessa Braganholo

doi:10.5753/jbcs.2024.2803

Authors

Bruno Erbisti Universidade Federal Fluminense https://orcid.org/0000-0002-0920-0354
David Kohan Marzagão King's College London, University of Oxford https://orcid.org/0000-0001-8475-7913
Vanessa Braganholo Universidade Federal Fluminense https://orcid.org/0000-0002-1184-8192

DOI:

https://doi.org/10.5753/jbcs.2024.2803

Keywords:

Kendall's Tau, metrics, data analysis, dissimilarity

Abstract

Data analysis is performed to examine, interpret, and extract information from data series, and it includes applying various methods and techniques to understand patterns and compare data. An approach to compare data is to use rank metrics that help identify how distinct two data series are when compared to each other according to patterns, formats, criteria, and dimensions in both data series. Among these metrics, Kendall’s Tau metric stands out, as it is robust and inexpensive, widely used in analyzing sequences and genomes, to detect errors in flash memories, and to compare distributions and top-k ranked values. However, a challenge arises when comparing lists with different lengths or when lists do not share the same elements. This happens, for example, when lists are defined by top-k elements, commonly called k-list. In this case, there is no guarantee that two k-lists share the same set of elements. Traditional metrics like Kendall’s Tau are designed to quantify differences only between shared elements in lists. Recognizing this limitation, a possible solution is to apply the metric to the shared elements of the lists. Another solution, named the generalization of Kendall’s Tau, proposed by Fagin et al., considers all elements in two lists. However, this generalization of Kendall Tau is a semi-metric, as it does not satisfy the triangular inequality. To solve this problem, we propose the Extended Kendall Tau (EKT) metric that meets all the conditions of a metric and simultaneously considers the distinct elements of the compared lists. The proposed metric was evaluated by applying conventional Kendall’s Tau and the extended Kendall’s Tau over 40 text files divided into five different languages (eight files per language). We compared KT and EKT measures within the ”same language” and across ”other language” files for the two scenarios. The results revealed that both methods could accurately identify the differences between the groups of texts of the ”same language” and ”other language”. However, the numerical results show that EKT is able to more significantly highlight the difference between groups of texts of different languages.

Downloads

Download data is not yet available.

References

Abdi, H. (2007). The kendall rank correlation coefficient. Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA, pages 508-510. Available online [link].

Almeida-Neto, M., Guimaraes, P., Guimaraes Jr, P. R., Loyola, R. D., and Ulrich, W. (2008). A consistent metric for nestedness analysis in ecological systems: reconciling concept and measurement. Oikos, 117(8):1227-1239. DOI: 10.1111/j.0030-1299.2008.16644.x.

Astrachan, O. (2003). Bubble sort: an archaeological algorithmic analysis. ACM Sigcse Bulletin, 35(1):1-5. DOI: 10.1145/792548.611918.

Bakkelund, D. (2009). An lcs-based string metric. Olso, Norway: University of Oslo. Available online [link].

Bar-Ilan, J. (2010). Rankings of information and library science journals by jif and by h-type indices. Journal of informetrics, 4(2):141-147. DOI: 10.1016/j.joi.2009.11.006.

Baumeister, D., Hogrebe, T., and Rey, L. (2019). Generalized distance bribery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1764-1771. DOI: 10.1609/aaai.v33i01.33011764.

Berger, B., Waterman, M. S., and Yu, Y. W. (2020). Levenshtein distance, sequence comparison and biological database search. IEEE transactions on information theory, 67(6):3287-3294. DOI: 10.1109/TIT.2020.2996543.

Bewick, V., Cheek, L., and Ball, J. (2003). Statistics review 7: Correlation and regression. Critical care, 7:1-9. DOI: 10.1186/cc2401.

Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8:3-62. Available online [link].

Bookstein, A., Kulyukin, V. A., and Raita, T. (2002). Generalized hamming distance. Information Retrieval, 5:353-375. DOI: 10.1023/A:1020499411651.

Buzaglo, S. and Etzion, T. (2015). Bounds on the size of permutation codes with the kendall $tau$-metric. IEEE Transactions on Information Theory, 61(6):3241-3250. DOI: 10.1109/TIT.2015.2424701.

Cayley, A. (1849). Lxxvii. note on the theory of permutations. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 34(232):527-529. DOI: 10.1080/14786444908646287.

Chee, Y. M. et al. (2014). Breakpoint analysis and permutation codes in generalized kendall tau and cayley metrics. In 2014 IEEE International Symposium on Information Theory, pages 2959-2963. IEEE. DOI: 10.1109/ISIT.2014.6875376.

Choudhary, B. (1993). The elements of complex analysis. New Age International. Book.

Cicirello, V. A. (2020). Kendall tau sequence distance: Extending kendall tau from ranks to sequences. EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 7(23). DOI: 10.4108/eai.13-7-2018.163925.

Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic press. DOI: 10.4324/9780203771587.

Collas, F. and Irurozki, E. (2021). Concentric mixtures of mallows models for top-$ k $ rankings: sampling and identifiability. In International Conference on Machine Learning, pages 2079-2088. PMLR. Available online [link].

Copson, E. T. (1988). Metric spaces. Number 57. CUP Archive. Book.

Cormode, G. and Muthukrishnan, S. (2005). Substring compression problems. In SODA, volume 5, pages 321-330. Citeseer. Available online [link].

de Lima, T. A. and Ayala-Rincón, M. (2012). Complexity of cayley distance and other general metrics on permutation groups. In 2012 7th Colombian Computing Congress (CCC), pages 1-6. IEEE. DOI: 10.1109/ColombianCC.2012.6398020.

DeWitt, Y. W. D. J. (2004). Computing pagerank in a distributed internet search system. In Proceedings of the 30th VLDB Conference, Toronto, Canada, volume 30, pages 420-431. Book.

Elhadi, M. and Al-Tobi, A. (2009). Webpage duplicate detection using combined pos and sequence alignment algorithm. In 2009 WRI world congress on computer science and information engineering, volume 1, pages 630-634. IEEE. DOI: 10.1109/CSIE.2009.771.

Fagin, R., Kumar, R., and Sivakumar, D. (2003). Comparing top k lists. SIAM Journal on discrete mathematics, 17(1):134-160. DOI: 10.1137/S0895480102412856.

Fligner, M. A. and Verducci, J. S. (1986). Distance based ranking models. Journal of the Royal Statistical Society: Series B (Methodological), 48(3):359-369. DOI: 10.1111/j.2517-6161.1986.tb01420.x.

Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli, M., Quirk, C., Mitchell, M., Gao, J., and Dolan, B. (2015). deltableu: A discriminative metric for generation tasks with intrinsically diverse targets. arXiv preprint arXiv:1506.06863. DOI: 10.48550/arXiv.1506.06863.

Gillette, T. A., Hosseini, P., and Ascoli, G. A. (2015). Topological characterization of neuronal arbor morphology via sequence representation: Ii-global alignment. BMC bioinformatics, 16(1):1-17. DOI: 10.1186/s12859-015-0605-1.

Gilpin, A. R. (1993). Table for conversion of kendall's tau to spearman's rho within the context of measures of magnitude of effect for meta-analysis. Educational and psychological measurement, 53(1):87-92. DOI: 10.1177/0013164493053001007.

Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell system technical journal, 29(2):147-160. DOI: 10.1002/j.1538-7305.1950.tb00463.x.

Hong, Y., Vaidya, J., and Lu, H. (2011). Search engine query clustering using top-k search results. In 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, volume 1, pages 112-119. IEEE. DOI: 10.1109/WI-IAT.2011.224.

Hunt, J. W. and MacIlroy, M. D. (1976). An algorithm for differential file comparison. Bell Laboratories Murray Hill. Available online [link].

Jadoon, S., Solehria, S. F., Rehman, S. u., and Jan, H. (2011). Design and analysis of optimized selection sort algorithm. International Journal of Electric & Computer Sciences (IJECS-IJENS), 11(01):16-22. Available online [link].

Jiang, A., Mateescu, R., Schwartz, M., and Bruck, J. (2009). Rank modulation for flash memories. IEEE Transactions on Information Theory, 55(6):2659-2673. DOI: 10.1109/TIT.2009.2018336.

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2):81-93. DOI: 10.1093/biomet/30.1-2.81.

Kendall, M. G. (1976). Rank correlation methods. Charles Griffin & Co., London. Book.

Kruskal, J. B. (1983). An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 25(2):201-237. DOI: 10.1137/1025045.

Kukich, K. (1992). Techniques for automatically correcting words in text. ACM computing surveys (CSUR), 24(4):377-439. DOI: 10.1145/146370.146380.

Lin, C.-Y. and Och, F. J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605-612. Available online [link].

MacKay, D. J. (1999). Good error-correcting codes based on very sparse matrices. IEEE transactions on Information Theory, 45(2):399-431. DOI: 10.1109/18.748992.

Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 18(1):50-60. DOI: 10.1214/aoms/1177730491.

Monjardet, B. (1998). On the comparison of the spearman and kendall metrics between linear orders. Discrete mathematics, 192(1-3):281-292. DOI: 10.1016/S0012-365X(98)00076-4.

Moulton, V. and Steel, M. (2012). The ‘butterfly effect’in cayley graphs with applications to genomics. Journal of mathematical biology, 65(6-7):1267-1284. DOI: 10.1007/s00285-011-0498-1.

Nalepa, B. and Gwiazda, A. (2019). Spearman’s rho modification in digital image processing. In IOP Conference Series: Materials Science and Engineering, volume 591, page 012060. IOP Publishing. DOI: 10.1088/1757-899X/591/1/012060.

Norouzi, M., Fleet, D. J., and Salakhutdinov, R. R. (2012). Hamming distance metric learning. Advances in neural information processing systems, 25. Available online [link].

Pal, K. and Michel, S. (2016). Efficient similarity search across top-k lists under the kendall's tau distance. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management, pages 1-12. DOI: 10.1145/2949689.2949709.

Pearson, K. (1895). Vii. note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352):240-242. DOI: 10.1098/rspl.1895.0041.

Schober, P., Boer, C., and Schwarte, L. A. (2018). Correlation coefficients: appropriate use and interpretation. Anesthesia & analgesia, 126(5):1763-1768. DOI: 10.1213/ANE.0000000000002864.

Schröder, G., Thiele, M., and Lehner, W. (2011). Setting goals and choosing metrics for recommender system evaluations. In UCERSTI2 workshop at the 5th ACM conference on recommender systems, Chicago, USA, volume 23, page 53. Available online [link].

Sculley, D. (2007). Rank aggregation for similar items. In Proceedings of the 2007 SIAM international conference on data mining, pages 587-592. SIAM. DOI: 10.1137/1.9781611972771.66.

Shyu, S. J. and Tsai, C.-Y. (2009). Finding the longest common subsequence for multiple biological sequences by ant colony optimization. Computers & Operations Research, 36(1):73-91. DOI: 10.1016/j.cor.2007.07.006.

Spearman, C. (1904). The proof and measurement of association between two things.. DOI: 10.1037/11491-005.

Spearman, C. (1906). Footrule for measuring correlation. British Journal of Psychology, 2(1):89. Available online [link].

Sridharamurthy, R., Masood, T. B., Kamakshidasan, A., and Natarajan, V. (2018). Edit distance between merge trees. IEEE transactions on visualization and computer graphics, 26(3):1518-1531. DOI: 10.1109/TVCG.2018.2873612.

Wagner, R. A. and Fischer, M. J. (1974). The string-to-string correction problem. Journal of the ACM (JACM), 21(1):168-173. DOI: 10.1145/321796.321811.

Weiss, S., Van Treuren, W., Lozupone, C., Faust, K., Friedman, J., Deng, Y., Xia, L. C., Xu, Z. Z., Ursell, L., Alm, E. J., et al. (2016). Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. The ISME journal, 10(7):1669-1681. DOI: 10.1038/ismej.2015.235.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80-83. DOI: 10.1007/978-1-4612-4380-9_16.

Wilson, W. A. (1931). On semi-metric spaces. American Journal of Mathematics, 53(2):361-373. DOI: 10.2307/2370790.

Wing, C., Simon, K., and Bello-Gomez, R. A. (2018). Designing difference in difference studies: best practices for public health policy research. Annual review of public health, 39:453-469. DOI: 10.1146/annurev-publhealth-040617-013507.

Xiong, H., Brodie, M., and Ma, S. (2006). Top-cop: Mining top-k strongly correlated pairs in large databases. In Sixth International Conference on Data Mining (ICDM'06), pages 1162-1166. IEEE. DOI: 10.1109/ICDM.2006.161.

Yang, S., Schoeny, C., and Dolecek, L. (2019). Theoretical bounds and constructions of codes in the generalized cayley metric. IEEE Transactions on Information Theory, 65(8):4746-4763.

Zhang, Y. and Ge, G. (2015). Snake-in-the-box codes for rank modulation under kendall’s $tau$ -metric. IEEE Transactions on Information Theory, 62(1):151-158. DOI: 10.1109/TIT.2015.2502602.

Zhou, Y. and Liu, Y. (2014). Correlation analysis of performance metrics for classifier. In Decision Making and Soft Computing: Proceedings of the 11th International FLINS Conference, pages 487-492. World Scientific. DOI: 10.1142/9789814619998_0081.