Beyond Species: Enhancing Botanical Data Integrity Using Similarity Metrics in Authorship Attribution

Authors

DOI:

https://doi.org/10.5753/jidm.2026.5847

Keywords:

Botanical Databases, Text Similarity, International Code of Nomenclature

Abstract

This extended study builds upon prior work exploring the deduplication of botanical authorship records governed by the International Code of Nomenclature (ICN). We introduce new datasets (Sargassaceae and Agaricaceae) to evaluate the performance of multiple text similarity algorithms, including Jaccard, Levenshtein, Jaro-Winkler, Metaphone, N-grams, Smith-Waterman, and Fingerprinting. Our updated methodology incorporates enhanced preprocessing strategies, new threshold calibration techniques, and comprehensive metric-based evaluations (precision, recall, and F1 score). The results reaffirm the robustness of Smith-Waterman and highlight the dataset-dependent behavior of Metaphone and Fingerprinting. This expanded analysis contributes to a more generalized understanding of text similarity challenges in biological databases and reinforces the importance of tailored algorithm selection based on taxonomic structure and data quality.

Downloads

Download data is not yet available.

References

Baeza-Yates, R. and Ribeiro-Neto, B. (2008). Modern Information Retrieval. Addison-Wesley Publishing Company,

USA, 2nd edition.

Christen, P. and Christen, P. (2012). Evaluation of matching quality and complexity. Data Matching: Concepts and

Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, pages 163–184. DOI: 10.1007/978-3-

-31164-2.

Ferreira, A. A., Gonçalves, M. A., and Laender, A. H. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2):15–26.

García, S., Luengo, J., and Herrera, F. (2015). Data preprocessing in data mining, volume 72. Springer.

Gomaa, W. H. and Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13):13–18.

Gyawali, B., Anastasiou, L., and Knoth, P. (2020). Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the

th Conference on Language Resources and Evaluation (LREC 2020), pages 901–910, Marseille, France. European Language Resources Association (ELRA).

Levin, F. H. and Heuser, C. A. (2010). Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2):183–197.

Liu, J., Lei, K. H., Liu, J. Y., Wang, C., and Han, J. (2013). Ranking-based name matching for author disambiguation in bibliographic data. Proceedings of the 19th ACM

SIGKDD international conference on Knowledge discovery and data mining, pages 1120–1128. Manning, C. D. (2008). Introduction to information retrieval. Syngress Publishing,.

Navarro, G. (2001). A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31–88. Prakoso, D. et al. (2021). Short text similarity measurement methods: A review. Journal of Big Data and Analytics in Practice, 3(1):33–44.

Silva, C. et al. (2019). Measurement of text similarity: A survey. Information, 11(421):1–25.

Silva, J. et al. (2021). Tool for validation and import in herbarium database. In Proceedings of the Botanical Data Conference, pages 123–130. Botanical Society.

Smith, T. F., Waterman, M. S., et al. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197.

Turland, N. J., Wiersema, J. H., Barrie, F. R., Greuter, W., Hawksworth, D. L., Herendeen, P. S., Knapp, S., Kusber, W.-H., Li, D.-Z., Marhold, K., et al. (2018). International Code of Nomenclature for algae, fungi, and plants (Shenzhen Code) adopted by the Nineteenth International Botanical Congress Shenzhen, China, July 2017. Koeltz botanical books.

Wang, J. and Dong, Y. (2020). Measurement of text similarity: A survey. Information, 11(9). DOI: 10.3390/info11090421.

Yacouby, R. and Axman, D. (2020). Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models.

In Eger, S., Gao, Y., Peyrard, M., Zhao, W., and Hovy, E., editors, Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 79–91, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.eval4nlp-1.9.

Downloads

Published

2026-03-13

How to Cite

Rios Delponte, L., Friedrich Dorneles, C. ., & Silmara Werner, S. (2026). Beyond Species: Enhancing Botanical Data Integrity Using Similarity Metrics in Authorship Attribution. Journal of Information and Data Management, 17(1), 26–35. https://doi.org/10.5753/jidm.2026.5847

Issue

Section

SBBD 2024 Full papers - Extended papers