Exploring Evolutionary Patterns: A Jupyter Notebook for Discovering Frequent Subtrees in Phylogenetic Tree Databases

Authors

DOI:

https://doi.org/10.5753/jidm.2025.4361

Keywords:

phylogenetics, frequent subtrees, bioinformatics, workflows

Abstract

The exploratory analysis of evolutionary information within a phylogenetic tree database is a crucial task in the field of bioinformatics. Phylogenetic trees are constructed by exploring multiple evolutionary and tree construction methods. For instance, methods like Maximum Parsimony, Maximum Likelihood, and Neighbor-Joining may yield slightly different trees due to their distinct approaches to inferring phylogenies (e.g., distance and character-based methods). Therefore, analyzing evolutionary data often entails identifying frequent subtrees within a given set of phylogenetic trees. However, this identification process can be computing-intensive, depending on the size of the input tree database. In this manuscript, we introduce the NMFSt.P Notebook, which aims to simplify the comparison of multiple phylogenetic trees for identifying frequent subtrees in the database. Our experiments demonstrate that NMFSt.P produces results comparable to the baseline approach while bringing the advantage of flexibility for the scientist.

Downloads

Download data is not yet available.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410. DOI: https://doi.org/10.1016/S0022-2836(05)80360-2.

Amir, A. and Keselman, D. (1997). Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithms. SIAM Journal on Computing, 26(6):1656–1669. DOI: 10.1137/S0097539794269461.

Babuji, Y., Woodard, A., Li, Z., Katz, D. S., Clifford, B., Kumar, R., Lacinski, L., Chard, R., Wozniak, J. M., Foster, I., Wilde, M., and Chard, K. (2019). Parsl: Pervasive parallel programming in python. In 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC). babuji19parsl.pdf. DOI: 10.1145/3307681.3325400.

Bryant, D. (2003). A classification of consensus methods for phylogenetics, pages 163–183. DOI: 10.1090/dimacs/061/11.

Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G., and Thompson, J. D. (2003). Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Research, 31(13):3497–3500. DOI: 10.1093/nar/gkg500.

Colonnelli, I., Aldinucci, M., Cantalupo, B., Padovani, L., Rabellino, S., Spampinato, C., Morelli, R., Di Carlo, R., Magini, N., and Cavazzoni, C. (2022). Distributed workflows with jupyter. Future Generation Computer Systems, 128:282–298. DOI: https://doi.org/10.1016/j.future.2021.10.007.

de O. Sandes, E. F., Miranda, G., Melo, A. C., Martorell, X., and Ayguade, E. (2014). Fine-grain parallel megabase sequence comparison with multiple heterogeneous gpus. SIGPLAN Not., 49(8):383–384. DOI: 10.1145/2692916.2555280.

de Oliveira, D., Ocaña, K. A. C. S., Baião, F. A., and Mattoso, M. (2012). A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J. Grid Comput., 10(3):521–552. DOI: 10.1007/S10723-012-9227-2.

de Oliveira, D. C. M., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers. DOI: 10.2200/S00915ED1V01Y201904DTM060.

Deepak, A. and Fernández-Baca, D. (2014). Enumerating all maximal frequent subtrees in collections of phylogenetic trees. Algorithms for Molecular Biology, 9(1):16. DOI: 10.1186/1748-7188-9-16.

Deepak, A. et al. (2014). Evominer: frequent subtree mining in phylogenetic databases. Knowledge and Information Systems, 41(3):559–590. DOI: 10.1007/s10115-013-0676-0.

Do, C. B., Brudno, M., and Batzoglou, S. (2004). PROBCONS: probabilistic consistency-based multiple alignment of amino acid sequences. In McGuinness, D. L. and Ferguson, G., editors, Proceedings of the Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25-29, 2004, San Jose, California, USA, pages 703–708. AAAI Press / The MIT Press.

Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. J. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. DOI: 10.1017/CBO9780511790492.

Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797. DOI: 10.1093/nar/gkh340.

Farris, J. S. (1970). Methods for Computing Wagner Trees. Systematic Biology, 19(1):83–92. DOI: 10.1093/sysbio/19.1.83.

Ferrari, C., Moraes, J. V., and de Oliveira, D. (2023). Nmfst.p: um notebook para identificação em paralelo de subárvores frequentes em conjuntos de Árvores filogenéticas. In Anais do XVII Brazilian e-Science Workshop, pages 1–8, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/bresci.2023.234110.

Goloboff, P. A. et al. (2009). Phylogenetic analysis of 73060 taxa corroborates major eukaryotic groups. Cladistics, 25(3):211–230. DOI: https://doi.org/10.1111/j.1096-0031.2009.00255.x.

Guedes, T., Ocaña, K., and de Oliveira, D. (2017). Sciphylominer: um workflow para mineração de dados filogemônicos de protozoários. In Anais do XI Brazilian e-Science Workshop, pages 69–76, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/bresci.2017.9924.

Katoh, K., Rozewicki, J., and Yamada, K. D. (2017). MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. Briefings in Bioinformatics, 20(4):1160–1166. DOI: 10.1093/bib/bbx108.

Markel, S. and Leon, D. (2003). Sequence analysis in a nutshell - a guide to common tools and databases: covers EMBOSS 2.5.0. O’Reilly.

Molloy, E. K. and Warnow, T. (2019). TreeMerge: a new method for improving the scalability of species tree estimation methods. Bioinformatics, 35(14):i417–i426. DOI: 10.1093/bioinformatics/btz344.

Nixon, K. C. (2001). Phylogeny. In Levin, S. A., editor, Encyclopedia of Biodiversity (Second Edition), pages 16–23. Academic Press, Waltham, second edition edition. DOI: https://doi.org/10.1016/B978-0-12-384719-5.00108-8.

Notredame, C., Higgins, D. G., and Heringa, J. (2000). T-coffee: a novel method for fast and accurate multiple sequence alignment11edited by j. thornton. Journal of Molecular Biology, 302(1):205–217. DOI: https://doi.org/10.1006/jmbi.2000.4042.

Ocaña, K. and de Oliveira, D. (2015). Parallel computing in genomic research: advances and applications. Adv. Appl. Bioinform. Chem., page 23.

Ocaña, K. A. C. S. et al. (2011). Sciphy: A cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In Proc. of the 6th Brazilian Symposium on Bioinformatics, pages 66–70. Springer. DOI: 10.1007/978-3-642-22825-4_9.

Ocaña, K. A. and Dávila, A. M. (2011). Phylogenomics-based reconstruction of protozoan species tree. Evolutionary Bioinformatics, 7:EBO.S6861. PMID: 21863127. DOI: 10.4137/EBO.S6861.

Puigbò, P., Wolf, Y. I., and Koonin, E. V. (2019). Genome-Wide Comparative Analysis of Phylogenetic Trees: The Prokaryotic Forest of Life, pages 241–269. Springer New York, New York, NY. DOI: 10.1007/978-1-4939-9074-08.

Ramu, A., Kahveci, T., and Burleigh, J. G. (2012). A scalable method for identifying frequent subtrees in sets of large phylogenetic trees. BMC Bioinformatics, 13(1):256. DOI: 10.1186/1471-2105-13-256.

Rasmussen, D. A. and Guo, F. (2022). Espalier: Efficient tree reconciliation and arg reconstruction using maximum agreement forests. bioRxiv. DOI: 10.1101/2022.01.17.476639.

Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4(4):406–425.

Schwartz, S., Kent, W. J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R. C., Haussler, D., and Miller, W. (2003). Human–mouse alignments with BLASTZ. Genome Res., 13(1):103–107.

Setubal, J. C. and Meidanis, J. (1997). Introduction to computational molecular biology. PWS Publishing Company.

Silva, V., Campos, V., Guedes, T., Camata, J. J., de Oliveira, D., Coutinho, A. L. G. A., Valduriez, P., and Mattoso, M. (2020). Dfanalyzer: Runtime dataflow analysis tool for computational science and engineering applications. SoftwareX, 12:100592. DOI: 10.1016/J.SOFTX.2020.100592.

Sukumaran, J. and Holder, M. T. (2010). DendroPy: a python library for phylogenetic computing. Bioinformatics, 26(12):1569–1571.

Tommy Tsan-Yuk Lam, C.-C. H. and Tang, J. W. (2010). Use of phylogenetics in the molecular epidemiology and

evolutionary studies of viral infections. Critical Reviews in Clinical Laboratory Sciences, 47(1):5–49. DOI: 10.3109/10408361003633318.

Vilella, A. J., Severin, J., Ureta-Vidal, A., Heng, L., Durbin, R., and Birney, E. (2009). Ensemblcompara genetrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome research, 19 2:327–35.

Zhang, Y., Zhou, J., and Sun, J. (2019). Scheduling bag-of-tasks applications on hybrid clouds under due date constraints. Journal of Systems Architecture, 101:101654. DOI: https://doi.org/10.1016/j.sysarc.2019.101654.

Downloads

Published

2025-08-23

How to Cite

Moraes, J. V., Ferrari, C., Rosseti, I., & de Oliveira, D. (2025). Exploring Evolutionary Patterns: A Jupyter Notebook for Discovering Frequent Subtrees in Phylogenetic Tree Databases. Journal of Information and Data Management, 16(1), 232–239. https://doi.org/10.5753/jidm.2025.4361

Issue

Section

Brazilian eScience Workshop 2023 Best Papers - Extended Versions