Hierarchical and Non-Hierarchical Classification of Transposable Elements with a Genetic Algorithm
DOI:
https://doi.org/10.5753/jidm.2018.2051Keywords:
Genetic Algorithms, Hierarchical Classification, Machine Learning, Rule Induction, Transposable ElementsAbstract
In traditional classification an instance is assigned to one class within a small set of classes. However, there are problems where an instance is related to many classes hierarchically structured, known as Hierarchical Classification
(HC), which is present in many domains like Text Categorization, Music Genre Classification and Bioinformatics. A topic that has gained attention recently is the classification of Transposable Elements (TEs), which are DNA sequences capable of moving inside the genome. TEs have a great importance in the genetic variability of species, since they can modify the functionality of host genes. Despite the research relevance, just a few tools perform its automatic classification and most of them do not use more elaborated strategies, like using Machine Learning to learn models from data. Moreover, the interpretability of these methods is still an issue. In this work, we extend the original study that proposed the global method HC-GA, presenting some improvements such as new fitness functions which were compared and analyzed per level and in the leaf nodes, as well as with respect to the interpretability of the rules generated. Besides, we compare the new best results of HC-GA with the results of flat methods previously reported in the original paper for the task of predicting the TEs leaf node classes. As experiments showed, flat methods do not performed well for HC datasets, even ignoring the hierarchical relationships among classes. We believe this occurred due the high imbalance of these datasets, which is something that flat methods do not handle well, unlike HC ones. HC-GA overcame flat classifiers, presenting promising results for multiple class levels including leaf node classes, even though it was not originally designed for this purpose, and considering the difficulty of predicting lower classes in a hierarchy.
Downloads
References
Abrusán, G., Grundmann, N., DeMester, L., and Makalowski, W. Teclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25 (10): 1329–1330, 2009.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215 (3): 403–410, 1990.
Biémont, C. A brief history of the status of transposable elements: from junk dna to major players in evolution. Genetics 186 (4): 1085–1093, 2010.
Costa, E. P., Lorena, A. C., Carvalho, A. C. P. L. F., and Freitas, A. A. Comparing several approaches for hierarchical classification of proteins with decision trees. In II Brazilian Symposium on Bioinformatics. Lecture Notes in Bioinformatics, vol. 4643. Springer-Verlag, Berlin, Heidelberg, pp. 126–137, 2007.
Davis, J. and Goadrich, M. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning. ACM, pp. 233–240, 2006.
De Castro, L. N. Fundamentals of natural computing: basic concepts, algorithms, and applications. CRC Press, 2006.
Feschotte, C. Transposable elements and the evolution of regulatory networks. Nature Reviews Genetics 9 (5): 397–405, 2008.
Feschotte, C., Keswani, U., Ranganathan, N., Guibotsy, M. L., and Levine, D. Exploring repetitive DNA landscapes using REPCLASS, a tool that automates the classification of transposable elements in eukaryotic genomes. Genome biology and evolution vol. 1, pp. 205–220, 2009.
Ghazi, D., Inkpen, D., and Szpakowicz, S. Hierarchical versus flat classification of emotions in text. In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text. Association for Computational Linguistics, pp. 140–146, 2010.
Hoede, C., Arnoux, S., Moisset, M., Chaumier, T., Inizan, O., Jamilloux, V., and Quesneville, H. Pastec: an automatic transposable element classification tool. PLoS One 9 (5): e91929, 2014.
Holland, J. H. Escaping brittleness: The possibilities of general-purpose learnlng algorithms applied to parallel rule-based systems. Machine learning: An artificial intelligence approach vol. 2, 1986.
Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O., and Walichiewicz, J. Repbase update, a database of eukaryotic repetitive elements. Cytogenetic and genome research 110 (1-4): 462–467, 2005.
Kazazian, H. H. Mobile elements: drivers of genome evolution. science 303 (5664): 1626–1632, 2004.
Kiritchenko, S., Matwin, S., and Famili, A. F. Functional annotation of genes using hierarchical text categorization. In in Proc. of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology (held at ISMB-05. Citeseer, Uottawa CA, 2005.
Kiritchenko, S., Matwin, S., Nock, R., and Famili, A. F. Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Conference of the Canadian Society for Computational Studies of Intelligence. Springer, Springer Berlin Heidelberg, pp. 395–406, 2006.
Loureiro, T., Camacho, R., Vieira, J., and Fonseca, N. A. Boosting the Detection of Transposable Elements Using Machine Learning. 6th International Conference on Practical Applications of Computational Biology & Bioinformatics 154 (222): 85–91, 2013.
McClintock, B. The significance of responses of the genome to challenge, 1993.
Melsted, P. and Pritchard, J. K. Efficient counting of k-mers in dna sequences using a bloom filter. BMC bioinformatics 12 (1): 333, 2011.
Mitchell, T. M. et al. Machine learning, 1997.
Nussbaumer, T., Martis, M. M., Roessner, S. K., Pfeifer, M., Bader, K. C., Sharma, S., Gundlach, H., and Spannagl, M. Mips plantsdb: a database framework for comparative plant genome research. Nucleic acids research 41 (D1): D1144–D1151, 2012.
Otero, F. E., Freitas, A. A., and Johnson, C. G. A new sequential covering strategy for inducing classification rules with ant colony algorithms. IEEE Transactions on Evolutionary Computation 17 (1): 64–76, 2013.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research vol. 12, pp. 2825–2830, 2011.
Pereira, G. T. and Cerri, R. Classificação hierárquica e não hierárquica de elementos transponíveis. Proceedings of the 5th Symposium on Knowledge Discovery, Mining and Learning (KDMiLe), 2017.
Rebollo, R., Romanish, M. T., and Mager, D. L. Transposable elements: an abundant and natural source of regulatory sequences for host genes. Annual review of genetics vol. 46, pp. 21–42, 2012.
Silla, C. N. and Kaestner, C. A. Hierarchical classification of bird species using their audio recorded songs. In Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on. IEEE, pp. 1895–1900, 2013.
Silla, CarlosN., J. and Freitas, A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22 (1-2): 31–72, 2011.
Smit, A. F., Hubley, R., and Green, P. RepeatMasker, 1996.
Smith, S. F. A learning system based on genetic adaptive algorithms. Ph.D. thesis, University of Pittsburgh, Pittsburgh, PA, 1980.
Srinivasan, S. and Ramakrishnan, S. Evolutionary multi objective optimization for rule mining: a review. Artificial Intelligence Review 36 (3): 205, 2011.
Steinbiss, S., Willhoeft, U., Gremme, G., and Kurtz, S. Fine-grained annotation and classification of de novo predicted ltr retrotransposons. Nucleic acids research 37 (21): 7002–7013, 2009.
van de Lagemaat, L. N., Landry, J.-R., Mager, D. L., and Medstrand, P. Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends in Genetics 19 (10): 530–536, 2003.
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., and Blockeel, H. Decision trees for hierarchical multi-label classification. Machine Learning 73 (2): 185–214, 2008.
Wetzel, A. Evaluation of the effectiveness of genetic algorithms in combinatorial optimization. Ph.D. thesis, University of Pittsburgh, Pittsburgh, PA, 1983.
Wheeler, T. J., Clements, J., Eddy, S. R., Hubley, R., Jones, T. A., Jurka, J., Smit, A. F., and Finn, R. D. Dfam: a database of repetitive dna based on profile hidden markov models. Nucleic acids research 41 (D1): D70–D82, 2012.
Wicker, T., Sabot, F., Hua-Van, A., Bennetzen, J. L., Capy, P., Chalhoub, B., Flavell, A., Leroy, P., Morgante, M., Panaud, O., et al. A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics 8 (12): 973–982, 2007.
Zimek, A., Buchwald, F., Frank, E., and Kramer, S. A study of hierarchical and flat classification of proteins. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7 (3): 563–571, 2010.