Cascade Support Vector Machines applied to the Translation Initiation Site prediction problem
DOI:
https://doi.org/10.5753/jidm.2018.2052Keywords:
Translation Initiation Site, Cascade SVM, Data Mining, Machine LearningAbstract
The correct identification of the protein coding region is an important and latent problem of biology. The challenge is the lack of deep knowledge about biological systems, specifically the conservative characteristics of the messenger Ribonucleic Acid (mRNA). Thus, the use of computational methods is fundamental to discovery patterns within the Translation Initiation Site (TIS). In Bioinformatics, machine learning algorithms have been widely applied, among them we have the Support Vector Machines (SVM), which are based on inductive inference. However, the use of SVM incurs a high computational cost when applied to large data sets, and its training time scales up to quadratically in relation to the data set size. In this study, to tackle this challenge and analyse the algorithm’s behavior, we employed a Cascade SVM approach to the TIS prediction problem. This strategy proposes accelerating the model training process and reducing the number of support vectors. The results achieved in our study showed that the cascaded SVM approach is able to significantly reduce model training times while maintaining accuracy and F-measure rates similar to the conventional approach (SVM). We also demonstrate the scenarios in which the cascade approach is more suitable for reducing training time.
Downloads
References
Baek, J., Kim, J., Hyun, J., and Kim, E. New efficient speed-up scheme for cascade implementation of svm classifier. In 2015 International Joint Conference on Neural Networks (IJCNN). 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, pp. 1–6, 2015.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on computational Learning Theory. COLT ’92. ACM, New York, NY, USA, pp. 144–152, 1992.
Chang, C.-C. and Lin, C.-J. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3): 27:1–27:27, May, 2011.
Decoste, D. and Schölkopf, B. Training invariant support vector machines. Mach. Learn. 46 (1-3): 161–190, Mar., 2002.
Garg, A. and Gupta, D. Virulentpred: a svm based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics vol. 9, pp. 62–62, Jan, 2008.
Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., and Vapnik, V. Parallel support vector machines: The cascade svm. In Advances in neural information processing systems. Neural Information Processing Systems, NIPS 2004, Vancouver, British Columbia, Canada, pp. 521–528, 2004.
Guimarães, W. W., Pinto, C. L. N., Nobre, C. N., and Zárate, L. E. The relevance of upstream and downstream regions of mrna in the prediction of translation initiation site of the protein. In 17th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2017, Washington, DC, USA, October 23-25, 2017. 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), Washington, DC, USA, pp. 112–118, 2017.
Hatzigeorgiou, A. G. Translation initiation start prediction in human cdnas with high accuracy. Bioinformatics vol. 18, pp. 343–350, 2002.
Joachims, T. Advances in kernel methods. In Advances in kernel methods: support vector learning, B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.). MIT Press, Cambridge, MA, USA, Making Large-scale Support Vector Machine Learning Practical, pp. 169–184, 1999.
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2. IJCAI’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 1137–1143, 1995.
Kozak, M. Compilation and analysis of sequences upstream from the translational start site in eukaryotic mrnas. Nucleic Acids Res 12 (2): 857–872, Jan, 1984.
Li, H. and Jiang, T. A class of edit kernels for svms to predict translation initiation sites in eukaryotic mrnas. In Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology. RECOMB ’04. ACM, New York, NY, USA, pp. 262–271, 2004.
Liu, H. and Wong, L. Data mining tools for biological sequences. Journal of bioinformatics and computational biology 1 (01): 139–167, 2003.
Liu, X. Y., Wu, J., and Zhou, Z. H. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (2): 539–550, April, 2009.
Luukkonen, B. G., Tan, W., and Schwartz, S. Efficiency of reinitiation of translation on human immunodeficiency virus type 1 mrnas is determined by the length of the upstream open reading frame and by intercistronic distance. J Virol 69 (7): 4086–4094, Jul, 1995. 7769666[pmid].
Mazo, C., Alegre, E., and Trujillo, M. Classification of cardiovascular tissues using lbp based descriptors and a cascade svm. Computer methods and programs in biomedicine vol. 147, pp. 1–10, 2017.
Morais, R. F. A. B. D., Miranda, P. B. C., and Silva, R. M. A. A meta-learning method to select under-sampling algorithms for imbalanced data sets. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), Recife, Brazil, pp. 385–390, 2016.
Nobre, C. N., Ortega, J., and de Pádua Braga, A. High efficiency on prediction of translation initiation site (tis) of refseq sequences. In Advances in Bioinformatics and Computational Biology, M.-F. Sagot and M. Walter (Eds.). Lecture Notes in Computer Science, vol. 4643. Springer Berlin Heidelberg, Angra dos Reis, Brazil, pp. 138–148, 2007.
Papadonikolakis, M. and Bouganis, C. S. Novel cascade fpga accelerator for support vector machines classification. IEEE Transactions on Neural Networks and Learning Systems 23 (7): 1040–1052, July, 2012.
Pedersen, A. G. and Nielsen, H. Neural network prediction of translation initiation sites in eukaryotes: Perspectives for est and genome analysis. In Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology. AAAI Press, AAAI Press, pp. 226–233, 1997.
Pinto, C. L. N., Nobre, C. N., and Zárate, L. E. Transductive learning as an alternative to translation initiation site identification. BMC Bioinformatics 18 (1): 81, 2017.
Platt, J. C. Advances in kernel methods. In Advances in kernel methods, B. Schölkopf, C. J. C. Burges, and A. J. Smola (Eds.). MIT Press, Cambridge, MA, USA, Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208, 1999.
Pruitt, K. D. and Maglott, D. R. Refseq and locuslink: Ncbi gene-centered resources. Nucleic Acids Res 29 (1): 137–140, Jan, 2001.
Silva, L. M., de Souza Teixeira, F. C., Ortega, J. M., Zárate, L. E., and Nobre, C. N. Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mrna. BMC Genomics 12 (Suppl 4): S9–S9, Dec, 2011.
Stormo, G. D., Schneider, T. D., and Gold, L. M. Characterization of translational initiation sites in e. coli. Nucleic Acids Research 10 (9): 2971–2996, 1982.
Sun, Z. and Fox, G. Study on parallel svm based on mapreduce. In In International Conference on Parallel and Distributed Processing Techniques and Applications. Citeseer, Las Vegas, USA, pp. 16–19, 2012.
Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995.
Wen, Y.-M. and Lu, B.-L. pp. 480–486. In F.-L. Yin, J. Wang, and C. Guo (Eds.), A Cascade Method for Reducing Training Time and the Number of Support Vectors. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 480–486, 2004.
Zhang, S., Hu, H., Jiang, T., Zhang, L., and Zeng, J. Titer: predicting translation initiation sites by deep learning. Bioinformatics 33 (14): i234–i242, 2017.
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., and Müller, K.-R. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16 (9): 799–807, Sept., 2000.