Feature selection and comparison of classifiers for predicting protein class
Feature Selection, Classifiers, Multi-Objective Genetic Algorithm, Protein PredictionAbstract
Knowing the function of proteins is essential for understanding several biological systems. The experiments in laboratory to determine protein class are costly and require a long time to be done. Therefore, it is necessary to provide efficient computational models to identify the class to which a protein belongs. Nowadays, a significant volume of information regarding proteins and their structure is continually being made available in public data repositories. For example, the STING_DB database has a lot of information extracted from all protein structural levels (primary, secondary, tertiary, and quaternary), which are frequently used in classification models for this type of problem. However, it is unknown which physical-chemical properties are the most relevant ones to contribute to the prediction of the class. Therefore, there is a need to identify the subset of more suitable properties. In this work, we propose an approach based on a multi-objective genetic algorithm with the classifier k-NN to select the best physical-chemical properties. Our strategy uses a multi-objective genetic algorithm to obtain a smaller subset of features that contribute significantly to the prediction problem. To improve the prediction’s performance, we choose to perform a post enrichment process, then we compare the performance of our methodology with several classifiers: ANN, SVM, Random Forest, and k-NN. Our method achieved an average F-measure value of 70.22% with the Random Forest classifier. Finally, a comparative analysis, with statistical significance, shows the relevance of our approach in relation to other methodologies.
