MDLText applied to automatic filtering of SPIM and SMS Spam
DOI:
https://doi.org/10.5753/isys.2018.359Keywords:
Online learning, Occam’s razor, Text categorization, Machine learningAbstract
Spam filtering in online instant messages and SMS is a challenging problem nowadays. It is because the messages are often very short and rife with slangs, idioms, symbols, emoticons, and abbreviations which hamper predicting and knowledge discovering. In order to face this problem, we evaluated a simple, fast, scalable, multiclass, and online text classification method based on the minimum description length principle. We conducted experiments using a real and public dataset, which demonstrate that our method is effective on instant messaging and SMS spam filtering in both online and offline learning contexts.
Downloads
References
Ahmed, I., Ali, R., Guan, D., Lee, Y.-K., Lee, S., e Chung, T. (2015). Semi-supervised learning using frequent itemset and ensemble learning for SMS classification. Expert Systems with Applications, 42(3):1065–1073. doi: 10.1016/j.eswa.2014.08.054
Almeida, T. A., Hidalgo, J. M. G., e Yamakami, A. (2011a). Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM Symposium on Document engineering (DocEng’11), pages 259–262, Mountain View, CA, USA. ACM. doi: 10.1145/2034691.2034742
Almeida, T. A., Silva, T. P., Santos, I., e Hidalgo, J. M. G. (2016). Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems, 108:25–32. doi: 10.1016/j.knosys.2016.05.001
Almeida, T. A., Yamakami, A., e Almeida, J. (2011b). Spam filtering: how the dimensionality reduction affects the accuracy of naive Bayes classifiers. Journal of Internet Services and Applications, 1(3):183–200. doi: 10.1007/s13174-010-0014-7
Assis, F., Yerazunis, W., Siefkes, C., e Chhabra, S. (2006). Exponential differential document count – a feature selection factor for improving Bayesian filters accuracy. In Proceedings of the 2006 MIT Spam Conference (SP’06), pages 1–6, Cambridge, MA, USA.
Bi, J., Wu, J., e Zhang, W. (2008). A trust and reputation based anti-SPIM method. In Proceedings of the 27th IEEE Conference on Computer Communications (INFOCOM’08), pages 1–5, Phoenix, Arizona, USA. IEEE Computer Society. doi: 10.1109/INFOCOM.2008.319
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 1st edition.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. doi: 10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A., e Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, California, USA.
Carpenter, L. M. e Hubbard, G. B. (2014). Cyberbullying: Implications for the psychiatric nurse practitioner. Journal of Child and Adolescent Psychiatric Nursing, 27(3):142–148. doi: 10.1111/jcap.12079
Cortes, C. e Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20(3):273–297. doi: 10.1007/BF00994018
Cover, T. M. e Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transaction on Information Theory, 13(1):21–27. doi: 10.1109/TIT.1967.1053964
Crammer, K., Dredze, M., e Pereira, F. (2012). Confidence-weighted linear classification for text categorization. Journal of Machine Learning Research, 13(1):1891–1926.
Das, S., Pourzandi, M., e Debbabi, M. (2012). On SPIM detection in LTE networks. In Proceedings of the 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE’2012), pages 1–4, Montreal, Québec, Canada. IEEE. doi: 10.1109/CCECE.2012.6334959
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.
Domingos, P. (1999). The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3:409–425. doi: 10.1023/A:1009868929893
Freund, Y. e Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296. doi: 10.1023/A:1007662407062
Galavotti, L., Sebastiani, F., e Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00), Lisbon, Portugal. Springer. doi: 10.1007/3-540-45268-0_6
García, S., Fernández, A., Luengo, J., e Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability. Soft Computing, 13(10):959–977. doi: 10.1007/s00500-008-0392-y
García, S., Fernández, A., Luengo, J., e Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044–2064. doi: 10.1016/j.ins.2009.12.010
Gentile, C. (2002). A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242.
Gomez-Martin, L. E. (2012). Smartphone usage and the need for consumer privacy laws. Pittsburgh Journal of Technology Law and Policy, 12:217–237. doi: 10.5195/tlp.2012.96
Goswami, G., Singh, R., e Vatsa, M. (2016). Automated spam detection in short text messages. In Singh, R., Vatsa, M., Majumdar, A., e Kumar, A., editors, Machine Intelligence and Signal Processing, volume 390, pages 85–98. Springer India, New Delhi. doi: 10.1007/978-81-322-2625-3_8
Grünwald, P. D., Myung, I. J., e Pitt, M. A. (2005). Advances in Minimum Description Length: Theory and Applications. The MIT Press.
Hastie, T. J., Tibshirani, R. J., e Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer, New York, NY, USA, 2th edition.
Hoi, S. C. H., Wang, J., e Zhao, P. (2014). Libol: A library for online learning algorithms. Journal of Machine Learning Research, 15(1):495–499.
Hsu, C., Chang, C., e Lin, C. (2003). A practical guide to support vector classification. Technical report, National Taiwan University.
Japkowicz, N. e Shah, M. (2011). Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, New York, NY, USA.
Joachims, T. (1998). Text categorization with suport vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML’98), pages 137–142, Chemnitz, Germany. Springer. doi: 10.1007/BFb0026683
Li, Y. e Long, P. M. (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):361–387. doi: 10.1023/A:1012435301888
Liu, Z., Lin, W., Li, N., e Lee, D. (2005). Detecting and filtering instant messaging spam: A global and personalized approach. In Proceedings of the First International Conference on Secure Network Protocols (NPSEC’05), pages 19–24. IEEE Computer Society. doi: 10.1109/NPSEC.2005.1532048
Manning, C. D., Raghavan, P., e Schütze, H. (2009). Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
McCallum, A. e Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the 15th AAAI Workshop on Learning for Text Categorization (AAAI’98), pages 41–48, Madison, Wisconsin.
Ng, H. T., Goh, W. B., e Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), pages 67–73, Philadelphia, PA, USA. ACM. doi: 10.1145/258525.258537
Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471. doi: 10.1016/0005-1098(78)90005-5
Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transaction on Information Theory, 42(1):40–47. doi: 10.1109/18.481776
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In Salton, G., editor, The Smart retrieval system - experiments in automatic document processing, pages 313–323. Prentice-Hall, Englewood Cliffs, NJ.
Santafe, G., Inza, I. n., e Lozano, J. A. (2015). Dealing with the evaluation of supervised classification algorithms. Artificial Intelligence Review, 44(4):467–508. doi: 10.1007/s10462-015-9433-y
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47. doi: 10.1145/505282.505283
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., e Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1):1–5. doi: 10.1016/j.eswa.2006.04.001
Silva, R. M., Alberto, T. C., Almeida, T. A., e Yamakami, A. (2016a). Filtrando comentários do YouTube através de classificação online baseada no princípio MDL e indexação semântica. In Anais do 13th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’16), pages 2–15, Recife, PE, Brasil.
Silva, R. M., Almeida, T. A., e Yamakami, A. (2015). Quanto mais simples, melhor! Categorização de textos baseada na navalha de Occam. In Anais do 12th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’15), pages 2–15, Natal, RN, Brasil.
Silva, R. M., Almeida, T. A., e Yamakami, A. (2016b). Detecção automática de SPIM e SMS spam usando método baseado no princípio da descrição mais simples. In Anais do 13th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’16), pages 2–15, Recife, PE, Brasil.
Silva, R. M., Almeida, T. A., e Yamakami, A. (2017). MDLText: An efficient and lightweight text classifier. Knowledge-Based Systems, 118:152–164. doi: 10.1016/j.knosys.2016.11.018
Tsakalidis, G. e Vergidis, K. (2017). A systematic approach toward description and classification of cybercrime incidents. IEEE Transactions on Systems, Man, and Cybernetics: Systems, PP(99):1–20. doi: 10.1109/TSMC.2017.2700495
Uysal, A. K. e Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36:226–235. doi: 10.1016/j.knosys.2012.06.005
Uysal, A. K., Gunal, S., Ergin, S., e Gunal, E. S. (2012). A novel framework for SMS spam filtering. In Proceedings of the 2012 International Symposium on Innovations in Intelligent Systems and Applications (INISTA’12), pages 1–4, Trabzon, Turkey. IEEE. doi: 10.1109/INISTA.2012.6246947
Wilbur, W. J. e Kim, W. (2009). The ineffectiveness of within-document term frequency in text classification. Information Retrieval, 12(5):509–525. doi: 10.1007/s10791-008-9069-5
Yang, Y. e Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML’97), pages 412–420, Nashville, TN, USA. Morgan Kaufmann.
Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning (ICML’04), pages 116–123, Banff, Alberta, Canada. ACM. doi: 10.1145/1015330.1015332
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML’03), pages 928–936, Washington, DC, USA. AAAI Press.