Identificação de predadores sexuais brasileiros em conversas textuais na internet por meio de aprendizagem de máquina

Authors

  • Leonardo Ferreira dos Santos Centro Federal de Educação Tecnológica Celso Suckow da Fonseca (CEFET/RJ)
  • Gustavo Guedes Centro Federal de Educação Tecnológica Celso Suckow da Fonseca (CEFET/RJ) http://orcid.org/0000-0001-8593-1506

DOI:

https://doi.org/10.5753/isys.2020.822

Keywords:

Pedofilia, PAN-2012, Identificação de predador sexual, Aprendizado de máquina, Redes Neurais Convolucionais, Máquina de vetores de suporte, Árvore de decisão, Naïve Bayes, Florestas Aleatórias, Redes sociais, Conversas virtuais

Abstract

Nos dias de hoje um grande número de crianças e adolescentes tem usado aplicações sociais. De fácil acesso, essas aplicações promovem benefícios e oportunidades. No entanto, ao mesmo tempo, expõem os usuários à diferentes riscos, dentre os quais a atividade predatória sexual. A atividade predatória sexual possui diversas finalidades como a obtenção de pornografia infantil, a extorsão e o abuso sexual. O presente trabalho possui três objetivos principais: (i) criar um conjunto de dados de conversas textuais contendo atividade sexual predatória real para o português do Brasil; (ii) realizar uma análise estatística das conversas textuais presentes nesse conjunto de dados; (iii) realizar uma avaliação experimental considerando os algoritmos de aprendizado de máquina mais populares no domínio da pesquisa com o conjunto de dados construído. Essa avaliação considera a medida de F1 como base. Os resultados alcançados com as contribuições (i) e (ii) possibilitam que novos estudos possam se concentrar na problemática da identificação de predadores sexuais em conversas textuais para o português do Brasil. Os resultados obtidos com a contribuição (iii) evidenciam que as Máquinas de vetores de suporte obtiveram o melhor comportamento, apresentando um resultado de 89.87%.

Downloads

Não há dados estatísticos.

Referências

[Andrijauskas et al. 2017] Andrijauskas, A., Shimabukuro, A., and Maia, R. F. (2017). Desenvolvimento de base de dados em l´ıngua portuguesa sobre crimes sexuais. VII Simp´osio de Iniciação Cient´ıfica, Did´atica e de Ações Sociais da FEI.
[Barbosa 2018] Barbosa, A. F. (2018). Pesquisa sobre o uso da internet por crianc¸as e adolescentes no brasil: Tic kids online brasil 2017. S˜ao Paulo: Comitˆe Gestor da Internet no Brasil.
[Biber 1993] Biber, D. (1993). Representativeness in corpus design. Literary and linguistic computing, 8(4):243–257.
[Bishop 2006] Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
[Blitzer et al. 2006] Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128.
[Cano et al. 2014] Cano, A. E., Fernandez, M., and Alani, H. (2014). Detecting child grooming behaviour patterns on social media. In International conference on social informatics, pages 412–427. Springer.
[Cardei and Rebedea 2017] Cardei, C. and Rebedea, T. (2017). Detecting sexual predators in chats using behavioral features and imbalanced learning. Natural Language Engineering, 23(4):589–616.
[Cheong and Jensen 2015] Cheong, Y.-G. and Jensen, A. K. (2015). Detecting predatory behavior in game chats. IEEE Transactions on Computational Intelligence and AI in Games, 7(3):220–232.
[Crystal 2002] Crystal, D. (2002). Language and the internet. IEEE Transactions on Professional Communication, 45(2):142–144.
[Dorasamy et al. 2018] Dorasamy, M., Jambulingam, M., and Vigian, T. (2018). Building a bright society with au courant parents: Combating online grooming.
[Ebrahimi 2016] Ebrahimi, M. (2016). Automatic Identification of Online Predators in Chat Logs by Anomaly Detection and Deep Learning. PhD thesis, Concordia University.
[Ebrahimi et al. 2016] Ebrahimi, M., Suen, C. Y., and Ormandjieva, O. (2016). Detecting predatory conversations in social media by deep convolutional neural networks. Digital Investigation, 18:33–49.
[Ghosh et al. 2018] Ghosh, A. K., Badillo-Urquiola, K., Guha, S., LaViola Jr, J. J., and Wisniewski, P. J. (2018). Safety vs. surveillance: what children have to say about mobile apps for parental control. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 124. ACM.
[Hernandez et al. 2018] Hernandez, S. C. L. S., Lacsina, A. C., Ylade, M. C., Aldaba, J., Lam, H. Y., Estacio Jr, L. R., and Lopez, A. L. (2018). sexual exploitation and abuse of children online in the philippines: A review of online news and articles. Acta Medica Philippina, 52(4):306.
[Inches and Crestani 2012] Inches, G. and Crestani, F. (2012). Overview of the international sexual predator identification competition at pan-2012. In CLEF (Online working notes/labs/workshop), volume 30.
[Johnson and Zhang 2015] Johnson, R. and Zhang, T. (2015). Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 103–112.
[Kloess et al. 2019] Kloess, J. A., Hamilton-Giachritsis, C. E., and Beech, A. R. (2019). Offense processes of online sexual grooming and abuse of children via internet communication platforms. Sexual Abuse, 31(1):73–96.
[Kluyver et al. 2016] Kluyver, T., Ragan-Kelley, B., P´erez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., andWilling, C. (2016). Jupyter notebooks – a publishing format for reproducible computational workflows. In Loizides, F. and Schmidt, B., editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87 – 90. IOS Press.
[Kohavi 1995] Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision graphs. Technical report, Carnegie-Mellon Univ. Pittsburgh PA Dept. of Computer Science.
[Komesu and Tenani 2009] Komesu, F. and Tenani, L. (2009). Considerac¸ ˜oes sobre o conceito de”internetˆes”nos estudos da linguagem. Linguagem em (Dis) cursor, 9(3):621–643.
[Livingstone et al. 2017] Livingstone, S., O´ lafsson, K., Helsper, E. J., Lupia´n˜ez-Villanueva, F., Veltri, G. A., and Folkvord, F. (2017). Maximizing opportunities and minimizing risks for children online: The role of digital skills in emerging strategies of parental mediation. Journal of Communication, 67(1):82–105.
[NCMEC 2017] NCMEC (2017). The online enticement of children: An in-depth analysis of cybertipline reports. National Center for Missing & Exploited Children Web site. https://missingkids-stage.adobecqms.net/ourwork/publications/exploitation/onlineenticement (Acessado em 16 de marc¸o de 2019).
[Ngejane et al. 2018] Ngejane, C., Mabuza-Hocquet, G., Eloff, J., and Lefophane, S. (2018). Mitigating online sexual grooming cybercrime on social media using machine learning: A desktop survey. In 2018 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD), pages 1–6. IEEE.
[Olowu 2014] Olowu, D. (2014). Cyber-based obscenity and the sexual exploitation of children via the internet: Implications for africa. In African Cyber Citizenship Conference 2014 (ACCC2014), page 115.
[O’Connell 2003] O’Connell, R. (2003). A typology of child cybersexploitation and online grooming practices. Preston, UK: University of Central Lancashire.
[Pedregosa et al. 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
[Pendar 2007] Pendar, N. (2007). Toward spotting the pedophile telling victim from predator in text chats. In International Conference on Semantic Computing (ICSC 2007), pages 235–241. IEEE.
[Pennebaker et al. 2001] Pennebaker, J.W., Francis, M. E., and Booth, R. J. (2001). Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001.
[Ponomareva and Thelwall 2012] Ponomareva, N. and Thelwall, M. (2012). Biographies or blenders: Which resource is best for cross-domain sentiment analysis? In International Conference on Intelligent Text Processing and Computational Linguistics, pages 488–499. Springer.
[Ross 1997] Ross, S. M. (1997). Introduction to Probability Models. Academic Press, San Diego, CA, USA, sixth edition.
[Santos and Guedes 2019] Santos, L. F. d. and Guedes, G. P. (2019). Identificac¸ ˜ao de predadores sexuais brasileiros por meio de an´alise de conversas realizadas na internet. In Anais do VIII Brazilian Workshop on Social Network Analysis and Mining, pages 143–154, Porto Alegre, RS, Brasil. SBC.
[Scott and Matwin 1998] Scott, S. and Matwin, S. (1998). Text classification using wordnet hypernyms. In Usage of WordNet in Natural Language Processing Systems.
[Sokolova and Bobicev 2018] Sokolova, M. and Bobicev, V. (2018). Corpus statistics in text classification of online data. arXiv preprint arXiv:1803.06390.
[Sutskever et al. 2014] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
[Varma and Simon 2006] Varma, S. and Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC bioinformatics, 7(1):91.
[Villatoro-Tello et al. 2012] Villatoro-Tello, E., Ju´arez-Gonz´alez, A., Escalante, H. J., Montes-y G´omez, M., and Pineda, L. V. (2012). A two-step approach for effective detection of misbehaving users in chats. In CLEF (Online Working Notes/-Labs/Workshop), volume 1178.
[Webb 2018] Webb, K. (2018). The world’s most popular video game chat app is now worth more than $2 billion, as it gears up to take on the makers of ’fortnite’. https://www.businessinsider.com/discord-funding-2-billion-value-2018-12 (Acessado em 17 de fevereiro de 2020).
[Weiss and Kulikowski 1991] Weiss, S. M. and Kulikowski, C. A. (1991). Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems.
[Wolak et al. 2018] Wolak, J., Finkelhor, D.,Walsh,W., and Treitman, L. (2018). Sextortion of minors: Characteristics and dynamics. Journal of Adolescent Health, 62(1):72–79.
[Yang and Pedersen 1997] Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. pages 412–420.

Downloads

Published

2020-07-29

Como Citar

dos Santos, L. F., & Guedes, G. (2020). Identificação de predadores sexuais brasileiros em conversas textuais na internet por meio de aprendizagem de máquina. ISys - Revista Brasileira De Sistemas De Informação, 13(4), 22–47. https://doi.org/10.5753/isys.2020.822

Issue

Section

Versões estendidas de artigos selecionados