Detecting and Analysing Duplicate Consumer Complaints and Collective Demands Across Multiple Platforms

Gestefane Rabbi; Júlia Viterbo; Gabriel Kakizaki; Zilton Cordeiro Junior; Raquel O. Prates; Julio C. S. Reis; Marcos André Gonçalves

doi:10.5753/jidm.2026.5959

Authors

Gestefane Rabbi Universidade Federal de Minas Gerais (UFMG) https://orcid.org/0009-0006-4484-3773
Júlia Viterbo Universidade Federal de Minas Gerais (UFMG) https://orcid.org/0009-0003-0715-3982
Gabriel Kakizaki Federal University of Viçosa (UFV) https://orcid.org/0000-0002-2589-9222
Zilton Cordeiro Junior Federal University of Minas Gerais (UFMG) https://orcid.org/0009-0006-0892-1455
Raquel O. Prates Federal University of Minas Gerais (UFMG) https://orcid.org/0000-0002-7128-4974
Julio C. S. Reis Federal University of Viçosa (UFV) https://orcid.org/0000-0003-0563-0434
Marcos André Gonçalves Federal University of Minas Gerais (UFMG) https://orcid.org/0000-0002-2075-3363

DOI:

https://doi.org/10.5753/jidm.2026.5959

Keywords:

Consumer Complaints, Duplicates, Collective Demands, PROCON, Consumidor.gov, Sindec

Abstract

The increasing volume of data in consumer complaint repositories poses considerable challenges for the effective management and analysis of this information. A primary issue is the prevalence of duplicate complaints, often submitted by the same consumer across different platforms as a strategy to exert pressure on service providers. Furthermore, the identification of collective consumer demands embedded within these complaints is essential for revealing systemic issues affecting broader consumer groups. This study proposes a computational framework to address these dual challenges: (i) the detection of duplicate complaints through temporal correlation and cross-platform matching of key attributes—such as consumer identity, service provider, and complaint subject—and (ii) the identification of collective demands via clustering techniques based on semantic similarity. To this end, natural language processing (NLP) methods are employed to extract and represent semantic content from unstructured complaint texts. Empirical results indicate that 95% of duplicate complaints are submitted within a 30-day window from the original entry. Additionally, the proposed clustering approach demonstrates validated effectiveness to enhance the management of unstructured consumer complaint data, facilitating more efficient conflict resolution and informed decision-making for regulatory agencies and service providers.

Downloads

Download data is not yet available.

References

Almeida, T. N. V. d. and Ramos, A. S. M. (2012). Os impactos das reclamacoes on-line na lealdade dos consumidores: um estudo experimental. Revista de Adm. Contemporanea, 16:664-683.

Barlaug, N. and Gulla, J. A. (2021). Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data, 15(3):1-37. DOI: 10.1145/3442200.

Barz, B. and Denzler, J. (2020). Do We Train on Test Data? Purging CIFAR of Near-Duplicates. Journal of Imaging, 6(6):41.

Bastani, K., Namavari, H., and Shaffer, J. (2019). Latent dirichlet allocation (lda) for topic modeling of the cfpb consumer complaints. Expert Systems with Applications, 127:256-271. DOI: 10.1016/j.eswa.2019.03.001.

Belem, F. M., de Andrade, C. M. V., Franca, C., Carvalho, M., Ganem, M. A. S., Teixeira, G., Jallais, G., Laender, A. H. F., and Goncalves, M. A. (2023). Contextual reinforcement, entity delimitation and generative data augmentation for entity recognition and relation extraction in official documents. J. Inf. Data Manag., 14(1).

BlackFriday.com.br (2024). How does Black Friday work? Accessed on: April 20, 2025.

Carvalho, M., Mangaravite, V., Ponce, L. M., Cantelli, L., Campoi, B., Nunes, G., de Paiva, B. B. M., Laender, A. H. F., and Goncalves, M. A. (2022). Deduplicating large volumes of data from natural and legal entities in the governmental field. In IEEE International Conference on Big Data, pages 2206-2213.

de Andrade, C. M. V., Belem, F., Cunha, W., Franca, C., Viegas, F., Rocha, L., and Goncalves, M. A. (2023). On the class separability of contextual embeddings representations - or "the classifier does not matter when the (text) representation is so good!". Inf. Process. Manag., 60(4):103336.

de Carvalho, A. P., Ferreira, A. A., Laender, A. H. F., and Goncalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. J. Inf. Data Manag., 2(3):289-304.

de Carvalho, M. G., Goncalves, M. A., Laender, A. H. F., and da Silva, A. S. (2006). Learning to deduplicate. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 41-50.

de Carvalho, M. G., Laender, A. H. F., Goncalves, M. A., and da Silva, A. S. (2008). Replica identification using genetic programming. In Proc. of the ACM Symposium on Applied Computing (SAC), pages 1801-1806.

dos Santos, A., Alves, D., and Braga, R. (2023). Topic modelling on consumer financial protection bureau data: An approach using bert-based embeddings. ResearchGate.

Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16.

Felix, L. G. S., Silveira, J. V., Luiz, W., Dias, D., and Rocha, L. (2018). Avaliacao Automatica de Conteudo de Aplicacoes de Reclamacao Online. In Anais do Symposium on Knowledge Discovery, Mining and Learning (KDMiLe), pages 49-56.

Fleiss, J. et al. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378-382.

Freitas, M. d. S. and Andreao, R. V. (2021). Automatizacao do Processamento do Texto Bruto Oriundo de um Servico de Atendimento de Reclamacoes. In Anais da Escola Regional de Informatica do Rio de Janeiro (ERI-RJ), pages 72-79.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure.

Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., and Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia tools and applications, 78:15169-15211.

Loshin, D. (2010). Master data management. Morgan Kaufmann.

Mangaravite, V., Carvalho, M., Cantelli, L., Ponce, L. M., Campoi, B., Nunes, G., Laender, A. H. F., and Goncalves, M. A. (2022). DedupeGov: Um Ambiente para Deduplicacao de Grandes Volumes de Dados de Pessoas Fisicas e Juridicas em Ambito Governamental. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 90-102.

Mansoor, M., Rehman, Z. U., Shaheen, M., Khan, M. A., and Habib, M. (2020). Deep Learning based Semantic Similarity Detection using Text Data. Information Technology And Control, 49(4):495-510.

McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3):276-282. DOI: 10.11613/BM.2012.031.

Miller, F. P., Vandome, A. F., and McBrewster, J. (2009). Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau-Levenshtein distance, Spell checker, Hamming distance. Alpha Press.

Mourao, F., Rocha, L., Araujo, R. B., Couto, T., Goncalves, M. A., and Jr., W. M. (2008). Understanding temporal aspects in document classification. In Proc. of the Int. Conf. on Web Search and Web Data Mining (WSDM), pages 159-170.

Rabbi, G., Araujo, M. M., Kakizaki, G., Viterbo, J., Reis, J. C., Prates, R. O., and Goncalves, M. A. (2024). Identificacao e caracterizacao de reclamacoes duplicadas por consumidores em multiplas plataformas. In Simposio Brasileiro de Banco de Dados (SBBD), pages 313-326.

Reclame Aqui (2024). Black Friday 2024: Consumers report an increase in problems with online purchases. Accessed on: April 20, 2025.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv.

Ripon, K. S. N., Rahman, A., and Rahaman, G. A. (2010). A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates. Journal of Computers, 5(12):1800-1809.

Salles, T., Rocha, L., Pappa, G. L., Mourao, F., Meira, W., and Goncalves, M. (2010). Temporally-aware algorithms for document classification. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, page 307-314.

Sargiani, V., de Castro, L. N., and Silva, L. A. (2020). A data mining study of sindec complaints in the period 2013-2017. In Proc. of the Int. Conf. on Internet Techn. and Society (ITS) and Sustainability, Techn. and Education (STE), pages 35-45.

Silva, L. S., Canalle, G. K., Salgado, A. C., Loscio, B. F., and Moro, M. M. (2019). Uma Analise Experimental do Impacto da Selecao de Atributos em Processos de Resolucao de Entidades. In Anais do Simp. Bras. de Banco de Dados (SBBD), pages 37-48.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In Braz. Conf. on Intelligent Systems (BRACIS), pages 403-417.

Vaishnav, D., Neethinayagam, M., Khaire, A., and Woo, J. (2024). Predictive analysis of cfpb consumer complaints using machine learning.

Wang, Y., Qin, J., and Wang, W. (2017). Efficient approximate entity matching using jaro-winkler distance. In Web Inf. Systems Engineering (WISE), pages 231-239.

Detecting and Analysing Duplicate Consumer Complaints and Collective Demands Across Multiple Platforms

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Metrics: