Um método automático para rotulagem de documentos médicos e categorização

Authors

DOI:

https://doi.org/10.5753/isys.2022.2260

Keywords:

Processamento de linguagem natural, Receitas, Termo - frequência inversa de documento

Abstract

A ampla adoção de sistemas para o gerenciamento e registro de documentos médicos (MD) têm gerado um grande volume de dados não estruturados. Tais dados correspondem a texto livre contendo expressões ambíguas para relatar a mesma condição clínica ou procedimentos. Isso torna a tarefa de categorização manual do MD sujeita a erros. Este trabalho visa rotular e classificar MD em português utilizando a rotulação binária (Receita e Outros) e a multiclasse (Receitas, Exames, Atestados e Outros). O n-grama e a frequência do termo - frequência inversa do documento (TF–IDF) foram utilizados na etapa de vetorização do texto. Os resultados alcançados são promissores: apresentaram 0,99 e 0,97 para o Kappa na classificação binária e multiclasse, respectivamente. Assim, com a classificação do MD, é possível fornecer segmentação das informações para gerenciar medicamentos prescritos.

Downloads

Não há dados estatísticos.

Referências

Assale, M., Dui, L. G., Cina, A., Seveso, A., and Cabitza, F. (2019). The revival of the notes field: Leveraging the unstructured content in electronic health records. Frontiers in Medicine, 0:66.

Baratloo, A., Hosseini, M., Negida, A., and El Ashal, G. (2015). Evidence based emergency medicine; part 1: Simple definition and calculation of accuracy, sensitivity and specificity. Emergency, 3:48–49.

Breiman, L. (2001). Random forests. Machine Learning 2001 45:1, 45:5–32.

Brown, P. F., Della Pietra, V. J., Desouza, P. V., Lai, J. C., and Mercer, R. L. (1992). Classbased n-gram models of natural language. Computational linguistics, 18(4):467–480.

Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2):121–167.

Cabitza, F., Locoro, A., Alderighi, C., Rasoini, R., Compagnone, D., and Berjano, P. (2019). The elephant in the record: On the multiplicity of data recording work:. Health Informatics Journal, 25:475–490.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.

Cui, M., Bai, R., Lu, Z., Li, X., Aickelin, U., and Ge, P. (2019). Regular expression based medical text classification using constructive heuristic approach. IEEE Access, 7:147892–147904.

Gardner, M. W. and Dorling, S. (1998). Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment, 32(14-15):2627–2636.

Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36.

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366.

Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.

Lee, J., Scott, D. J., Villarroel, M., Clifford, G. D., Saeed, M., and Mark, R. G. (2011). Open-access mimic-ii database for intensive care research. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 8315–8318. IEEE.

Lins, A. and Ludermir, T. B. (2005). Hybrid optimization algorithm for the definition of mlp neural network architectures and weights. In Fifth International Conference on Hybrid Intelligent Systems (HIS’05), pages 6–pp. IEEE.

Liu, J., Bai, R., Lu, Z., Ge, P., Aickelin, U., and Liu, D. (2020). Data-driven regular expressions evolution for medical text classification using genetic programming. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1–8. IEEE.

Mountrakis, G., Im, J., and Ogole, C. (2011). Support vector machines in remote sensing: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 66(3):247–259.

Murphy, S. N. and Chueh, H. C. (2002). A security architecture for query tools used to access large biomedical databases. In Proceedings of the AMIA Symposium, page 552. American Medical Informatics Association.

Ogunleye, A. and Wang, Q.-G. (2019). Xgboost model for chronic kidney disease diagnosis. IEEE/ACM transactions on computational biology and bioinformatics, 17(6):2131–2140.

Ohno-Machado, L., Bafna, V., Boxwala, A. A., Chapman, B. E., Chapman, W. W., Chaudhuri, K., Day, M. E., Farcas, C., Heintzman, N. D., Jiang, X., et al. (2012). idash: integrating data for analysis, anonymization, and sharing. Journal of the American Medical Informatics Association, 19(2):196–201.

Reys, A. D., Silva, D., Severo, D., Pedro, S., e Sa, M. M. d. S., and Salgado, G. A. (2020). Predicting multiple icd-10 codes from brazilian-portuguese clinical notes. In Brazilian Conference on Intelligent Systems, pages 566–580. Springer.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1–47.

Swain, P. H. and Hauska, H. (1977). The decision tree classifier: Design and potential. IEEE Transactions on Geoscience Electronics, 15(3):142–147.

Tayefi, M., Ngo, P., Chomutare, T., Dalianis, H., Salvi, E., Budrionis, A., and Godtliebsen, F. (2021). Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdisciplinary Reviews: Computational Statistics, page e1549.

Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).

Wulff, A., Mast, M., Hassler, M., Montag, S., Marschollek, M., and Jack, T. (2020). Designing an openehr-based pipeline for extracting and standardizing unstructured clinical data using natural language processing. Methods of Information in Medicine, 59:e64–e78.

Yun-tao, Z., Ling, G., and Yong-cheng, W. (2005). An improved tf-idf approach for text classification. Journal of Zhejiang University-SCIENCE A 2005 6:1, 6:49–55.

Downloads

Published

2022-10-18

Como Citar

L. V. de Sousa, O., M. V. Magalhães, D., E. S. Campelo, V., & R. V. e Silva, R. (2022). Um método automático para rotulagem de documentos médicos e categorização. ISys - Revista Brasileira De Sistemas De Informação, 15(1), 13:1–13:13. https://doi.org/10.5753/isys.2022.2260

Issue

Section

Artigos de Edição Especial