Quantifying Color and Distortion Biases in the NCT-CRC-HE-100K Histopathology Dataset

Authors

DOI:

https://doi.org/10.5753/jbcs.2026.7045

Keywords:

Bias analysis, colorectal cancer, histopathology, stain normalization

Abstract

Colorectal cancer (CRC) represents a persistent challenge for healthcare systems, and the development of reliable deep learning systems for histopathology depends on unbiased datasets. The widely used NCT-CRC-HE-100K dataset has been shown to contain color inconsistencies, distortion artifacts, and corrupted patches, yet prior analyses offered only limited quantitative evidence. In this work, we extend these observations by evaluating color signatures, stain-normalization behavior, and class-dependent image quality variations. We compare classical and deep learning based stain normalization methods to identify their impact on image quality metrics and potential reduction of class-specific biases in computational pathology. Our results show that while normalization reduces color-based class distinguishability, none of the evaluated methods completely eliminate tissue-specific color signatures. Additionally, this work demonstrates that distortion artifacts disproportionately affect one class in the dataset, introducing technical biases unrelated to morphology. Also, a CNN classifier trained on each normalized dataset indicates that model performance is not significantly changed across the normalization methods, including the unnormalized dataset, despite reductions in color-based separability. Overall, our study provides quantitative evidence that color, saturation, and distortion persist across normalization techniques, emphasizing the need for caution when using NCT-CRC-HE-100K to assess histopathology models.

Downloads

Download data is not yet available.

References

Barbano, C. A., Perlo, D., Tartaglione, E., Fiandrotti, A., Bertero, L., Cassoni, P., and Grangetto, M. (2021). Unitopatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading. In 2021 IEEE International Conference on Image Processing (ICIP), pages 76-80. IEEE. DOI: 10.1109/icip42928.2021.9506198.

Borkowski, A. A., Bui, M. M., Thomas, L. B., Wilson, C. P., DeLand, L. A., and Mastorides, S. M. (2019). Lung and colon cancer histopathological image dataset (LC25000). arXiv preprint arXiv:1912.12142. DOI: 10.48550/arXiv.1912.12142.

Cordova, R., Viallon, V., Fontvieille, E., Peruchet-Noray, L., Jansana, A., Wagner, K.-H., Kyrø, C., Tjønneland, A., Katzke, V., Bajracharya, R., et al. (2023). Consumption of ultra-processed foods and risk of multimorbidity of cancer and cardiometabolic diseases: a multinational cohort study. The Lancet Regional Health-Europe, 35. DOI: 10.1016/j.lanepe.2023.100771.

Dehkharghanian, T., Bidgoli, A. A., Riasatian, A., Mazaheri, P., Campbell, C. J., Pantanowitz, L., Tizhoosh, H., and Rahnamayan, S. (2023). Biased data, biased AI: deep networks predict the acquisition site of tcga images. Diagnostic pathology, 18(1):67. DOI: 10.1186/s13000-023-01355-3.

Di Giammarco, M., Martinelli, F., Santone, A., Cesarelli, M., and Mercaldo, F. (2024). Colon cancer diagnosis by means of explainable deep learning. Scientific reports, 14(1):15334. DOI: 10.1038/s41598-024-63659-8.

Fabbrizzi, S., Papadopoulos, S., Ntoutsi, E., and Kompatsiaris, I. (2022). A survey on bias in visual datasets. Computer Vision and Image Understanding, 223:103552. DOI: 10.1016/j.cviu.2022.103552.

Filho, A. M., Laversanne, M., Ferlay, J., Colombet, M., Piñeros, M., Znaor, A., Parkin, D. M., Soerjomataram, I., and Bray, F. (2025). The globocan 2022 cancer estimates: data sources, methods, and a snapshot of the cancer burden worldwide. International Journal of Cancer, 156(7):1336-1346. DOI: 10.1002/ijc.35278.

Firildak, K., Celik, G., and Talu, M. F. (2025). Supervised constructive learning-based model for identifying colorectal cancer tissue types from histopathological images. International Journal of Imaging Systems and Technology, 35(4):e70161. DOI: 10.1002/ima.70161.

Hoque, M. Z., Keskinarkaus, A., Nyberg, P., and Seppänen, T. (2024). Stain normalization methods for histopathology image analysis: A comprehensive review and experimental comparison. Information Fusion, 102:101997. DOI: 10.1016/j.inffus.2023.101997.

Ignatov, A. and Malivenko, G. (2024). NCT-CRC-HE: Not all histopathological datasets are equally useful. In European Conference on Computer Vision, pages 300-317. Springer. DOI: 10.48550/arXiv.2409.11546.

Janowczyk, A., Zuo, R., Gilmore, H., Feldman, M., and Madabhushi, A. (2019). Histoqc: an open-source quality control tool for digital pathology slides. JCO clinical cancer informatics, 3:1-7. DOI: 10.1200/cci.18.00157.

Jawad, M. A. and Khursheed, F. (2024). A novel approach for color-balanced reference image selection for breast histology image normalization. Biomedical Signal Processing and Control, 94:106299. DOI: 10.21203/rs.3.rs-3833711/v1.

Jiang, X., Hu, Z., Wang, S., and Zhang, Y. (2023). Deep learning for medical image-based cancer diagnosis. Cancers, 15(14):3608. DOI: 10.3390/cancers15143608.

Juul, F., Parekh, N., Martinez-Steele, E., Monteiro, C. A., and Chang, V. W. (2022). Ultra-processed food consumption among us adults from 2001 to 2018. The American journal of clinical nutrition, 115(1):211-221. DOI: 10.1093/ajcn/nqab305.

Kang, H., Luo, D., Feng, W., Zeng, S., Quan, T., Hu, J., and Liu, X. (2021). Stainnet: a fast and robust stain normalization network. Frontiers in Medicine, 8:746307. DOI: 10.3389/fmed.2021.746307.

Kastryulin, S., Zakirov, J., Prokopenko, D., and Dylov, D. V. (2022). Pytorch image quality: Metrics for image quality assessment. arXiv preprint arXiv:2208.14818. DOI: 10.2139/ssrn.4206741.

Kather, J. N., Halama, N., and Marx, A. (2018). 100,000 histological images of human colorectal cancer and healthy tissue (v0.1). DOI: 10.5281/zenodo.1214456.

Khan, U., Härkönen, J., Friman, M., Latonen, L., Kuopio, T., and Ruusuvuori, P. (2025). Staining normalization in histopathology: Method benchmarking using multicenter dataset. arXiv preprint arXiv:2506.19106. DOI: 10.1038/s41598-026-40943-3.

Kheiri, F., Rahnamayan, S., Makrehchi, M., and Asilian Bidgoli, A. (2025). Investigation on potential bias factors in histopathology datasets. Scientific Reports, 15(1):11349. DOI: 10.1038/s41598-025-89210-x.

Levy, R. B., Barata, M. F., Leite, M. A., and Andrade, G. C. (2024). How and why ultra-processed foods harm human health. Proceedings of the Nutrition Society, 83(1):1-8. DOI: 10.1017/s0029665123003567.

Li, M. (2024). Transformer-based self-supervised learning and distillation for medical image classification: Improving colorectal cancer detection on nct-crc-he-100k with swin-t v2. In 2024 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE), pages 644-648. IEEE. DOI: 10.1109/cbase64041.2024.10824558.

Liu, Z. and He, K. (2024). A decade's battle on dataset bias: Are we there yet? arXiv preprint arXiv:2403.08632. DOI: 10.48550/arxiv.2403.08632.

Lohr, S. L. (2021). Sampling: design and analysis. Chapman and Hall/CRC. DOI: 10.2307/1271491.

Macenko, M., Niethammer, M., Marron, J. S., Borland, D., Woosley, J. T., Guan, X., Schmitt, C., and Thomas, N. E. (2009). A method for normalizing histology slides for quantitative analysis. In 2009 IEEE international symposium on biomedical imaging: from nano to macro, pages 1107-1110. IEEE. DOI: 10.1109/isbi.2009.5193250.

Meine, G. C., Picon, R. V., Santo, P. A. E., and Sander, G. B. (2024). Ultra-processed food consumption and gastrointestinal cancer risk: A systematic review and meta-analysis. Official journal of the American College of Gastroenterology| ACG, 119(6):1056-1065. DOI: 10.14309/ajg.0000000000002826.

Merabet, A., Saighi, A., Saad, H., Ferradji, M. A., Laboudi, Z., Almaktoom, A. T., Mousavirad, S. J., Elbatal, I., and Mohamed, A. W. (2025). Ai for colon cancer: a focus on classification, detection, and predictive modeling. International Journal of Medical Informatics, page 106115. DOI: 10.1016/j.ijmedinf.2025.106115.

Mittal, A., Moorthy, A. K., and Bovik, A. C. (2012). No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing, 21(12):4695-4708. DOI: 10.1109/tip.2012.2214050.

Mokhtari, Z., Amjadi, E., Bolhasani, H., Faghih, Z., Dehghanian, A., and Rezaei, M. (2023). Crc-icm: Colorectal cancer immune cell markers pattern dataset. arXiv preprint arXiv:2308.10033. DOI: 10.48550/arxiv.2308.10033.

Pacal, I. and Attallah, O. (2025). Hybrid deep learning model for automated colorectal cancer detection using local and global feature extraction. Knowledge-Based Systems, page 113625. DOI: 10.1016/j.knosys.2025.113625.

Pocock, J., Graham, S., Vu, Q. D., Jahanifar, M., Deshpande, S., Hadjigeorghiou, G., Shephard, A., Bashir, R. M. S., Bilal, M., Lu, W., et al. (2022). Tiatoolbox as an end-to-end library for advanced tissue image analytics. Communications medicine, 2(1):120. DOI: 10.1038/s43856-022-00186-5.

Popkin, B. M. and Laar, A. (2025). Nutrition transition's latest stage: Are ultra-processed food increases in low-and middle-income countries dooming our preschoolers' diets and future health? Pediatric Obesity, 20(5):e70002. DOI: 10.2139/ssrn.4872344.

Qin, Z., Sun, W., Guo, T., and Lu, G. (2024). Colorectal cancer image recognition algorithm based on improved transformer. Discover Applied Sciences, 6(8):422. DOI: 10.1007/s42452-024-06127-2.

Reinhard, E., Adhikhmin, M., Gooch, B., and Shirley, P. (2002). Color transfer between images. IEEE Computer graphics and applications, 21(5):34-41. DOI: 10.1109/38.946629.

Rezaei, M., Amjadi, E., Bolhasani, H., Dehghanian, A., Sanei, M., Faghih, Z., et al. (2023). Colorectal cancer immune cell markers dataset v1 (crc-icm-v1). Avaialble at:[link].

Rinaldi, A. M., Russo, C., and Tommasino, C. (2022). Effects of color stain normalization in histopathology image retrieval using deep learning. In 2022 IEEE International Symposium on Multimedia (ISM), pages 26-33. IEEE. DOI: 10.1109/ism55400.2022.00010.

Rodrigues, R. M., Souza, A. d. M., Bezerra, I. N., Pereira, R. A., Yokoo, E. M., and Sichieri, R. (2021). Evolução dos alimentos mais consumidos no brasil entre 2008-2009 e 2017-2018. Revista de Saúde Pública, 55:4s.

Roy, S., kumar Jain, A., Lal, S., and Kini, J. (2018). A study about color normalization methods for histopathology images. Micron, 114:42-61. DOI: 10.1016/j.micron.2018.07.005.

Ruifrok, A. (2001). Quantification of histochemical staining by color deconvolution. Analytical and quantitative cytology and histology/the International Academy of Cytology [and] American Society of Cytology. Available at:[link].

Shaban, M. T., Baur, C., Navab, N., and Albarqouni, S. (2019). Staingan: Stain style transfer for digital histological images. In 2019 Ieee 16th international symposium on biomedical imaging (Isbi 2019), pages 953-956. IEEE. DOI: 10.1109/isbi.2019.8759152.

Shi, L., Li, X., Hu, W., Chen, H., Chen, J., Fan, Z., Gao, M., Jing, Y., Lu, G., Ma, D., et al. (2023). EBHI-Seg: A novel enteroscope biopsy histopathological hematoxylin and eosin image dataset for image segmentation tasks. Frontiers in Medicine, 10:1114673. DOI: /10.3389/fmed.2023.1114673.

Sirinukunwattana, K., Pluim, J. P., Chen, H., Qi, X., Heng, P.-A., Guo, Y. B., Wang, L. Y., Matuszewski, B. J., Bruni, E., Sanchez, U., et al. (2017). Gland segmentation in colon histology images: The glas challenge contest. Medical image analysis, 35:489-502. DOI: 10.1016/j.media.2016.08.008.

Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105-6114. PMLR. DOI: 10.48550/arxiv.1905.11946.

Tellez, D., Litjens, G., Bándi, P., Bulten, W., Bokhorst, J.-M., Ciompi, F., and Van Der Laak, J. (2019). Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Medical image analysis, 58:101544. DOI: 10.1016/j.media.2019.101544.

Uddin, A. H., Chen, Y.-L., Akter, M. R., Ku, C. S., Yang, J., and Por, L. Y. (2024). Colon and lung cancer classification from multi-modal images using resilient and efficient neural network architectures. Heliyon, 10(9). DOI: 10.1016/j.heliyon.2024.e30625.

Vahadane, A., Peng, T., Sethi, A., Albarqouni, S., Wang, L., Baust, M., Steiger, K., Schlitter, A. M., Esposito, I., and Navab, N. (2016). Structure-preserving color normalization and sparse stain separation for histological images. IEEE transactions on medical imaging, 35(8):1962-1971. DOI: 10.1109/tmi.2016.2529665.

Voon, W., Hum, Y. C., Tee, Y. K., Yap, W.-S., Nisar, H., Mokayed, H., Gupta, N., and Lai, K. W. (2023). Evaluating the effectiveness of stain normalization techniques in automated grading of invasive ductal carcinoma histopathological images. Scientific Reports, 13(1):20518. DOI: 10.1038/s41598-023-46619-6.

Wang, A., Liu, A., Zhang, R., Kleiman, A., Kim, L., Zhao, D., Shirai, I., Narayanan, A., and Russakovsky, O. (2022a). Revise: A tool for measuring and mitigating bias in visual datasets. International Journal of Computer Vision, 130(7):1790-1810. DOI: 10.1007/s11263-022-01625-5.

Wang, L., Du, M., Wang, K., Khandpur, N., Rossato, S. L., Drouin-Chartier, J.-P., Steele, E. M., Giovannucci, E., Song, M., and Zhang, F. F. (2022b). Association of ultra-processed food consumption with colorectal cancer risk among men and women: results from three prospective us cohort studies. bmj, 378. DOI: 10.1136/bmj-2021-068921.

Xi, Y. and Xu, P. (2021). Global colorectal cancer burden in 2020 and projections to 2040. Translational oncology, 14(10):101174. DOI: 10.1016/j.tranon.2021.101174.

Yu, J., Feng, Q., Kim, J. H., and Zhu, Y. (2022). Combined effect of healthy lifestyle factors and risks of colorectal adenoma, colorectal cancer, and colorectal cancer mortality: systematic review and meta-analysis. Frontiers in oncology, 12:827019. DOI: 10.3389/fonc.2022.827019.

Downloads

Published

2026-05-07

How to Cite

Rodrigues, G. A. P., Serrano, A. L. M., Filho, G. P. R., Bonacin, R., Gonçalves, V. P., Rajarajan, M., & Meneguette, R. I. (2026). Quantifying Color and Distortion Biases in the NCT-CRC-HE-100K Histopathology Dataset. Journal of the Brazilian Computer Society, 32(1), 1317–1330. https://doi.org/10.5753/jbcs.2026.7045

Issue

Section

Regular Issue