Biotext with SWeePtex: Bioinformatics Tricks to Perform Fast, Accurate, and Content-specific String Embedding
DOI:
https://doi.org/10.5753/jbcs.2026.6198Keywords:
Text Mining, Vector Embedding, Bioinformatics, Random ProjectionAbstract
The escalating demand for adaptable Artificial Intelligence (AI) systems presents a critical hurdle: generating efficient text embeddings tailored to specific problems. While Large Language Models (LLMs) excel in general contexts, they struggle in specialized domains due to their massive data requirements, opaque embedding strategies, and high computational costs. We introduce Biotext, featuring SWeePtex, a novel framework that adapts successful Bioinformatics techniques for text embedding. By converting text to the Biological Sequence-Like (BSL) format, our Python package enables the application of SWeeP, a tool originally developed for biological sequences, to create content-addressable vectors in natural language, employing the random projection paradigm. Using unsupervised machine learning, we validated this finding by analyzing data from 14,984 MEDLINE abstracts on the thioredoxin theme. Biotext, through SWeePtex, constructs a unified vector space for words and documents from scratch, capturing rich contextual relationships and offering scalable processing. Our usage example demonstrates that this Bioinformatics-inspired method effectively addresses key challenges in Natural Language Processing (NLP), providing interpretable, computationally efficient, and content-addressable linguistic representations for document exploration. Ultimately, Biotext demonstrates that bridging Bioinformatics and NLP yields powerful, efficient, and accessible text analysis tools that balance analytical power with interpretability, particularly valuable in specialized domains and resource-constrained environments. Biotext Python package is freely available at the PyPI repository.
Downloads
References
Araujo, J. D., Santos-e Silva, J. C., Costa-Martins, A. G., Sampaio, V., de Castro, D. B., de Souza, R. F., Giddaluru, J., Ramos, P. I. P., Pita, R., Barreto, M. L., Barral-Netto, M., and Nakaya, H. I. (2022). Tucuxi-blast: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a dna-encoded approach. PeerJ, 10:e13507. DOI: 10.7717/peerj.13507.
Asgari, E. and Mofrad, M. R. (2015). Continuous distributed representation of biological sequences for deep proteomics and genomics. PLOS ONE, 10(11):e0141287. DOI: 10.1371/JOURNAL.PONE.0141287.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155. Available at:[link].
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python. Book.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. DOI: 10.48550/arxiv.1607.04606.
Canese, K. and Weis, S. (2002). Pubmed: The bibliographic database. Available at:[link].
Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., and De Hoon, M. J. L. (2009). Biopython: freely available python tools for computational molecular biology and bioinformatics. 25(11):1422–1423. DOI: 10.1093/bioinformatics/btp163.
Currie, G. M. (2023). Academic integrity and artificial intelligence: is chatgpt hype, hero or heresy? Seminars in Nuclear Medicine, 53(5):719–730. DOI: 10.1053/j.semnuclmed.2023.04.008.
da Silva Filho, A. C., Marchaukoski, J. N., Raittz, R. T., Pierri, C. R. D., de Jesus Soares Machado, D., Fadel-Picheth, C. M. T., and Picheth, G. (2021). Prediction and analysis in silico of genomic islands in aeromonas hydrophila. Frontiers in Microbiology, 12. DOI: 10.3389/fmicb.2021.769380.
De Pierri, C. R., Voyceik, R., Santos de Mattos, L. G. C., Kulik, M. G., Camargo, J. O., Repula de Oliveira, A. M., de Lima Nichio, B. T., Marchaukoski, J. N., da Silva Filho, A. C., Guizelini, D., Ortega, J. M., Pedrosa, F. O., and Raittz, R. T. (2020). Sweep: representing large biological sequences datasets in compact vectors. Scientific Reports, 10(1):91. DOI: 10.1038/s41598-019-55627-4.
DeepSeek-AI et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. DOI: 10.48550/arXiv.2501.12948.
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. DOI: 10.48550/arxiv.1810.04805.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2):179–211. DOI: 10.1207/s15516709cog1402_1.
Flyamer, I. et al. (2024). Phlya/adjusttext: 1.3.0. Zenodo. DOI: 10.5281/zenodo.14019059.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. (2020). Array programming with numpy. Nature 2020 585:7825, 585(7825):357–362. DOI: 10.1038/s41586-020-2649-2.
Hassani, H., Beneki, C., Unger, S., Mazinani, M. T., and Yeganegi, M. R. (2020). Text mining in big data analytics. Big Data and Cognitive Computing, 4(1):1. DOI: 10.3390/bdcc4010001.
Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507. DOI: 10.1126/science.1127647.
Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95. DOI: 10.1109/MCSE.2007.55.
Hutson, M. (2024). Forget chatgpt: why researchers now run small ais on their laptops. Nature, 633(8030):728–729. DOI: 10.1038/D41586-024-02998-Y.
Ieger-Raittz, R., De Pierri, C. R., Perico, C. P., de Fatima Costa, F., Bana, E. G., Vicenzi, L., de Jesus Soares Machado, D., Marchaukoski, J. N., and Raittz, R. T. (2025). What are we learning with yoga? mapping the scientific literature on yoga using a vector-text-mining approach. PLOS ONE, 20(5):e0322791. DOI: 10.1371/JOURNAL.PONE.0322791.
Johnson, W. B. and Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space, page 189–206. American Mathematical Society. DOI: 10.1090/conm/026/737400.
Jones, S. K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21. DOI: 10.1108/eb026526.
Jurafsky, D. and Martin, J. H. (2025). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition with language models. Available at:[link].
Kanerva, P. (1994). The Spatter Code for Encoding Concepts at Many Levels, page 226–229. Springer London, London. DOI: 10.1007/978-1-4471-2097-1_52.
Kanerva, P., Kristoferson, J., and Hols, A. (2000). Random indexing of text samples for latent semantic analysis. Available at:[link].
Kaski, S. (1998). Dimensionality reduction by random mapping: fast similarity computation for clustering. In 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227), page 413–418. DOI: 10.1109/IJCNN.1998.682302.
Kaski, S., Honkela, T., Lagus, K., and Kohonen, T. (1998). Websom – self-organizing maps of document collections. Neurocomputing, 21(1):101–117. DOI: 10.1016/S0925-2312(98)00039-3.
Leimeister, C. A., Schellhorn, J., Dörrer, S., Gerth, M., Bleidorn, C., and Morgenstern, B. (2019). Prot-spam: fast alignment-free phylogeny reconstruction based on whole-proteome sequences. GigaScience, 8(3):1–14. DOI: 10.1093/GIGASCIENCE/GIY148.
Li, S., Hu, R., and Wang, L. (2025). Efficiently building a domain-specific large language model from scratch: A case study of a classical chinese large language model. DOI: 10.48550/arXiv.2505.11810.
Lilleberg, J., Zhu, Y., and Zhang, Y. (2015). Support vector machines and word2vec for text classification with semantic features. Proceedings of 2015 IEEE 14th International Conference on Cognitive Informatics and Cognitive Computing, ICCI*CC 2015, page 136–140. DOI: 10.1109/ICCI-CC.2015.7259377.
Ma, L. and Zhang, Y. (2015). Using word2vec to process big text data. Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015, page 2895–2897. DOI: 10.1109/BIGDATA.2015.7364114.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. DOI: 10.48550/arXiv.1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. DOI: 10.48550/arXiv.1310.4546.
Mueller, A. C. (2026). Wordcloud. Available at:[link].
OpenAI et al. (2023). Gpt-4 technical report. DOI: 10.48550/arXiv.2303.08774.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., M., B., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830. DOI: 10.48550/arxiv.1201.0490.
Perico, C. P., Pierri, C. R. D., Neto, G. P., Fernandes, D. R., Pedrosa, F. O., de Souza, E. M., and Raittz, R. T. (2022). Genomic landscape of the sars-cov-2 pandemic in brazil suggests an external p.1 variant origin. Frontiers in Microbiology, 13. DOI: 10.3389/fmicb.2022.1037455.
Radford, A. and Narasimhan, K. (2018). Improving language understanding by generative pre-training. Available at:[link].
Raittz, R. T., Pierri, C. R. D., Maluk, M., Batista, M. B., Carmona, M., Junghare, M., Faoro, H., Cruz, L. M., Battistoni, F., de Souza, E., de Oliveira Pedrosa, F., Chen, W. M., Poole, P. S., Dixon, R. A., and James, E. K. (2021). Comparative genomics provides insights into the taxonomy of azoarcus and reveals separate origins of nif genes in the proposed azoarcus and aromatoleum genera. Genes, 12:1–21. DOI: 10.3390/genes12010071.
Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60(5):503–520. DOI: 10.1108/00220410410560582.
Rubel, E. T., Raittz, R. T., Coimbra, N. A. d. R., Gehlen, M. A. C., and Pedrosa, F. d. O. (2016). Proclat, a new bioinformatics tool for in silico protein reclassification: case study of drab, a protein coded from the dratgb operon in azospirillum brasilense. BMC bioinformatics, 17(Suppl 18). DOI: 10.1186/S12859-016-1338-5.
Scorzato, L. (2024). Reliability and interpretability in science and deep learning. Minds and Machines, 34(3):1–31. DOI: 10.1007/S11023-024-09682-0.
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K. A., Ceder, G., and Jain, A. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763):95–98. DOI: 10.1038/s41586-019-1335-8.
Van Rossum, G. and Foundation, P. S. (2025). Python language reference. Available at:[link].
Yao, X., Zheng, Y., Yang, X., and Yang, Z. (2022). Nlp from scratch without large-scale pretraining: A simple and efficient framework. DOI: 10.48550/arXiv.2111.04130.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Diogo de J. S. Machado, Camilla R. De Pierri, Antonio C. da Silva Filho, Flávia de F. Costa, Nelson A. de M. Lemos, Camila P. Perico, Letícia G. C. Santos, Maricel G. Kann, Fábio de O. Pedrosa, Roberto Tadeu Raittz

This work is licensed under a Creative Commons Attribution 4.0 International License.

