Tourist Data Collection with Web Scraping: Problems, Solutions and Optimizations

Authors

DOI:

https://doi.org/10.5753/isys.2024.3644

Keywords:

Web Scraping, Data Extraction, Selenium, Tripadvisor, Data acquisition, Comments

Abstract

The Internet can be considered one of the biggest data sources for humanity, but it's use is intended for human readers, not for automatic reading. Internet automated data extraction faces some obstacles involving data arrangement and the frequent updates that websites suffer. This study analyzes the extraction of tourist data for later categorization, listing faced problems, like page navigation and program performance, and presenting their solutions. Scraping algorithms are presented based on a general process and the different page-loading method, exploring them in a case study. Are evaluated the use of computational resources, the scraping rate and the security impacts, modifying the implementation accordingly to the problems that arise in the processing and documenting the solutions and optimizations, e.g. avoiding the Denial of Service (DoS) and applying parallel programming (multithreading). The data and scripts generated are available in a public repository for reproduction and inspiration.

Downloads

Download data is not yet available.

References

Atkins, T., Etemad, E. J., and Rivoal, F. (2022). Css snapshot 2022. Disponível em: [link]. Acesso em: 11 de setembro de 2009.

Berners-Lee, T. and Connolly, D. (1995). Hypertext markup language - 2.0. Disponível em: [link]. Acesso em: 21 de março de 2023.

Caracristi, M. F. A., Feger, J. E., da Silva, T. M., and Marynowski, J. E. (2021). Uma viagem pelo jalapão, brasil: análise das experiências turísticas. Revista Paranaense de Desenvolvimento (RPD), 41:89–110.

Caracristi, M. F. A., Feger, J. E., Marynowski, J. E., and Minasi, S. M. (2021b). A demanda turística do parque estadual do jalapão (pej, to, brasil) baseada em comentários de redes sociais. Revista Brasileira de Ecoturismo, 14:291–314.

Diouf, R., Sarr, E. N., Sall, O., Birregah, B., Bousso, M., and Mbaye, S. N. (2019). Web scraping: State-of-the-art and areas of application. Proc. of the 2019 IEEE International Conference on Big Data (Big Data). volume 17, pages 6040–6042. IEEE. doi: 10.1109/BigData47090.2019.9005594.

dos Santos, A. J. F. and D’Emery, R. A. (2022). Desenvolvimento de uma plataforma de dados abertos para análise do auxílio emergencial do brasil durante a covid-19: uma abordagem web scraping. Proc. do XVIII Simpósio Brasileiro de Sistemas de Informação (SBSI Estendido 2022). pages 197–206. Sociedade Brasileira de Computação (SBC). doi: 10.5753/sbsiestendido.2022.222861.

Feger, J. E., Marynowski, J. E., Botta, E. D., and de Fatima de Albuquerque Caracristi, M. (2023). Experiência Vivenciada por Turistas no Fervedouro do Ceiça, Tocantins, TO, Brasil. Revista Paranaense de Desenvolvimento, 44:275–298.

Feger, J. E., Marynowski, J. E., Reck, S. B., di Fatima Rocha Garcia, R., and de Fatima de Albuquerque Caracristi, M. (2024). Experiência turística no atrativo serra do espırito santo situado no parque estadual do jalapao (to). Revista Brasileira de Ecoturismo (RBEcotur), 17:86–105. doi: 10.34024/rbecotur.2024.v17.14930.

Firefox, M. (2020). Css: Cascading style sheets. Disponível em: [link]. Acesso em: 8 de fevereiro de 2023.

Galdino, I. M., Gallindo, E. D. L., and Moreira, M. W. L. (2020). Utilização de bots para obtenção automática de dados públicos usando as técnicas de web crawling e web scraping. Proc. of the WCGE - Workshop de Computação Aplicada em Governo Eletrônico. pages 172–179. Sociedade Brasileira da Computação. doi: 10.5753/wcge.2020.11269.

Gorro, K. D., Sabellano, M. J. G., Gorro, K., Maderazo, C., and Capao, K. (2018). Classification of cyberbullying in facebook using selenium and svm. Proc. of the ICCCS - Int. Conf. on Computer and Communication Systems. pages 183–186. IEEE. doi: 10.1109/CCOMS.2018.8463326.

Gosenheimer, A., Feger, J. E., Minasi, S. M., Marynowski, J. E., and da Silva, T. M. (2021). Foz do iguaçu/pr na perspectiva da teoria do espaço turístico. Marketing Tourism Review, 6:1–28. doi: 10.29149/mtr.v6i2.6621.

Horváth, G. and Menyhárt, L. (2014). Teaching introductory programming with javascript in higher education. Proc. of the ICAC - Int. Conf. on Applied Informatics. pages 339–350. University of Debrecen/ Debreceni Egyetem. doi: 10.14794/icai.9.2014.1.339.

Kaizer, E. F., Caracristi, M. F. A., Feger, J. E., Marynowski, J. E., and Silva, T. M. (2021). Análise da experiência relatada pelos turistas ao visitar o parque estadual do jalapão (pej) – to, brasil. Atelie do Turismo, 5:183–204.

Khder, M. (2021). Web scraping or web crawling: State of art, techniques, approaches and application. International Journal of Advances in Soft Computing and its Applications, 13:145–168. doi: 10.15849/IJASCA.211128.11.

Krotov, V. and Silva, L. (2018). Legality and ethics of web scraping. Proc. of the ERF - Emergent Research Forum.

Li, F. and Broadwater, R. (2004). Software framework concepts for power distribution system analysis. IEEE Transactions on Power Systems, 19:948–956. doi: 10.1109/TPWRS.2003.821437.

Oliveira, F. A. D., Villote, G. D. S., Costa, R. L., Goldschmidt, R. R., and Cavalcanti, M. C. (2020). Minerando regras de associação de multirrelação na web de dados. iSys - Brazilian Journal of Information Systems, 13:77–100. doi: 10.5753/isys.2020.830.

Olston, C. and Najork, M. (2010). Web crawling. Foundations and Trends® in Information Retrieval, 4:175–246. doi: 10.1561/1500000017.

Raggett, D., Hors, A. L., and Jacobs, I. (2018). Html 4.0.1 specification. Disponível em: [link].

Robie, J., Dyck, M., and Spiegel, J. (2017). Xml path language (xpath) 3.1. Disponível em: [link].

Saavedra-Barrera, R., Culler, D., and von Eicken, T. (1990). Analysis of multithreaded architectures for parallel computing. Proc. of the second annual ACM symposium on Parallel algorithms and architectures - SPAA ’90. pages 169–178. ACM Press. doi: 10.1145/97444.97683.

Soares, A., Dorlivete, P., Shitsuka, M., Parreira, F. J., and Shitsuka, R. (2018). Metodologia da Pesquisa Científica.

Srisuresh, P. and Holdrege, M. (1999). Ip network address translator (nat) terminology and considerations. Disponível em: [link].

Vargiu, E. and Urru, M. (2012). Exploiting web scraping in a collaborative filtering-based approach to web advertising. Artificial Intelligence Research, 2. doi: 10.5430/air.v2n1p44. iSys: Revista Brasileira de Sistemas de Informação (iSys: Brazilian Journal of Information Systems).

Vieira, J. P. A. and Moura, R. S. (2020). Análise de métodos de extração de aspectos em opiniões regulares. iSys - Brazilian Journal of Information Systems, 13:82–97. doi: 10.5753/isys.2020.796.

Zargar, S. T., Joshi, J., and Tipper, D. (2013). A survey of defense mechanisms against distributed denial of service (ddos) flooding attacks. IEEE Communications Surveys Tutorials, 15:2046–2069. doi: 10.1109/SURV.2013.031413.00127.

Zhao, B. (2017). Web Scraping, pages 1–3. Springer International Publishing. doi: 10.1007/978-3-319-32001-4483−1.

Published

2024-07-13

How to Cite

de Paula, J. P. M., Marynowski, J. E., & Feger, J. E. (2024). Tourist Data Collection with Web Scraping: Problems, Solutions and Optimizations. ISys - Brazilian Journal of Information Systems, 17(1), 8:1 – 8. https://doi.org/10.5753/isys.2024.3644

Issue

Section

Regular articles