Tourist Data Collection with Web Scraping: Problems, Solutions and Optimizations




Web Scraping, Data Extraction, Selenium, Tripadvisor, Data acquisition, Comments


The Internet can be considered one of the biggest data sources for humanity, but it's use is intended for human readers, not for automatic reading. Internet automated data extraction faces some obstacles involving data arrangement and the frequent updates that websites suffer. This study analyzes the extraction of tourist data for later categorization, listing faced problems, like page navigation and program performance, and presenting their solutions. Scraping algorithms are presented based on a general process and the different page-loading method, exploring them in a case study. Are evaluated the use of computational resources, the scraping rate and the security impacts, modifying the implementation accordingly to the problems that arise in the processing and documenting the solutions and optimizations, e.g. avoiding the Denial of Service (DoS) and applying parallel programming (multithreading). The data and scripts generated are available in a public repository for reproduction and inspiration.


Download data is not yet available.


de Paula, J. P. M., Marynowski, J. E., & Feger, J. E. (2024). Tourist Data Collection with Web Scraping: Problems, Solutions and Optimizations. ISys - Brazilian Journal of Information Systems, 17(1), 8:1 – 8.



Regular articles