SmarT: Machine Learning Approach for Efficient Filtering and Retrieval of Spatial and Temporal Data in Big Data
DOI:
https://doi.org/10.5753/jidm.2021.1951Keywords:
Big Data, Machine Learning, Time Series AnalysisAbstract
Spatiotemporal data has always been big data. In these days, big data analytics for spatiotemporal data is receiving considerable attention to allow users to analyze huge amounts of data. Traditional big data platforms cannot handle all the challenges of processing spatio-temporal data. Although some big data platforms have been proposed to process a massive volume of spatiotemporal data, neither is considered a clear winner for all possible scenarios. This paper presents the SmarT query engine, a machine learning-based solution that chooses the best big data platform for processing spatiotemporal queries on the fly. In a detailed experimental evaluation, considering the Apache Spark, Elasticsearch, and SciDB big data platforms, the response time decreased up to 22% when using SmarT.
Downloads
References
Almeer, M. H. Cloud hadoop map reduce for remote sensing image analysis. Journal of Emerging Trends in Computing and Information Sciences 3 (4): 637–644, 2012.
Amaral, T. and de Sousa, E. P. M. Mining temporal exception rules from multivariate time series using a new support measure. Journal of Information and Data Management 11 (3), 2020.
Baumann, P., Furtado, P., Ritsch, R., and Widmann, N. The rasdaman approach to multidimensional database management. In Symposium on Applied Computing: Proceedings of the 1997 ACM symposium on Applied computing. Vol. 1997. pp. 166–173, 1997.
Benabderrahmane, S., Mellouli, N., Lamolle, M., and Paroubek, P. Smart4job: A big data framework for intelligent job offers broadcasting using time series forecasting and semantic classification. Big Data Research vol. 7, pp. 16–30, 2017.
Bentley, J. L. Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9): 509–517, 1975.
Braga, D. J. F., da Silva, T. L. C., Rocha, A., Coutinho, G., Magalhães, R. P., Guerra, P. T., de Macêdo, J. A., and Barbosa, S. D. Time series forecasting to support irrigation management. Journal of Information and Data Management 10 (2): 66–80, 2019.
Brown, P. G. Overview of scidb: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp. 963–968, 2010.
Camara, G., Assis, L. F., Ribeiro, G., Ferreira, K. R., Llapa, E., and Vinhas, L. Big earth observation data analytics: matching requirements to system architectures. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data. ACM, pp. 1–6, 2016.
Casanova, M. A., Câmara, G., Davis, C., Vinhas, L., and Queiroz, G. R. Banco de dados geográficos. Mundo-GEO Curitiba, 2005.
Chatterjee, S. and Hadi, A. S. Regression analysis by example. John Wiley & Sons, 2015.
Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, pp. 785–794, 2016.
Chi, M., Plaza, A., Benediktsson, J. A., Sun, Z., Shen, J., and Zhu, Y. Big data for remote sensing: Challenges and opportunities. Proceedings of the IEEE 104 (11): 2207–2219, 2016.
Comber, A. and Wulder, M. Considering spatiotemporal processes in big data analysis: Insights from remote sensing of land cover and land use. Transactions in GIS , 2019.
da Silva, A. C., Lustosa, H. L. S., da Silva, D. N. R., Porto, F. A. M., and Valduriez, P. Savime: An array dbms for simulation analysis and ml models prediction. Journal of Information and Data Management 11 (3), 2020.
de Assis, L. F. F. G., de Queiroz, G. R., Ferreira, K. R., Vinhas, L., Llapa, E., Sanchez, A. I., Maus, V., and Câmara, G. Big data streaming for remote sensing time series analytics using mapreduce. Revista Brasileira de Cartografia 69 (5), 2017.
de Oliveira, S. S. T., Martins, W. S., Sacramento, V., Bueno, E., Cardoso, M., and Pascoal, L. A parallel and distributed approach to the analysis of time series on remote sensing big data. Journal of Information and Data Management 10 (1): 16–34, 2019.
Dean, J. and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51 (1): 107–113, 2008.
Eldawy, A. and Mokbel, M. F. Spatialhadoop: A mapreduce framework for spatial data. In 2015 IEEE 31st international conference on Data Engineering. IEEE, pp. 1352–1363, 2015.
Fan, J., Han, F., and Liu, H. Challenges of big data analysis. National science review 1 (2): 293–314, 2014.
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.-A. Deep learning for time series classification: a review. Data Mining and Knowledge Discovery 33 (4): 917–963, 2019.
Feick, M., Kleer, N., and Kohn, M. Fundamentals of real-time data processing architectures lambda and kappa. SKILL 2018-Studierendenkonferenz Informatik , 2018.
Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., and Moore, R. Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment vol. 202, pp. 18–27, 2017.
Guedes, T., Silva, V., Camata, J., Bedo, M. V., Mattoso, M., and de Oliveira, D. C. Towards an empirical evaluation of scientific data indexing and querying. Journal of Information and Data Management 9 (1): 84–84, 2018.
Guo, H., Liu, Z., Jiang, H., Wang, C., Liu, J., and Liang, D. Big earth data: a new challenge and opportunity for digital earth’s development. International Journal of Digital Earth 10 (1): 1–12, 2017.
Hawkins, D. M. The problem of overfitting. Journal of chemical information and computer sciences 44 (1): 1–12, 2004.
Ji, C., Shao, Q., Sun, J., Liu, S., Pan, L., Wu, L., and Yang, C. Device data ingestion for industrial big data platforms with a case study. Sensors 16 (3): 279, 2016.
Kiran, M., Murphy, P., Monga, I., Dugan, J., and Baveja, S. S. Lambda architecture for cost-effective batch and speed big data processing. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, pp. 2785–2792, 2015.
Lin, F.-C., Chung, L.-K., Wang, C.-J., Ku, W.-Y., and Chou, T.-Y. Storage and processing of massive remote sensing images using a novel cloud computing platform. GIScience & Remote Sensing 50 (3): 322–336, 2013.
Lin, J. The lambda and the kappa. IEEE Internet Computing 21 (05): 60–66, 2017.
Lu, M., Pebesma, E., Sanchez, A., and Verbesselt, J. Spatio-temporal change detection from multidimensional arrays: Detecting deforestation from modis time series. ISPRS Journal of Photogrammetry and Remote Sensing vol. 117, pp. 227–236, 2016.
Ma, Y., Wu, H., Wang, L., Huang, B., Ranjan, R., Zomaya, A., and Jie, W. Remote sensing big data computing: Challenges and opportunities. Future Generation Computer Systems vol. 51, pp. 47–60, 2015.
Nielsen, D. Tree Boosting With XGBoost-Why Does XGBoost Win" Every" Machine Learning Competition? M.S. thesis, NTNU, 2016.
Oliveira, S. S. T., Rodrigues, V., and Martins, W. S. Smart: Uso de aprendizado de máquina para filtragem e recuperação eficiente de dados espaciais e temporais em big data. In Anais do XXXV Simpósio Brasileiro de Bancos de Dados. SBC, pp. 85–96, 2020.
Oliveira, S. S. T. d. Explorando paralelismo em big data no processamento de séries temporais de imagens de sensoriamento remoto. Ph.D. thesis, Universidade Federal de Goiás. Instituto de Informática, 2019.
Opitz, D. and Maclin, R. Popular ensemble methods: An empirical study. Journal of artificial intelligence research vol. 11, pp. 169–198, 1999.
Patterson, J. Lumberyard: Time series indexing at scale. In OSCON 2011 - O’Reilly Conferences. Portland, USA, 2011.
Polikar, R. Ensemble based systems in decision making. IEEE Circuits and systems magazine 6 (3): 21–45, 2006.
Procopiuc, O., Agarwal, P. K., Arge, L., and Vitter, J. S. Bkd-tree: A dynamic scalable kd-tree. In International Symposium on Spatial and Temporal Databases. Springer, pp. 46–65, 2003.
Rajaraman, A. and Ullman, J. D. Mining of massive datasets. Cambridge University Press, 2011.
Rathore, M. M. U., Paul, A., Ahmad, A., Chen, B.-W., Huang, B., and Ji, W. Real-time big data analytical architecture for remote sensing application. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 8 (10): 4610–4621, 2015.
Romani, L. A., Ávila, A. M. H., Zullo Jr, J., Traina Jr, C., and Traina, A. J. Mining relevant and extreme patterns on climate time series with clipsminer. Journal of Information and Data Management 1 (2): 245–245, 2010.
Shrivastava, S. A review of spatial big data platforms, opportunities, and challenges. IETE Journal of Education 61 (2): 80–89, 2020.
Song, W., Jin, B., Li, S., Wei, X., Li, D., and Hu, F. Building spatiotemporal cloud platform for supporting gis application. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences vol. 1, pp. 55–62, 2015.
Stonebraker, M., Brown, P., Poliakov, A., and Raman, S. The architecture of scidb. In International Conference on Scientific and Statistical Database Management. Springer, pp. 1–16, 2011.
Tahmassebpour, M. and Otaghvari, A. Increase efficiency big data in intelligent transportation system with using iot integration cloud. Journal of Fundamental and Applied Sciences 8 (3S): 2443–2461, 2016.
Thacker, U., Pandey, M., and Rautaray, S. S. Performance of elasticsearch in cloud environment with ngram and non-ngram indexing. In 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). IEEE, pp. 3624–3628, 2016.
Van Den Bergh, F., Wessels, K. J., Miteff, S., Van Zyl, T. L., Gazendam, A. D., and Bachoo, A. K. Hitempo: a platform for time-series analysis of remote-sensing satellite data in a high-performance computing environment. International journal of remote sensing 33 (15): 4720–4740, 2012.
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, pp. 5, 2013.
Wagner, W., Fröhlich, J., Wotawa, G., Stowasser, R., Staudinger, M., Hoffmann, C., Walli, A., Federspiel, C., Aspetsberger, M., Atzberger, C., et al. Addressing grand challenges in earth observation science: The earth observation data centre for water resources monitoring. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences 2 (7), 2014.
Wang, F., Li, M., Mei, Y., and Li, W. Time series data mining: A case study with big data analytics approach. IEEE Access vol. 8, pp. 14322–14328, 2020.
Wang, Y., Yuan, J., Chen, X., and Bao, J. Smart grid time series big data processing system. In Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), 2015 IEEE. IEEE, pp. 393–400, 2015.
Wolfert, S., Ge, L., Verdouw, C., and Bogaardt, M.-J. Big data in smart farming-a review. Agricultural Systems vol. 153, pp. 69–80, 2017.
Xu, M., Zhao, L., Yang, R., Yang, J., Sha, D., and Yang, C. Integrating memory-mapping and n-dimensional hash function for fast and efficient grid-based climate data query. Annals of GIS , 2020.
You, S., Zhang, J., and Gruenwald, L. Large-scale spatial join query processing in cloud. In 2015 31st IEEE international conference on data engineering workshops. IEEE, pp. 34–41, 2015.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., et al. Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11): 56–65, 2016.
Zhang, X., Khanal, U., Zhao, X., and Ficklin, S. Understanding software platforms for in-memory scientific data analysis: A case study of the spark system. In 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp. 1135–1144, 2016.
Zhang, X., Khanal, U., Zhao, X., and Ficklin, S. Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system. Journal of Parallel and Distributed Computing vol. 120, pp. 369–382, 2018.
Zhou, Z.-H. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.