SAVIME: An Array DBMS for Simulation Analysis and ML Models Prediction
DOI:
https://doi.org/10.5753/jidm.2020.2021Keywords:
Scientific Data Management, Multidimensional Array, Machine LearningAbstract
Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make them benefit from DBMS support, enabling Declarative data analysis and visualization over scientific data, we present an in-memory array DBMS system called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that it is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability, making it useful for modern ML based applications.
Downloads
References
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., and Ailamaki, A. Nodb: efficient query execution on raw data files. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, Arizona, USA, pp. 241–252, 2012.
Ayachit, U., Bauer, A., Geveci, B., O’Leary, P., Moreland, K., Fabian, N., and Mauldin, J. Paraview catalyst: Enabling in situ data analysis and visualization. In Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization. Association for Computing Machinery, Texas, USA, pp. 25–29, 2015.
Baumann, P. Management of multidimensional discrete data. The VLDB Journal 3 (4): 401–444, 1994.
Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., and Widmann, N. The multidimensional database system rasdaman. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. Association for Computing Machinery, Washington, USA, pp. 575–577, 1998.
Baumann, P., Furtado, P., Ritsch, R., and Widmann, N. The rasdaman approach to multidimensional database management. In Proceedings of the 1997 ACM symposium on Applied computing. Association for Computing Machinery, California, USA, pp. 166–173, 1997.
Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, Nova Scotia, Canada, pp. 1387–1395, 2017.
Bayma, L. O. and Pereira, M. A. Identifying finest machine learning algorithm for climate data imputation in the state of minas gerais, brazil. Journal of Information and Data Management 9 (3): 259–259, 2018.
Becker, K., Harb, J. G., and Ebeling, R. Exploring deep learning for the analysis of emotional reactions to terrorist events on twitter. Journal of Information and Data Management 10 (2): 97–115, 2019.
Blanas, S., Wu, K., Byna, S., Dong, B., and Shoshani, A. Parallel data analysis directly on scientific file formats. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. Association for Computing Machinery, Utah, USA, pp. 385–396, 2014.
Braga, D. J. F., da Silva, T. L. C., Rocha, A., Coutinho, G., Magalhães, R. P., Guerra, P. T., de Macêdo, J. A., and Barbosa, S. D. Time series forecasting to support irrigation management. Journal of Information and Data Management 10 (2): 66–80, 2019.
Cai, S., Chen, G., Ooi, B. C., and Gao, J. Model slicing for supporting complex analytics with elastic inference cost and resource constraints. Proceedings of the VLDB Endowment 13 (2): 86–99, 2019.
Center, B. S. New hpc4e seismic test suite to increase the pace of development of new modelling and imaging technologies. online, 2016.
Cudré-Mauroux, P., Kimura, H., Lim, K.-T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D. L., Balazinska, M., Becla, J., et al. A demonstration of scidb: a science-oriented dbms. Proceedings of the VLDB Endowment 2 (2): 1534–1537, 2009.
Gomes, A. T. A., Pereira, W. S., Valentin, F., and Paredes, D. On the implementation of a scalable simulator for multiscale hybrid-mixed methods. CoRR vol. abs/1703.10435, 2017.
Gosink, L., Shalf, J., Stockinger, K., Wu, K., and Bethel, W. Hdf5-fastquery: Accelerating complex queries on hdf datasets using fast bitmap indices. In 18th International Conference on Scientific and Statistical Database Management (SSDBM’06). IEEE, IEEE Computer Society, Vienna, Austria, pp. 149–158, 2006.
Howe, B. Gridfields: Model-driven Data Transformation in the Physical Sciences. Ph.D. thesis, Portland State University, USA, 2006.
Lee, B. S., Snapp, R. R., Chen, L., and Song, I.-Y. Modeling and querying scientific simulation mesh data, 2002.
Lima, A. A., Mattoso, M., and Valduriez, P. Adaptive virtual partitioning for olap query processing in a database cluster. Journal of Information and Data Management 1 (1): 75–75, 2010.
Lofstead, J., Zheng, F., Klasky, S., and Schwan, K. Adaptable, metadata rich io methods for portable high performance io. In 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, IEEE Computer Society, Los Alamitos, California, USA, pp. 1–10, 2009.
Lustosa, H. SAVIME: Enabling Declarative Array Processing In Memory. Ph.D. thesis, Laboratório Nacional de Computação Científica, Petrópolis, Brazil, 2020.
Lustosa, H., Lemus, N., Porto, F., and Valduriez, P. Tars: An array model with rich semantics for multidimensional data. In Proceedings of the ER Forum 2017 and the ER 2017 Demo Track co-located with the 36th International Conference on Conceptual Modelling (ER 2017), Valencia, Spain, - November 6-9, 2017. CEUR Workshop Proceedings, vol. 1979. CEUR-WS.org, Valencia, Spain, pp. 114–127, 2017.
Lustosa, H., Porto, F., Blanco, P., and Valduriez, P. Database system support of simulation data. Proceedings of the VLDB Endowment (PVLDB) 9 (13): 1329–1340, 2016.
Lustosa, H., Porto, F., and Valduriez, P. Savime: A database management system for simulation data analysis and visualization. In Proceedings of the Brazilian Symposium on Databases. SBBD 2019 vol. 34, pp. 85–96, 2019.
Marathe, A. P. and Salem, K. A language for manipulating arrays. In VLDB. Vol. 97. Association for Computing Machinery, Tucson, Arizona, USA, pp. 46–55, 1997.
Marathe, A. P. and Salem, K. Query processing techniques for arrays. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. Association for Computing Machinery, Philadelphia, Pennsylvania, USA, pp. 323–334, 1999.
Marathe, A. P. and Salem, K. Query processing techniques for arrays. The VLDB Journal 11 (1): 68–91, 2002.
Oliphant, T. E. A guide to NumPy. Vol. 1. Trelgol Publishing, USA, 2006.
Papadopoulos, S., Datta, K., Madden, S., and Mattson, T. The tiledb array data storage manager. Proceedings of the VLDB Endowment 10 (4): 349–360, 2016.
Pena, E., Pena, E. H., Falk, E., Meira, J. A., and de Almeida, E. C. Mind your dependencies for semantic query optimization. Journal of Information and Data Management 9 (1): 3–3, 2018.
Petrillo. Finding strong gravitational lenses in the kilo degree survey with convolutional neural networks. Monthly Notices of the Royal Astronomical Society 472 (1): 1129–1150, 2017.
Rew, R. and Davis, G. Netcdf: an interface for scientific data access. IEEE computer graphics and applications 10 (4): 76–82, 1990.
Rezaei Mahdiraji, A., Baumann, P., and Berti, G. Img-complex: graph data model for topology of unstructured meshes. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. Association for Computing Machinery, San Francisco, California, USA, pp. 1619–1624, 2013.
Saha, S., Moorthi, S., Pan, H.-L., Wu, X., Wang, J., Nadiga, S., Tripp, P., Kistler, R., Woollen, J., Behringer, D., et al. The ncep climate forecast system reanalysis. Bulletin of the American Meteorological Society 91 (8): 1015–1058, 2010.
Santos, A., Lustosa, H., Porto, F., and Schulze, B. Towards in-transit analysis on supercomputing environments. CoRR vol. abs/1805.06425, 2018.
Schroeder, W., Martin, K., and Lorensen, B. The visualization toolkit, 4th edn. kitware. New York , 2006.
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., kin Wong, W., and chun Woo, W. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of 2015 NIPS International Conference on Neural Information Processing Systems. pp. 802–810, 2015.
Stonebraker, M., Bear, C., Çetintemel, U., Cherniack, M., Ge, T., Hachem, N., Harizopoulos, S., Lifter, J., Rogers, J., and Zdonik, S. One size fits all? part 2: Benchmarking results. In Proc. CIDR. CIDR Conference, California, USA, 2007.
Stonebraker, M., Brown, P., Zhang, D., and Becla, J. Scidb: A database management system for applications with complex analytics. Computing in Science & Engineering 15 (3): 54–62, 2013.
The HDF Group. Hierarchical Data Format, version 5. http://www.hdfgroup.org/HDF5/, 1997-2020.
Vishwanath, V., Hereld, M., Morozov, V., and Papka, M. E. Topology-aware data movement and staging for i/o acceleration on blue gene/p supercomputing systems. In SC’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Association for Computing Machinery, Washington, USA, pp. 1–11, 2011.
Xing, H., Floratos, S., Blanas, S., Byna, S., Prabhat, M., Wu, K., and Brown, P. Arraybridge: Interweaving declarative array processing in scidb with imperative hdf5-based programs. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, pp. 977–988, 2018.
Zalipynis, R. A. R. Chronosdb: distributed, file based, geospatial array dbms. Proceedings of the VLDB Endowment 11 (10): 1247–1261, 2018.