Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned Checkpoint
DOI:
https://doi.org/10.5753/jisa.2024.3891Keywords:
Checkpoint/restore, Recovery, State Machine Replication, High-availabilityAbstract
Dependable systems usually rely on replication to provide resilience and availability. However, for long-lived systems, replication is not enough since given a sufficient amount of time, there might be more faulty replicas than the threshold tolerated in the system. In order to overcome this limitation, checkpoint and recovery techniques are used to update and resume failed replicas. In this sense, checkpointing procedures periodically capture snapshots of the system state during failure-free execution, enabling recovery processes to resume from a previously stored and consistent state. Nevertheless, saving checkpoints introduces overhead, requiring synchronization with the processing of incoming requests to prevent inconsistencies.
This overhead becomes even more pronounced in high-throughput systems like Parallel State Machine Replication, where workloads dominated by independent requests leverage multi-threading parallelism.
This work addresses the costly nature of checkpointing by proposing a novel approach that divides the replica's state into partitions and takes snapshots of only a few partitions at a time. Replicas continue executing requests targeted to other partitions without interruption. Thus, incoming requests experience delays during a checkpoint only if they access a partition currently being saved. Combining this approach with the Parallel State Machine Replication yields reduced snapshot durations and lower client latency during checkpointing. Additionally, the proposed approach accelerates replicas recovery through collaborative state transfer, enabling workload distribution among replicas and parallel execution of transfer and installation of the recovering state.
Downloads
References
Aguilera, M. K., Chen, W., and Toueg, S. (2000). Failure detection and consensus in the crash-recovery model. Distributed computing, 13:99-125. DOI: 10.1007/s004460050070.
Alchieri, E., Dotti, F., Mendizabal, O. M., and Pedone, F. (2017). Reconfiguring parallel state machine replication. In 2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS), pages 104-113. IEEE. DOI: 10.1109/SRDS.2017.23.
Amazon (2012a). Summary of the december 24, 2012 amazon elb service event in the us-east region. Available online [link].
Amazon (2012b). Summary of windows azure service disruption on feb 29th. Available online [link].
Bessani, A., Sousa, J., and Alchieri, E. E. P. (2014). State machine replication for the masses with bft-smart. In DSN, pages 355-362. DOI: 10.1109/DSN.2014.43.
Bessani, A. N., Santos, M., Felix, J., Neves, N. F., and Correia, M. (2013). On the efficiency of durable state machine replication. In USENIX ATC, pages 169-180. Available online [link].
Bobrow, D. G., Burchfiel, J. D., Murphy, D. L., and Tomlinson, R. S. (1972). Tenex, a paged time sharing system for the pdp-10. Communications of the ACM, 15(3):135-143. DOI: 10.1145/361268.36127.
Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI '06, pages 335-350, Berkeley, CA, USA. USENIX Association. Available online [link].
Castro, M. and Liskov, B. (2000). Proactive recovery in a byzantine-fault-tolerant system. In Proceedings of the 4th conference on Symposium on Operating System Design & Implementation-Volume 4, page 19. USENIX Association. Available online [link].
Chen, J., Arumaithurai, M., Jiao, L., Fu, X., and Ramakrishnan, K. (2011). Copss: An efficient content oriented publish/subscribe system. In 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems, pages 99-110. IEEE. DOI: 10.1109/ANCS.2011.27.
Cheng, D., Chen, Y., Zhou, X., Gmach, D., and Milojicic, D. (2017). Adaptive scheduling of parallel jobs in spark streaming. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pages 1-9. IEEE. DOI: 10.1109/INFOCOM.2017.8057206.
Clement, A., Kapritsos, M., Lee, S., Wang, Y., Alvisi, L., Dahlin, M., and Riche, T. (2009). Upright cluster services. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 277-290, New York, NY, USA. ACM. DOI: 10.1145/1629575.1629602.
Coelho, P. and Pedone, F. (2018). Geographic state machine replication. In 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS), pages 221-230. DOI: 10.1109/SRDS.2018.00034.
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Woodford, D., Saito, Y., Taylor, C., Szymaniak, M., and Wang, R. (2012). Spanner: Google's globally-distributed database. In OSDI. DOI: 10.1145/249124.
Curino, C., Jones, E., Zhang, Y., and Madden, S. (2010). Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 3(1-2):48-57. Available online [link].
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. ACM SIGOPS operating systems review, 41(6):205-220. DOI: 10.1145/1323293.1294281.
Défago, X., Schiper, A., and Urbán, P. (2004). Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Computing Surveys (CSUR), 36(4):372-421. DOI: 10.1145/1041680.1041682.
Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. Journal of Supercomputing, 65(3):1302-1326. DOI: 10.1007/s11227-013-0884-0.
Elnozahy, E. N., Alvisi, L., Wang, Y.-M., and Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375-408. DOI: 10.1145/568522.568525.
Eugster, P. T., Felber, P. A., Guerraoui, R., and Kermarrec, A.-M. (2003). The many faces of publish/subscribe. ACM computing surveys (CSUR), 35(2):114-131. DOI: 10.1145/857076.857078.
Frank, A., Baumgartner, M., Salkhordeh, R., and Brinkmann, A. (2021). Improving checkpointing intervals by considering individual job failure probabilities. In IPDPS. DOI: 10.1109/IPDPS49936.2021.00038.
Gitlab (2017). Gitlab.com databse incident. Available online [link].
Goulart, H., Álvaro Franco, and Mendizabal, O. (2023a). Checkpointing techniques in distributed systems: A synopsis of diverse strategies over the last decades. In WTF, pages 15-28, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/wtf.2023.785.
Goulart, H. S., Trombeta, J., Franco, A., and Mendizabal, O. M. (2023b). Achieving enhanced performance combining checkpointing and dynamic state partitioning. In 2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 149-159. DOI: 10.1109/SBAC-PAD59825.2023.00024.
Herlihy, M. P. and Wing, J. M. (1990). Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12(3):463-492. DOI: 10.1145/78969.78972.
Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Yao, R. (2017). Gray failure: The achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pages 150-155. DOI: 10.1145/3102980.310300.
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). Zookeeper: Wait-free coordination for internet-scale systems. In USENIX annual technical conference, volume 8. Boston, MA, USA. Available online [link].
Hurfin, M., Mostefaoui, A., and Raynal, M. (1998). Consensus in asynchronous systems where processes can crash and recover. In Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No. 98CB36281), pages 280-286. IEEE. DOI: 10.1109/RELDIS.1998.740510.
Izraelevitz, J., Wang, G., Hanscom, R., Silvers, K., Lehman, T. S., Chockler, G., and Gotsman, A. (2022). Acuerdo: Fast atomic broadcast over rdma. In Proceedings of the 51st International Conference on Parallel Processing, pages 1-11. DOI: 10.1145/3545008.3545041.
Junior, E. G., Alchieri, E., Dotti, F., and Mendizabal, O. (2023). A time-phased partitioned checkpoint approach to reduce state snapshot overhead. In Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing, LADC '23, page 100–109, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3615366.3615417.
Kapitza, R., Behl, J., Cachin, C., Distler, T., Kuhnle, S., Mohammadi, S. V., Schröder-Preikschat, W., and Stengel, K. (2012). Cheapbft: resource-efficient byzantine fault tolerance. In Eurosys. DOI: 10.1145/2168836.2168866.
Kapritsos, M., Wang, Y., Quema, V., Clement, A., Alvisi, L., Dahlin, M., et al. (2012). All about eve: Execute-verify replication for multi-core servers. In OSDI, volume 12, pages 237-250. Available online [link].
Kotla, R. and Dahlin, M. (2004). High throughput byzantine fault tolerance. In DSN, pages 575-584. IEEE. DOI: 10.1109/DSN.2004.1311928.
Kumar, K. A., Deshpande, A., and Khuller, S. (2013). Data placement and replica selection for improving co-location in distributed environments. arXiv preprint arXiv:1302.4168. DOI: 10.48550/arXiv.1302.4168.
Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558-565. DOI: 10.1145/3335772.3335934.
Lamport, L. (1998). The part-time parliament. ACM Trans. Comput. Syst., 16(2):133-169. DOI: 10.1145/279227.279229.
Lamport, L. et al. (2001). Paxos made simple. ACM Sigact News, 32(4):18-25. Available online [link].
Le, L. H., Fynn, E., Eslahi-Kelorazi, M., Soulé, R., and Pedone, F. (2019). Dynastar: Optimized dynamic partitioning for scalable state machine replication. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pages 1453-1465. IEEE. DOI: 10.1109/ICDCS.2019.00145.
Li, B., Xu, W., and Kapitza, R. (2018). Dynamic state partitioning in parallelized byzantine fault tolerance. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 158-163. IEEE. DOI: 10.1109/DSN-W.2018.00056.
Marandi, P. J., Bezerra, C. E., and Pedone, F. (2014). Rethinking state-machine replication for parallelism. In 2014 IEEE 34th International Conference on Distributed Computing Systems, pages 368-377. IEEE. DOI: 10.1109/ICDCS.2014.45.
Marandi, P. J. and Pedone, F. (2014). Optimistic parallel state-machine replication. In SRDS, pages 57-66. IEEE. DOI: 10.1109/SRDS.2014.25.
Masti, S. (2021). How we built a general purpose key value store for facebook with zippydb. Available online [link].
Mendizabal, O. M., De Moura, R. S., Dotti, F. L., and Pedone, F. (2017a). Efficient and deterministic scheduling for parallel state machine replication. In IPDPS, pages 748-757. IEEE. DOI: 10.1109/IPDPS.2017.29.
Mendizabal, O. M., Dotti, F. L., and Pedone, F. (2016). Analysis of checkpointing overhead in parallel state machine replication. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, SAC '16, page 534–537, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/2851613.285187.
Mendizabal, O. M., Dotti, F. L., and Pedone, F. (2017b). High performance recovery for parallel state machine replication. In ICDCS, pages 34-44. IEEE. DOI: 10.1109/ICDCS.2017.193.
Mendizabal, O. M., Marandi, P. J., Dotti, F. L., and Pedone, F. (2014). Checkpointing in parallel state-machine replication. In International Conference on Principles of Distributed Systems, pages 123-138. Springer. DOI: 10.1007/978-3-319-14472-6_9.
Moraru, I., Andersen, D. G., and Kaminsky, M. (2013). There is more consensus in egalitarian parliaments. In SOSP. DOI: 10.1145/2517349.2517350.
Ongaro, D. and Ousterhout, J. (2014). In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 305-319. Available online [link].
Quamar, A., Kumar, K. A., and Deshpande, A. (2013). SWORD: scalable workload-aware data placement for transactional workloads. In Proceedings of the 16th International Conference on Extending Database Technology, pages 430-441. DOI: 10.1145/2452376.2452427.
Sachs, K., Appel, S., Kounev, S., and Buchmann, A. (2010). Benchmarking publish/subscribe-based messaging systems. In International Conference on Database Systems for Advanced Applications, pages 203-214. Springer. DOI: 10.1007/978-3-642-14589-6_21.
Schneider, F. B. (1990). Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys (CSUR), 22(4):299-319. DOI: 10.1145/98163.98167.
Schuller, P. (2014). Manhattan, our real-time, multi-tenant distributed database for twitter scale. Available online [link].
Tiwari, D., Gupta, S., and Vazhkudai, S. S. (2014). Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In DSN, pages 25-36. IEEE. DOI: 10.1109/DSN.2014.101.
Viel, E. and Ueda, H. (2014). Data stream partitioning re-optimization based on runtime dependency mining. In 2014 IEEE 30th International Conference on Data Engineering Workshops, pages 199-206. IEEE. DOI: 10.1109/ICDEW.2014.6818327.
White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C., and Joglekar, A. (2002). An integrated experimental environment for distributed systems and networks. ACM SIGOPS Operating Systems Review, 36(SI):255-270. DOI: 10.1145/844128.844152.
Zheng, W., Tu, S., Kohler, E., and Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In OSDI, pages 465-477. Available online [link].
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Journal of Internet Services and Applications
This work is licensed under a Creative Commons Attribution 4.0 International License.