Composing State Machine Replication
DOI:
https://doi.org/10.5753/jisa.2025.5914Keywords:
Distributed Computing, Fault Tolerance, State Machine Replication, CompositionAbstract
High availability is a fundamental requirement in large-scale distributed systems, where replication strategies are central in keeping applications operational despite a bounded number of failures. State Machine Replication (SMR) is one of the most widely adopted approaches for implementing highly available, fault-tolerant services, as it increases uptime while ensuring strong consistency. In recent years, research on SMR has yielded numerous variations tailored to enhance resilience, performance, and scalability. In this paper, we revisit SMR from a new perspective by introducing Composing State Machine Replication (CSMR), a method that enables fault-tolerant service composition. By composing SMRs, we promote the reuse of existing services to construct more complex and reliable systems. This modular approach fosters loosely coupled, flexible architectures, contributing to the theoretical foundations of SMR and aligning with common development practices in cloud computing and microservices. We formally define CSMR and demonstrate how composition can be used to extend existing SMR specifications with new features. For example, CSMR allows the semantics of a service operation to be extended by enabling different state machine replicas to execute complementary steps of the same operation. Additionally, SMR composition facilitates sharding and state partitioning by assigning disjoint state variables to separate SMRs. Beyond formalization, the paper provides illustrative examples of CSMR and introduces a high-level CSMR architecture that highlights the essential components, their responsibilities, and their interactions in supporting the composition process. To further demonstrate practicability, we present an API for building CSMR systems that combines RPC-based communication with declarative configuration in YAML format.
Downloads
References
Alchieri, E., Dotti, F., Mendizabal, O. M., and Pedone, F. (2017). Reconfiguring parallel state machine replication. In 2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS), pages 104-113, China. IEEE. DOI: 10.1109/SRDS.2017.23.
Altinbuken, D. and Sirer, E. G. (2012). Commodifying replicated state machines with openreplica. Technical report, Cornell University. Available at:[link].
Alves, C. M., Idalino, T. B., and Mendizabal, O. (2024). Extending state machine replication through composition. In Proceedings of the 13th Latin-American Symposium on Dependable and Secure Computing, LADC '24, page 231–240, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3697090.3697106.
Attiya, H. and Welch, J. (2004). Distributed Computing: Fundamentals, Simulations, and Advanced Topics. Wiley-Interscience. Book.
Avizienis, A. and Kelly, J. P. (1984). Fault tolerance by design diversity: Concepts and experiments. Computer, 17:67-80. DOI: 10.1109/MC.1984.1659219.
Batista, E., Alchieri, E., Dotti, F., and Pedone, F. (2022). Early scheduling on steroids: Boosting parallel state machine replication. Journal of Parallel and Distributed Computing, 163:269-282. DOI: 10.1016/j.jpdc.2022.02.001.
Berger, C., Reiser, H. P., and Bessani, A. (2021). Making reads in bft state machine replication fast, linearizable, and live. In 2021 40th International Symposium on Reliable Distributed Systems (SRDS), pages 1-12. IEEE. DOI: 10.1109/srds53918.2021.00010.
Bessani, A., Sousa, J., and Alchieri, E. E. (2014). State machine replication for the masses with bft-smart. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 355-362, Atlanta, GA, USA. IEEE. DOI: 10.1109/DSN.2014.43.
Bezerra, C. E., Pedone, F., and Van Renesse, R. (2014). Scalable state-machine replication. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 331-342, USA. IEEE. DOI: 10.1109/DSN.2014.41.
Burgos, A., Alchieri, E., Dotti, F., and Pedone, F. (2021). Exploiting concurrency in sharded parallel state machine replication. IEEE Transactions on Parallel and Distributed Systems, 33:2133-2147. DOI: 10.1109/TPDS.2021.3135761.
Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 335-350, USA. USENIX Association. DOI: 10.5555/1298455.1298487.
Castro, M. and Liskov, B. (2002). Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS), 20:398-461. DOI: 10.1145/571637.571640.
Castro, M., Liskov, B., et al. (1999). Practical byzantine fault tolerance. In OSDI, pages 173-186, USA. USENIX Association. DOI: 10.5555/296806.296824.
Chandra, T. D., Griesemer, R., and Redstone, J. (2007). Paxos made live: an engineering perspective. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pages 398-407, USA. ACM. DOI: 10.1145/1281100.1281103.
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., et al. (2013). Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems (TOCS), 31(3):1-22. Available at:[link].
Cui, H., Gu, R., Liu, C., Chen, T., and Yang, J. (2015). Paxos made transparent. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 105-120, USA. ACM. DOI: 10.1145/2815400.2815427.
Dang, Z., Ibarra, O. H., and Su, J. (2004). Composability of infinite-state activity automata. In International Symposium on Algorithms and Computation, pages 377-388, Berlim. Springer. DOI: 10.1007/978-3-540-30551-4_34.
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. ACM SIGOPS operating systems review, 41:205-220. DOI: 10.1145/1323293.1294281.
Défago, X., Schiper, A., and Urbán, P. (2004). Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Computing Surveys (CSUR), 36:372-421. DOI: 10.1145/1041680.1041682.
Eslahi-Kelorazi, M., Le, L. H., and Pedone, F. (2020). Developing complex data structures over partitioned state machine replication. In 2020 16th European Dependable Computing Conference (EDCC), pages 9-16. DOI: 10.1109/EDCC51268.2020.00012.
Etcd (2013). etcd: A distributed, reliable key-value store for the most critical data of a distributed system. Available at:[link].
Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29-43, USA. ACM. DOI: 10.1145/1165389.945450.
Herlihy, M. P. and Wing, J. M. (1990). Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12:463-492. DOI: 10.1145/78969.78972.
Howard, H., Schwarzkopf, M., Madhavapeddy, A., and Crowcroft, J. (2015). Raft refloated: Do we have consensus? ACM SIGOPS Operating Systems Review, 49:12-21. DOI: 10.1145/2723872.2723876.
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). Zookeeper: Wait-free coordination for internet-scale systems. In 2010 USENIX Annual Technical Conference (USENIX ATC 10), page 11, USA. USENIX Association. DOI: 10.5555/1855840.1855851.
Kapritsos, M., Wang, Y., Quema, V., Clement, A., Alvisi, L., and Dahlin, M. (2012). All about eve: execute-verify replication for multi-core servers. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, pages 237-250. USENIX Association. Available at:[link].
Kirsch, J. and Amir, Y. (2008). Paxos for system builders: An overview. In Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware, pages 1-6, USA. ACM. DOI: 10.1145/1529974.1529979.
Kotla, R. and Dahlin, M. (2004). High throughput byzantine fault tolerance. In DSN, pages 575-584, Florence, Italy. IEEE. DOI: 10.1109/DSN.2004.1311928.
Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21:558-565. DOI: 10.1145/359545.359563.
Lamport, L. (2001). Paxos made simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001), 32:51-58. Available at:[link].
Lamport, L. (2006). Lower bounds for asynchronous consensus. Distributed Computing, 19(2):104-125. DOI: 10.1007/s00446-006-0155-x.
Lamport, L., Malkhi, D., and Zhou, L. (2010). Reconfiguring a state machine. ACM SIGACT News, 41:63-73. DOI: 10.1145/1753171.1753191.
Lampson, B. W. (1996). How to build a highly available system using consensus. In International Workshop on Distributed Algorithms, pages 1-17, Berlin, Heidelberg. Springer. DOI: 10.5555/645953.675640.
Lynch, N. and Musco, C. (2022). A Basic Compositional Model for Spiking Neural Networks, pages 403-449. Springer Nature Switzerland, Cham. DOI: 10.1007/978-3-031-15629-8_22.
Lynch, N. A. (1996). Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. DOI: 10.1007/bfb0022433.
Manhattan, T. (2014). Manhattan, our real-time, multi-tenant distributed database for twitter scale. Available at:[link] [Accessed: Jun. 2024].
Marandi, P. J., Bezerra, C. E. B., and Pedone, F. (2014). Rethinking state-machine replication for parallelism. In ICDCS, pages 368-377, Madrid, Spain. IEEE. DOI: 10.1109/ICDCS.2014.45.
Masti, S. (2021). How we built a general purpose key value store for facebook with zippydb. Available at:[link] [Accessed: Jun. 2024].
Mendizabal, O. M., De Moura, R. S. T., Dotti, F. L., and Pedone, F. (2017). Efficient and deterministic scheduling for parallel state machine replication. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 748-757, USA. IEEE. DOI: 10.1109/IPDPS.2017.29.
Oliveira Vale, A., Shao, Z., and Chen, Y. (2024). A compositional theory of linearizability. Journal of the ACM, 71(2):1-107. DOI: 10.1145/3571231.
Olson, M. A., Bostic, K., and Seltzer, M. I. (1999). Berkeley db. In USENIX Annual Technical Conference, FREENIX Track, pages 183-191. Book.
Pacheco, L., Dotti, F., and Pedone, F. (2022). Strengthening atomic multicast for partitioned state machine replication. In Proceedings of the 11th Latin-American Symposium on Dependable Computing, pages 51-60. DOI: 10.1145/3569902.3569909.
Pereira, P. M., Dotti, F. L., Meinhardt, C., and Mendizabal, O. M. (2019). A library for services transparent replication. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pages 268-275, USA. ACM. DOI: 10.1145/3297280.3297308.
Perone, M. and Karachalias, G. (2023). Crème de la crem: Composable representable executable machines. In Proceedings of the 1st ACM SIGPLAN International Workshop on Functional Software Architecture, page 11–19, USA. ACM. DOI: 10.1145/3609025.3609480.
Rodeh, O., Bacik, J., and Mason, C. (2013). Btrfs: The linux b-tree filesystem. ACM Transactions on Storage (TOS), 9(3):1-32. DOI: 10.1145/2501620.2501623.
Scharf, J. a. L., Xavier, L. G. C., and Mendizabal, O. M. (2023). Joining parallel and partitioned state machine replication models for enhanced shared logging performance. In Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing, page 90–99, USA. ACM. DOI: 10.1145/3615366.3615422.
Schneider, F. B. (1990). Implementing fault-tolerant services using the state machine approach: A tutorial. ACM CSUR 1990, 22:299-319. DOI: 10.1145/98163.98167.
Sela, G., Herlihy, M., and Petrank, E. (2021). Brief announcement: Linearizability: A typo. In Proceedings of the 2021 ACM Symposium on Principles of Distributed Computing, pages 561-564. DOI: 10.1145/3465084.3467944.
Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010). The hadoop distributed file system. In MSST, pages 1-10, USA. IEEE. DOI: 10.1109/MSST.2010.5496972.
Xavier, L. G., Dotti, F., Meinhardt, C., and Mendizabal, O. (2020). Scalable and decoupled logging for state machine replication. In Proceedings of the 38th Brazilian Symposium on Computer Networks and Distributed Systems, pages 267-280, Brasil. SBC. DOI: 10.5753/sbrc.2020.12288.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Journal of Internet Services and Applications

This work is licensed under a Creative Commons Attribution 4.0 International License.

