Dependable Microservices in the Kubernetes era: A Practitioners Survey

Authors

DOI:

https://doi.org/10.5753/jisa.2024.4000

Keywords:

Microservices, Dependability, Faults, Failures, Countermeasure Techniques, Technologies

Abstract

The microservices architectural style offers several advantages to software development, including independence among development teams, greater autonomy for developers, faster product development, and improved scalability. However, since the communication topology relies on distributed systems, faults become more frequent and harder to manage, posing challenges to reliability and availability, which are key attributes of business-critical services. To address these concerns, fault patterns, countermeasures, and technologies have been explored and implemented in both industry and academia to prevent, tolerate, mitigate, and predict faults in microservices. To understand current industry practices for achieving dependable microservices, we present the results of an opinion survey with microservice practitioners, aiming to identify the main fault and failure patterns, countermeasure techniques, supporting technologies, existing gaps, and the evolution of the field. We also provide a review of academic research in this area, examining the connections between industry practices and academic literature, highlighting key findings, challenges, and opportunities.

Downloads

Download data is not yet available.

References

Al-Qudah, Z., Rabinovich, M., and Allman, M. (2010). Web timeouts and their implications. In International Conference on Passive and Active Network Measurement, pages 211-221. Springer. DOI: 10.1007/978-3-642-12334-4_22.

Amazon (2020). Amazon kinesis - easily collect, process, and analyze video and data streams in real time. Available online [link].

Amiri, Z., Heidari, A., Navimipour, N. J., and Unal, M. (2023). Resilient and dependability management in distributed environments: A systematic and comprehensive literature review. Cluster Computing, 26(2):1565-1600. DOI: 10.1007/s10586-022-03738-5.

Apache (2022). Welcome to apache avro. Available online [link].

Apache (2023). Apache kafka - a distributed streaming platform. Available online [link].

Assad, M., Meiklejohn, C. S., Miller, H., and Krusche, S. (2024). Can my microservice tolerate an unreliable database? resilience testing with fault injection and visualization. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, pages 54-58. DOI: 10.1145/3639478.3640021.

Avizienis, A., Laprie, J. ., Randell, B., and Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11-33. DOI: 10.1109/TDSC.2004.2.

Bangui, H., Rossi, B., and Buhnova, B. (2022). A conceptual antifragile microservice framework for reshaping critical infrastructures. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 364-368. IEEE. DOI: 10.1109/ICSME55016.2022.00040.

Barr, J., Narin, A., and Varia, J. (2011). Building fault-tolerant applications on AWS. Available online [link].

Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., and Rosenthal, C. (2016). Chaos engineering. IEEE Software, 33(3):35-41. DOI: 10.1109/MS.2016.60.

Belkhiri, A., Shahnejat Bushehri, A., Gohring de Magalhaes, F., and Nicolescu, G. (2023). Transparent trace annotation for performance debugging in microservice-oriented systems (work in progress paper). In Companion of the 2023 ACM/SPEC International Conference on Performance Engineering, pages 25-32. DOI: 10.1145/3578245.3585030.

Belshe, M., Peon, R., and Thomson, M. (2015). Hypertext transfer protocol version 2 (http/2). Available online [link].

Bento, A., Correia, J., Filipe, R., Araujo, F., and Cardoso, J. (2021). Automated analysis of distributed tracing: Challenges and research directions. Journal of Grid Computing, 19(1):9. DOI: 10.1007/s10723-021-09551-5.

Beyer, B., Jones, C., Petoff, J., and Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. " O'Reilly Media, Inc.". Book.

Bi, T., Pan, Y., Jiang, X., Ma, M., and Wang, P. (2022). Vecrosim: A versatile metric-oriented microservice fault simulation system (tools and artifact track). In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE), pages 297-308. IEEE. DOI: 10.1109/ISSRE55969.2022.00037.

Blohowiak, A., Basiri, A., Hochstein, L., and Rosenthal, C. (2016). A platform for automating chaos experiments. In 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pages 5-8. DOI: 10.1109/ISSREW.2016.52.

Bogner, J., Fritzsch, J., Wagner, S., and Zimmermann, A. (2019). Microservices in industry: insights into technologies, characteristics, and software quality. In 2019 IEEE international conference on software architecture companion (ICSA-C), pages 187-195. IEEE. DOI: 10.1109/ICSA-C.2019.00041.

Brooker, M. (2019). Timeouts, retries, and backoff with jitter. Available online [link].

Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and Wilkes, J. (2016). Borg, omega, and kubernetes. Communications of the ACM, 59(5):50-57. DOI: 10.1145/2890784.

Camilli, M., Guerriero, A., Janes, A., Russo, B., and Russo, S. (2022). Microservices integrated performance and reliability testing. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test, pages 29-39. DOI: 10.1145/3524481.3527233.

Chen, Y., Yan, M., Yang, D., Zhang, X., and Wang, Z. (2022). Deep attentive anomaly detection for microservice systems with multimodal time-series data. In 2022 IEEE international conference on web services (ICWS), pages 373-378. IEEE. DOI: 10.1109/ICWS55610.2022.00062.

Cristian, F. (1991). Understanding fault-tolerant distributed systems. Communications of the ACM, 34(2):56-78. DOI: 10.1145/102792.102801.

Di Francesco, P., Lago, P., and Malavolta, I. (2018). Migrating towards microservice architectures: an industrial survey. In 2018 IEEE International Conference on Software Architecture (ICSA), pages 29-2909. IEEE. DOI: 10.1109/ICSA.2018.00012.

Dragoni, N., Giallorenzo, S., Lafuente, A. L., Mazzara, M., Montesi, F., Mustafin, R., and Safina, L. (2017). Microservices: Yesterday, Today, and Tomorrow, pages 195-216. Springer International Publishing, Cham. DOI: 10.1007/978-3-319-67425-4_12.

Fielding, R. T. and Taylor, R. N. (2000). Architectural styles and the design of network-based software architectures, volume 7. University of California, Irvine Irvine. Available online [link].

Fowler, M. (2014). Microservices: a definition of this new architectural term. Available online [link].

Fowler, S. J. (2016). Production-Ready Microservices: Building Standardized Systems Across an Engineering Organization. O'Reilly Media, Inc, Sebastopol, CA, USA. Book.

Frank, S., Wagner, L., Hakamian, A., Straesser, M., and van Hoorn, A. (2022). Misim: A simulator for resilience assessment of microservice-based architectures. In 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS), pages 1014-1025. IEEE. DOI: 10.1109/QRS57517.2022.00105.

Gan, Y., Liang, M., Dev, S., Lo, D., and Delimitrou, C. (2021). Sage: practical and scalable ml-driven performance debugging in microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 135-151. DOI: 10.1145/3445814.3446700.

Gazzola, L., Goldstein, M., Mariani, L., Segall, I., and Ussi, L. (2020). Automatic ex-vivo regression testing of microservices. In Proceedings of the IEEE/ACM 1st International Conference on Automation of Software Test, pages 11-20. DOI: 10.1145/3387903.3389309.

Ghofrani, J. and Bozorgmehr, A. (2019). Migration to microservices: Barriers and solutions. In International Conference on Applied Informatics, pages 269-281. Springer. DOI: 10.1007/978-3-030-32475-9_20.

Ghofrani, J. and Lübke, D. (2018). Challenges of microservices architecture: A survey on the state of the practice. ZEUS, 2018:1-8. Available online [link].

Gill, P., Jain, N., and Nagappan, N. (2011). Understanding network failures in data centers: measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM 2011 conference, pages 350-361. DOI: 10.1145/2018436.2018477.

Grohmann, J., Straesser, M., Chalbani, A., Eismann, S., Arian, Y., Herbst, N., Peretz, N., and Kounev, S. (2021). Suanming: Explainable prediction of performance degradations in microservice applications. In Proceedings of the ACM/SPEC International Conference on Performance Engineering, pages 165-176. DOI: 10.1145/3427921.3450248.

Gu, S., Rong, G., Ren, T., Zhang, H., Shen, H., Yu, Y., Li, X., Ouyang, J., and Chen, C. (2023). Trinityrcl: Multi-granular and code-level root cause localization using multiple types of telemetry data in microservice systems. IEEE Transactions on Software Engineering, 49(5):3071-3088. DOI: 10.1109/TSE.2023.3241299.

Guo, X., Peng, X., Wang, H., Li, W., Jiang, H., Ding, D., Xie, T., and Su, L. (2020). Graph-based trace analysis for microservice architecture understanding and problem diagnosis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1387-1397. DOI: 10.1145/3368089.3417066.

Hammer, R. (2007). Patterns for Fault Tolerant Software. O'Reilly Media, Inc, West-sussex, Inglaterra. Book.

Harsh, V., Zhou, W., Ashok, S., Mysore, R. N., Godfrey, B., and Banerjee, S. (2023). Murphy: Performance diagnosis of distributed cloud applications. In Proceedings of the ACM SIGCOMM 2023 Conference, pages 438-451. DOI: 10.1145/3603269.3604877.

Haselböck, S. and Weinreich, R. (2017). Decision guidance models for microservice monitoring. In 2017 IEEE International Conference on Software Architecture Workshops (ICSAW), pages 54-61. IEEE. DOI: 10.1109/ICSAW.2017.31.

Heger, C., van Hoorn, A., Mann, M., and Okanović, D. (2017). Application performance management: State of the art and challenges for the future. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, pages 429-432. DOI: 10.1145/3030207.3053674.

Heorhiadi, V., Rajagopalan, S., Jamjoom, H., Reiter, M. K., and Sekar, V. (2016). Gremlin: Systematic resilience testing of microservices. In 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pages 57-66. DOI: 10.1109/ICDCS.2016.11.

Homer, A., Sharp, J., Brader, L., Narumoto, M., and Swanson, T. (2014). Cloud design patterns: Prescriptive architecture guidance for cloud applications. Microsoft patterns & practices. Book.

Hrusto, A., Engström, E., and Runeson, P. (2022). Optimization of anomaly detection in a microservice system through continuous feedback from development. In Proceedings of the 10th IEEE/ACM International Workshop on Software Engineering for Systems-of-Systems and Software Ecosystems, pages 13-20. DOI: 10.1145/3528229.3529382.

Huang, T., Chen, P., and Li, R. (2022). A semi-supervised vae based active anomaly detection framework in multivariate time series for online systems. In Proceedings of the ACM Web Conference 2022, pages 1797-1806. DOI: 10.1145/3485447.3511984.

Jamshidi, P., Pahl, C., Mendonça, N. C., Lewis, J., and Tilkov, S. (2018). Microservices: The journey so far and challenges ahead. IEEE Software, 35(3):24-35. DOI: 10.1109/MS.2018.2141039.

Jhawar, R. and Piuri, V. (2017). Fault tolerance and resilience in cloud computing environments. In Computer and information security handbook, pages 165-181. Elsevier. DOI: 10.1016/B978-0-12-803843-7.00009-0.

Kim, M., Sumbaly, R., and Shah, S. (2013). Root cause detection in a service-oriented architecture. SIGMETRICS Perform. Eval. Rev., 41(1):93-104. DOI: 10.1145/2494232.2465753.

Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical report, Department of Computer Science, University of Durham. Available online [link].

Kleppmann, M. (2017). Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. " O'Reilly Media, Inc.". Book.

Knoche, H. and Hasselbring, W. (2019). Drivers and barriers for microservice adoption-a survey among professionals in germany. Enterprise Modelling and Information Systems Architectures (EMISAJ)-International Journal of Conceptual Modeling: Vol. 14, Nr. 1. DOI: 10.18417/emisa.14.1.

Kumar, A. (2014). Software architecture styles: A survey. International Journal of Computer Applications, 87(9). Available online [link].

Lamport, L. (2002). Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Book.

Lee, C., Yang, T., Chen, Z., Su, Y., and Lyu, M. R. (2023). Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1750-1762. IEEE. DOI: 10.1109/ICSE48619.2023.00150.

Lee, W. S., Grosh, D. L., Tillman, F. A., and Lie, C. H. (1985). Fault tree analysis, methods, and applications - 2013; a review. IEEE Transactions on Reliability, R-34(3):194-203. DOI: 10.1109/TR.1985.522211.

łgorzata Steinder, M. and Sethi, A. S. (2004). A survey of fault localization techniques in computer networks. Science of computer programming, 53(2):165-194. DOI: 10.1016/j.scico.2004.01.010.

López, M. R. and Spillner, J. (2017). Towards quantifiable boundaries for elastic horizontal scaling of microservices. In Companion Proceedings of the10th International Conference on Utility and Cloud Computing, pages 35-40. DOI: 10.1145/3147234.3148111.

Ma, M., Xu, J., Wang, Y., Chen, P., Zhang, Z., and Wang, P. (2020). Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020, pages 246-258. DOI: 10.1145/3366423.3380111.

Maeda, K. (2011). Comparative survey of object serialization techniques and the programming supports. International Journal of Computer and Information Engineering, 5(12):1488-1493. Available online [link].

McCaffrey, C. (2015). The verification of a distributed system. Queue, 13(9):60:150-60:160. DOI: 10.1145/2844108.

Meiklejohn, C., Stark, L., Celozzi, C., Ranney, M., and Miller, H. (2022). Method overloading the circuit. In Proceedings of the 13th Symposium on Cloud Computing, pages 273-288. DOI: 10.1145/3542929.3563466.

Meiklejohn, C. S., Estrada, A., Song, Y., Miller, H., and Padhye, R. (2021). Service-level fault injection testing. In Proceedings of the ACM Symposium on Cloud Computing, pages 388-402. DOI: 10.1145/3472883.3487005.

Mogul, J. C. (1995). The case for persistent-connection http. ACM SIGCOMM Computer Communication Review, 25(4):299-313. DOI: 10.1145/217391.217465.

Molléri, J. S., Petersen, K., and Mendes, E. (2016). Survey guidelines in software engineering: An annotated review. In Proceedings of the 10th ACM/IEEE international symposium on empirical software engineering and measurement, pages 1-6. DOI: 10.1145/2961111.2962619.

Nasab, A. R., Shahin, M., Raviz, S. A. H., Liang, P., Mashmool, A., and Lenarduzzi, V. (2023). An empirical study of security practices for microservices systems. Journal of Systems and Software, 198:111563. DOI: 10.1016/j.jss.2022.111563.

Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., and Deardeuff, M. (2015). How amazon web services uses formal methods. Commun. ACM, 58(4):66-73. DOI: 10.1145/2699417.

Newman, S. (2015). Building Microservices: Designing Fine-Grained Systems. O'Reilly Media, 1st edition. Book.

Newman, S. (2019). Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith. O'Reilly Media, Incorporated. Book.

Niedermaier, S., Koetter, F., Freymann, A., and Wagner, S. (2019). On observability and monitoring of distributed systems-an industry interview study. In International Conference on Service-Oriented Computing, pages 36-52. Springer. DOI: 10.1007/978-3-030-33702-5_3.

Nygard, M. T. (2018). Release it!: Design and Deploy Production-Ready Software. Pramatic Bookshelf, Raleigh, NC, 2 edition. Book.

O'Neill, V. and Soh, B. (2022). Orchestrating the resilience of cloud microservices using task-based reliability and dynamic costing. In 2022 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), pages 1-6. IEEE. DOI: 10.1109/CSDE56538.2022.10089320.

O’Neill, V. and Soh, B. (2023). Spot market cloud orchestration using task-based redundancy and dynamic costing. Future Internet, 15(9):288. DOI: 10.3390/fi15090288.

Padmanabhan, V. N., Ramabhadran, S., Agarwal, S., and Padhye, J. (2006). A study of end-to-end web access failures. In Proceedings of the 2006 ACM CoNEXT conference, pages 1-13. DOI: 10.1145/1368436.1368457.

Pai, G. J. and Dugan, J. B. (2002). Automatic synthesis of dynamic fault trees from uml system models. In 13th International Symposium on Software Reliability Engineering, 2002. Proceedings., pages 243-254. DOI: 10.1109/ISSRE.2002.1173261.

Panda, A., Sagiv, M., and Shenker, S. (2017). Verification in the age of microservices. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, pages 30-36. DOI: 10.1145/3102980.3102986.

Petersen, K., Vakkalanka, S., and Kuzniarz, L. (2015). Guidelines for conducting systematic mapping studies in software engineering: An update. Information & Software Technology, 64:1-18. DOI: 10.1016/j.infsof.2015.03.007.

Pitakrat, T., Okanović, D., van Hoorn, A., and Grunske, L. (2018). Hora: Architecture-aware online failure prediction. Journal of Systems and Software, 137:669-685. DOI: 10.1016/j.jss.2017.02.041.

Poltronieri, F., Tortonesi, M., and Stefanelli, C. (2021). Chaostwin: A chaos engineering and digital twin approach for the design of resilient it services. In 2021 17th International Conference on Network and Service Management (CNSM), pages 234-238. IEEE. DOI: 10.23919/CNSM52442.2021.9615519.

Poltronieri, F., Tortonesi, M., and Stefanelli, C. (2022). A chaos engineering approach for improving the resiliency of it services configurations. In NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, pages 1-6. IEEE. DOI: 10.1109/NOMS54207.2022.9789887.

Potharaju, R. and Jain, N. (2013). When the network crumbles: An empirical study of cloud network failures and their impact on services. In Proceedings of the 4th annual Symposium on Cloud Computing, pages 1-17. DOI: 10.1145/2523616.2523638.

Protocol-Buffers (2023). Protocol buffers. Available online [link].

Rajagopalan, S. and Jamjoom, H. (2015). App-bisect: Autonomous healing for microservice-based apps. In 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15), Santa Clara, CA. USENIX Association. Available online [link].

Richardson, C. (2019). Microservices Patterns. Manning, Shelter Island, NY, USA. Book.

Rostanski, M., Grochla, K., and Seman, A. (2014). Evaluation of highly available and fault-tolerant middleware clustered architectures using rabbitmq. In 2014 federated conference on computer science and information systems, pages 879-884. IEEE. DOI: 10.15439/2014F48.

Rouf, R., Rasolroveicy, M., Litoiu, M., Nagar, S., Mohapatra, P., Gupta, P., and Watts, I. (2024). Instantops: A joint approach to system failure prediction and root cause identification in microserivces cloud-native applications. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering, pages 119-129. DOI: 10.1145/3629526.3645047.

Salah, T., Jamal Zemerly, M., Yeun, C. Y., Al-Qutayri, M., and Al-Hammadi, Y. (2016). The evolution of distributed systems towards microservices architecture. In 2016 11th International Conference for Internet Technology and Secured Transactions (ICITST), pages 318-325. DOI: 10.1109/ICITST.2016.7856721.

Salfner, F., Lenk, M., and Malek, M. (2010). A survey of online failure prediction methods. ACM Computing Surveys (CSUR), 42(3):1-42. DOI: 10.1145/1670679.1670680.

Samir, A. and Pahl, C. (2019). Dla: Detecting and localizing anomalies in containerized microservice architectures using markov models. In 2019 7th International Conference on Future Internet of Things and Cloud (FiCloud), pages 205-213. IEEE. DOI: 10.1109/FiCloud.2019.00036.

Sarda, K., Namrud, Z., Rouf, R., Ahuja, H., Rasolroveicy, M., Litoiu, M., Shwartz, L., and Watts, I. (2023). Adarma auto-detection and auto-remediation of microservice anomalies by leveraging large language models. In Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering, pages 200-205. DOI: 10.5555/3615924.3615949.

Schermann, G., Schöni, D., Leitner, P., and Gall, H. C. (2016). Bifrost: Supporting continuous deployment with automated enactment of multi-phase live testing strategies. In Proceedings of the 17th International Middleware Conference, pages 1-14. DOI: 10.1145/2988336.2988348.

Scrocca, M., Tommasini, R., Margara, A., Valle, E. D., and Sakr, S. (2020). The kaiju project: enabling event-driven observability. In Proceedings of the 14th ACM International Conference on Distributed and Event-Based Systems, pages 85-96. DOI: 10.1145/3401025.3401740.

Sedghpour, M. R. S., Klein, C., and Tordsson, J. (2021). Service mesh circuit breaker: From panic button to performance management tool. In Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems, pages 4-10. DOI: 10.1145/3447851.3458740.

Sill, A. (2016). The design and architecture of microservices. IEEE Cloud Computing, 3(5):76-80. DOI: 10.1109/MCC.2016.111.

Simonsson, J., Zhang, L., Morin, B., Baudry, B., and Monperrus, M. (2021). Observability and chaos engineering on system calls for containerized applications in docker. Future Generation Computer Systems, 122:117-129. DOI: 10.1016/j.future.2021.04.001.

Soldani, J. and Brogi, A. (2022). Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR), 55(3):1-39. DOI: 10.1145/3501297.

Stocker, M., Zimmermann, O., Zdun, U., Lübke, D., and Pautasso, C. (2018). Interface quality patterns: Communicating and improving the quality of microservices apis. In Proceedings of the 23rd European Conference on Pattern Languages of Programs, pages 1-16. DOI: 10.1145/3282308.3282319.

Taibi, D., Lenarduzzi, V., and Pahl, C. (2018). Architectural patterns for microservices: A systematic mapping study. Available online [link].

Tam, D. S. H., Liu, Y., Xu, H., Xie, S., and Lau, W. C. (2023). Pert-gnn: Latency prediction for microservice-based cloud-native applications via graph neural networks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2155-2165. DOI: 10.1145/3580305.3599465.

Tanenbaum, A. S. and Van Steen, M. (2007). Distributed systems: principles and paradigms. Prentice-Hall. Book.

Thönes, J. (2015). Microservices. IEEE software, 32(1):116-116. DOI: 10.1109/MS.2015.11.

Viggiato, M., Terra, R., Rocha, H., Valente, M. T., and Figueiredo, E. (2018). Microservices in practice: A survey study. arXiv preprint arXiv:1808.04836. DOI: 10.1109/MS.2015.11.

Vishwanath, K. V. and Nagappan, N. (2010). Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM symposium on Cloud computing, pages 193-204. DOI: 10.1145/1807128.1807161.

VMware (2023). RabbitMQ - Messaging that just works. Book.

Waseem, M., Liang, P., Shahin, M., Di Salle, A., and Márquez, G. (2021). Design, monitoring, and testing of microservices systems: The practitioners’ perspective. Journal of Systems and Software, 182:111061. DOI: 10.1016/j.jss.2021.111061.

Wittig, M., Wittig, A., and Whaley, B. (2016). Amazon Web Services in action. Manning. Book.

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M. C., Regnell, B., and Wesslén, A. (2012). Experimentation in software engineering. Springer Science & Business Media. Book.

Wong, W. E., Gao, R., Li, Y., Abreu, R., and Wotawa, F. (2016). A survey on software fault localization. IEEE Transactions on Software Engineering, 42(8):707-740. DOI: 10.1109/TSE.2016.2521368.

Wu, H., Yu, S., Niu, X., Nie, C., Pei, Y., He, Q., and Yang, Y. (2023). Enhancing fault injection testing of service systems via fault-tolerance bottleneck. IEEE Transactions on Software Engineering, 49(8):4097-4114. DOI: 10.1109/TSE.2023.3285357.

Wu, L., Tordsson, J., Elmroth, E., and Kao, O. (2020). Microrca: Root cause localization of performance issues in microservices. In NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium, pages 1-9. IEEE. DOI: 10.1109/NOMS47738.2020.9110353.

Xie, R., Yang, J., Li, J., and Wang, L. (2023a). Impacttracer: root cause localization in microservices based on fault propagation modeling. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1-6. IEEE. DOI: 10.23919/DATE56975.2023.10137078.

Xie, Z., Pei, C., Li, W., Jiang, H., Su, L., Li, J., Xie, G., and Pei, D. (2023b). From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1739-1749. DOI: 10.1145/3611643.3613861.

Xie, Z., Xu, H., Chen, W., Li, W., Jiang, H., Su, L., Wang, H., and Pei, D. (2023c). Unsupervised anomaly detection on microservice traces through graph vae. In Proceedings of the ACM Web Conference 2023, pages 2874-2884. DOI: 10.1145/3543507.3583215.

Yang, J., Chen, T., Wu, M., Xu, Z., Liu, X., Lin, H., Yang, M., Long, F., Zhang, L., and Zhou, L. (2009). Modist: Transparent model checking of unmodified distributed systems. pages 213-228. Available online [link].

Yang, J., Guo, Y., Chen, Y., and Zhao, Y. (2023). Hi-rca: A hierarchy anomaly diagnosis framework based on causality and correlation analysis. Applied Sciences, 13(22):12126. DOI: 10.3390/app132212126.

Yang, T., Shen, J., Su, Y., Ling, X., Yang, Y., and Lyu, M. R. (2021). Aid: efficient prediction of aggregated intensity of dependency in large-scale cloud systems. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 653-665. IEEE. DOI: 10.1109/ASE51524.2021.9678534.

Yu, G., Chen, P., Li, Y., Chen, H., Li, X., and Zheng, Z. (2023). Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 553-565. DOI: 10.5281/zenodo.8276375.

Yu, Y., Manolios, P., and Lamport, L. (1999). Model checking tla+ specifications. In Advanced Research Working Conference on Correct Hardware Design and Verification Methods, pages 54-66. Springer. DOI: 10.1007/3-540-48153-2_6.

Zang, Z., Wen, Q., and Xu, K. (2019). A fault tree based microservice reliability evaluation model. In IOP Conference Series: Materials Science and Engineering, volume 569, page 032069. IOP Publishing. DOI: 10.1088/1757-899X/569/3/032069.

Zhang, J., Ferydouni, R., Montana, A., Bittman, D., and Alvaro, P. (2021). 3milebeach: A tracer with teeth. In Proceedings of the ACM Symposium on Cloud Computing, pages 458-472. DOI: 10.1145/3472883.3486986.

Zhao, C., Ma, M., Zhong, Z., Zhang, S., Tan, Z., Xiong, X., Yu, L., Feng, J., Sun, Y., Zhang, Y., et al. (2023). Robust multimodal failure detection for microservice systems. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5639-5649. DOI: 10.1145/3580305.3599902.

Zheng, L., Chen, Z., He, J., and Chen, H. (2024). Mulan: Multi-modal causal structure learning and root cause analysis for microservice systems. In Proceedings of the ACM on Web Conference 2024, pages 4107-4116. DOI: 10.1145/3589334.3645442.

Zhou, X., Peng, X., Xie, T., Sun, J., Ji, C., Li, W., and Ding, D. (2018). Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, pages 1-1. DOI: 10.1109/TSE.2018.2887384.

Zhu, Y., Wang, J., Li, B., Zhao, Y., Zhang, Z., Xiong, Y., and Chen, S. (2024). Microirc: Instance-level root cause localization for microservice systems. Journal of Systems and Software, page 112145. DOI: 10.1016/j.jss.2024.112145.

Zo, H., Nazareth, D. L., and Jain, H. K. (2007). Measuring reliability of applications composed of web services. In 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07), pages 278c-278c. IEEE. DOI: 10.1109/HICSS.2007.338.

Downloads

Published

2024-12-14

How to Cite

Souza, V. J. S., Neves, V. O., & Kimura, B. Y. L. (2024). Dependable Microservices in the Kubernetes era: A Practitioners Survey. Journal of Internet Services and Applications, 15(1), 561–583. https://doi.org/10.5753/jisa.2024.4000

Issue

Section

Research article