Computer Standards & Interfaces 97 (2026) 104116 Contents lists available at ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi Chaos experiments in microservice architectures: A systematic literature review Emrah Esen a , Akhan Akbulut a , Cagatay Catal b ,∗ a Department of Computer Engineering, Istanbul Kültür University, 34536, Istanbul, Turkey b Department of Computer Science and Engineering, Qatar University, Doha 2713, Qatar ARTICLE INFO ABSTRACT Keywords: This study analyzes the implementation of Chaos Engineering in modern microservice systems. It identifies Chaos engineering key methods, tools, and practices used to effectively enhance the resilience of software systems in production Microservice environments. In this context, our Systematic Literature Review (SLR) of 31 research articles has uncovered 38 Systematic literature review tools crucial for carrying out fault injection methods, including several tools such as Chaos Toolkit, Gremlin, and Chaos Machine. The study also explores the platforms used for chaos experiments and how centralized management of chaos engineering can facilitate the coordination of these experiments across complex systems. The evaluated literature reveals the efficacy of chaos engineering in improving fault tolerance and robustness of software systems, particularly those based on microservice architectures. The paper underlines the importance of careful planning and execution in implementing chaos engineering and encourages further research in this field to uncover more effective practices for the resilience improvement of microservice systems. Contents 1. Introduction ...................................................................................................................................................................................................... 2 2. Background ....................................................................................................................................................................................................... 2 2.1. Microservice architecture ........................................................................................................................................................................ 3 2.2. Microservice principles ........................................................................................................................................................................... 3 2.3. Challenges/Troubleshooting/Failures in microservice architecture .............................................................................................................. 3 2.4. Chaos engineering .................................................................................................................................................................................. 4 3. Review protocol................................................................................................................................................................................................. 4 3.1. Research questions ................................................................................................................................................................................. 4 3.2. Search strategy....................................................................................................................................................................................... 4 3.3. Study selection criteria ........................................................................................................................................................................... 4 3.4. Study quality assessment......................................................................................................................................................................... 5 3.5. Data extraction ...................................................................................................................................................................................... 5 3.6. Data synthesis ........................................................................................................................................................................................ 6 4. Results .............................................................................................................................................................................................................. 6 4.1. Main statistics ........................................................................................................................................................................................ 6 4.2. How is Chaos engineering effectively applied in production environments to enhance the resilience of software systems? .............................. 6 4.3. Which platforms have been used for chaos experiments? ........................................................................................................................... 6 4.4. How can Chaos engineering be effectively applied to microservice architecture to ensure successful implementation and enhance system resilience? .............................................................................................................................................................................................. 10 4.5. To what extent can the centralized provision of Chaos engineering effectively facilitate the management of chaos experiments across complex systems?................................................................................................................................................................................................. 10 4.6. What are the challenges reported in the relevant papers? .......................................................................................................................... 10 5. Discussion ......................................................................................................................................................................................................... 10 5.1. General discussion .................................................................................................................................................................................. 10 5.2. Threats to validity .................................................................................................................................................................................. 12 ∗ Corresponding author. E-mail address: ccatal@qu.edu.qa (C. Catal). https://doi.org/10.1016/j.csi.2025.104116 Received 22 September 2024; Received in revised form 28 November 2025; Accepted 12 December 2025 Available online 15 December 2025 0920-5489/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies. E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 6. Conclusion ........................................................................................................................................................................................................ 12 CRediT authorship contribution statement ........................................................................................................................................................... 12 Declaration of competing interest ........................................................................................................................................................................ 12 Data availability ................................................................................................................................................................................................ 12 References......................................................................................................................................................................................................... 12 challenges faced, and solutions. In addition, it will assess the effective- 1. Introduction ness of chaos experiments in enhancing the reliability and robustness of microservice systems by using data obtained from real-world scenarios In recent years, the adoption of microservice architecture has led to develop strategic recommendations. This study is a critical step to the transformation of application infrastructures into distributed in understanding the applicability and impact of chaos engineering systems. These systems are designed to enhance maintainability by de- within the complexity of microservice architectures and aims to make coupling services. The primary benefit of this architecture is the ease of significant contributions to the body of knowledge in this field. Recent maintenance of individual services within the microservice ecosystem research has applied chaos engineering for this architectural style, how- due to their smaller and more modular nature [1]. However, despite ever, a systematic overview of the state-of-the-art on the use of chaos these advantages, the distributed nature of microservices introduces engineering in the microservice architecture is lacking. Therefore, a significant challenges. Specifically, the complex management of ser- Systematic Literature Review (SLR) has been performed to provide an vices and their tight integration can considerably complicate software overview of how chaos engineering was applied. debugging. Debugging becomes complex in this architecture due to its This article primarily targets peer-reviewed research papers to main- distributed nature, the necessity to pinpoint the exact service causing tain methodological consistency and ensure scholarly rigor. We specif- the problem, and the dynamic characteristics of microservices. Con- ically chose a systematic literature review (SLR) methodology because sequently, debugging in microservice architecture demands a greater peer-reviewed academic studies are subject to rigorous validation pro- level of effort and specialized expertise compared to conventional cesses, which enhance the reliability and validity of our findings [8, monolithic architectures [2]. However, it becomes quite challenging to 9]. Although excluding industry-specific, grey literature may restrict predict what will happen if there is an unexpected error or if a service certain practical perspectives, this choice was deliberately made to on the network goes out of service. Service outages can be caused by avoid potential biases and uphold the scientific integrity of our re- anything from a malicious cyberattack to a hardware failure to simple view [10,11]. However, future studies could broaden the scope to human error, and they can have devastating financial consequences. incorporate industrial case studies and practical experiences, which Although such unexpected situations are rare, they can interfere with would enrich our understanding of chaos engineering’s applicability the operation of distributed systems and devastatingly affect the live beyond the academic context. environment in which the application is located [3]. It is necessary to The main contributions of this study are listed as follows: detect points in the system before an error occurs and spreads to the 1. To the best of our knowledge, this is the first study to employ entire system. a systematic literature review approach in the field of chaos Microservice architecture applications undergo testing procedures engineering on microservice architecture applications [12]. The to ensure their quality and dependability. These include unit testing, study provides an extensive systematic literature review of how service test, end-to-end test, behavior-driven test, integration test, and chaos engineering can be applied to enhance the resilience of mi- regression test [4]. The comprehensive approach to microservices test- croservice architectures. It collates findings from various sources ing also encompasses live testing strategies for complex systems [5]. to provide insights into the current state of research and practice This thorough process emphasizes different aspects such as function- in this field. ality, interoperability, performance of individual services within the 2. The study categorizes and summarizes the range of chaos en- architecture. It aims to detect and resolve issues early to ensure stable gineering tools and methods used in industry and academia, and high-quality microservice applications [1,6]. However, considering highlighting their functionalities in process/service termination, that microservices consist of multiple services, the application should network simulation, load stressing, security testing, and fault not have an impact on the user experience in cases such as network injection within application code. failures and suddenly increased service loads. For example, if the 3. This research paper discusses contemporary techniques and ap- microservice that adds the product to favorites on a shopping site fails proaches for implementing chaos engineering in microservice or responds late, the user should be able to continue the shopping ex- architectures. It also emphasizes the ongoing work in this field, perience. Therefore, testing operations in production-like environments offering a significant reference for future research endeavors. become inevitable. No matter how distributed or complex the system The paper systematically reviews existing literature to showcase is, there is a need for a method to manage unforeseeable situations how chaos engineering can enhance system resilience, laying a that can build trust in the system against unexpected failures. chaos comprehensive groundwork for further exploration into chaos engineering is defined as the discipline of conducting experiments in a experimentation strategies and innovating new fault injection live environment to test or verify the reliability of software [7]. methods or tools within microservice architectures. The primary objective of this research is to conduct a thorough investigation into how chaos experiments are performed in the widely The rest of the paper is structured as follows: Section 2 explains used microservices-based systems of today. Microservice architectures the background and related work. Section 3 presents the methodology have come to the forefront in modern software development processes of the research. Section 4 presents the results and Section 5 compre- due to their advantages such as flexibility, scalability, and rapid de- hensively discusses the presented answers to research questions and velopment. However, these architectures also bring unique challenges validity threats. Lastly, the conclusion is presented in Section 6. due to complex service dependencies and dynamic operational environ- ments. This study aims to comprehensively address the methodologies, 2. Background application scenarios, and impacts of chaos experiments conducted to test the resilience of microservice systems and identify potential The microservice approach breaks down a large application into a weak points. The research intends to present the current state of chaos network of small, self-contained units, each running its own process engineering practices by analyzing them, highlighting best practices, and often communicating through web APIs. Unlike large, single-piece 2 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 monolithic systems, these small services are robust, easy to scale up or Technology heterogeneity. They are treated as small services, each run- down, and can be updated individually using various programming lan- ning independently and communicating with each other using open guages and technologies. This structure allows development teams to be protocols. While monolithic applications are developed with a single smaller and more agile, leading to faster updates and improvements. programming language and database system, services included in a Yet, managing many interconnected services can become complicated, microservice ecosystem may use a different programming language and especially when something goes wrong. To enhance system reliability database. This allows the advantages of each programming language and resilience, a method known as chaos engineering is employed. This and database to be used. involves deliberately introducing problems into the live system to test Resilience. When an error occurs in the system in monolithic applica- its ability to cope and recover. This technique helps to uncover and tions, the whole system is affected. In the microservice architecture, rectify flaws, thereby making the system stronger overall. Regular and only the part under the responsibility of the relevant service is affected, automated tests mimic real-life problems to ensure that the system can the places belonging to other services are not affected and the user handle unexpected challenges and remain stable and efficient. experience continues. 2.1. Microservice architecture Scalability. While the scaling process on monolithic applications covers the entire application, the services that are under heavy load can be Microservice architectures have gained significant popularity in the scaled in applications developed with microservice architecture. This software industry due to their ability to address the challenges and prevents extra resource costs for partitions that do not need to be scaled complexities of developing modern applications [6,13]. unnecessarily and increases the user experience. Deployment. Microservice architecture facilitates the autonomous de- 2.2. Microservice principles ployment of individual services, enabling updates or changes without impacting others. Various deployment strategies, including blue–green, Microservice architectures are based on the concept of decentral- canary, and rolling deployment, minimize disruptions during the de- ization, where each service is independently developed, deployed, and ployment process [18]. As a result, microservice architecture provides managed. This emphasizes autonomy and minimal inter-service depen- increased flexibility and resilience in deployment, distinguishing it dencies. Each microservice is designed to focus on a single function or from monolithic applications. closely related set of functions and supports technology heterogeneity by allowing different services to use different technology stacks that Organizational alignment. In software development processes, some best suit their needs. Resilience is a core aspect, with services built to challenges may be encountered due to large teamwork and large pieces withstand failures without affecting the entire system while scalability of code. It is possible to make these challenges more manageable with enables services to be scaled independently as per demand. Com- smaller teams established. At the same time, this is an indication that munication occurs through lightweight mechanisms like HTTP/REST microservices applications allow us to form smaller and more cohesive APIs, supporting continuous delivery and deployment practices. Due teams. Each team is responsible for its own microservice and can take to the distributed nature of microservice architecture, comprehensive action by making improvements if necessary. monitoring and logging for observability becomes crucial. Additionally, there is often an alignment between the microservice architecture 2.3. Challenges/Troubleshooting/Failures in microservice architecture and organizational structure involving small cross-functional teams Microservice architectures pose numerous challenges. As the num- responsible for individual services [14]. ber of services increases, the complexity of service interactions also It is helpful to compare the microservice architecture to the mono- grows. Network communication reliance leads to latency and net- lithic architecture. The main difference between them is the dimensions work failure issues, while ensuring data consistency across multiple of the developed applications. The microservice architecture can be databases requires careful design and implementation of distributed thought of as developing an application as a suite of smaller services, transactions or eventual consistency models. Microservices bring typ- rather than as a single, monolithic structure. Enterprise applications ical distributed system challenges such as handling partial failures, usually consist of three main parts: a client-side user interface (i.e., con- dealing with latency and asynchrony, complex service discovery, load taining HTML pages and Javascript running on the user’s machine balancing in dynamic scaling environments, and managing configu- in a browser), a database (i.e., composed of many tables, common rations across multiple services and environments. Security concerns and often relational, added to database management), and a server- are heightened due to increased inter-service communications surface side application. In the server-side application, HTTP requests are area. Testing becomes more complex involving individual service test- processed, business logic is executed, HTML views are prepared that ing along with testing their interactions; deployment is challenging will retrieve data from the database and update it and send it to the especially when there are dependencies between services; effective browser. This structure is a good example of monoliths. Any changes observability and monitoring become crucial for timely issue resolu- to the system involve creating and deploying a new version of the tion; versioning management is critical for maintaining system stability; server-side application [15]. The cycles of change are interdependent. lastly assembling skilled teams proficient in DevOps, cloud computing, A change to a small part of the application requires rebuilding and programming languages presents a significant challenge. Microservice deploying the entire monolith [6]. architecture faces various challenges, troubleshooting, and failures. Microservice architecture, on the other hand, has some common While adopting a distributed architecture enhances modularity, it in- features, unlike monolithic architecture. These are componentization herently introduces operational complexities that differ significantly with services, organizing around job capabilities, smart interfaces and from monolithic structures. Recent research has also explored the use simple communication, decentralized governance, decentralized data of hybrid bio-inspired algorithms to optimize this process dynamically. management, infrastructure automation, and design for failure [16]. For instance, the Hybrid Kookaburra–Pelican Optimization Algorithm Today, although modern internet applications seem like a single appli- has been shown to improve load distribution and system scalability in cation, they use microservice architectures behind them. Microservice cloud and microservice-based environments [19]. architecture basically refers to small autonomous and interoperability In conclusion, while microservices offer numerous advantages such services. It has emerged due to increasing needs such as technology as improved scalability, flexibility, and agility, they also introduce diversity, flexibility, scaling, ease of deployment, organization and significant challenges in terms of system complexity, operational de- management, and provides various advantages in these matters. Its mands, and the need for skilled personnel and sophisticated tool- advantages are described as follows [17]: ing [20]. 3 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 2.4. Chaos engineering 3.1. Research questions ‘‘Chaos engineering is the discipline of experimenting on a dis- Research Questions (RQs) and their corresponding motivations are tributed system in order to build confidence in the system’s capability presented as follows: to withstand turbulent conditions in production-like environment’’ [7, • RQ1: How is Chaos engineering effectively applied in production 21]. It is the careful and planned execution of experiments to show how environments to enhance the resilience of software systems? the distributed system will respond to a failure. It is necessary for large- Motivation: Understanding the practical implementation of Chaos scale software systems because it is practically impossible to simulate engineering in production environments is crucial for ensuring real events in test environments. Experiments based on real events are the resilience of software systems under real-world operating created together with chaos engineering [22]. By analyzing the test conditions. results, improvements are made where necessary, and in this way, it • RQ2: Which platforms have been used for Chaos experiments? is aimed to increase the reliability of the software in the production Motivation: Identifying the platforms provides insights into the environment. technological landscape and tools available for conducting Chaos Thanks to an experimental and systems-based approach, confidence engineering practices. is established for the survivability of these systems during collapses. • RQ3: How is Chaos engineering effectively applied to microser- Canary analysis collects data on how distributed systems react to vice architectures to ensure its successful implementation in en- failure scenarios by observing their behavior in abnormal situations and hancing system resilience? performing controlled experiments [23]. This method involves applying Motivation: Microservice architectures introduce new challenges new updates or changes to a specific aspect of the system, enabling in system design. Exploring the application of Chaos engineering early detection of potential problems before they affect a larger scale. in this context can help improve the resilience and fault tolerance Chaos experiments consist of the following principles [24,25]: of microservice systems. • RQ4: To what extent can the centralized provision of Chaos • Hypothesize steady state: The first step is to hypothesize the engineering effectively facilitate the management of Chaos exper- steady state of the system under normal conditions. iments across complex systems? • Vary real-world events: The next step is to vary real-world events Motivation: Understanding the feasibility of providing Chaos en- that can cause turbulence in the system. gineering as a centralized service enables organizations to coor- • Run experiments in production: Experimenters should run the ex- dinate Chaos experiments across complex systems. periments in production-like environment to simulate real-world • RQ5: What are the challenges reported in the relevant papers? conditions. Motivation: Identifying these challenges provides valuable in- • Automate experiments to run continuously: Experimenters should sights into overcoming obstacles and advancing the adoption of automate the experiments to run continuously, ensuring that the Chaos engineering practices. system can withstand turbulence over time. • Minimize blast radius: The experiments should be designed to 3.2. Search strategy minimize blast radius, i.e., the impact of the experiment on the system should be limited to a small area The primary studies were carefully selected from the papers pub- • Analyze results: Experimenters should analyze the results of the lished between 2010 and 2022 because the topic is only relevant in experiments to determine the system’s behavior under turbulent recent years. The databases are IEEE Xplore, ACM Digital Library, conditions. Science Direct, Springer, Wiley, MDPI and Scopus and Science Direct. • Repeat experiments: The experiments should be repeated to en- The initial search involved reviewing the titles, abstracts, and keywords sure that the system can consistently withstand turbulence. of the studies identified in the databases. The search results obtained When the experiment is finished, information about the actual from the databases were stored in the data extraction form using a effect will be provided to the system. spreadsheet tool. Furthermore, this systematic review was conducted collaboratively by three authors. The following search string was used to broaden the search scope: 3. Review protocol ((chaos engineering) OR (chaos experiments)) OR (microservices) The results of the searches made in the databases mentioned above Systematic review studies must be conducted using a well-defined are shown in Fig. 2. and specific protocol. To conduct a systematic review study, all studies on a particular topic must be examined [12]. We followed the system- 3.3. Study selection criteria atic review process shown in Fig. 1 and took all the steps to reduce risk bias in this study. Multiple reviewers were involved in the SLR process, After applying exclusion inclusion criteria, 55 articles were ob- and in cases of conflict, a brief meeting was organized to facilitate tained. The exclusion criteria in our study are shown as follows: consensus. The first step is to define the research questions. Then, the most appropriate databases were selected. Based on the selected • EC-1: Duplicate papers from multiple sources databases, automated searches were conducted and several articles • EC-2: Papers without full-text availability were identified. Selection criteria were then established to determine • EC-3: Papers not written in English • EC-4: Survey papers which studies should be included and excluded in this research. The • EC-5: Papers not related to Chaos engineering titles and abstracts of all studies were reviewed. In cases of doubt, the full text of the publication was reviewed. Then, after the studies The inclusion criteria in our study are shown as follows: were analyzed in detail, selection criteria were applied. All selected studies were assessed using a quality assessment process. Subsequently, • IC-1: Primary papers discussing the use of Chaos experiments in the results were synthesized, listed, and summarized in a clear and a microservice architecture understandable manner. • IC-2: Primary publications that focus on Chaos engineering 4 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 Fig. 1. SLR review protocol. Source: Adapted from [26– 28]. Fig. 2. Distribution of selected papers per database. 3.4. Study quality assessment Fig. 2 presents the distribution of papers based on databases where they were found at different selection stages. After the initial search, The assessment of each study’s quality is an indicator of the strength 4520 papers were retrieved, of which 55 remained after applying the of evidence provided by the systematic review. The quality of studies selection criteria. After quality assessment, 31 papers were selected was assessed using various questions. Studies of poor quality were as primary studies. The 55 papers were carefully read in full and the not included in the present study. These criteria based on quality required data for answering the research questions were extracted. instruments were adopted guide and other SLRs research [12]. The All the collected articles are listed in Table 1. following questions were used to assess the quality of the studies. 3.5. Data extraction • Q1. Are the aims of the study clearly stated? • Q2. Are the scope and experimental design of the study clearly defined? Data required for answering the Research Questions were extracted • Q3. Is the research process documented adequately? from the selected articles to answer the research questions. A data • Q4. Are all the study questions answered? extraction form was created to answer the research questions. The data • Q5. Are the negative findings presented? extraction form consists of several metadata such as the author’s first • Q6. Do the conclusions relate to the aim of the purpose of the and last name, the title of the study, the publication year, and the type study and are they reliable? of study. In addition to this metadata, several columns were created to store the required information related to the research questions. By In this study, considering all these criteria, a general quality as- employing a data extraction form, we ensured that the relevant data sessment was performed for each paper. The rating was 2 points for required to answer each research question were systematically captured the ‘‘yes’’ option, 0 points for the ‘‘no’’ option, and 1 point for the from the selected publications. This approach facilitated the subsequent ‘‘somewhat’’ option. The decision threshold for classifying the paper synthesis of the findings. The data extraction process involved meticu- as poor quality was determined based on the mean value, which lous attention to detail and ensured the reliability and integrity of the corresponds to a total of 5 points. data used in our systematic literature review. 5 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 Table 1 Selected primary studies. ID Reference Title Year Database S1 [29] Automating Chaos Experiments in Production 2019 ACM S2 [25] Getting Started with Chaos engineering—design of an implementation framework in practice 2020 ACM S3 [30] Human-AI Partnerships for Chaos engineering 2020 ACM S4 [31] 3MileBeach: A Tracer with Teeth 2021 ACM S5 [32] Service-Level Fault Injection Testing 2021 ACM S6 [33] A Platform for Automating Chaos Experiments 2016 IEEE Xplore S7 [34] Automated Fault-Tolerance Testing 2016 IEEE Xplore S8 [35] Gremlin: Systematic Resilience Testing of Microservices 2016 IEEE Xplore S9 [36] Fault Injection Techniques - A Brief Review 2018 IEEE Xplore S10 [37] ORCAS: Efficient Resilience Benchmarking of Microservice Architectures 2018 IEEE Xplore S11 [38] The Business Case for Chaos engineering 2018 IEEE Xplore S12 [39] Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service 2018 IEEE Xplore S13 [40] Security Chaos engineering for Cloud Services: Work In Progress 2019 IEEE Xplore S14 [41] A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems 2020 IEEE Xplore S15 [42] CloudStrike: Chaos engineering for Security and Resiliency in Cloud Infrastructure 2020 IEEE Xplore S16 [43] Identifying and Prioritizing Chaos Experiments by Using Established Risk Analysis Techniques 2020 IEEE Xplore S17 [44] Fitness-guided Resilience Testing of Microservice-based Applications 2020 IEEE Xplore S18 [24] A Chaos engineering System for Live Analysis and Falsification of Exception-Handling in the JVM 2021 IEEE Xplore S19 [45] A Study on Chaos engineering for Improving Cloud Software Quality and Reliability 2021 IEEE Xplore S20 [46] Chaos engineering for Enhanced Resilience of Cyber–Physical Systems 2021 IEEE Xplore S21 [47] ChaosTwin: A Chaos engineering and Digital Twin Approach for the Design of Resilient IT Services 2021 IEEE Xplore S22 [48] Platform Software Reliability for Cloud Service Continuity—Challenges and Opportunities 2021 IEEE Xplore S23 [49] Trace-based Intelligent Fault Diagnosis for Microservices with Deep Learning 2021 IEEE Xplore S24 [50] A Guided Approach Towards Complex Chaos Selection, Prioritization and Injection 2022 IEEE Xplore S25 [51] Chaos Driven Development for Software Robustness Enhancement 2022 IEEE Xplore S26 [22] Maximizing Error Injection Realism for Chaos engineering With System Calls 2022 IEEE Xplore S27 [52] On Evaluating Self-Adaptive and Self-Healing Systems using Chaos engineering 2022 IEEE Xplore S28 [53] Observability and chaos engineering on system calls for containerized applications in Docker 2021 ScienceDirect S29 [54] Scalability resilience framework using application-level fault injection for cloud-based software services 2022 Springer S30 [55] Chaos as a Software Product Line—A platform for improving open hybrid-cloud systems resiliency 2022 Wiley S31 [56] The Observability, Chaos engineering, and Remediation for Cloud-Native Reliability 2022 Wiley 3.6. Data synthesis Chaos engineering involves several categories of functionality that serve distinct purposes in resilience testing. The first category involves To answer the research questions, the data obtained are collected intentionally terminating processes or services to evaluate system be- and summarized in an appropriate manner, which is called data syn- havior and recovery from failures [7]. Another category is network thesis. To perform the data synthesis, a qualitative analysis process simulation, which allows engineers to replicate adverse network condi- was conducted on the data obtained. For instance, synonyms used tions to assess system performance and reliability [25]. In the Stressing for different categories were identified and merged in the respective Machine category, engineers subject the system to extreme loads to fields. This comprehensive data synthesis approach allowed us to derive identify limits and potential bottlenecks [7]. In security testing, en- insights and draw conclusions from the collected information. gineers simulate breaches or attacks to assess the system’s response and enhance defenses [7]. Lastly, engineers use fault application code 4. Results to inject targeted faults or errors into the codebase, assessing system resilience and error-handling capabilities [24]. These categories help The result section of the paper provides various insights into how organizations proactively identify weaknesses, strengthen system ro- chaos engineering is applied in production environments, particularly bustness, and enhance reliability in complex technology landscapes [7]. its use in improving the resilience and reliability of microservice ar- Functionality categories of tools are presented in Fig. 6. chitecture applications. The section discusses how fault detection is The tools utilized in industry settings are not comprehensively ad- developed using chaos engineering tools and is mainly used in pro- dressed in articles. To provide insights for future research, the identified tools from the additional examination were categorized based on their duction for troubleshooting. Chaos Experiments are usually conducted functionality, as presented in Tables 2 and 3. Table 2 displays the in the production environment to provide realistic results. The section tools obtained from the study, while Table 3 presents additional tools further enumerates several tools that have been used for Chaos experi- that have been examined. Tools listed in the table with corresponding ments, as well as discussing general principles such as defining a steady references indicate their inclusion in the referenced articles. state, forming a hypothesis, conducting the experiment, and proving or refuting the hypothesis. These principles and tools help detect problems 4.2. How is Chaos engineering effectively applied in production environ- like hardware issues, software errors network interruptions security ments to enhance the resilience of software systems? vulnerabilities configuration mistakes within their respective contexts. Table 4 examines the successful implementation of Chaos Engineer- 4.1. Main statistics ing in operational settings, covering different aspects such as goals, techniques and resources, guiding principles, findings, limitations and Fig. 3 shows the results of the quality assessment. The distribution of substitutes, as well as the general strategy. the years of publication is shown in Fig. 4. Most of the studies related to our study were conducted in the last year. This shows that researchers’ 4.3. Which platforms have been used for chaos experiments? interest in chaos engineering has increased in recent years. Most of the studies included were indexed in the IEEE Xplore database. Table 5 provides a concise summary of various tools and platforms Fig. 5 presents the distribution of the type of publications and used in Chaos experiments, along with their specific functionalities the corresponding databases. While there are many journal papers, or characteristics. It offers comprehensive insights into each platform conference proceedings also appear in the selected papers. through detailed descriptions accompanied by the necessary references. 6 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 Fig. 3. Quality assessment scores. Fig. 4. Year of publication. Fig. 5. Diagram of the distribution of studies per search database. 7 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 Fig. 6. Functionality of chaos engineering tools. Table 2 Chaos engineering tools from studies. Chaos engineering tool Termination Network simulating Stressing machine Security Fault application code Chaos Monkey [57] × Gremlin [35] × × × × × Chaos Toolkit [45] × × × × × Pumba [55] × × LitmusChaos [45] × × × × ToxiProxy [45] × × PowerfulSeal [45] × × × × Pod Reaper [25] × Netflix Simian Army [36] × × × WireMock [25] × × KubeMonkey [25] × × × Chaosblade [45] × × × ChaosTwin [47] × × × × Chaos Machine [24] × × × Cloud Strike [42] × Phoebe [22] × Mjolnirr [58] × ChaosOrca [37] × × × 3MileBeach [31] × × Muxy [25] × × × Blockade [25] × Chaos Lambda [25] × × Byte-Monkey [25] × Turbulence [25] × × × Cthulhu [25] × × × × Byteman [25] × × ChaosCube [55] × Chaos Lemur [25] × Chaos HTTP Proxy [25] × Chaos Mesh [45] × × × Istio Chaos [45] × ChAP [33] × × IntelliFT [44] × × × × Table 3 Chaos engineering tools from our search. Chaos engineering tool Termination Network simulating Stressing machine Security Fault application code Pod Chaos X X X DNS Chaos X AWS Chaos X X X Azure Chaos X X X X GCP Chaos X X X X 8 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 Table 4 Chaos engineering in production environments. Category Description Objective The primary objective of applying chaos engineering in production environments is to enhance the resilience of software systems. This involves troubleshooting to identify and address potential malfunctions before they occur. The overarching goal is to minimize issues in production through the use of chaos engineering tools, enabling automatic fault detection [24,53]. Methods and tools chaos engineering relies on specific tools to facilitate its effective application in production environments. These tools aid in automatic fault detection, a crucial aspect of troubleshooting to minimize potential issues in the production environment [24,53]. Principles and considerations The effective application of chaos engineering is closely tied to key principles and considerations. These include continuous experimentation, serving as a form of robustness testing conducted in real-world operational conditions. Fundamental principles of Chaos Experiments involve defining a steady state, hypothesizing about its impact, conducting the experiment, and then demonstrating or refuting the hypothesis [53]. Insights and results Chaos experiments conducted in the production environment provide valuable insights into the behavior of the system. This is particularly significant as the production environment may exhibit unpredictable behavior that differs from staging environments in some cases [24]. Constraints and alternatives While conducting chaos experiments in production is ideal, it is acknowledged that legal or technical constraints may sometimes prevent this. In such cases, an alternative approach is considered, starting chaos experiments in a staging environment and gradually transitioning to the production environment [25]. Overall approach The overall approach for the effective application of chaos engineering in production environments involves the systematic execution of chaos experiments. This includes leveraging chaos engineering tools and taking into account the constraints and challenges associated with conducting experiments in real-world operational settings. The aim is to proactively identify and address potential issues before they impact the production environment, ultimately enhancing the resilience of software systems. Table 5 Chaos engineering tools identified from selected papers. Platform/Tool Description The Chaos Machine A tool for conducting chaos experiments at the application level on Java Virtual Machine (JVM), using exception injection to analyze try-catch blocks for error processing [24]. Screwdriver An automated fault-tolerance testing tool for on-premise applications and services, creating realistic error models and collecting metrics by injecting errors into the system [34]. Chaos Monkey Designed by Netflix, this tool tests the system’s resilience by randomly killing partitions to check system functionality [7,45]. Cloud Strike A security chaos engineering system for multi-cloud security, extending chaos engineering to security by injecting faults impacting confidentiality, integrity, and availability [42]. ChaosMesh An open-source chaos engineering platform for testing the resilience and reliability of distributed systems by intentionally injecting failures and disruptions [55]. Powerfulseal An open-source tool for testing the resilience of Kubernetes clusters by simulating real-world failures and disruptions [55]. IntelliFT A feedback-based, automated failure testing technique for microservice applications, focusing on exposing defects in fault-handling logic [44]. The Chaos Toolkit Open-source software that runs experiments against the system to confirm a hypothesis [25,55]. Phoebe A fault injection framework for reliability analysis concerning system call invocation errors, enabling full observability of system call invocations and automatic experimentation [22]. Mjolnirr A private cloud platform with a built-in Chaos Monkey service for developing private PaaS cloud infrastructure [58]. ChaosOrca A tool for Chaos engineering on containers, perturbing system calls for processes inside containers and monitoring their effects [37]. Gremlin Offered as a SaaS technology, Gremlin tests system resilience on various parameters and conditions, with capabilities for automation and integration with Kubernetes clusters and public clouds [35]. 3MileBeach A distributed tracing and fault injection framework for microservices, enabling chaos experiments through message serialization library manipulation [31]. ChAP A software platform for running automated chaos experiments, simulating various failure scenarios and providing insights into system behavior under stress [29,33]. ChaosTwin Utilizes a digital twin approach in Chaos Engineering to mitigate impacts of unforeseen events, constructing models across workload, network, and service layers [47]. Litmus Chaos An open-source cloud-native framework for Chaos Engineering in Kubernetes environments, offering a range of chaos experiments and workflows [50]. Filibuster A testing method in chaos engineering that introduces errors into microservice architecture to validate resilience and error tolerance [32]. 9 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 Table 6 Chaos engineering in microservices: approaches, descriptions, and expected outcomes. Approach Description Expected impact Fault injection testing This method involves intentionally introducing errors into the system to assess its Evaluating and enhancing the system’s resilience response, particularly in microservices by simulating various failure modes such as and stability. network issues, service outages, or resource shortages within or between microservices, to evaluate the system’s resilience and stability [52]. Hypothesis-driven Key to chaos engineering is conducting experiments based on well-defined Identifying system weaknesses and increasing experiments hypotheses about the normal state of the system and its expected behavior during resilience. failure scenarios. This strategic approach enables focused experiments that assess the resilience of both individual microservices and the overall system [45,53]. Blast radius Managing the ‘‘blast radius’’ of experiments is crucial in microservices. It involves Better understanding and enhancing the system’s management understanding the potential impact of introduced failures, starting with small resilience. experiments and then expanding, to manage failure impacts while identifying system vulnerabilities [45]. Resilience requirement Utilizing chaos engineering to determine and analyze the resilience requirements of Understanding specific resilience needs of each elicitation microservice architectures. This process involves observing the system’s response to microservice and their interactions. induced faults to identify specific resilience needs of each microservice and their interactions [52]. Continuous testing and Regularly conducting chaos experiments as part of an ongoing testing process Proactive identification and resolution of system improvement ensures that microservices remain resilient against unforeseen issues. This continuous weaknesses, leading to continual improvement and approach aids in proactively finding and fixing potential system weaknesses [56]. increased resilience. Observability and Integrating chaos engineering with observability tools enhances the monitoring of Real-time tracking of responses to failures and remediation microservices during fault injection, allowing for real-time tracking of responses to development of effective remediation strategies for failures, aiding in the development of effective remediation strategies and overall overall system resilience improvement. system resilience improvement [56]. 4.4. How can Chaos engineering be effectively applied to microservice archi- 5.1. General discussion tecture to ensure successful implementation and enhance system resilience? In this article, we reviewed the literature on the application of Table 6 provides a comprehensive overview of the different facets chaos engineering in microservice architecture to understand the state- and projected implications of implementing chaos engineering within of-the-art. For this purpose, six research questions were defined and microservice architecture. answered. By implementing these approaches and strategies, organizations can In RQ1, we aimed to understand how chaos engineering is ap- effectively integrate chaos engineering into their microservice architec- plied to production environments. Chaos engineering, when adeptly tures to uncover vulnerabilities and enhance the overall dependability applied in production settings, serves as a pivotal tool for augmenting of their systems. the robustness of software systems. This approach entails conducting deliberate and controlled chaos experiments within the production en- 4.5. To what extent can the centralized provision of Chaos engineering vironment, a strategy that is instrumental in uncovering and rectifying effectively facilitate the management of chaos experiments across complex potential issues before they escalate into full-blown system failures, systems? thereby bolstering system uptime [38]. Moreover, chaos engineering is characterized by the intentional injection of faults into systems. Table 7 provides an overview of the ways in which centralized chaos This methodology is crucial for identifying and addressing security engineering can simplify experiment management in intricate systems. flaws and risks, laying the groundwork for the development of resilient It emphasizes advantages like standardization, resource utilization, risk application architectures [56]. By replicating adverse conditions that mitigation, and more, resulting in enhanced system resilience and could naturally arise in production settings, chaos engineering helps performance. detect of inherent system vulnerabilities and structural deficiencies, fostering a proactive stance towards issue mitigation [38]. 4.6. What are the challenges reported in the relevant papers? Additionally, this practice involves comprehensive testing of real- world scenarios on operational systems. Such testing is vital for as- Table 8 concisely presents the primary obstacles in the area of sessing the complete spectrum of software systems, encompassing both chaos engineering and their respective resolutions. These obstacles hardware malfunctions and software glitches, within their actual de- encompass system intricacy, hazards to live environments, resource ployment contexts. This approach significantly contributes to the en- demands, security issues, and automation complexities. The proposed hancement of overall system resilience [38]. To effectively implement resolutions involve phased implementation, risk assessment, knowledge chaos engineering, it is recommended to initiate with less complex enhancement, robust security protocols, and automation approaches. experiments, leverage automation for these experiments, and focus on areas with either high impact or high frequency of issues. Observing 5. Discussion the system at its limits is also crucial for reinforcing resilience [25]. In RQ2, we discuss various platforms that aim to increase the In the discussion section, we summarize answers to the research flexibility and reliability of microservice architectures through chaos questions. They mention that chaos engineering can improve robust- experiments. Tools like Gremlin, Chaos Monkey, Chaos Toolkit, Pumba, ness by simulating real-world failure scenarios and exploring system LitmusChaos, ToxiProxy and PowerfulSeal have been utilized in indus- reactions, especially in microservice architectures. Various tools for try settings to simulate different failure scenarios. These tools provide implementing chaos engineering were listed and compared. They con- functions such as terminating processes, simulating network conditions, clude by stating that the application of chaos engineering requires applying stress tests security measures and injecting faults to proac- careful planning due to inherent challenges but has the potential to tively identify weaknesses and strengthen system robustness across greatly improve system resilience. different technology landscapes. 10 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 Table 7 Centralized provision in chaos engineering. Approach Description Expected impact Standardization Centralized provision allows for the standardization of chaos engineering practices Improved coordination and reliability of and tools across the organization. This ensures that all teams follow consistent results. processes and use approved tools, leading to better coordination and more reliable results [42]. Resource optimization Centralized provision enables efficient allocation of resources for chaos experiments. Enhanced resource utilization and reduced It allows pooling of expertise, tools, and infrastructure, reducing redundancy and redundancy. optimizing resource utilization [38]. Risk management Centralized provision facilitates better risk management by providing oversight and Controlled experimentation and effective governance for chaos experiments. It establishes clear guidelines, safety measures, risk management. and expected states for running experiments in production environments, ensuring controlled experimentation [42]. Automation and Centralized provision supports the automation of chaos experiments to run Ongoing validation of system resilience and continuous testing continuously. This ensures regular conduction of experiments, leading to ongoing early identification of potential issues. validation of system resilience and identification of potential issues before they manifest as outages [38,42]. Knowledge sharing and A centralized approach encourages knowledge sharing and collaboration among Promotion of a continuous improvement collaboration teams. It facilitates the dissemination of best practices, lessons learned, and culture and shared learning. successful experiment designs, fostering a culture of continuous improvement and shared learning [25]. Performance metrics and Centralized provision enables the establishment of standardized performance metrics Consistent system health measurement and analysis and analysis methods for chaos experiments. This allows for consistent measurement more effective decision-making. of system health and identification of deviations from steady-state, leading to more effective decision-making and system improvements [43]. Table 8 Challenges and solutions in chaos Engineering. Category Challenges Possible solutions References Complexity Designing and executing effective chaos experiments To mitigate complexity, it is recommended to start with smaller, more [25,43] in large systems is complex due to intricate manageable experiments and gradually expand the scope of chaos interdependencies within these systems. engineering practices. Risk of impact Concerns about causing disruptions in the production Implementing risk analysis techniques can help prioritize experiments, [45,50] environment, affecting users and business operations. focusing on less critical system components first to minimize potential impacts. Resource Significant resources needed including time, expertise, Addressing resource intensiveness involves providing comprehensive [7,47] intensiveness and infrastructure, posing a barrier for many training and education on chaos engineering best practices and tools to organizations. equip teams with the necessary skills and knowledge. Security Introducing controlled failures can raise security To combat security concerns, robust security measures should be [42,47] concerns issues, potentially exposing vulnerabilities or sensitive implemented during experiments to safeguard sensitive data and prevent data. unauthorized access. Tooling and Developing tools for automated chaos experiments is Overcoming tooling and automation challenges requires the development [7,33,38,40,42] automation challenging in heterogeneous and dynamic and use of automated tools for Chaos experiments, which reduce manual environments. efforts and facilitate continuous, unattended testing. Recent studies have emphasized the growing intersection between solutions like Netflix’s Chaos Automation Platform (ChAP) and fault artificial intelligence and cybersecurity within the context of chaos injection techniques such as service call manipulation. The emphasis is engineering. AI-driven techniques are nowadays used for real-time placed on the need for careful planning, effective communication, risk threat detection, anomaly prediction, and automated response mech- management, and continuous learning to ensure comprehensive and anisms in enterprise systems. For example, generative AI models have valuable chaos experiments for enhancing overall system resilience. been proposed to enhance cybersecurity frameworks by improving data In response to RQ5, our discussion concludes that the practical privacy management and identifying potential attack vectors [59]. implementation of chaos engineering, despite its promise to enhance In RQ3, we focused on understanding how chaos engineering is im- system resilience, presents numerous challenges. These challenges in- plemented in microservice architectures. To enhance system resilience clude potential business impacts, difficulty in determining scope, the in microservice architectures through chaos engineering, organizations unpredictability of outcomes, time and resource constraints, system should utilize fault injection testing to replicate failures within mi- complexities, skill and knowledge prerequisites, interpretation of re- croservices. They should also conduct hypothesis-driven experiments sults, cultural readiness, and selection of appropriate tools. These all with a solid comprehension of the normal state and anticipated behav- necessitate meticulous planning and skilled execution for effectiveness. ior during disruptions, while managing the scope of these experiments to minimize impact. Additionally, it is essential to identify and an- Recent studies explore the convergence of Chaos Engineering and alyze resilience requirements, participate in continuous testing and Artificial Intelligence (AI). Large language models (LLMs) have been improvement efforts, as well as integrate observability tools for real- used to automate the chaos engineering lifecycle, managing phases time monitoring during fault injection tests. Moreover, organizations from hypothesis creation to experiment orchestration and remedia- need to establish clear communication channels across teams involved tion [60]. Meanwhile, advances in applying chaos engineering to multi- in order to ensure effective collaboration and knowledge sharing. agent AI systems suggest new directions: for example, chaos experi- The answer to RQ4, highlights the significance of centralized man- ments applied to LLM-based multi-agent systems can surface vulner- agement and monitoring in conducting chaos experiments within large- abilities such as hallucinations, agent failures, or inter-agent communi- scale microservices ecosystems. It discusses the utilization of software cation breakdowns [61]. Together, these works show how intelligent, 11 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 adaptive chaos frameworks might evolve in microservice-based systems experiments are insightful, as they reveal system behaviors in pro- as well. duction environments, which often differ unpredictably from staging Recent research also discusses specific operational challenges such environments [36,53]. as load balancing and security in the context of chaos engineering. For Furthermore, the effectiveness of chaos engineering is contingent example, an empirical study applies delay injections under different on the systematic execution of chaos experiments. These experiments, user loads in cloud-native systems to observe how throughput and utilizing advanced chaos engineering tools, need to navigate the con- latency change under stress, providing insights into how load balanc- straints and challenges inherent in real-world operational settings. ing policies perform under fault conditions [62]. In parallel, several The main objective is the enhancement of system resilience, achieved frameworks have begun integrating security-focused chaos tests that by proactively identifying and preemptively addressing potential is- intentionally inject faults into authentication, identity management, sues [46]. and access control components to ensure that security mechanisms However, it is acknowledged that conducting chaos experiments remain effective under stress conditions [63]. These studies highlight directly in production environments might be impeded by legal or how chaos engineering can be extended beyond performance reliability technical constraints. In such scenarios, initiating experiments in a to proactively strengthen both load distribution and security resilience staging environment and then gradually transitioning to the production in microservice environments. environment offers a viable alternative. This approach ensures that The main challenges faced by previous researchers and possible the benefits of chaos engineering can still be realized, but in a more solutions have been discussed in the paper. The collected challenges controlled and possibly less direct manner. were mainly related to the correct interpretation of chaos experiments Our review highlights that chaos engineering is a critical methodol- and making sense of them. There may be more challenges, but if ogy for ensuring the resilience and robustness of software systems. By they were not mentioned in these articles, we could not include them. following continuous experimentation and proactive troubleshooting, it We believe that chaos engineering is still in the early stages and the offers a pathway to address the challenges faced in complex production adoption in the software industry will take some time. environments. This SLR contributes to the scientific community by dis- cussing these methodologies and their applications, thereby providing 5.2. Threats to validity a framework for future research and practical implementation in the field of software system resilience. Internal validity The validity of this systematic literature review is threatened by CRediT authorship contribution statement issues related to defining the candidate pool of papers, potential bias in selecting primary studies, data extraction, and data synthesis. The Emrah Esen: Writing – review & editing, Writing – original draft, application of exclusion criteria can be influenced by the researchers’ Visualization, Validation, Software, Methodology, Investigation, For- biases, posing a potential threat to validity. We compiled a compre- mal analysis, Data curation. Akhan Akbulut: Writing – review & hensive list of exclusion criteria, and all conflicts were documented editing, Writing – original draft, Visualization, Validation, Supervi- and resolved through discussions among us. Data extraction validity is sion, Software, Resources, Project administration, Methodology, Inves- crucial as it directly impacts the study results. Whenever any of us was tigation, Formal analysis, Data curation. Cagatay Catal: Writing – uncertain about data extraction, the case was recorded for resolution review & editing, Writing – original draft, Visualization, Validation, through discussions with the team. Multiple meetings were held to Supervision, Software, Resources, Project administration, Methodology, minimize researcher bias. Investigation, Funding acquisition, Formal analysis, Data curation. External validity Declaration of competing interest The search for candidate papers involved using general search terms to minimize the risk of excluding relevant studies. Despite using a broad The authors declare that they have no known competing finan- search query to acquire more articles, there remains a possibility that cial interests or personal relationships that could have appeared to some papers were overlooked in electronic databases or missed due to influence the work reported in this paper. recent publications. Furthermore, although seven widely used online databases in computer science and software engineering were searched, Data availability new papers may not have been included. Data will be made available on request. 6. Conclusion Our systematic literature review (SLR) on chaos engineering has References explored its role in enhancing the resilience of software systems in pro- duction environments. Through our review, we have identified several [1] P. Jamshidi, C. Pahl, N.C. Mendonça, J. Lewis, S. Tilkov, Microservices: The journey so far and challenges ahead, IEEE Softw. 35 (3) (2018) 24–35, http: crucial aspects that underline the effective application and challenges //dx.doi.org/10.1109/MS.2018.2141039. of chaos engineering [25]. [2] I. Beschastnikh, P. Wang, Y. Brun, M.D. Ernst, Debugging distributed systems, Firstly, Chaos Engineering serves as a proactive troubleshooting ap- Commun. ACM 59 (8) (2016) 32–37, http://dx.doi.org/10.1145/2909480. proach in production environments [25]. By identifying and addressing [3] W. Ahmed, Y.W. Wu, A survey on reliability in distributed systems, J. Comput. potential malfunctions before they occur, it effectively preempts system System Sci. 79 (8) (2013) 1243–1255, http://dx.doi.org/10.1016/j.jcss.2013.02. 006. disruptions. This proactive strategy is significantly implemented by [4] D. Ma’ruf, S. Sulistyo, L. Nugroho, Applying integrating testing of microservices chaos engineering tools that assist in automatic fault detection, thereby in airline ticketing system, Ijitee (Int. J. Inf. Technol. Electr. Eng.) 4 (2020) 39, minimizing potential issues in these critical environments [50]. http://dx.doi.org/10.22146/ijitee.55491. Secondly, the essence of chaos engineering is rooted in continuous [5] F. Dai, H. Chen, Z. Qiang, Z. Liang, B. Huang, L. Wang, Automatic analysis experimentation and robustness testing under real-world operational of complex interactions in microservice systems, Complexity 2020 (2020) 1–12, http://dx.doi.org/10.1155/2020/2128793. conditions. The methodology involves a systematic approach: defining [6] J. Lewis, M. Fowler, Microservices: a definition of this new architectural term a steady state, hypothesizing its impacts, conducting controlled exper- (2014), 2014, URL: http://martinfowler.com/articles/microservices.html (cit. p. iments, and subsequently confirming or refuting the hypotheses. These 26). 12 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 [7] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, C. [31] J. Zhang, R. Ferydouni, A. Montana, D. Bittman, P. Alvaro, 3MileBeach: A Rosenthal, Chaos engineering, IEEE Softw. 33 (3) (2016) 35–41, http://dx.doi. tracer with teeth, in: Proceedings of the ACM Symposium on Cloud Computing, org/10.1109/MS.2016.60. SoCC ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. [8] R.T. Munodawafa, S.K. Johl, A systematic review of eco-innovation and perfor- 458–472, http://dx.doi.org/10.1145/3472883.3486986. mance from the resource-based and stakeholder perspectives, Sustainability 11 [32] C.S. Meiklejohn, A. Estrada, Y. Song, H. Miller, R. Padhye, Service-level fault (2019) 6067, http://dx.doi.org/10.3390/su11216067. injection testing, in: Proceedings of the ACM Symposium on Cloud Computing, [9] J.M. Macharia, Systematic literature review of interventions supported by inte- SoCC ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. gration of ict in education to improve learners’ academic performance in stem 388–402, http://dx.doi.org/10.1145/3472883.3487005. subjects in kenya, J. Educ. Pract. 6 (2022) 52–75, http://dx.doi.org/10.47941/ [33] A. Blohowiak, A. Basiri, L. Hochstein, C. Rosenthal, A platform for automating jep.979. chaos experiments, in: 2016 IEEE International Symposium on Software Reliabil- [10] P. Gerli, J.N. Marco, J. Whalley, What makes a smart village smart? a review ity Engineering Workshops, ISSREW, 2016, pp. 5–8, http://dx.doi.org/10.1109/ of the literature, Transform. Gov.: People Process. Policy 16 (2022) 292–304, ISSREW.2016.52. http://dx.doi.org/10.1108/tg-07-2021-0126. [34] A. Nagarajan, A. Vaddadi, Automated fault-tolerance testing, in: 2016 IEEE [11] R. Coppola, L. Ardito, Quality assessment methods for textual conversational Ninth International Conference on Software Testing, Verification and Validation interfaces: a multivocal literature review, Information 12 (2021) 437, http: Workshops, ICSTW, 2016, pp. 275–276, http://dx.doi.org/10.1109/ICSTW.2016. //dx.doi.org/10.3390/info12110437. 34. [12] B. Kitchenham, O. Pearl Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman, [35] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M.K. Reiter, V. Sekar, Gremlin: Systematic literature reviews in software engineering – A systematic literature Systematic resilience testing of microservices, in: 2016 IEEE 36th International review, Inf. Softw. Technol. 51 (1) (2009) 7–15, http://dx.doi.org/10.1016/j. Conference on Distributed Computing Systems, ICDCS, 2016, pp. 57–66, http: infsof.2008.09.009, Special Section - Most Cited Articles in 2002 and Regular //dx.doi.org/10.1109/ICDCS.2016.11. Research Papers. [36] R.K. Lenka, S. Padhi, K.M. Nayak, Fault injection techniques - a brief review, [13] N. Dragoni, S. Giallorenzo, A.L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, L. in: 2018 International Conference on Advances in Computing, Communication Safina, Microservices: yesterday, today, and tomorrow, 2017, arXiv:1606.04036. Control and Networking, ICACCCN, 2018, pp. 832–837, http://dx.doi.org/10. [14] P.D. Francesco, I. Malavolta, P. Lago, Research on architecting microservices: 1109/ICACCCN.2018.8748585. Trends, focus, and potential for industrial adoption, in: 2017 IEEE International [37] A. van Hoorn, A. Aleti, T.F. Düllmann, T. Pitakrat, ORCAS: Efficient resilience Conference on Software Architecture, ICSA, 2017, pp. 21–30, http://dx.doi.org/ benchmarking of microservice architectures, in: 2018 IEEE International Sym- 10.1109/ICSA.2017.24. posium on Software Reliability Engineering Workshops, ISSREW, 2018, pp. [15] M. Fowler, Patterns of Enterprise Application Architecture, Addison-Wesley 146–147, http://dx.doi.org/10.1109/ISSREW.2018.00-10. Longman Publishing Co., Inc., USA, 2002. [38] H. Tucker, L. Hochstein, N. Jones, A. Basiri, C. Rosenthal, The business case for chaos engineering, IEEE Cloud Comput. 5 (3) (2018) 45–54, http://dx.doi.org/ [16] J. Lewis, M. Fowler, Microservices, 2014, https://martinfowler.com/articles/ 10.1109/MCC.2018.032591616. microservices.html. [39] N. Brousse, O. Mykhailov, Use of self-healing techniques to improve the [17] S. Newman, Building Microservices: Designing Fine-Grained Systems, " O’Reilly reliability of a dynamic and geo-distributed ad delivery service, in: 2018 Media, Inc.", 2021. IEEE International Symposium on Software Reliability Engineering Workshops, [18] C.K. Rudrabhatla, Comparison of zero downtime based deployment techniques in ISSREW, 2018, pp. 1–5, http://dx.doi.org/10.1109/ISSREW.2018.00-40. public cloud infrastructure, in: 2020 Fourth International Conference on I-SMAC [40] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for (IoT in Social, Mobile, Analytics and Cloud), I-SMAC, 2020, pp. 1082–1086, cloud services: Work in progress, in: 2019 IEEE 18th International Symposium http://dx.doi.org/10.1109/I-SMAC49090.2020.9243605. on Network Computing and Applications, NCA, 2019, pp. 1–3, http://dx.doi.org/ [19] S.R. Addula, P. Perugu.P, M.K. Kumar, D. Kumar, B. Ananthan, R. R, S. P, S. 10.1109/NCA.2019.8935046. G, Dynamic load balancing in cloud computing using hybrid Kookaburra-Pelican [41] H. Chen, P. Chen, G. Yu, A framework of virtual war room and matrix sketch- optimization algorithms, in: 2024 International Conference on Augmented Re- based streaming anomaly detection for microservice systems, IEEE Access 8 ality, Intelligent Systems, and Industrial Automation, ARIIA, 2024, pp. 1–7, (2020) 43413–43426, http://dx.doi.org/10.1109/ACCESS.2020.2977464. http://dx.doi.org/10.1109/ARIIA63345.2024.11051893. [42] K.A. Torkura, M.I.H. Sukmana, F. Cheng, C. Meinel, CloudStrike: Chaos engi- [20] M. Waseem, P. Liang, M. Shahin, A systematic mapping study on microservices neering for security and resiliency in cloud infrastructure, IEEE Access 8 (2020) architecture in devops, J. Syst. Softw. 170 (2020) 110798, http://dx.doi.org/10. 123044–123060, http://dx.doi.org/10.1109/ACCESS.2020.3007338. 1016/j.jss.2020.110798. [43] D. Kesim, A. van Hoorn, S. Frank, M. H00E4ussler, Identifying and prioritizing [21] C. Rosenthal, N. Jones, Chaos Engineering: System Resiliency in Practice, O’Reilly chaos experiments by using established risk analysis techniques, in: 2020 IEEE Media, 2020. 31st International Symposium on Software Reliability Engineering, ISSRE, 2020, [22] L. Zhang, B. Morin, B. Baudry, M. Monperrus, Maximizing error injection realism pp. 229–240, http://dx.doi.org/10.1109/ISSRE5003.2020.00030. for chaos engineering with system calls, IEEE Trans. Dependable Secur. Comput. [44] Z. Long, G. Wu, X. Chen, C. Cui, W. Chen, J. Wei, Fitness-guided resilience 19 (4) (2022) 2695–2708, http://dx.doi.org/10.1109/TDSC.2021.3069715. testing of microservice-based applications, 2020, pp. 151–158, http://dx.doi.org/ [23] Š. Davidovič, B. Beyer, Canary analysis service, Commun. ACM 61 (5) (2018) 10.1109/ICWS49710.2020.00027. 54–62, http://dx.doi.org/10.1145/3190566. [45] S. De, A study on chaos engineering for improving cloud software quality [24] L. Zhang, B. Morin, P. Haller, B. Baudry, M. Monperrus, A chaos engineering and reliability, in: 2021 International Conference on Disruptive Technologies system for live analysis and falsification of exception-handling in the JVM, IEEE for Multi-Disciplinary Research and Applications, CENTCON, Vol. 1, 2021, pp. Trans. Softw. Eng. 47 (11) (2021) 2534–2548, http://dx.doi.org/10.1109/TSE. 289–294, http://dx.doi.org/10.1109/CENTCON52345.2021.9688292. 2019.2954871. [46] C. Konstantinou, G. Stergiopoulos, M. Parvania, P. Esteves-Verissimo, Chaos [25] H. Jernberg, P. Runeson, E. Engström, Getting started with chaos engineering engineering for enhanced resilience of cyber-physical systems, in: 2021 Re- - design of an implementation framework in practice, in: Proceedings of the silience Week, RWS, 2021, pp. 1–10, http://dx.doi.org/10.1109/RWS52686. 14th ACM / IEEE International Symposium on Empirical Software Engineering 2021.9611797. and Measurement, ESEM, ESEM ’20, Association for Computing Machinery, New [47] F. Poltronieri, M. Tortonesi, C. Stefanelli, ChaosTwin: A chaos engineering and York, NY, USA, 2020, http://dx.doi.org/10.1145/3382494.3421464. digital twin approach for the design of resilient IT services, in: 2021 17th [26] A. Alkhateeb, C. Catal, G. Kar, A. Mishra, Hybrid blockchain platforms for the International Conference on Network and Service Management, CNSM, 2021, internet of things (IoT): A systematic literature review, Sensors 22 (4) (2022) pp. 234–238, http://dx.doi.org/10.23919/CNSM52442.2021.9615519. http://dx.doi.org/10.3390/s22041304. [48] N. Luo, Y. Xiong, Platform software reliability for cloud service continuity [27] R. van Dinter, B. Tekinerdogan, C. Catal, Predictive maintenance using digital - challenges and opportunities, in: 2021 IEEE 21st International Conference twins: A systematic literature review, Inf. Softw. Technol. 151 (2022) 107008, on Software Quality, Reliability and Security, QRS, 2021, pp. 388–393, http: http://dx.doi.org/10.1016/j.infsof.2022.107008. //dx.doi.org/10.1109/QRS54544.2021.00050. [28] M. Jorayeva, A. Akbulut, C. Catal, A. Mishra, Machine learning-based software [49] H. Chen, K. Wei, A. Li, T. Wang, W. Zhang, Trace-based intelligent fault diagnosis defect prediction for mobile applications: A systematic literature review, Sensors for microservices with deep learning, in: 2021 IEEE 45th Annual Computers, 22 (7) (2022) http://dx.doi.org/10.3390/s22072551. Software, and Applications Conference, COMPSAC, 2021, pp. 884–893, http: [29] A. Basiri, L. Hochstein, N. Jones, H. Tucker, Automating chaos experiments //dx.doi.org/10.1109/COMPSAC51774.2021.00121. in production, in: 2019 IEEE/ACM 41st International Conference on Software [50] O. Sharma, M. Verma, S. Bhadauria, P. Jayachandran, A guided approach Engineering: Software Engineering in Practice, ICSE-SEIP, 2019, pp. 31–40, towards complex chaos selection, prioritisation and injection, in: 2022 IEEE http://dx.doi.org/10.1109/ICSE-SEIP.2019.00012. 15th International Conference on Cloud Computing, CLOUD, 2022, pp. 91–93, [30] L.B. Canonico, V. Vakeel, J. Dominic, P. Rodeghero, N. McNeese, Human-AI http://dx.doi.org/10.1109/CLOUD55607.2022.00025. partnerships for chaos engineering, in: Proceedings of the IEEE/ACM 42nd [51] N. Luo, L. Zhang, Chaos driven development for software robustness enhance- International Conference on Software Engineering Workshops, ICSEW ’20, As- ment, in: 2022 9th International Conference on Dependable Systems and their sociation for Computing Machinery, New York, NY, USA, 2020, pp. 499–503, Applications, DSA, 2022, pp. 1029–1034, http://dx.doi.org/10.1109/DSA56465. http://dx.doi.org/10.1145/3387940.3391493. 2022.00154. 13 E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116 [52] M.A. Naqvi, S. Malik, M. Astekin, L. Moonen, On evaluating self-adaptive [58] D. Savchenko, G. Radchenko, O. Taipale, Microservices validation: Mjolnirr and self-healing systems using chaos engineering, in: 2022 IEEE International platform case study, in: 2015 38th International Convention on Information and Conference on Autonomic Computing and Self-Organizing Systems, ACSOS, 2022, Communication Technology, Electronics and Microelectronics, MIPRO, 2015, pp. pp. 1–10, http://dx.doi.org/10.1109/ACSOS55765.2022.00018. 235–240, http://dx.doi.org/10.1109/MIPRO.2015.7160271. [53] J. Simonsson, L. Zhang, B. Morin, B. Baudry, M. Monperrus, Observability and [59] G.S. Nadella, S.R. Addula, A.R. Yadulla, G.S. Sajja, M. Meesala, M.H. Maturi, chaos engineering on system calls for containerized applications in Docker, K. Meduri, H. Gonaygunta, Generative AI-enhanced cybersecurity framework for Future Gener. Comput. Syst. 122 (2021) 117–129, http://dx.doi.org/10.1016/ enterprise data privacy management, Computers 14 (2) (2025) http://dx.doi.org/ j.future.2021.04.001. 10.3390/computers14020055. [54] A.A.-S. Ahmad, P. Andras, Scalability resilience framework using application- [60] D. Kikuta, H. Ikeuchi, K. Tajiri, Y. Nakano, ChaosEater: Fully automating chaos level fault injection for cloud-based software services, J. Cloud Comput. 11 (1) engineering with large language models, 2025, arXiv preprint arXiv:2501.11107. (2022) 1, http://dx.doi.org/10.1186/s13677-021-00277-z. URL https://arxiv.org/abs/2501.11107. [55] C. Camacho, P.C. Cañizares, L. Llana, A. Núñez, Chaos as a software product [61] J. Owotogbe, Assessing and enhancing the robustness of LLM-based multi- line—A platform for improving open hybrid-cloud systems resiliency, Softw.: agent systems through chaos engineering, in: 2025 IEEE/ACM 4th International Pract. Exp. 52 (7) (2022) 1581–1614, http://dx.doi.org/10.1002/spe.3076. Conference on AI Engineering – Software Engineering for AI, CAIN, 2025, pp. [56] P. Raj, S. Vanga, A. Chaudhary, The observability, chaos engineering, and 250–252, http://dx.doi.org/10.1109/CAIN66642.2025.00039. remediation for cloud-native reliability, in: Cloud-Native Computing: How To [62] A. Al-Said Ahmad, L.F. Al-Qora’n, A. Zayed, Exploring the impact of chaos Design, Develop, and Secure Microservices and Event-Driven Applications, 2023, engineering with various user loads on cloud native applications: An exploratory pp. 71–93, http://dx.doi.org/10.1002/9781119814795.ch4. empirical study, Computing 106 (2024) 2389–2425, http://dx.doi.org/10.1007/ [57] M.A. Chang, B. Tschaen, T. Benson, L. Vanbever, Chaos monkey: Increasing sdn s00607-024-01292-z. reliability through systematic network destruction, in: Proceedings of the 2015 [63] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for ACM Conference on Special Interest Group on Data Communication, 2015, pp. cloud services: Work in progress, in: 2019 IEEE 18th International Symposium 371–372. on Network Computing and Applications, NCA, 2019, pp. 1–3, http://dx.doi.org/ 10.1109/NCA.2019.8935046. 14