opaque-lattice/papers_txt/Chaos-experiments-in-microservice-architectures--A-_2026_Computer-Standards-.txt

                                                                             Computer Standards & Interfaces 97 (2026) 104116


                                                                                   Contents lists available at ScienceDirect


                                                                      Computer Standards & Interfaces
                                                                          journal homepage: www.elsevier.com/locate/csi


Chaos experiments in microservice architectures: A systematic literature
review
Emrah Esen a , Akhan Akbulut a , Cagatay Catal b                                                   ,∗
a
    Department of Computer Engineering, Istanbul Kültür University, 34536, Istanbul, Turkey
b
    Department of Computer Science and Engineering, Qatar University, Doha 2713, Qatar


ARTICLE                     INFO                                        ABSTRACT

Keywords:                                                               This study analyzes the implementation of Chaos Engineering in modern microservice systems. It identifies
Chaos engineering                                                       key methods, tools, and practices used to effectively enhance the resilience of software systems in production
Microservice                                                            environments. In this context, our Systematic Literature Review (SLR) of 31 research articles has uncovered 38
Systematic literature review
                                                                        tools crucial for carrying out fault injection methods, including several tools such as Chaos Toolkit, Gremlin,
                                                                        and Chaos Machine. The study also explores the platforms used for chaos experiments and how centralized
                                                                        management of chaos engineering can facilitate the coordination of these experiments across complex systems.
                                                                        The evaluated literature reveals the efficacy of chaos engineering in improving fault tolerance and robustness of
                                                                        software systems, particularly those based on microservice architectures. The paper underlines the importance
                                                                        of careful planning and execution in implementing chaos engineering and encourages further research in this
                                                                        field to uncover more effective practices for the resilience improvement of microservice systems.


Contents

    1.     Introduction ...................................................................................................................................................................................................... 2
    2.     Background ....................................................................................................................................................................................................... 2
            2.1.    Microservice architecture ........................................................................................................................................................................ 3
            2.2.    Microservice principles ........................................................................................................................................................................... 3
            2.3.    Challenges/Troubleshooting/Failures in microservice architecture .............................................................................................................. 3
            2.4.    Chaos engineering .................................................................................................................................................................................. 4
    3.     Review protocol................................................................................................................................................................................................. 4
            3.1.    Research questions ................................................................................................................................................................................. 4
            3.2.    Search strategy....................................................................................................................................................................................... 4
            3.3.    Study selection criteria ........................................................................................................................................................................... 4
            3.4.    Study quality assessment......................................................................................................................................................................... 5
            3.5.    Data extraction ...................................................................................................................................................................................... 5
            3.6.    Data synthesis ........................................................................................................................................................................................ 6
    4.     Results .............................................................................................................................................................................................................. 6
            4.1.    Main statistics ........................................................................................................................................................................................ 6
            4.2.    How is Chaos engineering effectively applied in production environments to enhance the resilience of software systems? .............................. 6
            4.3.    Which platforms have been used for chaos experiments? ........................................................................................................................... 6
            4.4.    How can Chaos engineering be effectively applied to microservice architecture to ensure successful implementation and enhance system
                   resilience? .............................................................................................................................................................................................. 10
            4.5.    To what extent can the centralized provision of Chaos engineering effectively facilitate the management of chaos experiments across complex
                   systems?................................................................................................................................................................................................. 10
            4.6.    What are the challenges reported in the relevant papers? .......................................................................................................................... 10
    5.     Discussion ......................................................................................................................................................................................................... 10
            5.1.    General discussion .................................................................................................................................................................................. 10
            5.2.    Threats to validity .................................................................................................................................................................................. 12


     ∗ Corresponding author.
         E-mail address: ccatal@qu.edu.qa (C. Catal).

https://doi.org/10.1016/j.csi.2025.104116
Received 22 September 2024; Received in revised form 28 November 2025; Accepted 12 December 2025
Available online 15 December 2025
0920-5489/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
E. Esen et al.                                                                                                                                                    Computer Standards & Interfaces 97 (2026) 104116


  6.     Conclusion ........................................................................................................................................................................................................ 12
         CRediT authorship contribution statement ........................................................................................................................................................... 12
         Declaration of competing interest ........................................................................................................................................................................ 12
         Data availability ................................................................................................................................................................................................ 12
         References......................................................................................................................................................................................................... 12


                                                                                                                  challenges faced, and solutions. In addition, it will assess the effective-
1. Introduction                                                                                                   ness of chaos experiments in enhancing the reliability and robustness of
                                                                                                                  microservice systems by using data obtained from real-world scenarios
    In recent years, the adoption of microservice architecture has led                                            to develop strategic recommendations. This study is a critical step
to the transformation of application infrastructures into distributed                                             in understanding the applicability and impact of chaos engineering
systems. These systems are designed to enhance maintainability by de-                                             within the complexity of microservice architectures and aims to make
coupling services. The primary benefit of this architecture is the ease of                                        significant contributions to the body of knowledge in this field. Recent
maintenance of individual services within the microservice ecosystem                                              research has applied chaos engineering for this architectural style, how-
due to their smaller and more modular nature [1]. However, despite                                                ever, a systematic overview of the state-of-the-art on the use of chaos
these advantages, the distributed nature of microservices introduces                                              engineering in the microservice architecture is lacking. Therefore, a
significant challenges. Specifically, the complex management of ser-                                              Systematic Literature Review (SLR) has been performed to provide an
vices and their tight integration can considerably complicate software                                            overview of how chaos engineering was applied.
debugging. Debugging becomes complex in this architecture due to its                                                  This article primarily targets peer-reviewed research papers to main-
distributed nature, the necessity to pinpoint the exact service causing                                           tain methodological consistency and ensure scholarly rigor. We specif-
the problem, and the dynamic characteristics of microservices. Con-                                               ically chose a systematic literature review (SLR) methodology because
sequently, debugging in microservice architecture demands a greater                                               peer-reviewed academic studies are subject to rigorous validation pro-
level of effort and specialized expertise compared to conventional                                                cesses, which enhance the reliability and validity of our findings [8,
monolithic architectures [2]. However, it becomes quite challenging to                                            9]. Although excluding industry-specific, grey literature may restrict
predict what will happen if there is an unexpected error or if a service                                          certain practical perspectives, this choice was deliberately made to
on the network goes out of service. Service outages can be caused by                                              avoid potential biases and uphold the scientific integrity of our re-
anything from a malicious cyberattack to a hardware failure to simple                                             view [10,11]. However, future studies could broaden the scope to
human error, and they can have devastating financial consequences.                                                incorporate industrial case studies and practical experiences, which
Although such unexpected situations are rare, they can interfere with                                             would enrich our understanding of chaos engineering’s applicability
the operation of distributed systems and devastatingly affect the live                                            beyond the academic context.
environment in which the application is located [3]. It is necessary to                                               The main contributions of this study are listed as follows:
detect points in the system before an error occurs and spreads to the
                                                                                                                        1. To the best of our knowledge, this is the first study to employ
entire system.
                                                                                                                           a systematic literature review approach in the field of chaos
    Microservice architecture applications undergo testing procedures
                                                                                                                           engineering on microservice architecture applications [12]. The
to ensure their quality and dependability. These include unit testing,
                                                                                                                           study provides an extensive systematic literature review of how
service test, end-to-end test, behavior-driven test, integration test, and
                                                                                                                           chaos engineering can be applied to enhance the resilience of mi-
regression test [4]. The comprehensive approach to microservices test-
                                                                                                                           croservice architectures. It collates findings from various sources
ing also encompasses live testing strategies for complex systems [5].
                                                                                                                           to provide insights into the current state of research and practice
This thorough process emphasizes different aspects such as function-
                                                                                                                           in this field.
ality, interoperability, performance of individual services within the
                                                                                                                        2. The study categorizes and summarizes the range of chaos en-
architecture. It aims to detect and resolve issues early to ensure stable
                                                                                                                           gineering tools and methods used in industry and academia,
and high-quality microservice applications [1,6]. However, considering
                                                                                                                           highlighting their functionalities in process/service termination,
that microservices consist of multiple services, the application should
                                                                                                                           network simulation, load stressing, security testing, and fault
not have an impact on the user experience in cases such as network
                                                                                                                           injection within application code.
failures and suddenly increased service loads. For example, if the
                                                                                                                        3. This research paper discusses contemporary techniques and ap-
microservice that adds the product to favorites on a shopping site fails
                                                                                                                           proaches for implementing chaos engineering in microservice
or responds late, the user should be able to continue the shopping ex-
                                                                                                                           architectures. It also emphasizes the ongoing work in this field,
perience. Therefore, testing operations in production-like environments
                                                                                                                           offering a significant reference for future research endeavors.
become inevitable. No matter how distributed or complex the system
                                                                                                                           The paper systematically reviews existing literature to showcase
is, there is a need for a method to manage unforeseeable situations
                                                                                                                           how chaos engineering can enhance system resilience, laying a
that can build trust in the system against unexpected failures. chaos
                                                                                                                           comprehensive groundwork for further exploration into chaos
engineering is defined as the discipline of conducting experiments in a
                                                                                                                           experimentation strategies and innovating new fault injection
live environment to test or verify the reliability of software [7].
                                                                                                                           methods or tools within microservice architectures.
    The primary objective of this research is to conduct a thorough
investigation into how chaos experiments are performed in the widely                                                  The rest of the paper is structured as follows: Section 2 explains
used microservices-based systems of today. Microservice architectures                                             the background and related work. Section 3 presents the methodology
have come to the forefront in modern software development processes                                               of the research. Section 4 presents the results and Section 5 compre-
due to their advantages such as flexibility, scalability, and rapid de-                                           hensively discusses the presented answers to research questions and
velopment. However, these architectures also bring unique challenges                                              validity threats. Lastly, the conclusion is presented in Section 6.
due to complex service dependencies and dynamic operational environ-
ments. This study aims to comprehensively address the methodologies,                                              2. Background
application scenarios, and impacts of chaos experiments conducted
to test the resilience of microservice systems and identify potential                                                The microservice approach breaks down a large application into a
weak points. The research intends to present the current state of chaos                                           network of small, self-contained units, each running its own process
engineering practices by analyzing them, highlighting best practices,                                             and often communicating through web APIs. Unlike large, single-piece

                                                                                                              2
E. Esen et al.                                                                                                      Computer Standards & Interfaces 97 (2026) 104116


monolithic systems, these small services are robust, easy to scale up or            Technology heterogeneity. They are treated as small services, each run-
down, and can be updated individually using various programming lan-                ning independently and communicating with each other using open
guages and technologies. This structure allows development teams to be              protocols. While monolithic applications are developed with a single
smaller and more agile, leading to faster updates and improvements.                 programming language and database system, services included in a
Yet, managing many interconnected services can become complicated,                  microservice ecosystem may use a different programming language and
especially when something goes wrong. To enhance system reliability                 database. This allows the advantages of each programming language
and resilience, a method known as chaos engineering is employed. This               and database to be used.
involves deliberately introducing problems into the live system to test
                                                                                    Resilience. When an error occurs in the system in monolithic applica-
its ability to cope and recover. This technique helps to uncover and
                                                                                    tions, the whole system is affected. In the microservice architecture,
rectify flaws, thereby making the system stronger overall. Regular and
                                                                                    only the part under the responsibility of the relevant service is affected,
automated tests mimic real-life problems to ensure that the system can              the places belonging to other services are not affected and the user
handle unexpected challenges and remain stable and efficient.                       experience continues.

2.1. Microservice architecture                                                      Scalability. While the scaling process on monolithic applications covers
                                                                                    the entire application, the services that are under heavy load can be
    Microservice architectures have gained significant popularity in the            scaled in applications developed with microservice architecture. This
software industry due to their ability to address the challenges and                prevents extra resource costs for partitions that do not need to be scaled
complexities of developing modern applications [6,13].                              unnecessarily and increases the user experience.

                                                                                    Deployment. Microservice architecture facilitates the autonomous de-
2.2. Microservice principles                                                        ployment of individual services, enabling updates or changes without
                                                                                    impacting others. Various deployment strategies, including blue–green,
    Microservice architectures are based on the concept of decentral-               canary, and rolling deployment, minimize disruptions during the de-
ization, where each service is independently developed, deployed, and               ployment process [18]. As a result, microservice architecture provides
managed. This emphasizes autonomy and minimal inter-service depen-                  increased flexibility and resilience in deployment, distinguishing it
dencies. Each microservice is designed to focus on a single function or             from monolithic applications.
closely related set of functions and supports technology heterogeneity
by allowing different services to use different technology stacks that              Organizational alignment. In software development processes, some
best suit their needs. Resilience is a core aspect, with services built to          challenges may be encountered due to large teamwork and large pieces
withstand failures without affecting the entire system while scalability            of code. It is possible to make these challenges more manageable with
enables services to be scaled independently as per demand. Com-                     smaller teams established. At the same time, this is an indication that
munication occurs through lightweight mechanisms like HTTP/REST                     microservices applications allow us to form smaller and more cohesive
APIs, supporting continuous delivery and deployment practices. Due                  teams. Each team is responsible for its own microservice and can take
to the distributed nature of microservice architecture, comprehensive               action by making improvements if necessary.
monitoring and logging for observability becomes crucial. Additionally,
there is often an alignment between the microservice architecture                   2.3. Challenges/Troubleshooting/Failures in microservice architecture
and organizational structure involving small cross-functional teams
                                                                                        Microservice architectures pose numerous challenges. As the num-
responsible for individual services [14].
                                                                                    ber of services increases, the complexity of service interactions also
    It is helpful to compare the microservice architecture to the mono-
                                                                                    grows. Network communication reliance leads to latency and net-
lithic architecture. The main difference between them is the dimensions
                                                                                    work failure issues, while ensuring data consistency across multiple
of the developed applications. The microservice architecture can be
                                                                                    databases requires careful design and implementation of distributed
thought of as developing an application as a suite of smaller services,
                                                                                    transactions or eventual consistency models. Microservices bring typ-
rather than as a single, monolithic structure. Enterprise applications
                                                                                    ical distributed system challenges such as handling partial failures,
usually consist of three main parts: a client-side user interface (i.e., con-
                                                                                    dealing with latency and asynchrony, complex service discovery, load
taining HTML pages and Javascript running on the user’s machine
                                                                                    balancing in dynamic scaling environments, and managing configu-
in a browser), a database (i.e., composed of many tables, common
                                                                                    rations across multiple services and environments. Security concerns
and often relational, added to database management), and a server-
                                                                                    are heightened due to increased inter-service communications surface
side application. In the server-side application, HTTP requests are                 area. Testing becomes more complex involving individual service test-
processed, business logic is executed, HTML views are prepared that                 ing along with testing their interactions; deployment is challenging
will retrieve data from the database and update it and send it to the               especially when there are dependencies between services; effective
browser. This structure is a good example of monoliths. Any changes                 observability and monitoring become crucial for timely issue resolu-
to the system involve creating and deploying a new version of the                   tion; versioning management is critical for maintaining system stability;
server-side application [15]. The cycles of change are interdependent.              lastly assembling skilled teams proficient in DevOps, cloud computing,
A change to a small part of the application requires rebuilding and                 programming languages presents a significant challenge. Microservice
deploying the entire monolith [6].                                                  architecture faces various challenges, troubleshooting, and failures.
    Microservice architecture, on the other hand, has some common                   While adopting a distributed architecture enhances modularity, it in-
features, unlike monolithic architecture. These are componentization                herently introduces operational complexities that differ significantly
with services, organizing around job capabilities, smart interfaces and             from monolithic structures. Recent research has also explored the use
simple communication, decentralized governance, decentralized data                  of hybrid bio-inspired algorithms to optimize this process dynamically.
management, infrastructure automation, and design for failure [16].                 For instance, the Hybrid Kookaburra–Pelican Optimization Algorithm
Today, although modern internet applications seem like a single appli-              has been shown to improve load distribution and system scalability in
cation, they use microservice architectures behind them. Microservice               cloud and microservice-based environments [19].
architecture basically refers to small autonomous and interoperability                  In conclusion, while microservices offer numerous advantages such
services. It has emerged due to increasing needs such as technology                 as improved scalability, flexibility, and agility, they also introduce
diversity, flexibility, scaling, ease of deployment, organization and               significant challenges in terms of system complexity, operational de-
management, and provides various advantages in these matters. Its                   mands, and the need for skilled personnel and sophisticated tool-
advantages are described as follows [17]:                                           ing [20].

                                                                                3
E. Esen et al.                                                                                                     Computer Standards & Interfaces 97 (2026) 104116


2.4. Chaos engineering                                                             3.1. Research questions


    ‘‘Chaos engineering is the discipline of experimenting on a dis-                  Research Questions (RQs) and their corresponding motivations are
tributed system in order to build confidence in the system’s capability            presented as follows:
to withstand turbulent conditions in production-like environment’’ [7,
                                                                                       • RQ1: How is Chaos engineering effectively applied in production
21]. It is the careful and planned execution of experiments to show how
                                                                                         environments to enhance the resilience of software systems?
the distributed system will respond to a failure. It is necessary for large-
                                                                                         Motivation: Understanding the practical implementation of Chaos
scale software systems because it is practically impossible to simulate
                                                                                         engineering in production environments is crucial for ensuring
real events in test environments. Experiments based on real events are                   the resilience of software systems under real-world operating
created together with chaos engineering [22]. By analyzing the test                      conditions.
results, improvements are made where necessary, and in this way, it                    • RQ2: Which platforms have been used for Chaos experiments?
is aimed to increase the reliability of the software in the production                   Motivation: Identifying the platforms provides insights into the
environment.                                                                             technological landscape and tools available for conducting Chaos
    Thanks to an experimental and systems-based approach, confidence                     engineering practices.
is established for the survivability of these systems during collapses.                • RQ3: How is Chaos engineering effectively applied to microser-
Canary analysis collects data on how distributed systems react to                        vice architectures to ensure its successful implementation in en-
failure scenarios by observing their behavior in abnormal situations and                 hancing system resilience?
performing controlled experiments [23]. This method involves applying                    Motivation: Microservice architectures introduce new challenges
new updates or changes to a specific aspect of the system, enabling                      in system design. Exploring the application of Chaos engineering
early detection of potential problems before they affect a larger scale.                 in this context can help improve the resilience and fault tolerance
    Chaos experiments consist of the following principles [24,25]:                       of microservice systems.
                                                                                       • RQ4: To what extent can the centralized provision of Chaos
     • Hypothesize steady state: The first step is to hypothesize the                    engineering effectively facilitate the management of Chaos exper-
       steady state of the system under normal conditions.                               iments across complex systems?
     • Vary real-world events: The next step is to vary real-world events                Motivation: Understanding the feasibility of providing Chaos en-
       that can cause turbulence in the system.                                          gineering as a centralized service enables organizations to coor-
     • Run experiments in production: Experimenters should run the ex-                   dinate Chaos experiments across complex systems.
       periments in production-like environment to simulate real-world                 • RQ5: What are the challenges reported in the relevant papers?
       conditions.                                                                       Motivation: Identifying these challenges provides valuable in-
     • Automate experiments to run continuously: Experimenters should                    sights into overcoming obstacles and advancing the adoption of
       automate the experiments to run continuously, ensuring that the                   Chaos engineering practices.
       system can withstand turbulence over time.
     • Minimize blast radius: The experiments should be designed to                3.2. Search strategy
       minimize blast radius, i.e., the impact of the experiment on the
       system should be limited to a small area                                        The primary studies were carefully selected from the papers pub-
     • Analyze results: Experimenters should analyze the results of the            lished between 2010 and 2022 because the topic is only relevant in
       experiments to determine the system’s behavior under turbulent              recent years. The databases are IEEE Xplore, ACM Digital Library,
       conditions.                                                                 Science Direct, Springer, Wiley, MDPI and Scopus and Science Direct.
     • Repeat experiments: The experiments should be repeated to en-               The initial search involved reviewing the titles, abstracts, and keywords
       sure that the system can consistently withstand turbulence.                 of the studies identified in the databases. The search results obtained
       When the experiment is finished, information about the actual               from the databases were stored in the data extraction form using a
       effect will be provided to the system.                                      spreadsheet tool. Furthermore, this systematic review was conducted
                                                                                   collaboratively by three authors.
                                                                                       The following search string was used to broaden the search scope:
3. Review protocol                                                                 ((chaos engineering) OR (chaos experiments)) OR (microservices)
                                                                                       The results of the searches made in the databases mentioned above
    Systematic review studies must be conducted using a well-defined               are shown in Fig. 2.
and specific protocol. To conduct a systematic review study, all studies
on a particular topic must be examined [12]. We followed the system-               3.3. Study selection criteria
atic review process shown in Fig. 1 and took all the steps to reduce risk
bias in this study. Multiple reviewers were involved in the SLR process,               After applying exclusion inclusion criteria, 55 articles were ob-
and in cases of conflict, a brief meeting was organized to facilitate              tained. The exclusion criteria in our study are shown as follows:
consensus. The first step is to define the research questions. Then,
the most appropriate databases were selected. Based on the selected                    • EC-1: Duplicate papers from multiple sources
databases, automated searches were conducted and several articles                      • EC-2: Papers without full-text availability
were identified. Selection criteria were then established to determine                 • EC-3: Papers not written in English
                                                                                       • EC-4: Survey papers
which studies should be included and excluded in this research. The
                                                                                       • EC-5: Papers not related to Chaos engineering
titles and abstracts of all studies were reviewed. In cases of doubt,
the full text of the publication was reviewed. Then, after the studies                The inclusion criteria in our study are shown as follows:
were analyzed in detail, selection criteria were applied. All selected
studies were assessed using a quality assessment process. Subsequently,                • IC-1: Primary papers discussing the use of Chaos experiments in
the results were synthesized, listed, and summarized in a clear and                      a microservice architecture
understandable manner.                                                                 • IC-2: Primary publications that focus on Chaos engineering

                                                                               4
E. Esen et al.                                                                                                  Computer Standards & Interfaces 97 (2026) 104116


                                                              Fig. 1. SLR review protocol.
                                                              Source: Adapted from [26–
                                                              28].


                                                   Fig. 2. Distribution of selected papers per database.


3.4. Study quality assessment                                                        Fig. 2 presents the distribution of papers based on databases where
                                                                                 they were found at different selection stages. After the initial search,
    The assessment of each study’s quality is an indicator of the strength       4520 papers were retrieved, of which 55 remained after applying the
of evidence provided by the systematic review. The quality of studies            selection criteria. After quality assessment, 31 papers were selected
was assessed using various questions. Studies of poor quality were               as primary studies. The 55 papers were carefully read in full and the
not included in the present study. These criteria based on quality               required data for answering the research questions were extracted.
instruments were adopted guide and other SLRs research [12]. The                     All the collected articles are listed in Table 1.
following questions were used to assess the quality of the studies.
                                                                                 3.5. Data extraction
     • Q1. Are the aims of the study clearly stated?
     • Q2. Are the scope and experimental design of the study clearly
       defined?                                                                      Data required for answering the Research Questions were extracted
     • Q3. Is the research process documented adequately?                        from the selected articles to answer the research questions. A data
     • Q4. Are all the study questions answered?                                 extraction form was created to answer the research questions. The data
     • Q5. Are the negative findings presented?                                  extraction form consists of several metadata such as the author’s first
     • Q6. Do the conclusions relate to the aim of the purpose of the            and last name, the title of the study, the publication year, and the type
       study and are they reliable?                                              of study. In addition to this metadata, several columns were created
                                                                                 to store the required information related to the research questions. By
    In this study, considering all these criteria, a general quality as-         employing a data extraction form, we ensured that the relevant data
sessment was performed for each paper. The rating was 2 points for               required to answer each research question were systematically captured
the ‘‘yes’’ option, 0 points for the ‘‘no’’ option, and 1 point for the          from the selected publications. This approach facilitated the subsequent
‘‘somewhat’’ option. The decision threshold for classifying the paper            synthesis of the findings. The data extraction process involved meticu-
as poor quality was determined based on the mean value, which                    lous attention to detail and ensured the reliability and integrity of the
corresponds to a total of 5 points.                                              data used in our systematic literature review.

                                                                             5
E. Esen et al.                                                                                                             Computer Standards & Interfaces 97 (2026) 104116


Table 1
Selected primary studies.
 ID          Reference      Title                                                                                                               Year        Database
 S1          [29]           Automating Chaos Experiments in Production                                                                          2019        ACM
 S2          [25]           Getting Started with Chaos engineering—design of an implementation framework in practice                            2020        ACM
 S3          [30]           Human-AI Partnerships for Chaos engineering                                                                         2020        ACM
 S4          [31]           3MileBeach: A Tracer with Teeth                                                                                     2021        ACM
 S5          [32]           Service-Level Fault Injection Testing                                                                               2021        ACM
 S6          [33]           A Platform for Automating Chaos Experiments                                                                         2016        IEEE Xplore
 S7          [34]           Automated Fault-Tolerance Testing                                                                                   2016        IEEE Xplore
 S8          [35]           Gremlin: Systematic Resilience Testing of Microservices                                                             2016        IEEE Xplore
 S9          [36]           Fault Injection Techniques - A Brief Review                                                                         2018        IEEE Xplore
 S10         [37]           ORCAS: Efficient Resilience Benchmarking of Microservice Architectures                                              2018        IEEE Xplore
 S11         [38]           The Business Case for Chaos engineering                                                                             2018        IEEE Xplore
 S12         [39]           Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service      2018        IEEE Xplore
 S13         [40]           Security Chaos engineering for Cloud Services: Work In Progress                                                     2019        IEEE Xplore
 S14         [41]           A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems        2020        IEEE Xplore
 S15         [42]           CloudStrike: Chaos engineering for Security and Resiliency in Cloud Infrastructure                                  2020        IEEE Xplore
 S16         [43]           Identifying and Prioritizing Chaos Experiments by Using Established Risk Analysis Techniques                        2020        IEEE Xplore
 S17         [44]           Fitness-guided Resilience Testing of Microservice-based Applications                                                2020        IEEE Xplore
 S18         [24]           A Chaos engineering System for Live Analysis and Falsification of Exception-Handling in the JVM                     2021        IEEE Xplore
 S19         [45]           A Study on Chaos engineering for Improving Cloud Software Quality and Reliability                                   2021        IEEE Xplore
 S20         [46]           Chaos engineering for Enhanced Resilience of Cyber–Physical Systems                                                 2021        IEEE Xplore
 S21         [47]           ChaosTwin: A Chaos engineering and Digital Twin Approach for the Design of Resilient IT Services                    2021        IEEE Xplore
 S22         [48]           Platform Software Reliability for Cloud Service Continuity—Challenges and Opportunities                             2021        IEEE Xplore
 S23         [49]           Trace-based Intelligent Fault Diagnosis for Microservices with Deep Learning                                        2021        IEEE Xplore
 S24         [50]           A Guided Approach Towards Complex Chaos Selection, Prioritization and Injection                                     2022        IEEE Xplore
 S25         [51]           Chaos Driven Development for Software Robustness Enhancement                                                        2022        IEEE Xplore
 S26         [22]           Maximizing Error Injection Realism for Chaos engineering With System Calls                                          2022        IEEE Xplore
 S27         [52]           On Evaluating Self-Adaptive and Self-Healing Systems using Chaos engineering                                        2022        IEEE Xplore
 S28         [53]           Observability and chaos engineering on system calls for containerized applications in Docker                        2021        ScienceDirect
 S29         [54]           Scalability resilience framework using application-level fault injection for cloud-based software services          2022        Springer
 S30         [55]           Chaos as a Software Product Line—A platform for improving open hybrid-cloud systems resiliency                      2022        Wiley
 S31         [56]           The Observability, Chaos engineering, and Remediation for Cloud-Native Reliability                                  2022        Wiley


3.6. Data synthesis                                                                         Chaos engineering involves several categories of functionality that
                                                                                        serve distinct purposes in resilience testing. The first category involves
    To answer the research questions, the data obtained are collected                   intentionally terminating processes or services to evaluate system be-
and summarized in an appropriate manner, which is called data syn-                      havior and recovery from failures [7]. Another category is network
thesis. To perform the data synthesis, a qualitative analysis process                   simulation, which allows engineers to replicate adverse network condi-
was conducted on the data obtained. For instance, synonyms used                         tions to assess system performance and reliability [25]. In the Stressing
for different categories were identified and merged in the respective                   Machine category, engineers subject the system to extreme loads to
fields. This comprehensive data synthesis approach allowed us to derive                 identify limits and potential bottlenecks [7]. In security testing, en-
insights and draw conclusions from the collected information.                           gineers simulate breaches or attacks to assess the system’s response
                                                                                        and enhance defenses [7]. Lastly, engineers use fault application code
4. Results                                                                              to inject targeted faults or errors into the codebase, assessing system
                                                                                        resilience and error-handling capabilities [24]. These categories help
    The result section of the paper provides various insights into how                  organizations proactively identify weaknesses, strengthen system ro-
chaos engineering is applied in production environments, particularly                   bustness, and enhance reliability in complex technology landscapes [7].
its use in improving the resilience and reliability of microservice ar-                 Functionality categories of tools are presented in Fig. 6.
chitecture applications. The section discusses how fault detection is                       The tools utilized in industry settings are not comprehensively ad-
developed using chaos engineering tools and is mainly used in pro-                      dressed in articles. To provide insights for future research, the identified
                                                                                        tools from the additional examination were categorized based on their
duction for troubleshooting. Chaos Experiments are usually conducted
                                                                                        functionality, as presented in Tables 2 and 3. Table 2 displays the
in the production environment to provide realistic results. The section
                                                                                        tools obtained from the study, while Table 3 presents additional tools
further enumerates several tools that have been used for Chaos experi-
                                                                                        that have been examined. Tools listed in the table with corresponding
ments, as well as discussing general principles such as defining a steady
                                                                                        references indicate their inclusion in the referenced articles.
state, forming a hypothesis, conducting the experiment, and proving or
refuting the hypothesis. These principles and tools help detect problems
                                                                                        4.2. How is Chaos engineering effectively applied in production environ-
like hardware issues, software errors network interruptions security
                                                                                        ments to enhance the resilience of software systems?
vulnerabilities configuration mistakes within their respective contexts.
                                                                                           Table 4 examines the successful implementation of Chaos Engineer-
4.1. Main statistics                                                                    ing in operational settings, covering different aspects such as goals,
                                                                                        techniques and resources, guiding principles, findings, limitations and
    Fig. 3 shows the results of the quality assessment. The distribution of             substitutes, as well as the general strategy.
the years of publication is shown in Fig. 4. Most of the studies related to
our study were conducted in the last year. This shows that researchers’                 4.3. Which platforms have been used for chaos experiments?
interest in chaos engineering has increased in recent years. Most of the
studies included were indexed in the IEEE Xplore database.                                 Table 5 provides a concise summary of various tools and platforms
    Fig. 5 presents the distribution of the type of publications and                    used in Chaos experiments, along with their specific functionalities
the corresponding databases. While there are many journal papers,                       or characteristics. It offers comprehensive insights into each platform
conference proceedings also appear in the selected papers.                              through detailed descriptions accompanied by the necessary references.

                                                                                    6
E. Esen et al.                                                                         Computer Standards & Interfaces 97 (2026) 104116


                                 Fig. 3. Quality assessment scores.


                                    Fig. 4. Year of publication.


                 Fig. 5. Diagram of the distribution of studies per search database.


                                                  7
E. Esen et al.                                                                                                            Computer Standards & Interfaces 97 (2026) 104116


                                                           Fig. 6. Functionality of chaos engineering tools.


                 Table 2
                 Chaos engineering tools from studies.
                  Chaos engineering tool         Termination         Network simulating    Stressing machine   Security         Fault application code
                  Chaos Monkey [57]              ×
                  Gremlin [35]                   ×                   ×                     ×                   ×                ×
                  Chaos Toolkit [45]             ×                   ×                     ×                   ×                ×
                  Pumba [55]                                         ×                     ×
                  LitmusChaos [45]               ×                   ×                     ×                   ×
                  ToxiProxy [45]                                     ×                                         ×
                  PowerfulSeal [45]              ×                   ×                     ×                   ×
                  Pod Reaper [25]                ×
                  Netflix Simian Army [36]       ×                   ×                                         ×
                  WireMock [25]                                      ×                                                          ×
                  KubeMonkey [25]                ×                   ×                     ×
                  Chaosblade [45]                ×                   ×                     ×
                  ChaosTwin [47]                 ×                   ×                     ×                                    ×
                  Chaos Machine [24]                                 ×                     ×                   ×
                  Cloud Strike [42]                                                                            ×
                  Phoebe [22]                                                                                                   ×
                  Mjolnirr [58]                                                                                                 ×
                  ChaosOrca [37]                                     ×                     ×                   ×
                  3MileBeach [31]                                    ×                                                          ×
                  Muxy [25]                                          ×                     ×                                    ×
                  Blockade [25]                                      ×
                  Chaos Lambda [25]              ×                                                                              ×
                  Byte-Monkey [25]                                                                                              ×
                  Turbulence [25]                ×                                         ×                   ×
                  Cthulhu [25]                   ×                   ×                     ×                                    ×
                  Byteman [25]                                                                                 ×                ×
                  ChaosCube [55]                 ×
                  Chaos Lemur [25]               ×
                  Chaos HTTP Proxy [25]                              ×
                  Chaos Mesh [45]                ×                   ×                     ×
                  Istio Chaos [45]                                   ×
                  ChAP [33]                                          ×                                                          ×
                  IntelliFT [44]                 ×                   ×                     ×                                    ×


                 Table 3
                 Chaos engineering tools from our search.
                  Chaos engineering tool     Termination         Network simulating       Stressing machine    Security         Fault application code
                  Pod Chaos                  X                   X                        X
                  DNS Chaos                                      X
                  AWS Chaos                  X                                            X                    X
                  Azure Chaos                X                   X                        X                    X
                  GCP Chaos                  X                   X                        X                    X


                                                                                      8
E. Esen et al.                                                                                                                Computer Standards & Interfaces 97 (2026) 104116


                 Table 4
                 Chaos engineering in production environments.
                  Category                         Description
                  Objective                        The primary objective of applying chaos engineering in production environments is to enhance the
                                                   resilience of software systems. This involves troubleshooting to identify and address potential
                                                   malfunctions before they occur. The overarching goal is to minimize issues in production through the
                                                   use of chaos engineering tools, enabling automatic fault detection [24,53].
                  Methods and tools                chaos engineering relies on specific tools to facilitate its effective application in production
                                                   environments. These tools aid in automatic fault detection, a crucial aspect of troubleshooting to
                                                   minimize potential issues in the production environment [24,53].
                  Principles and considerations    The effective application of chaos engineering is closely tied to key principles and considerations.
                                                   These include continuous experimentation, serving as a form of robustness testing conducted in
                                                   real-world operational conditions. Fundamental principles of Chaos Experiments involve defining a
                                                   steady state, hypothesizing about its impact, conducting the experiment, and then demonstrating or
                                                   refuting the hypothesis [53].
                  Insights and results             Chaos experiments conducted in the production environment provide valuable insights into the
                                                   behavior of the system. This is particularly significant as the production environment may exhibit
                                                   unpredictable behavior that differs from staging environments in some cases [24].
                  Constraints and alternatives     While conducting chaos experiments in production is ideal, it is acknowledged that legal or technical
                                                   constraints may sometimes prevent this. In such cases, an alternative approach is considered, starting
                                                   chaos experiments in a staging environment and gradually transitioning to the production
                                                   environment [25].
                  Overall approach                 The overall approach for the effective application of chaos engineering in production environments
                                                   involves the systematic execution of chaos experiments. This includes leveraging chaos engineering
                                                   tools and taking into account the constraints and challenges associated with conducting experiments in
                                                   real-world operational settings. The aim is to proactively identify and address potential issues before
                                                   they impact the production environment, ultimately enhancing the resilience of software systems.


                 Table 5
                 Chaos engineering tools identified from selected papers.
                  Platform/Tool                    Description
                  The Chaos Machine                A tool for conducting chaos experiments at the application level on Java Virtual Machine (JVM),
                                                   using exception injection to analyze try-catch blocks for error processing [24].
                  Screwdriver                      An automated fault-tolerance testing tool for on-premise applications and services, creating realistic
                                                   error models and collecting metrics by injecting errors into the system [34].
                  Chaos Monkey                     Designed by Netflix, this tool tests the system’s resilience by randomly killing partitions to check
                                                   system functionality [7,45].
                  Cloud Strike                     A security chaos engineering system for multi-cloud security, extending chaos engineering to security
                                                   by injecting faults impacting confidentiality, integrity, and availability [42].
                  ChaosMesh                        An open-source chaos engineering platform for testing the resilience and reliability of distributed
                                                   systems by intentionally injecting failures and disruptions [55].
                  Powerfulseal                     An open-source tool for testing the resilience of Kubernetes clusters by simulating real-world failures
                                                   and disruptions [55].
                  IntelliFT                        A feedback-based, automated failure testing technique for microservice applications, focusing on
                                                   exposing defects in fault-handling logic [44].
                  The Chaos Toolkit                Open-source software that runs experiments against the system to confirm a hypothesis [25,55].
                  Phoebe                           A fault injection framework for reliability analysis concerning system call invocation errors, enabling
                                                   full observability of system call invocations and automatic experimentation [22].
                  Mjolnirr                         A private cloud platform with a built-in Chaos Monkey service for developing private PaaS cloud
                                                   infrastructure [58].
                  ChaosOrca                        A tool for Chaos engineering on containers, perturbing system calls for processes inside containers
                                                   and monitoring their effects [37].
                  Gremlin                          Offered as a SaaS technology, Gremlin tests system resilience on various parameters and conditions,
                                                   with capabilities for automation and integration with Kubernetes clusters and public clouds [35].
                  3MileBeach                       A distributed tracing and fault injection framework for microservices, enabling chaos experiments
                                                   through message serialization library manipulation [31].
                  ChAP                             A software platform for running automated chaos experiments, simulating various failure scenarios
                                                   and providing insights into system behavior under stress [29,33].
                  ChaosTwin                        Utilizes a digital twin approach in Chaos Engineering to mitigate impacts of unforeseen events,
                                                   constructing models across workload, network, and service layers [47].
                  Litmus Chaos                     An open-source cloud-native framework for Chaos Engineering in Kubernetes environments, offering a
                                                   range of chaos experiments and workflows [50].
                  Filibuster                       A testing method in chaos engineering that introduces errors into microservice architecture to validate
                                                   resilience and error tolerance [32].


                                                                                    9
E. Esen et al.                                                                                                           Computer Standards & Interfaces 97 (2026) 104116


Table 6
Chaos engineering in microservices: approaches, descriptions, and expected outcomes.
 Approach                   Description                                                                              Expected impact
 Fault injection testing    This method involves intentionally introducing errors into the system to assess its      Evaluating and enhancing the system’s resilience
                            response, particularly in microservices by simulating various failure modes such as      and stability.
                            network issues, service outages, or resource shortages within or between
                            microservices, to evaluate the system’s resilience and stability [52].
 Hypothesis-driven          Key to chaos engineering is conducting experiments based on well-defined                 Identifying system weaknesses and increasing
 experiments                hypotheses about the normal state of the system and its expected behavior during         resilience.
                            failure scenarios. This strategic approach enables focused experiments that assess the
                            resilience of both individual microservices and the overall system [45,53].
 Blast radius               Managing the ‘‘blast radius’’ of experiments is crucial in microservices. It involves    Better understanding and enhancing the system’s
 management                 understanding the potential impact of introduced failures, starting with small           resilience.
                            experiments and then expanding, to manage failure impacts while identifying system
                            vulnerabilities [45].
 Resilience requirement     Utilizing chaos engineering to determine and analyze the resilience requirements of      Understanding specific resilience needs of each
 elicitation                microservice architectures. This process involves observing the system’s response to     microservice and their interactions.
                            induced faults to identify specific resilience needs of each microservice and their
                            interactions [52].
 Continuous testing and     Regularly conducting chaos experiments as part of an ongoing testing process             Proactive identification and resolution of system
 improvement                ensures that microservices remain resilient against unforeseen issues. This continuous   weaknesses, leading to continual improvement and
                            approach aids in proactively finding and fixing potential system weaknesses [56].        increased resilience.
 Observability and          Integrating chaos engineering with observability tools enhances the monitoring of        Real-time tracking of responses to failures and
 remediation                microservices during fault injection, allowing for real-time tracking of responses to    development of effective remediation strategies for
                            failures, aiding in the development of effective remediation strategies and overall      overall system resilience improvement.
                            system resilience improvement [56].


4.4. How can Chaos engineering be effectively applied to microservice archi-             5.1. General discussion
tecture to ensure successful implementation and enhance system resilience?
                                                                                             In this article, we reviewed the literature on the application of
    Table 6 provides a comprehensive overview of the different facets                    chaos engineering in microservice architecture to understand the state-
and projected implications of implementing chaos engineering within                      of-the-art. For this purpose, six research questions were defined and
microservice architecture.                                                               answered.
    By implementing these approaches and strategies, organizations can                       In RQ1, we aimed to understand how chaos engineering is ap-
effectively integrate chaos engineering into their microservice architec-                plied to production environments. Chaos engineering, when adeptly
tures to uncover vulnerabilities and enhance the overall dependability                   applied in production settings, serves as a pivotal tool for augmenting
of their systems.                                                                        the robustness of software systems. This approach entails conducting
                                                                                         deliberate and controlled chaos experiments within the production en-
4.5. To what extent can the centralized provision of Chaos engineering                   vironment, a strategy that is instrumental in uncovering and rectifying
effectively facilitate the management of chaos experiments across complex                potential issues before they escalate into full-blown system failures,
systems?                                                                                 thereby bolstering system uptime [38]. Moreover, chaos engineering
                                                                                         is characterized by the intentional injection of faults into systems.
    Table 7 provides an overview of the ways in which centralized chaos                  This methodology is crucial for identifying and addressing security
engineering can simplify experiment management in intricate systems.                     flaws and risks, laying the groundwork for the development of resilient
It emphasizes advantages like standardization, resource utilization, risk                application architectures [56]. By replicating adverse conditions that
mitigation, and more, resulting in enhanced system resilience and                        could naturally arise in production settings, chaos engineering helps
performance.                                                                             detect of inherent system vulnerabilities and structural deficiencies,
                                                                                         fostering a proactive stance towards issue mitigation [38].
4.6. What are the challenges reported in the relevant papers?                                Additionally, this practice involves comprehensive testing of real-
                                                                                         world scenarios on operational systems. Such testing is vital for as-
   Table 8 concisely presents the primary obstacles in the area of                       sessing the complete spectrum of software systems, encompassing both
chaos engineering and their respective resolutions. These obstacles                      hardware malfunctions and software glitches, within their actual de-
encompass system intricacy, hazards to live environments, resource                       ployment contexts. This approach significantly contributes to the en-
demands, security issues, and automation complexities. The proposed                      hancement of overall system resilience [38]. To effectively implement
resolutions involve phased implementation, risk assessment, knowledge                    chaos engineering, it is recommended to initiate with less complex
enhancement, robust security protocols, and automation approaches.                       experiments, leverage automation for these experiments, and focus on
                                                                                         areas with either high impact or high frequency of issues. Observing
5. Discussion                                                                            the system at its limits is also crucial for reinforcing resilience [25].
                                                                                             In RQ2, we discuss various platforms that aim to increase the
   In the discussion section, we summarize answers to the research                       flexibility and reliability of microservice architectures through chaos
questions. They mention that chaos engineering can improve robust-                       experiments. Tools like Gremlin, Chaos Monkey, Chaos Toolkit, Pumba,
ness by simulating real-world failure scenarios and exploring system                     LitmusChaos, ToxiProxy and PowerfulSeal have been utilized in indus-
reactions, especially in microservice architectures. Various tools for                   try settings to simulate different failure scenarios. These tools provide
implementing chaos engineering were listed and compared. They con-                       functions such as terminating processes, simulating network conditions,
clude by stating that the application of chaos engineering requires                      applying stress tests security measures and injecting faults to proac-
careful planning due to inherent challenges but has the potential to                     tively identify weaknesses and strengthen system robustness across
greatly improve system resilience.                                                       different technology landscapes.

                                                                                    10
E. Esen et al.                                                                                                                    Computer Standards & Interfaces 97 (2026) 104116


Table 7
Centralized provision in chaos engineering.
 Approach                            Description                                                                                    Expected impact
 Standardization                     Centralized provision allows for the standardization of chaos engineering practices            Improved coordination and reliability of
                                     and tools across the organization. This ensures that all teams follow consistent               results.
                                     processes and use approved tools, leading to better coordination and more reliable
                                     results [42].
 Resource optimization               Centralized provision enables efficient allocation of resources for chaos experiments.         Enhanced resource utilization and reduced
                                     It allows pooling of expertise, tools, and infrastructure, reducing redundancy and             redundancy.
                                     optimizing resource utilization [38].
 Risk management                     Centralized provision facilitates better risk management by providing oversight and            Controlled experimentation and effective
                                     governance for chaos experiments. It establishes clear guidelines, safety measures,            risk management.
                                     and expected states for running experiments in production environments, ensuring
                                     controlled experimentation [42].
 Automation and                      Centralized provision supports the automation of chaos experiments to run                      Ongoing validation of system resilience and
 continuous testing                  continuously. This ensures regular conduction of experiments, leading to ongoing               early identification of potential issues.
                                     validation of system resilience and identification of potential issues before they
                                     manifest as outages [38,42].
 Knowledge sharing and               A centralized approach encourages knowledge sharing and collaboration among                    Promotion of a continuous improvement
 collaboration                       teams. It facilitates the dissemination of best practices, lessons learned, and                culture and shared learning.
                                     successful experiment designs, fostering a culture of continuous improvement and
                                     shared learning [25].
 Performance metrics and             Centralized provision enables the establishment of standardized performance metrics            Consistent system health measurement and
 analysis                            and analysis methods for chaos experiments. This allows for consistent measurement             more effective decision-making.
                                     of system health and identification of deviations from steady-state, leading to more
                                     effective decision-making and system improvements [43].


Table 8
Challenges and solutions in chaos Engineering.
 Category             Challenges                                                  Possible solutions                                                             References
 Complexity           Designing and executing effective chaos experiments         To mitigate complexity, it is recommended to start with smaller, more          [25,43]
                      in large systems is complex due to intricate                manageable experiments and gradually expand the scope of chaos
                      interdependencies within these systems.                     engineering practices.
 Risk of impact       Concerns about causing disruptions in the production        Implementing risk analysis techniques can help prioritize experiments,         [45,50]
                      environment, affecting users and business operations.       focusing on less critical system components first to minimize potential
                                                                                  impacts.
 Resource             Significant resources needed including time, expertise,     Addressing resource intensiveness involves providing comprehensive             [7,47]
 intensiveness        and infrastructure, posing a barrier for many               training and education on chaos engineering best practices and tools to
                      organizations.                                              equip teams with the necessary skills and knowledge.
 Security             Introducing controlled failures can raise security          To combat security concerns, robust security measures should be                [42,47]
 concerns             issues, potentially exposing vulnerabilities or sensitive   implemented during experiments to safeguard sensitive data and prevent
                      data.                                                       unauthorized access.
 Tooling and          Developing tools for automated chaos experiments is         Overcoming tooling and automation challenges requires the development          [7,33,38,40,42]
 automation           challenging in heterogeneous and dynamic                    and use of automated tools for Chaos experiments, which reduce manual
                      environments.                                               efforts and facilitate continuous, unattended testing.


    Recent studies have emphasized the growing intersection between                          solutions like Netflix’s Chaos Automation Platform (ChAP) and fault
artificial intelligence and cybersecurity within the context of chaos                        injection techniques such as service call manipulation. The emphasis is
engineering. AI-driven techniques are nowadays used for real-time                            placed on the need for careful planning, effective communication, risk
threat detection, anomaly prediction, and automated response mech-                           management, and continuous learning to ensure comprehensive and
anisms in enterprise systems. For example, generative AI models have                         valuable chaos experiments for enhancing overall system resilience.
been proposed to enhance cybersecurity frameworks by improving data                              In response to RQ5, our discussion concludes that the practical
privacy management and identifying potential attack vectors [59].                            implementation of chaos engineering, despite its promise to enhance
    In RQ3, we focused on understanding how chaos engineering is im-                         system resilience, presents numerous challenges. These challenges in-
plemented in microservice architectures. To enhance system resilience                        clude potential business impacts, difficulty in determining scope, the
in microservice architectures through chaos engineering, organizations
                                                                                             unpredictability of outcomes, time and resource constraints, system
should utilize fault injection testing to replicate failures within mi-
                                                                                             complexities, skill and knowledge prerequisites, interpretation of re-
croservices. They should also conduct hypothesis-driven experiments
                                                                                             sults, cultural readiness, and selection of appropriate tools. These all
with a solid comprehension of the normal state and anticipated behav-
                                                                                             necessitate meticulous planning and skilled execution for effectiveness.
ior during disruptions, while managing the scope of these experiments
to minimize impact. Additionally, it is essential to identify and an-                            Recent studies explore the convergence of Chaos Engineering and
alyze resilience requirements, participate in continuous testing and                         Artificial Intelligence (AI). Large language models (LLMs) have been
improvement efforts, as well as integrate observability tools for real-                      used to automate the chaos engineering lifecycle, managing phases
time monitoring during fault injection tests. Moreover, organizations                        from hypothesis creation to experiment orchestration and remedia-
need to establish clear communication channels across teams involved                         tion [60]. Meanwhile, advances in applying chaos engineering to multi-
in order to ensure effective collaboration and knowledge sharing.                            agent AI systems suggest new directions: for example, chaos experi-
    The answer to RQ4, highlights the significance of centralized man-                       ments applied to LLM-based multi-agent systems can surface vulner-
agement and monitoring in conducting chaos experiments within large-                         abilities such as hallucinations, agent failures, or inter-agent communi-
scale microservices ecosystems. It discusses the utilization of software                     cation breakdowns [61]. Together, these works show how intelligent,

                                                                                        11
E. Esen et al.                                                                                                         Computer Standards & Interfaces 97 (2026) 104116


adaptive chaos frameworks might evolve in microservice-based systems             experiments are insightful, as they reveal system behaviors in pro-
as well.                                                                         duction environments, which often differ unpredictably from staging
    Recent research also discusses specific operational challenges such          environments [36,53].
as load balancing and security in the context of chaos engineering. For              Furthermore, the effectiveness of chaos engineering is contingent
example, an empirical study applies delay injections under different             on the systematic execution of chaos experiments. These experiments,
user loads in cloud-native systems to observe how throughput and                 utilizing advanced chaos engineering tools, need to navigate the con-
latency change under stress, providing insights into how load balanc-            straints and challenges inherent in real-world operational settings.
ing policies perform under fault conditions [62]. In parallel, several           The main objective is the enhancement of system resilience, achieved
frameworks have begun integrating security-focused chaos tests that              by proactively identifying and preemptively addressing potential is-
intentionally inject faults into authentication, identity management,            sues [46].
and access control components to ensure that security mechanisms                     However, it is acknowledged that conducting chaos experiments
remain effective under stress conditions [63]. These studies highlight           directly in production environments might be impeded by legal or
how chaos engineering can be extended beyond performance reliability             technical constraints. In such scenarios, initiating experiments in a
to proactively strengthen both load distribution and security resilience         staging environment and then gradually transitioning to the production
in microservice environments.                                                    environment offers a viable alternative. This approach ensures that
    The main challenges faced by previous researchers and possible               the benefits of chaos engineering can still be realized, but in a more
solutions have been discussed in the paper. The collected challenges             controlled and possibly less direct manner.
were mainly related to the correct interpretation of chaos experiments               Our review highlights that chaos engineering is a critical methodol-
and making sense of them. There may be more challenges, but if                   ogy for ensuring the resilience and robustness of software systems. By
they were not mentioned in these articles, we could not include them.            following continuous experimentation and proactive troubleshooting, it
We believe that chaos engineering is still in the early stages and the           offers a pathway to address the challenges faced in complex production
adoption in the software industry will take some time.                           environments. This SLR contributes to the scientific community by dis-
                                                                                 cussing these methodologies and their applications, thereby providing
5.2. Threats to validity                                                         a framework for future research and practical implementation in the
                                                                                 field of software system resilience.
Internal validity
    The validity of this systematic literature review is threatened by           CRediT authorship contribution statement
issues related to defining the candidate pool of papers, potential bias
in selecting primary studies, data extraction, and data synthesis. The               Emrah Esen: Writing – review & editing, Writing – original draft,
application of exclusion criteria can be influenced by the researchers’          Visualization, Validation, Software, Methodology, Investigation, For-
biases, posing a potential threat to validity. We compiled a compre-             mal analysis, Data curation. Akhan Akbulut: Writing – review &
hensive list of exclusion criteria, and all conflicts were documented            editing, Writing – original draft, Visualization, Validation, Supervi-
and resolved through discussions among us. Data extraction validity is           sion, Software, Resources, Project administration, Methodology, Inves-
crucial as it directly impacts the study results. Whenever any of us was         tigation, Formal analysis, Data curation. Cagatay Catal: Writing –
uncertain about data extraction, the case was recorded for resolution            review & editing, Writing – original draft, Visualization, Validation,
through discussions with the team. Multiple meetings were held to                Supervision, Software, Resources, Project administration, Methodology,
minimize researcher bias.                                                        Investigation, Funding acquisition, Formal analysis, Data curation.

External validity                                                                Declaration of competing interest
   The search for candidate papers involved using general search terms
to minimize the risk of excluding relevant studies. Despite using a broad            The authors declare that they have no known competing finan-
search query to acquire more articles, there remains a possibility that          cial interests or personal relationships that could have appeared to
some papers were overlooked in electronic databases or missed due to             influence the work reported in this paper.
recent publications. Furthermore, although seven widely used online
databases in computer science and software engineering were searched,            Data availability
new papers may not have been included.
                                                                                    Data will be made available on request.
6. Conclusion

    Our systematic literature review (SLR) on chaos engineering has              References
explored its role in enhancing the resilience of software systems in pro-
duction environments. Through our review, we have identified several              [1] P. Jamshidi, C. Pahl, N.C. Mendonça, J. Lewis, S. Tilkov, Microservices: The
                                                                                      journey so far and challenges ahead, IEEE Softw. 35 (3) (2018) 24–35, http:
crucial aspects that underline the effective application and challenges
                                                                                      //dx.doi.org/10.1109/MS.2018.2141039.
of chaos engineering [25].                                                        [2] I. Beschastnikh, P. Wang, Y. Brun, M.D. Ernst, Debugging distributed systems,
    Firstly, Chaos Engineering serves as a proactive troubleshooting ap-              Commun. ACM 59 (8) (2016) 32–37, http://dx.doi.org/10.1145/2909480.
proach in production environments [25]. By identifying and addressing             [3] W. Ahmed, Y.W. Wu, A survey on reliability in distributed systems, J. Comput.
potential malfunctions before they occur, it effectively preempts system              System Sci. 79 (8) (2013) 1243–1255, http://dx.doi.org/10.1016/j.jcss.2013.02.
                                                                                      006.
disruptions. This proactive strategy is significantly implemented by
                                                                                  [4] D. Ma’ruf, S. Sulistyo, L. Nugroho, Applying integrating testing of microservices
chaos engineering tools that assist in automatic fault detection, thereby             in airline ticketing system, Ijitee (Int. J. Inf. Technol. Electr. Eng.) 4 (2020) 39,
minimizing potential issues in these critical environments [50].                      http://dx.doi.org/10.22146/ijitee.55491.
    Secondly, the essence of chaos engineering is rooted in continuous            [5] F. Dai, H. Chen, Z. Qiang, Z. Liang, B. Huang, L. Wang, Automatic analysis
experimentation and robustness testing under real-world operational                   of complex interactions in microservice systems, Complexity 2020 (2020) 1–12,
                                                                                      http://dx.doi.org/10.1155/2020/2128793.
conditions. The methodology involves a systematic approach: defining              [6] J. Lewis, M. Fowler, Microservices: a definition of this new architectural term
a steady state, hypothesizing its impacts, conducting controlled exper-               (2014), 2014, URL: http://martinfowler.com/articles/microservices.html (cit. p.
iments, and subsequently confirming or refuting the hypotheses. These                 26).


                                                                            12
E. Esen et al.                                                                                                                      Computer Standards & Interfaces 97 (2026) 104116


 [7] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, C.             [31] J. Zhang, R. Ferydouni, A. Montana, D. Bittman, P. Alvaro, 3MileBeach: A
     Rosenthal, Chaos engineering, IEEE Softw. 33 (3) (2016) 35–41, http://dx.doi.                  tracer with teeth, in: Proceedings of the ACM Symposium on Cloud Computing,
     org/10.1109/MS.2016.60.                                                                        SoCC ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp.
 [8] R.T. Munodawafa, S.K. Johl, A systematic review of eco-innovation and perfor-                  458–472, http://dx.doi.org/10.1145/3472883.3486986.
     mance from the resource-based and stakeholder perspectives, Sustainability 11             [32] C.S. Meiklejohn, A. Estrada, Y. Song, H. Miller, R. Padhye, Service-level fault
     (2019) 6067, http://dx.doi.org/10.3390/su11216067.                                             injection testing, in: Proceedings of the ACM Symposium on Cloud Computing,
 [9] J.M. Macharia, Systematic literature review of interventions supported by inte-                SoCC ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp.
     gration of ict in education to improve learners’ academic performance in stem                  388–402, http://dx.doi.org/10.1145/3472883.3487005.
     subjects in kenya, J. Educ. Pract. 6 (2022) 52–75, http://dx.doi.org/10.47941/            [33] A. Blohowiak, A. Basiri, L. Hochstein, C. Rosenthal, A platform for automating
     jep.979.                                                                                       chaos experiments, in: 2016 IEEE International Symposium on Software Reliabil-
[10] P. Gerli, J.N. Marco, J. Whalley, What makes a smart village smart? a review                   ity Engineering Workshops, ISSREW, 2016, pp. 5–8, http://dx.doi.org/10.1109/
     of the literature, Transform. Gov.: People Process. Policy 16 (2022) 292–304,                  ISSREW.2016.52.
     http://dx.doi.org/10.1108/tg-07-2021-0126.                                                [34] A. Nagarajan, A. Vaddadi, Automated fault-tolerance testing, in: 2016 IEEE
[11] R. Coppola, L. Ardito, Quality assessment methods for textual conversational                   Ninth International Conference on Software Testing, Verification and Validation
     interfaces: a multivocal literature review, Information 12 (2021) 437, http:                   Workshops, ICSTW, 2016, pp. 275–276, http://dx.doi.org/10.1109/ICSTW.2016.
     //dx.doi.org/10.3390/info12110437.                                                             34.
[12] B. Kitchenham, O. Pearl Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman,            [35] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M.K. Reiter, V. Sekar, Gremlin:
     Systematic literature reviews in software engineering – A systematic literature                Systematic resilience testing of microservices, in: 2016 IEEE 36th International
     review, Inf. Softw. Technol. 51 (1) (2009) 7–15, http://dx.doi.org/10.1016/j.                  Conference on Distributed Computing Systems, ICDCS, 2016, pp. 57–66, http:
     infsof.2008.09.009, Special Section - Most Cited Articles in 2002 and Regular                  //dx.doi.org/10.1109/ICDCS.2016.11.
     Research Papers.                                                                          [36] R.K. Lenka, S. Padhi, K.M. Nayak, Fault injection techniques - a brief review,
[13] N. Dragoni, S. Giallorenzo, A.L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, L.             in: 2018 International Conference on Advances in Computing, Communication
     Safina, Microservices: yesterday, today, and tomorrow, 2017, arXiv:1606.04036.                 Control and Networking, ICACCCN, 2018, pp. 832–837, http://dx.doi.org/10.
[14] P.D. Francesco, I. Malavolta, P. Lago, Research on architecting microservices:                 1109/ICACCCN.2018.8748585.
     Trends, focus, and potential for industrial adoption, in: 2017 IEEE International         [37] A. van Hoorn, A. Aleti, T.F. Düllmann, T. Pitakrat, ORCAS: Efficient resilience
     Conference on Software Architecture, ICSA, 2017, pp. 21–30, http://dx.doi.org/                 benchmarking of microservice architectures, in: 2018 IEEE International Sym-
     10.1109/ICSA.2017.24.                                                                          posium on Software Reliability Engineering Workshops, ISSREW, 2018, pp.
[15] M. Fowler, Patterns of Enterprise Application Architecture, Addison-Wesley                     146–147, http://dx.doi.org/10.1109/ISSREW.2018.00-10.
     Longman Publishing Co., Inc., USA, 2002.                                                  [38] H. Tucker, L. Hochstein, N. Jones, A. Basiri, C. Rosenthal, The business case for
                                                                                                    chaos engineering, IEEE Cloud Comput. 5 (3) (2018) 45–54, http://dx.doi.org/
[16] J. Lewis, M. Fowler, Microservices, 2014, https://martinfowler.com/articles/
                                                                                                    10.1109/MCC.2018.032591616.
     microservices.html.
                                                                                               [39] N. Brousse, O. Mykhailov, Use of self-healing techniques to improve the
[17] S. Newman, Building Microservices: Designing Fine-Grained Systems, " O’Reilly
                                                                                                    reliability of a dynamic and geo-distributed ad delivery service, in: 2018
     Media, Inc.", 2021.
                                                                                                    IEEE International Symposium on Software Reliability Engineering Workshops,
[18] C.K. Rudrabhatla, Comparison of zero downtime based deployment techniques in
                                                                                                    ISSREW, 2018, pp. 1–5, http://dx.doi.org/10.1109/ISSREW.2018.00-40.
     public cloud infrastructure, in: 2020 Fourth International Conference on I-SMAC
                                                                                               [40] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for
     (IoT in Social, Mobile, Analytics and Cloud), I-SMAC, 2020, pp. 1082–1086,
                                                                                                    cloud services: Work in progress, in: 2019 IEEE 18th International Symposium
     http://dx.doi.org/10.1109/I-SMAC49090.2020.9243605.
                                                                                                    on Network Computing and Applications, NCA, 2019, pp. 1–3, http://dx.doi.org/
[19] S.R. Addula, P. Perugu.P, M.K. Kumar, D. Kumar, B. Ananthan, R. R, S. P, S.
                                                                                                    10.1109/NCA.2019.8935046.
     G, Dynamic load balancing in cloud computing using hybrid Kookaburra-Pelican
                                                                                               [41] H. Chen, P. Chen, G. Yu, A framework of virtual war room and matrix sketch-
     optimization algorithms, in: 2024 International Conference on Augmented Re-
                                                                                                    based streaming anomaly detection for microservice systems, IEEE Access 8
     ality, Intelligent Systems, and Industrial Automation, ARIIA, 2024, pp. 1–7,
                                                                                                    (2020) 43413–43426, http://dx.doi.org/10.1109/ACCESS.2020.2977464.
     http://dx.doi.org/10.1109/ARIIA63345.2024.11051893.
                                                                                               [42] K.A. Torkura, M.I.H. Sukmana, F. Cheng, C. Meinel, CloudStrike: Chaos engi-
[20] M. Waseem, P. Liang, M. Shahin, A systematic mapping study on microservices
                                                                                                    neering for security and resiliency in cloud infrastructure, IEEE Access 8 (2020)
     architecture in devops, J. Syst. Softw. 170 (2020) 110798, http://dx.doi.org/10.
                                                                                                    123044–123060, http://dx.doi.org/10.1109/ACCESS.2020.3007338.
     1016/j.jss.2020.110798.
                                                                                               [43] D. Kesim, A. van Hoorn, S. Frank, M. H00E4ussler, Identifying and prioritizing
[21] C. Rosenthal, N. Jones, Chaos Engineering: System Resiliency in Practice, O’Reilly
                                                                                                    chaos experiments by using established risk analysis techniques, in: 2020 IEEE
     Media, 2020.
                                                                                                    31st International Symposium on Software Reliability Engineering, ISSRE, 2020,
[22] L. Zhang, B. Morin, B. Baudry, M. Monperrus, Maximizing error injection realism                pp. 229–240, http://dx.doi.org/10.1109/ISSRE5003.2020.00030.
     for chaos engineering with system calls, IEEE Trans. Dependable Secur. Comput.            [44] Z. Long, G. Wu, X. Chen, C. Cui, W. Chen, J. Wei, Fitness-guided resilience
     19 (4) (2022) 2695–2708, http://dx.doi.org/10.1109/TDSC.2021.3069715.                          testing of microservice-based applications, 2020, pp. 151–158, http://dx.doi.org/
[23] Š. Davidovič, B. Beyer, Canary analysis service, Commun. ACM 61 (5) (2018)                     10.1109/ICWS49710.2020.00027.
     54–62, http://dx.doi.org/10.1145/3190566.                                                 [45] S. De, A study on chaos engineering for improving cloud software quality
[24] L. Zhang, B. Morin, P. Haller, B. Baudry, M. Monperrus, A chaos engineering                    and reliability, in: 2021 International Conference on Disruptive Technologies
     system for live analysis and falsification of exception-handling in the JVM, IEEE              for Multi-Disciplinary Research and Applications, CENTCON, Vol. 1, 2021, pp.
     Trans. Softw. Eng. 47 (11) (2021) 2534–2548, http://dx.doi.org/10.1109/TSE.                    289–294, http://dx.doi.org/10.1109/CENTCON52345.2021.9688292.
     2019.2954871.                                                                             [46] C. Konstantinou, G. Stergiopoulos, M. Parvania, P. Esteves-Verissimo, Chaos
[25] H. Jernberg, P. Runeson, E. Engström, Getting started with chaos engineering                   engineering for enhanced resilience of cyber-physical systems, in: 2021 Re-
     - design of an implementation framework in practice, in: Proceedings of the                    silience Week, RWS, 2021, pp. 1–10, http://dx.doi.org/10.1109/RWS52686.
     14th ACM / IEEE International Symposium on Empirical Software Engineering                      2021.9611797.
     and Measurement, ESEM, ESEM ’20, Association for Computing Machinery, New                 [47] F. Poltronieri, M. Tortonesi, C. Stefanelli, ChaosTwin: A chaos engineering and
     York, NY, USA, 2020, http://dx.doi.org/10.1145/3382494.3421464.                                digital twin approach for the design of resilient IT services, in: 2021 17th
[26] A. Alkhateeb, C. Catal, G. Kar, A. Mishra, Hybrid blockchain platforms for the                 International Conference on Network and Service Management, CNSM, 2021,
     internet of things (IoT): A systematic literature review, Sensors 22 (4) (2022)                pp. 234–238, http://dx.doi.org/10.23919/CNSM52442.2021.9615519.
     http://dx.doi.org/10.3390/s22041304.                                                      [48] N. Luo, Y. Xiong, Platform software reliability for cloud service continuity
[27] R. van Dinter, B. Tekinerdogan, C. Catal, Predictive maintenance using digital                 - challenges and opportunities, in: 2021 IEEE 21st International Conference
     twins: A systematic literature review, Inf. Softw. Technol. 151 (2022) 107008,                 on Software Quality, Reliability and Security, QRS, 2021, pp. 388–393, http:
     http://dx.doi.org/10.1016/j.infsof.2022.107008.                                                //dx.doi.org/10.1109/QRS54544.2021.00050.
[28] M. Jorayeva, A. Akbulut, C. Catal, A. Mishra, Machine learning-based software             [49] H. Chen, K. Wei, A. Li, T. Wang, W. Zhang, Trace-based intelligent fault diagnosis
     defect prediction for mobile applications: A systematic literature review, Sensors             for microservices with deep learning, in: 2021 IEEE 45th Annual Computers,
     22 (7) (2022) http://dx.doi.org/10.3390/s22072551.                                             Software, and Applications Conference, COMPSAC, 2021, pp. 884–893, http:
[29] A. Basiri, L. Hochstein, N. Jones, H. Tucker, Automating chaos experiments                     //dx.doi.org/10.1109/COMPSAC51774.2021.00121.
     in production, in: 2019 IEEE/ACM 41st International Conference on Software                [50] O. Sharma, M. Verma, S. Bhadauria, P. Jayachandran, A guided approach
     Engineering: Software Engineering in Practice, ICSE-SEIP, 2019, pp. 31–40,                     towards complex chaos selection, prioritisation and injection, in: 2022 IEEE
     http://dx.doi.org/10.1109/ICSE-SEIP.2019.00012.                                                15th International Conference on Cloud Computing, CLOUD, 2022, pp. 91–93,
[30] L.B. Canonico, V. Vakeel, J. Dominic, P. Rodeghero, N. McNeese, Human-AI                       http://dx.doi.org/10.1109/CLOUD55607.2022.00025.
     partnerships for chaos engineering, in: Proceedings of the IEEE/ACM 42nd                  [51] N. Luo, L. Zhang, Chaos driven development for software robustness enhance-
     International Conference on Software Engineering Workshops, ICSEW ’20, As-                     ment, in: 2022 9th International Conference on Dependable Systems and their
     sociation for Computing Machinery, New York, NY, USA, 2020, pp. 499–503,                       Applications, DSA, 2022, pp. 1029–1034, http://dx.doi.org/10.1109/DSA56465.
     http://dx.doi.org/10.1145/3387940.3391493.                                                     2022.00154.


                                                                                          13
E. Esen et al.                                                                                                                   Computer Standards & Interfaces 97 (2026) 104116


[52] M.A. Naqvi, S. Malik, M. Astekin, L. Moonen, On evaluating self-adaptive                [58] D. Savchenko, G. Radchenko, O. Taipale, Microservices validation: Mjolnirr
     and self-healing systems using chaos engineering, in: 2022 IEEE International                platform case study, in: 2015 38th International Convention on Information and
     Conference on Autonomic Computing and Self-Organizing Systems, ACSOS, 2022,                  Communication Technology, Electronics and Microelectronics, MIPRO, 2015, pp.
     pp. 1–10, http://dx.doi.org/10.1109/ACSOS55765.2022.00018.                                   235–240, http://dx.doi.org/10.1109/MIPRO.2015.7160271.
[53] J. Simonsson, L. Zhang, B. Morin, B. Baudry, M. Monperrus, Observability and            [59] G.S. Nadella, S.R. Addula, A.R. Yadulla, G.S. Sajja, M. Meesala, M.H. Maturi,
     chaos engineering on system calls for containerized applications in Docker,                  K. Meduri, H. Gonaygunta, Generative AI-enhanced cybersecurity framework for
     Future Gener. Comput. Syst. 122 (2021) 117–129, http://dx.doi.org/10.1016/                   enterprise data privacy management, Computers 14 (2) (2025) http://dx.doi.org/
     j.future.2021.04.001.                                                                        10.3390/computers14020055.
[54] A.A.-S. Ahmad, P. Andras, Scalability resilience framework using application-           [60] D. Kikuta, H. Ikeuchi, K. Tajiri, Y. Nakano, ChaosEater: Fully automating chaos
     level fault injection for cloud-based software services, J. Cloud Comput. 11 (1)             engineering with large language models, 2025, arXiv preprint arXiv:2501.11107.
     (2022) 1, http://dx.doi.org/10.1186/s13677-021-00277-z.                                      URL https://arxiv.org/abs/2501.11107.
[55] C. Camacho, P.C. Cañizares, L. Llana, A. Núñez, Chaos as a software product             [61] J. Owotogbe, Assessing and enhancing the robustness of LLM-based multi-
     line—A platform for improving open hybrid-cloud systems resiliency, Softw.:                  agent systems through chaos engineering, in: 2025 IEEE/ACM 4th International
     Pract. Exp. 52 (7) (2022) 1581–1614, http://dx.doi.org/10.1002/spe.3076.                     Conference on AI Engineering – Software Engineering for AI, CAIN, 2025, pp.
[56] P. Raj, S. Vanga, A. Chaudhary, The observability, chaos engineering, and                    250–252, http://dx.doi.org/10.1109/CAIN66642.2025.00039.
     remediation for cloud-native reliability, in: Cloud-Native Computing: How To            [62] A. Al-Said Ahmad, L.F. Al-Qora’n, A. Zayed, Exploring the impact of chaos
     Design, Develop, and Secure Microservices and Event-Driven Applications, 2023,               engineering with various user loads on cloud native applications: An exploratory
     pp. 71–93, http://dx.doi.org/10.1002/9781119814795.ch4.                                      empirical study, Computing 106 (2024) 2389–2425, http://dx.doi.org/10.1007/
[57] M.A. Chang, B. Tschaen, T. Benson, L. Vanbever, Chaos monkey: Increasing sdn                 s00607-024-01292-z.
     reliability through systematic network destruction, in: Proceedings of the 2015         [63] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for
     ACM Conference on Special Interest Group on Data Communication, 2015, pp.                    cloud services: Work in progress, in: 2019 IEEE 18th International Symposium
     371–372.                                                                                     on Network Computing and Applications, NCA, 2019, pp. 1–3, http://dx.doi.org/
                                                                                                  10.1109/NCA.2019.8935046.


                                                                                        14