979 lines
115 KiB
Plaintext
979 lines
115 KiB
Plaintext
Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Computer Standards & Interfaces
|
||
journal homepage: www.elsevier.com/locate/csi
|
||
|
||
|
||
|
||
|
||
Chaos experiments in microservice architectures: A systematic literature
|
||
review
|
||
Emrah Esen a , Akhan Akbulut a , Cagatay Catal b ,∗
|
||
a
|
||
Department of Computer Engineering, Istanbul Kültür University, 34536, Istanbul, Turkey
|
||
b
|
||
Department of Computer Science and Engineering, Qatar University, Doha 2713, Qatar
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: This study analyzes the implementation of Chaos Engineering in modern microservice systems. It identifies
|
||
Chaos engineering key methods, tools, and practices used to effectively enhance the resilience of software systems in production
|
||
Microservice environments. In this context, our Systematic Literature Review (SLR) of 31 research articles has uncovered 38
|
||
Systematic literature review
|
||
tools crucial for carrying out fault injection methods, including several tools such as Chaos Toolkit, Gremlin,
|
||
and Chaos Machine. The study also explores the platforms used for chaos experiments and how centralized
|
||
management of chaos engineering can facilitate the coordination of these experiments across complex systems.
|
||
The evaluated literature reveals the efficacy of chaos engineering in improving fault tolerance and robustness of
|
||
software systems, particularly those based on microservice architectures. The paper underlines the importance
|
||
of careful planning and execution in implementing chaos engineering and encourages further research in this
|
||
field to uncover more effective practices for the resilience improvement of microservice systems.
|
||
|
||
|
||
Contents
|
||
|
||
1. Introduction ...................................................................................................................................................................................................... 2
|
||
2. Background ....................................................................................................................................................................................................... 2
|
||
2.1. Microservice architecture ........................................................................................................................................................................ 3
|
||
2.2. Microservice principles ........................................................................................................................................................................... 3
|
||
2.3. Challenges/Troubleshooting/Failures in microservice architecture .............................................................................................................. 3
|
||
2.4. Chaos engineering .................................................................................................................................................................................. 4
|
||
3. Review protocol................................................................................................................................................................................................. 4
|
||
3.1. Research questions ................................................................................................................................................................................. 4
|
||
3.2. Search strategy....................................................................................................................................................................................... 4
|
||
3.3. Study selection criteria ........................................................................................................................................................................... 4
|
||
3.4. Study quality assessment......................................................................................................................................................................... 5
|
||
3.5. Data extraction ...................................................................................................................................................................................... 5
|
||
3.6. Data synthesis ........................................................................................................................................................................................ 6
|
||
4. Results .............................................................................................................................................................................................................. 6
|
||
4.1. Main statistics ........................................................................................................................................................................................ 6
|
||
4.2. How is Chaos engineering effectively applied in production environments to enhance the resilience of software systems? .............................. 6
|
||
4.3. Which platforms have been used for chaos experiments? ........................................................................................................................... 6
|
||
4.4. How can Chaos engineering be effectively applied to microservice architecture to ensure successful implementation and enhance system
|
||
resilience? .............................................................................................................................................................................................. 10
|
||
4.5. To what extent can the centralized provision of Chaos engineering effectively facilitate the management of chaos experiments across complex
|
||
systems?................................................................................................................................................................................................. 10
|
||
4.6. What are the challenges reported in the relevant papers? .......................................................................................................................... 10
|
||
5. Discussion ......................................................................................................................................................................................................... 10
|
||
5.1. General discussion .................................................................................................................................................................................. 10
|
||
5.2. Threats to validity .................................................................................................................................................................................. 12
|
||
|
||
|
||
|
||
∗ Corresponding author.
|
||
E-mail address: ccatal@qu.edu.qa (C. Catal).
|
||
|
||
https://doi.org/10.1016/j.csi.2025.104116
|
||
Received 22 September 2024; Received in revised form 28 November 2025; Accepted 12 December 2025
|
||
Available online 15 December 2025
|
||
0920-5489/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
6. Conclusion ........................................................................................................................................................................................................ 12
|
||
CRediT authorship contribution statement ........................................................................................................................................................... 12
|
||
Declaration of competing interest ........................................................................................................................................................................ 12
|
||
Data availability ................................................................................................................................................................................................ 12
|
||
References......................................................................................................................................................................................................... 12
|
||
|
||
|
||
|
||
challenges faced, and solutions. In addition, it will assess the effective-
|
||
1. Introduction ness of chaos experiments in enhancing the reliability and robustness of
|
||
microservice systems by using data obtained from real-world scenarios
|
||
In recent years, the adoption of microservice architecture has led to develop strategic recommendations. This study is a critical step
|
||
to the transformation of application infrastructures into distributed in understanding the applicability and impact of chaos engineering
|
||
systems. These systems are designed to enhance maintainability by de- within the complexity of microservice architectures and aims to make
|
||
coupling services. The primary benefit of this architecture is the ease of significant contributions to the body of knowledge in this field. Recent
|
||
maintenance of individual services within the microservice ecosystem research has applied chaos engineering for this architectural style, how-
|
||
due to their smaller and more modular nature [1]. However, despite ever, a systematic overview of the state-of-the-art on the use of chaos
|
||
these advantages, the distributed nature of microservices introduces engineering in the microservice architecture is lacking. Therefore, a
|
||
significant challenges. Specifically, the complex management of ser- Systematic Literature Review (SLR) has been performed to provide an
|
||
vices and their tight integration can considerably complicate software overview of how chaos engineering was applied.
|
||
debugging. Debugging becomes complex in this architecture due to its This article primarily targets peer-reviewed research papers to main-
|
||
distributed nature, the necessity to pinpoint the exact service causing tain methodological consistency and ensure scholarly rigor. We specif-
|
||
the problem, and the dynamic characteristics of microservices. Con- ically chose a systematic literature review (SLR) methodology because
|
||
sequently, debugging in microservice architecture demands a greater peer-reviewed academic studies are subject to rigorous validation pro-
|
||
level of effort and specialized expertise compared to conventional cesses, which enhance the reliability and validity of our findings [8,
|
||
monolithic architectures [2]. However, it becomes quite challenging to 9]. Although excluding industry-specific, grey literature may restrict
|
||
predict what will happen if there is an unexpected error or if a service certain practical perspectives, this choice was deliberately made to
|
||
on the network goes out of service. Service outages can be caused by avoid potential biases and uphold the scientific integrity of our re-
|
||
anything from a malicious cyberattack to a hardware failure to simple view [10,11]. However, future studies could broaden the scope to
|
||
human error, and they can have devastating financial consequences. incorporate industrial case studies and practical experiences, which
|
||
Although such unexpected situations are rare, they can interfere with would enrich our understanding of chaos engineering’s applicability
|
||
the operation of distributed systems and devastatingly affect the live beyond the academic context.
|
||
environment in which the application is located [3]. It is necessary to The main contributions of this study are listed as follows:
|
||
detect points in the system before an error occurs and spreads to the
|
||
1. To the best of our knowledge, this is the first study to employ
|
||
entire system.
|
||
a systematic literature review approach in the field of chaos
|
||
Microservice architecture applications undergo testing procedures
|
||
engineering on microservice architecture applications [12]. The
|
||
to ensure their quality and dependability. These include unit testing,
|
||
study provides an extensive systematic literature review of how
|
||
service test, end-to-end test, behavior-driven test, integration test, and
|
||
chaos engineering can be applied to enhance the resilience of mi-
|
||
regression test [4]. The comprehensive approach to microservices test-
|
||
croservice architectures. It collates findings from various sources
|
||
ing also encompasses live testing strategies for complex systems [5].
|
||
to provide insights into the current state of research and practice
|
||
This thorough process emphasizes different aspects such as function-
|
||
in this field.
|
||
ality, interoperability, performance of individual services within the
|
||
2. The study categorizes and summarizes the range of chaos en-
|
||
architecture. It aims to detect and resolve issues early to ensure stable
|
||
gineering tools and methods used in industry and academia,
|
||
and high-quality microservice applications [1,6]. However, considering
|
||
highlighting their functionalities in process/service termination,
|
||
that microservices consist of multiple services, the application should
|
||
network simulation, load stressing, security testing, and fault
|
||
not have an impact on the user experience in cases such as network
|
||
injection within application code.
|
||
failures and suddenly increased service loads. For example, if the
|
||
3. This research paper discusses contemporary techniques and ap-
|
||
microservice that adds the product to favorites on a shopping site fails
|
||
proaches for implementing chaos engineering in microservice
|
||
or responds late, the user should be able to continue the shopping ex-
|
||
architectures. It also emphasizes the ongoing work in this field,
|
||
perience. Therefore, testing operations in production-like environments
|
||
offering a significant reference for future research endeavors.
|
||
become inevitable. No matter how distributed or complex the system
|
||
The paper systematically reviews existing literature to showcase
|
||
is, there is a need for a method to manage unforeseeable situations
|
||
how chaos engineering can enhance system resilience, laying a
|
||
that can build trust in the system against unexpected failures. chaos
|
||
comprehensive groundwork for further exploration into chaos
|
||
engineering is defined as the discipline of conducting experiments in a
|
||
experimentation strategies and innovating new fault injection
|
||
live environment to test or verify the reliability of software [7].
|
||
methods or tools within microservice architectures.
|
||
The primary objective of this research is to conduct a thorough
|
||
investigation into how chaos experiments are performed in the widely The rest of the paper is structured as follows: Section 2 explains
|
||
used microservices-based systems of today. Microservice architectures the background and related work. Section 3 presents the methodology
|
||
have come to the forefront in modern software development processes of the research. Section 4 presents the results and Section 5 compre-
|
||
due to their advantages such as flexibility, scalability, and rapid de- hensively discusses the presented answers to research questions and
|
||
velopment. However, these architectures also bring unique challenges validity threats. Lastly, the conclusion is presented in Section 6.
|
||
due to complex service dependencies and dynamic operational environ-
|
||
ments. This study aims to comprehensively address the methodologies, 2. Background
|
||
application scenarios, and impacts of chaos experiments conducted
|
||
to test the resilience of microservice systems and identify potential The microservice approach breaks down a large application into a
|
||
weak points. The research intends to present the current state of chaos network of small, self-contained units, each running its own process
|
||
engineering practices by analyzing them, highlighting best practices, and often communicating through web APIs. Unlike large, single-piece
|
||
|
||
2
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
monolithic systems, these small services are robust, easy to scale up or Technology heterogeneity. They are treated as small services, each run-
|
||
down, and can be updated individually using various programming lan- ning independently and communicating with each other using open
|
||
guages and technologies. This structure allows development teams to be protocols. While monolithic applications are developed with a single
|
||
smaller and more agile, leading to faster updates and improvements. programming language and database system, services included in a
|
||
Yet, managing many interconnected services can become complicated, microservice ecosystem may use a different programming language and
|
||
especially when something goes wrong. To enhance system reliability database. This allows the advantages of each programming language
|
||
and resilience, a method known as chaos engineering is employed. This and database to be used.
|
||
involves deliberately introducing problems into the live system to test
|
||
Resilience. When an error occurs in the system in monolithic applica-
|
||
its ability to cope and recover. This technique helps to uncover and
|
||
tions, the whole system is affected. In the microservice architecture,
|
||
rectify flaws, thereby making the system stronger overall. Regular and
|
||
only the part under the responsibility of the relevant service is affected,
|
||
automated tests mimic real-life problems to ensure that the system can the places belonging to other services are not affected and the user
|
||
handle unexpected challenges and remain stable and efficient. experience continues.
|
||
|
||
2.1. Microservice architecture Scalability. While the scaling process on monolithic applications covers
|
||
the entire application, the services that are under heavy load can be
|
||
Microservice architectures have gained significant popularity in the scaled in applications developed with microservice architecture. This
|
||
software industry due to their ability to address the challenges and prevents extra resource costs for partitions that do not need to be scaled
|
||
complexities of developing modern applications [6,13]. unnecessarily and increases the user experience.
|
||
|
||
Deployment. Microservice architecture facilitates the autonomous de-
|
||
2.2. Microservice principles ployment of individual services, enabling updates or changes without
|
||
impacting others. Various deployment strategies, including blue–green,
|
||
Microservice architectures are based on the concept of decentral- canary, and rolling deployment, minimize disruptions during the de-
|
||
ization, where each service is independently developed, deployed, and ployment process [18]. As a result, microservice architecture provides
|
||
managed. This emphasizes autonomy and minimal inter-service depen- increased flexibility and resilience in deployment, distinguishing it
|
||
dencies. Each microservice is designed to focus on a single function or from monolithic applications.
|
||
closely related set of functions and supports technology heterogeneity
|
||
by allowing different services to use different technology stacks that Organizational alignment. In software development processes, some
|
||
best suit their needs. Resilience is a core aspect, with services built to challenges may be encountered due to large teamwork and large pieces
|
||
withstand failures without affecting the entire system while scalability of code. It is possible to make these challenges more manageable with
|
||
enables services to be scaled independently as per demand. Com- smaller teams established. At the same time, this is an indication that
|
||
munication occurs through lightweight mechanisms like HTTP/REST microservices applications allow us to form smaller and more cohesive
|
||
APIs, supporting continuous delivery and deployment practices. Due teams. Each team is responsible for its own microservice and can take
|
||
to the distributed nature of microservice architecture, comprehensive action by making improvements if necessary.
|
||
monitoring and logging for observability becomes crucial. Additionally,
|
||
there is often an alignment between the microservice architecture 2.3. Challenges/Troubleshooting/Failures in microservice architecture
|
||
and organizational structure involving small cross-functional teams
|
||
Microservice architectures pose numerous challenges. As the num-
|
||
responsible for individual services [14].
|
||
ber of services increases, the complexity of service interactions also
|
||
It is helpful to compare the microservice architecture to the mono-
|
||
grows. Network communication reliance leads to latency and net-
|
||
lithic architecture. The main difference between them is the dimensions
|
||
work failure issues, while ensuring data consistency across multiple
|
||
of the developed applications. The microservice architecture can be
|
||
databases requires careful design and implementation of distributed
|
||
thought of as developing an application as a suite of smaller services,
|
||
transactions or eventual consistency models. Microservices bring typ-
|
||
rather than as a single, monolithic structure. Enterprise applications
|
||
ical distributed system challenges such as handling partial failures,
|
||
usually consist of three main parts: a client-side user interface (i.e., con-
|
||
dealing with latency and asynchrony, complex service discovery, load
|
||
taining HTML pages and Javascript running on the user’s machine
|
||
balancing in dynamic scaling environments, and managing configu-
|
||
in a browser), a database (i.e., composed of many tables, common
|
||
rations across multiple services and environments. Security concerns
|
||
and often relational, added to database management), and a server-
|
||
are heightened due to increased inter-service communications surface
|
||
side application. In the server-side application, HTTP requests are area. Testing becomes more complex involving individual service test-
|
||
processed, business logic is executed, HTML views are prepared that ing along with testing their interactions; deployment is challenging
|
||
will retrieve data from the database and update it and send it to the especially when there are dependencies between services; effective
|
||
browser. This structure is a good example of monoliths. Any changes observability and monitoring become crucial for timely issue resolu-
|
||
to the system involve creating and deploying a new version of the tion; versioning management is critical for maintaining system stability;
|
||
server-side application [15]. The cycles of change are interdependent. lastly assembling skilled teams proficient in DevOps, cloud computing,
|
||
A change to a small part of the application requires rebuilding and programming languages presents a significant challenge. Microservice
|
||
deploying the entire monolith [6]. architecture faces various challenges, troubleshooting, and failures.
|
||
Microservice architecture, on the other hand, has some common While adopting a distributed architecture enhances modularity, it in-
|
||
features, unlike monolithic architecture. These are componentization herently introduces operational complexities that differ significantly
|
||
with services, organizing around job capabilities, smart interfaces and from monolithic structures. Recent research has also explored the use
|
||
simple communication, decentralized governance, decentralized data of hybrid bio-inspired algorithms to optimize this process dynamically.
|
||
management, infrastructure automation, and design for failure [16]. For instance, the Hybrid Kookaburra–Pelican Optimization Algorithm
|
||
Today, although modern internet applications seem like a single appli- has been shown to improve load distribution and system scalability in
|
||
cation, they use microservice architectures behind them. Microservice cloud and microservice-based environments [19].
|
||
architecture basically refers to small autonomous and interoperability In conclusion, while microservices offer numerous advantages such
|
||
services. It has emerged due to increasing needs such as technology as improved scalability, flexibility, and agility, they also introduce
|
||
diversity, flexibility, scaling, ease of deployment, organization and significant challenges in terms of system complexity, operational de-
|
||
management, and provides various advantages in these matters. Its mands, and the need for skilled personnel and sophisticated tool-
|
||
advantages are described as follows [17]: ing [20].
|
||
|
||
3
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
2.4. Chaos engineering 3.1. Research questions
|
||
|
||
|
||
‘‘Chaos engineering is the discipline of experimenting on a dis- Research Questions (RQs) and their corresponding motivations are
|
||
tributed system in order to build confidence in the system’s capability presented as follows:
|
||
to withstand turbulent conditions in production-like environment’’ [7,
|
||
• RQ1: How is Chaos engineering effectively applied in production
|
||
21]. It is the careful and planned execution of experiments to show how
|
||
environments to enhance the resilience of software systems?
|
||
the distributed system will respond to a failure. It is necessary for large-
|
||
Motivation: Understanding the practical implementation of Chaos
|
||
scale software systems because it is practically impossible to simulate
|
||
engineering in production environments is crucial for ensuring
|
||
real events in test environments. Experiments based on real events are the resilience of software systems under real-world operating
|
||
created together with chaos engineering [22]. By analyzing the test conditions.
|
||
results, improvements are made where necessary, and in this way, it • RQ2: Which platforms have been used for Chaos experiments?
|
||
is aimed to increase the reliability of the software in the production Motivation: Identifying the platforms provides insights into the
|
||
environment. technological landscape and tools available for conducting Chaos
|
||
Thanks to an experimental and systems-based approach, confidence engineering practices.
|
||
is established for the survivability of these systems during collapses. • RQ3: How is Chaos engineering effectively applied to microser-
|
||
Canary analysis collects data on how distributed systems react to vice architectures to ensure its successful implementation in en-
|
||
failure scenarios by observing their behavior in abnormal situations and hancing system resilience?
|
||
performing controlled experiments [23]. This method involves applying Motivation: Microservice architectures introduce new challenges
|
||
new updates or changes to a specific aspect of the system, enabling in system design. Exploring the application of Chaos engineering
|
||
early detection of potential problems before they affect a larger scale. in this context can help improve the resilience and fault tolerance
|
||
Chaos experiments consist of the following principles [24,25]: of microservice systems.
|
||
• RQ4: To what extent can the centralized provision of Chaos
|
||
• Hypothesize steady state: The first step is to hypothesize the engineering effectively facilitate the management of Chaos exper-
|
||
steady state of the system under normal conditions. iments across complex systems?
|
||
• Vary real-world events: The next step is to vary real-world events Motivation: Understanding the feasibility of providing Chaos en-
|
||
that can cause turbulence in the system. gineering as a centralized service enables organizations to coor-
|
||
• Run experiments in production: Experimenters should run the ex- dinate Chaos experiments across complex systems.
|
||
periments in production-like environment to simulate real-world • RQ5: What are the challenges reported in the relevant papers?
|
||
conditions. Motivation: Identifying these challenges provides valuable in-
|
||
• Automate experiments to run continuously: Experimenters should sights into overcoming obstacles and advancing the adoption of
|
||
automate the experiments to run continuously, ensuring that the Chaos engineering practices.
|
||
system can withstand turbulence over time.
|
||
• Minimize blast radius: The experiments should be designed to 3.2. Search strategy
|
||
minimize blast radius, i.e., the impact of the experiment on the
|
||
system should be limited to a small area The primary studies were carefully selected from the papers pub-
|
||
• Analyze results: Experimenters should analyze the results of the lished between 2010 and 2022 because the topic is only relevant in
|
||
experiments to determine the system’s behavior under turbulent recent years. The databases are IEEE Xplore, ACM Digital Library,
|
||
conditions. Science Direct, Springer, Wiley, MDPI and Scopus and Science Direct.
|
||
• Repeat experiments: The experiments should be repeated to en- The initial search involved reviewing the titles, abstracts, and keywords
|
||
sure that the system can consistently withstand turbulence. of the studies identified in the databases. The search results obtained
|
||
When the experiment is finished, information about the actual from the databases were stored in the data extraction form using a
|
||
effect will be provided to the system. spreadsheet tool. Furthermore, this systematic review was conducted
|
||
collaboratively by three authors.
|
||
The following search string was used to broaden the search scope:
|
||
3. Review protocol ((chaos engineering) OR (chaos experiments)) OR (microservices)
|
||
The results of the searches made in the databases mentioned above
|
||
Systematic review studies must be conducted using a well-defined are shown in Fig. 2.
|
||
and specific protocol. To conduct a systematic review study, all studies
|
||
on a particular topic must be examined [12]. We followed the system- 3.3. Study selection criteria
|
||
atic review process shown in Fig. 1 and took all the steps to reduce risk
|
||
bias in this study. Multiple reviewers were involved in the SLR process, After applying exclusion inclusion criteria, 55 articles were ob-
|
||
and in cases of conflict, a brief meeting was organized to facilitate tained. The exclusion criteria in our study are shown as follows:
|
||
consensus. The first step is to define the research questions. Then,
|
||
the most appropriate databases were selected. Based on the selected • EC-1: Duplicate papers from multiple sources
|
||
databases, automated searches were conducted and several articles • EC-2: Papers without full-text availability
|
||
were identified. Selection criteria were then established to determine • EC-3: Papers not written in English
|
||
• EC-4: Survey papers
|
||
which studies should be included and excluded in this research. The
|
||
• EC-5: Papers not related to Chaos engineering
|
||
titles and abstracts of all studies were reviewed. In cases of doubt,
|
||
the full text of the publication was reviewed. Then, after the studies The inclusion criteria in our study are shown as follows:
|
||
were analyzed in detail, selection criteria were applied. All selected
|
||
studies were assessed using a quality assessment process. Subsequently, • IC-1: Primary papers discussing the use of Chaos experiments in
|
||
the results were synthesized, listed, and summarized in a clear and a microservice architecture
|
||
understandable manner. • IC-2: Primary publications that focus on Chaos engineering
|
||
|
||
4
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
|
||
|
||
Fig. 1. SLR review protocol.
|
||
Source: Adapted from [26–
|
||
28].
|
||
|
||
|
||
|
||
|
||
Fig. 2. Distribution of selected papers per database.
|
||
|
||
|
||
3.4. Study quality assessment Fig. 2 presents the distribution of papers based on databases where
|
||
they were found at different selection stages. After the initial search,
|
||
The assessment of each study’s quality is an indicator of the strength 4520 papers were retrieved, of which 55 remained after applying the
|
||
of evidence provided by the systematic review. The quality of studies selection criteria. After quality assessment, 31 papers were selected
|
||
was assessed using various questions. Studies of poor quality were as primary studies. The 55 papers were carefully read in full and the
|
||
not included in the present study. These criteria based on quality required data for answering the research questions were extracted.
|
||
instruments were adopted guide and other SLRs research [12]. The All the collected articles are listed in Table 1.
|
||
following questions were used to assess the quality of the studies.
|
||
3.5. Data extraction
|
||
• Q1. Are the aims of the study clearly stated?
|
||
• Q2. Are the scope and experimental design of the study clearly
|
||
defined? Data required for answering the Research Questions were extracted
|
||
• Q3. Is the research process documented adequately? from the selected articles to answer the research questions. A data
|
||
• Q4. Are all the study questions answered? extraction form was created to answer the research questions. The data
|
||
• Q5. Are the negative findings presented? extraction form consists of several metadata such as the author’s first
|
||
• Q6. Do the conclusions relate to the aim of the purpose of the and last name, the title of the study, the publication year, and the type
|
||
study and are they reliable? of study. In addition to this metadata, several columns were created
|
||
to store the required information related to the research questions. By
|
||
In this study, considering all these criteria, a general quality as- employing a data extraction form, we ensured that the relevant data
|
||
sessment was performed for each paper. The rating was 2 points for required to answer each research question were systematically captured
|
||
the ‘‘yes’’ option, 0 points for the ‘‘no’’ option, and 1 point for the from the selected publications. This approach facilitated the subsequent
|
||
‘‘somewhat’’ option. The decision threshold for classifying the paper synthesis of the findings. The data extraction process involved meticu-
|
||
as poor quality was determined based on the mean value, which lous attention to detail and ensured the reliability and integrity of the
|
||
corresponds to a total of 5 points. data used in our systematic literature review.
|
||
|
||
5
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
Table 1
|
||
Selected primary studies.
|
||
ID Reference Title Year Database
|
||
S1 [29] Automating Chaos Experiments in Production 2019 ACM
|
||
S2 [25] Getting Started with Chaos engineering—design of an implementation framework in practice 2020 ACM
|
||
S3 [30] Human-AI Partnerships for Chaos engineering 2020 ACM
|
||
S4 [31] 3MileBeach: A Tracer with Teeth 2021 ACM
|
||
S5 [32] Service-Level Fault Injection Testing 2021 ACM
|
||
S6 [33] A Platform for Automating Chaos Experiments 2016 IEEE Xplore
|
||
S7 [34] Automated Fault-Tolerance Testing 2016 IEEE Xplore
|
||
S8 [35] Gremlin: Systematic Resilience Testing of Microservices 2016 IEEE Xplore
|
||
S9 [36] Fault Injection Techniques - A Brief Review 2018 IEEE Xplore
|
||
S10 [37] ORCAS: Efficient Resilience Benchmarking of Microservice Architectures 2018 IEEE Xplore
|
||
S11 [38] The Business Case for Chaos engineering 2018 IEEE Xplore
|
||
S12 [39] Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service 2018 IEEE Xplore
|
||
S13 [40] Security Chaos engineering for Cloud Services: Work In Progress 2019 IEEE Xplore
|
||
S14 [41] A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems 2020 IEEE Xplore
|
||
S15 [42] CloudStrike: Chaos engineering for Security and Resiliency in Cloud Infrastructure 2020 IEEE Xplore
|
||
S16 [43] Identifying and Prioritizing Chaos Experiments by Using Established Risk Analysis Techniques 2020 IEEE Xplore
|
||
S17 [44] Fitness-guided Resilience Testing of Microservice-based Applications 2020 IEEE Xplore
|
||
S18 [24] A Chaos engineering System for Live Analysis and Falsification of Exception-Handling in the JVM 2021 IEEE Xplore
|
||
S19 [45] A Study on Chaos engineering for Improving Cloud Software Quality and Reliability 2021 IEEE Xplore
|
||
S20 [46] Chaos engineering for Enhanced Resilience of Cyber–Physical Systems 2021 IEEE Xplore
|
||
S21 [47] ChaosTwin: A Chaos engineering and Digital Twin Approach for the Design of Resilient IT Services 2021 IEEE Xplore
|
||
S22 [48] Platform Software Reliability for Cloud Service Continuity—Challenges and Opportunities 2021 IEEE Xplore
|
||
S23 [49] Trace-based Intelligent Fault Diagnosis for Microservices with Deep Learning 2021 IEEE Xplore
|
||
S24 [50] A Guided Approach Towards Complex Chaos Selection, Prioritization and Injection 2022 IEEE Xplore
|
||
S25 [51] Chaos Driven Development for Software Robustness Enhancement 2022 IEEE Xplore
|
||
S26 [22] Maximizing Error Injection Realism for Chaos engineering With System Calls 2022 IEEE Xplore
|
||
S27 [52] On Evaluating Self-Adaptive and Self-Healing Systems using Chaos engineering 2022 IEEE Xplore
|
||
S28 [53] Observability and chaos engineering on system calls for containerized applications in Docker 2021 ScienceDirect
|
||
S29 [54] Scalability resilience framework using application-level fault injection for cloud-based software services 2022 Springer
|
||
S30 [55] Chaos as a Software Product Line—A platform for improving open hybrid-cloud systems resiliency 2022 Wiley
|
||
S31 [56] The Observability, Chaos engineering, and Remediation for Cloud-Native Reliability 2022 Wiley
|
||
|
||
|
||
|
||
3.6. Data synthesis Chaos engineering involves several categories of functionality that
|
||
serve distinct purposes in resilience testing. The first category involves
|
||
To answer the research questions, the data obtained are collected intentionally terminating processes or services to evaluate system be-
|
||
and summarized in an appropriate manner, which is called data syn- havior and recovery from failures [7]. Another category is network
|
||
thesis. To perform the data synthesis, a qualitative analysis process simulation, which allows engineers to replicate adverse network condi-
|
||
was conducted on the data obtained. For instance, synonyms used tions to assess system performance and reliability [25]. In the Stressing
|
||
for different categories were identified and merged in the respective Machine category, engineers subject the system to extreme loads to
|
||
fields. This comprehensive data synthesis approach allowed us to derive identify limits and potential bottlenecks [7]. In security testing, en-
|
||
insights and draw conclusions from the collected information. gineers simulate breaches or attacks to assess the system’s response
|
||
and enhance defenses [7]. Lastly, engineers use fault application code
|
||
4. Results to inject targeted faults or errors into the codebase, assessing system
|
||
resilience and error-handling capabilities [24]. These categories help
|
||
The result section of the paper provides various insights into how organizations proactively identify weaknesses, strengthen system ro-
|
||
chaos engineering is applied in production environments, particularly bustness, and enhance reliability in complex technology landscapes [7].
|
||
its use in improving the resilience and reliability of microservice ar- Functionality categories of tools are presented in Fig. 6.
|
||
chitecture applications. The section discusses how fault detection is The tools utilized in industry settings are not comprehensively ad-
|
||
developed using chaos engineering tools and is mainly used in pro- dressed in articles. To provide insights for future research, the identified
|
||
tools from the additional examination were categorized based on their
|
||
duction for troubleshooting. Chaos Experiments are usually conducted
|
||
functionality, as presented in Tables 2 and 3. Table 2 displays the
|
||
in the production environment to provide realistic results. The section
|
||
tools obtained from the study, while Table 3 presents additional tools
|
||
further enumerates several tools that have been used for Chaos experi-
|
||
that have been examined. Tools listed in the table with corresponding
|
||
ments, as well as discussing general principles such as defining a steady
|
||
references indicate their inclusion in the referenced articles.
|
||
state, forming a hypothesis, conducting the experiment, and proving or
|
||
refuting the hypothesis. These principles and tools help detect problems
|
||
4.2. How is Chaos engineering effectively applied in production environ-
|
||
like hardware issues, software errors network interruptions security
|
||
ments to enhance the resilience of software systems?
|
||
vulnerabilities configuration mistakes within their respective contexts.
|
||
Table 4 examines the successful implementation of Chaos Engineer-
|
||
4.1. Main statistics ing in operational settings, covering different aspects such as goals,
|
||
techniques and resources, guiding principles, findings, limitations and
|
||
Fig. 3 shows the results of the quality assessment. The distribution of substitutes, as well as the general strategy.
|
||
the years of publication is shown in Fig. 4. Most of the studies related to
|
||
our study were conducted in the last year. This shows that researchers’ 4.3. Which platforms have been used for chaos experiments?
|
||
interest in chaos engineering has increased in recent years. Most of the
|
||
studies included were indexed in the IEEE Xplore database. Table 5 provides a concise summary of various tools and platforms
|
||
Fig. 5 presents the distribution of the type of publications and used in Chaos experiments, along with their specific functionalities
|
||
the corresponding databases. While there are many journal papers, or characteristics. It offers comprehensive insights into each platform
|
||
conference proceedings also appear in the selected papers. through detailed descriptions accompanied by the necessary references.
|
||
|
||
6
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
|
||
|
||
Fig. 3. Quality assessment scores.
|
||
|
||
|
||
|
||
|
||
Fig. 4. Year of publication.
|
||
|
||
|
||
|
||
|
||
Fig. 5. Diagram of the distribution of studies per search database.
|
||
|
||
|
||
7
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
|
||
|
||
Fig. 6. Functionality of chaos engineering tools.
|
||
|
||
|
||
|
||
|
||
Table 2
|
||
Chaos engineering tools from studies.
|
||
Chaos engineering tool Termination Network simulating Stressing machine Security Fault application code
|
||
Chaos Monkey [57] ×
|
||
Gremlin [35] × × × × ×
|
||
Chaos Toolkit [45] × × × × ×
|
||
Pumba [55] × ×
|
||
LitmusChaos [45] × × × ×
|
||
ToxiProxy [45] × ×
|
||
PowerfulSeal [45] × × × ×
|
||
Pod Reaper [25] ×
|
||
Netflix Simian Army [36] × × ×
|
||
WireMock [25] × ×
|
||
KubeMonkey [25] × × ×
|
||
Chaosblade [45] × × ×
|
||
ChaosTwin [47] × × × ×
|
||
Chaos Machine [24] × × ×
|
||
Cloud Strike [42] ×
|
||
Phoebe [22] ×
|
||
Mjolnirr [58] ×
|
||
ChaosOrca [37] × × ×
|
||
3MileBeach [31] × ×
|
||
Muxy [25] × × ×
|
||
Blockade [25] ×
|
||
Chaos Lambda [25] × ×
|
||
Byte-Monkey [25] ×
|
||
Turbulence [25] × × ×
|
||
Cthulhu [25] × × × ×
|
||
Byteman [25] × ×
|
||
ChaosCube [55] ×
|
||
Chaos Lemur [25] ×
|
||
Chaos HTTP Proxy [25] ×
|
||
Chaos Mesh [45] × × ×
|
||
Istio Chaos [45] ×
|
||
ChAP [33] × ×
|
||
IntelliFT [44] × × × ×
|
||
|
||
|
||
|
||
|
||
Table 3
|
||
Chaos engineering tools from our search.
|
||
Chaos engineering tool Termination Network simulating Stressing machine Security Fault application code
|
||
Pod Chaos X X X
|
||
DNS Chaos X
|
||
AWS Chaos X X X
|
||
Azure Chaos X X X X
|
||
GCP Chaos X X X X
|
||
|
||
|
||
|
||
|
||
8
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
Table 4
|
||
Chaos engineering in production environments.
|
||
Category Description
|
||
Objective The primary objective of applying chaos engineering in production environments is to enhance the
|
||
resilience of software systems. This involves troubleshooting to identify and address potential
|
||
malfunctions before they occur. The overarching goal is to minimize issues in production through the
|
||
use of chaos engineering tools, enabling automatic fault detection [24,53].
|
||
Methods and tools chaos engineering relies on specific tools to facilitate its effective application in production
|
||
environments. These tools aid in automatic fault detection, a crucial aspect of troubleshooting to
|
||
minimize potential issues in the production environment [24,53].
|
||
Principles and considerations The effective application of chaos engineering is closely tied to key principles and considerations.
|
||
These include continuous experimentation, serving as a form of robustness testing conducted in
|
||
real-world operational conditions. Fundamental principles of Chaos Experiments involve defining a
|
||
steady state, hypothesizing about its impact, conducting the experiment, and then demonstrating or
|
||
refuting the hypothesis [53].
|
||
Insights and results Chaos experiments conducted in the production environment provide valuable insights into the
|
||
behavior of the system. This is particularly significant as the production environment may exhibit
|
||
unpredictable behavior that differs from staging environments in some cases [24].
|
||
Constraints and alternatives While conducting chaos experiments in production is ideal, it is acknowledged that legal or technical
|
||
constraints may sometimes prevent this. In such cases, an alternative approach is considered, starting
|
||
chaos experiments in a staging environment and gradually transitioning to the production
|
||
environment [25].
|
||
Overall approach The overall approach for the effective application of chaos engineering in production environments
|
||
involves the systematic execution of chaos experiments. This includes leveraging chaos engineering
|
||
tools and taking into account the constraints and challenges associated with conducting experiments in
|
||
real-world operational settings. The aim is to proactively identify and address potential issues before
|
||
they impact the production environment, ultimately enhancing the resilience of software systems.
|
||
|
||
|
||
|
||
|
||
Table 5
|
||
Chaos engineering tools identified from selected papers.
|
||
Platform/Tool Description
|
||
The Chaos Machine A tool for conducting chaos experiments at the application level on Java Virtual Machine (JVM),
|
||
using exception injection to analyze try-catch blocks for error processing [24].
|
||
Screwdriver An automated fault-tolerance testing tool for on-premise applications and services, creating realistic
|
||
error models and collecting metrics by injecting errors into the system [34].
|
||
Chaos Monkey Designed by Netflix, this tool tests the system’s resilience by randomly killing partitions to check
|
||
system functionality [7,45].
|
||
Cloud Strike A security chaos engineering system for multi-cloud security, extending chaos engineering to security
|
||
by injecting faults impacting confidentiality, integrity, and availability [42].
|
||
ChaosMesh An open-source chaos engineering platform for testing the resilience and reliability of distributed
|
||
systems by intentionally injecting failures and disruptions [55].
|
||
Powerfulseal An open-source tool for testing the resilience of Kubernetes clusters by simulating real-world failures
|
||
and disruptions [55].
|
||
IntelliFT A feedback-based, automated failure testing technique for microservice applications, focusing on
|
||
exposing defects in fault-handling logic [44].
|
||
The Chaos Toolkit Open-source software that runs experiments against the system to confirm a hypothesis [25,55].
|
||
Phoebe A fault injection framework for reliability analysis concerning system call invocation errors, enabling
|
||
full observability of system call invocations and automatic experimentation [22].
|
||
Mjolnirr A private cloud platform with a built-in Chaos Monkey service for developing private PaaS cloud
|
||
infrastructure [58].
|
||
ChaosOrca A tool for Chaos engineering on containers, perturbing system calls for processes inside containers
|
||
and monitoring their effects [37].
|
||
Gremlin Offered as a SaaS technology, Gremlin tests system resilience on various parameters and conditions,
|
||
with capabilities for automation and integration with Kubernetes clusters and public clouds [35].
|
||
3MileBeach A distributed tracing and fault injection framework for microservices, enabling chaos experiments
|
||
through message serialization library manipulation [31].
|
||
ChAP A software platform for running automated chaos experiments, simulating various failure scenarios
|
||
and providing insights into system behavior under stress [29,33].
|
||
ChaosTwin Utilizes a digital twin approach in Chaos Engineering to mitigate impacts of unforeseen events,
|
||
constructing models across workload, network, and service layers [47].
|
||
Litmus Chaos An open-source cloud-native framework for Chaos Engineering in Kubernetes environments, offering a
|
||
range of chaos experiments and workflows [50].
|
||
Filibuster A testing method in chaos engineering that introduces errors into microservice architecture to validate
|
||
resilience and error tolerance [32].
|
||
|
||
|
||
|
||
|
||
9
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
Table 6
|
||
Chaos engineering in microservices: approaches, descriptions, and expected outcomes.
|
||
Approach Description Expected impact
|
||
Fault injection testing This method involves intentionally introducing errors into the system to assess its Evaluating and enhancing the system’s resilience
|
||
response, particularly in microservices by simulating various failure modes such as and stability.
|
||
network issues, service outages, or resource shortages within or between
|
||
microservices, to evaluate the system’s resilience and stability [52].
|
||
Hypothesis-driven Key to chaos engineering is conducting experiments based on well-defined Identifying system weaknesses and increasing
|
||
experiments hypotheses about the normal state of the system and its expected behavior during resilience.
|
||
failure scenarios. This strategic approach enables focused experiments that assess the
|
||
resilience of both individual microservices and the overall system [45,53].
|
||
Blast radius Managing the ‘‘blast radius’’ of experiments is crucial in microservices. It involves Better understanding and enhancing the system’s
|
||
management understanding the potential impact of introduced failures, starting with small resilience.
|
||
experiments and then expanding, to manage failure impacts while identifying system
|
||
vulnerabilities [45].
|
||
Resilience requirement Utilizing chaos engineering to determine and analyze the resilience requirements of Understanding specific resilience needs of each
|
||
elicitation microservice architectures. This process involves observing the system’s response to microservice and their interactions.
|
||
induced faults to identify specific resilience needs of each microservice and their
|
||
interactions [52].
|
||
Continuous testing and Regularly conducting chaos experiments as part of an ongoing testing process Proactive identification and resolution of system
|
||
improvement ensures that microservices remain resilient against unforeseen issues. This continuous weaknesses, leading to continual improvement and
|
||
approach aids in proactively finding and fixing potential system weaknesses [56]. increased resilience.
|
||
Observability and Integrating chaos engineering with observability tools enhances the monitoring of Real-time tracking of responses to failures and
|
||
remediation microservices during fault injection, allowing for real-time tracking of responses to development of effective remediation strategies for
|
||
failures, aiding in the development of effective remediation strategies and overall overall system resilience improvement.
|
||
system resilience improvement [56].
|
||
|
||
|
||
|
||
4.4. How can Chaos engineering be effectively applied to microservice archi- 5.1. General discussion
|
||
tecture to ensure successful implementation and enhance system resilience?
|
||
In this article, we reviewed the literature on the application of
|
||
Table 6 provides a comprehensive overview of the different facets chaos engineering in microservice architecture to understand the state-
|
||
and projected implications of implementing chaos engineering within of-the-art. For this purpose, six research questions were defined and
|
||
microservice architecture. answered.
|
||
By implementing these approaches and strategies, organizations can In RQ1, we aimed to understand how chaos engineering is ap-
|
||
effectively integrate chaos engineering into their microservice architec- plied to production environments. Chaos engineering, when adeptly
|
||
tures to uncover vulnerabilities and enhance the overall dependability applied in production settings, serves as a pivotal tool for augmenting
|
||
of their systems. the robustness of software systems. This approach entails conducting
|
||
deliberate and controlled chaos experiments within the production en-
|
||
4.5. To what extent can the centralized provision of Chaos engineering vironment, a strategy that is instrumental in uncovering and rectifying
|
||
effectively facilitate the management of chaos experiments across complex potential issues before they escalate into full-blown system failures,
|
||
systems? thereby bolstering system uptime [38]. Moreover, chaos engineering
|
||
is characterized by the intentional injection of faults into systems.
|
||
Table 7 provides an overview of the ways in which centralized chaos This methodology is crucial for identifying and addressing security
|
||
engineering can simplify experiment management in intricate systems. flaws and risks, laying the groundwork for the development of resilient
|
||
It emphasizes advantages like standardization, resource utilization, risk application architectures [56]. By replicating adverse conditions that
|
||
mitigation, and more, resulting in enhanced system resilience and could naturally arise in production settings, chaos engineering helps
|
||
performance. detect of inherent system vulnerabilities and structural deficiencies,
|
||
fostering a proactive stance towards issue mitigation [38].
|
||
4.6. What are the challenges reported in the relevant papers? Additionally, this practice involves comprehensive testing of real-
|
||
world scenarios on operational systems. Such testing is vital for as-
|
||
Table 8 concisely presents the primary obstacles in the area of sessing the complete spectrum of software systems, encompassing both
|
||
chaos engineering and their respective resolutions. These obstacles hardware malfunctions and software glitches, within their actual de-
|
||
encompass system intricacy, hazards to live environments, resource ployment contexts. This approach significantly contributes to the en-
|
||
demands, security issues, and automation complexities. The proposed hancement of overall system resilience [38]. To effectively implement
|
||
resolutions involve phased implementation, risk assessment, knowledge chaos engineering, it is recommended to initiate with less complex
|
||
enhancement, robust security protocols, and automation approaches. experiments, leverage automation for these experiments, and focus on
|
||
areas with either high impact or high frequency of issues. Observing
|
||
5. Discussion the system at its limits is also crucial for reinforcing resilience [25].
|
||
In RQ2, we discuss various platforms that aim to increase the
|
||
In the discussion section, we summarize answers to the research flexibility and reliability of microservice architectures through chaos
|
||
questions. They mention that chaos engineering can improve robust- experiments. Tools like Gremlin, Chaos Monkey, Chaos Toolkit, Pumba,
|
||
ness by simulating real-world failure scenarios and exploring system LitmusChaos, ToxiProxy and PowerfulSeal have been utilized in indus-
|
||
reactions, especially in microservice architectures. Various tools for try settings to simulate different failure scenarios. These tools provide
|
||
implementing chaos engineering were listed and compared. They con- functions such as terminating processes, simulating network conditions,
|
||
clude by stating that the application of chaos engineering requires applying stress tests security measures and injecting faults to proac-
|
||
careful planning due to inherent challenges but has the potential to tively identify weaknesses and strengthen system robustness across
|
||
greatly improve system resilience. different technology landscapes.
|
||
|
||
10
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
Table 7
|
||
Centralized provision in chaos engineering.
|
||
Approach Description Expected impact
|
||
Standardization Centralized provision allows for the standardization of chaos engineering practices Improved coordination and reliability of
|
||
and tools across the organization. This ensures that all teams follow consistent results.
|
||
processes and use approved tools, leading to better coordination and more reliable
|
||
results [42].
|
||
Resource optimization Centralized provision enables efficient allocation of resources for chaos experiments. Enhanced resource utilization and reduced
|
||
It allows pooling of expertise, tools, and infrastructure, reducing redundancy and redundancy.
|
||
optimizing resource utilization [38].
|
||
Risk management Centralized provision facilitates better risk management by providing oversight and Controlled experimentation and effective
|
||
governance for chaos experiments. It establishes clear guidelines, safety measures, risk management.
|
||
and expected states for running experiments in production environments, ensuring
|
||
controlled experimentation [42].
|
||
Automation and Centralized provision supports the automation of chaos experiments to run Ongoing validation of system resilience and
|
||
continuous testing continuously. This ensures regular conduction of experiments, leading to ongoing early identification of potential issues.
|
||
validation of system resilience and identification of potential issues before they
|
||
manifest as outages [38,42].
|
||
Knowledge sharing and A centralized approach encourages knowledge sharing and collaboration among Promotion of a continuous improvement
|
||
collaboration teams. It facilitates the dissemination of best practices, lessons learned, and culture and shared learning.
|
||
successful experiment designs, fostering a culture of continuous improvement and
|
||
shared learning [25].
|
||
Performance metrics and Centralized provision enables the establishment of standardized performance metrics Consistent system health measurement and
|
||
analysis and analysis methods for chaos experiments. This allows for consistent measurement more effective decision-making.
|
||
of system health and identification of deviations from steady-state, leading to more
|
||
effective decision-making and system improvements [43].
|
||
|
||
|
||
Table 8
|
||
Challenges and solutions in chaos Engineering.
|
||
Category Challenges Possible solutions References
|
||
Complexity Designing and executing effective chaos experiments To mitigate complexity, it is recommended to start with smaller, more [25,43]
|
||
in large systems is complex due to intricate manageable experiments and gradually expand the scope of chaos
|
||
interdependencies within these systems. engineering practices.
|
||
Risk of impact Concerns about causing disruptions in the production Implementing risk analysis techniques can help prioritize experiments, [45,50]
|
||
environment, affecting users and business operations. focusing on less critical system components first to minimize potential
|
||
impacts.
|
||
Resource Significant resources needed including time, expertise, Addressing resource intensiveness involves providing comprehensive [7,47]
|
||
intensiveness and infrastructure, posing a barrier for many training and education on chaos engineering best practices and tools to
|
||
organizations. equip teams with the necessary skills and knowledge.
|
||
Security Introducing controlled failures can raise security To combat security concerns, robust security measures should be [42,47]
|
||
concerns issues, potentially exposing vulnerabilities or sensitive implemented during experiments to safeguard sensitive data and prevent
|
||
data. unauthorized access.
|
||
Tooling and Developing tools for automated chaos experiments is Overcoming tooling and automation challenges requires the development [7,33,38,40,42]
|
||
automation challenging in heterogeneous and dynamic and use of automated tools for Chaos experiments, which reduce manual
|
||
environments. efforts and facilitate continuous, unattended testing.
|
||
|
||
|
||
|
||
Recent studies have emphasized the growing intersection between solutions like Netflix’s Chaos Automation Platform (ChAP) and fault
|
||
artificial intelligence and cybersecurity within the context of chaos injection techniques such as service call manipulation. The emphasis is
|
||
engineering. AI-driven techniques are nowadays used for real-time placed on the need for careful planning, effective communication, risk
|
||
threat detection, anomaly prediction, and automated response mech- management, and continuous learning to ensure comprehensive and
|
||
anisms in enterprise systems. For example, generative AI models have valuable chaos experiments for enhancing overall system resilience.
|
||
been proposed to enhance cybersecurity frameworks by improving data In response to RQ5, our discussion concludes that the practical
|
||
privacy management and identifying potential attack vectors [59]. implementation of chaos engineering, despite its promise to enhance
|
||
In RQ3, we focused on understanding how chaos engineering is im- system resilience, presents numerous challenges. These challenges in-
|
||
plemented in microservice architectures. To enhance system resilience clude potential business impacts, difficulty in determining scope, the
|
||
in microservice architectures through chaos engineering, organizations
|
||
unpredictability of outcomes, time and resource constraints, system
|
||
should utilize fault injection testing to replicate failures within mi-
|
||
complexities, skill and knowledge prerequisites, interpretation of re-
|
||
croservices. They should also conduct hypothesis-driven experiments
|
||
sults, cultural readiness, and selection of appropriate tools. These all
|
||
with a solid comprehension of the normal state and anticipated behav-
|
||
necessitate meticulous planning and skilled execution for effectiveness.
|
||
ior during disruptions, while managing the scope of these experiments
|
||
to minimize impact. Additionally, it is essential to identify and an- Recent studies explore the convergence of Chaos Engineering and
|
||
alyze resilience requirements, participate in continuous testing and Artificial Intelligence (AI). Large language models (LLMs) have been
|
||
improvement efforts, as well as integrate observability tools for real- used to automate the chaos engineering lifecycle, managing phases
|
||
time monitoring during fault injection tests. Moreover, organizations from hypothesis creation to experiment orchestration and remedia-
|
||
need to establish clear communication channels across teams involved tion [60]. Meanwhile, advances in applying chaos engineering to multi-
|
||
in order to ensure effective collaboration and knowledge sharing. agent AI systems suggest new directions: for example, chaos experi-
|
||
The answer to RQ4, highlights the significance of centralized man- ments applied to LLM-based multi-agent systems can surface vulner-
|
||
agement and monitoring in conducting chaos experiments within large- abilities such as hallucinations, agent failures, or inter-agent communi-
|
||
scale microservices ecosystems. It discusses the utilization of software cation breakdowns [61]. Together, these works show how intelligent,
|
||
|
||
11
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
adaptive chaos frameworks might evolve in microservice-based systems experiments are insightful, as they reveal system behaviors in pro-
|
||
as well. duction environments, which often differ unpredictably from staging
|
||
Recent research also discusses specific operational challenges such environments [36,53].
|
||
as load balancing and security in the context of chaos engineering. For Furthermore, the effectiveness of chaos engineering is contingent
|
||
example, an empirical study applies delay injections under different on the systematic execution of chaos experiments. These experiments,
|
||
user loads in cloud-native systems to observe how throughput and utilizing advanced chaos engineering tools, need to navigate the con-
|
||
latency change under stress, providing insights into how load balanc- straints and challenges inherent in real-world operational settings.
|
||
ing policies perform under fault conditions [62]. In parallel, several The main objective is the enhancement of system resilience, achieved
|
||
frameworks have begun integrating security-focused chaos tests that by proactively identifying and preemptively addressing potential is-
|
||
intentionally inject faults into authentication, identity management, sues [46].
|
||
and access control components to ensure that security mechanisms However, it is acknowledged that conducting chaos experiments
|
||
remain effective under stress conditions [63]. These studies highlight directly in production environments might be impeded by legal or
|
||
how chaos engineering can be extended beyond performance reliability technical constraints. In such scenarios, initiating experiments in a
|
||
to proactively strengthen both load distribution and security resilience staging environment and then gradually transitioning to the production
|
||
in microservice environments. environment offers a viable alternative. This approach ensures that
|
||
The main challenges faced by previous researchers and possible the benefits of chaos engineering can still be realized, but in a more
|
||
solutions have been discussed in the paper. The collected challenges controlled and possibly less direct manner.
|
||
were mainly related to the correct interpretation of chaos experiments Our review highlights that chaos engineering is a critical methodol-
|
||
and making sense of them. There may be more challenges, but if ogy for ensuring the resilience and robustness of software systems. By
|
||
they were not mentioned in these articles, we could not include them. following continuous experimentation and proactive troubleshooting, it
|
||
We believe that chaos engineering is still in the early stages and the offers a pathway to address the challenges faced in complex production
|
||
adoption in the software industry will take some time. environments. This SLR contributes to the scientific community by dis-
|
||
cussing these methodologies and their applications, thereby providing
|
||
5.2. Threats to validity a framework for future research and practical implementation in the
|
||
field of software system resilience.
|
||
Internal validity
|
||
The validity of this systematic literature review is threatened by CRediT authorship contribution statement
|
||
issues related to defining the candidate pool of papers, potential bias
|
||
in selecting primary studies, data extraction, and data synthesis. The Emrah Esen: Writing – review & editing, Writing – original draft,
|
||
application of exclusion criteria can be influenced by the researchers’ Visualization, Validation, Software, Methodology, Investigation, For-
|
||
biases, posing a potential threat to validity. We compiled a compre- mal analysis, Data curation. Akhan Akbulut: Writing – review &
|
||
hensive list of exclusion criteria, and all conflicts were documented editing, Writing – original draft, Visualization, Validation, Supervi-
|
||
and resolved through discussions among us. Data extraction validity is sion, Software, Resources, Project administration, Methodology, Inves-
|
||
crucial as it directly impacts the study results. Whenever any of us was tigation, Formal analysis, Data curation. Cagatay Catal: Writing –
|
||
uncertain about data extraction, the case was recorded for resolution review & editing, Writing – original draft, Visualization, Validation,
|
||
through discussions with the team. Multiple meetings were held to Supervision, Software, Resources, Project administration, Methodology,
|
||
minimize researcher bias. Investigation, Funding acquisition, Formal analysis, Data curation.
|
||
|
||
External validity Declaration of competing interest
|
||
The search for candidate papers involved using general search terms
|
||
to minimize the risk of excluding relevant studies. Despite using a broad The authors declare that they have no known competing finan-
|
||
search query to acquire more articles, there remains a possibility that cial interests or personal relationships that could have appeared to
|
||
some papers were overlooked in electronic databases or missed due to influence the work reported in this paper.
|
||
recent publications. Furthermore, although seven widely used online
|
||
databases in computer science and software engineering were searched, Data availability
|
||
new papers may not have been included.
|
||
Data will be made available on request.
|
||
6. Conclusion
|
||
|
||
Our systematic literature review (SLR) on chaos engineering has References
|
||
explored its role in enhancing the resilience of software systems in pro-
|
||
duction environments. Through our review, we have identified several [1] P. Jamshidi, C. Pahl, N.C. Mendonça, J. Lewis, S. Tilkov, Microservices: The
|
||
journey so far and challenges ahead, IEEE Softw. 35 (3) (2018) 24–35, http:
|
||
crucial aspects that underline the effective application and challenges
|
||
//dx.doi.org/10.1109/MS.2018.2141039.
|
||
of chaos engineering [25]. [2] I. Beschastnikh, P. Wang, Y. Brun, M.D. Ernst, Debugging distributed systems,
|
||
Firstly, Chaos Engineering serves as a proactive troubleshooting ap- Commun. ACM 59 (8) (2016) 32–37, http://dx.doi.org/10.1145/2909480.
|
||
proach in production environments [25]. By identifying and addressing [3] W. Ahmed, Y.W. Wu, A survey on reliability in distributed systems, J. Comput.
|
||
potential malfunctions before they occur, it effectively preempts system System Sci. 79 (8) (2013) 1243–1255, http://dx.doi.org/10.1016/j.jcss.2013.02.
|
||
006.
|
||
disruptions. This proactive strategy is significantly implemented by
|
||
[4] D. Ma’ruf, S. Sulistyo, L. Nugroho, Applying integrating testing of microservices
|
||
chaos engineering tools that assist in automatic fault detection, thereby in airline ticketing system, Ijitee (Int. J. Inf. Technol. Electr. Eng.) 4 (2020) 39,
|
||
minimizing potential issues in these critical environments [50]. http://dx.doi.org/10.22146/ijitee.55491.
|
||
Secondly, the essence of chaos engineering is rooted in continuous [5] F. Dai, H. Chen, Z. Qiang, Z. Liang, B. Huang, L. Wang, Automatic analysis
|
||
experimentation and robustness testing under real-world operational of complex interactions in microservice systems, Complexity 2020 (2020) 1–12,
|
||
http://dx.doi.org/10.1155/2020/2128793.
|
||
conditions. The methodology involves a systematic approach: defining [6] J. Lewis, M. Fowler, Microservices: a definition of this new architectural term
|
||
a steady state, hypothesizing its impacts, conducting controlled exper- (2014), 2014, URL: http://martinfowler.com/articles/microservices.html (cit. p.
|
||
iments, and subsequently confirming or refuting the hypotheses. These 26).
|
||
|
||
|
||
12
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
[7] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, C. [31] J. Zhang, R. Ferydouni, A. Montana, D. Bittman, P. Alvaro, 3MileBeach: A
|
||
Rosenthal, Chaos engineering, IEEE Softw. 33 (3) (2016) 35–41, http://dx.doi. tracer with teeth, in: Proceedings of the ACM Symposium on Cloud Computing,
|
||
org/10.1109/MS.2016.60. SoCC ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp.
|
||
[8] R.T. Munodawafa, S.K. Johl, A systematic review of eco-innovation and perfor- 458–472, http://dx.doi.org/10.1145/3472883.3486986.
|
||
mance from the resource-based and stakeholder perspectives, Sustainability 11 [32] C.S. Meiklejohn, A. Estrada, Y. Song, H. Miller, R. Padhye, Service-level fault
|
||
(2019) 6067, http://dx.doi.org/10.3390/su11216067. injection testing, in: Proceedings of the ACM Symposium on Cloud Computing,
|
||
[9] J.M. Macharia, Systematic literature review of interventions supported by inte- SoCC ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp.
|
||
gration of ict in education to improve learners’ academic performance in stem 388–402, http://dx.doi.org/10.1145/3472883.3487005.
|
||
subjects in kenya, J. Educ. Pract. 6 (2022) 52–75, http://dx.doi.org/10.47941/ [33] A. Blohowiak, A. Basiri, L. Hochstein, C. Rosenthal, A platform for automating
|
||
jep.979. chaos experiments, in: 2016 IEEE International Symposium on Software Reliabil-
|
||
[10] P. Gerli, J.N. Marco, J. Whalley, What makes a smart village smart? a review ity Engineering Workshops, ISSREW, 2016, pp. 5–8, http://dx.doi.org/10.1109/
|
||
of the literature, Transform. Gov.: People Process. Policy 16 (2022) 292–304, ISSREW.2016.52.
|
||
http://dx.doi.org/10.1108/tg-07-2021-0126. [34] A. Nagarajan, A. Vaddadi, Automated fault-tolerance testing, in: 2016 IEEE
|
||
[11] R. Coppola, L. Ardito, Quality assessment methods for textual conversational Ninth International Conference on Software Testing, Verification and Validation
|
||
interfaces: a multivocal literature review, Information 12 (2021) 437, http: Workshops, ICSTW, 2016, pp. 275–276, http://dx.doi.org/10.1109/ICSTW.2016.
|
||
//dx.doi.org/10.3390/info12110437. 34.
|
||
[12] B. Kitchenham, O. Pearl Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman, [35] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M.K. Reiter, V. Sekar, Gremlin:
|
||
Systematic literature reviews in software engineering – A systematic literature Systematic resilience testing of microservices, in: 2016 IEEE 36th International
|
||
review, Inf. Softw. Technol. 51 (1) (2009) 7–15, http://dx.doi.org/10.1016/j. Conference on Distributed Computing Systems, ICDCS, 2016, pp. 57–66, http:
|
||
infsof.2008.09.009, Special Section - Most Cited Articles in 2002 and Regular //dx.doi.org/10.1109/ICDCS.2016.11.
|
||
Research Papers. [36] R.K. Lenka, S. Padhi, K.M. Nayak, Fault injection techniques - a brief review,
|
||
[13] N. Dragoni, S. Giallorenzo, A.L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, L. in: 2018 International Conference on Advances in Computing, Communication
|
||
Safina, Microservices: yesterday, today, and tomorrow, 2017, arXiv:1606.04036. Control and Networking, ICACCCN, 2018, pp. 832–837, http://dx.doi.org/10.
|
||
[14] P.D. Francesco, I. Malavolta, P. Lago, Research on architecting microservices: 1109/ICACCCN.2018.8748585.
|
||
Trends, focus, and potential for industrial adoption, in: 2017 IEEE International [37] A. van Hoorn, A. Aleti, T.F. Düllmann, T. Pitakrat, ORCAS: Efficient resilience
|
||
Conference on Software Architecture, ICSA, 2017, pp. 21–30, http://dx.doi.org/ benchmarking of microservice architectures, in: 2018 IEEE International Sym-
|
||
10.1109/ICSA.2017.24. posium on Software Reliability Engineering Workshops, ISSREW, 2018, pp.
|
||
[15] M. Fowler, Patterns of Enterprise Application Architecture, Addison-Wesley 146–147, http://dx.doi.org/10.1109/ISSREW.2018.00-10.
|
||
Longman Publishing Co., Inc., USA, 2002. [38] H. Tucker, L. Hochstein, N. Jones, A. Basiri, C. Rosenthal, The business case for
|
||
chaos engineering, IEEE Cloud Comput. 5 (3) (2018) 45–54, http://dx.doi.org/
|
||
[16] J. Lewis, M. Fowler, Microservices, 2014, https://martinfowler.com/articles/
|
||
10.1109/MCC.2018.032591616.
|
||
microservices.html.
|
||
[39] N. Brousse, O. Mykhailov, Use of self-healing techniques to improve the
|
||
[17] S. Newman, Building Microservices: Designing Fine-Grained Systems, " O’Reilly
|
||
reliability of a dynamic and geo-distributed ad delivery service, in: 2018
|
||
Media, Inc.", 2021.
|
||
IEEE International Symposium on Software Reliability Engineering Workshops,
|
||
[18] C.K. Rudrabhatla, Comparison of zero downtime based deployment techniques in
|
||
ISSREW, 2018, pp. 1–5, http://dx.doi.org/10.1109/ISSREW.2018.00-40.
|
||
public cloud infrastructure, in: 2020 Fourth International Conference on I-SMAC
|
||
[40] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for
|
||
(IoT in Social, Mobile, Analytics and Cloud), I-SMAC, 2020, pp. 1082–1086,
|
||
cloud services: Work in progress, in: 2019 IEEE 18th International Symposium
|
||
http://dx.doi.org/10.1109/I-SMAC49090.2020.9243605.
|
||
on Network Computing and Applications, NCA, 2019, pp. 1–3, http://dx.doi.org/
|
||
[19] S.R. Addula, P. Perugu.P, M.K. Kumar, D. Kumar, B. Ananthan, R. R, S. P, S.
|
||
10.1109/NCA.2019.8935046.
|
||
G, Dynamic load balancing in cloud computing using hybrid Kookaburra-Pelican
|
||
[41] H. Chen, P. Chen, G. Yu, A framework of virtual war room and matrix sketch-
|
||
optimization algorithms, in: 2024 International Conference on Augmented Re-
|
||
based streaming anomaly detection for microservice systems, IEEE Access 8
|
||
ality, Intelligent Systems, and Industrial Automation, ARIIA, 2024, pp. 1–7,
|
||
(2020) 43413–43426, http://dx.doi.org/10.1109/ACCESS.2020.2977464.
|
||
http://dx.doi.org/10.1109/ARIIA63345.2024.11051893.
|
||
[42] K.A. Torkura, M.I.H. Sukmana, F. Cheng, C. Meinel, CloudStrike: Chaos engi-
|
||
[20] M. Waseem, P. Liang, M. Shahin, A systematic mapping study on microservices
|
||
neering for security and resiliency in cloud infrastructure, IEEE Access 8 (2020)
|
||
architecture in devops, J. Syst. Softw. 170 (2020) 110798, http://dx.doi.org/10.
|
||
123044–123060, http://dx.doi.org/10.1109/ACCESS.2020.3007338.
|
||
1016/j.jss.2020.110798.
|
||
[43] D. Kesim, A. van Hoorn, S. Frank, M. H00E4ussler, Identifying and prioritizing
|
||
[21] C. Rosenthal, N. Jones, Chaos Engineering: System Resiliency in Practice, O’Reilly
|
||
chaos experiments by using established risk analysis techniques, in: 2020 IEEE
|
||
Media, 2020.
|
||
31st International Symposium on Software Reliability Engineering, ISSRE, 2020,
|
||
[22] L. Zhang, B. Morin, B. Baudry, M. Monperrus, Maximizing error injection realism pp. 229–240, http://dx.doi.org/10.1109/ISSRE5003.2020.00030.
|
||
for chaos engineering with system calls, IEEE Trans. Dependable Secur. Comput. [44] Z. Long, G. Wu, X. Chen, C. Cui, W. Chen, J. Wei, Fitness-guided resilience
|
||
19 (4) (2022) 2695–2708, http://dx.doi.org/10.1109/TDSC.2021.3069715. testing of microservice-based applications, 2020, pp. 151–158, http://dx.doi.org/
|
||
[23] Š. Davidovič, B. Beyer, Canary analysis service, Commun. ACM 61 (5) (2018) 10.1109/ICWS49710.2020.00027.
|
||
54–62, http://dx.doi.org/10.1145/3190566. [45] S. De, A study on chaos engineering for improving cloud software quality
|
||
[24] L. Zhang, B. Morin, P. Haller, B. Baudry, M. Monperrus, A chaos engineering and reliability, in: 2021 International Conference on Disruptive Technologies
|
||
system for live analysis and falsification of exception-handling in the JVM, IEEE for Multi-Disciplinary Research and Applications, CENTCON, Vol. 1, 2021, pp.
|
||
Trans. Softw. Eng. 47 (11) (2021) 2534–2548, http://dx.doi.org/10.1109/TSE. 289–294, http://dx.doi.org/10.1109/CENTCON52345.2021.9688292.
|
||
2019.2954871. [46] C. Konstantinou, G. Stergiopoulos, M. Parvania, P. Esteves-Verissimo, Chaos
|
||
[25] H. Jernberg, P. Runeson, E. Engström, Getting started with chaos engineering engineering for enhanced resilience of cyber-physical systems, in: 2021 Re-
|
||
- design of an implementation framework in practice, in: Proceedings of the silience Week, RWS, 2021, pp. 1–10, http://dx.doi.org/10.1109/RWS52686.
|
||
14th ACM / IEEE International Symposium on Empirical Software Engineering 2021.9611797.
|
||
and Measurement, ESEM, ESEM ’20, Association for Computing Machinery, New [47] F. Poltronieri, M. Tortonesi, C. Stefanelli, ChaosTwin: A chaos engineering and
|
||
York, NY, USA, 2020, http://dx.doi.org/10.1145/3382494.3421464. digital twin approach for the design of resilient IT services, in: 2021 17th
|
||
[26] A. Alkhateeb, C. Catal, G. Kar, A. Mishra, Hybrid blockchain platforms for the International Conference on Network and Service Management, CNSM, 2021,
|
||
internet of things (IoT): A systematic literature review, Sensors 22 (4) (2022) pp. 234–238, http://dx.doi.org/10.23919/CNSM52442.2021.9615519.
|
||
http://dx.doi.org/10.3390/s22041304. [48] N. Luo, Y. Xiong, Platform software reliability for cloud service continuity
|
||
[27] R. van Dinter, B. Tekinerdogan, C. Catal, Predictive maintenance using digital - challenges and opportunities, in: 2021 IEEE 21st International Conference
|
||
twins: A systematic literature review, Inf. Softw. Technol. 151 (2022) 107008, on Software Quality, Reliability and Security, QRS, 2021, pp. 388–393, http:
|
||
http://dx.doi.org/10.1016/j.infsof.2022.107008. //dx.doi.org/10.1109/QRS54544.2021.00050.
|
||
[28] M. Jorayeva, A. Akbulut, C. Catal, A. Mishra, Machine learning-based software [49] H. Chen, K. Wei, A. Li, T. Wang, W. Zhang, Trace-based intelligent fault diagnosis
|
||
defect prediction for mobile applications: A systematic literature review, Sensors for microservices with deep learning, in: 2021 IEEE 45th Annual Computers,
|
||
22 (7) (2022) http://dx.doi.org/10.3390/s22072551. Software, and Applications Conference, COMPSAC, 2021, pp. 884–893, http:
|
||
[29] A. Basiri, L. Hochstein, N. Jones, H. Tucker, Automating chaos experiments //dx.doi.org/10.1109/COMPSAC51774.2021.00121.
|
||
in production, in: 2019 IEEE/ACM 41st International Conference on Software [50] O. Sharma, M. Verma, S. Bhadauria, P. Jayachandran, A guided approach
|
||
Engineering: Software Engineering in Practice, ICSE-SEIP, 2019, pp. 31–40, towards complex chaos selection, prioritisation and injection, in: 2022 IEEE
|
||
http://dx.doi.org/10.1109/ICSE-SEIP.2019.00012. 15th International Conference on Cloud Computing, CLOUD, 2022, pp. 91–93,
|
||
[30] L.B. Canonico, V. Vakeel, J. Dominic, P. Rodeghero, N. McNeese, Human-AI http://dx.doi.org/10.1109/CLOUD55607.2022.00025.
|
||
partnerships for chaos engineering, in: Proceedings of the IEEE/ACM 42nd [51] N. Luo, L. Zhang, Chaos driven development for software robustness enhance-
|
||
International Conference on Software Engineering Workshops, ICSEW ’20, As- ment, in: 2022 9th International Conference on Dependable Systems and their
|
||
sociation for Computing Machinery, New York, NY, USA, 2020, pp. 499–503, Applications, DSA, 2022, pp. 1029–1034, http://dx.doi.org/10.1109/DSA56465.
|
||
http://dx.doi.org/10.1145/3387940.3391493. 2022.00154.
|
||
|
||
|
||
13
|
||
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
|
||
|
||
|
||
[52] M.A. Naqvi, S. Malik, M. Astekin, L. Moonen, On evaluating self-adaptive [58] D. Savchenko, G. Radchenko, O. Taipale, Microservices validation: Mjolnirr
|
||
and self-healing systems using chaos engineering, in: 2022 IEEE International platform case study, in: 2015 38th International Convention on Information and
|
||
Conference on Autonomic Computing and Self-Organizing Systems, ACSOS, 2022, Communication Technology, Electronics and Microelectronics, MIPRO, 2015, pp.
|
||
pp. 1–10, http://dx.doi.org/10.1109/ACSOS55765.2022.00018. 235–240, http://dx.doi.org/10.1109/MIPRO.2015.7160271.
|
||
[53] J. Simonsson, L. Zhang, B. Morin, B. Baudry, M. Monperrus, Observability and [59] G.S. Nadella, S.R. Addula, A.R. Yadulla, G.S. Sajja, M. Meesala, M.H. Maturi,
|
||
chaos engineering on system calls for containerized applications in Docker, K. Meduri, H. Gonaygunta, Generative AI-enhanced cybersecurity framework for
|
||
Future Gener. Comput. Syst. 122 (2021) 117–129, http://dx.doi.org/10.1016/ enterprise data privacy management, Computers 14 (2) (2025) http://dx.doi.org/
|
||
j.future.2021.04.001. 10.3390/computers14020055.
|
||
[54] A.A.-S. Ahmad, P. Andras, Scalability resilience framework using application- [60] D. Kikuta, H. Ikeuchi, K. Tajiri, Y. Nakano, ChaosEater: Fully automating chaos
|
||
level fault injection for cloud-based software services, J. Cloud Comput. 11 (1) engineering with large language models, 2025, arXiv preprint arXiv:2501.11107.
|
||
(2022) 1, http://dx.doi.org/10.1186/s13677-021-00277-z. URL https://arxiv.org/abs/2501.11107.
|
||
[55] C. Camacho, P.C. Cañizares, L. Llana, A. Núñez, Chaos as a software product [61] J. Owotogbe, Assessing and enhancing the robustness of LLM-based multi-
|
||
line—A platform for improving open hybrid-cloud systems resiliency, Softw.: agent systems through chaos engineering, in: 2025 IEEE/ACM 4th International
|
||
Pract. Exp. 52 (7) (2022) 1581–1614, http://dx.doi.org/10.1002/spe.3076. Conference on AI Engineering – Software Engineering for AI, CAIN, 2025, pp.
|
||
[56] P. Raj, S. Vanga, A. Chaudhary, The observability, chaos engineering, and 250–252, http://dx.doi.org/10.1109/CAIN66642.2025.00039.
|
||
remediation for cloud-native reliability, in: Cloud-Native Computing: How To [62] A. Al-Said Ahmad, L.F. Al-Qora’n, A. Zayed, Exploring the impact of chaos
|
||
Design, Develop, and Secure Microservices and Event-Driven Applications, 2023, engineering with various user loads on cloud native applications: An exploratory
|
||
pp. 71–93, http://dx.doi.org/10.1002/9781119814795.ch4. empirical study, Computing 106 (2024) 2389–2425, http://dx.doi.org/10.1007/
|
||
[57] M.A. Chang, B. Tschaen, T. Benson, L. Vanbever, Chaos monkey: Increasing sdn s00607-024-01292-z.
|
||
reliability through systematic network destruction, in: Proceedings of the 2015 [63] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for
|
||
ACM Conference on Special Interest Group on Data Communication, 2015, pp. cloud services: Work in progress, in: 2019 IEEE 18th International Symposium
|
||
371–372. on Network Computing and Applications, NCA, 2019, pp. 1–3, http://dx.doi.org/
|
||
10.1109/NCA.2019.8935046.
|
||
|
||
|
||
|
||
|
||
14
|
||
|