Files
opaque-lattice/papers_txt/Chaos-experiments-in-microservice-architectures--A-_2026_Computer-Standards-.txt
2026-01-06 12:49:26 -07:00

979 lines
115 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Computer Standards & Interfaces 97 (2026) 104116
Contents lists available at ScienceDirect
Computer Standards & Interfaces
journal homepage: www.elsevier.com/locate/csi
Chaos experiments in microservice architectures: A systematic literature
review
Emrah Esen a , Akhan Akbulut a , Cagatay Catal b ,
a
Department of Computer Engineering, Istanbul Kültür University, 34536, Istanbul, Turkey
b
Department of Computer Science and Engineering, Qatar University, Doha 2713, Qatar
ARTICLE INFO ABSTRACT
Keywords: This study analyzes the implementation of Chaos Engineering in modern microservice systems. It identifies
Chaos engineering key methods, tools, and practices used to effectively enhance the resilience of software systems in production
Microservice environments. In this context, our Systematic Literature Review (SLR) of 31 research articles has uncovered 38
Systematic literature review
tools crucial for carrying out fault injection methods, including several tools such as Chaos Toolkit, Gremlin,
and Chaos Machine. The study also explores the platforms used for chaos experiments and how centralized
management of chaos engineering can facilitate the coordination of these experiments across complex systems.
The evaluated literature reveals the efficacy of chaos engineering in improving fault tolerance and robustness of
software systems, particularly those based on microservice architectures. The paper underlines the importance
of careful planning and execution in implementing chaos engineering and encourages further research in this
field to uncover more effective practices for the resilience improvement of microservice systems.
Contents
1. Introduction ...................................................................................................................................................................................................... 2
2. Background ....................................................................................................................................................................................................... 2
2.1. Microservice architecture ........................................................................................................................................................................ 3
2.2. Microservice principles ........................................................................................................................................................................... 3
2.3. Challenges/Troubleshooting/Failures in microservice architecture .............................................................................................................. 3
2.4. Chaos engineering .................................................................................................................................................................................. 4
3. Review protocol................................................................................................................................................................................................. 4
3.1. Research questions ................................................................................................................................................................................. 4
3.2. Search strategy....................................................................................................................................................................................... 4
3.3. Study selection criteria ........................................................................................................................................................................... 4
3.4. Study quality assessment......................................................................................................................................................................... 5
3.5. Data extraction ...................................................................................................................................................................................... 5
3.6. Data synthesis ........................................................................................................................................................................................ 6
4. Results .............................................................................................................................................................................................................. 6
4.1. Main statistics ........................................................................................................................................................................................ 6
4.2. How is Chaos engineering effectively applied in production environments to enhance the resilience of software systems? .............................. 6
4.3. Which platforms have been used for chaos experiments? ........................................................................................................................... 6
4.4. How can Chaos engineering be effectively applied to microservice architecture to ensure successful implementation and enhance system
resilience? .............................................................................................................................................................................................. 10
4.5. To what extent can the centralized provision of Chaos engineering effectively facilitate the management of chaos experiments across complex
systems?................................................................................................................................................................................................. 10
4.6. What are the challenges reported in the relevant papers? .......................................................................................................................... 10
5. Discussion ......................................................................................................................................................................................................... 10
5.1. General discussion .................................................................................................................................................................................. 10
5.2. Threats to validity .................................................................................................................................................................................. 12
Corresponding author.
E-mail address: ccatal@qu.edu.qa (C. Catal).
https://doi.org/10.1016/j.csi.2025.104116
Received 22 September 2024; Received in revised form 28 November 2025; Accepted 12 December 2025
Available online 15 December 2025
0920-5489/© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
6. Conclusion ........................................................................................................................................................................................................ 12
CRediT authorship contribution statement ........................................................................................................................................................... 12
Declaration of competing interest ........................................................................................................................................................................ 12
Data availability ................................................................................................................................................................................................ 12
References......................................................................................................................................................................................................... 12
challenges faced, and solutions. In addition, it will assess the effective-
1. Introduction ness of chaos experiments in enhancing the reliability and robustness of
microservice systems by using data obtained from real-world scenarios
In recent years, the adoption of microservice architecture has led to develop strategic recommendations. This study is a critical step
to the transformation of application infrastructures into distributed in understanding the applicability and impact of chaos engineering
systems. These systems are designed to enhance maintainability by de- within the complexity of microservice architectures and aims to make
coupling services. The primary benefit of this architecture is the ease of significant contributions to the body of knowledge in this field. Recent
maintenance of individual services within the microservice ecosystem research has applied chaos engineering for this architectural style, how-
due to their smaller and more modular nature [1]. However, despite ever, a systematic overview of the state-of-the-art on the use of chaos
these advantages, the distributed nature of microservices introduces engineering in the microservice architecture is lacking. Therefore, a
significant challenges. Specifically, the complex management of ser- Systematic Literature Review (SLR) has been performed to provide an
vices and their tight integration can considerably complicate software overview of how chaos engineering was applied.
debugging. Debugging becomes complex in this architecture due to its This article primarily targets peer-reviewed research papers to main-
distributed nature, the necessity to pinpoint the exact service causing tain methodological consistency and ensure scholarly rigor. We specif-
the problem, and the dynamic characteristics of microservices. Con- ically chose a systematic literature review (SLR) methodology because
sequently, debugging in microservice architecture demands a greater peer-reviewed academic studies are subject to rigorous validation pro-
level of effort and specialized expertise compared to conventional cesses, which enhance the reliability and validity of our findings [8,
monolithic architectures [2]. However, it becomes quite challenging to 9]. Although excluding industry-specific, grey literature may restrict
predict what will happen if there is an unexpected error or if a service certain practical perspectives, this choice was deliberately made to
on the network goes out of service. Service outages can be caused by avoid potential biases and uphold the scientific integrity of our re-
anything from a malicious cyberattack to a hardware failure to simple view [10,11]. However, future studies could broaden the scope to
human error, and they can have devastating financial consequences. incorporate industrial case studies and practical experiences, which
Although such unexpected situations are rare, they can interfere with would enrich our understanding of chaos engineerings applicability
the operation of distributed systems and devastatingly affect the live beyond the academic context.
environment in which the application is located [3]. It is necessary to The main contributions of this study are listed as follows:
detect points in the system before an error occurs and spreads to the
1. To the best of our knowledge, this is the first study to employ
entire system.
a systematic literature review approach in the field of chaos
Microservice architecture applications undergo testing procedures
engineering on microservice architecture applications [12]. The
to ensure their quality and dependability. These include unit testing,
study provides an extensive systematic literature review of how
service test, end-to-end test, behavior-driven test, integration test, and
chaos engineering can be applied to enhance the resilience of mi-
regression test [4]. The comprehensive approach to microservices test-
croservice architectures. It collates findings from various sources
ing also encompasses live testing strategies for complex systems [5].
to provide insights into the current state of research and practice
This thorough process emphasizes different aspects such as function-
in this field.
ality, interoperability, performance of individual services within the
2. The study categorizes and summarizes the range of chaos en-
architecture. It aims to detect and resolve issues early to ensure stable
gineering tools and methods used in industry and academia,
and high-quality microservice applications [1,6]. However, considering
highlighting their functionalities in process/service termination,
that microservices consist of multiple services, the application should
network simulation, load stressing, security testing, and fault
not have an impact on the user experience in cases such as network
injection within application code.
failures and suddenly increased service loads. For example, if the
3. This research paper discusses contemporary techniques and ap-
microservice that adds the product to favorites on a shopping site fails
proaches for implementing chaos engineering in microservice
or responds late, the user should be able to continue the shopping ex-
architectures. It also emphasizes the ongoing work in this field,
perience. Therefore, testing operations in production-like environments
offering a significant reference for future research endeavors.
become inevitable. No matter how distributed or complex the system
The paper systematically reviews existing literature to showcase
is, there is a need for a method to manage unforeseeable situations
how chaos engineering can enhance system resilience, laying a
that can build trust in the system against unexpected failures. chaos
comprehensive groundwork for further exploration into chaos
engineering is defined as the discipline of conducting experiments in a
experimentation strategies and innovating new fault injection
live environment to test or verify the reliability of software [7].
methods or tools within microservice architectures.
The primary objective of this research is to conduct a thorough
investigation into how chaos experiments are performed in the widely The rest of the paper is structured as follows: Section 2 explains
used microservices-based systems of today. Microservice architectures the background and related work. Section 3 presents the methodology
have come to the forefront in modern software development processes of the research. Section 4 presents the results and Section 5 compre-
due to their advantages such as flexibility, scalability, and rapid de- hensively discusses the presented answers to research questions and
velopment. However, these architectures also bring unique challenges validity threats. Lastly, the conclusion is presented in Section 6.
due to complex service dependencies and dynamic operational environ-
ments. This study aims to comprehensively address the methodologies, 2. Background
application scenarios, and impacts of chaos experiments conducted
to test the resilience of microservice systems and identify potential The microservice approach breaks down a large application into a
weak points. The research intends to present the current state of chaos network of small, self-contained units, each running its own process
engineering practices by analyzing them, highlighting best practices, and often communicating through web APIs. Unlike large, single-piece
2
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
monolithic systems, these small services are robust, easy to scale up or Technology heterogeneity. They are treated as small services, each run-
down, and can be updated individually using various programming lan- ning independently and communicating with each other using open
guages and technologies. This structure allows development teams to be protocols. While monolithic applications are developed with a single
smaller and more agile, leading to faster updates and improvements. programming language and database system, services included in a
Yet, managing many interconnected services can become complicated, microservice ecosystem may use a different programming language and
especially when something goes wrong. To enhance system reliability database. This allows the advantages of each programming language
and resilience, a method known as chaos engineering is employed. This and database to be used.
involves deliberately introducing problems into the live system to test
Resilience. When an error occurs in the system in monolithic applica-
its ability to cope and recover. This technique helps to uncover and
tions, the whole system is affected. In the microservice architecture,
rectify flaws, thereby making the system stronger overall. Regular and
only the part under the responsibility of the relevant service is affected,
automated tests mimic real-life problems to ensure that the system can the places belonging to other services are not affected and the user
handle unexpected challenges and remain stable and efficient. experience continues.
2.1. Microservice architecture Scalability. While the scaling process on monolithic applications covers
the entire application, the services that are under heavy load can be
Microservice architectures have gained significant popularity in the scaled in applications developed with microservice architecture. This
software industry due to their ability to address the challenges and prevents extra resource costs for partitions that do not need to be scaled
complexities of developing modern applications [6,13]. unnecessarily and increases the user experience.
Deployment. Microservice architecture facilitates the autonomous de-
2.2. Microservice principles ployment of individual services, enabling updates or changes without
impacting others. Various deployment strategies, including bluegreen,
Microservice architectures are based on the concept of decentral- canary, and rolling deployment, minimize disruptions during the de-
ization, where each service is independently developed, deployed, and ployment process [18]. As a result, microservice architecture provides
managed. This emphasizes autonomy and minimal inter-service depen- increased flexibility and resilience in deployment, distinguishing it
dencies. Each microservice is designed to focus on a single function or from monolithic applications.
closely related set of functions and supports technology heterogeneity
by allowing different services to use different technology stacks that Organizational alignment. In software development processes, some
best suit their needs. Resilience is a core aspect, with services built to challenges may be encountered due to large teamwork and large pieces
withstand failures without affecting the entire system while scalability of code. It is possible to make these challenges more manageable with
enables services to be scaled independently as per demand. Com- smaller teams established. At the same time, this is an indication that
munication occurs through lightweight mechanisms like HTTP/REST microservices applications allow us to form smaller and more cohesive
APIs, supporting continuous delivery and deployment practices. Due teams. Each team is responsible for its own microservice and can take
to the distributed nature of microservice architecture, comprehensive action by making improvements if necessary.
monitoring and logging for observability becomes crucial. Additionally,
there is often an alignment between the microservice architecture 2.3. Challenges/Troubleshooting/Failures in microservice architecture
and organizational structure involving small cross-functional teams
Microservice architectures pose numerous challenges. As the num-
responsible for individual services [14].
ber of services increases, the complexity of service interactions also
It is helpful to compare the microservice architecture to the mono-
grows. Network communication reliance leads to latency and net-
lithic architecture. The main difference between them is the dimensions
work failure issues, while ensuring data consistency across multiple
of the developed applications. The microservice architecture can be
databases requires careful design and implementation of distributed
thought of as developing an application as a suite of smaller services,
transactions or eventual consistency models. Microservices bring typ-
rather than as a single, monolithic structure. Enterprise applications
ical distributed system challenges such as handling partial failures,
usually consist of three main parts: a client-side user interface (i.e., con-
dealing with latency and asynchrony, complex service discovery, load
taining HTML pages and Javascript running on the users machine
balancing in dynamic scaling environments, and managing configu-
in a browser), a database (i.e., composed of many tables, common
rations across multiple services and environments. Security concerns
and often relational, added to database management), and a server-
are heightened due to increased inter-service communications surface
side application. In the server-side application, HTTP requests are area. Testing becomes more complex involving individual service test-
processed, business logic is executed, HTML views are prepared that ing along with testing their interactions; deployment is challenging
will retrieve data from the database and update it and send it to the especially when there are dependencies between services; effective
browser. This structure is a good example of monoliths. Any changes observability and monitoring become crucial for timely issue resolu-
to the system involve creating and deploying a new version of the tion; versioning management is critical for maintaining system stability;
server-side application [15]. The cycles of change are interdependent. lastly assembling skilled teams proficient in DevOps, cloud computing,
A change to a small part of the application requires rebuilding and programming languages presents a significant challenge. Microservice
deploying the entire monolith [6]. architecture faces various challenges, troubleshooting, and failures.
Microservice architecture, on the other hand, has some common While adopting a distributed architecture enhances modularity, it in-
features, unlike monolithic architecture. These are componentization herently introduces operational complexities that differ significantly
with services, organizing around job capabilities, smart interfaces and from monolithic structures. Recent research has also explored the use
simple communication, decentralized governance, decentralized data of hybrid bio-inspired algorithms to optimize this process dynamically.
management, infrastructure automation, and design for failure [16]. For instance, the Hybrid KookaburraPelican Optimization Algorithm
Today, although modern internet applications seem like a single appli- has been shown to improve load distribution and system scalability in
cation, they use microservice architectures behind them. Microservice cloud and microservice-based environments [19].
architecture basically refers to small autonomous and interoperability In conclusion, while microservices offer numerous advantages such
services. It has emerged due to increasing needs such as technology as improved scalability, flexibility, and agility, they also introduce
diversity, flexibility, scaling, ease of deployment, organization and significant challenges in terms of system complexity, operational de-
management, and provides various advantages in these matters. Its mands, and the need for skilled personnel and sophisticated tool-
advantages are described as follows [17]: ing [20].
3
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
2.4. Chaos engineering 3.1. Research questions
Chaos engineering is the discipline of experimenting on a dis- Research Questions (RQs) and their corresponding motivations are
tributed system in order to build confidence in the systems capability presented as follows:
to withstand turbulent conditions in production-like environment [7,
• RQ1: How is Chaos engineering effectively applied in production
21]. It is the careful and planned execution of experiments to show how
environments to enhance the resilience of software systems?
the distributed system will respond to a failure. It is necessary for large-
Motivation: Understanding the practical implementation of Chaos
scale software systems because it is practically impossible to simulate
engineering in production environments is crucial for ensuring
real events in test environments. Experiments based on real events are the resilience of software systems under real-world operating
created together with chaos engineering [22]. By analyzing the test conditions.
results, improvements are made where necessary, and in this way, it • RQ2: Which platforms have been used for Chaos experiments?
is aimed to increase the reliability of the software in the production Motivation: Identifying the platforms provides insights into the
environment. technological landscape and tools available for conducting Chaos
Thanks to an experimental and systems-based approach, confidence engineering practices.
is established for the survivability of these systems during collapses. • RQ3: How is Chaos engineering effectively applied to microser-
Canary analysis collects data on how distributed systems react to vice architectures to ensure its successful implementation in en-
failure scenarios by observing their behavior in abnormal situations and hancing system resilience?
performing controlled experiments [23]. This method involves applying Motivation: Microservice architectures introduce new challenges
new updates or changes to a specific aspect of the system, enabling in system design. Exploring the application of Chaos engineering
early detection of potential problems before they affect a larger scale. in this context can help improve the resilience and fault tolerance
Chaos experiments consist of the following principles [24,25]: of microservice systems.
• RQ4: To what extent can the centralized provision of Chaos
• Hypothesize steady state: The first step is to hypothesize the engineering effectively facilitate the management of Chaos exper-
steady state of the system under normal conditions. iments across complex systems?
• Vary real-world events: The next step is to vary real-world events Motivation: Understanding the feasibility of providing Chaos en-
that can cause turbulence in the system. gineering as a centralized service enables organizations to coor-
• Run experiments in production: Experimenters should run the ex- dinate Chaos experiments across complex systems.
periments in production-like environment to simulate real-world • RQ5: What are the challenges reported in the relevant papers?
conditions. Motivation: Identifying these challenges provides valuable in-
• Automate experiments to run continuously: Experimenters should sights into overcoming obstacles and advancing the adoption of
automate the experiments to run continuously, ensuring that the Chaos engineering practices.
system can withstand turbulence over time.
• Minimize blast radius: The experiments should be designed to 3.2. Search strategy
minimize blast radius, i.e., the impact of the experiment on the
system should be limited to a small area The primary studies were carefully selected from the papers pub-
• Analyze results: Experimenters should analyze the results of the lished between 2010 and 2022 because the topic is only relevant in
experiments to determine the systems behavior under turbulent recent years. The databases are IEEE Xplore, ACM Digital Library,
conditions. Science Direct, Springer, Wiley, MDPI and Scopus and Science Direct.
• Repeat experiments: The experiments should be repeated to en- The initial search involved reviewing the titles, abstracts, and keywords
sure that the system can consistently withstand turbulence. of the studies identified in the databases. The search results obtained
When the experiment is finished, information about the actual from the databases were stored in the data extraction form using a
effect will be provided to the system. spreadsheet tool. Furthermore, this systematic review was conducted
collaboratively by three authors.
The following search string was used to broaden the search scope:
3. Review protocol ((chaos engineering) OR (chaos experiments)) OR (microservices)
The results of the searches made in the databases mentioned above
Systematic review studies must be conducted using a well-defined are shown in Fig. 2.
and specific protocol. To conduct a systematic review study, all studies
on a particular topic must be examined [12]. We followed the system- 3.3. Study selection criteria
atic review process shown in Fig. 1 and took all the steps to reduce risk
bias in this study. Multiple reviewers were involved in the SLR process, After applying exclusion inclusion criteria, 55 articles were ob-
and in cases of conflict, a brief meeting was organized to facilitate tained. The exclusion criteria in our study are shown as follows:
consensus. The first step is to define the research questions. Then,
the most appropriate databases were selected. Based on the selected • EC-1: Duplicate papers from multiple sources
databases, automated searches were conducted and several articles • EC-2: Papers without full-text availability
were identified. Selection criteria were then established to determine • EC-3: Papers not written in English
• EC-4: Survey papers
which studies should be included and excluded in this research. The
• EC-5: Papers not related to Chaos engineering
titles and abstracts of all studies were reviewed. In cases of doubt,
the full text of the publication was reviewed. Then, after the studies The inclusion criteria in our study are shown as follows:
were analyzed in detail, selection criteria were applied. All selected
studies were assessed using a quality assessment process. Subsequently, • IC-1: Primary papers discussing the use of Chaos experiments in
the results were synthesized, listed, and summarized in a clear and a microservice architecture
understandable manner. • IC-2: Primary publications that focus on Chaos engineering
4
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
Fig. 1. SLR review protocol.
Source: Adapted from [26
28].
Fig. 2. Distribution of selected papers per database.
3.4. Study quality assessment Fig. 2 presents the distribution of papers based on databases where
they were found at different selection stages. After the initial search,
The assessment of each studys quality is an indicator of the strength 4520 papers were retrieved, of which 55 remained after applying the
of evidence provided by the systematic review. The quality of studies selection criteria. After quality assessment, 31 papers were selected
was assessed using various questions. Studies of poor quality were as primary studies. The 55 papers were carefully read in full and the
not included in the present study. These criteria based on quality required data for answering the research questions were extracted.
instruments were adopted guide and other SLRs research [12]. The All the collected articles are listed in Table 1.
following questions were used to assess the quality of the studies.
3.5. Data extraction
• Q1. Are the aims of the study clearly stated?
• Q2. Are the scope and experimental design of the study clearly
defined? Data required for answering the Research Questions were extracted
• Q3. Is the research process documented adequately? from the selected articles to answer the research questions. A data
• Q4. Are all the study questions answered? extraction form was created to answer the research questions. The data
• Q5. Are the negative findings presented? extraction form consists of several metadata such as the authors first
• Q6. Do the conclusions relate to the aim of the purpose of the and last name, the title of the study, the publication year, and the type
study and are they reliable? of study. In addition to this metadata, several columns were created
to store the required information related to the research questions. By
In this study, considering all these criteria, a general quality as- employing a data extraction form, we ensured that the relevant data
sessment was performed for each paper. The rating was 2 points for required to answer each research question were systematically captured
the yes option, 0 points for the no option, and 1 point for the from the selected publications. This approach facilitated the subsequent
somewhat option. The decision threshold for classifying the paper synthesis of the findings. The data extraction process involved meticu-
as poor quality was determined based on the mean value, which lous attention to detail and ensured the reliability and integrity of the
corresponds to a total of 5 points. data used in our systematic literature review.
5
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
Table 1
Selected primary studies.
ID Reference Title Year Database
S1 [29] Automating Chaos Experiments in Production 2019 ACM
S2 [25] Getting Started with Chaos engineering—design of an implementation framework in practice 2020 ACM
S3 [30] Human-AI Partnerships for Chaos engineering 2020 ACM
S4 [31] 3MileBeach: A Tracer with Teeth 2021 ACM
S5 [32] Service-Level Fault Injection Testing 2021 ACM
S6 [33] A Platform for Automating Chaos Experiments 2016 IEEE Xplore
S7 [34] Automated Fault-Tolerance Testing 2016 IEEE Xplore
S8 [35] Gremlin: Systematic Resilience Testing of Microservices 2016 IEEE Xplore
S9 [36] Fault Injection Techniques - A Brief Review 2018 IEEE Xplore
S10 [37] ORCAS: Efficient Resilience Benchmarking of Microservice Architectures 2018 IEEE Xplore
S11 [38] The Business Case for Chaos engineering 2018 IEEE Xplore
S12 [39] Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service 2018 IEEE Xplore
S13 [40] Security Chaos engineering for Cloud Services: Work In Progress 2019 IEEE Xplore
S14 [41] A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems 2020 IEEE Xplore
S15 [42] CloudStrike: Chaos engineering for Security and Resiliency in Cloud Infrastructure 2020 IEEE Xplore
S16 [43] Identifying and Prioritizing Chaos Experiments by Using Established Risk Analysis Techniques 2020 IEEE Xplore
S17 [44] Fitness-guided Resilience Testing of Microservice-based Applications 2020 IEEE Xplore
S18 [24] A Chaos engineering System for Live Analysis and Falsification of Exception-Handling in the JVM 2021 IEEE Xplore
S19 [45] A Study on Chaos engineering for Improving Cloud Software Quality and Reliability 2021 IEEE Xplore
S20 [46] Chaos engineering for Enhanced Resilience of CyberPhysical Systems 2021 IEEE Xplore
S21 [47] ChaosTwin: A Chaos engineering and Digital Twin Approach for the Design of Resilient IT Services 2021 IEEE Xplore
S22 [48] Platform Software Reliability for Cloud Service Continuity—Challenges and Opportunities 2021 IEEE Xplore
S23 [49] Trace-based Intelligent Fault Diagnosis for Microservices with Deep Learning 2021 IEEE Xplore
S24 [50] A Guided Approach Towards Complex Chaos Selection, Prioritization and Injection 2022 IEEE Xplore
S25 [51] Chaos Driven Development for Software Robustness Enhancement 2022 IEEE Xplore
S26 [22] Maximizing Error Injection Realism for Chaos engineering With System Calls 2022 IEEE Xplore
S27 [52] On Evaluating Self-Adaptive and Self-Healing Systems using Chaos engineering 2022 IEEE Xplore
S28 [53] Observability and chaos engineering on system calls for containerized applications in Docker 2021 ScienceDirect
S29 [54] Scalability resilience framework using application-level fault injection for cloud-based software services 2022 Springer
S30 [55] Chaos as a Software Product Line—A platform for improving open hybrid-cloud systems resiliency 2022 Wiley
S31 [56] The Observability, Chaos engineering, and Remediation for Cloud-Native Reliability 2022 Wiley
3.6. Data synthesis Chaos engineering involves several categories of functionality that
serve distinct purposes in resilience testing. The first category involves
To answer the research questions, the data obtained are collected intentionally terminating processes or services to evaluate system be-
and summarized in an appropriate manner, which is called data syn- havior and recovery from failures [7]. Another category is network
thesis. To perform the data synthesis, a qualitative analysis process simulation, which allows engineers to replicate adverse network condi-
was conducted on the data obtained. For instance, synonyms used tions to assess system performance and reliability [25]. In the Stressing
for different categories were identified and merged in the respective Machine category, engineers subject the system to extreme loads to
fields. This comprehensive data synthesis approach allowed us to derive identify limits and potential bottlenecks [7]. In security testing, en-
insights and draw conclusions from the collected information. gineers simulate breaches or attacks to assess the systems response
and enhance defenses [7]. Lastly, engineers use fault application code
4. Results to inject targeted faults or errors into the codebase, assessing system
resilience and error-handling capabilities [24]. These categories help
The result section of the paper provides various insights into how organizations proactively identify weaknesses, strengthen system ro-
chaos engineering is applied in production environments, particularly bustness, and enhance reliability in complex technology landscapes [7].
its use in improving the resilience and reliability of microservice ar- Functionality categories of tools are presented in Fig. 6.
chitecture applications. The section discusses how fault detection is The tools utilized in industry settings are not comprehensively ad-
developed using chaos engineering tools and is mainly used in pro- dressed in articles. To provide insights for future research, the identified
tools from the additional examination were categorized based on their
duction for troubleshooting. Chaos Experiments are usually conducted
functionality, as presented in Tables 2 and 3. Table 2 displays the
in the production environment to provide realistic results. The section
tools obtained from the study, while Table 3 presents additional tools
further enumerates several tools that have been used for Chaos experi-
that have been examined. Tools listed in the table with corresponding
ments, as well as discussing general principles such as defining a steady
references indicate their inclusion in the referenced articles.
state, forming a hypothesis, conducting the experiment, and proving or
refuting the hypothesis. These principles and tools help detect problems
4.2. How is Chaos engineering effectively applied in production environ-
like hardware issues, software errors network interruptions security
ments to enhance the resilience of software systems?
vulnerabilities configuration mistakes within their respective contexts.
Table 4 examines the successful implementation of Chaos Engineer-
4.1. Main statistics ing in operational settings, covering different aspects such as goals,
techniques and resources, guiding principles, findings, limitations and
Fig. 3 shows the results of the quality assessment. The distribution of substitutes, as well as the general strategy.
the years of publication is shown in Fig. 4. Most of the studies related to
our study were conducted in the last year. This shows that researchers 4.3. Which platforms have been used for chaos experiments?
interest in chaos engineering has increased in recent years. Most of the
studies included were indexed in the IEEE Xplore database. Table 5 provides a concise summary of various tools and platforms
Fig. 5 presents the distribution of the type of publications and used in Chaos experiments, along with their specific functionalities
the corresponding databases. While there are many journal papers, or characteristics. It offers comprehensive insights into each platform
conference proceedings also appear in the selected papers. through detailed descriptions accompanied by the necessary references.
6
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
Fig. 3. Quality assessment scores.
Fig. 4. Year of publication.
Fig. 5. Diagram of the distribution of studies per search database.
7
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
Fig. 6. Functionality of chaos engineering tools.
Table 2
Chaos engineering tools from studies.
Chaos engineering tool Termination Network simulating Stressing machine Security Fault application code
Chaos Monkey [57] ×
Gremlin [35] × × × × ×
Chaos Toolkit [45] × × × × ×
Pumba [55] × ×
LitmusChaos [45] × × × ×
ToxiProxy [45] × ×
PowerfulSeal [45] × × × ×
Pod Reaper [25] ×
Netflix Simian Army [36] × × ×
WireMock [25] × ×
KubeMonkey [25] × × ×
Chaosblade [45] × × ×
ChaosTwin [47] × × × ×
Chaos Machine [24] × × ×
Cloud Strike [42] ×
Phoebe [22] ×
Mjolnirr [58] ×
ChaosOrca [37] × × ×
3MileBeach [31] × ×
Muxy [25] × × ×
Blockade [25] ×
Chaos Lambda [25] × ×
Byte-Monkey [25] ×
Turbulence [25] × × ×
Cthulhu [25] × × × ×
Byteman [25] × ×
ChaosCube [55] ×
Chaos Lemur [25] ×
Chaos HTTP Proxy [25] ×
Chaos Mesh [45] × × ×
Istio Chaos [45] ×
ChAP [33] × ×
IntelliFT [44] × × × ×
Table 3
Chaos engineering tools from our search.
Chaos engineering tool Termination Network simulating Stressing machine Security Fault application code
Pod Chaos X X X
DNS Chaos X
AWS Chaos X X X
Azure Chaos X X X X
GCP Chaos X X X X
8
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
Table 4
Chaos engineering in production environments.
Category Description
Objective The primary objective of applying chaos engineering in production environments is to enhance the
resilience of software systems. This involves troubleshooting to identify and address potential
malfunctions before they occur. The overarching goal is to minimize issues in production through the
use of chaos engineering tools, enabling automatic fault detection [24,53].
Methods and tools chaos engineering relies on specific tools to facilitate its effective application in production
environments. These tools aid in automatic fault detection, a crucial aspect of troubleshooting to
minimize potential issues in the production environment [24,53].
Principles and considerations The effective application of chaos engineering is closely tied to key principles and considerations.
These include continuous experimentation, serving as a form of robustness testing conducted in
real-world operational conditions. Fundamental principles of Chaos Experiments involve defining a
steady state, hypothesizing about its impact, conducting the experiment, and then demonstrating or
refuting the hypothesis [53].
Insights and results Chaos experiments conducted in the production environment provide valuable insights into the
behavior of the system. This is particularly significant as the production environment may exhibit
unpredictable behavior that differs from staging environments in some cases [24].
Constraints and alternatives While conducting chaos experiments in production is ideal, it is acknowledged that legal or technical
constraints may sometimes prevent this. In such cases, an alternative approach is considered, starting
chaos experiments in a staging environment and gradually transitioning to the production
environment [25].
Overall approach The overall approach for the effective application of chaos engineering in production environments
involves the systematic execution of chaos experiments. This includes leveraging chaos engineering
tools and taking into account the constraints and challenges associated with conducting experiments in
real-world operational settings. The aim is to proactively identify and address potential issues before
they impact the production environment, ultimately enhancing the resilience of software systems.
Table 5
Chaos engineering tools identified from selected papers.
Platform/Tool Description
The Chaos Machine A tool for conducting chaos experiments at the application level on Java Virtual Machine (JVM),
using exception injection to analyze try-catch blocks for error processing [24].
Screwdriver An automated fault-tolerance testing tool for on-premise applications and services, creating realistic
error models and collecting metrics by injecting errors into the system [34].
Chaos Monkey Designed by Netflix, this tool tests the systems resilience by randomly killing partitions to check
system functionality [7,45].
Cloud Strike A security chaos engineering system for multi-cloud security, extending chaos engineering to security
by injecting faults impacting confidentiality, integrity, and availability [42].
ChaosMesh An open-source chaos engineering platform for testing the resilience and reliability of distributed
systems by intentionally injecting failures and disruptions [55].
Powerfulseal An open-source tool for testing the resilience of Kubernetes clusters by simulating real-world failures
and disruptions [55].
IntelliFT A feedback-based, automated failure testing technique for microservice applications, focusing on
exposing defects in fault-handling logic [44].
The Chaos Toolkit Open-source software that runs experiments against the system to confirm a hypothesis [25,55].
Phoebe A fault injection framework for reliability analysis concerning system call invocation errors, enabling
full observability of system call invocations and automatic experimentation [22].
Mjolnirr A private cloud platform with a built-in Chaos Monkey service for developing private PaaS cloud
infrastructure [58].
ChaosOrca A tool for Chaos engineering on containers, perturbing system calls for processes inside containers
and monitoring their effects [37].
Gremlin Offered as a SaaS technology, Gremlin tests system resilience on various parameters and conditions,
with capabilities for automation and integration with Kubernetes clusters and public clouds [35].
3MileBeach A distributed tracing and fault injection framework for microservices, enabling chaos experiments
through message serialization library manipulation [31].
ChAP A software platform for running automated chaos experiments, simulating various failure scenarios
and providing insights into system behavior under stress [29,33].
ChaosTwin Utilizes a digital twin approach in Chaos Engineering to mitigate impacts of unforeseen events,
constructing models across workload, network, and service layers [47].
Litmus Chaos An open-source cloud-native framework for Chaos Engineering in Kubernetes environments, offering a
range of chaos experiments and workflows [50].
Filibuster A testing method in chaos engineering that introduces errors into microservice architecture to validate
resilience and error tolerance [32].
9
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
Table 6
Chaos engineering in microservices: approaches, descriptions, and expected outcomes.
Approach Description Expected impact
Fault injection testing This method involves intentionally introducing errors into the system to assess its Evaluating and enhancing the systems resilience
response, particularly in microservices by simulating various failure modes such as and stability.
network issues, service outages, or resource shortages within or between
microservices, to evaluate the systems resilience and stability [52].
Hypothesis-driven Key to chaos engineering is conducting experiments based on well-defined Identifying system weaknesses and increasing
experiments hypotheses about the normal state of the system and its expected behavior during resilience.
failure scenarios. This strategic approach enables focused experiments that assess the
resilience of both individual microservices and the overall system [45,53].
Blast radius Managing the blast radius of experiments is crucial in microservices. It involves Better understanding and enhancing the systems
management understanding the potential impact of introduced failures, starting with small resilience.
experiments and then expanding, to manage failure impacts while identifying system
vulnerabilities [45].
Resilience requirement Utilizing chaos engineering to determine and analyze the resilience requirements of Understanding specific resilience needs of each
elicitation microservice architectures. This process involves observing the systems response to microservice and their interactions.
induced faults to identify specific resilience needs of each microservice and their
interactions [52].
Continuous testing and Regularly conducting chaos experiments as part of an ongoing testing process Proactive identification and resolution of system
improvement ensures that microservices remain resilient against unforeseen issues. This continuous weaknesses, leading to continual improvement and
approach aids in proactively finding and fixing potential system weaknesses [56]. increased resilience.
Observability and Integrating chaos engineering with observability tools enhances the monitoring of Real-time tracking of responses to failures and
remediation microservices during fault injection, allowing for real-time tracking of responses to development of effective remediation strategies for
failures, aiding in the development of effective remediation strategies and overall overall system resilience improvement.
system resilience improvement [56].
4.4. How can Chaos engineering be effectively applied to microservice archi- 5.1. General discussion
tecture to ensure successful implementation and enhance system resilience?
In this article, we reviewed the literature on the application of
Table 6 provides a comprehensive overview of the different facets chaos engineering in microservice architecture to understand the state-
and projected implications of implementing chaos engineering within of-the-art. For this purpose, six research questions were defined and
microservice architecture. answered.
By implementing these approaches and strategies, organizations can In RQ1, we aimed to understand how chaos engineering is ap-
effectively integrate chaos engineering into their microservice architec- plied to production environments. Chaos engineering, when adeptly
tures to uncover vulnerabilities and enhance the overall dependability applied in production settings, serves as a pivotal tool for augmenting
of their systems. the robustness of software systems. This approach entails conducting
deliberate and controlled chaos experiments within the production en-
4.5. To what extent can the centralized provision of Chaos engineering vironment, a strategy that is instrumental in uncovering and rectifying
effectively facilitate the management of chaos experiments across complex potential issues before they escalate into full-blown system failures,
systems? thereby bolstering system uptime [38]. Moreover, chaos engineering
is characterized by the intentional injection of faults into systems.
Table 7 provides an overview of the ways in which centralized chaos This methodology is crucial for identifying and addressing security
engineering can simplify experiment management in intricate systems. flaws and risks, laying the groundwork for the development of resilient
It emphasizes advantages like standardization, resource utilization, risk application architectures [56]. By replicating adverse conditions that
mitigation, and more, resulting in enhanced system resilience and could naturally arise in production settings, chaos engineering helps
performance. detect of inherent system vulnerabilities and structural deficiencies,
fostering a proactive stance towards issue mitigation [38].
4.6. What are the challenges reported in the relevant papers? Additionally, this practice involves comprehensive testing of real-
world scenarios on operational systems. Such testing is vital for as-
Table 8 concisely presents the primary obstacles in the area of sessing the complete spectrum of software systems, encompassing both
chaos engineering and their respective resolutions. These obstacles hardware malfunctions and software glitches, within their actual de-
encompass system intricacy, hazards to live environments, resource ployment contexts. This approach significantly contributes to the en-
demands, security issues, and automation complexities. The proposed hancement of overall system resilience [38]. To effectively implement
resolutions involve phased implementation, risk assessment, knowledge chaos engineering, it is recommended to initiate with less complex
enhancement, robust security protocols, and automation approaches. experiments, leverage automation for these experiments, and focus on
areas with either high impact or high frequency of issues. Observing
5. Discussion the system at its limits is also crucial for reinforcing resilience [25].
In RQ2, we discuss various platforms that aim to increase the
In the discussion section, we summarize answers to the research flexibility and reliability of microservice architectures through chaos
questions. They mention that chaos engineering can improve robust- experiments. Tools like Gremlin, Chaos Monkey, Chaos Toolkit, Pumba,
ness by simulating real-world failure scenarios and exploring system LitmusChaos, ToxiProxy and PowerfulSeal have been utilized in indus-
reactions, especially in microservice architectures. Various tools for try settings to simulate different failure scenarios. These tools provide
implementing chaos engineering were listed and compared. They con- functions such as terminating processes, simulating network conditions,
clude by stating that the application of chaos engineering requires applying stress tests security measures and injecting faults to proac-
careful planning due to inherent challenges but has the potential to tively identify weaknesses and strengthen system robustness across
greatly improve system resilience. different technology landscapes.
10
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
Table 7
Centralized provision in chaos engineering.
Approach Description Expected impact
Standardization Centralized provision allows for the standardization of chaos engineering practices Improved coordination and reliability of
and tools across the organization. This ensures that all teams follow consistent results.
processes and use approved tools, leading to better coordination and more reliable
results [42].
Resource optimization Centralized provision enables efficient allocation of resources for chaos experiments. Enhanced resource utilization and reduced
It allows pooling of expertise, tools, and infrastructure, reducing redundancy and redundancy.
optimizing resource utilization [38].
Risk management Centralized provision facilitates better risk management by providing oversight and Controlled experimentation and effective
governance for chaos experiments. It establishes clear guidelines, safety measures, risk management.
and expected states for running experiments in production environments, ensuring
controlled experimentation [42].
Automation and Centralized provision supports the automation of chaos experiments to run Ongoing validation of system resilience and
continuous testing continuously. This ensures regular conduction of experiments, leading to ongoing early identification of potential issues.
validation of system resilience and identification of potential issues before they
manifest as outages [38,42].
Knowledge sharing and A centralized approach encourages knowledge sharing and collaboration among Promotion of a continuous improvement
collaboration teams. It facilitates the dissemination of best practices, lessons learned, and culture and shared learning.
successful experiment designs, fostering a culture of continuous improvement and
shared learning [25].
Performance metrics and Centralized provision enables the establishment of standardized performance metrics Consistent system health measurement and
analysis and analysis methods for chaos experiments. This allows for consistent measurement more effective decision-making.
of system health and identification of deviations from steady-state, leading to more
effective decision-making and system improvements [43].
Table 8
Challenges and solutions in chaos Engineering.
Category Challenges Possible solutions References
Complexity Designing and executing effective chaos experiments To mitigate complexity, it is recommended to start with smaller, more [25,43]
in large systems is complex due to intricate manageable experiments and gradually expand the scope of chaos
interdependencies within these systems. engineering practices.
Risk of impact Concerns about causing disruptions in the production Implementing risk analysis techniques can help prioritize experiments, [45,50]
environment, affecting users and business operations. focusing on less critical system components first to minimize potential
impacts.
Resource Significant resources needed including time, expertise, Addressing resource intensiveness involves providing comprehensive [7,47]
intensiveness and infrastructure, posing a barrier for many training and education on chaos engineering best practices and tools to
organizations. equip teams with the necessary skills and knowledge.
Security Introducing controlled failures can raise security To combat security concerns, robust security measures should be [42,47]
concerns issues, potentially exposing vulnerabilities or sensitive implemented during experiments to safeguard sensitive data and prevent
data. unauthorized access.
Tooling and Developing tools for automated chaos experiments is Overcoming tooling and automation challenges requires the development [7,33,38,40,42]
automation challenging in heterogeneous and dynamic and use of automated tools for Chaos experiments, which reduce manual
environments. efforts and facilitate continuous, unattended testing.
Recent studies have emphasized the growing intersection between solutions like Netflixs Chaos Automation Platform (ChAP) and fault
artificial intelligence and cybersecurity within the context of chaos injection techniques such as service call manipulation. The emphasis is
engineering. AI-driven techniques are nowadays used for real-time placed on the need for careful planning, effective communication, risk
threat detection, anomaly prediction, and automated response mech- management, and continuous learning to ensure comprehensive and
anisms in enterprise systems. For example, generative AI models have valuable chaos experiments for enhancing overall system resilience.
been proposed to enhance cybersecurity frameworks by improving data In response to RQ5, our discussion concludes that the practical
privacy management and identifying potential attack vectors [59]. implementation of chaos engineering, despite its promise to enhance
In RQ3, we focused on understanding how chaos engineering is im- system resilience, presents numerous challenges. These challenges in-
plemented in microservice architectures. To enhance system resilience clude potential business impacts, difficulty in determining scope, the
in microservice architectures through chaos engineering, organizations
unpredictability of outcomes, time and resource constraints, system
should utilize fault injection testing to replicate failures within mi-
complexities, skill and knowledge prerequisites, interpretation of re-
croservices. They should also conduct hypothesis-driven experiments
sults, cultural readiness, and selection of appropriate tools. These all
with a solid comprehension of the normal state and anticipated behav-
necessitate meticulous planning and skilled execution for effectiveness.
ior during disruptions, while managing the scope of these experiments
to minimize impact. Additionally, it is essential to identify and an- Recent studies explore the convergence of Chaos Engineering and
alyze resilience requirements, participate in continuous testing and Artificial Intelligence (AI). Large language models (LLMs) have been
improvement efforts, as well as integrate observability tools for real- used to automate the chaos engineering lifecycle, managing phases
time monitoring during fault injection tests. Moreover, organizations from hypothesis creation to experiment orchestration and remedia-
need to establish clear communication channels across teams involved tion [60]. Meanwhile, advances in applying chaos engineering to multi-
in order to ensure effective collaboration and knowledge sharing. agent AI systems suggest new directions: for example, chaos experi-
The answer to RQ4, highlights the significance of centralized man- ments applied to LLM-based multi-agent systems can surface vulner-
agement and monitoring in conducting chaos experiments within large- abilities such as hallucinations, agent failures, or inter-agent communi-
scale microservices ecosystems. It discusses the utilization of software cation breakdowns [61]. Together, these works show how intelligent,
11
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
adaptive chaos frameworks might evolve in microservice-based systems experiments are insightful, as they reveal system behaviors in pro-
as well. duction environments, which often differ unpredictably from staging
Recent research also discusses specific operational challenges such environments [36,53].
as load balancing and security in the context of chaos engineering. For Furthermore, the effectiveness of chaos engineering is contingent
example, an empirical study applies delay injections under different on the systematic execution of chaos experiments. These experiments,
user loads in cloud-native systems to observe how throughput and utilizing advanced chaos engineering tools, need to navigate the con-
latency change under stress, providing insights into how load balanc- straints and challenges inherent in real-world operational settings.
ing policies perform under fault conditions [62]. In parallel, several The main objective is the enhancement of system resilience, achieved
frameworks have begun integrating security-focused chaos tests that by proactively identifying and preemptively addressing potential is-
intentionally inject faults into authentication, identity management, sues [46].
and access control components to ensure that security mechanisms However, it is acknowledged that conducting chaos experiments
remain effective under stress conditions [63]. These studies highlight directly in production environments might be impeded by legal or
how chaos engineering can be extended beyond performance reliability technical constraints. In such scenarios, initiating experiments in a
to proactively strengthen both load distribution and security resilience staging environment and then gradually transitioning to the production
in microservice environments. environment offers a viable alternative. This approach ensures that
The main challenges faced by previous researchers and possible the benefits of chaos engineering can still be realized, but in a more
solutions have been discussed in the paper. The collected challenges controlled and possibly less direct manner.
were mainly related to the correct interpretation of chaos experiments Our review highlights that chaos engineering is a critical methodol-
and making sense of them. There may be more challenges, but if ogy for ensuring the resilience and robustness of software systems. By
they were not mentioned in these articles, we could not include them. following continuous experimentation and proactive troubleshooting, it
We believe that chaos engineering is still in the early stages and the offers a pathway to address the challenges faced in complex production
adoption in the software industry will take some time. environments. This SLR contributes to the scientific community by dis-
cussing these methodologies and their applications, thereby providing
5.2. Threats to validity a framework for future research and practical implementation in the
field of software system resilience.
Internal validity
The validity of this systematic literature review is threatened by CRediT authorship contribution statement
issues related to defining the candidate pool of papers, potential bias
in selecting primary studies, data extraction, and data synthesis. The Emrah Esen: Writing review & editing, Writing original draft,
application of exclusion criteria can be influenced by the researchers Visualization, Validation, Software, Methodology, Investigation, For-
biases, posing a potential threat to validity. We compiled a compre- mal analysis, Data curation. Akhan Akbulut: Writing review &
hensive list of exclusion criteria, and all conflicts were documented editing, Writing original draft, Visualization, Validation, Supervi-
and resolved through discussions among us. Data extraction validity is sion, Software, Resources, Project administration, Methodology, Inves-
crucial as it directly impacts the study results. Whenever any of us was tigation, Formal analysis, Data curation. Cagatay Catal: Writing
uncertain about data extraction, the case was recorded for resolution review & editing, Writing original draft, Visualization, Validation,
through discussions with the team. Multiple meetings were held to Supervision, Software, Resources, Project administration, Methodology,
minimize researcher bias. Investigation, Funding acquisition, Formal analysis, Data curation.
External validity Declaration of competing interest
The search for candidate papers involved using general search terms
to minimize the risk of excluding relevant studies. Despite using a broad The authors declare that they have no known competing finan-
search query to acquire more articles, there remains a possibility that cial interests or personal relationships that could have appeared to
some papers were overlooked in electronic databases or missed due to influence the work reported in this paper.
recent publications. Furthermore, although seven widely used online
databases in computer science and software engineering were searched, Data availability
new papers may not have been included.
Data will be made available on request.
6. Conclusion
Our systematic literature review (SLR) on chaos engineering has References
explored its role in enhancing the resilience of software systems in pro-
duction environments. Through our review, we have identified several [1] P. Jamshidi, C. Pahl, N.C. Mendonça, J. Lewis, S. Tilkov, Microservices: The
journey so far and challenges ahead, IEEE Softw. 35 (3) (2018) 2435, http:
crucial aspects that underline the effective application and challenges
//dx.doi.org/10.1109/MS.2018.2141039.
of chaos engineering [25]. [2] I. Beschastnikh, P. Wang, Y. Brun, M.D. Ernst, Debugging distributed systems,
Firstly, Chaos Engineering serves as a proactive troubleshooting ap- Commun. ACM 59 (8) (2016) 3237, http://dx.doi.org/10.1145/2909480.
proach in production environments [25]. By identifying and addressing [3] W. Ahmed, Y.W. Wu, A survey on reliability in distributed systems, J. Comput.
potential malfunctions before they occur, it effectively preempts system System Sci. 79 (8) (2013) 12431255, http://dx.doi.org/10.1016/j.jcss.2013.02.
006.
disruptions. This proactive strategy is significantly implemented by
[4] D. Maruf, S. Sulistyo, L. Nugroho, Applying integrating testing of microservices
chaos engineering tools that assist in automatic fault detection, thereby in airline ticketing system, Ijitee (Int. J. Inf. Technol. Electr. Eng.) 4 (2020) 39,
minimizing potential issues in these critical environments [50]. http://dx.doi.org/10.22146/ijitee.55491.
Secondly, the essence of chaos engineering is rooted in continuous [5] F. Dai, H. Chen, Z. Qiang, Z. Liang, B. Huang, L. Wang, Automatic analysis
experimentation and robustness testing under real-world operational of complex interactions in microservice systems, Complexity 2020 (2020) 112,
http://dx.doi.org/10.1155/2020/2128793.
conditions. The methodology involves a systematic approach: defining [6] J. Lewis, M. Fowler, Microservices: a definition of this new architectural term
a steady state, hypothesizing its impacts, conducting controlled exper- (2014), 2014, URL: http://martinfowler.com/articles/microservices.html (cit. p.
iments, and subsequently confirming or refuting the hypotheses. These 26).
12
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
[7] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, C. [31] J. Zhang, R. Ferydouni, A. Montana, D. Bittman, P. Alvaro, 3MileBeach: A
Rosenthal, Chaos engineering, IEEE Softw. 33 (3) (2016) 3541, http://dx.doi. tracer with teeth, in: Proceedings of the ACM Symposium on Cloud Computing,
org/10.1109/MS.2016.60. SoCC 21, Association for Computing Machinery, New York, NY, USA, 2021, pp.
[8] R.T. Munodawafa, S.K. Johl, A systematic review of eco-innovation and perfor- 458472, http://dx.doi.org/10.1145/3472883.3486986.
mance from the resource-based and stakeholder perspectives, Sustainability 11 [32] C.S. Meiklejohn, A. Estrada, Y. Song, H. Miller, R. Padhye, Service-level fault
(2019) 6067, http://dx.doi.org/10.3390/su11216067. injection testing, in: Proceedings of the ACM Symposium on Cloud Computing,
[9] J.M. Macharia, Systematic literature review of interventions supported by inte- SoCC 21, Association for Computing Machinery, New York, NY, USA, 2021, pp.
gration of ict in education to improve learners academic performance in stem 388402, http://dx.doi.org/10.1145/3472883.3487005.
subjects in kenya, J. Educ. Pract. 6 (2022) 5275, http://dx.doi.org/10.47941/ [33] A. Blohowiak, A. Basiri, L. Hochstein, C. Rosenthal, A platform for automating
jep.979. chaos experiments, in: 2016 IEEE International Symposium on Software Reliabil-
[10] P. Gerli, J.N. Marco, J. Whalley, What makes a smart village smart? a review ity Engineering Workshops, ISSREW, 2016, pp. 58, http://dx.doi.org/10.1109/
of the literature, Transform. Gov.: People Process. Policy 16 (2022) 292304, ISSREW.2016.52.
http://dx.doi.org/10.1108/tg-07-2021-0126. [34] A. Nagarajan, A. Vaddadi, Automated fault-tolerance testing, in: 2016 IEEE
[11] R. Coppola, L. Ardito, Quality assessment methods for textual conversational Ninth International Conference on Software Testing, Verification and Validation
interfaces: a multivocal literature review, Information 12 (2021) 437, http: Workshops, ICSTW, 2016, pp. 275276, http://dx.doi.org/10.1109/ICSTW.2016.
//dx.doi.org/10.3390/info12110437. 34.
[12] B. Kitchenham, O. Pearl Brereton, D. Budgen, M. Turner, J. Bailey, S. Linkman, [35] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M.K. Reiter, V. Sekar, Gremlin:
Systematic literature reviews in software engineering A systematic literature Systematic resilience testing of microservices, in: 2016 IEEE 36th International
review, Inf. Softw. Technol. 51 (1) (2009) 715, http://dx.doi.org/10.1016/j. Conference on Distributed Computing Systems, ICDCS, 2016, pp. 5766, http:
infsof.2008.09.009, Special Section - Most Cited Articles in 2002 and Regular //dx.doi.org/10.1109/ICDCS.2016.11.
Research Papers. [36] R.K. Lenka, S. Padhi, K.M. Nayak, Fault injection techniques - a brief review,
[13] N. Dragoni, S. Giallorenzo, A.L. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, L. in: 2018 International Conference on Advances in Computing, Communication
Safina, Microservices: yesterday, today, and tomorrow, 2017, arXiv:1606.04036. Control and Networking, ICACCCN, 2018, pp. 832837, http://dx.doi.org/10.
[14] P.D. Francesco, I. Malavolta, P. Lago, Research on architecting microservices: 1109/ICACCCN.2018.8748585.
Trends, focus, and potential for industrial adoption, in: 2017 IEEE International [37] A. van Hoorn, A. Aleti, T.F. Düllmann, T. Pitakrat, ORCAS: Efficient resilience
Conference on Software Architecture, ICSA, 2017, pp. 2130, http://dx.doi.org/ benchmarking of microservice architectures, in: 2018 IEEE International Sym-
10.1109/ICSA.2017.24. posium on Software Reliability Engineering Workshops, ISSREW, 2018, pp.
[15] M. Fowler, Patterns of Enterprise Application Architecture, Addison-Wesley 146147, http://dx.doi.org/10.1109/ISSREW.2018.00-10.
Longman Publishing Co., Inc., USA, 2002. [38] H. Tucker, L. Hochstein, N. Jones, A. Basiri, C. Rosenthal, The business case for
chaos engineering, IEEE Cloud Comput. 5 (3) (2018) 4554, http://dx.doi.org/
[16] J. Lewis, M. Fowler, Microservices, 2014, https://martinfowler.com/articles/
10.1109/MCC.2018.032591616.
microservices.html.
[39] N. Brousse, O. Mykhailov, Use of self-healing techniques to improve the
[17] S. Newman, Building Microservices: Designing Fine-Grained Systems, " OReilly
reliability of a dynamic and geo-distributed ad delivery service, in: 2018
Media, Inc.", 2021.
IEEE International Symposium on Software Reliability Engineering Workshops,
[18] C.K. Rudrabhatla, Comparison of zero downtime based deployment techniques in
ISSREW, 2018, pp. 15, http://dx.doi.org/10.1109/ISSREW.2018.00-40.
public cloud infrastructure, in: 2020 Fourth International Conference on I-SMAC
[40] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for
(IoT in Social, Mobile, Analytics and Cloud), I-SMAC, 2020, pp. 10821086,
cloud services: Work in progress, in: 2019 IEEE 18th International Symposium
http://dx.doi.org/10.1109/I-SMAC49090.2020.9243605.
on Network Computing and Applications, NCA, 2019, pp. 13, http://dx.doi.org/
[19] S.R. Addula, P. Perugu.P, M.K. Kumar, D. Kumar, B. Ananthan, R. R, S. P, S.
10.1109/NCA.2019.8935046.
G, Dynamic load balancing in cloud computing using hybrid Kookaburra-Pelican
[41] H. Chen, P. Chen, G. Yu, A framework of virtual war room and matrix sketch-
optimization algorithms, in: 2024 International Conference on Augmented Re-
based streaming anomaly detection for microservice systems, IEEE Access 8
ality, Intelligent Systems, and Industrial Automation, ARIIA, 2024, pp. 17,
(2020) 4341343426, http://dx.doi.org/10.1109/ACCESS.2020.2977464.
http://dx.doi.org/10.1109/ARIIA63345.2024.11051893.
[42] K.A. Torkura, M.I.H. Sukmana, F. Cheng, C. Meinel, CloudStrike: Chaos engi-
[20] M. Waseem, P. Liang, M. Shahin, A systematic mapping study on microservices
neering for security and resiliency in cloud infrastructure, IEEE Access 8 (2020)
architecture in devops, J. Syst. Softw. 170 (2020) 110798, http://dx.doi.org/10.
123044123060, http://dx.doi.org/10.1109/ACCESS.2020.3007338.
1016/j.jss.2020.110798.
[43] D. Kesim, A. van Hoorn, S. Frank, M. H00E4ussler, Identifying and prioritizing
[21] C. Rosenthal, N. Jones, Chaos Engineering: System Resiliency in Practice, OReilly
chaos experiments by using established risk analysis techniques, in: 2020 IEEE
Media, 2020.
31st International Symposium on Software Reliability Engineering, ISSRE, 2020,
[22] L. Zhang, B. Morin, B. Baudry, M. Monperrus, Maximizing error injection realism pp. 229240, http://dx.doi.org/10.1109/ISSRE5003.2020.00030.
for chaos engineering with system calls, IEEE Trans. Dependable Secur. Comput. [44] Z. Long, G. Wu, X. Chen, C. Cui, W. Chen, J. Wei, Fitness-guided resilience
19 (4) (2022) 26952708, http://dx.doi.org/10.1109/TDSC.2021.3069715. testing of microservice-based applications, 2020, pp. 151158, http://dx.doi.org/
[23] Š. Davidovič, B. Beyer, Canary analysis service, Commun. ACM 61 (5) (2018) 10.1109/ICWS49710.2020.00027.
5462, http://dx.doi.org/10.1145/3190566. [45] S. De, A study on chaos engineering for improving cloud software quality
[24] L. Zhang, B. Morin, P. Haller, B. Baudry, M. Monperrus, A chaos engineering and reliability, in: 2021 International Conference on Disruptive Technologies
system for live analysis and falsification of exception-handling in the JVM, IEEE for Multi-Disciplinary Research and Applications, CENTCON, Vol. 1, 2021, pp.
Trans. Softw. Eng. 47 (11) (2021) 25342548, http://dx.doi.org/10.1109/TSE. 289294, http://dx.doi.org/10.1109/CENTCON52345.2021.9688292.
2019.2954871. [46] C. Konstantinou, G. Stergiopoulos, M. Parvania, P. Esteves-Verissimo, Chaos
[25] H. Jernberg, P. Runeson, E. Engström, Getting started with chaos engineering engineering for enhanced resilience of cyber-physical systems, in: 2021 Re-
- design of an implementation framework in practice, in: Proceedings of the silience Week, RWS, 2021, pp. 110, http://dx.doi.org/10.1109/RWS52686.
14th ACM / IEEE International Symposium on Empirical Software Engineering 2021.9611797.
and Measurement, ESEM, ESEM 20, Association for Computing Machinery, New [47] F. Poltronieri, M. Tortonesi, C. Stefanelli, ChaosTwin: A chaos engineering and
York, NY, USA, 2020, http://dx.doi.org/10.1145/3382494.3421464. digital twin approach for the design of resilient IT services, in: 2021 17th
[26] A. Alkhateeb, C. Catal, G. Kar, A. Mishra, Hybrid blockchain platforms for the International Conference on Network and Service Management, CNSM, 2021,
internet of things (IoT): A systematic literature review, Sensors 22 (4) (2022) pp. 234238, http://dx.doi.org/10.23919/CNSM52442.2021.9615519.
http://dx.doi.org/10.3390/s22041304. [48] N. Luo, Y. Xiong, Platform software reliability for cloud service continuity
[27] R. van Dinter, B. Tekinerdogan, C. Catal, Predictive maintenance using digital - challenges and opportunities, in: 2021 IEEE 21st International Conference
twins: A systematic literature review, Inf. Softw. Technol. 151 (2022) 107008, on Software Quality, Reliability and Security, QRS, 2021, pp. 388393, http:
http://dx.doi.org/10.1016/j.infsof.2022.107008. //dx.doi.org/10.1109/QRS54544.2021.00050.
[28] M. Jorayeva, A. Akbulut, C. Catal, A. Mishra, Machine learning-based software [49] H. Chen, K. Wei, A. Li, T. Wang, W. Zhang, Trace-based intelligent fault diagnosis
defect prediction for mobile applications: A systematic literature review, Sensors for microservices with deep learning, in: 2021 IEEE 45th Annual Computers,
22 (7) (2022) http://dx.doi.org/10.3390/s22072551. Software, and Applications Conference, COMPSAC, 2021, pp. 884893, http:
[29] A. Basiri, L. Hochstein, N. Jones, H. Tucker, Automating chaos experiments //dx.doi.org/10.1109/COMPSAC51774.2021.00121.
in production, in: 2019 IEEE/ACM 41st International Conference on Software [50] O. Sharma, M. Verma, S. Bhadauria, P. Jayachandran, A guided approach
Engineering: Software Engineering in Practice, ICSE-SEIP, 2019, pp. 3140, towards complex chaos selection, prioritisation and injection, in: 2022 IEEE
http://dx.doi.org/10.1109/ICSE-SEIP.2019.00012. 15th International Conference on Cloud Computing, CLOUD, 2022, pp. 9193,
[30] L.B. Canonico, V. Vakeel, J. Dominic, P. Rodeghero, N. McNeese, Human-AI http://dx.doi.org/10.1109/CLOUD55607.2022.00025.
partnerships for chaos engineering, in: Proceedings of the IEEE/ACM 42nd [51] N. Luo, L. Zhang, Chaos driven development for software robustness enhance-
International Conference on Software Engineering Workshops, ICSEW 20, As- ment, in: 2022 9th International Conference on Dependable Systems and their
sociation for Computing Machinery, New York, NY, USA, 2020, pp. 499503, Applications, DSA, 2022, pp. 10291034, http://dx.doi.org/10.1109/DSA56465.
http://dx.doi.org/10.1145/3387940.3391493. 2022.00154.
13
E. Esen et al. Computer Standards & Interfaces 97 (2026) 104116
[52] M.A. Naqvi, S. Malik, M. Astekin, L. Moonen, On evaluating self-adaptive [58] D. Savchenko, G. Radchenko, O. Taipale, Microservices validation: Mjolnirr
and self-healing systems using chaos engineering, in: 2022 IEEE International platform case study, in: 2015 38th International Convention on Information and
Conference on Autonomic Computing and Self-Organizing Systems, ACSOS, 2022, Communication Technology, Electronics and Microelectronics, MIPRO, 2015, pp.
pp. 110, http://dx.doi.org/10.1109/ACSOS55765.2022.00018. 235240, http://dx.doi.org/10.1109/MIPRO.2015.7160271.
[53] J. Simonsson, L. Zhang, B. Morin, B. Baudry, M. Monperrus, Observability and [59] G.S. Nadella, S.R. Addula, A.R. Yadulla, G.S. Sajja, M. Meesala, M.H. Maturi,
chaos engineering on system calls for containerized applications in Docker, K. Meduri, H. Gonaygunta, Generative AI-enhanced cybersecurity framework for
Future Gener. Comput. Syst. 122 (2021) 117129, http://dx.doi.org/10.1016/ enterprise data privacy management, Computers 14 (2) (2025) http://dx.doi.org/
j.future.2021.04.001. 10.3390/computers14020055.
[54] A.A.-S. Ahmad, P. Andras, Scalability resilience framework using application- [60] D. Kikuta, H. Ikeuchi, K. Tajiri, Y. Nakano, ChaosEater: Fully automating chaos
level fault injection for cloud-based software services, J. Cloud Comput. 11 (1) engineering with large language models, 2025, arXiv preprint arXiv:2501.11107.
(2022) 1, http://dx.doi.org/10.1186/s13677-021-00277-z. URL https://arxiv.org/abs/2501.11107.
[55] C. Camacho, P.C. Cañizares, L. Llana, A. Núñez, Chaos as a software product [61] J. Owotogbe, Assessing and enhancing the robustness of LLM-based multi-
line—A platform for improving open hybrid-cloud systems resiliency, Softw.: agent systems through chaos engineering, in: 2025 IEEE/ACM 4th International
Pract. Exp. 52 (7) (2022) 15811614, http://dx.doi.org/10.1002/spe.3076. Conference on AI Engineering Software Engineering for AI, CAIN, 2025, pp.
[56] P. Raj, S. Vanga, A. Chaudhary, The observability, chaos engineering, and 250252, http://dx.doi.org/10.1109/CAIN66642.2025.00039.
remediation for cloud-native reliability, in: Cloud-Native Computing: How To [62] A. Al-Said Ahmad, L.F. Al-Qoran, A. Zayed, Exploring the impact of chaos
Design, Develop, and Secure Microservices and Event-Driven Applications, 2023, engineering with various user loads on cloud native applications: An exploratory
pp. 7193, http://dx.doi.org/10.1002/9781119814795.ch4. empirical study, Computing 106 (2024) 23892425, http://dx.doi.org/10.1007/
[57] M.A. Chang, B. Tschaen, T. Benson, L. Vanbever, Chaos monkey: Increasing sdn s00607-024-01292-z.
reliability through systematic network destruction, in: Proceedings of the 2015 [63] K.A. Torkura, M.I. Sukmana, F. Cheng, C. Meinel, Security chaos engineering for
ACM Conference on Special Interest Group on Data Communication, 2015, pp. cloud services: Work in progress, in: 2019 IEEE 18th International Symposium
371372. on Network Computing and Applications, NCA, 2019, pp. 13, http://dx.doi.org/
10.1109/NCA.2019.8935046.
14