Computer Standards & Interfaces 97 (2026) 104086


                                                                   Contents lists available at ScienceDirect


                                                         Computer Standards & Interfaces
                                                            journal homepage: www.elsevier.com/locate/csi


Graph-based interpretable dialogue sentiment analysis: A HybridBERT-LSTM
framework with semantic interaction explainer
Ercan Atagün a ,∗, Günay Temür b , Serdar Biroğul c,d
a Computer Engineering, Institute Of Graduate Studies, Duzce University, Düzce, 81000, Turkey
b
  Kaynasli Vocational School, Duzce University, Düzce, 81000, Turkey
c
  Department of Computer Engineering, Faculty of Engineering, Duzce University, Düzce, 81000, Turkey
d
  Department of Electronics and Information Technologies, Faculty of Architecture and Engineering, Nakhchivan State University, Nakhchivan, Azerbaijan


ARTICLE                 INFO                              ABSTRACT

Keywords:                                                 Conversational sentiment analysis in natural language processing faces substantial challenges due to intricate
Natural language processing                               contextual semantics and temporal dependencies within multi-turn dialogues. We present a novel HybridBERT-
Explainable artificial intelligence                       LSTM architecture that integrates BERT’s contextualized embeddings with LSTM’s sequential processing
Word context graph explainer
                                                          capabilities to enhance sentiment classification performance in dialogue scenarios. Our framework employs
                                                          a dual-pooling mechanism to capture local semantic features and global discourse dependencies, addressing
                                                          limitations of conventional approaches. Comprehensive evaluation on IMDb benchmark and real-world
                                                          dialogue datasets demonstrates that HybridBERT-LSTM consistently improves over standalone models (LSTM,
                                                          BERT, CNN, SVM) across accuracy, precision, recall, and F1-score metrics. The architecture effectively exploits
                                                          pre-trained contextual representations through bidirectional LSTM layers for temporal discourse modeling. We
                                                          introduce WordContextGraphExplainer, a graph-theoretic interpretability framework addressing conventional
                                                          explanation method limitations. Unlike LIME’s linear additivity assumptions treating features independently,
                                                          our approach utilizes perturbation-based analysis to model non-linear semantic interactions. The framework
                                                          generates semantic interaction graphs with nodes representing word contributions and edges encoding inter-
                                                          word dependencies, visualizing contextual sentiment propagation patterns. Empirical analysis reveals LIME’s
                                                          inadequacies in capturing temporal discourse dependencies and collaborative semantic interactions crucial for
                                                          dialogue sentiment understanding. WordContextGraphExplainer explicitly models semantic interdependencies,
                                                          negation scope, and temporal flow across conversational turns, enabling comprehensive understanding of both
                                                          word-level contributions and contextual interaction influences on decision-making processes. This integrated
                                                          framework establishes a new paradigm for interpretable dialogue sentiment analysis, advancing trustworthy
                                                          AI through high-performance classification coupled with comprehensive explainability.


1. Introduction                                                                                 sentiment analysis, as conventional text classification methodologies
                                                                                                frequently fail to adequately capture such sequential continuity. The
    Dialogue-based sentiment analysis constitutes a significant research                        multi-speaker nature of dialogues introduces critical considerations
domain within the field of natural language processing (NLP). This area                         regarding utterance attribution and the identification of emotional
of study represents a fundamental component of efforts to enhance
                                                                                                expression sources. Modeling sentiment transitions between conver-
human–machine interaction through more meaningful and emotion-
                                                                                                sational participants presents particular challenges, especially in sce-
centric approaches. Research endeavors in this field encompass numer-
ous inherent challenges and complexities. Dialogues typically emerge                            narios where emotions are expressed through implicit mechanisms.
from the reciprocal interactions among multiple conversational partic-                          Rather than explicit emotional declarations, human linguistic behavior
ipants, where the scope of communicative content spans the breadth                              frequently employs sophisticated rhetorical devices including irony,
of human knowledge and experience. The emotional orientation of an                              sarcasm, humor, double entenders, and cultural references, resulting in
utterance within a conversational sequence demonstrates substantial                             sentiment interpretations that diverge significantly from surface-level
dependency upon preceding discourse and contextual cues. This phe-
                                                                                                textual analysis. This phenomenon proves particularly problematic in
nomenon necessitates the development of context-aware models for


    ∗ Corresponding author.
      E-mail address: ercanatagun@duzce.edu.tr (E. Atagün).

https://doi.org/10.1016/j.csi.2025.104086
Received 7 June 2025; Received in revised form 7 October 2025; Accepted 13 October 2025
Available online 12 November 2025
0920-5489/© 2025 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
E. Atagün et al.                                                                                                     Computer Standards & Interfaces 97 (2026) 104086


brief, context-independent utterances, substantially complicating sen-          media environments, further demonstrating the potential of multimodal
timent analysis procedures. Contemporary dialogue-based sentiment               affective understanding.
analysis research faces significant constraints regarding the availabil-            Graph-based modeling has also been incorporated into multimodal
ity of high-quality, annotated datasets. Existing corpora are charac-           sentiment analysis. Zhao and Gao [10] proposed a semantically en-
terized by either limited scale or restriction to specific contextual           riched heterogeneous dialogue graph network to analyze sentiment in
domains such as cinematic dialogue or customer service interactions.
                                                                                multi-party conversations. Yang et al. [11] advanced sentiment accu-
Furthermore, the insufficient representation of cultural, linguistic, and
                                                                                racy through a model that jointly processes text, audio, and visual cues.
social diversity within available datasets impedes the development
                                                                                Context-awareness is a pivotal factor in sentiment interpretation within
of generalizable models with robust cross-domain applicability. Deep
                                                                                dialogues. Carvalho et al. [3] emphasized the influence of preceding
learning-based sentiment analysis architectures predominantly exhibit
                                                                                discourse on sentiment prediction. To enhance contextual coherence in
‘‘black box’’ characteristics, rendering their decision-making processes
opaque to human interpretation. This limitation particularly dimin-             generative AI dialogue systems, personalized dialogue summarization
ishes model reliability in tasks where emotional interpretation involves        techniques have been employed [12]. Mustapha [13] proposed a model
inherent subjectivity, consequently necessitating human oversight in            to analyze sentiment-cause relationships in stress-laden conversations,
practical applications. In this study, a novel hybrid model is pro-             aiming to reveal emotional dynamics. Contextual memory mechanisms
posed, integrating BERT’s contextualized representation capabilities            were further explored by Li et al. [14], who developed a bidirec-
with the sequential modeling proficiency of LSTM to address the in-             tional emotional recurrent unit (BiERU) to capture dynamic context
herent challenges of sentiment analysis in dialogue-based datasets. The         shifts and their implications for sentiment detection. Explainability has
architecture is specifically designed to capture both linguistic features       gained increasing importance in sentiment analysis. A variety of ap-
and temporal dependencies embedded within conversational structures.            proaches including attention mechanisms, graph neural networks, and
To enhance the interpretability of model outputs, a graph-theoretic             neuro-symbolic architectures have been introduced to elucidate model
interpretability framework, termed WordContextGraphExplainer, is in-            decision-making. Poria et al. [15] discussed fundamental challenges
troduced. This framework overcomes the limitations of conventional              in sentiment interpretation and underscored the role of explainability.
explanation methods by modeling non-linear semantic interactions be-            Zhu et al. [16] developed a neuro-symbolic model for personalized
tween lexical units. Through the construction of semantic interaction
                                                                                sentiment analysis, incorporating user-specific contextual factors into
graphs, the approach facilitates comprehensive visualization of contex-
                                                                                the explanatory framework. Luo et al. [17] introduced the PanoSent
tual sentiment propagation patterns, offering novel insights into the
                                                                                dataset to improve the analysis of emotional shifts in interactive sys-
underlying decision-making mechanisms of the model and establish-
                                                                                tems. In another direction, Zhang et al. proposed a novel interaction
ing a new paradigm for interpretable sentiment analysis in dialogue
systems.                                                                        network inspired by quantum theory to reframe dialogue-based sen-
                                                                                timent analysis [18]. Yang et al. [19] addressed the inadequacies
2. Related works                                                                of existing pre-trained models in capturing the logical structure of
                                                                                dialogues. To overcome these limitations, they proposed a new pre-
    Sentiment analysis has gained significant traction in NLP research,         training framework comprising utterance order modeling, sentence
driven by its pivotal role in enabling affective computing across do-           skeleton reconstruction, and sentiment shift detection, demonstrating
mains such as human–computer interaction, intelligent customer sup-             improvements in learning emotion interactions and discourse coher-
port, and conversational AI systems. Recent advancements in the field           ence. Collectively, recent developments in sentiment analysis empha-
have led to the development of a diverse array of methodologies, en-            size the significance of contextual awareness, multimodal data fusion,
compassing text-based approaches, multimodal frameworks, contextual             graph-based reasoning, and explainable AI techniques in enhancing
modeling techniques, and sophisticated deep learning architectures.             performance and interpretability within dialogue-centric applications.
This section presents an overview of key contributions in the literature,
with a particular emphasis on dialogue-based sentiment analysis, which
plays a critical role in domains such as customer support, conversa-            3. Materials and methods
tional AI, and empathetic dialogue systems. Song et al. [1] introduced
a topic-aware sentiment analysis model for dialogue (CASA), aiming to
                                                                                    The dialogue dataset dyadic conversational exchanges between two
identify sentiment orientations within conversational threads. Firdaus
                                                                                distinct participants. Each dialogue instance is structured as a se-
et al. [2] constructed the MEISD dataset, incorporating textual, audio,
                                                                                quence of alternating utterances, where each turn is associated with
and visual data for multimodal sentiment analysis. Emphasizing the
                                                                                a specific speaker and the corresponding textual content. The formal
relevance of conversational context, Carvalho et al. [3] demonstrated
that prior utterances significantly influence sentiment classification          mathematical representation of the dialogue structure is given by:
outcomes. Building upon this insight, topic-aware sentiment classifica-
                                                                                 = {(𝑠𝑖 , 𝑡𝑖 )}𝑁
                                                                                                𝑖=1
                                                                                                    ,   𝑠𝑖 ∈  = {𝐴, 𝐵},   𝑡𝑖 ∈ 𝛴 ∗
tion models have been proposed using multi-task learning strategies
within customer service dialogues [4]. Real-time sentiment analysis
                                                                                Here,  denotes the complete dialogue dataset
in dialogue systems is also a critical consideration. Bertero et al. [5]
developed a convolutional neural network capable of processing au-                 composed of 𝑁 conversational turns.
dio inputs for instantaneous emotion detection in interactive systems.          Each pair (𝑠𝑖 , 𝑡𝑖 ) represents the 𝑖-th turn in the dialogue,
Bothe et al. [6] presented a model to predict the sentiment of up-                 where 𝑠𝑖 is the speaker identifier and
coming utterances, thereby analyzing emotional transitions throughout
dialogue sequences. To address the limitations of unimodal text-based              𝑡𝑖 is the corresponding utterance.
sentiment analysis, recent studies have adopted multimodal strategies           The speaker set  = {𝐴, 𝐵} contains two participants,
by integrating text, speech, and visual signals. For instance, the EmoSen          typically alternating in a turn-based structure.
model [7] generates sentiment-aware responses using fused inputs
                                                                                The term 𝛴 represents the alphabet of the natural language
from these modalities. Similarly, Mallol-Ragolta and Schuller [8] intro-
duced a system that personalizes dialogue responses by estimating user             in which the dialogue is conducted, and 𝛴 ∗ denotes the set
emotions and arousal levels. Akbar et al. [9] proposed an innovative               of all finite-length strings (i.e., possible utterances)
emotion-driven framework for video-based sentiment analysis in social              formed from this alphabet.

                                                                            2
E. Atagün et al.                                                                                                Computer Standards & Interfaces 97 (2026) 104086


3.1. Data preprocessing and word embedding                                      3.2. GloVe: Global vectors for word representation

    The successful training of natural language processing (NLP) models             GloVe [24] is a widely adopted word embedding technique designed
is highly dependent on the transformation of raw textual data into              to capture semantic and conceptual relationships between words, par-
structured and semantically meaningful representations [20]. In this            ticularly in text classification tasks. It operates by constructing word
study, all textual inputs undergo a series of preprocessing operations          vector representations through the optimization of global word co-
designed to optimize them for subsequent modeling tasks. An initial             occurrence statistics derived from large-scale corpora. Unlike local
and essential preprocessing step involves lowercasing, which standard-          context–based models such as Word2Vec, GloVe incorporates both
izes textual input by mitigating case sensitivity inconsistencies that
                                                                                local and global contextual information, embedding lexical units into a
would otherwise lead to redundant representations of semantically
                                                                                dense, continuous vector space. In practical applications, GloVe embed-
identical words. This step is particularly critical for ensuring the ef-
                                                                                dings are employed to convert unstructured input text into fixed-length
fectiveness and consistency of word embedding techniques. Given that
                                                                                numerical tensors, which serve as inputs to deep learning architec-
parts of the dataset originate from web-based sources, residual HTML
                                                                                tures such as CNN and LSTM models. This transformation enables
tags and encoded entities such as <br> and &nbsp; are present in the
raw text. These components provide no linguistic or semantic value and          the model to effectively distinguish between textual classes by cap-
may negatively affect model performance. Therefore, all HTML-related            turing both syntactic patterns and latent semantic features. The key
tokens and special characters are systematically removed in the prepro-         advantage of GloVe lies in its ability to unify global corpus-level
cessing phase to reduce noise within the input space and to enhance the         statistical information with local context, producing more stable and
robustness of downstream NLP models. This comprehensive cleaning                semantically meaningful representations compared to models relying
process is implemented using Python NLTK, BeautifulSoup libraries               solely on window-based learning. However, it remains a static embed-
combined with regular expression patterns to ensure thorough removal            ding technique; each word is assigned a single vector regardless of
of web-derived artifacts. Additionally, standard stopword removal is            its context within a sentence. This context-independent nature limits
applied to eliminate semantically non-contributive terms. Notably, tra-         its flexibility when compared to transformer-based models like BERT,
ditional morphological normalization techniques such as stemming                which generate dynamic embeddings conditioned on the broader lin-
and lemmatization are deliberately excluded from our preprocessing              guistic environment. Despite these limitations, GloVe continues to play
pipeline, as BERT’s contextualized embedding framework inherently               a significant role in various NLP tasks such as text similarity, topic
captures morphological variations and semantic relationships without            labeling,spam detection, and sentiment analysis where modeling word-
requiring explicit normalization steps.                                         level semantics remains essential. Its computational simplicity and ease
    Following text normalization, each cleaned sentence is tokenized            of integration make it a reliable baseline in many NLP pipelines. Recent
into subword or word-level units. These token sequences are then                studies [25] have highlighted the importance of consistent embedding
converted into dense numerical representations using word embedding             strategies when comparing different NLP models, as variations in em-
techniques such as GloVe [21]. Embedding techniques project discrete
                                                                                bedding approaches can significantly impact performance comparisons
textual units into continuous vector representations that encapsulate
                                                                                and lead to biased evaluations.
both semantic coherence and syntactic structure, thereby facilitating
computational models in capturing lexical relatedness and contextual
                                                                                3.3. Support Vector Machine (SVM)
alignment within language data. The original, unprocessed dataset
can be denoted as follows: Let the original unprocessed dataset be
represented [22] as:                                                                Support Vector Machine (SVM) [26] is a well-established super-
                                                                                vised learning algorithm widely employed in text classification tasks,
𝑇 = {𝑠1 , 𝑠2 , … , 𝑠𝑁 }                                                         particularly due to its robustness in handling high-dimensional data
where each sentence 𝑠𝑘 is defined as [23] a sequence of 𝑀 words:                representations. In natural language processing pipelines, textual inputs
                                                                                are typically transformed into numerical feature vectors using tech-
𝑠𝑘 = {𝑢1 , 𝑢2 , … , 𝑢𝑀 }                                                        niques such as Term Frequency–Inverse Document Frequency (TF-IDF)
To refine the input, special characters , web-related entities , and          or various word embedding models. Once converted, SVM operates by
semantically non-contributive stopwords  are eliminated. The cleaned           identifying the optimal hyperplane that best separates the data points
sentence is thus defined by:                                                    into distinct class labels. The core principle of SVM lies in maximizing
                                                                                the margin between classes, thereby enhancing generalization perfor-
𝑠′𝑘 = Clean(𝑠𝑘 ) = {𝑢𝑗 ∈ 𝑠𝑘 ∣ 𝑢𝑗 ∉ ( ∪  ∪ )}
                                                                                mance. This is particularly advantageous in scenarios where the feature
The sanitized sentence 𝑠′𝑘 is then tokenized:                                   space exhibits high dimensionality and potential overlap between class
𝑠′𝑘 = {𝑣1 , 𝑣2 , … , 𝑣𝑃 },   𝑣𝑖 ∈                                              distributions. Furthermore, SVM’s ability to incorporate non-linear ker-
                                                                                nel functions such as polynomial or radial basis function (RBF) kernels
where  denotes the vocabulary of all tokens in the dataset.                    enables it to capture complex, non-linear patterns within the data,
    Word embeddings serve as a cornerstone for text classification, as          which are often present in linguistically rich or semantically ambiguous
they enable models to capture abstract semantic relationships while             textual inputs. Due to its mathematically grounded optimization frame-
reducing the dimensionality of input features. Unlike traditional bag-
                                                                                work and resistance to overfitting, SVM remains a competitive baseline
of-words approaches, embeddings are resilient to linguistic variability
                                                                                in various text classification domains, including sentiment analysis,
such as synonymy and polysemy. For sentiment analysis tasks, embed-
                                                                                spam detection, and topic categorization. Its effectiveness is further
dings can cluster words with similar affective connotations, thereby
                                                                                enhanced when combined with appropriate feature engineering and
enhancing the model’s ability to generalize and detect implicit senti-
                                                                                dimensionality reduction techniques, making it a viable choice for both
ments. Likewise, in general classification tasks, embeddings help reveal
thematic cohesion across texts, ultimately contributing to improved             small-scale and large-scale NLP applications.
predictive performance. Nevertheless, conventional embeddings like
Word2Vec or GloVe are context-independent, assigning the same vector            3.4. Convolutional Neural Networks (CNN)
representation to a word regardless of its usage context. This limitation
is addressed by contextualized models such as BERT, which generate                  Although originally developed for image recognition tasks, Convo-
dynamic embeddings based on surrounding words using transformer-                lutional Neural Networks (CNNs) have been extensively adapted for
based architectures. Word embeddings bridge the gap between lin-                various natural language processing problems, particularly in multi-
guistic expressiveness and computational tractability and remain an             label text classification [27] and sentiment analysis [28], due to their
indispensable component of modern NLP pipelines.                                capacity to capture local hierarchical patterns in sequential data. In

                                                                            3
E. Atagün et al.                                                                                                 Computer Standards & Interfaces 97 (2026) 104086


text classification applications, CNNs operate on word embeddings by             proposed. These hybrid solutions aim to retain BERT’s rich contextual
applying one-dimensional convolutional filters to detect local patterns          understanding while improving computational efficiency and general-
such as n-grams or syntactic motifs. These filters perform element-              izability, making them more suitable for applications constrained by
wise multiplications followed by non-linear activation functions to              resources or latency requirements.
generate feature maps that emphasize the most informative regions of
the input sequence. A subsequent max-pooling operation reduces the               3.7. Local Interpretable Model-Agnostic Explanations (LIME)
dimensionality and retains the most salient features, thereby enabling
the network to focus on contextually rich segments of text. This ar-                 LIME is a model-agnostic interpretability framework designed to
chitecture allows CNNs to efficiently model contextual dependencies              provide localized explanations for the predictions of complex machine
within fixed-size receptive fields, making them particularly suitable for        learning models. Positioned within the broader field of Explainable
tasks such as topic categorization, polarity detection, and aspect-based         Artificial Intelligence (XAI), LIME serves to enhance the interpretability
sentiment analysis. Compared to recurrent neural networks (RNNs),                of opaque ‘‘black-box’’ systems, particularly in high-stakes domains
CNNs offer significant advantages in terms of computational efficiency           where transparency and trust are critical [33]. LIME’s main goal is to
and parallelizability, as they do not rely on sequential input processing.       provide a straightforward, interpretable surrogate model that, within
However, one notable limitation of CNNs is their reduced capacity to             the local neighborhood of a particular instance, roughly represents the
model long-range dependencies, which can affect performance in tasks             original model’s decision boundary [34]. LIME accomplishes this by
involving lengthy or complex discourse structures.                               generating a set of synthetic samples close to the target instance, which
                                                                                 perturbs the original input. The black-box model is used to these altered
3.5. Long Short-Term Memory Networks (LSTM)                                      examples in order to derive the relevant predictions. These cases are
                                                                                 then subjected to a locality-sensitive weighting function, and the deci-
    LSTM networks, as a refined subclass of recurrent neural architec-           sion function is locally approximated by training a sparse linear model
tures, have demonstrated substantial effectiveness in text classification
                                                                                 on the weighted dataset. The contribution of each feature to the final
tasks due to their capacity to capture long-range dependencies and pre-
                                                                                 prediction is inferred using the surrogate model’s resulting coefficients.
serve semantically meaningful representations across sequential data
                                                                                 One of the key strengths of LIME lies in its model-agnostic design,
inputs [29]. By incorporating internal memory units and a gated con-
                                                                                 allowing it to be applied across a wide range of machine learning
trol mechanism – comprising input, forget, and output gates – LSTM
                                                                                 algorithms, including ensemble methods, deep neural networks, and
models effectively address the vanishing gradient challenge that limits
                                                                                 support vector machines. It offers human-understandable explanations
conventional RNNs. These gating components orchestrate information
                                                                                 while maintaining local fidelity to the original model. As such, LIME
flow dynamically, facilitating the retention of salient features over pro-
                                                                                 is widely adopted for increasing decision transparency and enabling
longed contexts and ensuring the continuity of semantic interpretation
                                                                                 human-AI collaboration, particularly in sensitive applications such as
throughout the sequence [30]. In text classification applications, LSTM
                                                                                 healthcare diagnostics, financial risk assessment, and legal reasoning.
typically process input sequences encoded as dense word embeddings,
allowing the network to learn hierarchical feature representations that
                                                                                 3.8. WordContextGraphExplainer
encapsulate both syntactic structure and semantic meaning. This ca-
pacity to capture nuanced contextual relationships makes LSTM par-
ticularly effective in tasks such as sentiment analysis, text similarity,            The exponential growth in transformer-based natural language pro-
spam detection, and topic categorization where subtle variations in              cessing (NLP) architectures has created an unprecedented demand
word order and polarity significantly influence predictive accuracy.             for interpretability frameworks capable of elucidating the complex
For instance, in sentiment classification, LSTM models can differen-             decision-making processes underlying these black-box models. While
tiate between expressions like ‘‘not good’’ and ‘‘extremely good’’ by            widely adopted XAI techniques such as LIME (Local Interpretable
maintaining a dynamic memory of temporal context throughout the                  Model-Agnostic Explanations) and SHAP (SHapley Additive Explana-
sequence.                                                                        tions) offer valuable insights through feature attribution, they in-
                                                                                 herently rely on linear additivity assumptions among input features.
3.6. Bidirectional Encoder Representations from Transformers (BERT)              This assumption falls short in capturing the intricate semantic de-
                                                                                 pendencies and non-linear interactions that characterize deep lan-
    BERT is a transformer-based, pre-trained language model that has             guage understanding. A fundamental limitation of existing approaches
substantially advanced the state of the art in text classification tasks         lies in their inability to model contextual interdependencies between
by capturing bidirectional contextual semantics through self-attention           words relationships that are crucial for interpreting sentiment propa-
mechanisms [31]. Unlike unidirectional models such as LSTM or GRU,               gation, negation scope, and semantic coherence in complex linguistic
which process text sequentially, BERT encodes semantic dependencies              structures. Traditional token-level attribution methods treat individual
from both left and right contexts simultaneously. This architecture              words as independent contributors, failing to account for the synergistic
enables nuanced disambiguation of polysemous words and more robust               effects that emerge from word pairings and contextual associations
modeling of long-range dependencies in natural language [32]. In                 in the semantic space. In this paper, WordContextGraphExplainer is
text classification applications, BERT is typically fine-tuned on task-          introduced as a novel graph-theoretic interpretability framework de-
specific labeled datasets. This involves appending a classification layer        veloped to enhance the transparency of transformer-based sentiment
often a dense layer with softmax activation on top of the pre-trained            classification systems. The methodology is built upon a systematic
BERT encoder. Through this transfer learning paradigm, BERT exhibits             perturbation analysis paradigm, in which masked language modeling is
superior performance across a variety of NLP tasks including sentiment           employed to estimate both individual lexical contributions and pairwise
classification, aspect-based sentiment analysis, and multi-label classifi-       semantic interactions. In contrast to linear attribution methods, this
cation, particularly in settings characterized by contextual ambiguity           approach explicitly models non-linear dependencies by quantifying the
and hierarchical dependencies. However, BERT’s practical deployment              divergence between observed joint effects and the expected additive
presents several challenges. Its high computational complexity, sensi-           influence of word pairs. At the core of the framework is the construction
tivity to input sequence length, and the requirement for large volumes           of a semantic interaction graph, where nodes represent individual
of labeled data during fine-tuning can pose significant barriers in real-        words annotated with their relative sentiment contributions, and edges
world scenarios. To mitigate these limitations, hybrid architectures that        encode the magnitude and directionality of inter-word dependencies.
integrate BERT with more lightweight modeling components have been               This graph-based representation facilitates intuitive visualization of

                                                                             4
E. Atagün et al.                                                                                                     Computer Standards & Interfaces 97 (2026) 104086


complex linguistic relationships through NetworkX-based layouts, en-            capturing contextual semantics, its self-attention mechanism may not
abling deeper insight into how contextual factors influence model               fully exploit the sequential dependencies within dialogue utterances. To
predictions. The framework demonstrates particular efficacy in sen-             mitigate this limitation, bidirectional LSTM layers are incorporated to
timent analysis tasks where nuanced interactions between affective              model temporal patterns and discourse-level relationships across token
indicators, negation patterns, and contextual modifiers significantly           sequences. These layers are adept at retaining long-range dependencies
impact interpretive accuracy. By providing interpretable visualizations         and recognizing sentiment transitions across multi-turn dialogue. By
of semantic interaction networks, WordContextGraphExplainer sup-                integrating these two components, the proposed HybridBERT-LSTM
ports advanced model debugging, bias detection, and clinical decision           architecture achieves a richer understanding of both the global context
support in sensitive domains such as mental health assessment and               and local structure of textual data, enhancing its capability to discern
medical text analytics. Moreover, the framework incorporates a top-             sentiment in complex conversational scenarios. This dual modeling
k interaction filtering mechanism, ensuring computational scalability           approach positions the framework as a robust solution for sentiment
while preserving the granularity required for interpretable analysis in         classification tasks, particularly in dialogue-rich environments where
high-stakes applications. This methodological advancement represents            contextual flow and temporal coherence are paramount.
a critical step toward the development of trustworthy AI systems that
combine linguistic reasoning with transparent explanatory capabilities,         3.9. Model architecture
offering a robust foundation for real-world deployment.
                                                                                   The proposed model processes input text through a series of trans-
    Algorithm 1: WordContextGraphExplainer Method                               formation stages, mathematically formalized as follows: Given an input
                                                                                sequence:
    Input: Text 𝑇 , transformer model 𝑀, tokenizer 𝜏, feature number
𝑘 ≥ 1, device 𝑑.                                                                𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 },   where 𝑛 ≤ 256,
    Output: Word context graph 𝐺 with semantic interactions.
    1: Compute baseline prediction 𝑃0 = 𝑀(𝑇 ).                                  the BERT encoder maps each token 𝑥𝑖 to a contextualized embedding,
    2: Compute predicted_class = arg max(𝑃0 ).                                  producing a sequence of hidden states:
    3: Initialize 𝑊 = 𝜏(𝑇 ), word_effects = ∅, interactions = ∅.                𝐻 = BERT(𝑋) ∈ R𝑛×𝑑BERT
    4: for each 𝑤𝑖 ∈ 𝑊 do 5: 𝑇masked = replace(𝑇 , 𝑤𝑖 , ‘[𝙼𝙰𝚂𝙺]’)
    6: 𝑃masked = 𝑀(𝑇masked )                                                    where 𝑑BERT = 768 represents the dimensionality of BERT’s contextual
    7: word_effects[𝑖] = 𝑃0 − 𝑃masked                                           embeddings.
    8: end for                                                                      The sequence 𝐻 is passed to a 3-layer bidirectional LSTM net-
    9: for each (𝑤𝑖 , 𝑤𝑗 ) ∈ combinations(𝑊 , 2) do                             work to capture temporal dependencies beyond what is modeled by
    10: 𝑇pair = replace(𝑇 , [𝑤𝑖 , 𝑤𝑗 ], ‘[𝙼𝙰𝚂𝙺]’)                               self-attention:
    11: 𝑃pair = 𝑀(𝑇pair )                                                       ⃖⃗𝑡 = LSTMforward (𝐻𝑡 , ℎ
                                                                                ℎ                       ⃖⃗𝑡−1 ),   ⃖⃖
                                                                                                                   ℎ𝑡 = LSTMbackward (𝐻𝑡 , ⃖⃖
                                                                                                                                           ℎ𝑡+1 )
    12: actual_effect = 𝑃0 − 𝑃pair
    13: expected_effect = word_effects[𝑖] + word_effects[𝑗]                     The final representation for each token is obtained by concatenating
    14: interaction𝑖𝑗 = actual_effect − expected_effect                         the forward and backward hidden states:
                                   ‖              ‖
    15: interactions[(𝑤𝑖 , 𝑤𝑗 )] = ‖interaction𝑖𝑗 ‖                             ℎLSTM          ℎ𝑡 ] ∈ R2𝑑LSTM
                                                                                         ⃖⃗𝑡 ; ⃖⃖
                                                                                      = [ℎ
                                   ‖              ‖2                             𝑡
    16: end for
    17: Sort interactions by magnitude in descending order.                     with 𝑑LSTM = 256, resulting in a 512-dimensional output per token.
                                                                                   To obtain a fixed-length vector representation of the sequence, both
    18: 𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠[∶ 𝑘]
                                                                                average and maximum pooling operations are applied:
    19: Construct graph 𝐺 = (𝑉 , 𝐸) where 𝑉 = 𝑊 and 𝐸 =
                                                                                         1 ∑ LSTM
                                                                                            𝑛
𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠
                                                                                ℎavg =        ℎ   ,        ℎmax = max ℎLSTM
    20:       Compute layout positions using         𝑜𝑟𝑔𝑎𝑛𝑖𝑧𝑒𝑑_𝑙𝑎𝑦𝑜𝑢𝑡(𝑊 ,                𝑛 𝑖=1 𝑖                       𝑖
                                                                                                                   1≤𝑖≤𝑛
𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠)
    21: Visualize 𝐺 with NetworkX rendering and semantic color                  These vectors are concatenated to form the final sequence representa-
coding                                                                          tion:
    22: Return 𝐺                                                                ℎcombined = [ℎavg ; ℎmax ] ∈ R4𝑑LSTM = R1024

    In this study, a hybrid architecture is proposed that integrates a          Feed-forward classification
pre-trained BERT model with a bidirectional Long Short-Term Memory
(BiLSTM) network to address the task of sentiment classification. The               The combined representation is passed through a feed-forward neu-
model processes textual input to generate sentiment label predictions,          ral network with dropout regularization:
effectively capturing both semantic context and temporal structure
                                                                                𝑧1 = Dropout0.3 (ℎcombined )
inherent in natural language. Grounded in a transformer-based archi-
tecture, the system accepts input sequences of up to 256 tokens, apply-         followed by a two-layer multilayer perceptron (MLP) with ReLU acti-
ing appropriate padding and truncation mechanisms when necessary                vation and softmax output for multi-class classification.
to standardize input lengths. The HybridBERT-LSTM model embodies                    This is followed by a two-layer MLP classifier, using a ReLU acti-
a synergistic design that leverages the complementary strengths of              vation and softmax output for multi-class prediction. The HybridBERT-
transformer-based language models and recurrent neural networks.                LSTM architecture integrates the strengths of transformer-based con-
This hybrid framework is explicitly engineered to address two crit-             textual modeling with the sequential learning capabilities of recurrent
ical aspects of sentiment analysis: contextual representation and se-           neural networks. While BERT excels in capturing bidirectional semantic
quential modeling. Contextual Representation: The BERT encoder, pre-            context via self-attention, the inclusion of bidirectional LSTM layers
trained on large-scale corpora, produces deep contextualized embed-             enhances the model’s ability to capture sequential dependencies and
dings by employing multi-head self-attention mechanisms. These em-              emotional transitions throughout dialogue sequences. The dual pooling
beddings capture nuanced semantic and syntactic information, enabling           strategy(average and max pooling) provides a comprehensive summary
the model to differentiate between polysemous expressions and context-          of the sequence. Average pooling captures the overall sentiment distri-
dependent sentiment cues. Sequential Modeling: While BERT excels at             bution across the sequence, whereas max pooling emphasizes salient

                                                                            5
E. Atagün et al.                                                                                                    Computer Standards & Interfaces 97 (2026) 104086


emotional cues. This duality enriches the feature space and contributes          Table 1
to more robust classification. Furthermore, hierarchical feature ab-             HybridBERT-LSTM Model Parameters.
straction is enabled by stacking multiple LSTM layers, allowing the               Parameter name              Parameter value
model to learn long-range patterns more effectively than shallow RNN              Model architecture          BERT encoder + BiLSTM + MLP
structures. Dropout layers, strategically placed after pooling (with a            Base Model                  google-bert/bert-base-uncased
rate of 0.3) and within the classifier (rate 0.2), serve as regularization        Tokenizer                   google-bert/bert-base-uncased
                                                                                  Maximum sequence length     256
mechanisms to prevent overfitting, especially during fine-tuning on               LSTM layer                  6
task-specific datasets. The model is trained using the AdamW optimizer            Batch size                  32
with a learning rate of 2 × 10−5 , and the cross-entropy loss function            Number of epochs            5
is employed as the objective. Performance evaluation is conducted                 Learning rate               0.00002
                                                                                  Optimization algorithm      AdamW
using standard metrics including accuracy, precision, recall, and F1-
                                                                                  Loss function               CrossEntropyLoss
score, ensuring comprehensive validation of the model’s classification            LSTM latent size            256
capability. The model integrates a pre-trained BERT encoder for captur-           Pooling                     avg + max pooling
ing deep contextual embeddings from input text sequences, followed                MLP layer                   Linear(1024→128) → ReLU → Linear(128→n_classes)
                                                                                  Dropout rates               0.3
by a multi-layer bidirectional LSTM network that models sequential
dependencies across tokens. To derive a robust sentence-level repre-
sentation, dual pooling operations(average and maximum pooling) are              Table 2
applied to the LSTM outputs. The concatenated feature vector is then             BERT Model Parameters.
passed through a fully connected neural network with dropout regu-                Parameter name                     Parameter value
larization, culminating in a softmax classifier for multi-class sentiment         Base model                         google-bert/bert-base-uncased
prediction. This hybrid architecture is designed to jointly leverage the          Tokenizer                          google-bert/bert-base-uncased
representational richness of transformer encoders and the temporal                Input length                       128
                                                                                  Batch size                         16
modeling strength of recurrent networks, effectively addressing both lo-
                                                                                  Number of epochs                   5
cal semantics and discourse-level sentiment dynamics within multi-turn            Learning rate                      0.00002
dialogues.                                                                        Loss Function                      BertForSequenceClassification – Cross-Entropy
    The computational overhead of HybridBERT-LSTM represents a crit-              Optimization algorithm             AdamW
ical consideration for practical deployment, particularly in real-time
applications such as conversational AI systems. The theoretical com-
                                                                                 Table 3
plexity of the proposed architecture can be decomposed into its con-             LSTM Model Parameters.
stituent components to understand the computational requirements.                 Parameter name                                   Parameter value
The BERT component contributes (𝑛2 ×𝑑BERT ) = (𝑛2 ×768) complexity
                                                                                  Embedding type                                   GloVe
due to the quadratic scaling of the self-attention mechanism, where 𝑛             Embedding size                                   100
represents the sequence length and 𝑑BERT denotes the BERT embedding               Maximum number of words                          5000
dimension. The subsequent 3-layer BiLSTM processing adds (3 × 𝑛 ×                LSTM layer number                                6
  2
𝑑LSTM ) = (3 × 𝑛 × 2562 ) complexity, where 𝑑LSTM represents the                 LSTM unit number                                 128/256
                                                                                  Dropout rate                                     0.5
LSTM hidden dimension. Consequently, the overall HybridBERT-LSTM
                                                                                  Output layer (Dense)                             Softmax
complexity is (𝑛2 × 768 + 3𝑛 × 65, 536). This represents a significant           Optimization algorithm                           Adam
computational increase compared to standalone BERT ((𝑛2 × 768)) or               Loss function                                    Sparse Categorical Crossentropy
LSTM models ((𝑛 × 𝑑LSTM 2  )), which may limit deployment in latency-            Epoch number                                     50
sensitive applications. However, the empirical results demonstrate that           Batch size                                       32

the performance gains justify this additional overhead in scenarios
where accuracy is prioritized over computational efficiency.
                                                                                     The parameters used for the BERT model employed in this study are
4. Experimental results                                                          presented in Table 2.
                                                                                     The parameter configurations utilized in the LSTM-based model
   This section presents the configurations of the models utilized in            developed for this study are detailed in Table 3.
the experiments, detailing the corresponding hyperparameters and im-                 The parameter configurations utilized in the CNN model developed
plementation settings. The objective is to ensure reproducibility and            for this study are detailed in Table 4.
provide a comprehensive understanding of the experimental setup.                     Table 5 summarizes the parameter values defined for the SVM
                                                                                 model.
4.1. Model hyperparameters                                                           Table 6 presents a comparative evaluation of various machine learn-
                                                                                 ing and deep learning models in the context of sentiment analysis on the
    The deep learning models were trained using a variety of hyper-              widely adopted IMDB dataset. Among the examined methods, the pro-
parameter configurations tailored to the architecture and task require-          posed HybridBERT-LSTM architecture achieved the highest accuracy
ments. These configurations include parameters such as learning rate,            rate of 98.14%, demonstrating a substantial improvement over other
batch size, maximum input sequence length, number of training epochs,            baseline models included in the analysis. This notable enhancement un-
optimizer type, and loss function. Additionally, architecture-specific           derscores the effectiveness of combining contextual embeddings from
settings such as the number of LSTM layers, dropout rates, and hidden            BERT with the sequential modeling capabilities of LSTM. The IMDB
state dimensionsare systematically defined. For models utilizing pre-            dataset was selected for evaluation due to its extensive usage and
trained components (e.g., BERT), both the base model and tokenizer               established credibility in the sentiment analysis literature, serving as
versions are explicitly specified. The subsequent tables summarize the           a robust benchmark for comparative performance assessment.
detailed parameter values for each model employed in this study,
including HybridBERT-LSTM, BERT-only, LSTM, CNN, and SVM-based                   4.2. Statistical significance testing
classifiers.
    The parameter values of the model developed in this study are                   In order to determine whether the observed differences in model
detailed in Table 1.                                                             performance metrics [44] were statistically significant, we employed

                                                                             6
E. Atagün et al.                                                                                                          Computer Standards & Interfaces 97 (2026) 104086


Table 4                                                                                   hypothesis (𝐻0 ) that the two models exhibit equal mean performance.
CNN Model Parameters.                                                                     Since our interest lies in detecting differences in either direction, a
 Parameter name                         Parameter value                                   two-tailed test is used:
 Embedding type                         GloVe
 Embedding size                         100                                               𝑝 = 2 × 𝑃 (𝑇 ≥ |𝑡|),
 Maximum number of words                5000
 Input layer                            Embedding(input_dim=5000, output_dim=100)         where 𝑇 follows the Student’s t-distribution with 𝑑𝑓 degrees of free-
 Number of Conv1D layers                6                                                 dom.
 Number of Conv1D filters               128                                                   If 𝑝 < 0.05, the difference is considered statistically significant,
 Kernel size                            5
                                                                                          indicating strong evidence against the null hypothesis. In this case, we
 Activation function                    ReLU
 Padding                                Same
                                                                                          conclude that one model outperforms the other beyond what would be
 Pooling                                MaxPooling1D (pool_size=2)                        expected by random variation. If 𝑝 ≥ 0.05, the difference is considered
 Dropout rate                           0.5                                               not statistically significant, implying that the observed discrepancy may
 Global pooling                         GlobalMaxPooling1D                                reasonably be attributed to experimental variability.
 Output layer (Dense)                   Softmax
 Loss function                          sparse_categorical_crossentropy
                                                                                              In addition to reporting p-values, effect sizes (Cohen’s d) were also
 Optimization algorithm                 Adam                                              computed to quantify the magnitude of the observed differences. While
 Evaluation metric                      Accuracy                                          statistical significance indicates whether a difference is unlikely to be
 Number of epochs                       50                                                due to chance, effect size provides a measure of its practical relevance.
 Batch size                             32
                                                                                          Together, these statistics provide a comprehensive assessment of the
                                                                                          comparative performance of the evaluated models.
Table 5
SVM Model Parameters.                                                                     4.3. Experimental results on datasets
 Parameter name                                                   Parameter value
 Embedding type                                                   GloVe                       Dataset 1 consists of question–answer pairs collected from two
 Embedding size                                                   100                     independent online counseling and psychotherapy platforms [45]. The
 Maximum number of words                                          5000
                                                                                          user-generated questions span a wide range of topics related to men-
 SVM Kernel                                                       Linear
                                                                                          tal health, including emotional well-being, interpersonal issues, and
                                                                                          psychological disorders. Each response was authored by licensed psy-
Table 6                                                                                   chologists, ensuring both clinical relevance and linguistic reliability. In
IMDB Dataset Accuracy Comparison.                                                         total, the dataset comprises 7,025 dialogue instances.
 Reference                     Method                                      Accuracy           Tables 7 and 8 present the training and testing performances, re-
 [35]                          LSTM                                        83.7%          spectively, of five distinct models HybridBERT-LSTM, BERT, LSTM,
 [36]                          CNN+LSTM                                    96.01%
                                                                                          CNN, and SVM evaluated on Dataset 1. The models were assessed using
 [37]                          LSTM+RNN                                    92.00%
 [38]                          BERT                                        93.97%
                                                                                          standard classification metrics including accuracy, precision, recall,
 [39]                          A hybrid approach                           95.6%          and F1-score, providing a comparative analysis of both their internal
 [40]                          HOMOCHAR                                    95.91%         consistency and generalizability.
 [41]                          Textual Emotion Analysis (TEA)              93%                An ablation study [46] is a systematic experimental methodol-
 [42]                          Lexical + Adversarial attacks               85%
 [43]                          Logistic Regression                         89.42%
                                                                                          ogy used to evaluate the individual contributions of specific model
 Proposed Model                HybridBERT-LSTM                             98.14%         components by selectively removing or modifying them while keep-
                                                                                          ing other factors constant. This approach provides empirical evidence
                                                                                          for the importance of particular architectural elements in determin-
                                                                                          ing the model’s overall performance. To rigorously assess whether
the Welch’s two-sample t-test, which is widely recommended when com-                      HybridBERT-LSTM’s performance gains arise from architectural design
paring two groups with potentially unequal variances and sample sizes.                    rather than mere parameter expansion, we conducted a comprehensive
This approach provides a robust test of mean differences without                          ablation study with parameter-matched baselines. Six model variants
assuming homogeneity of variances, which is particularly important                        were constructed: (1) BERT-Only baseline using the [CLS] token for
in machine learning experiments where stochastic training procedures                      classification, (2) BERT-ParamMatched with additional dense layers
may lead to heterogeneous variability across models.                                      matching the BiLSTM parameter count, (3) BERT+UniLSTM with a
    Let 𝑥̄ 1 and 𝑥̄ 2 denote the sample means of the two models being                     unidirectional LSTM, (4) BERT+BiLSTM-NoPooling without dual pool-
compared, 𝑠1 and 𝑠2 the corresponding standard deviations, and 𝑛1 and                     ing, (5) BERT+BiLSTM with frozen BERT isolating pure LSTM contri-
𝑛2 the number of independent runs. The Welch’s t-statistic is defined                     bution, and (6) HybridBERT-LSTM (Full) incorporating all proposed
as:                                                                                       components.
      𝑥̄ − 𝑥̄ 2                                                                               When Table 9 is examined, which shows the ablation test for
𝑡 = √1
        𝑠21      𝑠2                                                                       Data Set 1, the BERT-ParamMatched model achieves an accuracy of
              + 𝑛2
        𝑛1         2                                                                      95.35% ± 0.38% despite having an equivalent number of param-
   The approximate degrees of freedom (𝑑𝑓 ) for this test are calculated                  eters to the full model, whereas HybridBERT-LSTM attains 95.94%
according to the Welch–Satterthwaite equation:                                            ± 0.15%. The hierarchical performance degradation across ablation
       ( 2         )2                                                                     variants reveals the marginal contribution of each component: dual
         𝑠1    𝑠22
            +                                                                             pooling adds +0.19% (95.94% vs. 95.75%), bidirectionality contributes
         𝑛1    𝑛2
𝑑𝑓 = ( )2 ( )2                                                                            +0.17% (95.75% vs. 95.58%), and the sequential LSTM architecture
          𝑠2
           1
                       𝑠2
                        2
                                                                                          over feedforward MLP layers provides +0.23% (95.58% vs. 95.35%).
          𝑛1           𝑛2
                                                                                          The frozen BERT experiment (91.80% ± 0.65%) isolates critical insights
         𝑛1 −1
                   +   𝑛2 −1                                                              regarding representation quality versus fine-tuning contributions. As
    Given the test statistic and degrees of freedom, the p-value is ob-                   shown in Table 9, the ablation study on Dataset 1 systematically
tained by evaluating the probability of observing a difference as ex-                     confirms that HybridBERT-LSTM’s performance advantage arises from
treme as, or more extreme than, the measured difference under the null                    its architectural design rather than from parameter count inflation.

                                                                                      7
E. Atagün et al.                                                                                                    Computer Standards & Interfaces 97 (2026) 104086


                               Table 7
                               Training Performance Metrics for Dataset 1.
                                Method              Accuracy ± std       Precision ± std     Recall ± std      F1 ± std
                                HybridBERT-LSTM     0.9872 ± 0.0029      0.9871 ± 0.0028     0.9872 ± 0.0029   0.9871 ± 0.0029
                                BERT                0.9806 ± 0.0063      0.9805 ± 0.0057     0.9806 ± 0.0063   0.9805 ± 0.0062
                                LSTM                0.9829 ± 0.0162      0.9829 ± 0.0163     0.9829 ± 0.0162   0.9827 ± 0.0175
                                CNN                 0.9862 ± 0.0190      0.9829 ± 0.0199     0.9862 ± 0.0190   0.9829 ± 0.0202
                                SVM                 0.8247 ± 0.0073      0.8274 ± 0.0067     0.8247 ± 0.0073   0.8235 ± 0.0071


                               Table 8
                               Test Performance Metrics for Dataset 1.
                                Method              Accuracy ± std       Precision ± std     Recall ± std      F1 ± std
                                HybridBERT-LSTM     0.9594 ± 0.0015      0.9596 ± 0.0017     0.9594 ± 0.0015   0.9592 ± 0.0016
                                BERT                0.9516 ± 0.0040      0.9515 ± 0.0041     0.9516 ± 0.0044   0.9514 ± 0.0045
                                LSTM                0.9245 ± 0.0152      0.9257 ± 0.0163     0.9245 ± 0.0152   0.9239 ± 0.0165
                                CNN                 0.9195 ± 0.0171      0.9200 ± 0.0170     0.9195 ± 0.0171   0.9192 ± 0.0125
                                SVM                 0.8078 ± 0.0026      0.8118 ± 0.0025     0.8078 ± 0.0026   0.8058 ± 0.0031


Table 9                                                                             in testing, the model’s limited learning and generalization capacity be-
Ablation Performance Metrics for Dataset 1.                                         came evident. These findings collectively indicate that SVM lags behind
 Model                            Accuracy ± std           F1 ± std                 deep learning-based methods in terms of both modeling complexity and
 BERT+BiLSTM (Frozen)             0.9180 ± 0.0065          0.9165 ± 0.0068          adaptability to sequential linguistic features inherent in dialogue-based
 BERT-Only (Baseline)             0.9516 ± 0.0040          0.9512 ± 0.0042          sentiment classification tasks.
 BERT-ParamMatched                0.9535 ± 0.0038          0.9531 ± 0.0040
                                                                                        This dataset comprises conversational exchanges derived from ev-
 BERT+UniLSTM                     0.9558 ± 0.0028          0.9555 ± 0.0030
 BERT+BiLSTM-NoPooling            0.9575 ± 0.0022          0.9573 ± 0.0024          eryday spoken English interactions [47]. It consists of a total of 7,450
 HybridBERT-LSTM (Full)           0.9594 ± 0.0015          0.9592 ± 0.0016          dialogue samples, structured in a question–answer format. The train-
                                                                                    ing and testing performances of five different classification meth-
                                                                                    ods HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on Dataset 2
                                                                                    are presented in Tables 10 and 11, respectively. Among these, the
    When the results are evaluated over five repeated experiments, the
                                                                                    HybridBERT-LSTM model achieved the highest performance on the
HybridBERT-LSTM model not only outperforms the other methods in
                                                                                    training set, reaching an accuracy of 99.11% and an F1-score of
terms of accuracy, precision, recall, and F1-score, but also demonstrates
                                                                                    99.11%, thereby slightly outperforming the other methods. The BERT
a high degree of stability, as reflected by its very low standard devia-
                                                                                    and CNN models also demonstrated high effectiveness, achieving ac-
tions (≈ 0.0015–0.0017). This indicates that the model provides not just
                                                                                    curacies of 98.95% and 98.21%, respectively. These three models
superior performance but also reproducible results across runs.
                                                                                    exhibited strong alignment with the training data across all evaluation
    While the BERT model follows as the second-best performer, its
                                                                                    metrics, including accuracy, precision, recall, and F1-score.
higher variance (≈ 0.004) highlights less consistent outcomes compared
                                                                                        When Table 12 is examined, which shows the ablation test for
to HybridBERT-LSTM. Statistical testing (e.g., paired t -tests) confirms
                                                                                    Data Set 1, the BERT-ParamMatched model achieves an accuracy of
that the observed performance difference between HybridBERT-LSTM
                                                                                    97.92% ±0.35% despite having an equivalent number of parameters,
and BERT, though relatively small, is statistically significant (𝑝 < 0.05).
    In contrast, the performance gaps between HybridBERT-LSTM and                   whereas HybridBERT-LSTM attains 98.32% ±1.06%, reflecting a 0.40
weaker models such as LSTM, CNN, and particularly SVM are much                      percentage-point improvement. Component-wise analysis further in-
larger. Pairwise comparisons reveal p-values well below 0.01, strongly              dicates that dual pooling contributes +0.13% (98.32% vs. 98.19%),
supporting the conclusion that HybridBERT-LSTM’s superiority is not                 bidirectionality adds +0.13% (98.19% vs. 98.06%), and the sequential
due to random chance but reflects a genuine performance advan-                      LSTM architecture over MLP layers provides an additional +0.14%
tage. HybridBERT-LSTM vs. BERT: Smaller margin, but statistically                   (98.06% vs. 97.92%).
significant (𝑝 < 0.05). HybridBERT-LSTM vs. LSTM/CNN/SVM: Sub-                          Based on the evaluation of five repeated experiments, the
stantial margin, highly significant (p << 0.01). Among the evaluated                HybridBERT-LSTM model achieved the highest accuracy, precision,
approaches, the HybridBERT-LSTM architecture consistently demon-                    recall, and F1-scores on both the training and test sets. It stood out with
strated superior performance during both training and testing phases,               an accuracy of 99.11% in training and reached 98.32% accuracy on the
achieving remarkably high scores across all metrics. Specifically, it               test set. The consistently low standard deviations (≈ 0.0106–0.0126)
attained 98.72% accuracy and 98.72% F1-score on the training set,                   indicate that the model not only delivers high performance but also
outperforming all other models. BERT, LSTM, and CNN also exhib-                     produces stable results.
ited strong training performance, each surpassing 98% accuracy and                      BERT followed HybridBERT-LSTM and provided similarly strong
F1-scores, indicating their efficacy on seen data.                                  results. However, its slightly lower standard deviations suggest that
    In the testing phase, HybridBERT-LSTM maintained its leading po-                it yielded more consistent outcomes in some metrics. Although the
sition by achieving the highest test accuracy (95.94%) and F1-score                 performance gap between the two models appears small, pairwise t -
(95.92%), affirming its robustness and generalization capability. In                test results show that the p-values are mostly below 0.05. Therefore,
contrast, the CNN model experienced a notable performance drop from                 the difference between HybridBERT-LSTM and BERT is statistically
training to testing (accuracy falling from above 98% to 91.95% and F1-              significant.
score to 91.92%), suggesting a tendency toward overfitting. Similarly,                  In comparisons with the lower-performing models (LSTM, CNN, and
the LSTM model, despite achieving 98.29% accuracy in training, saw                  SVM), the p-values were found to be far below 0.01. This demonstrates
its performance decline to 92.45% accuracy during testing, reflecting               that HybridBERT-LSTM significantly and strongly outperforms these
reduced generalization.                                                             models. In particular, LSTM’s high variance in training (std ≈ 0.0380)
    Another critical observation is related to the SVM model, which                 indicates unstable learning behavior.
exhibited the lowest performance across both training and test sets.                    In conclusion, HybridBERT-LSTM not only achieved the highest
With a training accuracy of 82.47% and a further decline to 80.78%                  scores but also delivered stable and reproducible results.

                                                                                8
E. Atagün et al.                                                                                                     Computer Standards & Interfaces 97 (2026) 104086


                               Table 10
                               Training Performance Metrics for Dataset 2.
                                Method              Accuracy ± std       Precision ± std     Recall ± std      F1 ± std
                                HybridBERT-LSTM     0.9911 ± 0.0111      0.9911 ± 0.0126     0.9911 ± 0.0111   0.9911 ± 0.0111
                                BERT                0.9895 ± 0.0093      0.9896 ± 0.0094     0.9895 ± 0.0093   0.9895 ± 0.0093
                                LSTM                0.7270 ± 0.0380      0.7189 ± 0.0370     0.7175 ± 0.0380   0.7278 ± 0.0380
                                CNN                 0.9821 ± 0.0176      0.9826 ± 0.0176     0.9921 ± 0.0179   0.9822 ± 0.0176
                                SVM                 0.7785 ± 0.0518      0.7711 ± 0.0524     0.7785 ± 0.0518   0.7638 ± 0.0525


                               Table 11
                               Test Performance Metrics for Dataset 2.
                                Method              Accuracy ± std       Precision ± std    Recall ± std       F1 ± std
                                HybridBERT-LSTM     0.9832 ± 0.0106      0.9834 ± 0.0108    0.9832 ± 0.0106    0.9833 ± 0.0106
                                BERT                0.9779 ± 0.0038      0.9783 ± 0.0039    0.9779 ± 0.0038    0.9780 ± 0.0038
                                LSTM                0.7075 ± 0.0199      0.7089 ± 0.0178    0.7075 ± 0.0199    0.7078 ± 0.0199
                                CNN                 0.9718 ± 0.0102      0.9725 ± 0.0104    0.9718 ± 0.0102    0.9720 ± 0.0112
                                SVM                 0.7537 ± 0.0044      0.7491 ± 0.0045    0.7537 ± 0.0044    0.7277 ± 0.0045


Table 12                                                                            mutually enhance effectiveness on challenging classification tasks. The
Ablation Performance Metrics for Dataset 2.                                         frozen BERT experiment (72.15% ±2.45%) provides critical valida-
 Model                            Accuracy ± std           F1 ± std                 tion: despite lacking fine-tuning, it outperforms standalone LSTM with
 BERT+BiLSTM (Frozen)             0.9425 ± 0.0152          0.9418 ± 0.0155          GloVe embeddings (62.26% test) by 9.89 percentage points, isolating
 BERT-Only (Baseline)             0.9779 ± 0.0038          0.9780 ± 0.0038          the representation quality advantage of contextualized embeddings.
 BERT-ParamMatched                0.9792 ± 0.0035          0.9793 ± 0.0035          However, the 10.71% gap between frozen and full models (72.15%
 BERT+UniLSTM                     0.9806 ± 0.0028          0.9807 ± 0.0028
 BERT+BiLSTM-NoPooling            0.9819 ± 0.0022          0.9820 ± 0.0022
                                                                                    vs. 82.86%) represents the largest fine-tuning contribution across all
 HybridBERT-LSTM (Full)           0.9832 ± 0.0106          0.9833 ± 0.0106          datasets, establishing that task-specific adaptation is particularly criti-
                                                                                    cal for complex classification problems. The parameter efficiency ratio
                                                                                    of 18.74:1 (5.06% gain/0.27% parameter increase) dramatically ex-
                                                                                    ceeds simpler datasets (Dataset 1: 2.89:1, Dataset 2: 1.96:1), validating
    In contrast, LSTM and SVM yielded significantly lower performance,              that BiLSTM’s architectural value scales positively with task difficulty.
with training accuracies of 72.70% and 77.85%, respectively. Par-                       However, the test results reveal a marked decline in the gener-
ticularly, the low F1-score of 76.38% for SVM indicates inadequate                  alization performance of some models, most notably CNN. The CNN
classification consistency and stability. When evaluated on the test set,           model’s accuracy dropped significantly to 65.03% during testing, sug-
the overall performance ranking remained largely consistent with that               gesting signs of overfitting. The inability to maintain performance
observed during training. HybridBERT-LSTM and BERT maintained                       across datasets implies that the model may have memorized training
their superior performance, achieving test accuracies of 98.32% and                 instances rather than learning generalizable patterns. Similarly, the
97.79%, respectively. The CNN model followed closely with 97.18%                    BERT model, while achieving 94.92% training accuracy, exhibited a
accuracy, exhibiting a balanced and robust performance across all                   notable decline during testing, with an accuracy of 78.27%, indicating
evaluation criteria. Conversely, LSTM and SVM continued to underper-                moderate but consistent performance.
form in the test phase, reflecting limited generalization capability in                 The most robust generalization was observed in the HybridBERT-
comparison to the more advanced deep learning architectures.                        LSTM approach. This model achieved a training accuracy of 91.57%
    Dataset 3 comprises online consultation dialogues conducted be-                 and maintained a relatively high testing accuracy of 82.86%, with
tween patients and medical professionals [48]. The dataset consists of              minimal performance degradation between training and testing phases.
a total of 6,570 entries, with each instance representing a dialogue                These results underscore the HybridBERT-LSTM model’s capability to
exchange initiated by a patient inquiry and followed by a corresponding             balance learning efficiency with strong generalization, making it the
response from a doctor. The training and testing performances of five               most stable and reliable method on Dataset 3.
distinct approaches HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on                        Interestingly, the LSTM model maintained a consistent performance
Dataset 3 are presented in Tables 13 and 14, respectively. Among these,             of 62.26% across both training and testing phases, signaling limitations
the CNN model achieved the highest training performance, demon-                     in its learning capacity and suggesting that simpler architectures may
strating its strong learning capability. The BERT model also exhibited              be insufficient for handling the complexity of dialogue-based sentiment
competitive results, attaining a training accuracy of 94.92%, position-             classification tasks. The SVM model, although yielding only moder-
ing it as a viable alternative. In contrast, LSTM and SVM models yielded            ate success during training, preserved its performance during testing
notably lower performance during training, with accuracy scores of                  (68.42%), outperforming more complex deep learning models such as
62.26% and 71.92%, respectively, indicating limitations in their ability            CNN and LSTM in terms of stability. The HybridBERT-LSTM model
to model the training data effectively.                                             emerges as the most balanced and generalizable approach, while the
    When Table 15 is examined, which shows the ablation test for                    CNN model warrants cautious interpretation due to its susceptibility
Data Set 3, is examined, BERT-ParamMathes achieves 78.92% ±1.72%                    to overfitting. In this study, each method was evaluated through five
accuracy with equivalent parameters, while HybridBERT-LSTM reaches                  independent repetitions. This approach provides a more accurate rep-
82.86% ±0.65%, representing a statistically significant 3.94 percentage             resentation of variance compared to results obtained from a single
point improvement. Component decomposition demonstrates substan-                    run and enhances the reproducibility of the outcomes. Notably, the
tial marginal contributions: dual pooling adds +1.61% (82.86% vs.                   HybridBERT-LSTM model exhibited very low standard deviations (≈
81.25%), bidirectionality contributes +1.40% (81.25% vs. 79.85%),                   0.006–0.01 range), indicating that the model not only achieved high
and sequential LSTM architecture over MLP provides +0.93% (79.85%                   average scores but also produced consistent results across trials.
vs. 78.92%). The cumulative gain of 5.06% from BERT-Only base-                          HybridBERT-LSTM vs. BERT: Although the average performance
line (78.27%) substantially exceeds the sum of individual compo-                    difference is relatively small, the p-values mostly remain below 0.05.
nents (3.94%), indicating a 1.12% synergistic interaction effect – the              This suggests that the difference is unlikely to be due to chance and
strongest observed across all datasets – where BiLSTM components                    that the superiority of HybridBERT-LSTM is statistically significant.

                                                                                9
E. Atagün et al.                                                                                                   Computer Standards & Interfaces 97 (2026) 104086


                               Table 13
                               Training Performance Metrics for Dataset 3.
                                Method               Accuracy ± std      Precision ± std    Recall ± std      F1 ± std
                                HybridBERT-LSTM      0.9157 ± 0.0097     0.9058 ± 0.0103    0.9056 ± 0.0097   0.9052 ± 0.0097
                                BERT                 0.9492 ± 0.0234     0.9494 ± 0.0228    0.9492 ± 0.0234   0.9487 ± 0.0246
                                LSTM                 0.6298 ± 0.0164     0.6294 ± 0.0160    0.6298 ± 0.0164   0.6227 ± 0.0183
                                CNN                  0.9966 ± 0.0054     0.9966 ± 0.0062    0.9966 ± 0.0054   0.9966 ± 0.0056
                                SVM                  0.7192 ± 0.0125     0.7263 ± 0.0161    0.7192 ± 0.0125   0.7198 ± 0.0126

                               * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
                               (<<0.01), CNN (<<0.01), SVM (<<0.01).


                               Table 14
                               Test Performance Metrics for Dataset 3.
                                Method               Accuracy ± std      Precision ± std    Recall ± std      F1 ± std
                                HybridBERT-LSTM      0.8286 ± 0.0065     0.8326 ± 0.0062    0.8286 ± 0.0065   0.8282 ± 0.0064
                                BERT                 0.7827 ± 0.0185     0.7835 ± 0.0185    0.7827 ± 0.0185   0.7830 ± 0.0184
                                LSTM                 0.6226 ± 0.0081     0.6294 ± 0.0085    0.6226 ± 0.0081   0.6227 ± 0.0092
                                CNN                  0.6503 ± 0.0433     0.6516 ± 0.0565    0.6503 ± 0.0565   0.6497 ± 0.0565
                                SVM                  0.6842 ± 0.0093     0.6904 ± 0.0520    0.6842 ± 0.0093   0.6847 ± 0.0110

                               * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
                               (<<0.01), CNN (<<0.01), SVM (<<0.01).


Table 15                                                                            reported in both the training and test evaluations, further evidenc-
Ablation Performance Metrics for Dataset 3.                                         ing generalizable gains rather than dataset-specific variance. The re-
 Model                            Accuracy ± std            F1 ± std                sults collectively demonstrate that HybridBERT–LSTM’s improvements
 BERT+BiLSTM (Frozen)             0.7215 ± 0.0245           0.7208 ± 0.0248         are statistically sound, generalizable, and derived from architectural
 BERT-Only (Baseline)             0.7827 ± 0.0185           0.7830 ± 0.0184         synergy rather than overparameterization or random variation.
 BERT-ParamMatched                0.7892 ± 0.0172           0.7895 ± 0.0171
                                                                                        When Table 19, which shows the ablation test for Data Set 4, is
 BERT+UniLSTM                     0.7985 ± 0.0145           0.7988 ± 0.0144
 BERT+BiLSTM-NoPooling            0.8125 ± 0.0110           0.8128 ± 0.0109
                                                                                    examined, the BERT-ParamMatched achieves 85.48% ±2.05% accu-
 HybridBERT-LSTM (Full)           0.8286 ± 0.0065           0.8282 ± 0.0064         racy despite equivalent parameters, while HybridBERT-LSTM reaches
                                                                                    87.29% ±1.19%, representing a 1.81 percentage point improvement.
                                                                                    Component analysis reveals: dual pooling contributes +0.61% (87.29%
                                                                                    vs. 86.68%), bidirectionality adds +0.63% (86.68% vs. 86.05%), and
     HybridBERT-LSTM vs. LSTM, CNN, SVM: In comparisons with these                  sequential LSTM architecture over MLP provides +0.57% (86.05%
three models, the p-values were found to be far below 0.01. Therefore,              vs. 85.48%). The cumulative gain of 3.35% from BERT-Only base-
the superiority of HybridBERT-LSTM over these methods is strongly                   line (84.94%) exceeds individual components (1.81%), indicating a
supported by statistical evidence.                                                  1.54% synergistic effect where BiLSTM components mutually enhance
     Overall, the findings confirm that HybridBERT-LSTM is not only the             effectiveness on this moderately challenging task. The frozen BERT
best-performing model in terms of average scores but also the most                  variant (80.65% ±2.85%) validates two critical insights: it outper-
reliable and consistent one from a statistical perspective.                         forms standalone LSTM with GloVe embeddings (77.26% test) by 3.39
     This dataset comprises text entries collected from online conversa-            percentage points, confirming the superiority of contextualized repre-
tions conducted in English, each annotated with corresponding senti-                sentations, while the 6.64% gap to the full model (80.65% vs. 87.29%)
ment labels. It has been specifically curated for the purpose of analyzing          quantifies the substantial contribution of fine-tuning. The decreasing
and classifying the emotional tone embedded within textual utterances.              variance from frozen (±2.85%) through parameter-matched (±2.05%)
The dataset consists of 1,494 instances, and serves as a representative             to full model (±1.19%) demonstrates that architectural integration with
benchmark for evaluating sentiment classification models in informal,               end-to-end training provides essential stability, establishing that the
dialogue-based contexts [49].                                                       observed improvements stem from architectural design rather than
     Tables 16 and 17 present the training and test performance met-                capacity scaling.
rics, respectively, for five different sentiment classification models:                 When the results in Tables 16 and 17 are analyzed based on
HybridBERT-LSTM, BERT, LSTM, CNN, and SVM applied to Dataset 4.                     five independent repetitions, several important findings emerge re-
Evaluation was conducted using standard performance indicators: Ac-                 garding both performance levels and statistical reliability. First, the
curacy, Precision, Recall, and F1-score, to assess both the fitting capacity        HybridBERT-LSTM model demonstrates strong generalization ability,
on training data and generalizability on unseen test data.                          maintaining balanced accuracy (87.29% ±0.0119) and F1 (84.89%
     Table 18 presents the cross-validation results for Dataset 4.                  ±0.0140) on the test set, with relatively low variance across runs. The
     The consistency of accuracy and F1-scores across folds (≈0.8795 and            narrow confidence interval provided by the low standard deviations
0.8758, respectively) indicates that the model does not exhibit over-               indicates that the model is not only accurate but also stable across re-
fitting or excessive variance between training and evaluation phases.               peated experiments. The pairwise statistical comparisons reveal further
This stability confirms that the observed improvements are not artifacts            insights. Against BERT, the differences in performance metrics appear
of specific data splits but instead arise from the model’s architectural            moderate, yet the corresponding p-values are consistently below 0.05.
design, particularly its integration of bidirectional temporal encoding             This implies that the improvements of HybridBERT-LSTM over BERT,
and hierarchical pooling mechanisms. Moreover, the cross-validation                 while not large in magnitude, are statistically significant rather than
outcomes follow the same relative performance hierarchy observed in                 random fluctuations.
both the training and test experiments: HybridBERT–LSTM > BERT >                        In contrast, the performance gaps between HybridBERT-LSTM and
LSTM > CNN > SVM. This consistent ranking across all evaluation set-                the weaker models (LSTM, CNN, and especially SVM) are consider-
tings validates the comparative strength of the proposed architecture.              ably larger. Here, the p-values are well below 0.01, in many cases
The slight performance gap between HybridBERT–LSTM and BERT is                      below 0.001, providing strong statistical evidence that HybridBERT-
statistically meaningful and mirrors the 𝑝-value significance (<0.05)               LSTM’s superiority is systematic and not due to chance. Notably,

                                                                               10
E. Atagün et al.                                                                                                            Computer Standards & Interfaces 97 (2026) 104086


                                   Table 16
                                   Training Performance Metrics for Dataset 4.
                                     Method               Accuracy               Precision           Recall            F1
                                     HybridBERT-LSTM      0.9046 ± 0.0172        0.8446 ± 0.0157     0.9046 ± 0.0119   0.8730 ± 0.0070
                                     BERT                 0.9447 ± 0.0238        0.9403 ± 0.0331     0.9447 ± 0.0238   0.9379 ± 0.0294
                                     LSTM                 0.9849 ± 0.0381        0.9848 ± 0.0489     0.9849 ± 0.0381   0.9845 ± 0.0479
                                     CNN                  0.9944 ± 0.0443        0.9882 ± 0.0421     0.9944 ± 0.0443   0.9882 ± 0.0401
                                     SVM                  0.8084 ± 0.0249        0.8258 ± 0.0206     0.8084 ± 0.0249   0.7806 ± 0.0268

                                   * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01),
                                   CNN (<0.01), SVM (<0.001).


                                   Table 17
                                   Test Performance Metrics for Dataset 4.
                                     Method               Accuracy ± std         Precision ± std     Recall ± std      F1 ± std
                                     HybridBERT-LSTM      0.8729 ± 0.0119        0.8561 ± 0.0089     0.8729 ± 0.0117   0.8489 ± 0.0140
                                     BERT                 0.8494 ± 0.0218        0.8532 ± 0.0377     0.8494 ± 0.0218   0.8512 ± 0.0194
                                     LSTM                 0.7726 ± 0.0330        0.7971 ± 0.0410     0.7726 ± 0.0330   0.7818 ± 0.0409
                                     CNN                  0.8160 ± 0.0164        0.8040 ± 0.0146     0.8160 ± 0.0164   0.8090 ± 0.0141
                                     SVM                  0.7525 ± 0.0075        0.7030 ± 0.0058     0.7525 ± 0.0075   0.7192 ± 0.0114

                                   * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01),
                                   CNN (<0.01), SVM (<0.001).


Table 18                                                                                     decline in accuracy (77.26%) and F1-score (78.18%) on the test set.
Cross Validation Performance Metrics for Dataset 4.                                          BERT, while slightly lower in raw accuracy compared to HybridBERT-
 Model                    Accuracy            Precision     Recall          F1               LSTM, maintained a stable generalization profile (Accuracy: 84.94%,
 HybridBERT-LSTM          0.8795              0.8739        0.8795          0.8758           F1: 85.12%).
 BERT                     0.8561              0.8477        0.8561          0.8481               The SVM model again registered the weakest results across all
 LSTM                     0.8394              0.7806        0.8394          0.8090
                                                                                             test metrics, with an accuracy of 75.25% and an F1-score of 71.92%,
 CNN                      0.8327              0.7811        0.8327          0.8058
 SVM                      0.7593              0.7719        0.7593          0.7602
                                                                                             reinforcing the notion that classical machine learning methods may
                                                                                             struggle with complex dialogue structures compared to deep learning
                                                                                             architectures.
Table 19                                                                                         In summary, although CNN and LSTM excelled in training, their
Ablation Performance Metrics for Dataset 4.                                                  generalization to test data was limited. HybridBERT-LSTM, by contrast,
 Model                               Accuracy ± std             F1 ± std                     demonstrated consistent performance across both phases, reinforcing
 BERT+BiLSTM (Frozen)                0.8065 ± 0.0285            0.7971 ± 0.0295              its suitability for real-world sentiment classification tasks involving
 BERT-Only (Baseline)                0.8494 ± 0.0218            0.8512 ± 0.0194
                                                                                             dialogue-based inputs.
 BERT-ParamMatched                   0.8548 ± 0.0205            0.8558 ± 0.0188
 BERT+UniLSTM                        0.8605 ± 0.0178            0.8602 ± 0.0175
                                                                                                 Dataset 5 is constructed [50] for the purpose of modeling em-
 BERT+BiLSTM-NoPooling               0.8668 ± 0.0145            0.8645 ± 0.0155              pathetic dialogues and comprises multi-turn human-to-human con-
 HybridBERT-LSTM (Full)              0.8729 ± 0.0119            0.8489 ± 0.0140*             versations that reflect emotionally rich interactions. The corpus is
                                                                                             partitioned into three distinct subsets: the training set contains 40200
                                                                                             instances, the validation set includes 5730 instances, and the test set
                                                                                             comprises 5260 instances.
LSTM and CNN exhibit relatively high variances during training (std
                                                                                                 Tables 20 and 21 present the comparative performance metrics of
≈ 0.038–0.048 for LSTM; ≈ 0.040–0.044 for CNN), suggesting instability
                                                                                             five distinct models: HybridBERT-LSTM, BERT, LSTM, CNN, and SVM
and overfitting tendencies.
                                                                                             on Dataset 5, using standard evaluation criteria: Accuracy, Precision,
    Taken together, these results highlight two key aspects:
                                                                                             Recall, and F1-score. The results reveal clear patterns in terms of both
HybridBERT-LSTM delivers the best trade-off between accuracy and
                                                                                             model learning capacity on training data and generalization to unseen
reproducibility across repeated runs, and its performance improve-
                                                                                             test instances.
ments, particularly over LSTM, CNN, and SVM, are not only empirically
                                                                                                 When Table 22, which shows the ablation test for Data Set 5,
substantial but also statistically robust. Thus, the evidence supports                       is examined, the BERT-ParamMatched achieves 95.65% ± 0.19% ac-
HybridBERT-LSTM as the most reliable and generalizable method on                             curacy with equivalent parameters, while HybridBERT–LSTM reaches
Dataset 4.                                                                                   96.16% ± 0.23%, representing a 0.51 percentage point improvement.
    During training (Table 16), CNN achieved the highest accuracy                            Component decomposition reveals uniform contributions: dual pooling
(99.44%) and F1-score (98.82%), indicating a strong capacity to fit                          adds +0.17% (96.16% vs. 95.99%), bidirectionality contributes +0.17%
the training data. LSTM and BERT also demonstrated robust learn-                             (95.99% vs. 95.82%), and sequential LSTM architecture over MLP
ing performance with accuracy and F1-scores exceeding 94%, while                             provides +0.17% (95.82% vs. 95.65%). The cumulative gain of 0.66%
HybridBERT-LSTM followed closely behind with an accuracy of 90.46%                           from the BERT-Only baseline (95.50%) precisely matches the sum
and F1-score of 87.30%. SVM, in contrast, yielded noticeably lower                           of individual components, indicating minimal synergistic effects on
training performance (Accuracy: 80.84%, F1: 78.06%), highlighting its                        this high-performing task where architectural elements operate addi-
relative limitations in capturing complex language patterns.                                 tively rather than multiplicatively. The frozen BERT variant (92.45%
    However, test results (Table 17) reveal important insights into                          ± 0.82%) provides task-difficulty insights: it outperforms standalone
model generalizability. HybridBERT-LSTM emerged as the most bal-                             LSTM with GloVe embeddings (91.86% test) by only 0.59 percentage
anced and generalizable model, achieving the highest test accuracy                           points the smallest margin across all datasets yet maintains a 3.71%
(87.29%) and a competitive F1-score (84.89%). Despite its superior                           gap from the full model (92.45% vs. 96.16%). This pattern establishes
training performance, CNN exhibited a significant drop in test ac-                           that on near saturated tasks (BERT baseline: 95.50%), fine-tuning
curacy (81.60%), suggesting potential overfitting. Similarly, LSTM,                          provides greater marginal value (+3.71%) than architectural modi-
which performed strongly during training, experienced a substantial                          fications (+0.66%). The parameter efficiency ratio of 2.44:1 (0.66%

                                                                                       11
E. Atagün et al.                                                                                                        Computer Standards & Interfaces 97 (2026) 104086


                               Table 20
                               Training Performance Metrics for Dataset 5.
                                Method                Accuracy ± std      Precision ± std      Recall ± std        F1 ± std
                                HybridBERT-LSTM       0.9834 ± 0.0086     0.9834 ± 0.0074      0.9834 ± 0.0086     0.9833 ± 0.0084
                                BERT                  0.9654 ± 0.0062     0.9654 ± 0.0059      0.9654 ± 0.0062     0.9654 ± 0.0061
                                LSTM                  0.9936 ± 0.0049     0.9936 ± 0.0046      0.9936 ± 0.0049     0.9936 ± 0.0049
                                CNN                   0.9384 ± 0.0346     0.9416 ± 0.0312      0.9384 ± 0.0346     0.9373 ± 0.0278
                                SVM                   0.7536 ± 0.0272     0.7479 ± 0.0523      0.7536 ± 0.0272     0.7446 ± 0.0408

                               * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.05),
                               CNN (<0.01), SVM (<<0.01).


                               Table 21
                               Test Performance Metrics for Dataset 5.
                                Method                Accuracy ± std      Precision ± std      Recall ± std        F1 ± std
                                HybridBERT-LSTM       0.9616 ± 0.0023     0.9614 ± 0.0021      0.9616 ± 0.0023     0.9615 ± 0.0022
                                BERT                  0.9550 ± 0.0020     0.9554 ± 0.0019      0.9550 ± 0.0020     0.9550 ± 0.0020
                                LSTM                  0.9186 ± 0.0026     0.9201 ± 0.0029      0.9186 ± 0.0026     0.9190 ± 0.0031
                                CNN                   0.8851 ± 0.0281     0.8887 ± 0.0337      0.8851 ± 0.0310     0.8813 ± 0.0315
                                SVM                   0.7588 ± 0.0183     0.7507 ± 0.0178      0.7588 ± 0.0183     0.7506 ± 0.0179

                               * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
                               (<<0.01), CNN (<<0.01), SVM (<<0.01).


Table 22                                                                             (Accuracy: 75.36%, F1: 74.46%), confirming its limitations in handling
Ablation Performance Metrics for Dataset 5.                                          nuanced linguistic structures.
 Model                            Accuracy ± std             F1 ± std                    When evaluated on the test data (Table 21), HybridBERT-LSTM
 BERT+BiLSTM (Frozen)             0.9245 ± 0.0082            0.9243 ± 0.0083         again outperformed all other models, achieving the highest accuracy
 BERT-Only (Baseline)             0.9550 ± 0.0020            0.9550 ± 0.0020         (96.16%) and F1-score (96.15%), indicating strong generalization capa-
 BERT-ParamMatched                0.9565 ± 0.0019            0.9565 ± 0.0019
                                                                                     bility and robustness against overfitting. BERT maintained competitive
 BERT+UniLSTM                     0.9582 ± 0.0018            0.9582 ± 0.0018
 BERT+BiLSTM-NoPooling            0.9599 ± 0.0017            0.9599 ± 0.0017
                                                                                     test performance (Accuracy: 95.50%, F1: 95.50%), slightly lagging
 HybridBERT-LSTM (Full)           0.9616 ± 0.0023            0.9615 ± 0.0022         behind the hybrid model. While LSTM demonstrated superior training
                                                                                     results, its test performance declined more notably (Accuracy: 91.86%,
                                                                                     F1: 91.90%), suggesting possible overfitting to training data. Similarly,
                                                                                     CNN exhibited a moderate generalization gap, reaching only 88.51%
gain/0.27% parameter increase) positions Dataset 5 among simpler                     accuracy on the test set, despite its relatively high training metrics.
classification problems, validating the inverse relationship between                     SVM, consistent with previous datasets, again showed the lowest
baseline performance and BiLSTM’s contribution.                                      performance in both training and testing phases, with an F1-score of
    Based on the results averaged over five independent runs, the                    only 75.06% on the test data. This emphasizes the model’s limited ca-
HybridBERT-LSTM model consistently achieved the highest perfor-                      pacity to generalize in dialogue-rich or semantically complex scenarios
mance on both the training and test sets. The remarkably low standard                compared to deep learning-based alternatives.
deviations (≈0.002–0.009) indicate not only superior average perfor-                     Overall, these results substantiate the efficacy of the HybridBERT-
mance but also a high degree of stability and reproducibility across                 LSTM architecture in balancing contextual sensitivity and temporal
repeated trials.                                                                     structure modeling, thereby ensuring high accuracy and stability across
    The BERT model ranked second, yielding performance levels com-                   both learning and evaluation stages. The comparative drop in test
parable to HybridBERT-LSTM. However, pairwise statistical compar-                    performance observed in CNN and LSTM also underscores the impor-
isons revealed that the p-values were generally below 0.05, suggesting               tance of integrating both contextual and sequential representations for
that the observed differences, while relatively small, are statistically             enhanced sentiment classification in dialogue settings.
significant and not attributable to random variation.                                    Fig. 1 illustrates the interpretability analysis of the proposed senti-
    In contrast, comparisons with the lower-performing models (LSTM,                 ment classification model using the LIME framework. The visualization
CNN, and SVM) yielded p-values well below 0.01, providing strong                     comprises three distinct components, each elucidating the model’s
statistical evidence of HybridBERT-LSTM’s superiority. Notably, the                  decision-making process for a representative dialogue input.
LSTM model, despite attaining high training scores, exhibited a marked                   The prediction probabilities panel (top-left) displays the model’s
decline during testing, indicating a tendency toward overfitting. Simi-              confidence distribution across the three sentiment classes. Here, Class 1
larly, the CNN model displayed wider standard deviations, pointing to                achieves a probability score of 1.00, indicating complete certainty in
instability and reduced reliability across runs.                                     the model’s classification. Classes 0 and 2 both register a probability
    In conclusion, the HybridBERT-LSTM model not only achieved the                   of 0.00, underscoring the model’s confident and decisive prediction for
highest mean scores but also demonstrated low variance and statisti-                 this specific instance.
cally significant improvements, confirming its reliability and robustness                The feature importance panel, generated by LIME, presents the
as the most effective approach for Dataset 5.                                        quantitative contribution of individual lexical features to the final pre-
    In the training phase (Table 20), LSTM yielded the highest perfor-               diction. The ranking reveals that terms such as ‘‘crying’’, ‘‘embarrassing’’,
mance across all metrics, with an accuracy and F1-score of 99.36%,                   and ‘‘fear’’ possess the highest negative impact coefficients. Meanwhile,
indicating exceptional capability in capturing sequential dependencies               features like ‘‘worry’’, ‘‘freaking out’’, and ‘‘go out’’ show moderate
in the training corpus. Close behind, the HybridBERT-LSTM model                      levels of influence. Conversely, contextual words such as ‘‘counseling’’,
achieved 98.34% accuracy and an F1-score of 98.33%, reflecting its                   ‘‘therapy’’, and ‘‘days’’ exhibit minimal importance, suggesting limited
strength in combining contextual embeddings with sequential model-                   contribution to the sentiment prediction for this case.
ing. BERT also performed robustly, attaining 96.54% across all reported                  The highlighted text visualization (right panel) offers an intuitive
metrics. In contrast, CNN demonstrated a moderate performance (Ac-                   representation of feature importance through color-coded annotations.
curacy: 93.84%, F1: 93.73%), while SVM significantly underperformed                  The input sentence: ‘‘I’m starting counseling/therapy in a few days. I’m

                                                                                12
E. Atagün et al.                                                                                                                   Computer Standards & Interfaces 97 (2026) 104086


                                    Fig. 1. Interpretability analysis using the LIME framework for the proposed model for Dataset1.


                                    Fig. 2. Interpretability analysis using the LIME framework for the proposed model for Dataset2.


freaking out but my main fear is crying and embarrassing myself. Should                             Fig. 3 illustrates the interpretability analysis using the LIME frame-
I be worried?’’ is annotated with blue highlights, corresponding to                             work for a sample medical consultation text, highlighting the model’s
high-impact emotional cues. The intensity of each highlight is directly                         capability to perform sentiment classification within clinical communi-
proportional to the magnitude of that word’s influence on the final                             cation contexts. The visualization comprises several analytical compo-
classification.                                                                                 nents that elucidate the algorithmic decision-making process.
    Fig. 2 illustrates a LIME-based interpretability analysis for a sen-                            The prediction probability panel reveals high classification confi-
timent classification instance derived from medical discourse, high-                            dence by the model, assigning a dominant probability score of 0.99 to
lighting the model’s interpretive capabilities in processing healthcare-                        Class 2, while Classes 0 and 1 both receive marginal likelihoods of 0.01.
related textual inputs. The visualization provides a comprehensive in-                              The feature importance analysis presents local explanations gen-
sight into the underlying decision-making mechanisms of the sentiment                           erated by LIME, quantifying individual lexical contributions to the
prediction process.                                                                             final prediction. The term ‘‘affected’’ exhibits the highest contribution
    The prediction probabilities panel reveals that the model assigns                           coefficient at 0.24, followed by ‘‘cold’’ (0.22) and ‘‘recovery’’ (0.20).
a dominant probability of 0.97 to Class 0, while significantly lower                            Subsequent features such as ‘‘recommend’’ (0.17), ‘‘definitely’’ (0.11),
values of 0.01 and 0.02 are attributed to Classes 1 and 2, respectively.                        and ‘‘avoid’’ (0.10) display gradually decreasing importance values.
This distribution indicates high classification confidence with minimal                         Additional terms like ‘‘by’’, ‘‘protect’’, ‘‘loose’’, and ‘‘issue’’ register min-
uncertainty among the alternative sentiment categories.                                         imal weights, indicating lower relevance in the sentiment attribution
    The feature importance ranking presents local attributions gen-                             process.
erated by LIME, identifying the most influential lexical components                                 The highlighted text visualization renders the analyzed clinical
contributing to the classification decision. The term ‘‘cancer’’ emerges                        advisory statement:
as the primary contributor with an importance score of 0.61, followed
by ‘‘scared’’ (0.22) and ‘‘please’’ (0.11). Additional terms such as ‘‘re-                         ‘‘Hello, I have reviewed the attached photographs, the attachments have
ally’’, ‘‘as’’, ‘‘well’’, ‘‘find’’, ‘‘I’’, ‘‘blood’’, and ‘‘have’’ exhibit progressively           been removed to protect patient identity. In my opinion, you are affected
lower importance coefficients, reflecting their secondary roles in the                             by a tinea infection. I recommend taking 250 mg terbinafine tablets once
model’s sentiment determination process.                                                           daily and applying sertaconazole cream to the affected area twice daily.
    The highlighted text panel displays the analyzed medical narrative:                            Continue this for three weeks and return. You will definitely notice some
                                                                                                   improvement...’’
    ‘‘Hello doctor, I’m a 26-year-old male, 10 cm tall and weigh 255
    pounds. I sometimes have blood in my stool, especially after eating spicy                       Terms highlighted in green, specifically ‘‘affected’’, ‘‘recommend’’,
    food or when constipated. I’m really scared that I might have colon                         and ‘‘improvement’’, correspond to therapeutically oriented expressions
    cancer. I frequently experience diarrhea. There is no family history of                     that significantly influence the model’s positive sentiment classifica-
    colon cancer. I had blood tests done last night. Please find my reports                     tion.
    attached’’.                                                                                     This interpretability analysis reveals the model’s capacity to dis-
                                                                                                tinguish constructive medical recommendations from neutral or neg-
    The blue-highlighted segments, particularly ‘‘scared’’ and ‘‘cancer’’,                      atively toned clinical communications. The LIME explanation demon-
correspond to high-impact emotional and medical terminology that                                strates that the classification decision is primarily driven by treatment-
significantly influence the model’s sentiment evaluation.                                       related vocabulary and optimistic prognostic indicators, offering valu-
    This interpretability analysis demonstrates the model’s sensitivity to                      able insights into the model’s domain-specific sentiment recognition
emotionally charged and domain-specific medical expressions within                              abilities within healthcare advisory scenarios.
healthcare contexts. The LIME explanation reveals that the classifi-                                Fig. 4 presents a LIME-based interpretability analysis for the senti-
cation decision primarily hinges on illness-related concerns and fear-                          ment classification of a concise social media content sample, illustrating
based expressions. Accordingly, the analysis offers valuable insights                           the model’s ability to process succinct and informal textual expressions.
into the model’s domain-specific sentiment recognition capabilities                             The visualization offers in-depth insights into the underlying sentiment
when interpreting emotionally nuanced medical discourse.                                        classification mechanisms for multimedia-related content descriptions.

                                                                                           13
E. Atagün et al.                                                                                                       Computer Standards & Interfaces 97 (2026) 104086


                                Fig. 3. Interpretability analysis using the LIME framework for the proposed model for Dataset3.


                                Fig. 4. Interpretability analysis using the LIME framework for the proposed model for Dataset4.


                                Fig. 5. Interpretability analysis using the LIME framework for the proposed model for Dataset5.


    The prediction probability panel indicates that the model assigns                    Fig. 5 presents the LIME-based interpretability analysis of a per-
a dominant probability of 0.95 to Class 2, while Classes 0 and 1                     sonal expression sample, illustrating the model’s capacity to interpret
receive significantly lower confidence scores of 0.04 and 0.01, respec-              emotional distress within the context of domestic relationships. This
tively. This distribution demonstrates high classification confidence                visualization provides detailed insights into the sentiment classification
with minimal ambiguity across alternative sentiment categories.                      process related to interpersonal communication patterns. The predic-
    The feature importance ranking displays local explanations derived               tion probability panel shows that the model assigns a dominant prob-
from LIME, identifying the most influential lexical components in the                ability of 0.92 to Class 0, while Classes 1 and 2 receive substantially
model’s decision-making process. The term ‘‘cute’’ emerges as the pri-               lower confidence scores of 0.03 and 0.05, respectively. This distribu-
mary contributor with the highest importance score of 0.53, followed                 tion reflects the model’s high classification confidence with minimal
by ‘‘funny’’ (0.17). Additional terms such as ‘‘dogs’’ (0.05), ‘‘belly’’             ambiguity across alternative sentiment categories. The feature impor-
(0.04), ‘‘compilation’’ (0.03), ‘‘flop’’ (0.02), and ‘‘corgi’’ (0.01) exhibit        tance analysis displays locally derived explanations generated by LIME,
progressively decreasing contribution scores, reflecting their secondary             quantifying the contribution of individual lexical features to the final
roles in sentiment attribution.                                                      prediction. The terms ‘‘angry’’ and ‘‘friends’’ exhibit the highest impact
    The text highlight visualization renders the analyzed content de-                scores of 0.43, followed by ‘‘I’’ (0.24), ‘‘ugh’’ (0.23), and ‘‘exhausted’’
scription:                                                                           (0.22). Additional terms such as ‘‘yes’’ (0.16), ‘‘so’’ (0.10), ‘‘his’’ (0.09),
                                                                                     ‘‘husband’’ (0.04), and ‘‘again’’ (0.04) display diminishing importance
    ‘‘corgi belly flop compilation cute funny dogs corgi flop’’.                     scores, indicating secondary roles in the sentiment determination pro-
                                                                                     cess. The text highlight visualization presents the analyzed personal
                                                                                     narrative:
    Green-highlighted terms, particularly ‘‘cute’’ and ‘‘funny’’, corre-
spond to positive emotional descriptors that substantially influence the                ‘‘ugh I’m so angry my husband went out with his friends for the third time
model’s sentiment classification toward the positive class.                             this week, is he drinking, yes, I’m exhausted my daughter is teething so
    This interpretability analysis demonstrates the model’s efficacy in                 she isn’t sleeping well’’.
detecting positive sentiment cues within short, multimedia-oriented
content descriptions. The LIME explanation reveals that the classifi-                The blue-highlighted segments, particularly ‘‘ugh’’, ‘‘angry’’, ‘‘friends’’,
cation decision is primarily driven by emotionally charged adjectives                and ‘‘exhausted’’, correspond to emotionally expressive markers and
expressing affection and humor, offering valuable insights into the                  stress indicators that significantly influenced the model’s negative sen-
model’s ability to process informal social media language patterns and               timent classification. This interpretability analysis reveals the model’s
perform sentiment analysis on pet-related content.                                   ability to detect frustration and emotional exhaustion within narratives

                                                                                14
E. Atagün et al.                                                                                                   Computer Standards & Interfaces 97 (2026) 104086


                              Fig. 6. Graph-based visualization with the WordContextGraphExplainer framework for Dataset1.


involving intimate relational contexts. The LIME explanation demon-               use case for the explainer’s ability to decompose complex sentiment
strates that the classification decision is predominantly based on ex-            decisions into interpretable components. This visualization framework
plicit emotional state descriptors and situational stress signals, provid-        directly addresses the critical need for interpretability in natural lan-
ing valuable insight into the model’s competence in analyzing senti-              guage processing applications. By decomposing the model’s reasoning
ment in informal, emotionally charged personal communications and                 into individual word contributions and pairwise interactions, WordCon-
family-related discourse.                                                         textGraphExplainer enables practitioners to understand not only what
    Fig. 6 presents a comprehensive visualization generated by the                the model predicts, but why specific linguistic features drive those
WordContextGraphExplainer framework, illustrating contextual depen-               predictions. Such detailed analysis is especially valuable in high-stakes
dencies and feature interactions that underlie a sentiment analysis               applications, where transparency and accountability are essential. The
model’s decision-making process. This graph-based representation an-              graph structure effectively conveys the intricate interplay between
alyzes a textual input with inherently negative emotional content,                lexical semantics and contextual dependencies that influence automatic
offering insights into how individual lexical units contribute to the             sentiment classification, offering a robust foundation for both model
model’s final classification outcome.                                             validation and bias detection in NLP systems.
    The visualization employs a node–edge graph structure, wherein                    Fig. 7 illustrates a visual explanation generated through the Word-
each word in the input sentence is represented as a distinct node. A              ContextGraphExplainer framework, a graph-theoretic methodology de-
structured layout algorithm is used to optimally position the nodes,              veloped to enhance interpretability in natural language processing
minimizing visual overlap while preserving semantic relationships.                tasks. This approach is specifically designed to analyze the contextual
Node coloration adheres to a three-class scheme: red nodes signify                and semantic interdependencies among lexical units in a given text. The
words with negative influence on the prediction, gray nodes indicate              visualized instance centers on a sample from a patient–doctor interac-
neutral contributions, and green nodes denote positive contributions              tion scenario, highlighting how domain-specific terminology influences
that enhance the model’s classification confidence. Each node is an-              the model’s sentiment classification decision.
notated with a numeric coefficient reflecting its individual effect on                The graph comprises the following principal components:
the predicted class probability. The values presented (ranging from                   Each node corresponds to an individual word token extracted from
+0.0001 to +0.0197) quantitatively capture the magnitude of each                  the input sentence. Numerical values adjacent to the nodes (rang-
word’s contribution to the final classification decision. Notably, terms          ing from −0.6908 to +0.3007) quantify the contextual influence of
such as ‘‘worthless’’ (+0.0068), ‘‘barely’’ (+0.0072), and ‘‘emotions’’           each word on the model’s predicted sentiment class. These scalar
(+0.0197) exhibit significant negative sentiment contributions, aligning          weights reflect the relative importance of lexical features based on
with the model’s overall classification of the input as Negative. Edges           perturbation-based sensitivity analysis.
between nodes represent word-pair interactions whose importance ex-                   Edges link semantically related word pairs, capturing co-occurrence
ceeds a predefined threshold, capturing non-additive effects between              patterns and latent dependencies. Notably, the term ‘‘pain’’ occupies
co-occurring terms. As specified in the legend (top-left), the visualiza-         a central position in the graph with multiple connections, indicating
tion highlights the top five most influential word-pair interactions. Edge        its pivotal role in determining the emotional tone of the dialogue. The
annotations (e.g., ‘‘+0.6061 (Neg)’’, ‘‘+0.6701 (Neg)’’) denote both              visualization applies a ‘‘top-5 interactions’’ threshold, selectively dis-
the strength and directional impact of these interactions on sentiment            playing the most salient semantic relationships to prevent information
classification. These values reflect synergistic or antagonistic effects          overload while preserving interpretive clarity.
that emerge when specific word combinations appear within the same                    The graph reveals a meaningful mapping between medical do-
context. The model’s confident prediction of the input text as expressing         main terms (e.g., ‘‘doctor’’, ‘‘medication’’, ‘‘pain’’) and activity-related
Negative sentiment (as shown at the bottom of the visualization) is               expressions drawn from sports terminology (e.g., ‘‘tennis’’, ‘‘cricket’’,
supported by the prevalence of red-coded nodes and high-magnitude                 ‘‘playing’’), showcasing the model’s capacity to associate physically
negative interaction coefficients. The analyzed text—rich in expressions          contextualized discomfort with healthcare concerns. This highlights the
of emotional distress and self-deprecating language—serves as a clear             model’s ability to capture nuanced emotional cues across domains.

                                                                             15
E. Atagün et al.                                                                                                Computer Standards & Interfaces 97 (2026) 104086


                             Fig. 7. Graph-based visualization with the WordContextGraphExplainer framework for Dataset2.


                             Fig. 8. Graph-based visualization with the WordContextGraphExplainer framework for Dataset3.


    The WordContextGraphExplainer framework, as demonstrated in this                 A salient feature in the visualization is the positioning of the word
clinical communication use case, provides an interpretable, context-             ‘‘great’’ as the central hub node. With a high positive influence score
aware mechanism for analyzing model behavior. Its utility in domains             of +0.7819, this term is encoded in green, representing a dominant
such as clinical text analysis and patient-centered dialogue interpreta-         contributor within the Positive Sentiment category. Its central role in
tion suggests promising implications. By revealing both direct and indi-         the graph indicates that it functions as the primary sentiment-bearing
rect contributions of lexemes to the classification process, this method-        lexical unit in the sentence.
ology lays a solid foundation for future research on explainable AI in               The graph exhibits a radial topology, with all peripheral nodes
medical and psychologically sensitive natural language applications.             emanating from the central ‘‘great’’ node. This star-like configuration
    Fig. 8 presents a significant methodological example of visualiz-            reflects how sentiment polarity is propagated through the surrounding
ing sentiment analysis and contextual word relationships through the             context, with the central node acting as the semantic anchor.
WordContextGraphExplainer framework. The graph specifically illus-                   The weights of the edges range from −0.2868 to +0.1792, quantifying
trates the semantic structure of the sentence ‘‘that would be great, then        the strength of semantic correlation between each word and the central
we could plan things sooner’’, offering insight into how lexical elements        ‘‘great’’ node. The system’s overall classification of the sentence as
collectively influence the model’s sentiment prediction.                         Positive sentiment is clearly driven by the dominant positive influence

                                                                            16
E. Atagün et al.                                                                                                       Computer Standards & Interfaces 97 (2026) 104086


                                Fig. 9. Graph-based visualization with the WordContextGraphExplainer framework for Dataset4.


of the hub node. This highlights the framework’s keyword-centric                        coherence. This clustering reveals that the system is capable of con-
modeling approach to sentiment interpretation.                                          textually grouping entertainment-related entities, thereby enhancing
    Words such as ‘‘plan’’, ‘‘things’’, ‘‘sooner’’, ‘‘then’’, ‘‘we’’, ‘‘could’’,        domain-sensitive sentiment interpretation.
‘‘that’’, ‘‘would’’, and ‘‘be’’ are categorized as having neutral sentiment                 The inclusion of interrogative tokens such as ‘‘what’’ (+0.0018) and
contributions. These peripheral tokens exhibit minimal effect values                    the question mark ‘‘?’’ (+0.0017) underscores the framework’s ability
ranging between +0.0001 and +0.0002, suggesting their limited seman-                    to classify interrogative structures appropriately within the seman-
tic influence on the classification. This uniform distribution underscores              tic graph. These tokens demonstrate minor but contextually relevant
the marginal role of syntactic or functional words in the model’s                       contributions to the overall sentiment.
decision-making process.                                                                    The neutral classification of the term ‘‘never’’ (+0.0007) suggests a
    The system’s capacity to selectively highlight the five strongest                   sophisticated handling of negation. Rather than misattributing a strong
semantic pairwise interactions enhances both computational efficiency                   negative weight, the model maintains contextual equilibrium, acknowl-
and model interpretability. By focusing on the most relevant contex-                    edging the grammatical presence of negation without overestimating its
tual relationships, the graph avoids overcomplexity while preserving                    emotional impact.
analytical fidelity.                                                                        The model’s ultimate sentiment prediction as Positive is primar-
    This visualization demonstrates that WordContextGraphExplainer                      ily driven by the dominant influence of the ‘‘enjoy’’ hub node. This
serves as a promising approach within the sentiment analysis domain,                    demonstrates the system’s robust classification capabilities in scenarios
contributing meaningfully to the broader paradigm of interpretable                      containing mixed sentiments and multifaceted content.
artificial intelligence. Its ability to disentangle and communicate the                     Overall, this analysis reinforces the efficacy of the
interplay between dominant and supportive linguistic features makes it                  WordContextGraphExplainer framework as an interpretability tool for
particularly valuable for applications requiring both transparency and                  complex conversational texts. It not only captures domain-specific
analytical depth.                                                                       semantic cohesion but also preserves fine-grained contextual dependen-
    Fig. 9 presents a Word Context Graph that exemplifies the com-                      cies, making it a powerful instrument for multi-topic sentiment analysis
plex dynamics of multi-domain sentiment analysis and cross-topical                      in real-world natural language understanding applications.
semantic understanding. The visualization analyzes the sentence ‘‘I                         Fig. 10 illustrates a Word Context Graph generated by the Word-
have never seen Avatar, what is it about? I really enjoy The Avenger’’,                 ContextGraphExplainer framework, presenting a critical case study for
offering a fine-grained representation of lexical interactions within the               sentiment analysis and psychological state detection within the mental
entertainment domain.                                                                   health domain. The graph analyzes a linguistically complex, emotion-
    The node ‘‘enjoy’’ (+0.4646) serves as the central hub in the graph,                ally charged sentence:
exhibiting the highest positive sentiment score. This node constitutes
the semantic backbone of the structure, maintaining extensive connec-                      ‘‘I’m going through some things with my feelings and myself. I
tivity with surrounding tokens. The presence of dual-edge structures                       barely sleep and I do nothing but think about how I’m worthless
highlights WordContextGraphExplainer’s capacity to capture nuanced                         and how I shouldn’t be here’’.
variations in semantic relationship strength across word pairs.
    The strong semantic ties among the nodes ‘‘avatar’’, ‘‘avenger’’, and                  The term ‘‘feelings’’ (+0.0197) is positioned as the central hub node,
‘‘enjoy’’ reflect the model’s successful identification of domain-specific              forming the core component of the negative sentiment cluster. This

                                                                                   17
E. Atagün et al.                                                                                                         Computer Standards & Interfaces 97 (2026) 104086


                              Fig. 10. Graph-based visualization with the WordContextGraphExplainer framework for Dataset5.


central positioning reflects the dominant role of emotional discourse               Table 23
within the narrative and highlights the lexical anchor around which                 Interpretability Fidelity Score Comparison Across Datasets.
semantic interactions are organized.                                                 Dataset       LIME         WordContextGraphExplainer (%)         Improvement (%)
     The graph predominantly features nodes classified as negative, such             Dataset 1     0.8100       0.8900                                +9.88
as ‘‘worthless’’ (+0.0068), ‘‘nothing’’ (+0.0097), and ‘‘barely’’ (+0.0072).         Dataset 2     0.8000       0.8600                                +7.50
                                                                                     Dataset 3     0.6540       0.7380                                +12.84
These contribute to the accurate identification of depressive language               Dataset 4     0.6920       0.7120                                +2.89
patterns and reinforce the system’s capacity to localize affectively                 Dataset 5     0.6800       0.8200                                +20.59
significant tokens. Edge weights span a broad spectrum from +0.8360 to
−0.6061, indicating considerable variance in the strength of inter-word
interactions. Notably, the strongest negative correlations are concen-
                                                                                    𝐹 = 𝐸(𝑥, 𝑘) using the specified explanation method, where 𝑘 deter-
trated around the ‘‘feelings’’ hub, supporting its centrality in semantic
                                                                                    mines the number of top-ranked features to consider. Subsequently, we
influence. The nodes ‘‘shouldn’’ (+0.0171) and ‘‘be’’ (+0.0132) are neg-
                                                                                    create a modified input 𝑥′ = Remove(𝑥, 𝐹 ) by removing the identified
atively classified, reflecting the system’s ability to detect linguistic
                                                                                    important features from the original text. We then compute a new
indicators of suicidal ideation. This demonstrates the model’s sensitivity
                                                                                    prediction 𝑝′ = 𝑀(𝑥′ ) using this perturbed input to observe how the
to subtle syntactic constructions associated with psychological distress.
                                                                                    model’s behavior changes. Finally, we calculate the fidelity score as
The node ‘‘sleep’’ (+0.0125) is identified within the negative sentiment
                                                                                    fidelity = |𝑝0 − 𝑝′ |, which quantifies the absolute difference between
category, indicating the model’s capacity to recognize sleep disruption
                                                                                    the original and perturbed predictions.
— an important marker in clinical mental health assessments. The term
                                                                                        The underlying hypothesis assumes that if an explanation method
‘‘think’’ (+0.0044) reflects ruminative thought patterns and is correctly
                                                                                    accurately identifies decision-critical features, their removal should
positioned within the semantic network. This demonstrates the system’s
effectiveness in modeling internal cognitive processes associated with              produce substantial changes in model predictions. Mathematically, this
depressive episodes. The model’s overall prediction of Negative senti-              can be expressed as:
ment aligns with clinical assessment criteria, suggesting that the system           High Fidelity ⇔ arg max(𝑀(𝑥)) ≠ arg max(𝑀(𝑥′ ))                                  (2)
achieves a promising level of accuracy for mental health screening
applications. This classification is supported by the density of negative           The absolute difference metric captures both direction-preserving and
sentiment nodes and their semantically coherent interactions.                       direction-changing prediction modifications, providing a comprehen-
     This analysis demonstrates that the WordContextGraphExplainer                  sive assessment of explanation accuracy.
framework provides a robust interpretability mechanism for psycho-                      For comprehensive evaluation, individual fidelity scores are aggre-
logically sensitive content. By quantifying both individual lexical con-            gated using the arithmetic mean:
tributions and inter-word semantic interactions, the system delivers a
                                                                                                      1∑
                                                                                                            𝑛
fine-grained visualization of emotional discourse, making it particularly           Mean Fidelity =         |𝑀(𝑥𝑖 ) − 𝑀(𝑥′𝑖 )|                                       (3)
                                                                                                      𝑛 𝑖=1
valuable in clinical decision support systems.
     The fidelity metric [51] implemented in this framework quan-                   where 𝑛 represents the total number of test instances.
tifies the correspondence between explanation-based feature impor-                      In the broader context of XAI for natural language processing,
tance rankings and observable model behavior changes through a                      WordContextGraphExplainer offers methodological advantages over tra-
perturbation-based assessment methodology.                                          ditional frameworks such as LIME. Unlike LIME, which assumes fea-
     Let 𝑀 represent the trained model, 𝑥 denote the original input                 ture independence and linearity, WordContextGraphExplainer employs a
text, and 𝐸(𝑥) represent the explanation method that produces a set                 graph-theoretic structure capable of capturing non-linear relationships
of important features 𝐹 = {𝑓1 , 𝑓2 , … , 𝑓𝑘 } with associated importance            and contextual dependencies features essential for modeling complex,
scores.                                                                             multi-sentiment narratives. These findings underscore the superior-
     The fidelity score for a single instance is defined as:                        ity of graph-based interpretability in high-stakes domains and sug-
                                                                                    gest promising future directions for next-generation explainable NLP
Fidelity(𝑥, 𝐸) = |𝑀(𝑥) − 𝑀(𝑥′ )|                                        (1)         systems (see Table 23).
where 𝑥′ represents the perturbed text obtained by removing the top-𝑘
most important features identified by the explanation method 𝐸.                     5. Conclusion
   The fidelity [52] assessment follows this systematic procedure. First,
we compute the original model prediction 𝑝0 = 𝑀(𝑥) to establish a                       This study presents a comprehensive framework for sentiment clas-
baseline reference point. Next, we extract the most important features              sification in dialogue-based scenarios through the development of a

                                                                               18
E. Atagün et al.                                                                                                       Computer Standards & Interfaces 97 (2026) 104086


novel HybridBERT-LSTM architecture coupled with an innovative inter-             Declaration of competing interest
pretability methodology. The proposed hybrid model demonstrates su-
perior performance on both benchmark datasets, including the widely-                 The authors declare that they have no known competing finan-
adopted IMDb corpus, and real-world dialogue datasets, consistently              cial interests or personal relationships that could have appeared to
outperforming standalone architectures such as traditional LSTM,                 influence the work reported in this paper.
BERT, CNN, and SVM implementations. The empirical results validate
the model’s enhanced capacity to capture both the semantic richness              Data availability
of individual utterances and the sequential dependencies inherent in
multi-turn conversational contexts.                                                 The authors do not have permission to share data.
    The architectural innovation of HybridBERT-LSTM leverages pre-
trained BERT encodings for deep contextualized embeddings, subse-
quently processed through bidirectional LSTM layers to model tem-                References
poral dependencies and discourse-level structures. The integration of
                                                                                  [1] L. Song, et al., CASA: Conversational aspect sentiment analysis for dialogue
dual pooling mechanisms (average and maximum) followed by dense                       understanding, J. Artificial Intelligence Res. 73 (2022) 511–533.
classification layers enables the model to synthesize learned represen-           [2] M. Firdaus, et al., MEISD: A multimodal multi-label emotion, intensity and
tations effectively, making it particularly suitable for dialogue senti-              sentiment dialogue dataset, in: COLING, 2020, pp. 4441–4453.
ment analysis where contextual flow and sequential relationships are              [3] I. Carvalho, et al., The importance of context for sentiment analysis in dialogues,
                                                                                      IEEE Access 11 (2023) 86088–86103.
paramount.
                                                                                  [4] J. Wang, et al., Sentiment classification in customer service dialogue with
    A significant contribution of this research lies in the development               topic-aware multi-task learning, AAAI 34 (05) (2020) 9177–9184.
of explainable context-aware sentiment reasoning capabilities. Beyond             [5] D. Bertero, et al., Real-time speech emotion and sentiment recognition, EMNLP
the scope of traditional local explanation techniques, a novel graph-                 104 (2016) 2–1047.
theoretic interpretability framework, WordContextGraphExplainer, has              [6] C. Bothe, et al., Dialogue-based neural learning to estimate sentiment, in: ICANN,
                                                                                      2017, pp. 477–485.
been proposed to address the fundamental limitations inherent in                  [7] M. Firdaus, et al., EmoSen: Generating sentiment and emotion controlled
existing methodologies. Unlike LIME, which operates under linear                      responses, IEEE Trans. Affect. Comput. 13 (3) (2020) 1555–1566.
additivity assumptions and treats tokens as independent entities, Word-           [8] A. Mallol-Ragolta, B. Schuller, Coupling sentiment and arousal analysis, IEEE
ContextGraphExplainer employs sophisticated perturbation analysis                     Access 12 (2024) 20654–20662.
                                                                                  [9] Z. Akbar, M.U. Ghani, U. Aziz, Boosting viewer experience with emotion-driven
to model non-linear semantic interactions between word pairs. This
                                                                                      video analysis: A BERT-based framework for social media content, J. Artif. Intell.
methodology constructs semantic interaction graphs where nodes rep-                   Behav. (2025).
resent individual word contributions and edges encode inter-word                 [10] J. Zhao, W. Gao, A semantic-enhanced heterogeneous dialogue graph network,
dependencies, providing intuitive visualization of complex linguistic                 IEEE ICETCI 131 (2024) 5–1322.
relationships through NetworkX-based representations. The compar-                [11] M. Yang, et al., GME-dialogue-NET, Acad. J. Comput. Inf. Sci. 4 (8) (2021)
                                                                                      10–18.
ative analysis reveals that while LIME provides granular word-level              [12] M. Parmar, A. Tiwari, Emotion and sentiment analysis in dialogue: A multimodal
attributions, it operates independently of sequential context and fails               strategy employing the BERT model, in: 2024 Parul International Conference on
to capture the synergistic effects crucial for accurate sentiment inter-              Engineering and Technology, PICET, 2024, pp. 1–7.
pretation in conversational settings. In contrast, WordContextGraph-             [13] Mustapha Z., Aspect-based emotion analysis for dialogue understanding, 2024.
                                                                                 [14] W. Li, W. Shao, S. Ji, E. Cambria, BiERU: Bidirectional emotional recurrent unit
Explainer’s graph-based approach explicitly models contextual inter-
                                                                                      for conversational sentiment analysis, Neurocomputing 467 (2022) 73–82.
dependencies, semantic propagation patterns, and negation scope ef-              [15] S. Poria, D. Hazarika, N. Majumder, R. Mihalcea, Beneath the tip of the iceberg:
fects that are essential for understanding transformer decision-making                Current challenges and new directions in sentiment analysis research, IEEE Trans.
processes. This advancement enables practitioners to trace how sen-                   Affect. Comput. 14 (1) (2020) 108–132.
timent emerges through word interactions and temporal flow across                [16] L. Zhu, R. Mao, E. Cambria, B.J. Jansen, Neurosymbolic AI for personalized
                                                                                      sentiment analysis, in: International Conference on Human-Computer Interaction,
dialogue turns, providing unprecedented insights into model reason-
                                                                                      269–290, Springer Nature Switzerland, Cham, 2024.
ing mechanisms. The integration of WordContextGraphExplainer with                [17] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, et al., Panosent: A panoptic
HybridBERT-LSTM establishes a new paradigm for interpretable dia-                     sextuple extraction benchmark for multimodal conversational aspect-based sen-
logue sentiment analysis, where prediction accuracy and explainability                timent analysis, in: Proceedings of the 32nd ACM International Conference on
                                                                                      Multimedia, 2024, pp. 7667–7676.
are synergistically enhanced. This framework demonstrates particular
                                                                                 [18] Y. Zhang, Q. Li, D. Song, P. Zhang, P. Wang, Quantum-inspired interactive
efficacy in clinical applications and mental health assessment scenarios,             networks for conversational sentiment analysis, 2019.
where understanding the rationale behind sentiment predictions is                [19] L. Yang, Q. Yang, J. Zeng, T. Peng, Z. Yang, H. Lin, Dialogue sentiment analysis
as critical as the predictions themselves. Future research directions                 based on dialogue structure pre-training, Multimedia Syst. 31 (2) (2025) 1–13.
include extending the graph-based interpretability framework to mul-             [20] K. Horesh, A. Kumar, A. Anand, A. Sabu, T. Jain, Sentiment Analysis on Amazon
                                                                                      Electronics Product Reviews using Machine Learning Techniques, IEEE, 2023,
tilingual contexts and exploring its applications in other NLP tasks
                                                                                      http://dx.doi.org/10.1109/gcat59970.2023.10353467.
requiring fine-grained semantic understanding. Future work should                [21] A. Matsui, E. Ferrara, Word embedding for social sciences: An interdisciplinary
focus on developing simplified visualization layers and adaptive user                 survey, PeerJ Comput. Sci. 10 (2024) e2562.
interfaces that can present graph-based explanations at varying levels           [22] S. Anitha, P. Gnanasekaran, Advanced sentiment classification using RoBERTa
                                                                                      and aspect-based analysis on large-scale e-commerce datasets, Nanotechnol.
of complexity, enabling domain experts to access meaningful inter-
                                                                                      Perceptions 20 (S16) (2024) 336–348.
pretability insights without requiring deep technical expertise in graph         [23] P. Borah, D. Gupta, B.B. Hazarika, ConCave-convex procedure for support vector
theory or network analysis. Future research should incorporate system-                machines with Huber loss for text classification, Comput. Electr. Eng. 122 (2025)
atic human evaluation studies to assess the explanatory quality and                   109925.
clinical applicability of WordContextGraphExplainer outputs among                [24] Z. Hua, Y. Tong, Y. Zheng, Y. Li, Y. Zhang, PPGloVe: privacy-preserving GloVe
                                                                                      for training word vectors in the dark, IEEE Trans. Inf. Forensics Secur. 19 (2024)
domain practitioners.
                                                                                      3644–3658.
                                                                                 [25] A. Rasool, S. Aslam, N. Hussain, S. Imtiaz, W. Riaz, nbert: Harnessing NLP
CRediT authorship contribution statement                                              for emotion recognition in psychotherapy to transform mental health care,
                                                                                      Information 16 (4) (2025) 301.
    Ercan Atagün: Writing – review & editing, Writing – original                 [26] E. Mitera-Kiełbasa, K. Zima, Automated classification of exchange information
                                                                                      requirements for construction projects using Word2Vec and SVM, Infrastructures
draft, Methodology, Investigation, Conceptualization. Günay Temür:                    9 (11) (2024) 194.
Validation, Methodology. Serdar Biroğul: Supervision, Project admin-             [27] Z. Yang, F. Emmert-Streib, Optimal performance of Binary Relevance CNN in
istration, Conceptualization.                                                         targeted multi-label text classification, Knowl.-Based Syst. 284 (2024) 111286.


                                                                            19
E. Atagün et al.                                                                                                                    Computer Standards & Interfaces 97 (2026) 104086


[28] J. Peng, S. Huo, Application of an improved convolutional neural network                 [40] A. Bajaj, D.K. Vishwakarma, HOMOCHAR: A novel adversarial attack framework
     algorithm in text classification, J. Web Eng. 23 (3) (2024) 315–339.                          for exposing the vulnerability of text-based neural sentiment classifiers, Eng.
[29] K. Nithya, M. Krishnamoorthi, S.V. Easwaramoorthy, C.R. Dhivyaa, S. Yoo, J.                   Appl. Artif. Intell. 126 (2023) 106815, http://dx.doi.org/10.1016/j.engappai.
     Cho, Hybrid approach of deep feature extraction using BERT–OPCNN & FIAC                       2023.106815.
     with customized Bi-LSTM for rumor text classification, Alex. Eng. J. 90 (2024)           [41] A. Bajaj, D.K. Vishwakarma, Evading text-based emotion detection mechanism
     65–75.                                                                                        via adversarial attacks, Neurocomputing 558 (2023).
[30] S. Jamshidi, M. Mohammadi, S. Bagheri, H.E. Najafabadi, A. Rezvanian, M.                 [42] G.A. de Oliveira, R.T. de Sousa, R. de O. Albuquerque, L.J.G. Villalba, Adversarial
     Gheisari, et al., Effective text classification using BERT, MTM LSTM, and DT,                 attacks on a lexical sentiment analysis classifier, Comput. Commun. 174 (2021)
     Data Knowl. Eng. 151 (2024) 102306.                                                           154–171, http://dx.doi.org/10.1016/j.comcom.2021.04.026.
[31] O. Galal, A.H. Abdel-Gawad, M. Farouk, Federated freeze BERT for text                    [43] M. Hussain, M. Naseer, Comparative analysis of logistic regression, LSTM, and
     classification, J. Big Data 11 (1) (2024) 28.                                                 Bi-LSTM models for sentiment analysis on IMDB movie reviews, J. Artif. Intell.
[32] C. Eang, S. Lee, Improving the accuracy and effectiveness of text classification              Comput. 2 (1) (2024) 1–8.
     based on the integration of the bert model and a recurrent neural network                [44] C.D. Kulathilake, J. Udupihille, S.P. Abeysundara, A. Senoo, Deep learning-driven
     (RNN_Bert_Based), Appl. Sci. 14 (18) (2024) 8388.                                             multi-class classification of brain strokes using computed tomography: A step
[33] M. Ahmed, M.S. Hossain, R.U. Islam, K. Andersson, Explainable text classification             towards enhanced diagnostic precision, Eur. J. Radiol. 187 (2025) 112109.
     model for COVID-19 fake news detection, J. Internet Serv. Inf. Secur. 12 (2)             [45] Amod, Mental health counseling conversations dataset, 2024, Retrieved from
     (2022) 51–69.                                                                                 https://huggingface.co/datasets/Amod/mental_health_counseling_conversations/
[34] K. Zahoor, N.Z. Bawany, T. Qamar, Evaluating text classification with explainable             tree/main.
     artificial intelligence, Int. J. Artif. Intell. ISSN 225 (2024) 2–8938.                  [46] B. Yao, P. Tiwari, Q. Li, Self-supervised pre-trained neural network for quantum
[35] D. Kalla, N. Smith, F. Samaah, Deep learning-based sentiment analysis: Enhancing              natural language processing, Neural Netw. 184 (2025) 107004, Elsevier.
     IMDb review classification with LSTM models, 2025, Available at SSRN 5103558.            [47] SohamGhadge, Casual conversation dataset, 2024, Retrieved from https://
[36] R. Beniwal, A.K. Dinkar, A. Kumar, A. Panchal, A hybrid deep learning model                   huggingface.co/datasets/SohamGhadge/casual-conversation/tree/main.
     for sentiment analysis of IMDB movies reviews, in: 2024 Asia Pacific Conference          [48] Mahfoos, Patient-doctor conversation dataset, 2024, Retrieved from https://
     on Innovation in Technology, APCIT, IEEE, 2024, pp. 1–7.                                      huggingface.co/datasets/mahfoos/Patient-Doctor-Conversation/tree/main.
[37] N. Tabassum, T. Alyas, M. Hamid, M. Saleem, S. Malik, Z. Ali, U. Farooq,                 [49] Alimistro123, English chat sentiment dataset, 2024, Retrieved from https://www.
     Semantic analysis of Urdu English tweets empowered by machine learning, Intell.               kaggle.com/code/alimistro123/english-chat-sentiment-dataset-found.
     Autom. Soft Comput. 30 (1) (2021) 175–186.                                               [50] Adapting, Empathetic dialogues v2 dataset, 2024, Retrieved from https://
[38] A. Pandey, R. Yadav, A. Pathak, N. Shivani, B. Garg, A. Pandey, Sentiment                     huggingface.co/datasets/Adapting/empathetic_dialogues_v2.
     analysis of IMDB movie reviews, in: 2024 First International Conference on               [51] Y. Singh, Q.A. Hathaway, V. Keishing, S. Salehi, Y. Wei, N. Horvat, D.V. Vera-
     Software, Systems and Information Technology, SSITCON, IEEE, 2024, pp. 1–6.                   Garcia, A. Choudhary, A.Mula. Kh, E. Quaia, et al., Beyond post hoc explanations:
[39] R. Amin, R. Gantassi, N. Ahmed, A.H. Alshehri, F.S. Alsubaei, J. Frnda, A hybrid              A comprehensive framework for accountable AI in medical imaging through
     approach for adversarial attack detection based on sentiment analysis model                   transparency, Interpret. Explain. Bioeng. 12 (8) (2025) 879.
     using machine learning, Eng. Sci. Technol. an Int. J. 58 (2024) 101829.                  [52] M. Bayesh, S. Jahan, Embedding security awareness in IoT systems: A framework
                                                                                                   for providing change impact insights, Appl. Sci. 15 (14) (2025) 7871.


                                                                                         20