Files
opaque-lattice/papers_txt/Graph-based-interpretable-dialogue-sentiment-analysis--A_2026_Computer-Stand.txt
2026-01-06 12:49:26 -07:00

1447 lines
168 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Computer Standards & Interfaces 97 (2026) 104086
Contents lists available at ScienceDirect
Computer Standards & Interfaces
journal homepage: www.elsevier.com/locate/csi
Graph-based interpretable dialogue sentiment analysis: A HybridBERT-LSTM
framework with semantic interaction explainer
Ercan Atagün a ,, Günay Temür b , Serdar Biroğul c,d
a Computer Engineering, Institute Of Graduate Studies, Duzce University, Düzce, 81000, Turkey
b
Kaynasli Vocational School, Duzce University, Düzce, 81000, Turkey
c
Department of Computer Engineering, Faculty of Engineering, Duzce University, Düzce, 81000, Turkey
d
Department of Electronics and Information Technologies, Faculty of Architecture and Engineering, Nakhchivan State University, Nakhchivan, Azerbaijan
ARTICLE INFO ABSTRACT
Keywords: Conversational sentiment analysis in natural language processing faces substantial challenges due to intricate
Natural language processing contextual semantics and temporal dependencies within multi-turn dialogues. We present a novel HybridBERT-
Explainable artificial intelligence LSTM architecture that integrates BERTs contextualized embeddings with LSTMs sequential processing
Word context graph explainer
capabilities to enhance sentiment classification performance in dialogue scenarios. Our framework employs
a dual-pooling mechanism to capture local semantic features and global discourse dependencies, addressing
limitations of conventional approaches. Comprehensive evaluation on IMDb benchmark and real-world
dialogue datasets demonstrates that HybridBERT-LSTM consistently improves over standalone models (LSTM,
BERT, CNN, SVM) across accuracy, precision, recall, and F1-score metrics. The architecture effectively exploits
pre-trained contextual representations through bidirectional LSTM layers for temporal discourse modeling. We
introduce WordContextGraphExplainer, a graph-theoretic interpretability framework addressing conventional
explanation method limitations. Unlike LIMEs linear additivity assumptions treating features independently,
our approach utilizes perturbation-based analysis to model non-linear semantic interactions. The framework
generates semantic interaction graphs with nodes representing word contributions and edges encoding inter-
word dependencies, visualizing contextual sentiment propagation patterns. Empirical analysis reveals LIMEs
inadequacies in capturing temporal discourse dependencies and collaborative semantic interactions crucial for
dialogue sentiment understanding. WordContextGraphExplainer explicitly models semantic interdependencies,
negation scope, and temporal flow across conversational turns, enabling comprehensive understanding of both
word-level contributions and contextual interaction influences on decision-making processes. This integrated
framework establishes a new paradigm for interpretable dialogue sentiment analysis, advancing trustworthy
AI through high-performance classification coupled with comprehensive explainability.
1. Introduction sentiment analysis, as conventional text classification methodologies
frequently fail to adequately capture such sequential continuity. The
Dialogue-based sentiment analysis constitutes a significant research multi-speaker nature of dialogues introduces critical considerations
domain within the field of natural language processing (NLP). This area regarding utterance attribution and the identification of emotional
of study represents a fundamental component of efforts to enhance
expression sources. Modeling sentiment transitions between conver-
humanmachine interaction through more meaningful and emotion-
sational participants presents particular challenges, especially in sce-
centric approaches. Research endeavors in this field encompass numer-
ous inherent challenges and complexities. Dialogues typically emerge narios where emotions are expressed through implicit mechanisms.
from the reciprocal interactions among multiple conversational partic- Rather than explicit emotional declarations, human linguistic behavior
ipants, where the scope of communicative content spans the breadth frequently employs sophisticated rhetorical devices including irony,
of human knowledge and experience. The emotional orientation of an sarcasm, humor, double entenders, and cultural references, resulting in
utterance within a conversational sequence demonstrates substantial sentiment interpretations that diverge significantly from surface-level
dependency upon preceding discourse and contextual cues. This phe-
textual analysis. This phenomenon proves particularly problematic in
nomenon necessitates the development of context-aware models for
Corresponding author.
E-mail address: ercanatagun@duzce.edu.tr (E. Atagün).
https://doi.org/10.1016/j.csi.2025.104086
Received 7 June 2025; Received in revised form 7 October 2025; Accepted 13 October 2025
Available online 12 November 2025
0920-5489/© 2025 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
brief, context-independent utterances, substantially complicating sen- media environments, further demonstrating the potential of multimodal
timent analysis procedures. Contemporary dialogue-based sentiment affective understanding.
analysis research faces significant constraints regarding the availabil- Graph-based modeling has also been incorporated into multimodal
ity of high-quality, annotated datasets. Existing corpora are charac- sentiment analysis. Zhao and Gao [10] proposed a semantically en-
terized by either limited scale or restriction to specific contextual riched heterogeneous dialogue graph network to analyze sentiment in
domains such as cinematic dialogue or customer service interactions.
multi-party conversations. Yang et al. [11] advanced sentiment accu-
Furthermore, the insufficient representation of cultural, linguistic, and
racy through a model that jointly processes text, audio, and visual cues.
social diversity within available datasets impedes the development
Context-awareness is a pivotal factor in sentiment interpretation within
of generalizable models with robust cross-domain applicability. Deep
dialogues. Carvalho et al. [3] emphasized the influence of preceding
learning-based sentiment analysis architectures predominantly exhibit
discourse on sentiment prediction. To enhance contextual coherence in
black box characteristics, rendering their decision-making processes
opaque to human interpretation. This limitation particularly dimin- generative AI dialogue systems, personalized dialogue summarization
ishes model reliability in tasks where emotional interpretation involves techniques have been employed [12]. Mustapha [13] proposed a model
inherent subjectivity, consequently necessitating human oversight in to analyze sentiment-cause relationships in stress-laden conversations,
practical applications. In this study, a novel hybrid model is pro- aiming to reveal emotional dynamics. Contextual memory mechanisms
posed, integrating BERTs contextualized representation capabilities were further explored by Li et al. [14], who developed a bidirec-
with the sequential modeling proficiency of LSTM to address the in- tional emotional recurrent unit (BiERU) to capture dynamic context
herent challenges of sentiment analysis in dialogue-based datasets. The shifts and their implications for sentiment detection. Explainability has
architecture is specifically designed to capture both linguistic features gained increasing importance in sentiment analysis. A variety of ap-
and temporal dependencies embedded within conversational structures. proaches including attention mechanisms, graph neural networks, and
To enhance the interpretability of model outputs, a graph-theoretic neuro-symbolic architectures have been introduced to elucidate model
interpretability framework, termed WordContextGraphExplainer, is in- decision-making. Poria et al. [15] discussed fundamental challenges
troduced. This framework overcomes the limitations of conventional in sentiment interpretation and underscored the role of explainability.
explanation methods by modeling non-linear semantic interactions be- Zhu et al. [16] developed a neuro-symbolic model for personalized
tween lexical units. Through the construction of semantic interaction
sentiment analysis, incorporating user-specific contextual factors into
graphs, the approach facilitates comprehensive visualization of contex-
the explanatory framework. Luo et al. [17] introduced the PanoSent
tual sentiment propagation patterns, offering novel insights into the
dataset to improve the analysis of emotional shifts in interactive sys-
underlying decision-making mechanisms of the model and establish-
tems. In another direction, Zhang et al. proposed a novel interaction
ing a new paradigm for interpretable sentiment analysis in dialogue
systems. network inspired by quantum theory to reframe dialogue-based sen-
timent analysis [18]. Yang et al. [19] addressed the inadequacies
2. Related works of existing pre-trained models in capturing the logical structure of
dialogues. To overcome these limitations, they proposed a new pre-
Sentiment analysis has gained significant traction in NLP research, training framework comprising utterance order modeling, sentence
driven by its pivotal role in enabling affective computing across do- skeleton reconstruction, and sentiment shift detection, demonstrating
mains such as humancomputer interaction, intelligent customer sup- improvements in learning emotion interactions and discourse coher-
port, and conversational AI systems. Recent advancements in the field ence. Collectively, recent developments in sentiment analysis empha-
have led to the development of a diverse array of methodologies, en- size the significance of contextual awareness, multimodal data fusion,
compassing text-based approaches, multimodal frameworks, contextual graph-based reasoning, and explainable AI techniques in enhancing
modeling techniques, and sophisticated deep learning architectures. performance and interpretability within dialogue-centric applications.
This section presents an overview of key contributions in the literature,
with a particular emphasis on dialogue-based sentiment analysis, which
plays a critical role in domains such as customer support, conversa- 3. Materials and methods
tional AI, and empathetic dialogue systems. Song et al. [1] introduced
a topic-aware sentiment analysis model for dialogue (CASA), aiming to
The dialogue dataset dyadic conversational exchanges between two
identify sentiment orientations within conversational threads. Firdaus
distinct participants. Each dialogue instance is structured as a se-
et al. [2] constructed the MEISD dataset, incorporating textual, audio,
quence of alternating utterances, where each turn is associated with
and visual data for multimodal sentiment analysis. Emphasizing the
a specific speaker and the corresponding textual content. The formal
relevance of conversational context, Carvalho et al. [3] demonstrated
that prior utterances significantly influence sentiment classification mathematical representation of the dialogue structure is given by:
outcomes. Building upon this insight, topic-aware sentiment classifica-
 = {(𝑠𝑖 , 𝑡𝑖 )}𝑁
𝑖=1
, 𝑠𝑖 ∈  = {𝐴, 𝐵}, 𝑡𝑖 ∈ 𝛴
tion models have been proposed using multi-task learning strategies
within customer service dialogues [4]. Real-time sentiment analysis
Here,  denotes the complete dialogue dataset
in dialogue systems is also a critical consideration. Bertero et al. [5]
developed a convolutional neural network capable of processing au- composed of 𝑁 conversational turns.
dio inputs for instantaneous emotion detection in interactive systems. Each pair (𝑠𝑖 , 𝑡𝑖 ) represents the 𝑖-th turn in the dialogue,
Bothe et al. [6] presented a model to predict the sentiment of up- where 𝑠𝑖 is the speaker identifier and
coming utterances, thereby analyzing emotional transitions throughout
dialogue sequences. To address the limitations of unimodal text-based 𝑡𝑖 is the corresponding utterance.
sentiment analysis, recent studies have adopted multimodal strategies The speaker set  = {𝐴, 𝐵} contains two participants,
by integrating text, speech, and visual signals. For instance, the EmoSen typically alternating in a turn-based structure.
model [7] generates sentiment-aware responses using fused inputs
The term 𝛴 represents the alphabet of the natural language
from these modalities. Similarly, Mallol-Ragolta and Schuller [8] intro-
duced a system that personalizes dialogue responses by estimating user in which the dialogue is conducted, and 𝛴 denotes the set
emotions and arousal levels. Akbar et al. [9] proposed an innovative of all finite-length strings (i.e., possible utterances)
emotion-driven framework for video-based sentiment analysis in social formed from this alphabet.
2
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
3.1. Data preprocessing and word embedding 3.2. GloVe: Global vectors for word representation
The successful training of natural language processing (NLP) models GloVe [24] is a widely adopted word embedding technique designed
is highly dependent on the transformation of raw textual data into to capture semantic and conceptual relationships between words, par-
structured and semantically meaningful representations [20]. In this ticularly in text classification tasks. It operates by constructing word
study, all textual inputs undergo a series of preprocessing operations vector representations through the optimization of global word co-
designed to optimize them for subsequent modeling tasks. An initial occurrence statistics derived from large-scale corpora. Unlike local
and essential preprocessing step involves lowercasing, which standard- contextbased models such as Word2Vec, GloVe incorporates both
izes textual input by mitigating case sensitivity inconsistencies that
local and global contextual information, embedding lexical units into a
would otherwise lead to redundant representations of semantically
dense, continuous vector space. In practical applications, GloVe embed-
identical words. This step is particularly critical for ensuring the ef-
dings are employed to convert unstructured input text into fixed-length
fectiveness and consistency of word embedding techniques. Given that
numerical tensors, which serve as inputs to deep learning architec-
parts of the dataset originate from web-based sources, residual HTML
tures such as CNN and LSTM models. This transformation enables
tags and encoded entities such as <br> and &nbsp; are present in the
raw text. These components provide no linguistic or semantic value and the model to effectively distinguish between textual classes by cap-
may negatively affect model performance. Therefore, all HTML-related turing both syntactic patterns and latent semantic features. The key
tokens and special characters are systematically removed in the prepro- advantage of GloVe lies in its ability to unify global corpus-level
cessing phase to reduce noise within the input space and to enhance the statistical information with local context, producing more stable and
robustness of downstream NLP models. This comprehensive cleaning semantically meaningful representations compared to models relying
process is implemented using Python NLTK, BeautifulSoup libraries solely on window-based learning. However, it remains a static embed-
combined with regular expression patterns to ensure thorough removal ding technique; each word is assigned a single vector regardless of
of web-derived artifacts. Additionally, standard stopword removal is its context within a sentence. This context-independent nature limits
applied to eliminate semantically non-contributive terms. Notably, tra- its flexibility when compared to transformer-based models like BERT,
ditional morphological normalization techniques such as stemming which generate dynamic embeddings conditioned on the broader lin-
and lemmatization are deliberately excluded from our preprocessing guistic environment. Despite these limitations, GloVe continues to play
pipeline, as BERTs contextualized embedding framework inherently a significant role in various NLP tasks such as text similarity, topic
captures morphological variations and semantic relationships without labeling,spam detection, and sentiment analysis where modeling word-
requiring explicit normalization steps. level semantics remains essential. Its computational simplicity and ease
Following text normalization, each cleaned sentence is tokenized of integration make it a reliable baseline in many NLP pipelines. Recent
into subword or word-level units. These token sequences are then studies [25] have highlighted the importance of consistent embedding
converted into dense numerical representations using word embedding strategies when comparing different NLP models, as variations in em-
techniques such as GloVe [21]. Embedding techniques project discrete
bedding approaches can significantly impact performance comparisons
textual units into continuous vector representations that encapsulate
and lead to biased evaluations.
both semantic coherence and syntactic structure, thereby facilitating
computational models in capturing lexical relatedness and contextual
3.3. Support Vector Machine (SVM)
alignment within language data. The original, unprocessed dataset
can be denoted as follows: Let the original unprocessed dataset be
represented [22] as: Support Vector Machine (SVM) [26] is a well-established super-
vised learning algorithm widely employed in text classification tasks,
𝑇 = {𝑠1 , 𝑠2 , … , 𝑠𝑁 } particularly due to its robustness in handling high-dimensional data
where each sentence 𝑠𝑘 is defined as [23] a sequence of 𝑀 words: representations. In natural language processing pipelines, textual inputs
are typically transformed into numerical feature vectors using tech-
𝑠𝑘 = {𝑢1 , 𝑢2 , … , 𝑢𝑀 } niques such as Term FrequencyInverse Document Frequency (TF-IDF)
To refine the input, special characters , web-related entities , and or various word embedding models. Once converted, SVM operates by
semantically non-contributive stopwords  are eliminated. The cleaned identifying the optimal hyperplane that best separates the data points
sentence is thus defined by: into distinct class labels. The core principle of SVM lies in maximizing
the margin between classes, thereby enhancing generalization perfor-
𝑠𝑘 = Clean(𝑠𝑘 ) = {𝑢𝑗𝑠𝑘 𝑢𝑗 ∉ ( )}
mance. This is particularly advantageous in scenarios where the feature
The sanitized sentence 𝑠𝑘 is then tokenized: space exhibits high dimensionality and potential overlap between class
𝑠𝑘 = {𝑣1 , 𝑣2 , … , 𝑣𝑃 }, 𝑣𝑖 ∈  distributions. Furthermore, SVMs ability to incorporate non-linear ker-
nel functions such as polynomial or radial basis function (RBF) kernels
where  denotes the vocabulary of all tokens in the dataset. enables it to capture complex, non-linear patterns within the data,
Word embeddings serve as a cornerstone for text classification, as which are often present in linguistically rich or semantically ambiguous
they enable models to capture abstract semantic relationships while textual inputs. Due to its mathematically grounded optimization frame-
reducing the dimensionality of input features. Unlike traditional bag-
work and resistance to overfitting, SVM remains a competitive baseline
of-words approaches, embeddings are resilient to linguistic variability
in various text classification domains, including sentiment analysis,
such as synonymy and polysemy. For sentiment analysis tasks, embed-
spam detection, and topic categorization. Its effectiveness is further
dings can cluster words with similar affective connotations, thereby
enhanced when combined with appropriate feature engineering and
enhancing the models ability to generalize and detect implicit senti-
dimensionality reduction techniques, making it a viable choice for both
ments. Likewise, in general classification tasks, embeddings help reveal
thematic cohesion across texts, ultimately contributing to improved small-scale and large-scale NLP applications.
predictive performance. Nevertheless, conventional embeddings like
Word2Vec or GloVe are context-independent, assigning the same vector 3.4. Convolutional Neural Networks (CNN)
representation to a word regardless of its usage context. This limitation
is addressed by contextualized models such as BERT, which generate Although originally developed for image recognition tasks, Convo-
dynamic embeddings based on surrounding words using transformer- lutional Neural Networks (CNNs) have been extensively adapted for
based architectures. Word embeddings bridge the gap between lin- various natural language processing problems, particularly in multi-
guistic expressiveness and computational tractability and remain an label text classification [27] and sentiment analysis [28], due to their
indispensable component of modern NLP pipelines. capacity to capture local hierarchical patterns in sequential data. In
3
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
text classification applications, CNNs operate on word embeddings by proposed. These hybrid solutions aim to retain BERTs rich contextual
applying one-dimensional convolutional filters to detect local patterns understanding while improving computational efficiency and general-
such as n-grams or syntactic motifs. These filters perform element- izability, making them more suitable for applications constrained by
wise multiplications followed by non-linear activation functions to resources or latency requirements.
generate feature maps that emphasize the most informative regions of
the input sequence. A subsequent max-pooling operation reduces the 3.7. Local Interpretable Model-Agnostic Explanations (LIME)
dimensionality and retains the most salient features, thereby enabling
the network to focus on contextually rich segments of text. This ar- LIME is a model-agnostic interpretability framework designed to
chitecture allows CNNs to efficiently model contextual dependencies provide localized explanations for the predictions of complex machine
within fixed-size receptive fields, making them particularly suitable for learning models. Positioned within the broader field of Explainable
tasks such as topic categorization, polarity detection, and aspect-based Artificial Intelligence (XAI), LIME serves to enhance the interpretability
sentiment analysis. Compared to recurrent neural networks (RNNs), of opaque black-box systems, particularly in high-stakes domains
CNNs offer significant advantages in terms of computational efficiency where transparency and trust are critical [33]. LIMEs main goal is to
and parallelizability, as they do not rely on sequential input processing. provide a straightforward, interpretable surrogate model that, within
However, one notable limitation of CNNs is their reduced capacity to the local neighborhood of a particular instance, roughly represents the
model long-range dependencies, which can affect performance in tasks original models decision boundary [34]. LIME accomplishes this by
involving lengthy or complex discourse structures. generating a set of synthetic samples close to the target instance, which
perturbs the original input. The black-box model is used to these altered
3.5. Long Short-Term Memory Networks (LSTM) examples in order to derive the relevant predictions. These cases are
then subjected to a locality-sensitive weighting function, and the deci-
LSTM networks, as a refined subclass of recurrent neural architec- sion function is locally approximated by training a sparse linear model
tures, have demonstrated substantial effectiveness in text classification
on the weighted dataset. The contribution of each feature to the final
tasks due to their capacity to capture long-range dependencies and pre-
prediction is inferred using the surrogate models resulting coefficients.
serve semantically meaningful representations across sequential data
One of the key strengths of LIME lies in its model-agnostic design,
inputs [29]. By incorporating internal memory units and a gated con-
allowing it to be applied across a wide range of machine learning
trol mechanism comprising input, forget, and output gates LSTM
algorithms, including ensemble methods, deep neural networks, and
models effectively address the vanishing gradient challenge that limits
support vector machines. It offers human-understandable explanations
conventional RNNs. These gating components orchestrate information
while maintaining local fidelity to the original model. As such, LIME
flow dynamically, facilitating the retention of salient features over pro-
is widely adopted for increasing decision transparency and enabling
longed contexts and ensuring the continuity of semantic interpretation
human-AI collaboration, particularly in sensitive applications such as
throughout the sequence [30]. In text classification applications, LSTM
healthcare diagnostics, financial risk assessment, and legal reasoning.
typically process input sequences encoded as dense word embeddings,
allowing the network to learn hierarchical feature representations that
3.8. WordContextGraphExplainer
encapsulate both syntactic structure and semantic meaning. This ca-
pacity to capture nuanced contextual relationships makes LSTM par-
ticularly effective in tasks such as sentiment analysis, text similarity, The exponential growth in transformer-based natural language pro-
spam detection, and topic categorization where subtle variations in cessing (NLP) architectures has created an unprecedented demand
word order and polarity significantly influence predictive accuracy. for interpretability frameworks capable of elucidating the complex
For instance, in sentiment classification, LSTM models can differen- decision-making processes underlying these black-box models. While
tiate between expressions like not good and extremely good by widely adopted XAI techniques such as LIME (Local Interpretable
maintaining a dynamic memory of temporal context throughout the Model-Agnostic Explanations) and SHAP (SHapley Additive Explana-
sequence. tions) offer valuable insights through feature attribution, they in-
herently rely on linear additivity assumptions among input features.
3.6. Bidirectional Encoder Representations from Transformers (BERT) This assumption falls short in capturing the intricate semantic de-
pendencies and non-linear interactions that characterize deep lan-
BERT is a transformer-based, pre-trained language model that has guage understanding. A fundamental limitation of existing approaches
substantially advanced the state of the art in text classification tasks lies in their inability to model contextual interdependencies between
by capturing bidirectional contextual semantics through self-attention words relationships that are crucial for interpreting sentiment propa-
mechanisms [31]. Unlike unidirectional models such as LSTM or GRU, gation, negation scope, and semantic coherence in complex linguistic
which process text sequentially, BERT encodes semantic dependencies structures. Traditional token-level attribution methods treat individual
from both left and right contexts simultaneously. This architecture words as independent contributors, failing to account for the synergistic
enables nuanced disambiguation of polysemous words and more robust effects that emerge from word pairings and contextual associations
modeling of long-range dependencies in natural language [32]. In in the semantic space. In this paper, WordContextGraphExplainer is
text classification applications, BERT is typically fine-tuned on task- introduced as a novel graph-theoretic interpretability framework de-
specific labeled datasets. This involves appending a classification layer veloped to enhance the transparency of transformer-based sentiment
often a dense layer with softmax activation on top of the pre-trained classification systems. The methodology is built upon a systematic
BERT encoder. Through this transfer learning paradigm, BERT exhibits perturbation analysis paradigm, in which masked language modeling is
superior performance across a variety of NLP tasks including sentiment employed to estimate both individual lexical contributions and pairwise
classification, aspect-based sentiment analysis, and multi-label classifi- semantic interactions. In contrast to linear attribution methods, this
cation, particularly in settings characterized by contextual ambiguity approach explicitly models non-linear dependencies by quantifying the
and hierarchical dependencies. However, BERTs practical deployment divergence between observed joint effects and the expected additive
presents several challenges. Its high computational complexity, sensi- influence of word pairs. At the core of the framework is the construction
tivity to input sequence length, and the requirement for large volumes of a semantic interaction graph, where nodes represent individual
of labeled data during fine-tuning can pose significant barriers in real- words annotated with their relative sentiment contributions, and edges
world scenarios. To mitigate these limitations, hybrid architectures that encode the magnitude and directionality of inter-word dependencies.
integrate BERT with more lightweight modeling components have been This graph-based representation facilitates intuitive visualization of
4
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
complex linguistic relationships through NetworkX-based layouts, en- capturing contextual semantics, its self-attention mechanism may not
abling deeper insight into how contextual factors influence model fully exploit the sequential dependencies within dialogue utterances. To
predictions. The framework demonstrates particular efficacy in sen- mitigate this limitation, bidirectional LSTM layers are incorporated to
timent analysis tasks where nuanced interactions between affective model temporal patterns and discourse-level relationships across token
indicators, negation patterns, and contextual modifiers significantly sequences. These layers are adept at retaining long-range dependencies
impact interpretive accuracy. By providing interpretable visualizations and recognizing sentiment transitions across multi-turn dialogue. By
of semantic interaction networks, WordContextGraphExplainer sup- integrating these two components, the proposed HybridBERT-LSTM
ports advanced model debugging, bias detection, and clinical decision architecture achieves a richer understanding of both the global context
support in sensitive domains such as mental health assessment and and local structure of textual data, enhancing its capability to discern
medical text analytics. Moreover, the framework incorporates a top- sentiment in complex conversational scenarios. This dual modeling
k interaction filtering mechanism, ensuring computational scalability approach positions the framework as a robust solution for sentiment
while preserving the granularity required for interpretable analysis in classification tasks, particularly in dialogue-rich environments where
high-stakes applications. This methodological advancement represents contextual flow and temporal coherence are paramount.
a critical step toward the development of trustworthy AI systems that
combine linguistic reasoning with transparent explanatory capabilities, 3.9. Model architecture
offering a robust foundation for real-world deployment.
The proposed model processes input text through a series of trans-
Algorithm 1: WordContextGraphExplainer Method formation stages, mathematically formalized as follows: Given an input
sequence:
Input: Text 𝑇 , transformer model 𝑀, tokenizer 𝜏, feature number
𝑘 ≥ 1, device 𝑑. 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, where 𝑛 ≤ 256,
Output: Word context graph 𝐺 with semantic interactions.
1: Compute baseline prediction 𝑃0 = 𝑀(𝑇 ). the BERT encoder maps each token 𝑥𝑖 to a contextualized embedding,
2: Compute predicted_class = arg max(𝑃0 ). producing a sequence of hidden states:
3: Initialize 𝑊 = 𝜏(𝑇 ), word_effects = ∅, interactions = ∅. 𝐻 = BERT(𝑋) ∈ R𝑛×𝑑BERT
4: for each 𝑤𝑖𝑊 do 5: 𝑇masked = replace(𝑇 , 𝑤𝑖 , [𝙼𝙰𝚂𝙺])
6: 𝑃masked = 𝑀(𝑇masked ) where 𝑑BERT = 768 represents the dimensionality of BERTs contextual
7: word_effects[𝑖] = 𝑃0 𝑃masked embeddings.
8: end for The sequence 𝐻 is passed to a 3-layer bidirectional LSTM net-
9: for each (𝑤𝑖 , 𝑤𝑗 ) ∈ combinations(𝑊 , 2) do work to capture temporal dependencies beyond what is modeled by
10: 𝑇pair = replace(𝑇 , [𝑤𝑖 , 𝑤𝑗 ], [𝙼𝙰𝚂𝙺]) self-attention:
11: 𝑃pair = 𝑀(𝑇pair ) ⃖⃗𝑡 = LSTMforward (𝐻𝑡 ,
⃖⃗𝑡1 ), ⃖⃖
𝑡 = LSTMbackward (𝐻𝑡 , ⃖⃖
𝑡+1 )
12: actual_effect = 𝑃0 𝑃pair
13: expected_effect = word_effects[𝑖] + word_effects[𝑗] The final representation for each token is obtained by concatenating
14: interaction𝑖𝑗 = actual_effect expected_effect the forward and backward hidden states:
‖ ‖
15: interactions[(𝑤𝑖 , 𝑤𝑗 )] = ‖interaction𝑖𝑗LSTM 𝑡 ] ∈ R2𝑑LSTM
⃖⃗𝑡 ; ⃖⃖
= [
‖ ‖2 𝑡
16: end for
17: Sort interactions by magnitude in descending order. with 𝑑LSTM = 256, resulting in a 512-dimensional output per token.
To obtain a fixed-length vector representation of the sequence, both
18: 𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠[ 𝑘]
average and maximum pooling operations are applied:
19: Construct graph 𝐺 = (𝑉 , 𝐸) where 𝑉 = 𝑊 and 𝐸 =
1 ∑ LSTM
𝑛
𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠
avg = , max = max LSTM
20: Compute layout positions using 𝑜𝑟𝑔𝑎𝑛𝑖𝑧𝑒𝑑_𝑙𝑎𝑦𝑜𝑢𝑡(𝑊 , 𝑛 𝑖=1 𝑖 𝑖
1≤𝑖𝑛
𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠)
21: Visualize 𝐺 with NetworkX rendering and semantic color These vectors are concatenated to form the final sequence representa-
coding tion:
22: Return 𝐺 combined = [avg ; max ] ∈ R4𝑑LSTM = R1024
In this study, a hybrid architecture is proposed that integrates a Feed-forward classification
pre-trained BERT model with a bidirectional Long Short-Term Memory
(BiLSTM) network to address the task of sentiment classification. The The combined representation is passed through a feed-forward neu-
model processes textual input to generate sentiment label predictions, ral network with dropout regularization:
effectively capturing both semantic context and temporal structure
𝑧1 = Dropout0.3 (combined )
inherent in natural language. Grounded in a transformer-based archi-
tecture, the system accepts input sequences of up to 256 tokens, apply- followed by a two-layer multilayer perceptron (MLP) with ReLU acti-
ing appropriate padding and truncation mechanisms when necessary vation and softmax output for multi-class classification.
to standardize input lengths. The HybridBERT-LSTM model embodies This is followed by a two-layer MLP classifier, using a ReLU acti-
a synergistic design that leverages the complementary strengths of vation and softmax output for multi-class prediction. The HybridBERT-
transformer-based language models and recurrent neural networks. LSTM architecture integrates the strengths of transformer-based con-
This hybrid framework is explicitly engineered to address two crit- textual modeling with the sequential learning capabilities of recurrent
ical aspects of sentiment analysis: contextual representation and se- neural networks. While BERT excels in capturing bidirectional semantic
quential modeling. Contextual Representation: The BERT encoder, pre- context via self-attention, the inclusion of bidirectional LSTM layers
trained on large-scale corpora, produces deep contextualized embed- enhances the models ability to capture sequential dependencies and
dings by employing multi-head self-attention mechanisms. These em- emotional transitions throughout dialogue sequences. The dual pooling
beddings capture nuanced semantic and syntactic information, enabling strategy(average and max pooling) provides a comprehensive summary
the model to differentiate between polysemous expressions and context- of the sequence. Average pooling captures the overall sentiment distri-
dependent sentiment cues. Sequential Modeling: While BERT excels at bution across the sequence, whereas max pooling emphasizes salient
5
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
emotional cues. This duality enriches the feature space and contributes Table 1
to more robust classification. Furthermore, hierarchical feature ab- HybridBERT-LSTM Model Parameters.
straction is enabled by stacking multiple LSTM layers, allowing the Parameter name Parameter value
model to learn long-range patterns more effectively than shallow RNN Model architecture BERT encoder + BiLSTM + MLP
structures. Dropout layers, strategically placed after pooling (with a Base Model google-bert/bert-base-uncased
rate of 0.3) and within the classifier (rate 0.2), serve as regularization Tokenizer google-bert/bert-base-uncased
Maximum sequence length 256
mechanisms to prevent overfitting, especially during fine-tuning on LSTM layer 6
task-specific datasets. The model is trained using the AdamW optimizer Batch size 32
with a learning rate of 2 × 105 , and the cross-entropy loss function Number of epochs 5
is employed as the objective. Performance evaluation is conducted Learning rate 0.00002
Optimization algorithm AdamW
using standard metrics including accuracy, precision, recall, and F1-
Loss function CrossEntropyLoss
score, ensuring comprehensive validation of the models classification LSTM latent size 256
capability. The model integrates a pre-trained BERT encoder for captur- Pooling avg + max pooling
ing deep contextual embeddings from input text sequences, followed MLP layer Linear(1024→128) → ReLU → Linear(128→n_classes)
Dropout rates 0.3
by a multi-layer bidirectional LSTM network that models sequential
dependencies across tokens. To derive a robust sentence-level repre-
sentation, dual pooling operations(average and maximum pooling) are Table 2
applied to the LSTM outputs. The concatenated feature vector is then BERT Model Parameters.
passed through a fully connected neural network with dropout regu- Parameter name Parameter value
larization, culminating in a softmax classifier for multi-class sentiment Base model google-bert/bert-base-uncased
prediction. This hybrid architecture is designed to jointly leverage the Tokenizer google-bert/bert-base-uncased
representational richness of transformer encoders and the temporal Input length 128
Batch size 16
modeling strength of recurrent networks, effectively addressing both lo-
Number of epochs 5
cal semantics and discourse-level sentiment dynamics within multi-turn Learning rate 0.00002
dialogues. Loss Function BertForSequenceClassification Cross-Entropy
The computational overhead of HybridBERT-LSTM represents a crit- Optimization algorithm AdamW
ical consideration for practical deployment, particularly in real-time
applications such as conversational AI systems. The theoretical com-
Table 3
plexity of the proposed architecture can be decomposed into its con- LSTM Model Parameters.
stituent components to understand the computational requirements. Parameter name Parameter value
The BERT component contributes (𝑛2 ×𝑑BERT ) = (𝑛2 ×768) complexity
Embedding type GloVe
due to the quadratic scaling of the self-attention mechanism, where 𝑛 Embedding size 100
represents the sequence length and 𝑑BERT denotes the BERT embedding Maximum number of words 5000
dimension. The subsequent 3-layer BiLSTM processing adds (3 × 𝑛 × LSTM layer number 6
2
𝑑LSTM ) = (3 × 𝑛 × 2562 ) complexity, where 𝑑LSTM represents the LSTM unit number 128/256
Dropout rate 0.5
LSTM hidden dimension. Consequently, the overall HybridBERT-LSTM
Output layer (Dense) Softmax
complexity is (𝑛2 × 768 + 3𝑛 × 65, 536). This represents a significant Optimization algorithm Adam
computational increase compared to standalone BERT ((𝑛2 × 768)) or Loss function Sparse Categorical Crossentropy
LSTM models ((𝑛 × 𝑑LSTM 2 )), which may limit deployment in latency- Epoch number 50
sensitive applications. However, the empirical results demonstrate that Batch size 32
the performance gains justify this additional overhead in scenarios
where accuracy is prioritized over computational efficiency.
The parameters used for the BERT model employed in this study are
4. Experimental results presented in Table 2.
The parameter configurations utilized in the LSTM-based model
This section presents the configurations of the models utilized in developed for this study are detailed in Table 3.
the experiments, detailing the corresponding hyperparameters and im- The parameter configurations utilized in the CNN model developed
plementation settings. The objective is to ensure reproducibility and for this study are detailed in Table 4.
provide a comprehensive understanding of the experimental setup. Table 5 summarizes the parameter values defined for the SVM
model.
4.1. Model hyperparameters Table 6 presents a comparative evaluation of various machine learn-
ing and deep learning models in the context of sentiment analysis on the
The deep learning models were trained using a variety of hyper- widely adopted IMDB dataset. Among the examined methods, the pro-
parameter configurations tailored to the architecture and task require- posed HybridBERT-LSTM architecture achieved the highest accuracy
ments. These configurations include parameters such as learning rate, rate of 98.14%, demonstrating a substantial improvement over other
batch size, maximum input sequence length, number of training epochs, baseline models included in the analysis. This notable enhancement un-
optimizer type, and loss function. Additionally, architecture-specific derscores the effectiveness of combining contextual embeddings from
settings such as the number of LSTM layers, dropout rates, and hidden BERT with the sequential modeling capabilities of LSTM. The IMDB
state dimensionsare systematically defined. For models utilizing pre- dataset was selected for evaluation due to its extensive usage and
trained components (e.g., BERT), both the base model and tokenizer established credibility in the sentiment analysis literature, serving as
versions are explicitly specified. The subsequent tables summarize the a robust benchmark for comparative performance assessment.
detailed parameter values for each model employed in this study,
including HybridBERT-LSTM, BERT-only, LSTM, CNN, and SVM-based 4.2. Statistical significance testing
classifiers.
The parameter values of the model developed in this study are In order to determine whether the observed differences in model
detailed in Table 1. performance metrics [44] were statistically significant, we employed
6
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Table 4 hypothesis (𝐻0 ) that the two models exhibit equal mean performance.
CNN Model Parameters. Since our interest lies in detecting differences in either direction, a
Parameter name Parameter value two-tailed test is used:
Embedding type GloVe
Embedding size 100 𝑝 = 2 × 𝑃 (𝑇 ≥ |𝑡|),
Maximum number of words 5000
Input layer Embedding(input_dim=5000, output_dim=100) where 𝑇 follows the Students t-distribution with 𝑑𝑓 degrees of free-
Number of Conv1D layers 6 dom.
Number of Conv1D filters 128 If 𝑝 < 0.05, the difference is considered statistically significant,
Kernel size 5
indicating strong evidence against the null hypothesis. In this case, we
Activation function ReLU
Padding Same
conclude that one model outperforms the other beyond what would be
Pooling MaxPooling1D (pool_size=2) expected by random variation. If 𝑝 ≥ 0.05, the difference is considered
Dropout rate 0.5 not statistically significant, implying that the observed discrepancy may
Global pooling GlobalMaxPooling1D reasonably be attributed to experimental variability.
Output layer (Dense) Softmax
Loss function sparse_categorical_crossentropy
In addition to reporting p-values, effect sizes (Cohens d) were also
Optimization algorithm Adam computed to quantify the magnitude of the observed differences. While
Evaluation metric Accuracy statistical significance indicates whether a difference is unlikely to be
Number of epochs 50 due to chance, effect size provides a measure of its practical relevance.
Batch size 32
Together, these statistics provide a comprehensive assessment of the
comparative performance of the evaluated models.
Table 5
SVM Model Parameters. 4.3. Experimental results on datasets
Parameter name Parameter value
Embedding type GloVe Dataset 1 consists of questionanswer pairs collected from two
Embedding size 100 independent online counseling and psychotherapy platforms [45]. The
Maximum number of words 5000
user-generated questions span a wide range of topics related to men-
SVM Kernel Linear
tal health, including emotional well-being, interpersonal issues, and
psychological disorders. Each response was authored by licensed psy-
Table 6 chologists, ensuring both clinical relevance and linguistic reliability. In
IMDB Dataset Accuracy Comparison. total, the dataset comprises 7,025 dialogue instances.
Reference Method Accuracy Tables 7 and 8 present the training and testing performances, re-
[35] LSTM 83.7% spectively, of five distinct models HybridBERT-LSTM, BERT, LSTM,
[36] CNN+LSTM 96.01%
CNN, and SVM evaluated on Dataset 1. The models were assessed using
[37] LSTM+RNN 92.00%
[38] BERT 93.97%
standard classification metrics including accuracy, precision, recall,
[39] A hybrid approach 95.6% and F1-score, providing a comparative analysis of both their internal
[40] HOMOCHAR 95.91% consistency and generalizability.
[41] Textual Emotion Analysis (TEA) 93% An ablation study [46] is a systematic experimental methodol-
[42] Lexical + Adversarial attacks 85%
[43] Logistic Regression 89.42%
ogy used to evaluate the individual contributions of specific model
Proposed Model HybridBERT-LSTM 98.14% components by selectively removing or modifying them while keep-
ing other factors constant. This approach provides empirical evidence
for the importance of particular architectural elements in determin-
ing the models overall performance. To rigorously assess whether
the Welchs two-sample t-test, which is widely recommended when com- HybridBERT-LSTMs performance gains arise from architectural design
paring two groups with potentially unequal variances and sample sizes. rather than mere parameter expansion, we conducted a comprehensive
This approach provides a robust test of mean differences without ablation study with parameter-matched baselines. Six model variants
assuming homogeneity of variances, which is particularly important were constructed: (1) BERT-Only baseline using the [CLS] token for
in machine learning experiments where stochastic training procedures classification, (2) BERT-ParamMatched with additional dense layers
may lead to heterogeneous variability across models. matching the BiLSTM parameter count, (3) BERT+UniLSTM with a
Let 𝑥̄ 1 and 𝑥̄ 2 denote the sample means of the two models being unidirectional LSTM, (4) BERT+BiLSTM-NoPooling without dual pool-
compared, 𝑠1 and 𝑠2 the corresponding standard deviations, and 𝑛1 and ing, (5) BERT+BiLSTM with frozen BERT isolating pure LSTM contri-
𝑛2 the number of independent runs. The Welchs t-statistic is defined bution, and (6) HybridBERT-LSTM (Full) incorporating all proposed
as: components.
𝑥̄ 𝑥̄ 2 When Table 9 is examined, which shows the ablation test for
𝑡 = √1
𝑠21 𝑠2 Data Set 1, the BERT-ParamMatched model achieves an accuracy of
+ 𝑛2
𝑛1 2 95.35% ± 0.38% despite having an equivalent number of param-
The approximate degrees of freedom (𝑑𝑓 ) for this test are calculated eters to the full model, whereas HybridBERT-LSTM attains 95.94%
according to the WelchSatterthwaite equation: ± 0.15%. The hierarchical performance degradation across ablation
( 2 )2 variants reveals the marginal contribution of each component: dual
𝑠1 𝑠22
+ pooling adds +0.19% (95.94% vs. 95.75%), bidirectionality contributes
𝑛1 𝑛2
𝑑𝑓 = ( )2 ( )2 +0.17% (95.75% vs. 95.58%), and the sequential LSTM architecture
𝑠2
1
𝑠2
2
over feedforward MLP layers provides +0.23% (95.58% vs. 95.35%).
𝑛1 𝑛2
The frozen BERT experiment (91.80% ± 0.65%) isolates critical insights
𝑛1 1
+ 𝑛2 1 regarding representation quality versus fine-tuning contributions. As
Given the test statistic and degrees of freedom, the p-value is ob- shown in Table 9, the ablation study on Dataset 1 systematically
tained by evaluating the probability of observing a difference as ex- confirms that HybridBERT-LSTMs performance advantage arises from
treme as, or more extreme than, the measured difference under the null its architectural design rather than from parameter count inflation.
7
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Table 7
Training Performance Metrics for Dataset 1.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.9872 ± 0.0029 0.9871 ± 0.0028 0.9872 ± 0.0029 0.9871 ± 0.0029
BERT 0.9806 ± 0.0063 0.9805 ± 0.0057 0.9806 ± 0.0063 0.9805 ± 0.0062
LSTM 0.9829 ± 0.0162 0.9829 ± 0.0163 0.9829 ± 0.0162 0.9827 ± 0.0175
CNN 0.9862 ± 0.0190 0.9829 ± 0.0199 0.9862 ± 0.0190 0.9829 ± 0.0202
SVM 0.8247 ± 0.0073 0.8274 ± 0.0067 0.8247 ± 0.0073 0.8235 ± 0.0071
Table 8
Test Performance Metrics for Dataset 1.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.9594 ± 0.0015 0.9596 ± 0.0017 0.9594 ± 0.0015 0.9592 ± 0.0016
BERT 0.9516 ± 0.0040 0.9515 ± 0.0041 0.9516 ± 0.0044 0.9514 ± 0.0045
LSTM 0.9245 ± 0.0152 0.9257 ± 0.0163 0.9245 ± 0.0152 0.9239 ± 0.0165
CNN 0.9195 ± 0.0171 0.9200 ± 0.0170 0.9195 ± 0.0171 0.9192 ± 0.0125
SVM 0.8078 ± 0.0026 0.8118 ± 0.0025 0.8078 ± 0.0026 0.8058 ± 0.0031
Table 9 in testing, the models limited learning and generalization capacity be-
Ablation Performance Metrics for Dataset 1. came evident. These findings collectively indicate that SVM lags behind
Model Accuracy ± std F1 ± std deep learning-based methods in terms of both modeling complexity and
BERT+BiLSTM (Frozen) 0.9180 ± 0.0065 0.9165 ± 0.0068 adaptability to sequential linguistic features inherent in dialogue-based
BERT-Only (Baseline) 0.9516 ± 0.0040 0.9512 ± 0.0042 sentiment classification tasks.
BERT-ParamMatched 0.9535 ± 0.0038 0.9531 ± 0.0040
This dataset comprises conversational exchanges derived from ev-
BERT+UniLSTM 0.9558 ± 0.0028 0.9555 ± 0.0030
BERT+BiLSTM-NoPooling 0.9575 ± 0.0022 0.9573 ± 0.0024 eryday spoken English interactions [47]. It consists of a total of 7,450
HybridBERT-LSTM (Full) 0.9594 ± 0.0015 0.9592 ± 0.0016 dialogue samples, structured in a questionanswer format. The train-
ing and testing performances of five different classification meth-
ods HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on Dataset 2
are presented in Tables 10 and 11, respectively. Among these, the
When the results are evaluated over five repeated experiments, the
HybridBERT-LSTM model achieved the highest performance on the
HybridBERT-LSTM model not only outperforms the other methods in
training set, reaching an accuracy of 99.11% and an F1-score of
terms of accuracy, precision, recall, and F1-score, but also demonstrates
99.11%, thereby slightly outperforming the other methods. The BERT
a high degree of stability, as reflected by its very low standard devia-
and CNN models also demonstrated high effectiveness, achieving ac-
tions (≈ 0.00150.0017). This indicates that the model provides not just
curacies of 98.95% and 98.21%, respectively. These three models
superior performance but also reproducible results across runs.
exhibited strong alignment with the training data across all evaluation
While the BERT model follows as the second-best performer, its
metrics, including accuracy, precision, recall, and F1-score.
higher variance (≈ 0.004) highlights less consistent outcomes compared
When Table 12 is examined, which shows the ablation test for
to HybridBERT-LSTM. Statistical testing (e.g., paired t -tests) confirms
Data Set 1, the BERT-ParamMatched model achieves an accuracy of
that the observed performance difference between HybridBERT-LSTM
97.92% ±0.35% despite having an equivalent number of parameters,
and BERT, though relatively small, is statistically significant (𝑝 < 0.05).
In contrast, the performance gaps between HybridBERT-LSTM and whereas HybridBERT-LSTM attains 98.32% ±1.06%, reflecting a 0.40
weaker models such as LSTM, CNN, and particularly SVM are much percentage-point improvement. Component-wise analysis further in-
larger. Pairwise comparisons reveal p-values well below 0.01, strongly dicates that dual pooling contributes +0.13% (98.32% vs. 98.19%),
supporting the conclusion that HybridBERT-LSTMs superiority is not bidirectionality adds +0.13% (98.19% vs. 98.06%), and the sequential
due to random chance but reflects a genuine performance advan- LSTM architecture over MLP layers provides an additional +0.14%
tage. HybridBERT-LSTM vs. BERT: Smaller margin, but statistically (98.06% vs. 97.92%).
significant (𝑝 < 0.05). HybridBERT-LSTM vs. LSTM/CNN/SVM: Sub- Based on the evaluation of five repeated experiments, the
stantial margin, highly significant (p << 0.01). Among the evaluated HybridBERT-LSTM model achieved the highest accuracy, precision,
approaches, the HybridBERT-LSTM architecture consistently demon- recall, and F1-scores on both the training and test sets. It stood out with
strated superior performance during both training and testing phases, an accuracy of 99.11% in training and reached 98.32% accuracy on the
achieving remarkably high scores across all metrics. Specifically, it test set. The consistently low standard deviations (≈ 0.01060.0126)
attained 98.72% accuracy and 98.72% F1-score on the training set, indicate that the model not only delivers high performance but also
outperforming all other models. BERT, LSTM, and CNN also exhib- produces stable results.
ited strong training performance, each surpassing 98% accuracy and BERT followed HybridBERT-LSTM and provided similarly strong
F1-scores, indicating their efficacy on seen data. results. However, its slightly lower standard deviations suggest that
In the testing phase, HybridBERT-LSTM maintained its leading po- it yielded more consistent outcomes in some metrics. Although the
sition by achieving the highest test accuracy (95.94%) and F1-score performance gap between the two models appears small, pairwise t -
(95.92%), affirming its robustness and generalization capability. In test results show that the p-values are mostly below 0.05. Therefore,
contrast, the CNN model experienced a notable performance drop from the difference between HybridBERT-LSTM and BERT is statistically
training to testing (accuracy falling from above 98% to 91.95% and F1- significant.
score to 91.92%), suggesting a tendency toward overfitting. Similarly, In comparisons with the lower-performing models (LSTM, CNN, and
the LSTM model, despite achieving 98.29% accuracy in training, saw SVM), the p-values were found to be far below 0.01. This demonstrates
its performance decline to 92.45% accuracy during testing, reflecting that HybridBERT-LSTM significantly and strongly outperforms these
reduced generalization. models. In particular, LSTMs high variance in training (std ≈ 0.0380)
Another critical observation is related to the SVM model, which indicates unstable learning behavior.
exhibited the lowest performance across both training and test sets. In conclusion, HybridBERT-LSTM not only achieved the highest
With a training accuracy of 82.47% and a further decline to 80.78% scores but also delivered stable and reproducible results.
8
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Table 10
Training Performance Metrics for Dataset 2.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.9911 ± 0.0111 0.9911 ± 0.0126 0.9911 ± 0.0111 0.9911 ± 0.0111
BERT 0.9895 ± 0.0093 0.9896 ± 0.0094 0.9895 ± 0.0093 0.9895 ± 0.0093
LSTM 0.7270 ± 0.0380 0.7189 ± 0.0370 0.7175 ± 0.0380 0.7278 ± 0.0380
CNN 0.9821 ± 0.0176 0.9826 ± 0.0176 0.9921 ± 0.0179 0.9822 ± 0.0176
SVM 0.7785 ± 0.0518 0.7711 ± 0.0524 0.7785 ± 0.0518 0.7638 ± 0.0525
Table 11
Test Performance Metrics for Dataset 2.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.9832 ± 0.0106 0.9834 ± 0.0108 0.9832 ± 0.0106 0.9833 ± 0.0106
BERT 0.9779 ± 0.0038 0.9783 ± 0.0039 0.9779 ± 0.0038 0.9780 ± 0.0038
LSTM 0.7075 ± 0.0199 0.7089 ± 0.0178 0.7075 ± 0.0199 0.7078 ± 0.0199
CNN 0.9718 ± 0.0102 0.9725 ± 0.0104 0.9718 ± 0.0102 0.9720 ± 0.0112
SVM 0.7537 ± 0.0044 0.7491 ± 0.0045 0.7537 ± 0.0044 0.7277 ± 0.0045
Table 12 mutually enhance effectiveness on challenging classification tasks. The
Ablation Performance Metrics for Dataset 2. frozen BERT experiment (72.15% ±2.45%) provides critical valida-
Model Accuracy ± std F1 ± std tion: despite lacking fine-tuning, it outperforms standalone LSTM with
BERT+BiLSTM (Frozen) 0.9425 ± 0.0152 0.9418 ± 0.0155 GloVe embeddings (62.26% test) by 9.89 percentage points, isolating
BERT-Only (Baseline) 0.9779 ± 0.0038 0.9780 ± 0.0038 the representation quality advantage of contextualized embeddings.
BERT-ParamMatched 0.9792 ± 0.0035 0.9793 ± 0.0035 However, the 10.71% gap between frozen and full models (72.15%
BERT+UniLSTM 0.9806 ± 0.0028 0.9807 ± 0.0028
BERT+BiLSTM-NoPooling 0.9819 ± 0.0022 0.9820 ± 0.0022
vs. 82.86%) represents the largest fine-tuning contribution across all
HybridBERT-LSTM (Full) 0.9832 ± 0.0106 0.9833 ± 0.0106 datasets, establishing that task-specific adaptation is particularly criti-
cal for complex classification problems. The parameter efficiency ratio
of 18.74:1 (5.06% gain/0.27% parameter increase) dramatically ex-
ceeds simpler datasets (Dataset 1: 2.89:1, Dataset 2: 1.96:1), validating
In contrast, LSTM and SVM yielded significantly lower performance, that BiLSTMs architectural value scales positively with task difficulty.
with training accuracies of 72.70% and 77.85%, respectively. Par- However, the test results reveal a marked decline in the gener-
ticularly, the low F1-score of 76.38% for SVM indicates inadequate alization performance of some models, most notably CNN. The CNN
classification consistency and stability. When evaluated on the test set, models accuracy dropped significantly to 65.03% during testing, sug-
the overall performance ranking remained largely consistent with that gesting signs of overfitting. The inability to maintain performance
observed during training. HybridBERT-LSTM and BERT maintained across datasets implies that the model may have memorized training
their superior performance, achieving test accuracies of 98.32% and instances rather than learning generalizable patterns. Similarly, the
97.79%, respectively. The CNN model followed closely with 97.18% BERT model, while achieving 94.92% training accuracy, exhibited a
accuracy, exhibiting a balanced and robust performance across all notable decline during testing, with an accuracy of 78.27%, indicating
evaluation criteria. Conversely, LSTM and SVM continued to underper- moderate but consistent performance.
form in the test phase, reflecting limited generalization capability in The most robust generalization was observed in the HybridBERT-
comparison to the more advanced deep learning architectures. LSTM approach. This model achieved a training accuracy of 91.57%
Dataset 3 comprises online consultation dialogues conducted be- and maintained a relatively high testing accuracy of 82.86%, with
tween patients and medical professionals [48]. The dataset consists of minimal performance degradation between training and testing phases.
a total of 6,570 entries, with each instance representing a dialogue These results underscore the HybridBERT-LSTM models capability to
exchange initiated by a patient inquiry and followed by a corresponding balance learning efficiency with strong generalization, making it the
response from a doctor. The training and testing performances of five most stable and reliable method on Dataset 3.
distinct approaches HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on Interestingly, the LSTM model maintained a consistent performance
Dataset 3 are presented in Tables 13 and 14, respectively. Among these, of 62.26% across both training and testing phases, signaling limitations
the CNN model achieved the highest training performance, demon- in its learning capacity and suggesting that simpler architectures may
strating its strong learning capability. The BERT model also exhibited be insufficient for handling the complexity of dialogue-based sentiment
competitive results, attaining a training accuracy of 94.92%, position- classification tasks. The SVM model, although yielding only moder-
ing it as a viable alternative. In contrast, LSTM and SVM models yielded ate success during training, preserved its performance during testing
notably lower performance during training, with accuracy scores of (68.42%), outperforming more complex deep learning models such as
62.26% and 71.92%, respectively, indicating limitations in their ability CNN and LSTM in terms of stability. The HybridBERT-LSTM model
to model the training data effectively. emerges as the most balanced and generalizable approach, while the
When Table 15 is examined, which shows the ablation test for CNN model warrants cautious interpretation due to its susceptibility
Data Set 3, is examined, BERT-ParamMathes achieves 78.92% ±1.72% to overfitting. In this study, each method was evaluated through five
accuracy with equivalent parameters, while HybridBERT-LSTM reaches independent repetitions. This approach provides a more accurate rep-
82.86% ±0.65%, representing a statistically significant 3.94 percentage resentation of variance compared to results obtained from a single
point improvement. Component decomposition demonstrates substan- run and enhances the reproducibility of the outcomes. Notably, the
tial marginal contributions: dual pooling adds +1.61% (82.86% vs. HybridBERT-LSTM model exhibited very low standard deviations (≈
81.25%), bidirectionality contributes +1.40% (81.25% vs. 79.85%), 0.0060.01 range), indicating that the model not only achieved high
and sequential LSTM architecture over MLP provides +0.93% (79.85% average scores but also produced consistent results across trials.
vs. 78.92%). The cumulative gain of 5.06% from BERT-Only base- HybridBERT-LSTM vs. BERT: Although the average performance
line (78.27%) substantially exceeds the sum of individual compo- difference is relatively small, the p-values mostly remain below 0.05.
nents (3.94%), indicating a 1.12% synergistic interaction effect the This suggests that the difference is unlikely to be due to chance and
strongest observed across all datasets where BiLSTM components that the superiority of HybridBERT-LSTM is statistically significant.
9
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Table 13
Training Performance Metrics for Dataset 3.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.9157 ± 0.0097 0.9058 ± 0.0103 0.9056 ± 0.0097 0.9052 ± 0.0097
BERT 0.9492 ± 0.0234 0.9494 ± 0.0228 0.9492 ± 0.0234 0.9487 ± 0.0246
LSTM 0.6298 ± 0.0164 0.6294 ± 0.0160 0.6298 ± 0.0164 0.6227 ± 0.0183
CNN 0.9966 ± 0.0054 0.9966 ± 0.0062 0.9966 ± 0.0054 0.9966 ± 0.0056
SVM 0.7192 ± 0.0125 0.7263 ± 0.0161 0.7192 ± 0.0125 0.7198 ± 0.0126
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
(<<0.01), CNN (<<0.01), SVM (<<0.01).
Table 14
Test Performance Metrics for Dataset 3.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.8286 ± 0.0065 0.8326 ± 0.0062 0.8286 ± 0.0065 0.8282 ± 0.0064
BERT 0.7827 ± 0.0185 0.7835 ± 0.0185 0.7827 ± 0.0185 0.7830 ± 0.0184
LSTM 0.6226 ± 0.0081 0.6294 ± 0.0085 0.6226 ± 0.0081 0.6227 ± 0.0092
CNN 0.6503 ± 0.0433 0.6516 ± 0.0565 0.6503 ± 0.0565 0.6497 ± 0.0565
SVM 0.6842 ± 0.0093 0.6904 ± 0.0520 0.6842 ± 0.0093 0.6847 ± 0.0110
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
(<<0.01), CNN (<<0.01), SVM (<<0.01).
Table 15 reported in both the training and test evaluations, further evidenc-
Ablation Performance Metrics for Dataset 3. ing generalizable gains rather than dataset-specific variance. The re-
Model Accuracy ± std F1 ± std sults collectively demonstrate that HybridBERTLSTMs improvements
BERT+BiLSTM (Frozen) 0.7215 ± 0.0245 0.7208 ± 0.0248 are statistically sound, generalizable, and derived from architectural
BERT-Only (Baseline) 0.7827 ± 0.0185 0.7830 ± 0.0184 synergy rather than overparameterization or random variation.
BERT-ParamMatched 0.7892 ± 0.0172 0.7895 ± 0.0171
When Table 19, which shows the ablation test for Data Set 4, is
BERT+UniLSTM 0.7985 ± 0.0145 0.7988 ± 0.0144
BERT+BiLSTM-NoPooling 0.8125 ± 0.0110 0.8128 ± 0.0109
examined, the BERT-ParamMatched achieves 85.48% ±2.05% accu-
HybridBERT-LSTM (Full) 0.8286 ± 0.0065 0.8282 ± 0.0064 racy despite equivalent parameters, while HybridBERT-LSTM reaches
87.29% ±1.19%, representing a 1.81 percentage point improvement.
Component analysis reveals: dual pooling contributes +0.61% (87.29%
vs. 86.68%), bidirectionality adds +0.63% (86.68% vs. 86.05%), and
HybridBERT-LSTM vs. LSTM, CNN, SVM: In comparisons with these sequential LSTM architecture over MLP provides +0.57% (86.05%
three models, the p-values were found to be far below 0.01. Therefore, vs. 85.48%). The cumulative gain of 3.35% from BERT-Only base-
the superiority of HybridBERT-LSTM over these methods is strongly line (84.94%) exceeds individual components (1.81%), indicating a
supported by statistical evidence. 1.54% synergistic effect where BiLSTM components mutually enhance
Overall, the findings confirm that HybridBERT-LSTM is not only the effectiveness on this moderately challenging task. The frozen BERT
best-performing model in terms of average scores but also the most variant (80.65% ±2.85%) validates two critical insights: it outper-
reliable and consistent one from a statistical perspective. forms standalone LSTM with GloVe embeddings (77.26% test) by 3.39
This dataset comprises text entries collected from online conversa- percentage points, confirming the superiority of contextualized repre-
tions conducted in English, each annotated with corresponding senti- sentations, while the 6.64% gap to the full model (80.65% vs. 87.29%)
ment labels. It has been specifically curated for the purpose of analyzing quantifies the substantial contribution of fine-tuning. The decreasing
and classifying the emotional tone embedded within textual utterances. variance from frozen (±2.85%) through parameter-matched (±2.05%)
The dataset consists of 1,494 instances, and serves as a representative to full model (±1.19%) demonstrates that architectural integration with
benchmark for evaluating sentiment classification models in informal, end-to-end training provides essential stability, establishing that the
dialogue-based contexts [49]. observed improvements stem from architectural design rather than
Tables 16 and 17 present the training and test performance met- capacity scaling.
rics, respectively, for five different sentiment classification models: When the results in Tables 16 and 17 are analyzed based on
HybridBERT-LSTM, BERT, LSTM, CNN, and SVM applied to Dataset 4. five independent repetitions, several important findings emerge re-
Evaluation was conducted using standard performance indicators: Ac- garding both performance levels and statistical reliability. First, the
curacy, Precision, Recall, and F1-score, to assess both the fitting capacity HybridBERT-LSTM model demonstrates strong generalization ability,
on training data and generalizability on unseen test data. maintaining balanced accuracy (87.29% ±0.0119) and F1 (84.89%
Table 18 presents the cross-validation results for Dataset 4. ±0.0140) on the test set, with relatively low variance across runs. The
The consistency of accuracy and F1-scores across folds (≈0.8795 and narrow confidence interval provided by the low standard deviations
0.8758, respectively) indicates that the model does not exhibit over- indicates that the model is not only accurate but also stable across re-
fitting or excessive variance between training and evaluation phases. peated experiments. The pairwise statistical comparisons reveal further
This stability confirms that the observed improvements are not artifacts insights. Against BERT, the differences in performance metrics appear
of specific data splits but instead arise from the models architectural moderate, yet the corresponding p-values are consistently below 0.05.
design, particularly its integration of bidirectional temporal encoding This implies that the improvements of HybridBERT-LSTM over BERT,
and hierarchical pooling mechanisms. Moreover, the cross-validation while not large in magnitude, are statistically significant rather than
outcomes follow the same relative performance hierarchy observed in random fluctuations.
both the training and test experiments: HybridBERTLSTM > BERT > In contrast, the performance gaps between HybridBERT-LSTM and
LSTM > CNN > SVM. This consistent ranking across all evaluation set- the weaker models (LSTM, CNN, and especially SVM) are consider-
tings validates the comparative strength of the proposed architecture. ably larger. Here, the p-values are well below 0.01, in many cases
The slight performance gap between HybridBERTLSTM and BERT is below 0.001, providing strong statistical evidence that HybridBERT-
statistically meaningful and mirrors the 𝑝-value significance (<0.05) LSTMs superiority is systematic and not due to chance. Notably,
10
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Table 16
Training Performance Metrics for Dataset 4.
Method Accuracy Precision Recall F1
HybridBERT-LSTM 0.9046 ± 0.0172 0.8446 ± 0.0157 0.9046 ± 0.0119 0.8730 ± 0.0070
BERT 0.9447 ± 0.0238 0.9403 ± 0.0331 0.9447 ± 0.0238 0.9379 ± 0.0294
LSTM 0.9849 ± 0.0381 0.9848 ± 0.0489 0.9849 ± 0.0381 0.9845 ± 0.0479
CNN 0.9944 ± 0.0443 0.9882 ± 0.0421 0.9944 ± 0.0443 0.9882 ± 0.0401
SVM 0.8084 ± 0.0249 0.8258 ± 0.0206 0.8084 ± 0.0249 0.7806 ± 0.0268
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01),
CNN (<0.01), SVM (<0.001).
Table 17
Test Performance Metrics for Dataset 4.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.8729 ± 0.0119 0.8561 ± 0.0089 0.8729 ± 0.0117 0.8489 ± 0.0140
BERT 0.8494 ± 0.0218 0.8532 ± 0.0377 0.8494 ± 0.0218 0.8512 ± 0.0194
LSTM 0.7726 ± 0.0330 0.7971 ± 0.0410 0.7726 ± 0.0330 0.7818 ± 0.0409
CNN 0.8160 ± 0.0164 0.8040 ± 0.0146 0.8160 ± 0.0164 0.8090 ± 0.0141
SVM 0.7525 ± 0.0075 0.7030 ± 0.0058 0.7525 ± 0.0075 0.7192 ± 0.0114
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01),
CNN (<0.01), SVM (<0.001).
Table 18 decline in accuracy (77.26%) and F1-score (78.18%) on the test set.
Cross Validation Performance Metrics for Dataset 4. BERT, while slightly lower in raw accuracy compared to HybridBERT-
Model Accuracy Precision Recall F1 LSTM, maintained a stable generalization profile (Accuracy: 84.94%,
HybridBERT-LSTM 0.8795 0.8739 0.8795 0.8758 F1: 85.12%).
BERT 0.8561 0.8477 0.8561 0.8481 The SVM model again registered the weakest results across all
LSTM 0.8394 0.7806 0.8394 0.8090
test metrics, with an accuracy of 75.25% and an F1-score of 71.92%,
CNN 0.8327 0.7811 0.8327 0.8058
SVM 0.7593 0.7719 0.7593 0.7602
reinforcing the notion that classical machine learning methods may
struggle with complex dialogue structures compared to deep learning
architectures.
Table 19 In summary, although CNN and LSTM excelled in training, their
Ablation Performance Metrics for Dataset 4. generalization to test data was limited. HybridBERT-LSTM, by contrast,
Model Accuracy ± std F1 ± std demonstrated consistent performance across both phases, reinforcing
BERT+BiLSTM (Frozen) 0.8065 ± 0.0285 0.7971 ± 0.0295 its suitability for real-world sentiment classification tasks involving
BERT-Only (Baseline) 0.8494 ± 0.0218 0.8512 ± 0.0194
dialogue-based inputs.
BERT-ParamMatched 0.8548 ± 0.0205 0.8558 ± 0.0188
BERT+UniLSTM 0.8605 ± 0.0178 0.8602 ± 0.0175
Dataset 5 is constructed [50] for the purpose of modeling em-
BERT+BiLSTM-NoPooling 0.8668 ± 0.0145 0.8645 ± 0.0155 pathetic dialogues and comprises multi-turn human-to-human con-
HybridBERT-LSTM (Full) 0.8729 ± 0.0119 0.8489 ± 0.0140* versations that reflect emotionally rich interactions. The corpus is
partitioned into three distinct subsets: the training set contains 40200
instances, the validation set includes 5730 instances, and the test set
comprises 5260 instances.
LSTM and CNN exhibit relatively high variances during training (std
Tables 20 and 21 present the comparative performance metrics of
≈ 0.0380.048 for LSTM; ≈ 0.0400.044 for CNN), suggesting instability
five distinct models: HybridBERT-LSTM, BERT, LSTM, CNN, and SVM
and overfitting tendencies.
on Dataset 5, using standard evaluation criteria: Accuracy, Precision,
Taken together, these results highlight two key aspects:
Recall, and F1-score. The results reveal clear patterns in terms of both
HybridBERT-LSTM delivers the best trade-off between accuracy and
model learning capacity on training data and generalization to unseen
reproducibility across repeated runs, and its performance improve-
test instances.
ments, particularly over LSTM, CNN, and SVM, are not only empirically
When Table 22, which shows the ablation test for Data Set 5,
substantial but also statistically robust. Thus, the evidence supports is examined, the BERT-ParamMatched achieves 95.65% ± 0.19% ac-
HybridBERT-LSTM as the most reliable and generalizable method on curacy with equivalent parameters, while HybridBERTLSTM reaches
Dataset 4. 96.16% ± 0.23%, representing a 0.51 percentage point improvement.
During training (Table 16), CNN achieved the highest accuracy Component decomposition reveals uniform contributions: dual pooling
(99.44%) and F1-score (98.82%), indicating a strong capacity to fit adds +0.17% (96.16% vs. 95.99%), bidirectionality contributes +0.17%
the training data. LSTM and BERT also demonstrated robust learn- (95.99% vs. 95.82%), and sequential LSTM architecture over MLP
ing performance with accuracy and F1-scores exceeding 94%, while provides +0.17% (95.82% vs. 95.65%). The cumulative gain of 0.66%
HybridBERT-LSTM followed closely behind with an accuracy of 90.46% from the BERT-Only baseline (95.50%) precisely matches the sum
and F1-score of 87.30%. SVM, in contrast, yielded noticeably lower of individual components, indicating minimal synergistic effects on
training performance (Accuracy: 80.84%, F1: 78.06%), highlighting its this high-performing task where architectural elements operate addi-
relative limitations in capturing complex language patterns. tively rather than multiplicatively. The frozen BERT variant (92.45%
However, test results (Table 17) reveal important insights into ± 0.82%) provides task-difficulty insights: it outperforms standalone
model generalizability. HybridBERT-LSTM emerged as the most bal- LSTM with GloVe embeddings (91.86% test) by only 0.59 percentage
anced and generalizable model, achieving the highest test accuracy points the smallest margin across all datasets yet maintains a 3.71%
(87.29%) and a competitive F1-score (84.89%). Despite its superior gap from the full model (92.45% vs. 96.16%). This pattern establishes
training performance, CNN exhibited a significant drop in test ac- that on near saturated tasks (BERT baseline: 95.50%), fine-tuning
curacy (81.60%), suggesting potential overfitting. Similarly, LSTM, provides greater marginal value (+3.71%) than architectural modi-
which performed strongly during training, experienced a substantial fications (+0.66%). The parameter efficiency ratio of 2.44:1 (0.66%
11
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Table 20
Training Performance Metrics for Dataset 5.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.9834 ± 0.0086 0.9834 ± 0.0074 0.9834 ± 0.0086 0.9833 ± 0.0084
BERT 0.9654 ± 0.0062 0.9654 ± 0.0059 0.9654 ± 0.0062 0.9654 ± 0.0061
LSTM 0.9936 ± 0.0049 0.9936 ± 0.0046 0.9936 ± 0.0049 0.9936 ± 0.0049
CNN 0.9384 ± 0.0346 0.9416 ± 0.0312 0.9384 ± 0.0346 0.9373 ± 0.0278
SVM 0.7536 ± 0.0272 0.7479 ± 0.0523 0.7536 ± 0.0272 0.7446 ± 0.0408
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.05),
CNN (<0.01), SVM (<<0.01).
Table 21
Test Performance Metrics for Dataset 5.
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
HybridBERT-LSTM 0.9616 ± 0.0023 0.9614 ± 0.0021 0.9616 ± 0.0023 0.9615 ± 0.0022
BERT 0.9550 ± 0.0020 0.9554 ± 0.0019 0.9550 ± 0.0020 0.9550 ± 0.0020
LSTM 0.9186 ± 0.0026 0.9201 ± 0.0029 0.9186 ± 0.0026 0.9190 ± 0.0031
CNN 0.8851 ± 0.0281 0.8887 ± 0.0337 0.8851 ± 0.0310 0.8813 ± 0.0315
SVM 0.7588 ± 0.0183 0.7507 ± 0.0178 0.7588 ± 0.0183 0.7506 ± 0.0179
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
(<<0.01), CNN (<<0.01), SVM (<<0.01).
Table 22 (Accuracy: 75.36%, F1: 74.46%), confirming its limitations in handling
Ablation Performance Metrics for Dataset 5. nuanced linguistic structures.
Model Accuracy ± std F1 ± std When evaluated on the test data (Table 21), HybridBERT-LSTM
BERT+BiLSTM (Frozen) 0.9245 ± 0.0082 0.9243 ± 0.0083 again outperformed all other models, achieving the highest accuracy
BERT-Only (Baseline) 0.9550 ± 0.0020 0.9550 ± 0.0020 (96.16%) and F1-score (96.15%), indicating strong generalization capa-
BERT-ParamMatched 0.9565 ± 0.0019 0.9565 ± 0.0019
bility and robustness against overfitting. BERT maintained competitive
BERT+UniLSTM 0.9582 ± 0.0018 0.9582 ± 0.0018
BERT+BiLSTM-NoPooling 0.9599 ± 0.0017 0.9599 ± 0.0017
test performance (Accuracy: 95.50%, F1: 95.50%), slightly lagging
HybridBERT-LSTM (Full) 0.9616 ± 0.0023 0.9615 ± 0.0022 behind the hybrid model. While LSTM demonstrated superior training
results, its test performance declined more notably (Accuracy: 91.86%,
F1: 91.90%), suggesting possible overfitting to training data. Similarly,
CNN exhibited a moderate generalization gap, reaching only 88.51%
gain/0.27% parameter increase) positions Dataset 5 among simpler accuracy on the test set, despite its relatively high training metrics.
classification problems, validating the inverse relationship between SVM, consistent with previous datasets, again showed the lowest
baseline performance and BiLSTMs contribution. performance in both training and testing phases, with an F1-score of
Based on the results averaged over five independent runs, the only 75.06% on the test data. This emphasizes the models limited ca-
HybridBERT-LSTM model consistently achieved the highest perfor- pacity to generalize in dialogue-rich or semantically complex scenarios
mance on both the training and test sets. The remarkably low standard compared to deep learning-based alternatives.
deviations (≈0.0020.009) indicate not only superior average perfor- Overall, these results substantiate the efficacy of the HybridBERT-
mance but also a high degree of stability and reproducibility across LSTM architecture in balancing contextual sensitivity and temporal
repeated trials. structure modeling, thereby ensuring high accuracy and stability across
The BERT model ranked second, yielding performance levels com- both learning and evaluation stages. The comparative drop in test
parable to HybridBERT-LSTM. However, pairwise statistical compar- performance observed in CNN and LSTM also underscores the impor-
isons revealed that the p-values were generally below 0.05, suggesting tance of integrating both contextual and sequential representations for
that the observed differences, while relatively small, are statistically enhanced sentiment classification in dialogue settings.
significant and not attributable to random variation. Fig. 1 illustrates the interpretability analysis of the proposed senti-
In contrast, comparisons with the lower-performing models (LSTM, ment classification model using the LIME framework. The visualization
CNN, and SVM) yielded p-values well below 0.01, providing strong comprises three distinct components, each elucidating the models
statistical evidence of HybridBERT-LSTMs superiority. Notably, the decision-making process for a representative dialogue input.
LSTM model, despite attaining high training scores, exhibited a marked The prediction probabilities panel (top-left) displays the models
decline during testing, indicating a tendency toward overfitting. Simi- confidence distribution across the three sentiment classes. Here, Class 1
larly, the CNN model displayed wider standard deviations, pointing to achieves a probability score of 1.00, indicating complete certainty in
instability and reduced reliability across runs. the models classification. Classes 0 and 2 both register a probability
In conclusion, the HybridBERT-LSTM model not only achieved the of 0.00, underscoring the models confident and decisive prediction for
highest mean scores but also demonstrated low variance and statisti- this specific instance.
cally significant improvements, confirming its reliability and robustness The feature importance panel, generated by LIME, presents the
as the most effective approach for Dataset 5. quantitative contribution of individual lexical features to the final pre-
In the training phase (Table 20), LSTM yielded the highest perfor- diction. The ranking reveals that terms such as crying, embarrassing,
mance across all metrics, with an accuracy and F1-score of 99.36%, and fear possess the highest negative impact coefficients. Meanwhile,
indicating exceptional capability in capturing sequential dependencies features like worry, freaking out, and go out show moderate
in the training corpus. Close behind, the HybridBERT-LSTM model levels of influence. Conversely, contextual words such as counseling,
achieved 98.34% accuracy and an F1-score of 98.33%, reflecting its therapy, and days exhibit minimal importance, suggesting limited
strength in combining contextual embeddings with sequential model- contribution to the sentiment prediction for this case.
ing. BERT also performed robustly, attaining 96.54% across all reported The highlighted text visualization (right panel) offers an intuitive
metrics. In contrast, CNN demonstrated a moderate performance (Ac- representation of feature importance through color-coded annotations.
curacy: 93.84%, F1: 93.73%), while SVM significantly underperformed The input sentence: Im starting counseling/therapy in a few days. Im
12
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Fig. 1. Interpretability analysis using the LIME framework for the proposed model for Dataset1.
Fig. 2. Interpretability analysis using the LIME framework for the proposed model for Dataset2.
freaking out but my main fear is crying and embarrassing myself. Should Fig. 3 illustrates the interpretability analysis using the LIME frame-
I be worried? is annotated with blue highlights, corresponding to work for a sample medical consultation text, highlighting the models
high-impact emotional cues. The intensity of each highlight is directly capability to perform sentiment classification within clinical communi-
proportional to the magnitude of that words influence on the final cation contexts. The visualization comprises several analytical compo-
classification. nents that elucidate the algorithmic decision-making process.
Fig. 2 illustrates a LIME-based interpretability analysis for a sen- The prediction probability panel reveals high classification confi-
timent classification instance derived from medical discourse, high- dence by the model, assigning a dominant probability score of 0.99 to
lighting the models interpretive capabilities in processing healthcare- Class 2, while Classes 0 and 1 both receive marginal likelihoods of 0.01.
related textual inputs. The visualization provides a comprehensive in- The feature importance analysis presents local explanations gen-
sight into the underlying decision-making mechanisms of the sentiment erated by LIME, quantifying individual lexical contributions to the
prediction process. final prediction. The term affected exhibits the highest contribution
The prediction probabilities panel reveals that the model assigns coefficient at 0.24, followed by cold (0.22) and recovery (0.20).
a dominant probability of 0.97 to Class 0, while significantly lower Subsequent features such as recommend (0.17), definitely (0.11),
values of 0.01 and 0.02 are attributed to Classes 1 and 2, respectively. and avoid (0.10) display gradually decreasing importance values.
This distribution indicates high classification confidence with minimal Additional terms like by, protect, loose, and issue register min-
uncertainty among the alternative sentiment categories. imal weights, indicating lower relevance in the sentiment attribution
The feature importance ranking presents local attributions gen- process.
erated by LIME, identifying the most influential lexical components The highlighted text visualization renders the analyzed clinical
contributing to the classification decision. The term cancer emerges advisory statement:
as the primary contributor with an importance score of 0.61, followed
by scared (0.22) and please (0.11). Additional terms such as re- Hello, I have reviewed the attached photographs, the attachments have
ally, as, well, find, I, blood, and have exhibit progressively been removed to protect patient identity. In my opinion, you are affected
lower importance coefficients, reflecting their secondary roles in the by a tinea infection. I recommend taking 250 mg terbinafine tablets once
models sentiment determination process. daily and applying sertaconazole cream to the affected area twice daily.
The highlighted text panel displays the analyzed medical narrative: Continue this for three weeks and return. You will definitely notice some
improvement...
Hello doctor, Im a 26-year-old male, 10 cm tall and weigh 255
pounds. I sometimes have blood in my stool, especially after eating spicy Terms highlighted in green, specifically affected, recommend,
food or when constipated. Im really scared that I might have colon and improvement, correspond to therapeutically oriented expressions
cancer. I frequently experience diarrhea. There is no family history of that significantly influence the models positive sentiment classifica-
colon cancer. I had blood tests done last night. Please find my reports tion.
attached. This interpretability analysis reveals the models capacity to dis-
tinguish constructive medical recommendations from neutral or neg-
The blue-highlighted segments, particularly scared and cancer, atively toned clinical communications. The LIME explanation demon-
correspond to high-impact emotional and medical terminology that strates that the classification decision is primarily driven by treatment-
significantly influence the models sentiment evaluation. related vocabulary and optimistic prognostic indicators, offering valu-
This interpretability analysis demonstrates the models sensitivity to able insights into the models domain-specific sentiment recognition
emotionally charged and domain-specific medical expressions within abilities within healthcare advisory scenarios.
healthcare contexts. The LIME explanation reveals that the classifi- Fig. 4 presents a LIME-based interpretability analysis for the senti-
cation decision primarily hinges on illness-related concerns and fear- ment classification of a concise social media content sample, illustrating
based expressions. Accordingly, the analysis offers valuable insights the models ability to process succinct and informal textual expressions.
into the models domain-specific sentiment recognition capabilities The visualization offers in-depth insights into the underlying sentiment
when interpreting emotionally nuanced medical discourse. classification mechanisms for multimedia-related content descriptions.
13
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Fig. 3. Interpretability analysis using the LIME framework for the proposed model for Dataset3.
Fig. 4. Interpretability analysis using the LIME framework for the proposed model for Dataset4.
Fig. 5. Interpretability analysis using the LIME framework for the proposed model for Dataset5.
The prediction probability panel indicates that the model assigns Fig. 5 presents the LIME-based interpretability analysis of a per-
a dominant probability of 0.95 to Class 2, while Classes 0 and 1 sonal expression sample, illustrating the models capacity to interpret
receive significantly lower confidence scores of 0.04 and 0.01, respec- emotional distress within the context of domestic relationships. This
tively. This distribution demonstrates high classification confidence visualization provides detailed insights into the sentiment classification
with minimal ambiguity across alternative sentiment categories. process related to interpersonal communication patterns. The predic-
The feature importance ranking displays local explanations derived tion probability panel shows that the model assigns a dominant prob-
from LIME, identifying the most influential lexical components in the ability of 0.92 to Class 0, while Classes 1 and 2 receive substantially
models decision-making process. The term cute emerges as the pri- lower confidence scores of 0.03 and 0.05, respectively. This distribu-
mary contributor with the highest importance score of 0.53, followed tion reflects the models high classification confidence with minimal
by funny (0.17). Additional terms such as dogs (0.05), belly ambiguity across alternative sentiment categories. The feature impor-
(0.04), compilation (0.03), flop (0.02), and corgi (0.01) exhibit tance analysis displays locally derived explanations generated by LIME,
progressively decreasing contribution scores, reflecting their secondary quantifying the contribution of individual lexical features to the final
roles in sentiment attribution. prediction. The terms angry and friends exhibit the highest impact
The text highlight visualization renders the analyzed content de- scores of 0.43, followed by I (0.24), ugh (0.23), and exhausted
scription: (0.22). Additional terms such as yes (0.16), so (0.10), his (0.09),
husband (0.04), and again (0.04) display diminishing importance
corgi belly flop compilation cute funny dogs corgi flop. scores, indicating secondary roles in the sentiment determination pro-
cess. The text highlight visualization presents the analyzed personal
narrative:
Green-highlighted terms, particularly cute and funny, corre-
spond to positive emotional descriptors that substantially influence the ugh Im so angry my husband went out with his friends for the third time
models sentiment classification toward the positive class. this week, is he drinking, yes, Im exhausted my daughter is teething so
This interpretability analysis demonstrates the models efficacy in she isnt sleeping well.
detecting positive sentiment cues within short, multimedia-oriented
content descriptions. The LIME explanation reveals that the classifi- The blue-highlighted segments, particularly ugh, angry, friends,
cation decision is primarily driven by emotionally charged adjectives and exhausted, correspond to emotionally expressive markers and
expressing affection and humor, offering valuable insights into the stress indicators that significantly influenced the models negative sen-
models ability to process informal social media language patterns and timent classification. This interpretability analysis reveals the models
perform sentiment analysis on pet-related content. ability to detect frustration and emotional exhaustion within narratives
14
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Fig. 6. Graph-based visualization with the WordContextGraphExplainer framework for Dataset1.
involving intimate relational contexts. The LIME explanation demon- use case for the explainers ability to decompose complex sentiment
strates that the classification decision is predominantly based on ex- decisions into interpretable components. This visualization framework
plicit emotional state descriptors and situational stress signals, provid- directly addresses the critical need for interpretability in natural lan-
ing valuable insight into the models competence in analyzing senti- guage processing applications. By decomposing the models reasoning
ment in informal, emotionally charged personal communications and into individual word contributions and pairwise interactions, WordCon-
family-related discourse. textGraphExplainer enables practitioners to understand not only what
Fig. 6 presents a comprehensive visualization generated by the the model predicts, but why specific linguistic features drive those
WordContextGraphExplainer framework, illustrating contextual depen- predictions. Such detailed analysis is especially valuable in high-stakes
dencies and feature interactions that underlie a sentiment analysis applications, where transparency and accountability are essential. The
models decision-making process. This graph-based representation an- graph structure effectively conveys the intricate interplay between
alyzes a textual input with inherently negative emotional content, lexical semantics and contextual dependencies that influence automatic
offering insights into how individual lexical units contribute to the sentiment classification, offering a robust foundation for both model
models final classification outcome. validation and bias detection in NLP systems.
The visualization employs a nodeedge graph structure, wherein Fig. 7 illustrates a visual explanation generated through the Word-
each word in the input sentence is represented as a distinct node. A ContextGraphExplainer framework, a graph-theoretic methodology de-
structured layout algorithm is used to optimally position the nodes, veloped to enhance interpretability in natural language processing
minimizing visual overlap while preserving semantic relationships. tasks. This approach is specifically designed to analyze the contextual
Node coloration adheres to a three-class scheme: red nodes signify and semantic interdependencies among lexical units in a given text. The
words with negative influence on the prediction, gray nodes indicate visualized instance centers on a sample from a patientdoctor interac-
neutral contributions, and green nodes denote positive contributions tion scenario, highlighting how domain-specific terminology influences
that enhance the models classification confidence. Each node is an- the models sentiment classification decision.
notated with a numeric coefficient reflecting its individual effect on The graph comprises the following principal components:
the predicted class probability. The values presented (ranging from Each node corresponds to an individual word token extracted from
+0.0001 to +0.0197) quantitatively capture the magnitude of each the input sentence. Numerical values adjacent to the nodes (rang-
words contribution to the final classification decision. Notably, terms ing from 0.6908 to +0.3007) quantify the contextual influence of
such as worthless (+0.0068), barely (+0.0072), and emotions each word on the models predicted sentiment class. These scalar
(+0.0197) exhibit significant negative sentiment contributions, aligning weights reflect the relative importance of lexical features based on
with the models overall classification of the input as Negative. Edges perturbation-based sensitivity analysis.
between nodes represent word-pair interactions whose importance ex- Edges link semantically related word pairs, capturing co-occurrence
ceeds a predefined threshold, capturing non-additive effects between patterns and latent dependencies. Notably, the term pain occupies
co-occurring terms. As specified in the legend (top-left), the visualiza- a central position in the graph with multiple connections, indicating
tion highlights the top five most influential word-pair interactions. Edge its pivotal role in determining the emotional tone of the dialogue. The
annotations (e.g., +0.6061 (Neg), +0.6701 (Neg)) denote both visualization applies a top-5 interactions threshold, selectively dis-
the strength and directional impact of these interactions on sentiment playing the most salient semantic relationships to prevent information
classification. These values reflect synergistic or antagonistic effects overload while preserving interpretive clarity.
that emerge when specific word combinations appear within the same The graph reveals a meaningful mapping between medical do-
context. The models confident prediction of the input text as expressing main terms (e.g., doctor, medication, pain) and activity-related
Negative sentiment (as shown at the bottom of the visualization) is expressions drawn from sports terminology (e.g., tennis, cricket,
supported by the prevalence of red-coded nodes and high-magnitude playing), showcasing the models capacity to associate physically
negative interaction coefficients. The analyzed text—rich in expressions contextualized discomfort with healthcare concerns. This highlights the
of emotional distress and self-deprecating language—serves as a clear models ability to capture nuanced emotional cues across domains.
15
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Fig. 7. Graph-based visualization with the WordContextGraphExplainer framework for Dataset2.
Fig. 8. Graph-based visualization with the WordContextGraphExplainer framework for Dataset3.
The WordContextGraphExplainer framework, as demonstrated in this A salient feature in the visualization is the positioning of the word
clinical communication use case, provides an interpretable, context- great as the central hub node. With a high positive influence score
aware mechanism for analyzing model behavior. Its utility in domains of +0.7819, this term is encoded in green, representing a dominant
such as clinical text analysis and patient-centered dialogue interpreta- contributor within the Positive Sentiment category. Its central role in
tion suggests promising implications. By revealing both direct and indi- the graph indicates that it functions as the primary sentiment-bearing
rect contributions of lexemes to the classification process, this method- lexical unit in the sentence.
ology lays a solid foundation for future research on explainable AI in The graph exhibits a radial topology, with all peripheral nodes
medical and psychologically sensitive natural language applications. emanating from the central great node. This star-like configuration
Fig. 8 presents a significant methodological example of visualiz- reflects how sentiment polarity is propagated through the surrounding
ing sentiment analysis and contextual word relationships through the context, with the central node acting as the semantic anchor.
WordContextGraphExplainer framework. The graph specifically illus- The weights of the edges range from 0.2868 to +0.1792, quantifying
trates the semantic structure of the sentence that would be great, then the strength of semantic correlation between each word and the central
we could plan things sooner, offering insight into how lexical elements great node. The systems overall classification of the sentence as
collectively influence the models sentiment prediction. Positive sentiment is clearly driven by the dominant positive influence
16
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Fig. 9. Graph-based visualization with the WordContextGraphExplainer framework for Dataset4.
of the hub node. This highlights the frameworks keyword-centric coherence. This clustering reveals that the system is capable of con-
modeling approach to sentiment interpretation. textually grouping entertainment-related entities, thereby enhancing
Words such as plan, things, sooner, then, we, could, domain-sensitive sentiment interpretation.
that, would, and be are categorized as having neutral sentiment The inclusion of interrogative tokens such as what (+0.0018) and
contributions. These peripheral tokens exhibit minimal effect values the question mark ? (+0.0017) underscores the frameworks ability
ranging between +0.0001 and +0.0002, suggesting their limited seman- to classify interrogative structures appropriately within the seman-
tic influence on the classification. This uniform distribution underscores tic graph. These tokens demonstrate minor but contextually relevant
the marginal role of syntactic or functional words in the models contributions to the overall sentiment.
decision-making process. The neutral classification of the term never (+0.0007) suggests a
The systems capacity to selectively highlight the five strongest sophisticated handling of negation. Rather than misattributing a strong
semantic pairwise interactions enhances both computational efficiency negative weight, the model maintains contextual equilibrium, acknowl-
and model interpretability. By focusing on the most relevant contex- edging the grammatical presence of negation without overestimating its
tual relationships, the graph avoids overcomplexity while preserving emotional impact.
analytical fidelity. The models ultimate sentiment prediction as Positive is primar-
This visualization demonstrates that WordContextGraphExplainer ily driven by the dominant influence of the enjoy hub node. This
serves as a promising approach within the sentiment analysis domain, demonstrates the systems robust classification capabilities in scenarios
contributing meaningfully to the broader paradigm of interpretable containing mixed sentiments and multifaceted content.
artificial intelligence. Its ability to disentangle and communicate the Overall, this analysis reinforces the efficacy of the
interplay between dominant and supportive linguistic features makes it WordContextGraphExplainer framework as an interpretability tool for
particularly valuable for applications requiring both transparency and complex conversational texts. It not only captures domain-specific
analytical depth. semantic cohesion but also preserves fine-grained contextual dependen-
Fig. 9 presents a Word Context Graph that exemplifies the com- cies, making it a powerful instrument for multi-topic sentiment analysis
plex dynamics of multi-domain sentiment analysis and cross-topical in real-world natural language understanding applications.
semantic understanding. The visualization analyzes the sentence I Fig. 10 illustrates a Word Context Graph generated by the Word-
have never seen Avatar, what is it about? I really enjoy The Avenger, ContextGraphExplainer framework, presenting a critical case study for
offering a fine-grained representation of lexical interactions within the sentiment analysis and psychological state detection within the mental
entertainment domain. health domain. The graph analyzes a linguistically complex, emotion-
The node enjoy (+0.4646) serves as the central hub in the graph, ally charged sentence:
exhibiting the highest positive sentiment score. This node constitutes
the semantic backbone of the structure, maintaining extensive connec- Im going through some things with my feelings and myself. I
tivity with surrounding tokens. The presence of dual-edge structures barely sleep and I do nothing but think about how Im worthless
highlights WordContextGraphExplainers capacity to capture nuanced and how I shouldnt be here.
variations in semantic relationship strength across word pairs.
The strong semantic ties among the nodes avatar, avenger, and The term feelings (+0.0197) is positioned as the central hub node,
enjoy reflect the models successful identification of domain-specific forming the core component of the negative sentiment cluster. This
17
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
Fig. 10. Graph-based visualization with the WordContextGraphExplainer framework for Dataset5.
central positioning reflects the dominant role of emotional discourse Table 23
within the narrative and highlights the lexical anchor around which Interpretability Fidelity Score Comparison Across Datasets.
semantic interactions are organized. Dataset LIME WordContextGraphExplainer (%) Improvement (%)
The graph predominantly features nodes classified as negative, such Dataset 1 0.8100 0.8900 +9.88
as worthless (+0.0068), nothing (+0.0097), and barely (+0.0072). Dataset 2 0.8000 0.8600 +7.50
Dataset 3 0.6540 0.7380 +12.84
These contribute to the accurate identification of depressive language Dataset 4 0.6920 0.7120 +2.89
patterns and reinforce the systems capacity to localize affectively Dataset 5 0.6800 0.8200 +20.59
significant tokens. Edge weights span a broad spectrum from +0.8360 to
0.6061, indicating considerable variance in the strength of inter-word
interactions. Notably, the strongest negative correlations are concen-
𝐹 = 𝐸(𝑥, 𝑘) using the specified explanation method, where 𝑘 deter-
trated around the feelings hub, supporting its centrality in semantic
mines the number of top-ranked features to consider. Subsequently, we
influence. The nodes shouldn (+0.0171) and be (+0.0132) are neg-
create a modified input 𝑥 = Remove(𝑥, 𝐹 ) by removing the identified
atively classified, reflecting the systems ability to detect linguistic
important features from the original text. We then compute a new
indicators of suicidal ideation. This demonstrates the models sensitivity
prediction 𝑝 = 𝑀(𝑥 ) using this perturbed input to observe how the
to subtle syntactic constructions associated with psychological distress.
models behavior changes. Finally, we calculate the fidelity score as
The node sleep (+0.0125) is identified within the negative sentiment
fidelity = |𝑝0 𝑝 |, which quantifies the absolute difference between
category, indicating the models capacity to recognize sleep disruption
the original and perturbed predictions.
— an important marker in clinical mental health assessments. The term
The underlying hypothesis assumes that if an explanation method
think (+0.0044) reflects ruminative thought patterns and is correctly
accurately identifies decision-critical features, their removal should
positioned within the semantic network. This demonstrates the systems
effectiveness in modeling internal cognitive processes associated with produce substantial changes in model predictions. Mathematically, this
depressive episodes. The models overall prediction of Negative senti- can be expressed as:
ment aligns with clinical assessment criteria, suggesting that the system High Fidelity ⇔ arg max(𝑀(𝑥)) ≠ arg max(𝑀(𝑥 )) (2)
achieves a promising level of accuracy for mental health screening
applications. This classification is supported by the density of negative The absolute difference metric captures both direction-preserving and
sentiment nodes and their semantically coherent interactions. direction-changing prediction modifications, providing a comprehen-
This analysis demonstrates that the WordContextGraphExplainer sive assessment of explanation accuracy.
framework provides a robust interpretability mechanism for psycho- For comprehensive evaluation, individual fidelity scores are aggre-
logically sensitive content. By quantifying both individual lexical con- gated using the arithmetic mean:
tributions and inter-word semantic interactions, the system delivers a
1∑
𝑛
fine-grained visualization of emotional discourse, making it particularly Mean Fidelity = |𝑀(𝑥𝑖 ) 𝑀(𝑥𝑖 )| (3)
𝑛 𝑖=1
valuable in clinical decision support systems.
The fidelity metric [51] implemented in this framework quan- where 𝑛 represents the total number of test instances.
tifies the correspondence between explanation-based feature impor- In the broader context of XAI for natural language processing,
tance rankings and observable model behavior changes through a WordContextGraphExplainer offers methodological advantages over tra-
perturbation-based assessment methodology. ditional frameworks such as LIME. Unlike LIME, which assumes fea-
Let 𝑀 represent the trained model, 𝑥 denote the original input ture independence and linearity, WordContextGraphExplainer employs a
text, and 𝐸(𝑥) represent the explanation method that produces a set graph-theoretic structure capable of capturing non-linear relationships
of important features 𝐹 = {𝑓1 , 𝑓2 , … , 𝑓𝑘 } with associated importance and contextual dependencies features essential for modeling complex,
scores. multi-sentiment narratives. These findings underscore the superior-
The fidelity score for a single instance is defined as: ity of graph-based interpretability in high-stakes domains and sug-
gest promising future directions for next-generation explainable NLP
Fidelity(𝑥, 𝐸) = |𝑀(𝑥) 𝑀(𝑥 )| (1) systems (see Table 23).
where 𝑥 represents the perturbed text obtained by removing the top-𝑘
most important features identified by the explanation method 𝐸. 5. Conclusion
The fidelity [52] assessment follows this systematic procedure. First,
we compute the original model prediction 𝑝0 = 𝑀(𝑥) to establish a This study presents a comprehensive framework for sentiment clas-
baseline reference point. Next, we extract the most important features sification in dialogue-based scenarios through the development of a
18
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
novel HybridBERT-LSTM architecture coupled with an innovative inter- Declaration of competing interest
pretability methodology. The proposed hybrid model demonstrates su-
perior performance on both benchmark datasets, including the widely- The authors declare that they have no known competing finan-
adopted IMDb corpus, and real-world dialogue datasets, consistently cial interests or personal relationships that could have appeared to
outperforming standalone architectures such as traditional LSTM, influence the work reported in this paper.
BERT, CNN, and SVM implementations. The empirical results validate
the models enhanced capacity to capture both the semantic richness Data availability
of individual utterances and the sequential dependencies inherent in
multi-turn conversational contexts. The authors do not have permission to share data.
The architectural innovation of HybridBERT-LSTM leverages pre-
trained BERT encodings for deep contextualized embeddings, subse-
quently processed through bidirectional LSTM layers to model tem- References
poral dependencies and discourse-level structures. The integration of
[1] L. Song, et al., CASA: Conversational aspect sentiment analysis for dialogue
dual pooling mechanisms (average and maximum) followed by dense understanding, J. Artificial Intelligence Res. 73 (2022) 511533.
classification layers enables the model to synthesize learned represen- [2] M. Firdaus, et al., MEISD: A multimodal multi-label emotion, intensity and
tations effectively, making it particularly suitable for dialogue senti- sentiment dialogue dataset, in: COLING, 2020, pp. 44414453.
ment analysis where contextual flow and sequential relationships are [3] I. Carvalho, et al., The importance of context for sentiment analysis in dialogues,
IEEE Access 11 (2023) 8608886103.
paramount.
[4] J. Wang, et al., Sentiment classification in customer service dialogue with
A significant contribution of this research lies in the development topic-aware multi-task learning, AAAI 34 (05) (2020) 91779184.
of explainable context-aware sentiment reasoning capabilities. Beyond [5] D. Bertero, et al., Real-time speech emotion and sentiment recognition, EMNLP
the scope of traditional local explanation techniques, a novel graph- 104 (2016) 21047.
theoretic interpretability framework, WordContextGraphExplainer, has [6] C. Bothe, et al., Dialogue-based neural learning to estimate sentiment, in: ICANN,
2017, pp. 477485.
been proposed to address the fundamental limitations inherent in [7] M. Firdaus, et al., EmoSen: Generating sentiment and emotion controlled
existing methodologies. Unlike LIME, which operates under linear responses, IEEE Trans. Affect. Comput. 13 (3) (2020) 15551566.
additivity assumptions and treats tokens as independent entities, Word- [8] A. Mallol-Ragolta, B. Schuller, Coupling sentiment and arousal analysis, IEEE
ContextGraphExplainer employs sophisticated perturbation analysis Access 12 (2024) 2065420662.
[9] Z. Akbar, M.U. Ghani, U. Aziz, Boosting viewer experience with emotion-driven
to model non-linear semantic interactions between word pairs. This
video analysis: A BERT-based framework for social media content, J. Artif. Intell.
methodology constructs semantic interaction graphs where nodes rep- Behav. (2025).
resent individual word contributions and edges encode inter-word [10] J. Zhao, W. Gao, A semantic-enhanced heterogeneous dialogue graph network,
dependencies, providing intuitive visualization of complex linguistic IEEE ICETCI 131 (2024) 51322.
relationships through NetworkX-based representations. The compar- [11] M. Yang, et al., GME-dialogue-NET, Acad. J. Comput. Inf. Sci. 4 (8) (2021)
1018.
ative analysis reveals that while LIME provides granular word-level [12] M. Parmar, A. Tiwari, Emotion and sentiment analysis in dialogue: A multimodal
attributions, it operates independently of sequential context and fails strategy employing the BERT model, in: 2024 Parul International Conference on
to capture the synergistic effects crucial for accurate sentiment inter- Engineering and Technology, PICET, 2024, pp. 17.
pretation in conversational settings. In contrast, WordContextGraph- [13] Mustapha Z., Aspect-based emotion analysis for dialogue understanding, 2024.
[14] W. Li, W. Shao, S. Ji, E. Cambria, BiERU: Bidirectional emotional recurrent unit
Explainers graph-based approach explicitly models contextual inter-
for conversational sentiment analysis, Neurocomputing 467 (2022) 7382.
dependencies, semantic propagation patterns, and negation scope ef- [15] S. Poria, D. Hazarika, N. Majumder, R. Mihalcea, Beneath the tip of the iceberg:
fects that are essential for understanding transformer decision-making Current challenges and new directions in sentiment analysis research, IEEE Trans.
processes. This advancement enables practitioners to trace how sen- Affect. Comput. 14 (1) (2020) 108132.
timent emerges through word interactions and temporal flow across [16] L. Zhu, R. Mao, E. Cambria, B.J. Jansen, Neurosymbolic AI for personalized
sentiment analysis, in: International Conference on Human-Computer Interaction,
dialogue turns, providing unprecedented insights into model reason-
269290, Springer Nature Switzerland, Cham, 2024.
ing mechanisms. The integration of WordContextGraphExplainer with [17] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, et al., Panosent: A panoptic
HybridBERT-LSTM establishes a new paradigm for interpretable dia- sextuple extraction benchmark for multimodal conversational aspect-based sen-
logue sentiment analysis, where prediction accuracy and explainability timent analysis, in: Proceedings of the 32nd ACM International Conference on
Multimedia, 2024, pp. 76677676.
are synergistically enhanced. This framework demonstrates particular
[18] Y. Zhang, Q. Li, D. Song, P. Zhang, P. Wang, Quantum-inspired interactive
efficacy in clinical applications and mental health assessment scenarios, networks for conversational sentiment analysis, 2019.
where understanding the rationale behind sentiment predictions is [19] L. Yang, Q. Yang, J. Zeng, T. Peng, Z. Yang, H. Lin, Dialogue sentiment analysis
as critical as the predictions themselves. Future research directions based on dialogue structure pre-training, Multimedia Syst. 31 (2) (2025) 113.
include extending the graph-based interpretability framework to mul- [20] K. Horesh, A. Kumar, A. Anand, A. Sabu, T. Jain, Sentiment Analysis on Amazon
Electronics Product Reviews using Machine Learning Techniques, IEEE, 2023,
tilingual contexts and exploring its applications in other NLP tasks
http://dx.doi.org/10.1109/gcat59970.2023.10353467.
requiring fine-grained semantic understanding. Future work should [21] A. Matsui, E. Ferrara, Word embedding for social sciences: An interdisciplinary
focus on developing simplified visualization layers and adaptive user survey, PeerJ Comput. Sci. 10 (2024) e2562.
interfaces that can present graph-based explanations at varying levels [22] S. Anitha, P. Gnanasekaran, Advanced sentiment classification using RoBERTa
and aspect-based analysis on large-scale e-commerce datasets, Nanotechnol.
of complexity, enabling domain experts to access meaningful inter-
Perceptions 20 (S16) (2024) 336348.
pretability insights without requiring deep technical expertise in graph [23] P. Borah, D. Gupta, B.B. Hazarika, ConCave-convex procedure for support vector
theory or network analysis. Future research should incorporate system- machines with Huber loss for text classification, Comput. Electr. Eng. 122 (2025)
atic human evaluation studies to assess the explanatory quality and 109925.
clinical applicability of WordContextGraphExplainer outputs among [24] Z. Hua, Y. Tong, Y. Zheng, Y. Li, Y. Zhang, PPGloVe: privacy-preserving GloVe
for training word vectors in the dark, IEEE Trans. Inf. Forensics Secur. 19 (2024)
domain practitioners.
36443658.
[25] A. Rasool, S. Aslam, N. Hussain, S. Imtiaz, W. Riaz, nbert: Harnessing NLP
CRediT authorship contribution statement for emotion recognition in psychotherapy to transform mental health care,
Information 16 (4) (2025) 301.
Ercan Atagün: Writing review & editing, Writing original [26] E. Mitera-Kiełbasa, K. Zima, Automated classification of exchange information
requirements for construction projects using Word2Vec and SVM, Infrastructures
draft, Methodology, Investigation, Conceptualization. Günay Temür: 9 (11) (2024) 194.
Validation, Methodology. Serdar Biroğul: Supervision, Project admin- [27] Z. Yang, F. Emmert-Streib, Optimal performance of Binary Relevance CNN in
istration, Conceptualization. targeted multi-label text classification, Knowl.-Based Syst. 284 (2024) 111286.
19
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
[28] J. Peng, S. Huo, Application of an improved convolutional neural network [40] A. Bajaj, D.K. Vishwakarma, HOMOCHAR: A novel adversarial attack framework
algorithm in text classification, J. Web Eng. 23 (3) (2024) 315339. for exposing the vulnerability of text-based neural sentiment classifiers, Eng.
[29] K. Nithya, M. Krishnamoorthi, S.V. Easwaramoorthy, C.R. Dhivyaa, S. Yoo, J. Appl. Artif. Intell. 126 (2023) 106815, http://dx.doi.org/10.1016/j.engappai.
Cho, Hybrid approach of deep feature extraction using BERTOPCNN & FIAC 2023.106815.
with customized Bi-LSTM for rumor text classification, Alex. Eng. J. 90 (2024) [41] A. Bajaj, D.K. Vishwakarma, Evading text-based emotion detection mechanism
6575. via adversarial attacks, Neurocomputing 558 (2023).
[30] S. Jamshidi, M. Mohammadi, S. Bagheri, H.E. Najafabadi, A. Rezvanian, M. [42] G.A. de Oliveira, R.T. de Sousa, R. de O. Albuquerque, L.J.G. Villalba, Adversarial
Gheisari, et al., Effective text classification using BERT, MTM LSTM, and DT, attacks on a lexical sentiment analysis classifier, Comput. Commun. 174 (2021)
Data Knowl. Eng. 151 (2024) 102306. 154171, http://dx.doi.org/10.1016/j.comcom.2021.04.026.
[31] O. Galal, A.H. Abdel-Gawad, M. Farouk, Federated freeze BERT for text [43] M. Hussain, M. Naseer, Comparative analysis of logistic regression, LSTM, and
classification, J. Big Data 11 (1) (2024) 28. Bi-LSTM models for sentiment analysis on IMDB movie reviews, J. Artif. Intell.
[32] C. Eang, S. Lee, Improving the accuracy and effectiveness of text classification Comput. 2 (1) (2024) 18.
based on the integration of the bert model and a recurrent neural network [44] C.D. Kulathilake, J. Udupihille, S.P. Abeysundara, A. Senoo, Deep learning-driven
(RNN_Bert_Based), Appl. Sci. 14 (18) (2024) 8388. multi-class classification of brain strokes using computed tomography: A step
[33] M. Ahmed, M.S. Hossain, R.U. Islam, K. Andersson, Explainable text classification towards enhanced diagnostic precision, Eur. J. Radiol. 187 (2025) 112109.
model for COVID-19 fake news detection, J. Internet Serv. Inf. Secur. 12 (2) [45] Amod, Mental health counseling conversations dataset, 2024, Retrieved from
(2022) 5169. https://huggingface.co/datasets/Amod/mental_health_counseling_conversations/
[34] K. Zahoor, N.Z. Bawany, T. Qamar, Evaluating text classification with explainable tree/main.
artificial intelligence, Int. J. Artif. Intell. ISSN 225 (2024) 28938. [46] B. Yao, P. Tiwari, Q. Li, Self-supervised pre-trained neural network for quantum
[35] D. Kalla, N. Smith, F. Samaah, Deep learning-based sentiment analysis: Enhancing natural language processing, Neural Netw. 184 (2025) 107004, Elsevier.
IMDb review classification with LSTM models, 2025, Available at SSRN 5103558. [47] SohamGhadge, Casual conversation dataset, 2024, Retrieved from https://
[36] R. Beniwal, A.K. Dinkar, A. Kumar, A. Panchal, A hybrid deep learning model huggingface.co/datasets/SohamGhadge/casual-conversation/tree/main.
for sentiment analysis of IMDB movies reviews, in: 2024 Asia Pacific Conference [48] Mahfoos, Patient-doctor conversation dataset, 2024, Retrieved from https://
on Innovation in Technology, APCIT, IEEE, 2024, pp. 17. huggingface.co/datasets/mahfoos/Patient-Doctor-Conversation/tree/main.
[37] N. Tabassum, T. Alyas, M. Hamid, M. Saleem, S. Malik, Z. Ali, U. Farooq, [49] Alimistro123, English chat sentiment dataset, 2024, Retrieved from https://www.
Semantic analysis of Urdu English tweets empowered by machine learning, Intell. kaggle.com/code/alimistro123/english-chat-sentiment-dataset-found.
Autom. Soft Comput. 30 (1) (2021) 175186. [50] Adapting, Empathetic dialogues v2 dataset, 2024, Retrieved from https://
[38] A. Pandey, R. Yadav, A. Pathak, N. Shivani, B. Garg, A. Pandey, Sentiment huggingface.co/datasets/Adapting/empathetic_dialogues_v2.
analysis of IMDB movie reviews, in: 2024 First International Conference on [51] Y. Singh, Q.A. Hathaway, V. Keishing, S. Salehi, Y. Wei, N. Horvat, D.V. Vera-
Software, Systems and Information Technology, SSITCON, IEEE, 2024, pp. 16. Garcia, A. Choudhary, A.Mula. Kh, E. Quaia, et al., Beyond post hoc explanations:
[39] R. Amin, R. Gantassi, N. Ahmed, A.H. Alshehri, F.S. Alsubaei, J. Frnda, A hybrid A comprehensive framework for accountable AI in medical imaging through
approach for adversarial attack detection based on sentiment analysis model transparency, Interpret. Explain. Bioeng. 12 (8) (2025) 879.
using machine learning, Eng. Sci. Technol. an Int. J. 58 (2024) 101829. [52] M. Bayesh, S. Jahan, Embedding security awareness in IoT systems: A framework
for providing change impact insights, Appl. Sci. 15 (14) (2025) 7871.
20