1447 lines
168 KiB
Plaintext
1447 lines
168 KiB
Plaintext
Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
Contents lists available at ScienceDirect
|
||
|
||
|
||
Computer Standards & Interfaces
|
||
journal homepage: www.elsevier.com/locate/csi
|
||
|
||
|
||
|
||
|
||
Graph-based interpretable dialogue sentiment analysis: A HybridBERT-LSTM
|
||
framework with semantic interaction explainer
|
||
Ercan Atagün a ,∗, Günay Temür b , Serdar Biroğul c,d
|
||
a Computer Engineering, Institute Of Graduate Studies, Duzce University, Düzce, 81000, Turkey
|
||
b
|
||
Kaynasli Vocational School, Duzce University, Düzce, 81000, Turkey
|
||
c
|
||
Department of Computer Engineering, Faculty of Engineering, Duzce University, Düzce, 81000, Turkey
|
||
d
|
||
Department of Electronics and Information Technologies, Faculty of Architecture and Engineering, Nakhchivan State University, Nakhchivan, Azerbaijan
|
||
|
||
|
||
|
||
ARTICLE INFO ABSTRACT
|
||
|
||
Keywords: Conversational sentiment analysis in natural language processing faces substantial challenges due to intricate
|
||
Natural language processing contextual semantics and temporal dependencies within multi-turn dialogues. We present a novel HybridBERT-
|
||
Explainable artificial intelligence LSTM architecture that integrates BERT’s contextualized embeddings with LSTM’s sequential processing
|
||
Word context graph explainer
|
||
capabilities to enhance sentiment classification performance in dialogue scenarios. Our framework employs
|
||
a dual-pooling mechanism to capture local semantic features and global discourse dependencies, addressing
|
||
limitations of conventional approaches. Comprehensive evaluation on IMDb benchmark and real-world
|
||
dialogue datasets demonstrates that HybridBERT-LSTM consistently improves over standalone models (LSTM,
|
||
BERT, CNN, SVM) across accuracy, precision, recall, and F1-score metrics. The architecture effectively exploits
|
||
pre-trained contextual representations through bidirectional LSTM layers for temporal discourse modeling. We
|
||
introduce WordContextGraphExplainer, a graph-theoretic interpretability framework addressing conventional
|
||
explanation method limitations. Unlike LIME’s linear additivity assumptions treating features independently,
|
||
our approach utilizes perturbation-based analysis to model non-linear semantic interactions. The framework
|
||
generates semantic interaction graphs with nodes representing word contributions and edges encoding inter-
|
||
word dependencies, visualizing contextual sentiment propagation patterns. Empirical analysis reveals LIME’s
|
||
inadequacies in capturing temporal discourse dependencies and collaborative semantic interactions crucial for
|
||
dialogue sentiment understanding. WordContextGraphExplainer explicitly models semantic interdependencies,
|
||
negation scope, and temporal flow across conversational turns, enabling comprehensive understanding of both
|
||
word-level contributions and contextual interaction influences on decision-making processes. This integrated
|
||
framework establishes a new paradigm for interpretable dialogue sentiment analysis, advancing trustworthy
|
||
AI through high-performance classification coupled with comprehensive explainability.
|
||
|
||
|
||
|
||
1. Introduction sentiment analysis, as conventional text classification methodologies
|
||
frequently fail to adequately capture such sequential continuity. The
|
||
Dialogue-based sentiment analysis constitutes a significant research multi-speaker nature of dialogues introduces critical considerations
|
||
domain within the field of natural language processing (NLP). This area regarding utterance attribution and the identification of emotional
|
||
of study represents a fundamental component of efforts to enhance
|
||
expression sources. Modeling sentiment transitions between conver-
|
||
human–machine interaction through more meaningful and emotion-
|
||
sational participants presents particular challenges, especially in sce-
|
||
centric approaches. Research endeavors in this field encompass numer-
|
||
ous inherent challenges and complexities. Dialogues typically emerge narios where emotions are expressed through implicit mechanisms.
|
||
from the reciprocal interactions among multiple conversational partic- Rather than explicit emotional declarations, human linguistic behavior
|
||
ipants, where the scope of communicative content spans the breadth frequently employs sophisticated rhetorical devices including irony,
|
||
of human knowledge and experience. The emotional orientation of an sarcasm, humor, double entenders, and cultural references, resulting in
|
||
utterance within a conversational sequence demonstrates substantial sentiment interpretations that diverge significantly from surface-level
|
||
dependency upon preceding discourse and contextual cues. This phe-
|
||
textual analysis. This phenomenon proves particularly problematic in
|
||
nomenon necessitates the development of context-aware models for
|
||
|
||
|
||
∗ Corresponding author.
|
||
E-mail address: ercanatagun@duzce.edu.tr (E. Atagün).
|
||
|
||
https://doi.org/10.1016/j.csi.2025.104086
|
||
Received 7 June 2025; Received in revised form 7 October 2025; Accepted 13 October 2025
|
||
Available online 12 November 2025
|
||
0920-5489/© 2025 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
|
||
nc-nd/4.0/).
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
brief, context-independent utterances, substantially complicating sen- media environments, further demonstrating the potential of multimodal
|
||
timent analysis procedures. Contemporary dialogue-based sentiment affective understanding.
|
||
analysis research faces significant constraints regarding the availabil- Graph-based modeling has also been incorporated into multimodal
|
||
ity of high-quality, annotated datasets. Existing corpora are charac- sentiment analysis. Zhao and Gao [10] proposed a semantically en-
|
||
terized by either limited scale or restriction to specific contextual riched heterogeneous dialogue graph network to analyze sentiment in
|
||
domains such as cinematic dialogue or customer service interactions.
|
||
multi-party conversations. Yang et al. [11] advanced sentiment accu-
|
||
Furthermore, the insufficient representation of cultural, linguistic, and
|
||
racy through a model that jointly processes text, audio, and visual cues.
|
||
social diversity within available datasets impedes the development
|
||
Context-awareness is a pivotal factor in sentiment interpretation within
|
||
of generalizable models with robust cross-domain applicability. Deep
|
||
dialogues. Carvalho et al. [3] emphasized the influence of preceding
|
||
learning-based sentiment analysis architectures predominantly exhibit
|
||
discourse on sentiment prediction. To enhance contextual coherence in
|
||
‘‘black box’’ characteristics, rendering their decision-making processes
|
||
opaque to human interpretation. This limitation particularly dimin- generative AI dialogue systems, personalized dialogue summarization
|
||
ishes model reliability in tasks where emotional interpretation involves techniques have been employed [12]. Mustapha [13] proposed a model
|
||
inherent subjectivity, consequently necessitating human oversight in to analyze sentiment-cause relationships in stress-laden conversations,
|
||
practical applications. In this study, a novel hybrid model is pro- aiming to reveal emotional dynamics. Contextual memory mechanisms
|
||
posed, integrating BERT’s contextualized representation capabilities were further explored by Li et al. [14], who developed a bidirec-
|
||
with the sequential modeling proficiency of LSTM to address the in- tional emotional recurrent unit (BiERU) to capture dynamic context
|
||
herent challenges of sentiment analysis in dialogue-based datasets. The shifts and their implications for sentiment detection. Explainability has
|
||
architecture is specifically designed to capture both linguistic features gained increasing importance in sentiment analysis. A variety of ap-
|
||
and temporal dependencies embedded within conversational structures. proaches including attention mechanisms, graph neural networks, and
|
||
To enhance the interpretability of model outputs, a graph-theoretic neuro-symbolic architectures have been introduced to elucidate model
|
||
interpretability framework, termed WordContextGraphExplainer, is in- decision-making. Poria et al. [15] discussed fundamental challenges
|
||
troduced. This framework overcomes the limitations of conventional in sentiment interpretation and underscored the role of explainability.
|
||
explanation methods by modeling non-linear semantic interactions be- Zhu et al. [16] developed a neuro-symbolic model for personalized
|
||
tween lexical units. Through the construction of semantic interaction
|
||
sentiment analysis, incorporating user-specific contextual factors into
|
||
graphs, the approach facilitates comprehensive visualization of contex-
|
||
the explanatory framework. Luo et al. [17] introduced the PanoSent
|
||
tual sentiment propagation patterns, offering novel insights into the
|
||
dataset to improve the analysis of emotional shifts in interactive sys-
|
||
underlying decision-making mechanisms of the model and establish-
|
||
tems. In another direction, Zhang et al. proposed a novel interaction
|
||
ing a new paradigm for interpretable sentiment analysis in dialogue
|
||
systems. network inspired by quantum theory to reframe dialogue-based sen-
|
||
timent analysis [18]. Yang et al. [19] addressed the inadequacies
|
||
2. Related works of existing pre-trained models in capturing the logical structure of
|
||
dialogues. To overcome these limitations, they proposed a new pre-
|
||
Sentiment analysis has gained significant traction in NLP research, training framework comprising utterance order modeling, sentence
|
||
driven by its pivotal role in enabling affective computing across do- skeleton reconstruction, and sentiment shift detection, demonstrating
|
||
mains such as human–computer interaction, intelligent customer sup- improvements in learning emotion interactions and discourse coher-
|
||
port, and conversational AI systems. Recent advancements in the field ence. Collectively, recent developments in sentiment analysis empha-
|
||
have led to the development of a diverse array of methodologies, en- size the significance of contextual awareness, multimodal data fusion,
|
||
compassing text-based approaches, multimodal frameworks, contextual graph-based reasoning, and explainable AI techniques in enhancing
|
||
modeling techniques, and sophisticated deep learning architectures. performance and interpretability within dialogue-centric applications.
|
||
This section presents an overview of key contributions in the literature,
|
||
with a particular emphasis on dialogue-based sentiment analysis, which
|
||
plays a critical role in domains such as customer support, conversa- 3. Materials and methods
|
||
tional AI, and empathetic dialogue systems. Song et al. [1] introduced
|
||
a topic-aware sentiment analysis model for dialogue (CASA), aiming to
|
||
The dialogue dataset dyadic conversational exchanges between two
|
||
identify sentiment orientations within conversational threads. Firdaus
|
||
distinct participants. Each dialogue instance is structured as a se-
|
||
et al. [2] constructed the MEISD dataset, incorporating textual, audio,
|
||
quence of alternating utterances, where each turn is associated with
|
||
and visual data for multimodal sentiment analysis. Emphasizing the
|
||
a specific speaker and the corresponding textual content. The formal
|
||
relevance of conversational context, Carvalho et al. [3] demonstrated
|
||
that prior utterances significantly influence sentiment classification mathematical representation of the dialogue structure is given by:
|
||
outcomes. Building upon this insight, topic-aware sentiment classifica-
|
||
= {(𝑠𝑖 , 𝑡𝑖 )}𝑁
|
||
𝑖=1
|
||
, 𝑠𝑖 ∈ = {𝐴, 𝐵}, 𝑡𝑖 ∈ 𝛴 ∗
|
||
tion models have been proposed using multi-task learning strategies
|
||
within customer service dialogues [4]. Real-time sentiment analysis
|
||
Here, denotes the complete dialogue dataset
|
||
in dialogue systems is also a critical consideration. Bertero et al. [5]
|
||
developed a convolutional neural network capable of processing au- composed of 𝑁 conversational turns.
|
||
dio inputs for instantaneous emotion detection in interactive systems. Each pair (𝑠𝑖 , 𝑡𝑖 ) represents the 𝑖-th turn in the dialogue,
|
||
Bothe et al. [6] presented a model to predict the sentiment of up- where 𝑠𝑖 is the speaker identifier and
|
||
coming utterances, thereby analyzing emotional transitions throughout
|
||
dialogue sequences. To address the limitations of unimodal text-based 𝑡𝑖 is the corresponding utterance.
|
||
sentiment analysis, recent studies have adopted multimodal strategies The speaker set = {𝐴, 𝐵} contains two participants,
|
||
by integrating text, speech, and visual signals. For instance, the EmoSen typically alternating in a turn-based structure.
|
||
model [7] generates sentiment-aware responses using fused inputs
|
||
The term 𝛴 represents the alphabet of the natural language
|
||
from these modalities. Similarly, Mallol-Ragolta and Schuller [8] intro-
|
||
duced a system that personalizes dialogue responses by estimating user in which the dialogue is conducted, and 𝛴 ∗ denotes the set
|
||
emotions and arousal levels. Akbar et al. [9] proposed an innovative of all finite-length strings (i.e., possible utterances)
|
||
emotion-driven framework for video-based sentiment analysis in social formed from this alphabet.
|
||
|
||
2
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
3.1. Data preprocessing and word embedding 3.2. GloVe: Global vectors for word representation
|
||
|
||
The successful training of natural language processing (NLP) models GloVe [24] is a widely adopted word embedding technique designed
|
||
is highly dependent on the transformation of raw textual data into to capture semantic and conceptual relationships between words, par-
|
||
structured and semantically meaningful representations [20]. In this ticularly in text classification tasks. It operates by constructing word
|
||
study, all textual inputs undergo a series of preprocessing operations vector representations through the optimization of global word co-
|
||
designed to optimize them for subsequent modeling tasks. An initial occurrence statistics derived from large-scale corpora. Unlike local
|
||
and essential preprocessing step involves lowercasing, which standard- context–based models such as Word2Vec, GloVe incorporates both
|
||
izes textual input by mitigating case sensitivity inconsistencies that
|
||
local and global contextual information, embedding lexical units into a
|
||
would otherwise lead to redundant representations of semantically
|
||
dense, continuous vector space. In practical applications, GloVe embed-
|
||
identical words. This step is particularly critical for ensuring the ef-
|
||
dings are employed to convert unstructured input text into fixed-length
|
||
fectiveness and consistency of word embedding techniques. Given that
|
||
numerical tensors, which serve as inputs to deep learning architec-
|
||
parts of the dataset originate from web-based sources, residual HTML
|
||
tures such as CNN and LSTM models. This transformation enables
|
||
tags and encoded entities such as <br> and are present in the
|
||
raw text. These components provide no linguistic or semantic value and the model to effectively distinguish between textual classes by cap-
|
||
may negatively affect model performance. Therefore, all HTML-related turing both syntactic patterns and latent semantic features. The key
|
||
tokens and special characters are systematically removed in the prepro- advantage of GloVe lies in its ability to unify global corpus-level
|
||
cessing phase to reduce noise within the input space and to enhance the statistical information with local context, producing more stable and
|
||
robustness of downstream NLP models. This comprehensive cleaning semantically meaningful representations compared to models relying
|
||
process is implemented using Python NLTK, BeautifulSoup libraries solely on window-based learning. However, it remains a static embed-
|
||
combined with regular expression patterns to ensure thorough removal ding technique; each word is assigned a single vector regardless of
|
||
of web-derived artifacts. Additionally, standard stopword removal is its context within a sentence. This context-independent nature limits
|
||
applied to eliminate semantically non-contributive terms. Notably, tra- its flexibility when compared to transformer-based models like BERT,
|
||
ditional morphological normalization techniques such as stemming which generate dynamic embeddings conditioned on the broader lin-
|
||
and lemmatization are deliberately excluded from our preprocessing guistic environment. Despite these limitations, GloVe continues to play
|
||
pipeline, as BERT’s contextualized embedding framework inherently a significant role in various NLP tasks such as text similarity, topic
|
||
captures morphological variations and semantic relationships without labeling,spam detection, and sentiment analysis where modeling word-
|
||
requiring explicit normalization steps. level semantics remains essential. Its computational simplicity and ease
|
||
Following text normalization, each cleaned sentence is tokenized of integration make it a reliable baseline in many NLP pipelines. Recent
|
||
into subword or word-level units. These token sequences are then studies [25] have highlighted the importance of consistent embedding
|
||
converted into dense numerical representations using word embedding strategies when comparing different NLP models, as variations in em-
|
||
techniques such as GloVe [21]. Embedding techniques project discrete
|
||
bedding approaches can significantly impact performance comparisons
|
||
textual units into continuous vector representations that encapsulate
|
||
and lead to biased evaluations.
|
||
both semantic coherence and syntactic structure, thereby facilitating
|
||
computational models in capturing lexical relatedness and contextual
|
||
3.3. Support Vector Machine (SVM)
|
||
alignment within language data. The original, unprocessed dataset
|
||
can be denoted as follows: Let the original unprocessed dataset be
|
||
represented [22] as: Support Vector Machine (SVM) [26] is a well-established super-
|
||
vised learning algorithm widely employed in text classification tasks,
|
||
𝑇 = {𝑠1 , 𝑠2 , … , 𝑠𝑁 } particularly due to its robustness in handling high-dimensional data
|
||
where each sentence 𝑠𝑘 is defined as [23] a sequence of 𝑀 words: representations. In natural language processing pipelines, textual inputs
|
||
are typically transformed into numerical feature vectors using tech-
|
||
𝑠𝑘 = {𝑢1 , 𝑢2 , … , 𝑢𝑀 } niques such as Term Frequency–Inverse Document Frequency (TF-IDF)
|
||
To refine the input, special characters , web-related entities , and or various word embedding models. Once converted, SVM operates by
|
||
semantically non-contributive stopwords are eliminated. The cleaned identifying the optimal hyperplane that best separates the data points
|
||
sentence is thus defined by: into distinct class labels. The core principle of SVM lies in maximizing
|
||
the margin between classes, thereby enhancing generalization perfor-
|
||
𝑠′𝑘 = Clean(𝑠𝑘 ) = {𝑢𝑗 ∈ 𝑠𝑘 ∣ 𝑢𝑗 ∉ ( ∪ ∪ )}
|
||
mance. This is particularly advantageous in scenarios where the feature
|
||
The sanitized sentence 𝑠′𝑘 is then tokenized: space exhibits high dimensionality and potential overlap between class
|
||
𝑠′𝑘 = {𝑣1 , 𝑣2 , … , 𝑣𝑃 }, 𝑣𝑖 ∈ distributions. Furthermore, SVM’s ability to incorporate non-linear ker-
|
||
nel functions such as polynomial or radial basis function (RBF) kernels
|
||
where denotes the vocabulary of all tokens in the dataset. enables it to capture complex, non-linear patterns within the data,
|
||
Word embeddings serve as a cornerstone for text classification, as which are often present in linguistically rich or semantically ambiguous
|
||
they enable models to capture abstract semantic relationships while textual inputs. Due to its mathematically grounded optimization frame-
|
||
reducing the dimensionality of input features. Unlike traditional bag-
|
||
work and resistance to overfitting, SVM remains a competitive baseline
|
||
of-words approaches, embeddings are resilient to linguistic variability
|
||
in various text classification domains, including sentiment analysis,
|
||
such as synonymy and polysemy. For sentiment analysis tasks, embed-
|
||
spam detection, and topic categorization. Its effectiveness is further
|
||
dings can cluster words with similar affective connotations, thereby
|
||
enhanced when combined with appropriate feature engineering and
|
||
enhancing the model’s ability to generalize and detect implicit senti-
|
||
dimensionality reduction techniques, making it a viable choice for both
|
||
ments. Likewise, in general classification tasks, embeddings help reveal
|
||
thematic cohesion across texts, ultimately contributing to improved small-scale and large-scale NLP applications.
|
||
predictive performance. Nevertheless, conventional embeddings like
|
||
Word2Vec or GloVe are context-independent, assigning the same vector 3.4. Convolutional Neural Networks (CNN)
|
||
representation to a word regardless of its usage context. This limitation
|
||
is addressed by contextualized models such as BERT, which generate Although originally developed for image recognition tasks, Convo-
|
||
dynamic embeddings based on surrounding words using transformer- lutional Neural Networks (CNNs) have been extensively adapted for
|
||
based architectures. Word embeddings bridge the gap between lin- various natural language processing problems, particularly in multi-
|
||
guistic expressiveness and computational tractability and remain an label text classification [27] and sentiment analysis [28], due to their
|
||
indispensable component of modern NLP pipelines. capacity to capture local hierarchical patterns in sequential data. In
|
||
|
||
3
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
text classification applications, CNNs operate on word embeddings by proposed. These hybrid solutions aim to retain BERT’s rich contextual
|
||
applying one-dimensional convolutional filters to detect local patterns understanding while improving computational efficiency and general-
|
||
such as n-grams or syntactic motifs. These filters perform element- izability, making them more suitable for applications constrained by
|
||
wise multiplications followed by non-linear activation functions to resources or latency requirements.
|
||
generate feature maps that emphasize the most informative regions of
|
||
the input sequence. A subsequent max-pooling operation reduces the 3.7. Local Interpretable Model-Agnostic Explanations (LIME)
|
||
dimensionality and retains the most salient features, thereby enabling
|
||
the network to focus on contextually rich segments of text. This ar- LIME is a model-agnostic interpretability framework designed to
|
||
chitecture allows CNNs to efficiently model contextual dependencies provide localized explanations for the predictions of complex machine
|
||
within fixed-size receptive fields, making them particularly suitable for learning models. Positioned within the broader field of Explainable
|
||
tasks such as topic categorization, polarity detection, and aspect-based Artificial Intelligence (XAI), LIME serves to enhance the interpretability
|
||
sentiment analysis. Compared to recurrent neural networks (RNNs), of opaque ‘‘black-box’’ systems, particularly in high-stakes domains
|
||
CNNs offer significant advantages in terms of computational efficiency where transparency and trust are critical [33]. LIME’s main goal is to
|
||
and parallelizability, as they do not rely on sequential input processing. provide a straightforward, interpretable surrogate model that, within
|
||
However, one notable limitation of CNNs is their reduced capacity to the local neighborhood of a particular instance, roughly represents the
|
||
model long-range dependencies, which can affect performance in tasks original model’s decision boundary [34]. LIME accomplishes this by
|
||
involving lengthy or complex discourse structures. generating a set of synthetic samples close to the target instance, which
|
||
perturbs the original input. The black-box model is used to these altered
|
||
3.5. Long Short-Term Memory Networks (LSTM) examples in order to derive the relevant predictions. These cases are
|
||
then subjected to a locality-sensitive weighting function, and the deci-
|
||
LSTM networks, as a refined subclass of recurrent neural architec- sion function is locally approximated by training a sparse linear model
|
||
tures, have demonstrated substantial effectiveness in text classification
|
||
on the weighted dataset. The contribution of each feature to the final
|
||
tasks due to their capacity to capture long-range dependencies and pre-
|
||
prediction is inferred using the surrogate model’s resulting coefficients.
|
||
serve semantically meaningful representations across sequential data
|
||
One of the key strengths of LIME lies in its model-agnostic design,
|
||
inputs [29]. By incorporating internal memory units and a gated con-
|
||
allowing it to be applied across a wide range of machine learning
|
||
trol mechanism – comprising input, forget, and output gates – LSTM
|
||
algorithms, including ensemble methods, deep neural networks, and
|
||
models effectively address the vanishing gradient challenge that limits
|
||
support vector machines. It offers human-understandable explanations
|
||
conventional RNNs. These gating components orchestrate information
|
||
while maintaining local fidelity to the original model. As such, LIME
|
||
flow dynamically, facilitating the retention of salient features over pro-
|
||
is widely adopted for increasing decision transparency and enabling
|
||
longed contexts and ensuring the continuity of semantic interpretation
|
||
human-AI collaboration, particularly in sensitive applications such as
|
||
throughout the sequence [30]. In text classification applications, LSTM
|
||
healthcare diagnostics, financial risk assessment, and legal reasoning.
|
||
typically process input sequences encoded as dense word embeddings,
|
||
allowing the network to learn hierarchical feature representations that
|
||
3.8. WordContextGraphExplainer
|
||
encapsulate both syntactic structure and semantic meaning. This ca-
|
||
pacity to capture nuanced contextual relationships makes LSTM par-
|
||
ticularly effective in tasks such as sentiment analysis, text similarity, The exponential growth in transformer-based natural language pro-
|
||
spam detection, and topic categorization where subtle variations in cessing (NLP) architectures has created an unprecedented demand
|
||
word order and polarity significantly influence predictive accuracy. for interpretability frameworks capable of elucidating the complex
|
||
For instance, in sentiment classification, LSTM models can differen- decision-making processes underlying these black-box models. While
|
||
tiate between expressions like ‘‘not good’’ and ‘‘extremely good’’ by widely adopted XAI techniques such as LIME (Local Interpretable
|
||
maintaining a dynamic memory of temporal context throughout the Model-Agnostic Explanations) and SHAP (SHapley Additive Explana-
|
||
sequence. tions) offer valuable insights through feature attribution, they in-
|
||
herently rely on linear additivity assumptions among input features.
|
||
3.6. Bidirectional Encoder Representations from Transformers (BERT) This assumption falls short in capturing the intricate semantic de-
|
||
pendencies and non-linear interactions that characterize deep lan-
|
||
BERT is a transformer-based, pre-trained language model that has guage understanding. A fundamental limitation of existing approaches
|
||
substantially advanced the state of the art in text classification tasks lies in their inability to model contextual interdependencies between
|
||
by capturing bidirectional contextual semantics through self-attention words relationships that are crucial for interpreting sentiment propa-
|
||
mechanisms [31]. Unlike unidirectional models such as LSTM or GRU, gation, negation scope, and semantic coherence in complex linguistic
|
||
which process text sequentially, BERT encodes semantic dependencies structures. Traditional token-level attribution methods treat individual
|
||
from both left and right contexts simultaneously. This architecture words as independent contributors, failing to account for the synergistic
|
||
enables nuanced disambiguation of polysemous words and more robust effects that emerge from word pairings and contextual associations
|
||
modeling of long-range dependencies in natural language [32]. In in the semantic space. In this paper, WordContextGraphExplainer is
|
||
text classification applications, BERT is typically fine-tuned on task- introduced as a novel graph-theoretic interpretability framework de-
|
||
specific labeled datasets. This involves appending a classification layer veloped to enhance the transparency of transformer-based sentiment
|
||
often a dense layer with softmax activation on top of the pre-trained classification systems. The methodology is built upon a systematic
|
||
BERT encoder. Through this transfer learning paradigm, BERT exhibits perturbation analysis paradigm, in which masked language modeling is
|
||
superior performance across a variety of NLP tasks including sentiment employed to estimate both individual lexical contributions and pairwise
|
||
classification, aspect-based sentiment analysis, and multi-label classifi- semantic interactions. In contrast to linear attribution methods, this
|
||
cation, particularly in settings characterized by contextual ambiguity approach explicitly models non-linear dependencies by quantifying the
|
||
and hierarchical dependencies. However, BERT’s practical deployment divergence between observed joint effects and the expected additive
|
||
presents several challenges. Its high computational complexity, sensi- influence of word pairs. At the core of the framework is the construction
|
||
tivity to input sequence length, and the requirement for large volumes of a semantic interaction graph, where nodes represent individual
|
||
of labeled data during fine-tuning can pose significant barriers in real- words annotated with their relative sentiment contributions, and edges
|
||
world scenarios. To mitigate these limitations, hybrid architectures that encode the magnitude and directionality of inter-word dependencies.
|
||
integrate BERT with more lightweight modeling components have been This graph-based representation facilitates intuitive visualization of
|
||
|
||
4
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
complex linguistic relationships through NetworkX-based layouts, en- capturing contextual semantics, its self-attention mechanism may not
|
||
abling deeper insight into how contextual factors influence model fully exploit the sequential dependencies within dialogue utterances. To
|
||
predictions. The framework demonstrates particular efficacy in sen- mitigate this limitation, bidirectional LSTM layers are incorporated to
|
||
timent analysis tasks where nuanced interactions between affective model temporal patterns and discourse-level relationships across token
|
||
indicators, negation patterns, and contextual modifiers significantly sequences. These layers are adept at retaining long-range dependencies
|
||
impact interpretive accuracy. By providing interpretable visualizations and recognizing sentiment transitions across multi-turn dialogue. By
|
||
of semantic interaction networks, WordContextGraphExplainer sup- integrating these two components, the proposed HybridBERT-LSTM
|
||
ports advanced model debugging, bias detection, and clinical decision architecture achieves a richer understanding of both the global context
|
||
support in sensitive domains such as mental health assessment and and local structure of textual data, enhancing its capability to discern
|
||
medical text analytics. Moreover, the framework incorporates a top- sentiment in complex conversational scenarios. This dual modeling
|
||
k interaction filtering mechanism, ensuring computational scalability approach positions the framework as a robust solution for sentiment
|
||
while preserving the granularity required for interpretable analysis in classification tasks, particularly in dialogue-rich environments where
|
||
high-stakes applications. This methodological advancement represents contextual flow and temporal coherence are paramount.
|
||
a critical step toward the development of trustworthy AI systems that
|
||
combine linguistic reasoning with transparent explanatory capabilities, 3.9. Model architecture
|
||
offering a robust foundation for real-world deployment.
|
||
The proposed model processes input text through a series of trans-
|
||
Algorithm 1: WordContextGraphExplainer Method formation stages, mathematically formalized as follows: Given an input
|
||
sequence:
|
||
Input: Text 𝑇 , transformer model 𝑀, tokenizer 𝜏, feature number
|
||
𝑘 ≥ 1, device 𝑑. 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, where 𝑛 ≤ 256,
|
||
Output: Word context graph 𝐺 with semantic interactions.
|
||
1: Compute baseline prediction 𝑃0 = 𝑀(𝑇 ). the BERT encoder maps each token 𝑥𝑖 to a contextualized embedding,
|
||
2: Compute predicted_class = arg max(𝑃0 ). producing a sequence of hidden states:
|
||
3: Initialize 𝑊 = 𝜏(𝑇 ), word_effects = ∅, interactions = ∅. 𝐻 = BERT(𝑋) ∈ R𝑛×𝑑BERT
|
||
4: for each 𝑤𝑖 ∈ 𝑊 do 5: 𝑇masked = replace(𝑇 , 𝑤𝑖 , ‘[𝙼𝙰𝚂𝙺]’)
|
||
6: 𝑃masked = 𝑀(𝑇masked ) where 𝑑BERT = 768 represents the dimensionality of BERT’s contextual
|
||
7: word_effects[𝑖] = 𝑃0 − 𝑃masked embeddings.
|
||
8: end for The sequence 𝐻 is passed to a 3-layer bidirectional LSTM net-
|
||
9: for each (𝑤𝑖 , 𝑤𝑗 ) ∈ combinations(𝑊 , 2) do work to capture temporal dependencies beyond what is modeled by
|
||
10: 𝑇pair = replace(𝑇 , [𝑤𝑖 , 𝑤𝑗 ], ‘[𝙼𝙰𝚂𝙺]’) self-attention:
|
||
11: 𝑃pair = 𝑀(𝑇pair ) ⃖⃗𝑡 = LSTMforward (𝐻𝑡 , ℎ
|
||
ℎ ⃖⃗𝑡−1 ), ⃖⃖
|
||
ℎ𝑡 = LSTMbackward (𝐻𝑡 , ⃖⃖
|
||
ℎ𝑡+1 )
|
||
12: actual_effect = 𝑃0 − 𝑃pair
|
||
13: expected_effect = word_effects[𝑖] + word_effects[𝑗] The final representation for each token is obtained by concatenating
|
||
14: interaction𝑖𝑗 = actual_effect − expected_effect the forward and backward hidden states:
|
||
‖ ‖
|
||
15: interactions[(𝑤𝑖 , 𝑤𝑗 )] = ‖interaction𝑖𝑗 ‖ ℎLSTM ℎ𝑡 ] ∈ R2𝑑LSTM
|
||
⃖⃗𝑡 ; ⃖⃖
|
||
= [ℎ
|
||
‖ ‖2 𝑡
|
||
16: end for
|
||
17: Sort interactions by magnitude in descending order. with 𝑑LSTM = 256, resulting in a 512-dimensional output per token.
|
||
To obtain a fixed-length vector representation of the sequence, both
|
||
18: 𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠[∶ 𝑘]
|
||
average and maximum pooling operations are applied:
|
||
19: Construct graph 𝐺 = (𝑉 , 𝐸) where 𝑉 = 𝑊 and 𝐸 =
|
||
1 ∑ LSTM
|
||
𝑛
|
||
𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠
|
||
ℎavg = ℎ , ℎmax = max ℎLSTM
|
||
20: Compute layout positions using 𝑜𝑟𝑔𝑎𝑛𝑖𝑧𝑒𝑑_𝑙𝑎𝑦𝑜𝑢𝑡(𝑊 , 𝑛 𝑖=1 𝑖 𝑖
|
||
1≤𝑖≤𝑛
|
||
𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠)
|
||
21: Visualize 𝐺 with NetworkX rendering and semantic color These vectors are concatenated to form the final sequence representa-
|
||
coding tion:
|
||
22: Return 𝐺 ℎcombined = [ℎavg ; ℎmax ] ∈ R4𝑑LSTM = R1024
|
||
|
||
In this study, a hybrid architecture is proposed that integrates a Feed-forward classification
|
||
pre-trained BERT model with a bidirectional Long Short-Term Memory
|
||
(BiLSTM) network to address the task of sentiment classification. The The combined representation is passed through a feed-forward neu-
|
||
model processes textual input to generate sentiment label predictions, ral network with dropout regularization:
|
||
effectively capturing both semantic context and temporal structure
|
||
𝑧1 = Dropout0.3 (ℎcombined )
|
||
inherent in natural language. Grounded in a transformer-based archi-
|
||
tecture, the system accepts input sequences of up to 256 tokens, apply- followed by a two-layer multilayer perceptron (MLP) with ReLU acti-
|
||
ing appropriate padding and truncation mechanisms when necessary vation and softmax output for multi-class classification.
|
||
to standardize input lengths. The HybridBERT-LSTM model embodies This is followed by a two-layer MLP classifier, using a ReLU acti-
|
||
a synergistic design that leverages the complementary strengths of vation and softmax output for multi-class prediction. The HybridBERT-
|
||
transformer-based language models and recurrent neural networks. LSTM architecture integrates the strengths of transformer-based con-
|
||
This hybrid framework is explicitly engineered to address two crit- textual modeling with the sequential learning capabilities of recurrent
|
||
ical aspects of sentiment analysis: contextual representation and se- neural networks. While BERT excels in capturing bidirectional semantic
|
||
quential modeling. Contextual Representation: The BERT encoder, pre- context via self-attention, the inclusion of bidirectional LSTM layers
|
||
trained on large-scale corpora, produces deep contextualized embed- enhances the model’s ability to capture sequential dependencies and
|
||
dings by employing multi-head self-attention mechanisms. These em- emotional transitions throughout dialogue sequences. The dual pooling
|
||
beddings capture nuanced semantic and syntactic information, enabling strategy(average and max pooling) provides a comprehensive summary
|
||
the model to differentiate between polysemous expressions and context- of the sequence. Average pooling captures the overall sentiment distri-
|
||
dependent sentiment cues. Sequential Modeling: While BERT excels at bution across the sequence, whereas max pooling emphasizes salient
|
||
|
||
5
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
emotional cues. This duality enriches the feature space and contributes Table 1
|
||
to more robust classification. Furthermore, hierarchical feature ab- HybridBERT-LSTM Model Parameters.
|
||
straction is enabled by stacking multiple LSTM layers, allowing the Parameter name Parameter value
|
||
model to learn long-range patterns more effectively than shallow RNN Model architecture BERT encoder + BiLSTM + MLP
|
||
structures. Dropout layers, strategically placed after pooling (with a Base Model google-bert/bert-base-uncased
|
||
rate of 0.3) and within the classifier (rate 0.2), serve as regularization Tokenizer google-bert/bert-base-uncased
|
||
Maximum sequence length 256
|
||
mechanisms to prevent overfitting, especially during fine-tuning on LSTM layer 6
|
||
task-specific datasets. The model is trained using the AdamW optimizer Batch size 32
|
||
with a learning rate of 2 × 10−5 , and the cross-entropy loss function Number of epochs 5
|
||
is employed as the objective. Performance evaluation is conducted Learning rate 0.00002
|
||
Optimization algorithm AdamW
|
||
using standard metrics including accuracy, precision, recall, and F1-
|
||
Loss function CrossEntropyLoss
|
||
score, ensuring comprehensive validation of the model’s classification LSTM latent size 256
|
||
capability. The model integrates a pre-trained BERT encoder for captur- Pooling avg + max pooling
|
||
ing deep contextual embeddings from input text sequences, followed MLP layer Linear(1024→128) → ReLU → Linear(128→n_classes)
|
||
Dropout rates 0.3
|
||
by a multi-layer bidirectional LSTM network that models sequential
|
||
dependencies across tokens. To derive a robust sentence-level repre-
|
||
sentation, dual pooling operations(average and maximum pooling) are Table 2
|
||
applied to the LSTM outputs. The concatenated feature vector is then BERT Model Parameters.
|
||
passed through a fully connected neural network with dropout regu- Parameter name Parameter value
|
||
larization, culminating in a softmax classifier for multi-class sentiment Base model google-bert/bert-base-uncased
|
||
prediction. This hybrid architecture is designed to jointly leverage the Tokenizer google-bert/bert-base-uncased
|
||
representational richness of transformer encoders and the temporal Input length 128
|
||
Batch size 16
|
||
modeling strength of recurrent networks, effectively addressing both lo-
|
||
Number of epochs 5
|
||
cal semantics and discourse-level sentiment dynamics within multi-turn Learning rate 0.00002
|
||
dialogues. Loss Function BertForSequenceClassification – Cross-Entropy
|
||
The computational overhead of HybridBERT-LSTM represents a crit- Optimization algorithm AdamW
|
||
ical consideration for practical deployment, particularly in real-time
|
||
applications such as conversational AI systems. The theoretical com-
|
||
Table 3
|
||
plexity of the proposed architecture can be decomposed into its con- LSTM Model Parameters.
|
||
stituent components to understand the computational requirements. Parameter name Parameter value
|
||
The BERT component contributes (𝑛2 ×𝑑BERT ) = (𝑛2 ×768) complexity
|
||
Embedding type GloVe
|
||
due to the quadratic scaling of the self-attention mechanism, where 𝑛 Embedding size 100
|
||
represents the sequence length and 𝑑BERT denotes the BERT embedding Maximum number of words 5000
|
||
dimension. The subsequent 3-layer BiLSTM processing adds (3 × 𝑛 × LSTM layer number 6
|
||
2
|
||
𝑑LSTM ) = (3 × 𝑛 × 2562 ) complexity, where 𝑑LSTM represents the LSTM unit number 128/256
|
||
Dropout rate 0.5
|
||
LSTM hidden dimension. Consequently, the overall HybridBERT-LSTM
|
||
Output layer (Dense) Softmax
|
||
complexity is (𝑛2 × 768 + 3𝑛 × 65, 536). This represents a significant Optimization algorithm Adam
|
||
computational increase compared to standalone BERT ((𝑛2 × 768)) or Loss function Sparse Categorical Crossentropy
|
||
LSTM models ((𝑛 × 𝑑LSTM 2 )), which may limit deployment in latency- Epoch number 50
|
||
sensitive applications. However, the empirical results demonstrate that Batch size 32
|
||
|
||
the performance gains justify this additional overhead in scenarios
|
||
where accuracy is prioritized over computational efficiency.
|
||
The parameters used for the BERT model employed in this study are
|
||
4. Experimental results presented in Table 2.
|
||
The parameter configurations utilized in the LSTM-based model
|
||
This section presents the configurations of the models utilized in developed for this study are detailed in Table 3.
|
||
the experiments, detailing the corresponding hyperparameters and im- The parameter configurations utilized in the CNN model developed
|
||
plementation settings. The objective is to ensure reproducibility and for this study are detailed in Table 4.
|
||
provide a comprehensive understanding of the experimental setup. Table 5 summarizes the parameter values defined for the SVM
|
||
model.
|
||
4.1. Model hyperparameters Table 6 presents a comparative evaluation of various machine learn-
|
||
ing and deep learning models in the context of sentiment analysis on the
|
||
The deep learning models were trained using a variety of hyper- widely adopted IMDB dataset. Among the examined methods, the pro-
|
||
parameter configurations tailored to the architecture and task require- posed HybridBERT-LSTM architecture achieved the highest accuracy
|
||
ments. These configurations include parameters such as learning rate, rate of 98.14%, demonstrating a substantial improvement over other
|
||
batch size, maximum input sequence length, number of training epochs, baseline models included in the analysis. This notable enhancement un-
|
||
optimizer type, and loss function. Additionally, architecture-specific derscores the effectiveness of combining contextual embeddings from
|
||
settings such as the number of LSTM layers, dropout rates, and hidden BERT with the sequential modeling capabilities of LSTM. The IMDB
|
||
state dimensionsare systematically defined. For models utilizing pre- dataset was selected for evaluation due to its extensive usage and
|
||
trained components (e.g., BERT), both the base model and tokenizer established credibility in the sentiment analysis literature, serving as
|
||
versions are explicitly specified. The subsequent tables summarize the a robust benchmark for comparative performance assessment.
|
||
detailed parameter values for each model employed in this study,
|
||
including HybridBERT-LSTM, BERT-only, LSTM, CNN, and SVM-based 4.2. Statistical significance testing
|
||
classifiers.
|
||
The parameter values of the model developed in this study are In order to determine whether the observed differences in model
|
||
detailed in Table 1. performance metrics [44] were statistically significant, we employed
|
||
|
||
6
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
Table 4 hypothesis (𝐻0 ) that the two models exhibit equal mean performance.
|
||
CNN Model Parameters. Since our interest lies in detecting differences in either direction, a
|
||
Parameter name Parameter value two-tailed test is used:
|
||
Embedding type GloVe
|
||
Embedding size 100 𝑝 = 2 × 𝑃 (𝑇 ≥ |𝑡|),
|
||
Maximum number of words 5000
|
||
Input layer Embedding(input_dim=5000, output_dim=100) where 𝑇 follows the Student’s t-distribution with 𝑑𝑓 degrees of free-
|
||
Number of Conv1D layers 6 dom.
|
||
Number of Conv1D filters 128 If 𝑝 < 0.05, the difference is considered statistically significant,
|
||
Kernel size 5
|
||
indicating strong evidence against the null hypothesis. In this case, we
|
||
Activation function ReLU
|
||
Padding Same
|
||
conclude that one model outperforms the other beyond what would be
|
||
Pooling MaxPooling1D (pool_size=2) expected by random variation. If 𝑝 ≥ 0.05, the difference is considered
|
||
Dropout rate 0.5 not statistically significant, implying that the observed discrepancy may
|
||
Global pooling GlobalMaxPooling1D reasonably be attributed to experimental variability.
|
||
Output layer (Dense) Softmax
|
||
Loss function sparse_categorical_crossentropy
|
||
In addition to reporting p-values, effect sizes (Cohen’s d) were also
|
||
Optimization algorithm Adam computed to quantify the magnitude of the observed differences. While
|
||
Evaluation metric Accuracy statistical significance indicates whether a difference is unlikely to be
|
||
Number of epochs 50 due to chance, effect size provides a measure of its practical relevance.
|
||
Batch size 32
|
||
Together, these statistics provide a comprehensive assessment of the
|
||
comparative performance of the evaluated models.
|
||
Table 5
|
||
SVM Model Parameters. 4.3. Experimental results on datasets
|
||
Parameter name Parameter value
|
||
Embedding type GloVe Dataset 1 consists of question–answer pairs collected from two
|
||
Embedding size 100 independent online counseling and psychotherapy platforms [45]. The
|
||
Maximum number of words 5000
|
||
user-generated questions span a wide range of topics related to men-
|
||
SVM Kernel Linear
|
||
tal health, including emotional well-being, interpersonal issues, and
|
||
psychological disorders. Each response was authored by licensed psy-
|
||
Table 6 chologists, ensuring both clinical relevance and linguistic reliability. In
|
||
IMDB Dataset Accuracy Comparison. total, the dataset comprises 7,025 dialogue instances.
|
||
Reference Method Accuracy Tables 7 and 8 present the training and testing performances, re-
|
||
[35] LSTM 83.7% spectively, of five distinct models HybridBERT-LSTM, BERT, LSTM,
|
||
[36] CNN+LSTM 96.01%
|
||
CNN, and SVM evaluated on Dataset 1. The models were assessed using
|
||
[37] LSTM+RNN 92.00%
|
||
[38] BERT 93.97%
|
||
standard classification metrics including accuracy, precision, recall,
|
||
[39] A hybrid approach 95.6% and F1-score, providing a comparative analysis of both their internal
|
||
[40] HOMOCHAR 95.91% consistency and generalizability.
|
||
[41] Textual Emotion Analysis (TEA) 93% An ablation study [46] is a systematic experimental methodol-
|
||
[42] Lexical + Adversarial attacks 85%
|
||
[43] Logistic Regression 89.42%
|
||
ogy used to evaluate the individual contributions of specific model
|
||
Proposed Model HybridBERT-LSTM 98.14% components by selectively removing or modifying them while keep-
|
||
ing other factors constant. This approach provides empirical evidence
|
||
for the importance of particular architectural elements in determin-
|
||
ing the model’s overall performance. To rigorously assess whether
|
||
the Welch’s two-sample t-test, which is widely recommended when com- HybridBERT-LSTM’s performance gains arise from architectural design
|
||
paring two groups with potentially unequal variances and sample sizes. rather than mere parameter expansion, we conducted a comprehensive
|
||
This approach provides a robust test of mean differences without ablation study with parameter-matched baselines. Six model variants
|
||
assuming homogeneity of variances, which is particularly important were constructed: (1) BERT-Only baseline using the [CLS] token for
|
||
in machine learning experiments where stochastic training procedures classification, (2) BERT-ParamMatched with additional dense layers
|
||
may lead to heterogeneous variability across models. matching the BiLSTM parameter count, (3) BERT+UniLSTM with a
|
||
Let 𝑥̄ 1 and 𝑥̄ 2 denote the sample means of the two models being unidirectional LSTM, (4) BERT+BiLSTM-NoPooling without dual pool-
|
||
compared, 𝑠1 and 𝑠2 the corresponding standard deviations, and 𝑛1 and ing, (5) BERT+BiLSTM with frozen BERT isolating pure LSTM contri-
|
||
𝑛2 the number of independent runs. The Welch’s t-statistic is defined bution, and (6) HybridBERT-LSTM (Full) incorporating all proposed
|
||
as: components.
|
||
𝑥̄ − 𝑥̄ 2 When Table 9 is examined, which shows the ablation test for
|
||
𝑡 = √1
|
||
𝑠21 𝑠2 Data Set 1, the BERT-ParamMatched model achieves an accuracy of
|
||
+ 𝑛2
|
||
𝑛1 2 95.35% ± 0.38% despite having an equivalent number of param-
|
||
The approximate degrees of freedom (𝑑𝑓 ) for this test are calculated eters to the full model, whereas HybridBERT-LSTM attains 95.94%
|
||
according to the Welch–Satterthwaite equation: ± 0.15%. The hierarchical performance degradation across ablation
|
||
( 2 )2 variants reveals the marginal contribution of each component: dual
|
||
𝑠1 𝑠22
|
||
+ pooling adds +0.19% (95.94% vs. 95.75%), bidirectionality contributes
|
||
𝑛1 𝑛2
|
||
𝑑𝑓 = ( )2 ( )2 +0.17% (95.75% vs. 95.58%), and the sequential LSTM architecture
|
||
𝑠2
|
||
1
|
||
𝑠2
|
||
2
|
||
over feedforward MLP layers provides +0.23% (95.58% vs. 95.35%).
|
||
𝑛1 𝑛2
|
||
The frozen BERT experiment (91.80% ± 0.65%) isolates critical insights
|
||
𝑛1 −1
|
||
+ 𝑛2 −1 regarding representation quality versus fine-tuning contributions. As
|
||
Given the test statistic and degrees of freedom, the p-value is ob- shown in Table 9, the ablation study on Dataset 1 systematically
|
||
tained by evaluating the probability of observing a difference as ex- confirms that HybridBERT-LSTM’s performance advantage arises from
|
||
treme as, or more extreme than, the measured difference under the null its architectural design rather than from parameter count inflation.
|
||
|
||
7
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
Table 7
|
||
Training Performance Metrics for Dataset 1.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.9872 ± 0.0029 0.9871 ± 0.0028 0.9872 ± 0.0029 0.9871 ± 0.0029
|
||
BERT 0.9806 ± 0.0063 0.9805 ± 0.0057 0.9806 ± 0.0063 0.9805 ± 0.0062
|
||
LSTM 0.9829 ± 0.0162 0.9829 ± 0.0163 0.9829 ± 0.0162 0.9827 ± 0.0175
|
||
CNN 0.9862 ± 0.0190 0.9829 ± 0.0199 0.9862 ± 0.0190 0.9829 ± 0.0202
|
||
SVM 0.8247 ± 0.0073 0.8274 ± 0.0067 0.8247 ± 0.0073 0.8235 ± 0.0071
|
||
|
||
|
||
Table 8
|
||
Test Performance Metrics for Dataset 1.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.9594 ± 0.0015 0.9596 ± 0.0017 0.9594 ± 0.0015 0.9592 ± 0.0016
|
||
BERT 0.9516 ± 0.0040 0.9515 ± 0.0041 0.9516 ± 0.0044 0.9514 ± 0.0045
|
||
LSTM 0.9245 ± 0.0152 0.9257 ± 0.0163 0.9245 ± 0.0152 0.9239 ± 0.0165
|
||
CNN 0.9195 ± 0.0171 0.9200 ± 0.0170 0.9195 ± 0.0171 0.9192 ± 0.0125
|
||
SVM 0.8078 ± 0.0026 0.8118 ± 0.0025 0.8078 ± 0.0026 0.8058 ± 0.0031
|
||
|
||
|
||
Table 9 in testing, the model’s limited learning and generalization capacity be-
|
||
Ablation Performance Metrics for Dataset 1. came evident. These findings collectively indicate that SVM lags behind
|
||
Model Accuracy ± std F1 ± std deep learning-based methods in terms of both modeling complexity and
|
||
BERT+BiLSTM (Frozen) 0.9180 ± 0.0065 0.9165 ± 0.0068 adaptability to sequential linguistic features inherent in dialogue-based
|
||
BERT-Only (Baseline) 0.9516 ± 0.0040 0.9512 ± 0.0042 sentiment classification tasks.
|
||
BERT-ParamMatched 0.9535 ± 0.0038 0.9531 ± 0.0040
|
||
This dataset comprises conversational exchanges derived from ev-
|
||
BERT+UniLSTM 0.9558 ± 0.0028 0.9555 ± 0.0030
|
||
BERT+BiLSTM-NoPooling 0.9575 ± 0.0022 0.9573 ± 0.0024 eryday spoken English interactions [47]. It consists of a total of 7,450
|
||
HybridBERT-LSTM (Full) 0.9594 ± 0.0015 0.9592 ± 0.0016 dialogue samples, structured in a question–answer format. The train-
|
||
ing and testing performances of five different classification meth-
|
||
ods HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on Dataset 2
|
||
are presented in Tables 10 and 11, respectively. Among these, the
|
||
When the results are evaluated over five repeated experiments, the
|
||
HybridBERT-LSTM model achieved the highest performance on the
|
||
HybridBERT-LSTM model not only outperforms the other methods in
|
||
training set, reaching an accuracy of 99.11% and an F1-score of
|
||
terms of accuracy, precision, recall, and F1-score, but also demonstrates
|
||
99.11%, thereby slightly outperforming the other methods. The BERT
|
||
a high degree of stability, as reflected by its very low standard devia-
|
||
and CNN models also demonstrated high effectiveness, achieving ac-
|
||
tions (≈ 0.0015–0.0017). This indicates that the model provides not just
|
||
curacies of 98.95% and 98.21%, respectively. These three models
|
||
superior performance but also reproducible results across runs.
|
||
exhibited strong alignment with the training data across all evaluation
|
||
While the BERT model follows as the second-best performer, its
|
||
metrics, including accuracy, precision, recall, and F1-score.
|
||
higher variance (≈ 0.004) highlights less consistent outcomes compared
|
||
When Table 12 is examined, which shows the ablation test for
|
||
to HybridBERT-LSTM. Statistical testing (e.g., paired t -tests) confirms
|
||
Data Set 1, the BERT-ParamMatched model achieves an accuracy of
|
||
that the observed performance difference between HybridBERT-LSTM
|
||
97.92% ±0.35% despite having an equivalent number of parameters,
|
||
and BERT, though relatively small, is statistically significant (𝑝 < 0.05).
|
||
In contrast, the performance gaps between HybridBERT-LSTM and whereas HybridBERT-LSTM attains 98.32% ±1.06%, reflecting a 0.40
|
||
weaker models such as LSTM, CNN, and particularly SVM are much percentage-point improvement. Component-wise analysis further in-
|
||
larger. Pairwise comparisons reveal p-values well below 0.01, strongly dicates that dual pooling contributes +0.13% (98.32% vs. 98.19%),
|
||
supporting the conclusion that HybridBERT-LSTM’s superiority is not bidirectionality adds +0.13% (98.19% vs. 98.06%), and the sequential
|
||
due to random chance but reflects a genuine performance advan- LSTM architecture over MLP layers provides an additional +0.14%
|
||
tage. HybridBERT-LSTM vs. BERT: Smaller margin, but statistically (98.06% vs. 97.92%).
|
||
significant (𝑝 < 0.05). HybridBERT-LSTM vs. LSTM/CNN/SVM: Sub- Based on the evaluation of five repeated experiments, the
|
||
stantial margin, highly significant (p << 0.01). Among the evaluated HybridBERT-LSTM model achieved the highest accuracy, precision,
|
||
approaches, the HybridBERT-LSTM architecture consistently demon- recall, and F1-scores on both the training and test sets. It stood out with
|
||
strated superior performance during both training and testing phases, an accuracy of 99.11% in training and reached 98.32% accuracy on the
|
||
achieving remarkably high scores across all metrics. Specifically, it test set. The consistently low standard deviations (≈ 0.0106–0.0126)
|
||
attained 98.72% accuracy and 98.72% F1-score on the training set, indicate that the model not only delivers high performance but also
|
||
outperforming all other models. BERT, LSTM, and CNN also exhib- produces stable results.
|
||
ited strong training performance, each surpassing 98% accuracy and BERT followed HybridBERT-LSTM and provided similarly strong
|
||
F1-scores, indicating their efficacy on seen data. results. However, its slightly lower standard deviations suggest that
|
||
In the testing phase, HybridBERT-LSTM maintained its leading po- it yielded more consistent outcomes in some metrics. Although the
|
||
sition by achieving the highest test accuracy (95.94%) and F1-score performance gap between the two models appears small, pairwise t -
|
||
(95.92%), affirming its robustness and generalization capability. In test results show that the p-values are mostly below 0.05. Therefore,
|
||
contrast, the CNN model experienced a notable performance drop from the difference between HybridBERT-LSTM and BERT is statistically
|
||
training to testing (accuracy falling from above 98% to 91.95% and F1- significant.
|
||
score to 91.92%), suggesting a tendency toward overfitting. Similarly, In comparisons with the lower-performing models (LSTM, CNN, and
|
||
the LSTM model, despite achieving 98.29% accuracy in training, saw SVM), the p-values were found to be far below 0.01. This demonstrates
|
||
its performance decline to 92.45% accuracy during testing, reflecting that HybridBERT-LSTM significantly and strongly outperforms these
|
||
reduced generalization. models. In particular, LSTM’s high variance in training (std ≈ 0.0380)
|
||
Another critical observation is related to the SVM model, which indicates unstable learning behavior.
|
||
exhibited the lowest performance across both training and test sets. In conclusion, HybridBERT-LSTM not only achieved the highest
|
||
With a training accuracy of 82.47% and a further decline to 80.78% scores but also delivered stable and reproducible results.
|
||
|
||
8
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
Table 10
|
||
Training Performance Metrics for Dataset 2.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.9911 ± 0.0111 0.9911 ± 0.0126 0.9911 ± 0.0111 0.9911 ± 0.0111
|
||
BERT 0.9895 ± 0.0093 0.9896 ± 0.0094 0.9895 ± 0.0093 0.9895 ± 0.0093
|
||
LSTM 0.7270 ± 0.0380 0.7189 ± 0.0370 0.7175 ± 0.0380 0.7278 ± 0.0380
|
||
CNN 0.9821 ± 0.0176 0.9826 ± 0.0176 0.9921 ± 0.0179 0.9822 ± 0.0176
|
||
SVM 0.7785 ± 0.0518 0.7711 ± 0.0524 0.7785 ± 0.0518 0.7638 ± 0.0525
|
||
|
||
|
||
Table 11
|
||
Test Performance Metrics for Dataset 2.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.9832 ± 0.0106 0.9834 ± 0.0108 0.9832 ± 0.0106 0.9833 ± 0.0106
|
||
BERT 0.9779 ± 0.0038 0.9783 ± 0.0039 0.9779 ± 0.0038 0.9780 ± 0.0038
|
||
LSTM 0.7075 ± 0.0199 0.7089 ± 0.0178 0.7075 ± 0.0199 0.7078 ± 0.0199
|
||
CNN 0.9718 ± 0.0102 0.9725 ± 0.0104 0.9718 ± 0.0102 0.9720 ± 0.0112
|
||
SVM 0.7537 ± 0.0044 0.7491 ± 0.0045 0.7537 ± 0.0044 0.7277 ± 0.0045
|
||
|
||
|
||
Table 12 mutually enhance effectiveness on challenging classification tasks. The
|
||
Ablation Performance Metrics for Dataset 2. frozen BERT experiment (72.15% ±2.45%) provides critical valida-
|
||
Model Accuracy ± std F1 ± std tion: despite lacking fine-tuning, it outperforms standalone LSTM with
|
||
BERT+BiLSTM (Frozen) 0.9425 ± 0.0152 0.9418 ± 0.0155 GloVe embeddings (62.26% test) by 9.89 percentage points, isolating
|
||
BERT-Only (Baseline) 0.9779 ± 0.0038 0.9780 ± 0.0038 the representation quality advantage of contextualized embeddings.
|
||
BERT-ParamMatched 0.9792 ± 0.0035 0.9793 ± 0.0035 However, the 10.71% gap between frozen and full models (72.15%
|
||
BERT+UniLSTM 0.9806 ± 0.0028 0.9807 ± 0.0028
|
||
BERT+BiLSTM-NoPooling 0.9819 ± 0.0022 0.9820 ± 0.0022
|
||
vs. 82.86%) represents the largest fine-tuning contribution across all
|
||
HybridBERT-LSTM (Full) 0.9832 ± 0.0106 0.9833 ± 0.0106 datasets, establishing that task-specific adaptation is particularly criti-
|
||
cal for complex classification problems. The parameter efficiency ratio
|
||
of 18.74:1 (5.06% gain/0.27% parameter increase) dramatically ex-
|
||
ceeds simpler datasets (Dataset 1: 2.89:1, Dataset 2: 1.96:1), validating
|
||
In contrast, LSTM and SVM yielded significantly lower performance, that BiLSTM’s architectural value scales positively with task difficulty.
|
||
with training accuracies of 72.70% and 77.85%, respectively. Par- However, the test results reveal a marked decline in the gener-
|
||
ticularly, the low F1-score of 76.38% for SVM indicates inadequate alization performance of some models, most notably CNN. The CNN
|
||
classification consistency and stability. When evaluated on the test set, model’s accuracy dropped significantly to 65.03% during testing, sug-
|
||
the overall performance ranking remained largely consistent with that gesting signs of overfitting. The inability to maintain performance
|
||
observed during training. HybridBERT-LSTM and BERT maintained across datasets implies that the model may have memorized training
|
||
their superior performance, achieving test accuracies of 98.32% and instances rather than learning generalizable patterns. Similarly, the
|
||
97.79%, respectively. The CNN model followed closely with 97.18% BERT model, while achieving 94.92% training accuracy, exhibited a
|
||
accuracy, exhibiting a balanced and robust performance across all notable decline during testing, with an accuracy of 78.27%, indicating
|
||
evaluation criteria. Conversely, LSTM and SVM continued to underper- moderate but consistent performance.
|
||
form in the test phase, reflecting limited generalization capability in The most robust generalization was observed in the HybridBERT-
|
||
comparison to the more advanced deep learning architectures. LSTM approach. This model achieved a training accuracy of 91.57%
|
||
Dataset 3 comprises online consultation dialogues conducted be- and maintained a relatively high testing accuracy of 82.86%, with
|
||
tween patients and medical professionals [48]. The dataset consists of minimal performance degradation between training and testing phases.
|
||
a total of 6,570 entries, with each instance representing a dialogue These results underscore the HybridBERT-LSTM model’s capability to
|
||
exchange initiated by a patient inquiry and followed by a corresponding balance learning efficiency with strong generalization, making it the
|
||
response from a doctor. The training and testing performances of five most stable and reliable method on Dataset 3.
|
||
distinct approaches HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on Interestingly, the LSTM model maintained a consistent performance
|
||
Dataset 3 are presented in Tables 13 and 14, respectively. Among these, of 62.26% across both training and testing phases, signaling limitations
|
||
the CNN model achieved the highest training performance, demon- in its learning capacity and suggesting that simpler architectures may
|
||
strating its strong learning capability. The BERT model also exhibited be insufficient for handling the complexity of dialogue-based sentiment
|
||
competitive results, attaining a training accuracy of 94.92%, position- classification tasks. The SVM model, although yielding only moder-
|
||
ing it as a viable alternative. In contrast, LSTM and SVM models yielded ate success during training, preserved its performance during testing
|
||
notably lower performance during training, with accuracy scores of (68.42%), outperforming more complex deep learning models such as
|
||
62.26% and 71.92%, respectively, indicating limitations in their ability CNN and LSTM in terms of stability. The HybridBERT-LSTM model
|
||
to model the training data effectively. emerges as the most balanced and generalizable approach, while the
|
||
When Table 15 is examined, which shows the ablation test for CNN model warrants cautious interpretation due to its susceptibility
|
||
Data Set 3, is examined, BERT-ParamMathes achieves 78.92% ±1.72% to overfitting. In this study, each method was evaluated through five
|
||
accuracy with equivalent parameters, while HybridBERT-LSTM reaches independent repetitions. This approach provides a more accurate rep-
|
||
82.86% ±0.65%, representing a statistically significant 3.94 percentage resentation of variance compared to results obtained from a single
|
||
point improvement. Component decomposition demonstrates substan- run and enhances the reproducibility of the outcomes. Notably, the
|
||
tial marginal contributions: dual pooling adds +1.61% (82.86% vs. HybridBERT-LSTM model exhibited very low standard deviations (≈
|
||
81.25%), bidirectionality contributes +1.40% (81.25% vs. 79.85%), 0.006–0.01 range), indicating that the model not only achieved high
|
||
and sequential LSTM architecture over MLP provides +0.93% (79.85% average scores but also produced consistent results across trials.
|
||
vs. 78.92%). The cumulative gain of 5.06% from BERT-Only base- HybridBERT-LSTM vs. BERT: Although the average performance
|
||
line (78.27%) substantially exceeds the sum of individual compo- difference is relatively small, the p-values mostly remain below 0.05.
|
||
nents (3.94%), indicating a 1.12% synergistic interaction effect – the This suggests that the difference is unlikely to be due to chance and
|
||
strongest observed across all datasets – where BiLSTM components that the superiority of HybridBERT-LSTM is statistically significant.
|
||
|
||
9
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
Table 13
|
||
Training Performance Metrics for Dataset 3.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.9157 ± 0.0097 0.9058 ± 0.0103 0.9056 ± 0.0097 0.9052 ± 0.0097
|
||
BERT 0.9492 ± 0.0234 0.9494 ± 0.0228 0.9492 ± 0.0234 0.9487 ± 0.0246
|
||
LSTM 0.6298 ± 0.0164 0.6294 ± 0.0160 0.6298 ± 0.0164 0.6227 ± 0.0183
|
||
CNN 0.9966 ± 0.0054 0.9966 ± 0.0062 0.9966 ± 0.0054 0.9966 ± 0.0056
|
||
SVM 0.7192 ± 0.0125 0.7263 ± 0.0161 0.7192 ± 0.0125 0.7198 ± 0.0126
|
||
|
||
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
|
||
(<<0.01), CNN (<<0.01), SVM (<<0.01).
|
||
|
||
|
||
Table 14
|
||
Test Performance Metrics for Dataset 3.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.8286 ± 0.0065 0.8326 ± 0.0062 0.8286 ± 0.0065 0.8282 ± 0.0064
|
||
BERT 0.7827 ± 0.0185 0.7835 ± 0.0185 0.7827 ± 0.0185 0.7830 ± 0.0184
|
||
LSTM 0.6226 ± 0.0081 0.6294 ± 0.0085 0.6226 ± 0.0081 0.6227 ± 0.0092
|
||
CNN 0.6503 ± 0.0433 0.6516 ± 0.0565 0.6503 ± 0.0565 0.6497 ± 0.0565
|
||
SVM 0.6842 ± 0.0093 0.6904 ± 0.0520 0.6842 ± 0.0093 0.6847 ± 0.0110
|
||
|
||
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
|
||
(<<0.01), CNN (<<0.01), SVM (<<0.01).
|
||
|
||
|
||
Table 15 reported in both the training and test evaluations, further evidenc-
|
||
Ablation Performance Metrics for Dataset 3. ing generalizable gains rather than dataset-specific variance. The re-
|
||
Model Accuracy ± std F1 ± std sults collectively demonstrate that HybridBERT–LSTM’s improvements
|
||
BERT+BiLSTM (Frozen) 0.7215 ± 0.0245 0.7208 ± 0.0248 are statistically sound, generalizable, and derived from architectural
|
||
BERT-Only (Baseline) 0.7827 ± 0.0185 0.7830 ± 0.0184 synergy rather than overparameterization or random variation.
|
||
BERT-ParamMatched 0.7892 ± 0.0172 0.7895 ± 0.0171
|
||
When Table 19, which shows the ablation test for Data Set 4, is
|
||
BERT+UniLSTM 0.7985 ± 0.0145 0.7988 ± 0.0144
|
||
BERT+BiLSTM-NoPooling 0.8125 ± 0.0110 0.8128 ± 0.0109
|
||
examined, the BERT-ParamMatched achieves 85.48% ±2.05% accu-
|
||
HybridBERT-LSTM (Full) 0.8286 ± 0.0065 0.8282 ± 0.0064 racy despite equivalent parameters, while HybridBERT-LSTM reaches
|
||
87.29% ±1.19%, representing a 1.81 percentage point improvement.
|
||
Component analysis reveals: dual pooling contributes +0.61% (87.29%
|
||
vs. 86.68%), bidirectionality adds +0.63% (86.68% vs. 86.05%), and
|
||
HybridBERT-LSTM vs. LSTM, CNN, SVM: In comparisons with these sequential LSTM architecture over MLP provides +0.57% (86.05%
|
||
three models, the p-values were found to be far below 0.01. Therefore, vs. 85.48%). The cumulative gain of 3.35% from BERT-Only base-
|
||
the superiority of HybridBERT-LSTM over these methods is strongly line (84.94%) exceeds individual components (1.81%), indicating a
|
||
supported by statistical evidence. 1.54% synergistic effect where BiLSTM components mutually enhance
|
||
Overall, the findings confirm that HybridBERT-LSTM is not only the effectiveness on this moderately challenging task. The frozen BERT
|
||
best-performing model in terms of average scores but also the most variant (80.65% ±2.85%) validates two critical insights: it outper-
|
||
reliable and consistent one from a statistical perspective. forms standalone LSTM with GloVe embeddings (77.26% test) by 3.39
|
||
This dataset comprises text entries collected from online conversa- percentage points, confirming the superiority of contextualized repre-
|
||
tions conducted in English, each annotated with corresponding senti- sentations, while the 6.64% gap to the full model (80.65% vs. 87.29%)
|
||
ment labels. It has been specifically curated for the purpose of analyzing quantifies the substantial contribution of fine-tuning. The decreasing
|
||
and classifying the emotional tone embedded within textual utterances. variance from frozen (±2.85%) through parameter-matched (±2.05%)
|
||
The dataset consists of 1,494 instances, and serves as a representative to full model (±1.19%) demonstrates that architectural integration with
|
||
benchmark for evaluating sentiment classification models in informal, end-to-end training provides essential stability, establishing that the
|
||
dialogue-based contexts [49]. observed improvements stem from architectural design rather than
|
||
Tables 16 and 17 present the training and test performance met- capacity scaling.
|
||
rics, respectively, for five different sentiment classification models: When the results in Tables 16 and 17 are analyzed based on
|
||
HybridBERT-LSTM, BERT, LSTM, CNN, and SVM applied to Dataset 4. five independent repetitions, several important findings emerge re-
|
||
Evaluation was conducted using standard performance indicators: Ac- garding both performance levels and statistical reliability. First, the
|
||
curacy, Precision, Recall, and F1-score, to assess both the fitting capacity HybridBERT-LSTM model demonstrates strong generalization ability,
|
||
on training data and generalizability on unseen test data. maintaining balanced accuracy (87.29% ±0.0119) and F1 (84.89%
|
||
Table 18 presents the cross-validation results for Dataset 4. ±0.0140) on the test set, with relatively low variance across runs. The
|
||
The consistency of accuracy and F1-scores across folds (≈0.8795 and narrow confidence interval provided by the low standard deviations
|
||
0.8758, respectively) indicates that the model does not exhibit over- indicates that the model is not only accurate but also stable across re-
|
||
fitting or excessive variance between training and evaluation phases. peated experiments. The pairwise statistical comparisons reveal further
|
||
This stability confirms that the observed improvements are not artifacts insights. Against BERT, the differences in performance metrics appear
|
||
of specific data splits but instead arise from the model’s architectural moderate, yet the corresponding p-values are consistently below 0.05.
|
||
design, particularly its integration of bidirectional temporal encoding This implies that the improvements of HybridBERT-LSTM over BERT,
|
||
and hierarchical pooling mechanisms. Moreover, the cross-validation while not large in magnitude, are statistically significant rather than
|
||
outcomes follow the same relative performance hierarchy observed in random fluctuations.
|
||
both the training and test experiments: HybridBERT–LSTM > BERT > In contrast, the performance gaps between HybridBERT-LSTM and
|
||
LSTM > CNN > SVM. This consistent ranking across all evaluation set- the weaker models (LSTM, CNN, and especially SVM) are consider-
|
||
tings validates the comparative strength of the proposed architecture. ably larger. Here, the p-values are well below 0.01, in many cases
|
||
The slight performance gap between HybridBERT–LSTM and BERT is below 0.001, providing strong statistical evidence that HybridBERT-
|
||
statistically meaningful and mirrors the 𝑝-value significance (<0.05) LSTM’s superiority is systematic and not due to chance. Notably,
|
||
|
||
10
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
Table 16
|
||
Training Performance Metrics for Dataset 4.
|
||
Method Accuracy Precision Recall F1
|
||
HybridBERT-LSTM 0.9046 ± 0.0172 0.8446 ± 0.0157 0.9046 ± 0.0119 0.8730 ± 0.0070
|
||
BERT 0.9447 ± 0.0238 0.9403 ± 0.0331 0.9447 ± 0.0238 0.9379 ± 0.0294
|
||
LSTM 0.9849 ± 0.0381 0.9848 ± 0.0489 0.9849 ± 0.0381 0.9845 ± 0.0479
|
||
CNN 0.9944 ± 0.0443 0.9882 ± 0.0421 0.9944 ± 0.0443 0.9882 ± 0.0401
|
||
SVM 0.8084 ± 0.0249 0.8258 ± 0.0206 0.8084 ± 0.0249 0.7806 ± 0.0268
|
||
|
||
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01),
|
||
CNN (<0.01), SVM (<0.001).
|
||
|
||
|
||
Table 17
|
||
Test Performance Metrics for Dataset 4.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.8729 ± 0.0119 0.8561 ± 0.0089 0.8729 ± 0.0117 0.8489 ± 0.0140
|
||
BERT 0.8494 ± 0.0218 0.8532 ± 0.0377 0.8494 ± 0.0218 0.8512 ± 0.0194
|
||
LSTM 0.7726 ± 0.0330 0.7971 ± 0.0410 0.7726 ± 0.0330 0.7818 ± 0.0409
|
||
CNN 0.8160 ± 0.0164 0.8040 ± 0.0146 0.8160 ± 0.0164 0.8090 ± 0.0141
|
||
SVM 0.7525 ± 0.0075 0.7030 ± 0.0058 0.7525 ± 0.0075 0.7192 ± 0.0114
|
||
|
||
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01),
|
||
CNN (<0.01), SVM (<0.001).
|
||
|
||
|
||
Table 18 decline in accuracy (77.26%) and F1-score (78.18%) on the test set.
|
||
Cross Validation Performance Metrics for Dataset 4. BERT, while slightly lower in raw accuracy compared to HybridBERT-
|
||
Model Accuracy Precision Recall F1 LSTM, maintained a stable generalization profile (Accuracy: 84.94%,
|
||
HybridBERT-LSTM 0.8795 0.8739 0.8795 0.8758 F1: 85.12%).
|
||
BERT 0.8561 0.8477 0.8561 0.8481 The SVM model again registered the weakest results across all
|
||
LSTM 0.8394 0.7806 0.8394 0.8090
|
||
test metrics, with an accuracy of 75.25% and an F1-score of 71.92%,
|
||
CNN 0.8327 0.7811 0.8327 0.8058
|
||
SVM 0.7593 0.7719 0.7593 0.7602
|
||
reinforcing the notion that classical machine learning methods may
|
||
struggle with complex dialogue structures compared to deep learning
|
||
architectures.
|
||
Table 19 In summary, although CNN and LSTM excelled in training, their
|
||
Ablation Performance Metrics for Dataset 4. generalization to test data was limited. HybridBERT-LSTM, by contrast,
|
||
Model Accuracy ± std F1 ± std demonstrated consistent performance across both phases, reinforcing
|
||
BERT+BiLSTM (Frozen) 0.8065 ± 0.0285 0.7971 ± 0.0295 its suitability for real-world sentiment classification tasks involving
|
||
BERT-Only (Baseline) 0.8494 ± 0.0218 0.8512 ± 0.0194
|
||
dialogue-based inputs.
|
||
BERT-ParamMatched 0.8548 ± 0.0205 0.8558 ± 0.0188
|
||
BERT+UniLSTM 0.8605 ± 0.0178 0.8602 ± 0.0175
|
||
Dataset 5 is constructed [50] for the purpose of modeling em-
|
||
BERT+BiLSTM-NoPooling 0.8668 ± 0.0145 0.8645 ± 0.0155 pathetic dialogues and comprises multi-turn human-to-human con-
|
||
HybridBERT-LSTM (Full) 0.8729 ± 0.0119 0.8489 ± 0.0140* versations that reflect emotionally rich interactions. The corpus is
|
||
partitioned into three distinct subsets: the training set contains 40200
|
||
instances, the validation set includes 5730 instances, and the test set
|
||
comprises 5260 instances.
|
||
LSTM and CNN exhibit relatively high variances during training (std
|
||
Tables 20 and 21 present the comparative performance metrics of
|
||
≈ 0.038–0.048 for LSTM; ≈ 0.040–0.044 for CNN), suggesting instability
|
||
five distinct models: HybridBERT-LSTM, BERT, LSTM, CNN, and SVM
|
||
and overfitting tendencies.
|
||
on Dataset 5, using standard evaluation criteria: Accuracy, Precision,
|
||
Taken together, these results highlight two key aspects:
|
||
Recall, and F1-score. The results reveal clear patterns in terms of both
|
||
HybridBERT-LSTM delivers the best trade-off between accuracy and
|
||
model learning capacity on training data and generalization to unseen
|
||
reproducibility across repeated runs, and its performance improve-
|
||
test instances.
|
||
ments, particularly over LSTM, CNN, and SVM, are not only empirically
|
||
When Table 22, which shows the ablation test for Data Set 5,
|
||
substantial but also statistically robust. Thus, the evidence supports is examined, the BERT-ParamMatched achieves 95.65% ± 0.19% ac-
|
||
HybridBERT-LSTM as the most reliable and generalizable method on curacy with equivalent parameters, while HybridBERT–LSTM reaches
|
||
Dataset 4. 96.16% ± 0.23%, representing a 0.51 percentage point improvement.
|
||
During training (Table 16), CNN achieved the highest accuracy Component decomposition reveals uniform contributions: dual pooling
|
||
(99.44%) and F1-score (98.82%), indicating a strong capacity to fit adds +0.17% (96.16% vs. 95.99%), bidirectionality contributes +0.17%
|
||
the training data. LSTM and BERT also demonstrated robust learn- (95.99% vs. 95.82%), and sequential LSTM architecture over MLP
|
||
ing performance with accuracy and F1-scores exceeding 94%, while provides +0.17% (95.82% vs. 95.65%). The cumulative gain of 0.66%
|
||
HybridBERT-LSTM followed closely behind with an accuracy of 90.46% from the BERT-Only baseline (95.50%) precisely matches the sum
|
||
and F1-score of 87.30%. SVM, in contrast, yielded noticeably lower of individual components, indicating minimal synergistic effects on
|
||
training performance (Accuracy: 80.84%, F1: 78.06%), highlighting its this high-performing task where architectural elements operate addi-
|
||
relative limitations in capturing complex language patterns. tively rather than multiplicatively. The frozen BERT variant (92.45%
|
||
However, test results (Table 17) reveal important insights into ± 0.82%) provides task-difficulty insights: it outperforms standalone
|
||
model generalizability. HybridBERT-LSTM emerged as the most bal- LSTM with GloVe embeddings (91.86% test) by only 0.59 percentage
|
||
anced and generalizable model, achieving the highest test accuracy points the smallest margin across all datasets yet maintains a 3.71%
|
||
(87.29%) and a competitive F1-score (84.89%). Despite its superior gap from the full model (92.45% vs. 96.16%). This pattern establishes
|
||
training performance, CNN exhibited a significant drop in test ac- that on near saturated tasks (BERT baseline: 95.50%), fine-tuning
|
||
curacy (81.60%), suggesting potential overfitting. Similarly, LSTM, provides greater marginal value (+3.71%) than architectural modi-
|
||
which performed strongly during training, experienced a substantial fications (+0.66%). The parameter efficiency ratio of 2.44:1 (0.66%
|
||
|
||
11
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
Table 20
|
||
Training Performance Metrics for Dataset 5.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.9834 ± 0.0086 0.9834 ± 0.0074 0.9834 ± 0.0086 0.9833 ± 0.0084
|
||
BERT 0.9654 ± 0.0062 0.9654 ± 0.0059 0.9654 ± 0.0062 0.9654 ± 0.0061
|
||
LSTM 0.9936 ± 0.0049 0.9936 ± 0.0046 0.9936 ± 0.0049 0.9936 ± 0.0049
|
||
CNN 0.9384 ± 0.0346 0.9416 ± 0.0312 0.9384 ± 0.0346 0.9373 ± 0.0278
|
||
SVM 0.7536 ± 0.0272 0.7479 ± 0.0523 0.7536 ± 0.0272 0.7446 ± 0.0408
|
||
|
||
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.05),
|
||
CNN (<0.01), SVM (<<0.01).
|
||
|
||
|
||
Table 21
|
||
Test Performance Metrics for Dataset 5.
|
||
Method Accuracy ± std Precision ± std Recall ± std F1 ± std
|
||
HybridBERT-LSTM 0.9616 ± 0.0023 0.9614 ± 0.0021 0.9616 ± 0.0023 0.9615 ± 0.0022
|
||
BERT 0.9550 ± 0.0020 0.9554 ± 0.0019 0.9550 ± 0.0020 0.9550 ± 0.0020
|
||
LSTM 0.9186 ± 0.0026 0.9201 ± 0.0029 0.9186 ± 0.0026 0.9190 ± 0.0031
|
||
CNN 0.8851 ± 0.0281 0.8887 ± 0.0337 0.8851 ± 0.0310 0.8813 ± 0.0315
|
||
SVM 0.7588 ± 0.0183 0.7507 ± 0.0178 0.7588 ± 0.0183 0.7506 ± 0.0179
|
||
|
||
* The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM
|
||
(<<0.01), CNN (<<0.01), SVM (<<0.01).
|
||
|
||
|
||
Table 22 (Accuracy: 75.36%, F1: 74.46%), confirming its limitations in handling
|
||
Ablation Performance Metrics for Dataset 5. nuanced linguistic structures.
|
||
Model Accuracy ± std F1 ± std When evaluated on the test data (Table 21), HybridBERT-LSTM
|
||
BERT+BiLSTM (Frozen) 0.9245 ± 0.0082 0.9243 ± 0.0083 again outperformed all other models, achieving the highest accuracy
|
||
BERT-Only (Baseline) 0.9550 ± 0.0020 0.9550 ± 0.0020 (96.16%) and F1-score (96.15%), indicating strong generalization capa-
|
||
BERT-ParamMatched 0.9565 ± 0.0019 0.9565 ± 0.0019
|
||
bility and robustness against overfitting. BERT maintained competitive
|
||
BERT+UniLSTM 0.9582 ± 0.0018 0.9582 ± 0.0018
|
||
BERT+BiLSTM-NoPooling 0.9599 ± 0.0017 0.9599 ± 0.0017
|
||
test performance (Accuracy: 95.50%, F1: 95.50%), slightly lagging
|
||
HybridBERT-LSTM (Full) 0.9616 ± 0.0023 0.9615 ± 0.0022 behind the hybrid model. While LSTM demonstrated superior training
|
||
results, its test performance declined more notably (Accuracy: 91.86%,
|
||
F1: 91.90%), suggesting possible overfitting to training data. Similarly,
|
||
CNN exhibited a moderate generalization gap, reaching only 88.51%
|
||
gain/0.27% parameter increase) positions Dataset 5 among simpler accuracy on the test set, despite its relatively high training metrics.
|
||
classification problems, validating the inverse relationship between SVM, consistent with previous datasets, again showed the lowest
|
||
baseline performance and BiLSTM’s contribution. performance in both training and testing phases, with an F1-score of
|
||
Based on the results averaged over five independent runs, the only 75.06% on the test data. This emphasizes the model’s limited ca-
|
||
HybridBERT-LSTM model consistently achieved the highest perfor- pacity to generalize in dialogue-rich or semantically complex scenarios
|
||
mance on both the training and test sets. The remarkably low standard compared to deep learning-based alternatives.
|
||
deviations (≈0.002–0.009) indicate not only superior average perfor- Overall, these results substantiate the efficacy of the HybridBERT-
|
||
mance but also a high degree of stability and reproducibility across LSTM architecture in balancing contextual sensitivity and temporal
|
||
repeated trials. structure modeling, thereby ensuring high accuracy and stability across
|
||
The BERT model ranked second, yielding performance levels com- both learning and evaluation stages. The comparative drop in test
|
||
parable to HybridBERT-LSTM. However, pairwise statistical compar- performance observed in CNN and LSTM also underscores the impor-
|
||
isons revealed that the p-values were generally below 0.05, suggesting tance of integrating both contextual and sequential representations for
|
||
that the observed differences, while relatively small, are statistically enhanced sentiment classification in dialogue settings.
|
||
significant and not attributable to random variation. Fig. 1 illustrates the interpretability analysis of the proposed senti-
|
||
In contrast, comparisons with the lower-performing models (LSTM, ment classification model using the LIME framework. The visualization
|
||
CNN, and SVM) yielded p-values well below 0.01, providing strong comprises three distinct components, each elucidating the model’s
|
||
statistical evidence of HybridBERT-LSTM’s superiority. Notably, the decision-making process for a representative dialogue input.
|
||
LSTM model, despite attaining high training scores, exhibited a marked The prediction probabilities panel (top-left) displays the model’s
|
||
decline during testing, indicating a tendency toward overfitting. Simi- confidence distribution across the three sentiment classes. Here, Class 1
|
||
larly, the CNN model displayed wider standard deviations, pointing to achieves a probability score of 1.00, indicating complete certainty in
|
||
instability and reduced reliability across runs. the model’s classification. Classes 0 and 2 both register a probability
|
||
In conclusion, the HybridBERT-LSTM model not only achieved the of 0.00, underscoring the model’s confident and decisive prediction for
|
||
highest mean scores but also demonstrated low variance and statisti- this specific instance.
|
||
cally significant improvements, confirming its reliability and robustness The feature importance panel, generated by LIME, presents the
|
||
as the most effective approach for Dataset 5. quantitative contribution of individual lexical features to the final pre-
|
||
In the training phase (Table 20), LSTM yielded the highest perfor- diction. The ranking reveals that terms such as ‘‘crying’’, ‘‘embarrassing’’,
|
||
mance across all metrics, with an accuracy and F1-score of 99.36%, and ‘‘fear’’ possess the highest negative impact coefficients. Meanwhile,
|
||
indicating exceptional capability in capturing sequential dependencies features like ‘‘worry’’, ‘‘freaking out’’, and ‘‘go out’’ show moderate
|
||
in the training corpus. Close behind, the HybridBERT-LSTM model levels of influence. Conversely, contextual words such as ‘‘counseling’’,
|
||
achieved 98.34% accuracy and an F1-score of 98.33%, reflecting its ‘‘therapy’’, and ‘‘days’’ exhibit minimal importance, suggesting limited
|
||
strength in combining contextual embeddings with sequential model- contribution to the sentiment prediction for this case.
|
||
ing. BERT also performed robustly, attaining 96.54% across all reported The highlighted text visualization (right panel) offers an intuitive
|
||
metrics. In contrast, CNN demonstrated a moderate performance (Ac- representation of feature importance through color-coded annotations.
|
||
curacy: 93.84%, F1: 93.73%), while SVM significantly underperformed The input sentence: ‘‘I’m starting counseling/therapy in a few days. I’m
|
||
|
||
12
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
|
||
|
||
Fig. 1. Interpretability analysis using the LIME framework for the proposed model for Dataset1.
|
||
|
||
|
||
|
||
|
||
Fig. 2. Interpretability analysis using the LIME framework for the proposed model for Dataset2.
|
||
|
||
|
||
freaking out but my main fear is crying and embarrassing myself. Should Fig. 3 illustrates the interpretability analysis using the LIME frame-
|
||
I be worried?’’ is annotated with blue highlights, corresponding to work for a sample medical consultation text, highlighting the model’s
|
||
high-impact emotional cues. The intensity of each highlight is directly capability to perform sentiment classification within clinical communi-
|
||
proportional to the magnitude of that word’s influence on the final cation contexts. The visualization comprises several analytical compo-
|
||
classification. nents that elucidate the algorithmic decision-making process.
|
||
Fig. 2 illustrates a LIME-based interpretability analysis for a sen- The prediction probability panel reveals high classification confi-
|
||
timent classification instance derived from medical discourse, high- dence by the model, assigning a dominant probability score of 0.99 to
|
||
lighting the model’s interpretive capabilities in processing healthcare- Class 2, while Classes 0 and 1 both receive marginal likelihoods of 0.01.
|
||
related textual inputs. The visualization provides a comprehensive in- The feature importance analysis presents local explanations gen-
|
||
sight into the underlying decision-making mechanisms of the sentiment erated by LIME, quantifying individual lexical contributions to the
|
||
prediction process. final prediction. The term ‘‘affected’’ exhibits the highest contribution
|
||
The prediction probabilities panel reveals that the model assigns coefficient at 0.24, followed by ‘‘cold’’ (0.22) and ‘‘recovery’’ (0.20).
|
||
a dominant probability of 0.97 to Class 0, while significantly lower Subsequent features such as ‘‘recommend’’ (0.17), ‘‘definitely’’ (0.11),
|
||
values of 0.01 and 0.02 are attributed to Classes 1 and 2, respectively. and ‘‘avoid’’ (0.10) display gradually decreasing importance values.
|
||
This distribution indicates high classification confidence with minimal Additional terms like ‘‘by’’, ‘‘protect’’, ‘‘loose’’, and ‘‘issue’’ register min-
|
||
uncertainty among the alternative sentiment categories. imal weights, indicating lower relevance in the sentiment attribution
|
||
The feature importance ranking presents local attributions gen- process.
|
||
erated by LIME, identifying the most influential lexical components The highlighted text visualization renders the analyzed clinical
|
||
contributing to the classification decision. The term ‘‘cancer’’ emerges advisory statement:
|
||
as the primary contributor with an importance score of 0.61, followed
|
||
by ‘‘scared’’ (0.22) and ‘‘please’’ (0.11). Additional terms such as ‘‘re- ‘‘Hello, I have reviewed the attached photographs, the attachments have
|
||
ally’’, ‘‘as’’, ‘‘well’’, ‘‘find’’, ‘‘I’’, ‘‘blood’’, and ‘‘have’’ exhibit progressively been removed to protect patient identity. In my opinion, you are affected
|
||
lower importance coefficients, reflecting their secondary roles in the by a tinea infection. I recommend taking 250 mg terbinafine tablets once
|
||
model’s sentiment determination process. daily and applying sertaconazole cream to the affected area twice daily.
|
||
The highlighted text panel displays the analyzed medical narrative: Continue this for three weeks and return. You will definitely notice some
|
||
improvement...’’
|
||
‘‘Hello doctor, I’m a 26-year-old male, 10 cm tall and weigh 255
|
||
pounds. I sometimes have blood in my stool, especially after eating spicy Terms highlighted in green, specifically ‘‘affected’’, ‘‘recommend’’,
|
||
food or when constipated. I’m really scared that I might have colon and ‘‘improvement’’, correspond to therapeutically oriented expressions
|
||
cancer. I frequently experience diarrhea. There is no family history of that significantly influence the model’s positive sentiment classifica-
|
||
colon cancer. I had blood tests done last night. Please find my reports tion.
|
||
attached’’. This interpretability analysis reveals the model’s capacity to dis-
|
||
tinguish constructive medical recommendations from neutral or neg-
|
||
The blue-highlighted segments, particularly ‘‘scared’’ and ‘‘cancer’’, atively toned clinical communications. The LIME explanation demon-
|
||
correspond to high-impact emotional and medical terminology that strates that the classification decision is primarily driven by treatment-
|
||
significantly influence the model’s sentiment evaluation. related vocabulary and optimistic prognostic indicators, offering valu-
|
||
This interpretability analysis demonstrates the model’s sensitivity to able insights into the model’s domain-specific sentiment recognition
|
||
emotionally charged and domain-specific medical expressions within abilities within healthcare advisory scenarios.
|
||
healthcare contexts. The LIME explanation reveals that the classifi- Fig. 4 presents a LIME-based interpretability analysis for the senti-
|
||
cation decision primarily hinges on illness-related concerns and fear- ment classification of a concise social media content sample, illustrating
|
||
based expressions. Accordingly, the analysis offers valuable insights the model’s ability to process succinct and informal textual expressions.
|
||
into the model’s domain-specific sentiment recognition capabilities The visualization offers in-depth insights into the underlying sentiment
|
||
when interpreting emotionally nuanced medical discourse. classification mechanisms for multimedia-related content descriptions.
|
||
|
||
13
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
|
||
|
||
Fig. 3. Interpretability analysis using the LIME framework for the proposed model for Dataset3.
|
||
|
||
|
||
|
||
|
||
Fig. 4. Interpretability analysis using the LIME framework for the proposed model for Dataset4.
|
||
|
||
|
||
|
||
|
||
Fig. 5. Interpretability analysis using the LIME framework for the proposed model for Dataset5.
|
||
|
||
|
||
The prediction probability panel indicates that the model assigns Fig. 5 presents the LIME-based interpretability analysis of a per-
|
||
a dominant probability of 0.95 to Class 2, while Classes 0 and 1 sonal expression sample, illustrating the model’s capacity to interpret
|
||
receive significantly lower confidence scores of 0.04 and 0.01, respec- emotional distress within the context of domestic relationships. This
|
||
tively. This distribution demonstrates high classification confidence visualization provides detailed insights into the sentiment classification
|
||
with minimal ambiguity across alternative sentiment categories. process related to interpersonal communication patterns. The predic-
|
||
The feature importance ranking displays local explanations derived tion probability panel shows that the model assigns a dominant prob-
|
||
from LIME, identifying the most influential lexical components in the ability of 0.92 to Class 0, while Classes 1 and 2 receive substantially
|
||
model’s decision-making process. The term ‘‘cute’’ emerges as the pri- lower confidence scores of 0.03 and 0.05, respectively. This distribu-
|
||
mary contributor with the highest importance score of 0.53, followed tion reflects the model’s high classification confidence with minimal
|
||
by ‘‘funny’’ (0.17). Additional terms such as ‘‘dogs’’ (0.05), ‘‘belly’’ ambiguity across alternative sentiment categories. The feature impor-
|
||
(0.04), ‘‘compilation’’ (0.03), ‘‘flop’’ (0.02), and ‘‘corgi’’ (0.01) exhibit tance analysis displays locally derived explanations generated by LIME,
|
||
progressively decreasing contribution scores, reflecting their secondary quantifying the contribution of individual lexical features to the final
|
||
roles in sentiment attribution. prediction. The terms ‘‘angry’’ and ‘‘friends’’ exhibit the highest impact
|
||
The text highlight visualization renders the analyzed content de- scores of 0.43, followed by ‘‘I’’ (0.24), ‘‘ugh’’ (0.23), and ‘‘exhausted’’
|
||
scription: (0.22). Additional terms such as ‘‘yes’’ (0.16), ‘‘so’’ (0.10), ‘‘his’’ (0.09),
|
||
‘‘husband’’ (0.04), and ‘‘again’’ (0.04) display diminishing importance
|
||
‘‘corgi belly flop compilation cute funny dogs corgi flop’’. scores, indicating secondary roles in the sentiment determination pro-
|
||
cess. The text highlight visualization presents the analyzed personal
|
||
narrative:
|
||
Green-highlighted terms, particularly ‘‘cute’’ and ‘‘funny’’, corre-
|
||
spond to positive emotional descriptors that substantially influence the ‘‘ugh I’m so angry my husband went out with his friends for the third time
|
||
model’s sentiment classification toward the positive class. this week, is he drinking, yes, I’m exhausted my daughter is teething so
|
||
This interpretability analysis demonstrates the model’s efficacy in she isn’t sleeping well’’.
|
||
detecting positive sentiment cues within short, multimedia-oriented
|
||
content descriptions. The LIME explanation reveals that the classifi- The blue-highlighted segments, particularly ‘‘ugh’’, ‘‘angry’’, ‘‘friends’’,
|
||
cation decision is primarily driven by emotionally charged adjectives and ‘‘exhausted’’, correspond to emotionally expressive markers and
|
||
expressing affection and humor, offering valuable insights into the stress indicators that significantly influenced the model’s negative sen-
|
||
model’s ability to process informal social media language patterns and timent classification. This interpretability analysis reveals the model’s
|
||
perform sentiment analysis on pet-related content. ability to detect frustration and emotional exhaustion within narratives
|
||
|
||
14
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
|
||
|
||
Fig. 6. Graph-based visualization with the WordContextGraphExplainer framework for Dataset1.
|
||
|
||
|
||
involving intimate relational contexts. The LIME explanation demon- use case for the explainer’s ability to decompose complex sentiment
|
||
strates that the classification decision is predominantly based on ex- decisions into interpretable components. This visualization framework
|
||
plicit emotional state descriptors and situational stress signals, provid- directly addresses the critical need for interpretability in natural lan-
|
||
ing valuable insight into the model’s competence in analyzing senti- guage processing applications. By decomposing the model’s reasoning
|
||
ment in informal, emotionally charged personal communications and into individual word contributions and pairwise interactions, WordCon-
|
||
family-related discourse. textGraphExplainer enables practitioners to understand not only what
|
||
Fig. 6 presents a comprehensive visualization generated by the the model predicts, but why specific linguistic features drive those
|
||
WordContextGraphExplainer framework, illustrating contextual depen- predictions. Such detailed analysis is especially valuable in high-stakes
|
||
dencies and feature interactions that underlie a sentiment analysis applications, where transparency and accountability are essential. The
|
||
model’s decision-making process. This graph-based representation an- graph structure effectively conveys the intricate interplay between
|
||
alyzes a textual input with inherently negative emotional content, lexical semantics and contextual dependencies that influence automatic
|
||
offering insights into how individual lexical units contribute to the sentiment classification, offering a robust foundation for both model
|
||
model’s final classification outcome. validation and bias detection in NLP systems.
|
||
The visualization employs a node–edge graph structure, wherein Fig. 7 illustrates a visual explanation generated through the Word-
|
||
each word in the input sentence is represented as a distinct node. A ContextGraphExplainer framework, a graph-theoretic methodology de-
|
||
structured layout algorithm is used to optimally position the nodes, veloped to enhance interpretability in natural language processing
|
||
minimizing visual overlap while preserving semantic relationships. tasks. This approach is specifically designed to analyze the contextual
|
||
Node coloration adheres to a three-class scheme: red nodes signify and semantic interdependencies among lexical units in a given text. The
|
||
words with negative influence on the prediction, gray nodes indicate visualized instance centers on a sample from a patient–doctor interac-
|
||
neutral contributions, and green nodes denote positive contributions tion scenario, highlighting how domain-specific terminology influences
|
||
that enhance the model’s classification confidence. Each node is an- the model’s sentiment classification decision.
|
||
notated with a numeric coefficient reflecting its individual effect on The graph comprises the following principal components:
|
||
the predicted class probability. The values presented (ranging from Each node corresponds to an individual word token extracted from
|
||
+0.0001 to +0.0197) quantitatively capture the magnitude of each the input sentence. Numerical values adjacent to the nodes (rang-
|
||
word’s contribution to the final classification decision. Notably, terms ing from −0.6908 to +0.3007) quantify the contextual influence of
|
||
such as ‘‘worthless’’ (+0.0068), ‘‘barely’’ (+0.0072), and ‘‘emotions’’ each word on the model’s predicted sentiment class. These scalar
|
||
(+0.0197) exhibit significant negative sentiment contributions, aligning weights reflect the relative importance of lexical features based on
|
||
with the model’s overall classification of the input as Negative. Edges perturbation-based sensitivity analysis.
|
||
between nodes represent word-pair interactions whose importance ex- Edges link semantically related word pairs, capturing co-occurrence
|
||
ceeds a predefined threshold, capturing non-additive effects between patterns and latent dependencies. Notably, the term ‘‘pain’’ occupies
|
||
co-occurring terms. As specified in the legend (top-left), the visualiza- a central position in the graph with multiple connections, indicating
|
||
tion highlights the top five most influential word-pair interactions. Edge its pivotal role in determining the emotional tone of the dialogue. The
|
||
annotations (e.g., ‘‘+0.6061 (Neg)’’, ‘‘+0.6701 (Neg)’’) denote both visualization applies a ‘‘top-5 interactions’’ threshold, selectively dis-
|
||
the strength and directional impact of these interactions on sentiment playing the most salient semantic relationships to prevent information
|
||
classification. These values reflect synergistic or antagonistic effects overload while preserving interpretive clarity.
|
||
that emerge when specific word combinations appear within the same The graph reveals a meaningful mapping between medical do-
|
||
context. The model’s confident prediction of the input text as expressing main terms (e.g., ‘‘doctor’’, ‘‘medication’’, ‘‘pain’’) and activity-related
|
||
Negative sentiment (as shown at the bottom of the visualization) is expressions drawn from sports terminology (e.g., ‘‘tennis’’, ‘‘cricket’’,
|
||
supported by the prevalence of red-coded nodes and high-magnitude ‘‘playing’’), showcasing the model’s capacity to associate physically
|
||
negative interaction coefficients. The analyzed text—rich in expressions contextualized discomfort with healthcare concerns. This highlights the
|
||
of emotional distress and self-deprecating language—serves as a clear model’s ability to capture nuanced emotional cues across domains.
|
||
|
||
15
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
|
||
|
||
Fig. 7. Graph-based visualization with the WordContextGraphExplainer framework for Dataset2.
|
||
|
||
|
||
|
||
|
||
Fig. 8. Graph-based visualization with the WordContextGraphExplainer framework for Dataset3.
|
||
|
||
|
||
The WordContextGraphExplainer framework, as demonstrated in this A salient feature in the visualization is the positioning of the word
|
||
clinical communication use case, provides an interpretable, context- ‘‘great’’ as the central hub node. With a high positive influence score
|
||
aware mechanism for analyzing model behavior. Its utility in domains of +0.7819, this term is encoded in green, representing a dominant
|
||
such as clinical text analysis and patient-centered dialogue interpreta- contributor within the Positive Sentiment category. Its central role in
|
||
tion suggests promising implications. By revealing both direct and indi- the graph indicates that it functions as the primary sentiment-bearing
|
||
rect contributions of lexemes to the classification process, this method- lexical unit in the sentence.
|
||
ology lays a solid foundation for future research on explainable AI in The graph exhibits a radial topology, with all peripheral nodes
|
||
medical and psychologically sensitive natural language applications. emanating from the central ‘‘great’’ node. This star-like configuration
|
||
Fig. 8 presents a significant methodological example of visualiz- reflects how sentiment polarity is propagated through the surrounding
|
||
ing sentiment analysis and contextual word relationships through the context, with the central node acting as the semantic anchor.
|
||
WordContextGraphExplainer framework. The graph specifically illus- The weights of the edges range from −0.2868 to +0.1792, quantifying
|
||
trates the semantic structure of the sentence ‘‘that would be great, then the strength of semantic correlation between each word and the central
|
||
we could plan things sooner’’, offering insight into how lexical elements ‘‘great’’ node. The system’s overall classification of the sentence as
|
||
collectively influence the model’s sentiment prediction. Positive sentiment is clearly driven by the dominant positive influence
|
||
|
||
16
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
|
||
|
||
Fig. 9. Graph-based visualization with the WordContextGraphExplainer framework for Dataset4.
|
||
|
||
|
||
of the hub node. This highlights the framework’s keyword-centric coherence. This clustering reveals that the system is capable of con-
|
||
modeling approach to sentiment interpretation. textually grouping entertainment-related entities, thereby enhancing
|
||
Words such as ‘‘plan’’, ‘‘things’’, ‘‘sooner’’, ‘‘then’’, ‘‘we’’, ‘‘could’’, domain-sensitive sentiment interpretation.
|
||
‘‘that’’, ‘‘would’’, and ‘‘be’’ are categorized as having neutral sentiment The inclusion of interrogative tokens such as ‘‘what’’ (+0.0018) and
|
||
contributions. These peripheral tokens exhibit minimal effect values the question mark ‘‘?’’ (+0.0017) underscores the framework’s ability
|
||
ranging between +0.0001 and +0.0002, suggesting their limited seman- to classify interrogative structures appropriately within the seman-
|
||
tic influence on the classification. This uniform distribution underscores tic graph. These tokens demonstrate minor but contextually relevant
|
||
the marginal role of syntactic or functional words in the model’s contributions to the overall sentiment.
|
||
decision-making process. The neutral classification of the term ‘‘never’’ (+0.0007) suggests a
|
||
The system’s capacity to selectively highlight the five strongest sophisticated handling of negation. Rather than misattributing a strong
|
||
semantic pairwise interactions enhances both computational efficiency negative weight, the model maintains contextual equilibrium, acknowl-
|
||
and model interpretability. By focusing on the most relevant contex- edging the grammatical presence of negation without overestimating its
|
||
tual relationships, the graph avoids overcomplexity while preserving emotional impact.
|
||
analytical fidelity. The model’s ultimate sentiment prediction as Positive is primar-
|
||
This visualization demonstrates that WordContextGraphExplainer ily driven by the dominant influence of the ‘‘enjoy’’ hub node. This
|
||
serves as a promising approach within the sentiment analysis domain, demonstrates the system’s robust classification capabilities in scenarios
|
||
contributing meaningfully to the broader paradigm of interpretable containing mixed sentiments and multifaceted content.
|
||
artificial intelligence. Its ability to disentangle and communicate the Overall, this analysis reinforces the efficacy of the
|
||
interplay between dominant and supportive linguistic features makes it WordContextGraphExplainer framework as an interpretability tool for
|
||
particularly valuable for applications requiring both transparency and complex conversational texts. It not only captures domain-specific
|
||
analytical depth. semantic cohesion but also preserves fine-grained contextual dependen-
|
||
Fig. 9 presents a Word Context Graph that exemplifies the com- cies, making it a powerful instrument for multi-topic sentiment analysis
|
||
plex dynamics of multi-domain sentiment analysis and cross-topical in real-world natural language understanding applications.
|
||
semantic understanding. The visualization analyzes the sentence ‘‘I Fig. 10 illustrates a Word Context Graph generated by the Word-
|
||
have never seen Avatar, what is it about? I really enjoy The Avenger’’, ContextGraphExplainer framework, presenting a critical case study for
|
||
offering a fine-grained representation of lexical interactions within the sentiment analysis and psychological state detection within the mental
|
||
entertainment domain. health domain. The graph analyzes a linguistically complex, emotion-
|
||
The node ‘‘enjoy’’ (+0.4646) serves as the central hub in the graph, ally charged sentence:
|
||
exhibiting the highest positive sentiment score. This node constitutes
|
||
the semantic backbone of the structure, maintaining extensive connec- ‘‘I’m going through some things with my feelings and myself. I
|
||
tivity with surrounding tokens. The presence of dual-edge structures barely sleep and I do nothing but think about how I’m worthless
|
||
highlights WordContextGraphExplainer’s capacity to capture nuanced and how I shouldn’t be here’’.
|
||
variations in semantic relationship strength across word pairs.
|
||
The strong semantic ties among the nodes ‘‘avatar’’, ‘‘avenger’’, and The term ‘‘feelings’’ (+0.0197) is positioned as the central hub node,
|
||
‘‘enjoy’’ reflect the model’s successful identification of domain-specific forming the core component of the negative sentiment cluster. This
|
||
|
||
17
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
|
||
|
||
Fig. 10. Graph-based visualization with the WordContextGraphExplainer framework for Dataset5.
|
||
|
||
|
||
central positioning reflects the dominant role of emotional discourse Table 23
|
||
within the narrative and highlights the lexical anchor around which Interpretability Fidelity Score Comparison Across Datasets.
|
||
semantic interactions are organized. Dataset LIME WordContextGraphExplainer (%) Improvement (%)
|
||
The graph predominantly features nodes classified as negative, such Dataset 1 0.8100 0.8900 +9.88
|
||
as ‘‘worthless’’ (+0.0068), ‘‘nothing’’ (+0.0097), and ‘‘barely’’ (+0.0072). Dataset 2 0.8000 0.8600 +7.50
|
||
Dataset 3 0.6540 0.7380 +12.84
|
||
These contribute to the accurate identification of depressive language Dataset 4 0.6920 0.7120 +2.89
|
||
patterns and reinforce the system’s capacity to localize affectively Dataset 5 0.6800 0.8200 +20.59
|
||
significant tokens. Edge weights span a broad spectrum from +0.8360 to
|
||
−0.6061, indicating considerable variance in the strength of inter-word
|
||
interactions. Notably, the strongest negative correlations are concen-
|
||
𝐹 = 𝐸(𝑥, 𝑘) using the specified explanation method, where 𝑘 deter-
|
||
trated around the ‘‘feelings’’ hub, supporting its centrality in semantic
|
||
mines the number of top-ranked features to consider. Subsequently, we
|
||
influence. The nodes ‘‘shouldn’’ (+0.0171) and ‘‘be’’ (+0.0132) are neg-
|
||
create a modified input 𝑥′ = Remove(𝑥, 𝐹 ) by removing the identified
|
||
atively classified, reflecting the system’s ability to detect linguistic
|
||
important features from the original text. We then compute a new
|
||
indicators of suicidal ideation. This demonstrates the model’s sensitivity
|
||
prediction 𝑝′ = 𝑀(𝑥′ ) using this perturbed input to observe how the
|
||
to subtle syntactic constructions associated with psychological distress.
|
||
model’s behavior changes. Finally, we calculate the fidelity score as
|
||
The node ‘‘sleep’’ (+0.0125) is identified within the negative sentiment
|
||
fidelity = |𝑝0 − 𝑝′ |, which quantifies the absolute difference between
|
||
category, indicating the model’s capacity to recognize sleep disruption
|
||
the original and perturbed predictions.
|
||
— an important marker in clinical mental health assessments. The term
|
||
The underlying hypothesis assumes that if an explanation method
|
||
‘‘think’’ (+0.0044) reflects ruminative thought patterns and is correctly
|
||
accurately identifies decision-critical features, their removal should
|
||
positioned within the semantic network. This demonstrates the system’s
|
||
effectiveness in modeling internal cognitive processes associated with produce substantial changes in model predictions. Mathematically, this
|
||
depressive episodes. The model’s overall prediction of Negative senti- can be expressed as:
|
||
ment aligns with clinical assessment criteria, suggesting that the system High Fidelity ⇔ arg max(𝑀(𝑥)) ≠ arg max(𝑀(𝑥′ )) (2)
|
||
achieves a promising level of accuracy for mental health screening
|
||
applications. This classification is supported by the density of negative The absolute difference metric captures both direction-preserving and
|
||
sentiment nodes and their semantically coherent interactions. direction-changing prediction modifications, providing a comprehen-
|
||
This analysis demonstrates that the WordContextGraphExplainer sive assessment of explanation accuracy.
|
||
framework provides a robust interpretability mechanism for psycho- For comprehensive evaluation, individual fidelity scores are aggre-
|
||
logically sensitive content. By quantifying both individual lexical con- gated using the arithmetic mean:
|
||
tributions and inter-word semantic interactions, the system delivers a
|
||
1∑
|
||
𝑛
|
||
fine-grained visualization of emotional discourse, making it particularly Mean Fidelity = |𝑀(𝑥𝑖 ) − 𝑀(𝑥′𝑖 )| (3)
|
||
𝑛 𝑖=1
|
||
valuable in clinical decision support systems.
|
||
The fidelity metric [51] implemented in this framework quan- where 𝑛 represents the total number of test instances.
|
||
tifies the correspondence between explanation-based feature impor- In the broader context of XAI for natural language processing,
|
||
tance rankings and observable model behavior changes through a WordContextGraphExplainer offers methodological advantages over tra-
|
||
perturbation-based assessment methodology. ditional frameworks such as LIME. Unlike LIME, which assumes fea-
|
||
Let 𝑀 represent the trained model, 𝑥 denote the original input ture independence and linearity, WordContextGraphExplainer employs a
|
||
text, and 𝐸(𝑥) represent the explanation method that produces a set graph-theoretic structure capable of capturing non-linear relationships
|
||
of important features 𝐹 = {𝑓1 , 𝑓2 , … , 𝑓𝑘 } with associated importance and contextual dependencies features essential for modeling complex,
|
||
scores. multi-sentiment narratives. These findings underscore the superior-
|
||
The fidelity score for a single instance is defined as: ity of graph-based interpretability in high-stakes domains and sug-
|
||
gest promising future directions for next-generation explainable NLP
|
||
Fidelity(𝑥, 𝐸) = |𝑀(𝑥) − 𝑀(𝑥′ )| (1) systems (see Table 23).
|
||
where 𝑥′ represents the perturbed text obtained by removing the top-𝑘
|
||
most important features identified by the explanation method 𝐸. 5. Conclusion
|
||
The fidelity [52] assessment follows this systematic procedure. First,
|
||
we compute the original model prediction 𝑝0 = 𝑀(𝑥) to establish a This study presents a comprehensive framework for sentiment clas-
|
||
baseline reference point. Next, we extract the most important features sification in dialogue-based scenarios through the development of a
|
||
|
||
18
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
novel HybridBERT-LSTM architecture coupled with an innovative inter- Declaration of competing interest
|
||
pretability methodology. The proposed hybrid model demonstrates su-
|
||
perior performance on both benchmark datasets, including the widely- The authors declare that they have no known competing finan-
|
||
adopted IMDb corpus, and real-world dialogue datasets, consistently cial interests or personal relationships that could have appeared to
|
||
outperforming standalone architectures such as traditional LSTM, influence the work reported in this paper.
|
||
BERT, CNN, and SVM implementations. The empirical results validate
|
||
the model’s enhanced capacity to capture both the semantic richness Data availability
|
||
of individual utterances and the sequential dependencies inherent in
|
||
multi-turn conversational contexts. The authors do not have permission to share data.
|
||
The architectural innovation of HybridBERT-LSTM leverages pre-
|
||
trained BERT encodings for deep contextualized embeddings, subse-
|
||
quently processed through bidirectional LSTM layers to model tem- References
|
||
poral dependencies and discourse-level structures. The integration of
|
||
[1] L. Song, et al., CASA: Conversational aspect sentiment analysis for dialogue
|
||
dual pooling mechanisms (average and maximum) followed by dense understanding, J. Artificial Intelligence Res. 73 (2022) 511–533.
|
||
classification layers enables the model to synthesize learned represen- [2] M. Firdaus, et al., MEISD: A multimodal multi-label emotion, intensity and
|
||
tations effectively, making it particularly suitable for dialogue senti- sentiment dialogue dataset, in: COLING, 2020, pp. 4441–4453.
|
||
ment analysis where contextual flow and sequential relationships are [3] I. Carvalho, et al., The importance of context for sentiment analysis in dialogues,
|
||
IEEE Access 11 (2023) 86088–86103.
|
||
paramount.
|
||
[4] J. Wang, et al., Sentiment classification in customer service dialogue with
|
||
A significant contribution of this research lies in the development topic-aware multi-task learning, AAAI 34 (05) (2020) 9177–9184.
|
||
of explainable context-aware sentiment reasoning capabilities. Beyond [5] D. Bertero, et al., Real-time speech emotion and sentiment recognition, EMNLP
|
||
the scope of traditional local explanation techniques, a novel graph- 104 (2016) 2–1047.
|
||
theoretic interpretability framework, WordContextGraphExplainer, has [6] C. Bothe, et al., Dialogue-based neural learning to estimate sentiment, in: ICANN,
|
||
2017, pp. 477–485.
|
||
been proposed to address the fundamental limitations inherent in [7] M. Firdaus, et al., EmoSen: Generating sentiment and emotion controlled
|
||
existing methodologies. Unlike LIME, which operates under linear responses, IEEE Trans. Affect. Comput. 13 (3) (2020) 1555–1566.
|
||
additivity assumptions and treats tokens as independent entities, Word- [8] A. Mallol-Ragolta, B. Schuller, Coupling sentiment and arousal analysis, IEEE
|
||
ContextGraphExplainer employs sophisticated perturbation analysis Access 12 (2024) 20654–20662.
|
||
[9] Z. Akbar, M.U. Ghani, U. Aziz, Boosting viewer experience with emotion-driven
|
||
to model non-linear semantic interactions between word pairs. This
|
||
video analysis: A BERT-based framework for social media content, J. Artif. Intell.
|
||
methodology constructs semantic interaction graphs where nodes rep- Behav. (2025).
|
||
resent individual word contributions and edges encode inter-word [10] J. Zhao, W. Gao, A semantic-enhanced heterogeneous dialogue graph network,
|
||
dependencies, providing intuitive visualization of complex linguistic IEEE ICETCI 131 (2024) 5–1322.
|
||
relationships through NetworkX-based representations. The compar- [11] M. Yang, et al., GME-dialogue-NET, Acad. J. Comput. Inf. Sci. 4 (8) (2021)
|
||
10–18.
|
||
ative analysis reveals that while LIME provides granular word-level [12] M. Parmar, A. Tiwari, Emotion and sentiment analysis in dialogue: A multimodal
|
||
attributions, it operates independently of sequential context and fails strategy employing the BERT model, in: 2024 Parul International Conference on
|
||
to capture the synergistic effects crucial for accurate sentiment inter- Engineering and Technology, PICET, 2024, pp. 1–7.
|
||
pretation in conversational settings. In contrast, WordContextGraph- [13] Mustapha Z., Aspect-based emotion analysis for dialogue understanding, 2024.
|
||
[14] W. Li, W. Shao, S. Ji, E. Cambria, BiERU: Bidirectional emotional recurrent unit
|
||
Explainer’s graph-based approach explicitly models contextual inter-
|
||
for conversational sentiment analysis, Neurocomputing 467 (2022) 73–82.
|
||
dependencies, semantic propagation patterns, and negation scope ef- [15] S. Poria, D. Hazarika, N. Majumder, R. Mihalcea, Beneath the tip of the iceberg:
|
||
fects that are essential for understanding transformer decision-making Current challenges and new directions in sentiment analysis research, IEEE Trans.
|
||
processes. This advancement enables practitioners to trace how sen- Affect. Comput. 14 (1) (2020) 108–132.
|
||
timent emerges through word interactions and temporal flow across [16] L. Zhu, R. Mao, E. Cambria, B.J. Jansen, Neurosymbolic AI for personalized
|
||
sentiment analysis, in: International Conference on Human-Computer Interaction,
|
||
dialogue turns, providing unprecedented insights into model reason-
|
||
269–290, Springer Nature Switzerland, Cham, 2024.
|
||
ing mechanisms. The integration of WordContextGraphExplainer with [17] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, et al., Panosent: A panoptic
|
||
HybridBERT-LSTM establishes a new paradigm for interpretable dia- sextuple extraction benchmark for multimodal conversational aspect-based sen-
|
||
logue sentiment analysis, where prediction accuracy and explainability timent analysis, in: Proceedings of the 32nd ACM International Conference on
|
||
Multimedia, 2024, pp. 7667–7676.
|
||
are synergistically enhanced. This framework demonstrates particular
|
||
[18] Y. Zhang, Q. Li, D. Song, P. Zhang, P. Wang, Quantum-inspired interactive
|
||
efficacy in clinical applications and mental health assessment scenarios, networks for conversational sentiment analysis, 2019.
|
||
where understanding the rationale behind sentiment predictions is [19] L. Yang, Q. Yang, J. Zeng, T. Peng, Z. Yang, H. Lin, Dialogue sentiment analysis
|
||
as critical as the predictions themselves. Future research directions based on dialogue structure pre-training, Multimedia Syst. 31 (2) (2025) 1–13.
|
||
include extending the graph-based interpretability framework to mul- [20] K. Horesh, A. Kumar, A. Anand, A. Sabu, T. Jain, Sentiment Analysis on Amazon
|
||
Electronics Product Reviews using Machine Learning Techniques, IEEE, 2023,
|
||
tilingual contexts and exploring its applications in other NLP tasks
|
||
http://dx.doi.org/10.1109/gcat59970.2023.10353467.
|
||
requiring fine-grained semantic understanding. Future work should [21] A. Matsui, E. Ferrara, Word embedding for social sciences: An interdisciplinary
|
||
focus on developing simplified visualization layers and adaptive user survey, PeerJ Comput. Sci. 10 (2024) e2562.
|
||
interfaces that can present graph-based explanations at varying levels [22] S. Anitha, P. Gnanasekaran, Advanced sentiment classification using RoBERTa
|
||
and aspect-based analysis on large-scale e-commerce datasets, Nanotechnol.
|
||
of complexity, enabling domain experts to access meaningful inter-
|
||
Perceptions 20 (S16) (2024) 336–348.
|
||
pretability insights without requiring deep technical expertise in graph [23] P. Borah, D. Gupta, B.B. Hazarika, ConCave-convex procedure for support vector
|
||
theory or network analysis. Future research should incorporate system- machines with Huber loss for text classification, Comput. Electr. Eng. 122 (2025)
|
||
atic human evaluation studies to assess the explanatory quality and 109925.
|
||
clinical applicability of WordContextGraphExplainer outputs among [24] Z. Hua, Y. Tong, Y. Zheng, Y. Li, Y. Zhang, PPGloVe: privacy-preserving GloVe
|
||
for training word vectors in the dark, IEEE Trans. Inf. Forensics Secur. 19 (2024)
|
||
domain practitioners.
|
||
3644–3658.
|
||
[25] A. Rasool, S. Aslam, N. Hussain, S. Imtiaz, W. Riaz, nbert: Harnessing NLP
|
||
CRediT authorship contribution statement for emotion recognition in psychotherapy to transform mental health care,
|
||
Information 16 (4) (2025) 301.
|
||
Ercan Atagün: Writing – review & editing, Writing – original [26] E. Mitera-Kiełbasa, K. Zima, Automated classification of exchange information
|
||
requirements for construction projects using Word2Vec and SVM, Infrastructures
|
||
draft, Methodology, Investigation, Conceptualization. Günay Temür: 9 (11) (2024) 194.
|
||
Validation, Methodology. Serdar Biroğul: Supervision, Project admin- [27] Z. Yang, F. Emmert-Streib, Optimal performance of Binary Relevance CNN in
|
||
istration, Conceptualization. targeted multi-label text classification, Knowl.-Based Syst. 284 (2024) 111286.
|
||
|
||
|
||
19
|
||
E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086
|
||
|
||
|
||
[28] J. Peng, S. Huo, Application of an improved convolutional neural network [40] A. Bajaj, D.K. Vishwakarma, HOMOCHAR: A novel adversarial attack framework
|
||
algorithm in text classification, J. Web Eng. 23 (3) (2024) 315–339. for exposing the vulnerability of text-based neural sentiment classifiers, Eng.
|
||
[29] K. Nithya, M. Krishnamoorthi, S.V. Easwaramoorthy, C.R. Dhivyaa, S. Yoo, J. Appl. Artif. Intell. 126 (2023) 106815, http://dx.doi.org/10.1016/j.engappai.
|
||
Cho, Hybrid approach of deep feature extraction using BERT–OPCNN & FIAC 2023.106815.
|
||
with customized Bi-LSTM for rumor text classification, Alex. Eng. J. 90 (2024) [41] A. Bajaj, D.K. Vishwakarma, Evading text-based emotion detection mechanism
|
||
65–75. via adversarial attacks, Neurocomputing 558 (2023).
|
||
[30] S. Jamshidi, M. Mohammadi, S. Bagheri, H.E. Najafabadi, A. Rezvanian, M. [42] G.A. de Oliveira, R.T. de Sousa, R. de O. Albuquerque, L.J.G. Villalba, Adversarial
|
||
Gheisari, et al., Effective text classification using BERT, MTM LSTM, and DT, attacks on a lexical sentiment analysis classifier, Comput. Commun. 174 (2021)
|
||
Data Knowl. Eng. 151 (2024) 102306. 154–171, http://dx.doi.org/10.1016/j.comcom.2021.04.026.
|
||
[31] O. Galal, A.H. Abdel-Gawad, M. Farouk, Federated freeze BERT for text [43] M. Hussain, M. Naseer, Comparative analysis of logistic regression, LSTM, and
|
||
classification, J. Big Data 11 (1) (2024) 28. Bi-LSTM models for sentiment analysis on IMDB movie reviews, J. Artif. Intell.
|
||
[32] C. Eang, S. Lee, Improving the accuracy and effectiveness of text classification Comput. 2 (1) (2024) 1–8.
|
||
based on the integration of the bert model and a recurrent neural network [44] C.D. Kulathilake, J. Udupihille, S.P. Abeysundara, A. Senoo, Deep learning-driven
|
||
(RNN_Bert_Based), Appl. Sci. 14 (18) (2024) 8388. multi-class classification of brain strokes using computed tomography: A step
|
||
[33] M. Ahmed, M.S. Hossain, R.U. Islam, K. Andersson, Explainable text classification towards enhanced diagnostic precision, Eur. J. Radiol. 187 (2025) 112109.
|
||
model for COVID-19 fake news detection, J. Internet Serv. Inf. Secur. 12 (2) [45] Amod, Mental health counseling conversations dataset, 2024, Retrieved from
|
||
(2022) 51–69. https://huggingface.co/datasets/Amod/mental_health_counseling_conversations/
|
||
[34] K. Zahoor, N.Z. Bawany, T. Qamar, Evaluating text classification with explainable tree/main.
|
||
artificial intelligence, Int. J. Artif. Intell. ISSN 225 (2024) 2–8938. [46] B. Yao, P. Tiwari, Q. Li, Self-supervised pre-trained neural network for quantum
|
||
[35] D. Kalla, N. Smith, F. Samaah, Deep learning-based sentiment analysis: Enhancing natural language processing, Neural Netw. 184 (2025) 107004, Elsevier.
|
||
IMDb review classification with LSTM models, 2025, Available at SSRN 5103558. [47] SohamGhadge, Casual conversation dataset, 2024, Retrieved from https://
|
||
[36] R. Beniwal, A.K. Dinkar, A. Kumar, A. Panchal, A hybrid deep learning model huggingface.co/datasets/SohamGhadge/casual-conversation/tree/main.
|
||
for sentiment analysis of IMDB movies reviews, in: 2024 Asia Pacific Conference [48] Mahfoos, Patient-doctor conversation dataset, 2024, Retrieved from https://
|
||
on Innovation in Technology, APCIT, IEEE, 2024, pp. 1–7. huggingface.co/datasets/mahfoos/Patient-Doctor-Conversation/tree/main.
|
||
[37] N. Tabassum, T. Alyas, M. Hamid, M. Saleem, S. Malik, Z. Ali, U. Farooq, [49] Alimistro123, English chat sentiment dataset, 2024, Retrieved from https://www.
|
||
Semantic analysis of Urdu English tweets empowered by machine learning, Intell. kaggle.com/code/alimistro123/english-chat-sentiment-dataset-found.
|
||
Autom. Soft Comput. 30 (1) (2021) 175–186. [50] Adapting, Empathetic dialogues v2 dataset, 2024, Retrieved from https://
|
||
[38] A. Pandey, R. Yadav, A. Pathak, N. Shivani, B. Garg, A. Pandey, Sentiment huggingface.co/datasets/Adapting/empathetic_dialogues_v2.
|
||
analysis of IMDB movie reviews, in: 2024 First International Conference on [51] Y. Singh, Q.A. Hathaway, V. Keishing, S. Salehi, Y. Wei, N. Horvat, D.V. Vera-
|
||
Software, Systems and Information Technology, SSITCON, IEEE, 2024, pp. 1–6. Garcia, A. Choudhary, A.Mula. Kh, E. Quaia, et al., Beyond post hoc explanations:
|
||
[39] R. Amin, R. Gantassi, N. Ahmed, A.H. Alshehri, F.S. Alsubaei, J. Frnda, A hybrid A comprehensive framework for accountable AI in medical imaging through
|
||
approach for adversarial attack detection based on sentiment analysis model transparency, Interpret. Explain. Bioeng. 12 (8) (2025) 879.
|
||
using machine learning, Eng. Sci. Technol. an Int. J. 58 (2024) 101829. [52] M. Bayesh, S. Jahan, Embedding security awareness in IoT systems: A framework
|
||
for providing change impact insights, Appl. Sci. 15 (14) (2025) 7871.
|
||
|
||
|
||
|
||
|
||
20
|
||
|