Computer Standards & Interfaces 97 (2026) 104086 Contents lists available at ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi Graph-based interpretable dialogue sentiment analysis: A HybridBERT-LSTM framework with semantic interaction explainer Ercan Atagün a ,∗, Günay Temür b , Serdar Biroğul c,d a Computer Engineering, Institute Of Graduate Studies, Duzce University, Düzce, 81000, Turkey b Kaynasli Vocational School, Duzce University, Düzce, 81000, Turkey c Department of Computer Engineering, Faculty of Engineering, Duzce University, Düzce, 81000, Turkey d Department of Electronics and Information Technologies, Faculty of Architecture and Engineering, Nakhchivan State University, Nakhchivan, Azerbaijan ARTICLE INFO ABSTRACT Keywords: Conversational sentiment analysis in natural language processing faces substantial challenges due to intricate Natural language processing contextual semantics and temporal dependencies within multi-turn dialogues. We present a novel HybridBERT- Explainable artificial intelligence LSTM architecture that integrates BERT’s contextualized embeddings with LSTM’s sequential processing Word context graph explainer capabilities to enhance sentiment classification performance in dialogue scenarios. Our framework employs a dual-pooling mechanism to capture local semantic features and global discourse dependencies, addressing limitations of conventional approaches. Comprehensive evaluation on IMDb benchmark and real-world dialogue datasets demonstrates that HybridBERT-LSTM consistently improves over standalone models (LSTM, BERT, CNN, SVM) across accuracy, precision, recall, and F1-score metrics. The architecture effectively exploits pre-trained contextual representations through bidirectional LSTM layers for temporal discourse modeling. We introduce WordContextGraphExplainer, a graph-theoretic interpretability framework addressing conventional explanation method limitations. Unlike LIME’s linear additivity assumptions treating features independently, our approach utilizes perturbation-based analysis to model non-linear semantic interactions. The framework generates semantic interaction graphs with nodes representing word contributions and edges encoding inter- word dependencies, visualizing contextual sentiment propagation patterns. Empirical analysis reveals LIME’s inadequacies in capturing temporal discourse dependencies and collaborative semantic interactions crucial for dialogue sentiment understanding. WordContextGraphExplainer explicitly models semantic interdependencies, negation scope, and temporal flow across conversational turns, enabling comprehensive understanding of both word-level contributions and contextual interaction influences on decision-making processes. This integrated framework establishes a new paradigm for interpretable dialogue sentiment analysis, advancing trustworthy AI through high-performance classification coupled with comprehensive explainability. 1. Introduction sentiment analysis, as conventional text classification methodologies frequently fail to adequately capture such sequential continuity. The Dialogue-based sentiment analysis constitutes a significant research multi-speaker nature of dialogues introduces critical considerations domain within the field of natural language processing (NLP). This area regarding utterance attribution and the identification of emotional of study represents a fundamental component of efforts to enhance expression sources. Modeling sentiment transitions between conver- human–machine interaction through more meaningful and emotion- sational participants presents particular challenges, especially in sce- centric approaches. Research endeavors in this field encompass numer- ous inherent challenges and complexities. Dialogues typically emerge narios where emotions are expressed through implicit mechanisms. from the reciprocal interactions among multiple conversational partic- Rather than explicit emotional declarations, human linguistic behavior ipants, where the scope of communicative content spans the breadth frequently employs sophisticated rhetorical devices including irony, of human knowledge and experience. The emotional orientation of an sarcasm, humor, double entenders, and cultural references, resulting in utterance within a conversational sequence demonstrates substantial sentiment interpretations that diverge significantly from surface-level dependency upon preceding discourse and contextual cues. This phe- textual analysis. This phenomenon proves particularly problematic in nomenon necessitates the development of context-aware models for ∗ Corresponding author. E-mail address: ercanatagun@duzce.edu.tr (E. Atagün). https://doi.org/10.1016/j.csi.2025.104086 Received 7 June 2025; Received in revised form 7 October 2025; Accepted 13 October 2025 Available online 12 November 2025 0920-5489/© 2025 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by- nc-nd/4.0/). E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 brief, context-independent utterances, substantially complicating sen- media environments, further demonstrating the potential of multimodal timent analysis procedures. Contemporary dialogue-based sentiment affective understanding. analysis research faces significant constraints regarding the availabil- Graph-based modeling has also been incorporated into multimodal ity of high-quality, annotated datasets. Existing corpora are charac- sentiment analysis. Zhao and Gao [10] proposed a semantically en- terized by either limited scale or restriction to specific contextual riched heterogeneous dialogue graph network to analyze sentiment in domains such as cinematic dialogue or customer service interactions. multi-party conversations. Yang et al. [11] advanced sentiment accu- Furthermore, the insufficient representation of cultural, linguistic, and racy through a model that jointly processes text, audio, and visual cues. social diversity within available datasets impedes the development Context-awareness is a pivotal factor in sentiment interpretation within of generalizable models with robust cross-domain applicability. Deep dialogues. Carvalho et al. [3] emphasized the influence of preceding learning-based sentiment analysis architectures predominantly exhibit discourse on sentiment prediction. To enhance contextual coherence in ‘‘black box’’ characteristics, rendering their decision-making processes opaque to human interpretation. This limitation particularly dimin- generative AI dialogue systems, personalized dialogue summarization ishes model reliability in tasks where emotional interpretation involves techniques have been employed [12]. Mustapha [13] proposed a model inherent subjectivity, consequently necessitating human oversight in to analyze sentiment-cause relationships in stress-laden conversations, practical applications. In this study, a novel hybrid model is pro- aiming to reveal emotional dynamics. Contextual memory mechanisms posed, integrating BERT’s contextualized representation capabilities were further explored by Li et al. [14], who developed a bidirec- with the sequential modeling proficiency of LSTM to address the in- tional emotional recurrent unit (BiERU) to capture dynamic context herent challenges of sentiment analysis in dialogue-based datasets. The shifts and their implications for sentiment detection. Explainability has architecture is specifically designed to capture both linguistic features gained increasing importance in sentiment analysis. A variety of ap- and temporal dependencies embedded within conversational structures. proaches including attention mechanisms, graph neural networks, and To enhance the interpretability of model outputs, a graph-theoretic neuro-symbolic architectures have been introduced to elucidate model interpretability framework, termed WordContextGraphExplainer, is in- decision-making. Poria et al. [15] discussed fundamental challenges troduced. This framework overcomes the limitations of conventional in sentiment interpretation and underscored the role of explainability. explanation methods by modeling non-linear semantic interactions be- Zhu et al. [16] developed a neuro-symbolic model for personalized tween lexical units. Through the construction of semantic interaction sentiment analysis, incorporating user-specific contextual factors into graphs, the approach facilitates comprehensive visualization of contex- the explanatory framework. Luo et al. [17] introduced the PanoSent tual sentiment propagation patterns, offering novel insights into the dataset to improve the analysis of emotional shifts in interactive sys- underlying decision-making mechanisms of the model and establish- tems. In another direction, Zhang et al. proposed a novel interaction ing a new paradigm for interpretable sentiment analysis in dialogue systems. network inspired by quantum theory to reframe dialogue-based sen- timent analysis [18]. Yang et al. [19] addressed the inadequacies 2. Related works of existing pre-trained models in capturing the logical structure of dialogues. To overcome these limitations, they proposed a new pre- Sentiment analysis has gained significant traction in NLP research, training framework comprising utterance order modeling, sentence driven by its pivotal role in enabling affective computing across do- skeleton reconstruction, and sentiment shift detection, demonstrating mains such as human–computer interaction, intelligent customer sup- improvements in learning emotion interactions and discourse coher- port, and conversational AI systems. Recent advancements in the field ence. Collectively, recent developments in sentiment analysis empha- have led to the development of a diverse array of methodologies, en- size the significance of contextual awareness, multimodal data fusion, compassing text-based approaches, multimodal frameworks, contextual graph-based reasoning, and explainable AI techniques in enhancing modeling techniques, and sophisticated deep learning architectures. performance and interpretability within dialogue-centric applications. This section presents an overview of key contributions in the literature, with a particular emphasis on dialogue-based sentiment analysis, which plays a critical role in domains such as customer support, conversa- 3. Materials and methods tional AI, and empathetic dialogue systems. Song et al. [1] introduced a topic-aware sentiment analysis model for dialogue (CASA), aiming to The dialogue dataset dyadic conversational exchanges between two identify sentiment orientations within conversational threads. Firdaus distinct participants. Each dialogue instance is structured as a se- et al. [2] constructed the MEISD dataset, incorporating textual, audio, quence of alternating utterances, where each turn is associated with and visual data for multimodal sentiment analysis. Emphasizing the a specific speaker and the corresponding textual content. The formal relevance of conversational context, Carvalho et al. [3] demonstrated that prior utterances significantly influence sentiment classification mathematical representation of the dialogue structure is given by: outcomes. Building upon this insight, topic-aware sentiment classifica-  = {(𝑠𝑖 , 𝑡𝑖 )}𝑁 𝑖=1 , 𝑠𝑖 ∈  = {𝐴, 𝐵}, 𝑡𝑖 ∈ 𝛴 ∗ tion models have been proposed using multi-task learning strategies within customer service dialogues [4]. Real-time sentiment analysis Here,  denotes the complete dialogue dataset in dialogue systems is also a critical consideration. Bertero et al. [5] developed a convolutional neural network capable of processing au- composed of 𝑁 conversational turns. dio inputs for instantaneous emotion detection in interactive systems. Each pair (𝑠𝑖 , 𝑡𝑖 ) represents the 𝑖-th turn in the dialogue, Bothe et al. [6] presented a model to predict the sentiment of up- where 𝑠𝑖 is the speaker identifier and coming utterances, thereby analyzing emotional transitions throughout dialogue sequences. To address the limitations of unimodal text-based 𝑡𝑖 is the corresponding utterance. sentiment analysis, recent studies have adopted multimodal strategies The speaker set  = {𝐴, 𝐵} contains two participants, by integrating text, speech, and visual signals. For instance, the EmoSen typically alternating in a turn-based structure. model [7] generates sentiment-aware responses using fused inputs The term 𝛴 represents the alphabet of the natural language from these modalities. Similarly, Mallol-Ragolta and Schuller [8] intro- duced a system that personalizes dialogue responses by estimating user in which the dialogue is conducted, and 𝛴 ∗ denotes the set emotions and arousal levels. Akbar et al. [9] proposed an innovative of all finite-length strings (i.e., possible utterances) emotion-driven framework for video-based sentiment analysis in social formed from this alphabet. 2 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 3.1. Data preprocessing and word embedding 3.2. GloVe: Global vectors for word representation The successful training of natural language processing (NLP) models GloVe [24] is a widely adopted word embedding technique designed is highly dependent on the transformation of raw textual data into to capture semantic and conceptual relationships between words, par- structured and semantically meaningful representations [20]. In this ticularly in text classification tasks. It operates by constructing word study, all textual inputs undergo a series of preprocessing operations vector representations through the optimization of global word co- designed to optimize them for subsequent modeling tasks. An initial occurrence statistics derived from large-scale corpora. Unlike local and essential preprocessing step involves lowercasing, which standard- context–based models such as Word2Vec, GloVe incorporates both izes textual input by mitigating case sensitivity inconsistencies that local and global contextual information, embedding lexical units into a would otherwise lead to redundant representations of semantically dense, continuous vector space. In practical applications, GloVe embed- identical words. This step is particularly critical for ensuring the ef- dings are employed to convert unstructured input text into fixed-length fectiveness and consistency of word embedding techniques. Given that numerical tensors, which serve as inputs to deep learning architec- parts of the dataset originate from web-based sources, residual HTML tures such as CNN and LSTM models. This transformation enables tags and encoded entities such as
and   are present in the raw text. These components provide no linguistic or semantic value and the model to effectively distinguish between textual classes by cap- may negatively affect model performance. Therefore, all HTML-related turing both syntactic patterns and latent semantic features. The key tokens and special characters are systematically removed in the prepro- advantage of GloVe lies in its ability to unify global corpus-level cessing phase to reduce noise within the input space and to enhance the statistical information with local context, producing more stable and robustness of downstream NLP models. This comprehensive cleaning semantically meaningful representations compared to models relying process is implemented using Python NLTK, BeautifulSoup libraries solely on window-based learning. However, it remains a static embed- combined with regular expression patterns to ensure thorough removal ding technique; each word is assigned a single vector regardless of of web-derived artifacts. Additionally, standard stopword removal is its context within a sentence. This context-independent nature limits applied to eliminate semantically non-contributive terms. Notably, tra- its flexibility when compared to transformer-based models like BERT, ditional morphological normalization techniques such as stemming which generate dynamic embeddings conditioned on the broader lin- and lemmatization are deliberately excluded from our preprocessing guistic environment. Despite these limitations, GloVe continues to play pipeline, as BERT’s contextualized embedding framework inherently a significant role in various NLP tasks such as text similarity, topic captures morphological variations and semantic relationships without labeling,spam detection, and sentiment analysis where modeling word- requiring explicit normalization steps. level semantics remains essential. Its computational simplicity and ease Following text normalization, each cleaned sentence is tokenized of integration make it a reliable baseline in many NLP pipelines. Recent into subword or word-level units. These token sequences are then studies [25] have highlighted the importance of consistent embedding converted into dense numerical representations using word embedding strategies when comparing different NLP models, as variations in em- techniques such as GloVe [21]. Embedding techniques project discrete bedding approaches can significantly impact performance comparisons textual units into continuous vector representations that encapsulate and lead to biased evaluations. both semantic coherence and syntactic structure, thereby facilitating computational models in capturing lexical relatedness and contextual 3.3. Support Vector Machine (SVM) alignment within language data. The original, unprocessed dataset can be denoted as follows: Let the original unprocessed dataset be represented [22] as: Support Vector Machine (SVM) [26] is a well-established super- vised learning algorithm widely employed in text classification tasks, 𝑇 = {𝑠1 , 𝑠2 , … , 𝑠𝑁 } particularly due to its robustness in handling high-dimensional data where each sentence 𝑠𝑘 is defined as [23] a sequence of 𝑀 words: representations. In natural language processing pipelines, textual inputs are typically transformed into numerical feature vectors using tech- 𝑠𝑘 = {𝑢1 , 𝑢2 , … , 𝑢𝑀 } niques such as Term Frequency–Inverse Document Frequency (TF-IDF) To refine the input, special characters , web-related entities , and or various word embedding models. Once converted, SVM operates by semantically non-contributive stopwords  are eliminated. The cleaned identifying the optimal hyperplane that best separates the data points sentence is thus defined by: into distinct class labels. The core principle of SVM lies in maximizing the margin between classes, thereby enhancing generalization perfor- 𝑠′𝑘 = Clean(𝑠𝑘 ) = {𝑢𝑗 ∈ 𝑠𝑘 ∣ 𝑢𝑗 ∉ ( ∪  ∪ )} mance. This is particularly advantageous in scenarios where the feature The sanitized sentence 𝑠′𝑘 is then tokenized: space exhibits high dimensionality and potential overlap between class 𝑠′𝑘 = {𝑣1 , 𝑣2 , … , 𝑣𝑃 }, 𝑣𝑖 ∈  distributions. Furthermore, SVM’s ability to incorporate non-linear ker- nel functions such as polynomial or radial basis function (RBF) kernels where  denotes the vocabulary of all tokens in the dataset. enables it to capture complex, non-linear patterns within the data, Word embeddings serve as a cornerstone for text classification, as which are often present in linguistically rich or semantically ambiguous they enable models to capture abstract semantic relationships while textual inputs. Due to its mathematically grounded optimization frame- reducing the dimensionality of input features. Unlike traditional bag- work and resistance to overfitting, SVM remains a competitive baseline of-words approaches, embeddings are resilient to linguistic variability in various text classification domains, including sentiment analysis, such as synonymy and polysemy. For sentiment analysis tasks, embed- spam detection, and topic categorization. Its effectiveness is further dings can cluster words with similar affective connotations, thereby enhanced when combined with appropriate feature engineering and enhancing the model’s ability to generalize and detect implicit senti- dimensionality reduction techniques, making it a viable choice for both ments. Likewise, in general classification tasks, embeddings help reveal thematic cohesion across texts, ultimately contributing to improved small-scale and large-scale NLP applications. predictive performance. Nevertheless, conventional embeddings like Word2Vec or GloVe are context-independent, assigning the same vector 3.4. Convolutional Neural Networks (CNN) representation to a word regardless of its usage context. This limitation is addressed by contextualized models such as BERT, which generate Although originally developed for image recognition tasks, Convo- dynamic embeddings based on surrounding words using transformer- lutional Neural Networks (CNNs) have been extensively adapted for based architectures. Word embeddings bridge the gap between lin- various natural language processing problems, particularly in multi- guistic expressiveness and computational tractability and remain an label text classification [27] and sentiment analysis [28], due to their indispensable component of modern NLP pipelines. capacity to capture local hierarchical patterns in sequential data. In 3 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 text classification applications, CNNs operate on word embeddings by proposed. These hybrid solutions aim to retain BERT’s rich contextual applying one-dimensional convolutional filters to detect local patterns understanding while improving computational efficiency and general- such as n-grams or syntactic motifs. These filters perform element- izability, making them more suitable for applications constrained by wise multiplications followed by non-linear activation functions to resources or latency requirements. generate feature maps that emphasize the most informative regions of the input sequence. A subsequent max-pooling operation reduces the 3.7. Local Interpretable Model-Agnostic Explanations (LIME) dimensionality and retains the most salient features, thereby enabling the network to focus on contextually rich segments of text. This ar- LIME is a model-agnostic interpretability framework designed to chitecture allows CNNs to efficiently model contextual dependencies provide localized explanations for the predictions of complex machine within fixed-size receptive fields, making them particularly suitable for learning models. Positioned within the broader field of Explainable tasks such as topic categorization, polarity detection, and aspect-based Artificial Intelligence (XAI), LIME serves to enhance the interpretability sentiment analysis. Compared to recurrent neural networks (RNNs), of opaque ‘‘black-box’’ systems, particularly in high-stakes domains CNNs offer significant advantages in terms of computational efficiency where transparency and trust are critical [33]. LIME’s main goal is to and parallelizability, as they do not rely on sequential input processing. provide a straightforward, interpretable surrogate model that, within However, one notable limitation of CNNs is their reduced capacity to the local neighborhood of a particular instance, roughly represents the model long-range dependencies, which can affect performance in tasks original model’s decision boundary [34]. LIME accomplishes this by involving lengthy or complex discourse structures. generating a set of synthetic samples close to the target instance, which perturbs the original input. The black-box model is used to these altered 3.5. Long Short-Term Memory Networks (LSTM) examples in order to derive the relevant predictions. These cases are then subjected to a locality-sensitive weighting function, and the deci- LSTM networks, as a refined subclass of recurrent neural architec- sion function is locally approximated by training a sparse linear model tures, have demonstrated substantial effectiveness in text classification on the weighted dataset. The contribution of each feature to the final tasks due to their capacity to capture long-range dependencies and pre- prediction is inferred using the surrogate model’s resulting coefficients. serve semantically meaningful representations across sequential data One of the key strengths of LIME lies in its model-agnostic design, inputs [29]. By incorporating internal memory units and a gated con- allowing it to be applied across a wide range of machine learning trol mechanism – comprising input, forget, and output gates – LSTM algorithms, including ensemble methods, deep neural networks, and models effectively address the vanishing gradient challenge that limits support vector machines. It offers human-understandable explanations conventional RNNs. These gating components orchestrate information while maintaining local fidelity to the original model. As such, LIME flow dynamically, facilitating the retention of salient features over pro- is widely adopted for increasing decision transparency and enabling longed contexts and ensuring the continuity of semantic interpretation human-AI collaboration, particularly in sensitive applications such as throughout the sequence [30]. In text classification applications, LSTM healthcare diagnostics, financial risk assessment, and legal reasoning. typically process input sequences encoded as dense word embeddings, allowing the network to learn hierarchical feature representations that 3.8. WordContextGraphExplainer encapsulate both syntactic structure and semantic meaning. This ca- pacity to capture nuanced contextual relationships makes LSTM par- ticularly effective in tasks such as sentiment analysis, text similarity, The exponential growth in transformer-based natural language pro- spam detection, and topic categorization where subtle variations in cessing (NLP) architectures has created an unprecedented demand word order and polarity significantly influence predictive accuracy. for interpretability frameworks capable of elucidating the complex For instance, in sentiment classification, LSTM models can differen- decision-making processes underlying these black-box models. While tiate between expressions like ‘‘not good’’ and ‘‘extremely good’’ by widely adopted XAI techniques such as LIME (Local Interpretable maintaining a dynamic memory of temporal context throughout the Model-Agnostic Explanations) and SHAP (SHapley Additive Explana- sequence. tions) offer valuable insights through feature attribution, they in- herently rely on linear additivity assumptions among input features. 3.6. Bidirectional Encoder Representations from Transformers (BERT) This assumption falls short in capturing the intricate semantic de- pendencies and non-linear interactions that characterize deep lan- BERT is a transformer-based, pre-trained language model that has guage understanding. A fundamental limitation of existing approaches substantially advanced the state of the art in text classification tasks lies in their inability to model contextual interdependencies between by capturing bidirectional contextual semantics through self-attention words relationships that are crucial for interpreting sentiment propa- mechanisms [31]. Unlike unidirectional models such as LSTM or GRU, gation, negation scope, and semantic coherence in complex linguistic which process text sequentially, BERT encodes semantic dependencies structures. Traditional token-level attribution methods treat individual from both left and right contexts simultaneously. This architecture words as independent contributors, failing to account for the synergistic enables nuanced disambiguation of polysemous words and more robust effects that emerge from word pairings and contextual associations modeling of long-range dependencies in natural language [32]. In in the semantic space. In this paper, WordContextGraphExplainer is text classification applications, BERT is typically fine-tuned on task- introduced as a novel graph-theoretic interpretability framework de- specific labeled datasets. This involves appending a classification layer veloped to enhance the transparency of transformer-based sentiment often a dense layer with softmax activation on top of the pre-trained classification systems. The methodology is built upon a systematic BERT encoder. Through this transfer learning paradigm, BERT exhibits perturbation analysis paradigm, in which masked language modeling is superior performance across a variety of NLP tasks including sentiment employed to estimate both individual lexical contributions and pairwise classification, aspect-based sentiment analysis, and multi-label classifi- semantic interactions. In contrast to linear attribution methods, this cation, particularly in settings characterized by contextual ambiguity approach explicitly models non-linear dependencies by quantifying the and hierarchical dependencies. However, BERT’s practical deployment divergence between observed joint effects and the expected additive presents several challenges. Its high computational complexity, sensi- influence of word pairs. At the core of the framework is the construction tivity to input sequence length, and the requirement for large volumes of a semantic interaction graph, where nodes represent individual of labeled data during fine-tuning can pose significant barriers in real- words annotated with their relative sentiment contributions, and edges world scenarios. To mitigate these limitations, hybrid architectures that encode the magnitude and directionality of inter-word dependencies. integrate BERT with more lightweight modeling components have been This graph-based representation facilitates intuitive visualization of 4 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 complex linguistic relationships through NetworkX-based layouts, en- capturing contextual semantics, its self-attention mechanism may not abling deeper insight into how contextual factors influence model fully exploit the sequential dependencies within dialogue utterances. To predictions. The framework demonstrates particular efficacy in sen- mitigate this limitation, bidirectional LSTM layers are incorporated to timent analysis tasks where nuanced interactions between affective model temporal patterns and discourse-level relationships across token indicators, negation patterns, and contextual modifiers significantly sequences. These layers are adept at retaining long-range dependencies impact interpretive accuracy. By providing interpretable visualizations and recognizing sentiment transitions across multi-turn dialogue. By of semantic interaction networks, WordContextGraphExplainer sup- integrating these two components, the proposed HybridBERT-LSTM ports advanced model debugging, bias detection, and clinical decision architecture achieves a richer understanding of both the global context support in sensitive domains such as mental health assessment and and local structure of textual data, enhancing its capability to discern medical text analytics. Moreover, the framework incorporates a top- sentiment in complex conversational scenarios. This dual modeling k interaction filtering mechanism, ensuring computational scalability approach positions the framework as a robust solution for sentiment while preserving the granularity required for interpretable analysis in classification tasks, particularly in dialogue-rich environments where high-stakes applications. This methodological advancement represents contextual flow and temporal coherence are paramount. a critical step toward the development of trustworthy AI systems that combine linguistic reasoning with transparent explanatory capabilities, 3.9. Model architecture offering a robust foundation for real-world deployment. The proposed model processes input text through a series of trans- Algorithm 1: WordContextGraphExplainer Method formation stages, mathematically formalized as follows: Given an input sequence: Input: Text 𝑇 , transformer model 𝑀, tokenizer 𝜏, feature number 𝑘 ≥ 1, device 𝑑. 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, where 𝑛 ≤ 256, Output: Word context graph 𝐺 with semantic interactions. 1: Compute baseline prediction 𝑃0 = 𝑀(𝑇 ). the BERT encoder maps each token 𝑥𝑖 to a contextualized embedding, 2: Compute predicted_class = arg max(𝑃0 ). producing a sequence of hidden states: 3: Initialize 𝑊 = 𝜏(𝑇 ), word_effects = ∅, interactions = ∅. 𝐻 = BERT(𝑋) ∈ R𝑛×𝑑BERT 4: for each 𝑤𝑖 ∈ 𝑊 do 5: 𝑇masked = replace(𝑇 , 𝑤𝑖 , ‘[𝙼𝙰𝚂𝙺]’) 6: 𝑃masked = 𝑀(𝑇masked ) where 𝑑BERT = 768 represents the dimensionality of BERT’s contextual 7: word_effects[𝑖] = 𝑃0 − 𝑃masked embeddings. 8: end for The sequence 𝐻 is passed to a 3-layer bidirectional LSTM net- 9: for each (𝑤𝑖 , 𝑤𝑗 ) ∈ combinations(𝑊 , 2) do work to capture temporal dependencies beyond what is modeled by 10: 𝑇pair = replace(𝑇 , [𝑤𝑖 , 𝑤𝑗 ], ‘[𝙼𝙰𝚂𝙺]’) self-attention: 11: 𝑃pair = 𝑀(𝑇pair ) ⃖⃗𝑡 = LSTMforward (𝐻𝑡 , ℎ ℎ ⃖⃗𝑡−1 ), ⃖⃖ ℎ𝑡 = LSTMbackward (𝐻𝑡 , ⃖⃖ ℎ𝑡+1 ) 12: actual_effect = 𝑃0 − 𝑃pair 13: expected_effect = word_effects[𝑖] + word_effects[𝑗] The final representation for each token is obtained by concatenating 14: interaction𝑖𝑗 = actual_effect − expected_effect the forward and backward hidden states: ‖ ‖ 15: interactions[(𝑤𝑖 , 𝑤𝑗 )] = ‖interaction𝑖𝑗 ‖ ℎLSTM ℎ𝑡 ] ∈ R2𝑑LSTM ⃖⃗𝑡 ; ⃖⃖ = [ℎ ‖ ‖2 𝑡 16: end for 17: Sort interactions by magnitude in descending order. with 𝑑LSTM = 256, resulting in a 512-dimensional output per token. To obtain a fixed-length vector representation of the sequence, both 18: 𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠[∶ 𝑘] average and maximum pooling operations are applied: 19: Construct graph 𝐺 = (𝑉 , 𝐸) where 𝑉 = 𝑊 and 𝐸 = 1 ∑ LSTM 𝑛 𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠 ℎavg = ℎ , ℎmax = max ℎLSTM 20: Compute layout positions using 𝑜𝑟𝑔𝑎𝑛𝑖𝑧𝑒𝑑_𝑙𝑎𝑦𝑜𝑢𝑡(𝑊 , 𝑛 𝑖=1 𝑖 𝑖 1≤𝑖≤𝑛 𝑡𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝑠) 21: Visualize 𝐺 with NetworkX rendering and semantic color These vectors are concatenated to form the final sequence representa- coding tion: 22: Return 𝐺 ℎcombined = [ℎavg ; ℎmax ] ∈ R4𝑑LSTM = R1024 In this study, a hybrid architecture is proposed that integrates a Feed-forward classification pre-trained BERT model with a bidirectional Long Short-Term Memory (BiLSTM) network to address the task of sentiment classification. The The combined representation is passed through a feed-forward neu- model processes textual input to generate sentiment label predictions, ral network with dropout regularization: effectively capturing both semantic context and temporal structure 𝑧1 = Dropout0.3 (ℎcombined ) inherent in natural language. Grounded in a transformer-based archi- tecture, the system accepts input sequences of up to 256 tokens, apply- followed by a two-layer multilayer perceptron (MLP) with ReLU acti- ing appropriate padding and truncation mechanisms when necessary vation and softmax output for multi-class classification. to standardize input lengths. The HybridBERT-LSTM model embodies This is followed by a two-layer MLP classifier, using a ReLU acti- a synergistic design that leverages the complementary strengths of vation and softmax output for multi-class prediction. The HybridBERT- transformer-based language models and recurrent neural networks. LSTM architecture integrates the strengths of transformer-based con- This hybrid framework is explicitly engineered to address two crit- textual modeling with the sequential learning capabilities of recurrent ical aspects of sentiment analysis: contextual representation and se- neural networks. While BERT excels in capturing bidirectional semantic quential modeling. Contextual Representation: The BERT encoder, pre- context via self-attention, the inclusion of bidirectional LSTM layers trained on large-scale corpora, produces deep contextualized embed- enhances the model’s ability to capture sequential dependencies and dings by employing multi-head self-attention mechanisms. These em- emotional transitions throughout dialogue sequences. The dual pooling beddings capture nuanced semantic and syntactic information, enabling strategy(average and max pooling) provides a comprehensive summary the model to differentiate between polysemous expressions and context- of the sequence. Average pooling captures the overall sentiment distri- dependent sentiment cues. Sequential Modeling: While BERT excels at bution across the sequence, whereas max pooling emphasizes salient 5 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 emotional cues. This duality enriches the feature space and contributes Table 1 to more robust classification. Furthermore, hierarchical feature ab- HybridBERT-LSTM Model Parameters. straction is enabled by stacking multiple LSTM layers, allowing the Parameter name Parameter value model to learn long-range patterns more effectively than shallow RNN Model architecture BERT encoder + BiLSTM + MLP structures. Dropout layers, strategically placed after pooling (with a Base Model google-bert/bert-base-uncased rate of 0.3) and within the classifier (rate 0.2), serve as regularization Tokenizer google-bert/bert-base-uncased Maximum sequence length 256 mechanisms to prevent overfitting, especially during fine-tuning on LSTM layer 6 task-specific datasets. The model is trained using the AdamW optimizer Batch size 32 with a learning rate of 2 × 10−5 , and the cross-entropy loss function Number of epochs 5 is employed as the objective. Performance evaluation is conducted Learning rate 0.00002 Optimization algorithm AdamW using standard metrics including accuracy, precision, recall, and F1- Loss function CrossEntropyLoss score, ensuring comprehensive validation of the model’s classification LSTM latent size 256 capability. The model integrates a pre-trained BERT encoder for captur- Pooling avg + max pooling ing deep contextual embeddings from input text sequences, followed MLP layer Linear(1024→128) → ReLU → Linear(128→n_classes) Dropout rates 0.3 by a multi-layer bidirectional LSTM network that models sequential dependencies across tokens. To derive a robust sentence-level repre- sentation, dual pooling operations(average and maximum pooling) are Table 2 applied to the LSTM outputs. The concatenated feature vector is then BERT Model Parameters. passed through a fully connected neural network with dropout regu- Parameter name Parameter value larization, culminating in a softmax classifier for multi-class sentiment Base model google-bert/bert-base-uncased prediction. This hybrid architecture is designed to jointly leverage the Tokenizer google-bert/bert-base-uncased representational richness of transformer encoders and the temporal Input length 128 Batch size 16 modeling strength of recurrent networks, effectively addressing both lo- Number of epochs 5 cal semantics and discourse-level sentiment dynamics within multi-turn Learning rate 0.00002 dialogues. Loss Function BertForSequenceClassification – Cross-Entropy The computational overhead of HybridBERT-LSTM represents a crit- Optimization algorithm AdamW ical consideration for practical deployment, particularly in real-time applications such as conversational AI systems. The theoretical com- Table 3 plexity of the proposed architecture can be decomposed into its con- LSTM Model Parameters. stituent components to understand the computational requirements. Parameter name Parameter value The BERT component contributes (𝑛2 ×𝑑BERT ) = (𝑛2 ×768) complexity Embedding type GloVe due to the quadratic scaling of the self-attention mechanism, where 𝑛 Embedding size 100 represents the sequence length and 𝑑BERT denotes the BERT embedding Maximum number of words 5000 dimension. The subsequent 3-layer BiLSTM processing adds (3 × 𝑛 × LSTM layer number 6 2 𝑑LSTM ) = (3 × 𝑛 × 2562 ) complexity, where 𝑑LSTM represents the LSTM unit number 128/256 Dropout rate 0.5 LSTM hidden dimension. Consequently, the overall HybridBERT-LSTM Output layer (Dense) Softmax complexity is (𝑛2 × 768 + 3𝑛 × 65, 536). This represents a significant Optimization algorithm Adam computational increase compared to standalone BERT ((𝑛2 × 768)) or Loss function Sparse Categorical Crossentropy LSTM models ((𝑛 × 𝑑LSTM 2 )), which may limit deployment in latency- Epoch number 50 sensitive applications. However, the empirical results demonstrate that Batch size 32 the performance gains justify this additional overhead in scenarios where accuracy is prioritized over computational efficiency. The parameters used for the BERT model employed in this study are 4. Experimental results presented in Table 2. The parameter configurations utilized in the LSTM-based model This section presents the configurations of the models utilized in developed for this study are detailed in Table 3. the experiments, detailing the corresponding hyperparameters and im- The parameter configurations utilized in the CNN model developed plementation settings. The objective is to ensure reproducibility and for this study are detailed in Table 4. provide a comprehensive understanding of the experimental setup. Table 5 summarizes the parameter values defined for the SVM model. 4.1. Model hyperparameters Table 6 presents a comparative evaluation of various machine learn- ing and deep learning models in the context of sentiment analysis on the The deep learning models were trained using a variety of hyper- widely adopted IMDB dataset. Among the examined methods, the pro- parameter configurations tailored to the architecture and task require- posed HybridBERT-LSTM architecture achieved the highest accuracy ments. These configurations include parameters such as learning rate, rate of 98.14%, demonstrating a substantial improvement over other batch size, maximum input sequence length, number of training epochs, baseline models included in the analysis. This notable enhancement un- optimizer type, and loss function. Additionally, architecture-specific derscores the effectiveness of combining contextual embeddings from settings such as the number of LSTM layers, dropout rates, and hidden BERT with the sequential modeling capabilities of LSTM. The IMDB state dimensionsare systematically defined. For models utilizing pre- dataset was selected for evaluation due to its extensive usage and trained components (e.g., BERT), both the base model and tokenizer established credibility in the sentiment analysis literature, serving as versions are explicitly specified. The subsequent tables summarize the a robust benchmark for comparative performance assessment. detailed parameter values for each model employed in this study, including HybridBERT-LSTM, BERT-only, LSTM, CNN, and SVM-based 4.2. Statistical significance testing classifiers. The parameter values of the model developed in this study are In order to determine whether the observed differences in model detailed in Table 1. performance metrics [44] were statistically significant, we employed 6 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Table 4 hypothesis (𝐻0 ) that the two models exhibit equal mean performance. CNN Model Parameters. Since our interest lies in detecting differences in either direction, a Parameter name Parameter value two-tailed test is used: Embedding type GloVe Embedding size 100 𝑝 = 2 × 𝑃 (𝑇 ≥ |𝑡|), Maximum number of words 5000 Input layer Embedding(input_dim=5000, output_dim=100) where 𝑇 follows the Student’s t-distribution with 𝑑𝑓 degrees of free- Number of Conv1D layers 6 dom. Number of Conv1D filters 128 If 𝑝 < 0.05, the difference is considered statistically significant, Kernel size 5 indicating strong evidence against the null hypothesis. In this case, we Activation function ReLU Padding Same conclude that one model outperforms the other beyond what would be Pooling MaxPooling1D (pool_size=2) expected by random variation. If 𝑝 ≥ 0.05, the difference is considered Dropout rate 0.5 not statistically significant, implying that the observed discrepancy may Global pooling GlobalMaxPooling1D reasonably be attributed to experimental variability. Output layer (Dense) Softmax Loss function sparse_categorical_crossentropy In addition to reporting p-values, effect sizes (Cohen’s d) were also Optimization algorithm Adam computed to quantify the magnitude of the observed differences. While Evaluation metric Accuracy statistical significance indicates whether a difference is unlikely to be Number of epochs 50 due to chance, effect size provides a measure of its practical relevance. Batch size 32 Together, these statistics provide a comprehensive assessment of the comparative performance of the evaluated models. Table 5 SVM Model Parameters. 4.3. Experimental results on datasets Parameter name Parameter value Embedding type GloVe Dataset 1 consists of question–answer pairs collected from two Embedding size 100 independent online counseling and psychotherapy platforms [45]. The Maximum number of words 5000 user-generated questions span a wide range of topics related to men- SVM Kernel Linear tal health, including emotional well-being, interpersonal issues, and psychological disorders. Each response was authored by licensed psy- Table 6 chologists, ensuring both clinical relevance and linguistic reliability. In IMDB Dataset Accuracy Comparison. total, the dataset comprises 7,025 dialogue instances. Reference Method Accuracy Tables 7 and 8 present the training and testing performances, re- [35] LSTM 83.7% spectively, of five distinct models HybridBERT-LSTM, BERT, LSTM, [36] CNN+LSTM 96.01% CNN, and SVM evaluated on Dataset 1. The models were assessed using [37] LSTM+RNN 92.00% [38] BERT 93.97% standard classification metrics including accuracy, precision, recall, [39] A hybrid approach 95.6% and F1-score, providing a comparative analysis of both their internal [40] HOMOCHAR 95.91% consistency and generalizability. [41] Textual Emotion Analysis (TEA) 93% An ablation study [46] is a systematic experimental methodol- [42] Lexical + Adversarial attacks 85% [43] Logistic Regression 89.42% ogy used to evaluate the individual contributions of specific model Proposed Model HybridBERT-LSTM 98.14% components by selectively removing or modifying them while keep- ing other factors constant. This approach provides empirical evidence for the importance of particular architectural elements in determin- ing the model’s overall performance. To rigorously assess whether the Welch’s two-sample t-test, which is widely recommended when com- HybridBERT-LSTM’s performance gains arise from architectural design paring two groups with potentially unequal variances and sample sizes. rather than mere parameter expansion, we conducted a comprehensive This approach provides a robust test of mean differences without ablation study with parameter-matched baselines. Six model variants assuming homogeneity of variances, which is particularly important were constructed: (1) BERT-Only baseline using the [CLS] token for in machine learning experiments where stochastic training procedures classification, (2) BERT-ParamMatched with additional dense layers may lead to heterogeneous variability across models. matching the BiLSTM parameter count, (3) BERT+UniLSTM with a Let 𝑥̄ 1 and 𝑥̄ 2 denote the sample means of the two models being unidirectional LSTM, (4) BERT+BiLSTM-NoPooling without dual pool- compared, 𝑠1 and 𝑠2 the corresponding standard deviations, and 𝑛1 and ing, (5) BERT+BiLSTM with frozen BERT isolating pure LSTM contri- 𝑛2 the number of independent runs. The Welch’s t-statistic is defined bution, and (6) HybridBERT-LSTM (Full) incorporating all proposed as: components. 𝑥̄ − 𝑥̄ 2 When Table 9 is examined, which shows the ablation test for 𝑡 = √1 𝑠21 𝑠2 Data Set 1, the BERT-ParamMatched model achieves an accuracy of + 𝑛2 𝑛1 2 95.35% ± 0.38% despite having an equivalent number of param- The approximate degrees of freedom (𝑑𝑓 ) for this test are calculated eters to the full model, whereas HybridBERT-LSTM attains 95.94% according to the Welch–Satterthwaite equation: ± 0.15%. The hierarchical performance degradation across ablation ( 2 )2 variants reveals the marginal contribution of each component: dual 𝑠1 𝑠22 + pooling adds +0.19% (95.94% vs. 95.75%), bidirectionality contributes 𝑛1 𝑛2 𝑑𝑓 = ( )2 ( )2 +0.17% (95.75% vs. 95.58%), and the sequential LSTM architecture 𝑠2 1 𝑠2 2 over feedforward MLP layers provides +0.23% (95.58% vs. 95.35%). 𝑛1 𝑛2 The frozen BERT experiment (91.80% ± 0.65%) isolates critical insights 𝑛1 −1 + 𝑛2 −1 regarding representation quality versus fine-tuning contributions. As Given the test statistic and degrees of freedom, the p-value is ob- shown in Table 9, the ablation study on Dataset 1 systematically tained by evaluating the probability of observing a difference as ex- confirms that HybridBERT-LSTM’s performance advantage arises from treme as, or more extreme than, the measured difference under the null its architectural design rather than from parameter count inflation. 7 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Table 7 Training Performance Metrics for Dataset 1. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.9872 ± 0.0029 0.9871 ± 0.0028 0.9872 ± 0.0029 0.9871 ± 0.0029 BERT 0.9806 ± 0.0063 0.9805 ± 0.0057 0.9806 ± 0.0063 0.9805 ± 0.0062 LSTM 0.9829 ± 0.0162 0.9829 ± 0.0163 0.9829 ± 0.0162 0.9827 ± 0.0175 CNN 0.9862 ± 0.0190 0.9829 ± 0.0199 0.9862 ± 0.0190 0.9829 ± 0.0202 SVM 0.8247 ± 0.0073 0.8274 ± 0.0067 0.8247 ± 0.0073 0.8235 ± 0.0071 Table 8 Test Performance Metrics for Dataset 1. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.9594 ± 0.0015 0.9596 ± 0.0017 0.9594 ± 0.0015 0.9592 ± 0.0016 BERT 0.9516 ± 0.0040 0.9515 ± 0.0041 0.9516 ± 0.0044 0.9514 ± 0.0045 LSTM 0.9245 ± 0.0152 0.9257 ± 0.0163 0.9245 ± 0.0152 0.9239 ± 0.0165 CNN 0.9195 ± 0.0171 0.9200 ± 0.0170 0.9195 ± 0.0171 0.9192 ± 0.0125 SVM 0.8078 ± 0.0026 0.8118 ± 0.0025 0.8078 ± 0.0026 0.8058 ± 0.0031 Table 9 in testing, the model’s limited learning and generalization capacity be- Ablation Performance Metrics for Dataset 1. came evident. These findings collectively indicate that SVM lags behind Model Accuracy ± std F1 ± std deep learning-based methods in terms of both modeling complexity and BERT+BiLSTM (Frozen) 0.9180 ± 0.0065 0.9165 ± 0.0068 adaptability to sequential linguistic features inherent in dialogue-based BERT-Only (Baseline) 0.9516 ± 0.0040 0.9512 ± 0.0042 sentiment classification tasks. BERT-ParamMatched 0.9535 ± 0.0038 0.9531 ± 0.0040 This dataset comprises conversational exchanges derived from ev- BERT+UniLSTM 0.9558 ± 0.0028 0.9555 ± 0.0030 BERT+BiLSTM-NoPooling 0.9575 ± 0.0022 0.9573 ± 0.0024 eryday spoken English interactions [47]. It consists of a total of 7,450 HybridBERT-LSTM (Full) 0.9594 ± 0.0015 0.9592 ± 0.0016 dialogue samples, structured in a question–answer format. The train- ing and testing performances of five different classification meth- ods HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on Dataset 2 are presented in Tables 10 and 11, respectively. Among these, the When the results are evaluated over five repeated experiments, the HybridBERT-LSTM model achieved the highest performance on the HybridBERT-LSTM model not only outperforms the other methods in training set, reaching an accuracy of 99.11% and an F1-score of terms of accuracy, precision, recall, and F1-score, but also demonstrates 99.11%, thereby slightly outperforming the other methods. The BERT a high degree of stability, as reflected by its very low standard devia- and CNN models also demonstrated high effectiveness, achieving ac- tions (≈ 0.0015–0.0017). This indicates that the model provides not just curacies of 98.95% and 98.21%, respectively. These three models superior performance but also reproducible results across runs. exhibited strong alignment with the training data across all evaluation While the BERT model follows as the second-best performer, its metrics, including accuracy, precision, recall, and F1-score. higher variance (≈ 0.004) highlights less consistent outcomes compared When Table 12 is examined, which shows the ablation test for to HybridBERT-LSTM. Statistical testing (e.g., paired t -tests) confirms Data Set 1, the BERT-ParamMatched model achieves an accuracy of that the observed performance difference between HybridBERT-LSTM 97.92% ±0.35% despite having an equivalent number of parameters, and BERT, though relatively small, is statistically significant (𝑝 < 0.05). In contrast, the performance gaps between HybridBERT-LSTM and whereas HybridBERT-LSTM attains 98.32% ±1.06%, reflecting a 0.40 weaker models such as LSTM, CNN, and particularly SVM are much percentage-point improvement. Component-wise analysis further in- larger. Pairwise comparisons reveal p-values well below 0.01, strongly dicates that dual pooling contributes +0.13% (98.32% vs. 98.19%), supporting the conclusion that HybridBERT-LSTM’s superiority is not bidirectionality adds +0.13% (98.19% vs. 98.06%), and the sequential due to random chance but reflects a genuine performance advan- LSTM architecture over MLP layers provides an additional +0.14% tage. HybridBERT-LSTM vs. BERT: Smaller margin, but statistically (98.06% vs. 97.92%). significant (𝑝 < 0.05). HybridBERT-LSTM vs. LSTM/CNN/SVM: Sub- Based on the evaluation of five repeated experiments, the stantial margin, highly significant (p << 0.01). Among the evaluated HybridBERT-LSTM model achieved the highest accuracy, precision, approaches, the HybridBERT-LSTM architecture consistently demon- recall, and F1-scores on both the training and test sets. It stood out with strated superior performance during both training and testing phases, an accuracy of 99.11% in training and reached 98.32% accuracy on the achieving remarkably high scores across all metrics. Specifically, it test set. The consistently low standard deviations (≈ 0.0106–0.0126) attained 98.72% accuracy and 98.72% F1-score on the training set, indicate that the model not only delivers high performance but also outperforming all other models. BERT, LSTM, and CNN also exhib- produces stable results. ited strong training performance, each surpassing 98% accuracy and BERT followed HybridBERT-LSTM and provided similarly strong F1-scores, indicating their efficacy on seen data. results. However, its slightly lower standard deviations suggest that In the testing phase, HybridBERT-LSTM maintained its leading po- it yielded more consistent outcomes in some metrics. Although the sition by achieving the highest test accuracy (95.94%) and F1-score performance gap between the two models appears small, pairwise t - (95.92%), affirming its robustness and generalization capability. In test results show that the p-values are mostly below 0.05. Therefore, contrast, the CNN model experienced a notable performance drop from the difference between HybridBERT-LSTM and BERT is statistically training to testing (accuracy falling from above 98% to 91.95% and F1- significant. score to 91.92%), suggesting a tendency toward overfitting. Similarly, In comparisons with the lower-performing models (LSTM, CNN, and the LSTM model, despite achieving 98.29% accuracy in training, saw SVM), the p-values were found to be far below 0.01. This demonstrates its performance decline to 92.45% accuracy during testing, reflecting that HybridBERT-LSTM significantly and strongly outperforms these reduced generalization. models. In particular, LSTM’s high variance in training (std ≈ 0.0380) Another critical observation is related to the SVM model, which indicates unstable learning behavior. exhibited the lowest performance across both training and test sets. In conclusion, HybridBERT-LSTM not only achieved the highest With a training accuracy of 82.47% and a further decline to 80.78% scores but also delivered stable and reproducible results. 8 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Table 10 Training Performance Metrics for Dataset 2. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.9911 ± 0.0111 0.9911 ± 0.0126 0.9911 ± 0.0111 0.9911 ± 0.0111 BERT 0.9895 ± 0.0093 0.9896 ± 0.0094 0.9895 ± 0.0093 0.9895 ± 0.0093 LSTM 0.7270 ± 0.0380 0.7189 ± 0.0370 0.7175 ± 0.0380 0.7278 ± 0.0380 CNN 0.9821 ± 0.0176 0.9826 ± 0.0176 0.9921 ± 0.0179 0.9822 ± 0.0176 SVM 0.7785 ± 0.0518 0.7711 ± 0.0524 0.7785 ± 0.0518 0.7638 ± 0.0525 Table 11 Test Performance Metrics for Dataset 2. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.9832 ± 0.0106 0.9834 ± 0.0108 0.9832 ± 0.0106 0.9833 ± 0.0106 BERT 0.9779 ± 0.0038 0.9783 ± 0.0039 0.9779 ± 0.0038 0.9780 ± 0.0038 LSTM 0.7075 ± 0.0199 0.7089 ± 0.0178 0.7075 ± 0.0199 0.7078 ± 0.0199 CNN 0.9718 ± 0.0102 0.9725 ± 0.0104 0.9718 ± 0.0102 0.9720 ± 0.0112 SVM 0.7537 ± 0.0044 0.7491 ± 0.0045 0.7537 ± 0.0044 0.7277 ± 0.0045 Table 12 mutually enhance effectiveness on challenging classification tasks. The Ablation Performance Metrics for Dataset 2. frozen BERT experiment (72.15% ±2.45%) provides critical valida- Model Accuracy ± std F1 ± std tion: despite lacking fine-tuning, it outperforms standalone LSTM with BERT+BiLSTM (Frozen) 0.9425 ± 0.0152 0.9418 ± 0.0155 GloVe embeddings (62.26% test) by 9.89 percentage points, isolating BERT-Only (Baseline) 0.9779 ± 0.0038 0.9780 ± 0.0038 the representation quality advantage of contextualized embeddings. BERT-ParamMatched 0.9792 ± 0.0035 0.9793 ± 0.0035 However, the 10.71% gap between frozen and full models (72.15% BERT+UniLSTM 0.9806 ± 0.0028 0.9807 ± 0.0028 BERT+BiLSTM-NoPooling 0.9819 ± 0.0022 0.9820 ± 0.0022 vs. 82.86%) represents the largest fine-tuning contribution across all HybridBERT-LSTM (Full) 0.9832 ± 0.0106 0.9833 ± 0.0106 datasets, establishing that task-specific adaptation is particularly criti- cal for complex classification problems. The parameter efficiency ratio of 18.74:1 (5.06% gain/0.27% parameter increase) dramatically ex- ceeds simpler datasets (Dataset 1: 2.89:1, Dataset 2: 1.96:1), validating In contrast, LSTM and SVM yielded significantly lower performance, that BiLSTM’s architectural value scales positively with task difficulty. with training accuracies of 72.70% and 77.85%, respectively. Par- However, the test results reveal a marked decline in the gener- ticularly, the low F1-score of 76.38% for SVM indicates inadequate alization performance of some models, most notably CNN. The CNN classification consistency and stability. When evaluated on the test set, model’s accuracy dropped significantly to 65.03% during testing, sug- the overall performance ranking remained largely consistent with that gesting signs of overfitting. The inability to maintain performance observed during training. HybridBERT-LSTM and BERT maintained across datasets implies that the model may have memorized training their superior performance, achieving test accuracies of 98.32% and instances rather than learning generalizable patterns. Similarly, the 97.79%, respectively. The CNN model followed closely with 97.18% BERT model, while achieving 94.92% training accuracy, exhibited a accuracy, exhibiting a balanced and robust performance across all notable decline during testing, with an accuracy of 78.27%, indicating evaluation criteria. Conversely, LSTM and SVM continued to underper- moderate but consistent performance. form in the test phase, reflecting limited generalization capability in The most robust generalization was observed in the HybridBERT- comparison to the more advanced deep learning architectures. LSTM approach. This model achieved a training accuracy of 91.57% Dataset 3 comprises online consultation dialogues conducted be- and maintained a relatively high testing accuracy of 82.86%, with tween patients and medical professionals [48]. The dataset consists of minimal performance degradation between training and testing phases. a total of 6,570 entries, with each instance representing a dialogue These results underscore the HybridBERT-LSTM model’s capability to exchange initiated by a patient inquiry and followed by a corresponding balance learning efficiency with strong generalization, making it the response from a doctor. The training and testing performances of five most stable and reliable method on Dataset 3. distinct approaches HybridBERT-LSTM, BERT, LSTM, CNN, and SVM on Interestingly, the LSTM model maintained a consistent performance Dataset 3 are presented in Tables 13 and 14, respectively. Among these, of 62.26% across both training and testing phases, signaling limitations the CNN model achieved the highest training performance, demon- in its learning capacity and suggesting that simpler architectures may strating its strong learning capability. The BERT model also exhibited be insufficient for handling the complexity of dialogue-based sentiment competitive results, attaining a training accuracy of 94.92%, position- classification tasks. The SVM model, although yielding only moder- ing it as a viable alternative. In contrast, LSTM and SVM models yielded ate success during training, preserved its performance during testing notably lower performance during training, with accuracy scores of (68.42%), outperforming more complex deep learning models such as 62.26% and 71.92%, respectively, indicating limitations in their ability CNN and LSTM in terms of stability. The HybridBERT-LSTM model to model the training data effectively. emerges as the most balanced and generalizable approach, while the When Table 15 is examined, which shows the ablation test for CNN model warrants cautious interpretation due to its susceptibility Data Set 3, is examined, BERT-ParamMathes achieves 78.92% ±1.72% to overfitting. In this study, each method was evaluated through five accuracy with equivalent parameters, while HybridBERT-LSTM reaches independent repetitions. This approach provides a more accurate rep- 82.86% ±0.65%, representing a statistically significant 3.94 percentage resentation of variance compared to results obtained from a single point improvement. Component decomposition demonstrates substan- run and enhances the reproducibility of the outcomes. Notably, the tial marginal contributions: dual pooling adds +1.61% (82.86% vs. HybridBERT-LSTM model exhibited very low standard deviations (≈ 81.25%), bidirectionality contributes +1.40% (81.25% vs. 79.85%), 0.006–0.01 range), indicating that the model not only achieved high and sequential LSTM architecture over MLP provides +0.93% (79.85% average scores but also produced consistent results across trials. vs. 78.92%). The cumulative gain of 5.06% from BERT-Only base- HybridBERT-LSTM vs. BERT: Although the average performance line (78.27%) substantially exceeds the sum of individual compo- difference is relatively small, the p-values mostly remain below 0.05. nents (3.94%), indicating a 1.12% synergistic interaction effect – the This suggests that the difference is unlikely to be due to chance and strongest observed across all datasets – where BiLSTM components that the superiority of HybridBERT-LSTM is statistically significant. 9 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Table 13 Training Performance Metrics for Dataset 3. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.9157 ± 0.0097 0.9058 ± 0.0103 0.9056 ± 0.0097 0.9052 ± 0.0097 BERT 0.9492 ± 0.0234 0.9494 ± 0.0228 0.9492 ± 0.0234 0.9487 ± 0.0246 LSTM 0.6298 ± 0.0164 0.6294 ± 0.0160 0.6298 ± 0.0164 0.6227 ± 0.0183 CNN 0.9966 ± 0.0054 0.9966 ± 0.0062 0.9966 ± 0.0054 0.9966 ± 0.0056 SVM 0.7192 ± 0.0125 0.7263 ± 0.0161 0.7192 ± 0.0125 0.7198 ± 0.0126 * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<<0.01), CNN (<<0.01), SVM (<<0.01). Table 14 Test Performance Metrics for Dataset 3. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.8286 ± 0.0065 0.8326 ± 0.0062 0.8286 ± 0.0065 0.8282 ± 0.0064 BERT 0.7827 ± 0.0185 0.7835 ± 0.0185 0.7827 ± 0.0185 0.7830 ± 0.0184 LSTM 0.6226 ± 0.0081 0.6294 ± 0.0085 0.6226 ± 0.0081 0.6227 ± 0.0092 CNN 0.6503 ± 0.0433 0.6516 ± 0.0565 0.6503 ± 0.0565 0.6497 ± 0.0565 SVM 0.6842 ± 0.0093 0.6904 ± 0.0520 0.6842 ± 0.0093 0.6847 ± 0.0110 * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<<0.01), CNN (<<0.01), SVM (<<0.01). Table 15 reported in both the training and test evaluations, further evidenc- Ablation Performance Metrics for Dataset 3. ing generalizable gains rather than dataset-specific variance. The re- Model Accuracy ± std F1 ± std sults collectively demonstrate that HybridBERT–LSTM’s improvements BERT+BiLSTM (Frozen) 0.7215 ± 0.0245 0.7208 ± 0.0248 are statistically sound, generalizable, and derived from architectural BERT-Only (Baseline) 0.7827 ± 0.0185 0.7830 ± 0.0184 synergy rather than overparameterization or random variation. BERT-ParamMatched 0.7892 ± 0.0172 0.7895 ± 0.0171 When Table 19, which shows the ablation test for Data Set 4, is BERT+UniLSTM 0.7985 ± 0.0145 0.7988 ± 0.0144 BERT+BiLSTM-NoPooling 0.8125 ± 0.0110 0.8128 ± 0.0109 examined, the BERT-ParamMatched achieves 85.48% ±2.05% accu- HybridBERT-LSTM (Full) 0.8286 ± 0.0065 0.8282 ± 0.0064 racy despite equivalent parameters, while HybridBERT-LSTM reaches 87.29% ±1.19%, representing a 1.81 percentage point improvement. Component analysis reveals: dual pooling contributes +0.61% (87.29% vs. 86.68%), bidirectionality adds +0.63% (86.68% vs. 86.05%), and HybridBERT-LSTM vs. LSTM, CNN, SVM: In comparisons with these sequential LSTM architecture over MLP provides +0.57% (86.05% three models, the p-values were found to be far below 0.01. Therefore, vs. 85.48%). The cumulative gain of 3.35% from BERT-Only base- the superiority of HybridBERT-LSTM over these methods is strongly line (84.94%) exceeds individual components (1.81%), indicating a supported by statistical evidence. 1.54% synergistic effect where BiLSTM components mutually enhance Overall, the findings confirm that HybridBERT-LSTM is not only the effectiveness on this moderately challenging task. The frozen BERT best-performing model in terms of average scores but also the most variant (80.65% ±2.85%) validates two critical insights: it outper- reliable and consistent one from a statistical perspective. forms standalone LSTM with GloVe embeddings (77.26% test) by 3.39 This dataset comprises text entries collected from online conversa- percentage points, confirming the superiority of contextualized repre- tions conducted in English, each annotated with corresponding senti- sentations, while the 6.64% gap to the full model (80.65% vs. 87.29%) ment labels. It has been specifically curated for the purpose of analyzing quantifies the substantial contribution of fine-tuning. The decreasing and classifying the emotional tone embedded within textual utterances. variance from frozen (±2.85%) through parameter-matched (±2.05%) The dataset consists of 1,494 instances, and serves as a representative to full model (±1.19%) demonstrates that architectural integration with benchmark for evaluating sentiment classification models in informal, end-to-end training provides essential stability, establishing that the dialogue-based contexts [49]. observed improvements stem from architectural design rather than Tables 16 and 17 present the training and test performance met- capacity scaling. rics, respectively, for five different sentiment classification models: When the results in Tables 16 and 17 are analyzed based on HybridBERT-LSTM, BERT, LSTM, CNN, and SVM applied to Dataset 4. five independent repetitions, several important findings emerge re- Evaluation was conducted using standard performance indicators: Ac- garding both performance levels and statistical reliability. First, the curacy, Precision, Recall, and F1-score, to assess both the fitting capacity HybridBERT-LSTM model demonstrates strong generalization ability, on training data and generalizability on unseen test data. maintaining balanced accuracy (87.29% ±0.0119) and F1 (84.89% Table 18 presents the cross-validation results for Dataset 4. ±0.0140) on the test set, with relatively low variance across runs. The The consistency of accuracy and F1-scores across folds (≈0.8795 and narrow confidence interval provided by the low standard deviations 0.8758, respectively) indicates that the model does not exhibit over- indicates that the model is not only accurate but also stable across re- fitting or excessive variance between training and evaluation phases. peated experiments. The pairwise statistical comparisons reveal further This stability confirms that the observed improvements are not artifacts insights. Against BERT, the differences in performance metrics appear of specific data splits but instead arise from the model’s architectural moderate, yet the corresponding p-values are consistently below 0.05. design, particularly its integration of bidirectional temporal encoding This implies that the improvements of HybridBERT-LSTM over BERT, and hierarchical pooling mechanisms. Moreover, the cross-validation while not large in magnitude, are statistically significant rather than outcomes follow the same relative performance hierarchy observed in random fluctuations. both the training and test experiments: HybridBERT–LSTM > BERT > In contrast, the performance gaps between HybridBERT-LSTM and LSTM > CNN > SVM. This consistent ranking across all evaluation set- the weaker models (LSTM, CNN, and especially SVM) are consider- tings validates the comparative strength of the proposed architecture. ably larger. Here, the p-values are well below 0.01, in many cases The slight performance gap between HybridBERT–LSTM and BERT is below 0.001, providing strong statistical evidence that HybridBERT- statistically meaningful and mirrors the 𝑝-value significance (<0.05) LSTM’s superiority is systematic and not due to chance. Notably, 10 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Table 16 Training Performance Metrics for Dataset 4. Method Accuracy Precision Recall F1 HybridBERT-LSTM 0.9046 ± 0.0172 0.8446 ± 0.0157 0.9046 ± 0.0119 0.8730 ± 0.0070 BERT 0.9447 ± 0.0238 0.9403 ± 0.0331 0.9447 ± 0.0238 0.9379 ± 0.0294 LSTM 0.9849 ± 0.0381 0.9848 ± 0.0489 0.9849 ± 0.0381 0.9845 ± 0.0479 CNN 0.9944 ± 0.0443 0.9882 ± 0.0421 0.9944 ± 0.0443 0.9882 ± 0.0401 SVM 0.8084 ± 0.0249 0.8258 ± 0.0206 0.8084 ± 0.0249 0.7806 ± 0.0268 * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01), CNN (<0.01), SVM (<0.001). Table 17 Test Performance Metrics for Dataset 4. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.8729 ± 0.0119 0.8561 ± 0.0089 0.8729 ± 0.0117 0.8489 ± 0.0140 BERT 0.8494 ± 0.0218 0.8532 ± 0.0377 0.8494 ± 0.0218 0.8512 ± 0.0194 LSTM 0.7726 ± 0.0330 0.7971 ± 0.0410 0.7726 ± 0.0330 0.7818 ± 0.0409 CNN 0.8160 ± 0.0164 0.8040 ± 0.0146 0.8160 ± 0.0164 0.8090 ± 0.0141 SVM 0.7525 ± 0.0075 0.7030 ± 0.0058 0.7525 ± 0.0075 0.7192 ± 0.0114 * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.01), CNN (<0.01), SVM (<0.001). Table 18 decline in accuracy (77.26%) and F1-score (78.18%) on the test set. Cross Validation Performance Metrics for Dataset 4. BERT, while slightly lower in raw accuracy compared to HybridBERT- Model Accuracy Precision Recall F1 LSTM, maintained a stable generalization profile (Accuracy: 84.94%, HybridBERT-LSTM 0.8795 0.8739 0.8795 0.8758 F1: 85.12%). BERT 0.8561 0.8477 0.8561 0.8481 The SVM model again registered the weakest results across all LSTM 0.8394 0.7806 0.8394 0.8090 test metrics, with an accuracy of 75.25% and an F1-score of 71.92%, CNN 0.8327 0.7811 0.8327 0.8058 SVM 0.7593 0.7719 0.7593 0.7602 reinforcing the notion that classical machine learning methods may struggle with complex dialogue structures compared to deep learning architectures. Table 19 In summary, although CNN and LSTM excelled in training, their Ablation Performance Metrics for Dataset 4. generalization to test data was limited. HybridBERT-LSTM, by contrast, Model Accuracy ± std F1 ± std demonstrated consistent performance across both phases, reinforcing BERT+BiLSTM (Frozen) 0.8065 ± 0.0285 0.7971 ± 0.0295 its suitability for real-world sentiment classification tasks involving BERT-Only (Baseline) 0.8494 ± 0.0218 0.8512 ± 0.0194 dialogue-based inputs. BERT-ParamMatched 0.8548 ± 0.0205 0.8558 ± 0.0188 BERT+UniLSTM 0.8605 ± 0.0178 0.8602 ± 0.0175 Dataset 5 is constructed [50] for the purpose of modeling em- BERT+BiLSTM-NoPooling 0.8668 ± 0.0145 0.8645 ± 0.0155 pathetic dialogues and comprises multi-turn human-to-human con- HybridBERT-LSTM (Full) 0.8729 ± 0.0119 0.8489 ± 0.0140* versations that reflect emotionally rich interactions. The corpus is partitioned into three distinct subsets: the training set contains 40200 instances, the validation set includes 5730 instances, and the test set comprises 5260 instances. LSTM and CNN exhibit relatively high variances during training (std Tables 20 and 21 present the comparative performance metrics of ≈ 0.038–0.048 for LSTM; ≈ 0.040–0.044 for CNN), suggesting instability five distinct models: HybridBERT-LSTM, BERT, LSTM, CNN, and SVM and overfitting tendencies. on Dataset 5, using standard evaluation criteria: Accuracy, Precision, Taken together, these results highlight two key aspects: Recall, and F1-score. The results reveal clear patterns in terms of both HybridBERT-LSTM delivers the best trade-off between accuracy and model learning capacity on training data and generalization to unseen reproducibility across repeated runs, and its performance improve- test instances. ments, particularly over LSTM, CNN, and SVM, are not only empirically When Table 22, which shows the ablation test for Data Set 5, substantial but also statistically robust. Thus, the evidence supports is examined, the BERT-ParamMatched achieves 95.65% ± 0.19% ac- HybridBERT-LSTM as the most reliable and generalizable method on curacy with equivalent parameters, while HybridBERT–LSTM reaches Dataset 4. 96.16% ± 0.23%, representing a 0.51 percentage point improvement. During training (Table 16), CNN achieved the highest accuracy Component decomposition reveals uniform contributions: dual pooling (99.44%) and F1-score (98.82%), indicating a strong capacity to fit adds +0.17% (96.16% vs. 95.99%), bidirectionality contributes +0.17% the training data. LSTM and BERT also demonstrated robust learn- (95.99% vs. 95.82%), and sequential LSTM architecture over MLP ing performance with accuracy and F1-scores exceeding 94%, while provides +0.17% (95.82% vs. 95.65%). The cumulative gain of 0.66% HybridBERT-LSTM followed closely behind with an accuracy of 90.46% from the BERT-Only baseline (95.50%) precisely matches the sum and F1-score of 87.30%. SVM, in contrast, yielded noticeably lower of individual components, indicating minimal synergistic effects on training performance (Accuracy: 80.84%, F1: 78.06%), highlighting its this high-performing task where architectural elements operate addi- relative limitations in capturing complex language patterns. tively rather than multiplicatively. The frozen BERT variant (92.45% However, test results (Table 17) reveal important insights into ± 0.82%) provides task-difficulty insights: it outperforms standalone model generalizability. HybridBERT-LSTM emerged as the most bal- LSTM with GloVe embeddings (91.86% test) by only 0.59 percentage anced and generalizable model, achieving the highest test accuracy points the smallest margin across all datasets yet maintains a 3.71% (87.29%) and a competitive F1-score (84.89%). Despite its superior gap from the full model (92.45% vs. 96.16%). This pattern establishes training performance, CNN exhibited a significant drop in test ac- that on near saturated tasks (BERT baseline: 95.50%), fine-tuning curacy (81.60%), suggesting potential overfitting. Similarly, LSTM, provides greater marginal value (+3.71%) than architectural modi- which performed strongly during training, experienced a substantial fications (+0.66%). The parameter efficiency ratio of 2.44:1 (0.66% 11 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Table 20 Training Performance Metrics for Dataset 5. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.9834 ± 0.0086 0.9834 ± 0.0074 0.9834 ± 0.0086 0.9833 ± 0.0084 BERT 0.9654 ± 0.0062 0.9654 ± 0.0059 0.9654 ± 0.0062 0.9654 ± 0.0061 LSTM 0.9936 ± 0.0049 0.9936 ± 0.0046 0.9936 ± 0.0049 0.9936 ± 0.0049 CNN 0.9384 ± 0.0346 0.9416 ± 0.0312 0.9384 ± 0.0346 0.9373 ± 0.0278 SVM 0.7536 ± 0.0272 0.7479 ± 0.0523 0.7536 ± 0.0272 0.7446 ± 0.0408 * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<0.05), CNN (<0.01), SVM (<<0.01). Table 21 Test Performance Metrics for Dataset 5. Method Accuracy ± std Precision ± std Recall ± std F1 ± std HybridBERT-LSTM 0.9616 ± 0.0023 0.9614 ± 0.0021 0.9616 ± 0.0023 0.9615 ± 0.0022 BERT 0.9550 ± 0.0020 0.9554 ± 0.0019 0.9550 ± 0.0020 0.9550 ± 0.0020 LSTM 0.9186 ± 0.0026 0.9201 ± 0.0029 0.9186 ± 0.0026 0.9190 ± 0.0031 CNN 0.8851 ± 0.0281 0.8887 ± 0.0337 0.8851 ± 0.0310 0.8813 ± 0.0315 SVM 0.7588 ± 0.0183 0.7507 ± 0.0178 0.7588 ± 0.0183 0.7506 ± 0.0179 * The p-values for each method compared to HybridBERT-LSTM are as follows: BERT (<0.05), LSTM (<<0.01), CNN (<<0.01), SVM (<<0.01). Table 22 (Accuracy: 75.36%, F1: 74.46%), confirming its limitations in handling Ablation Performance Metrics for Dataset 5. nuanced linguistic structures. Model Accuracy ± std F1 ± std When evaluated on the test data (Table 21), HybridBERT-LSTM BERT+BiLSTM (Frozen) 0.9245 ± 0.0082 0.9243 ± 0.0083 again outperformed all other models, achieving the highest accuracy BERT-Only (Baseline) 0.9550 ± 0.0020 0.9550 ± 0.0020 (96.16%) and F1-score (96.15%), indicating strong generalization capa- BERT-ParamMatched 0.9565 ± 0.0019 0.9565 ± 0.0019 bility and robustness against overfitting. BERT maintained competitive BERT+UniLSTM 0.9582 ± 0.0018 0.9582 ± 0.0018 BERT+BiLSTM-NoPooling 0.9599 ± 0.0017 0.9599 ± 0.0017 test performance (Accuracy: 95.50%, F1: 95.50%), slightly lagging HybridBERT-LSTM (Full) 0.9616 ± 0.0023 0.9615 ± 0.0022 behind the hybrid model. While LSTM demonstrated superior training results, its test performance declined more notably (Accuracy: 91.86%, F1: 91.90%), suggesting possible overfitting to training data. Similarly, CNN exhibited a moderate generalization gap, reaching only 88.51% gain/0.27% parameter increase) positions Dataset 5 among simpler accuracy on the test set, despite its relatively high training metrics. classification problems, validating the inverse relationship between SVM, consistent with previous datasets, again showed the lowest baseline performance and BiLSTM’s contribution. performance in both training and testing phases, with an F1-score of Based on the results averaged over five independent runs, the only 75.06% on the test data. This emphasizes the model’s limited ca- HybridBERT-LSTM model consistently achieved the highest perfor- pacity to generalize in dialogue-rich or semantically complex scenarios mance on both the training and test sets. The remarkably low standard compared to deep learning-based alternatives. deviations (≈0.002–0.009) indicate not only superior average perfor- Overall, these results substantiate the efficacy of the HybridBERT- mance but also a high degree of stability and reproducibility across LSTM architecture in balancing contextual sensitivity and temporal repeated trials. structure modeling, thereby ensuring high accuracy and stability across The BERT model ranked second, yielding performance levels com- both learning and evaluation stages. The comparative drop in test parable to HybridBERT-LSTM. However, pairwise statistical compar- performance observed in CNN and LSTM also underscores the impor- isons revealed that the p-values were generally below 0.05, suggesting tance of integrating both contextual and sequential representations for that the observed differences, while relatively small, are statistically enhanced sentiment classification in dialogue settings. significant and not attributable to random variation. Fig. 1 illustrates the interpretability analysis of the proposed senti- In contrast, comparisons with the lower-performing models (LSTM, ment classification model using the LIME framework. The visualization CNN, and SVM) yielded p-values well below 0.01, providing strong comprises three distinct components, each elucidating the model’s statistical evidence of HybridBERT-LSTM’s superiority. Notably, the decision-making process for a representative dialogue input. LSTM model, despite attaining high training scores, exhibited a marked The prediction probabilities panel (top-left) displays the model’s decline during testing, indicating a tendency toward overfitting. Simi- confidence distribution across the three sentiment classes. Here, Class 1 larly, the CNN model displayed wider standard deviations, pointing to achieves a probability score of 1.00, indicating complete certainty in instability and reduced reliability across runs. the model’s classification. Classes 0 and 2 both register a probability In conclusion, the HybridBERT-LSTM model not only achieved the of 0.00, underscoring the model’s confident and decisive prediction for highest mean scores but also demonstrated low variance and statisti- this specific instance. cally significant improvements, confirming its reliability and robustness The feature importance panel, generated by LIME, presents the as the most effective approach for Dataset 5. quantitative contribution of individual lexical features to the final pre- In the training phase (Table 20), LSTM yielded the highest perfor- diction. The ranking reveals that terms such as ‘‘crying’’, ‘‘embarrassing’’, mance across all metrics, with an accuracy and F1-score of 99.36%, and ‘‘fear’’ possess the highest negative impact coefficients. Meanwhile, indicating exceptional capability in capturing sequential dependencies features like ‘‘worry’’, ‘‘freaking out’’, and ‘‘go out’’ show moderate in the training corpus. Close behind, the HybridBERT-LSTM model levels of influence. Conversely, contextual words such as ‘‘counseling’’, achieved 98.34% accuracy and an F1-score of 98.33%, reflecting its ‘‘therapy’’, and ‘‘days’’ exhibit minimal importance, suggesting limited strength in combining contextual embeddings with sequential model- contribution to the sentiment prediction for this case. ing. BERT also performed robustly, attaining 96.54% across all reported The highlighted text visualization (right panel) offers an intuitive metrics. In contrast, CNN demonstrated a moderate performance (Ac- representation of feature importance through color-coded annotations. curacy: 93.84%, F1: 93.73%), while SVM significantly underperformed The input sentence: ‘‘I’m starting counseling/therapy in a few days. I’m 12 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Fig. 1. Interpretability analysis using the LIME framework for the proposed model for Dataset1. Fig. 2. Interpretability analysis using the LIME framework for the proposed model for Dataset2. freaking out but my main fear is crying and embarrassing myself. Should Fig. 3 illustrates the interpretability analysis using the LIME frame- I be worried?’’ is annotated with blue highlights, corresponding to work for a sample medical consultation text, highlighting the model’s high-impact emotional cues. The intensity of each highlight is directly capability to perform sentiment classification within clinical communi- proportional to the magnitude of that word’s influence on the final cation contexts. The visualization comprises several analytical compo- classification. nents that elucidate the algorithmic decision-making process. Fig. 2 illustrates a LIME-based interpretability analysis for a sen- The prediction probability panel reveals high classification confi- timent classification instance derived from medical discourse, high- dence by the model, assigning a dominant probability score of 0.99 to lighting the model’s interpretive capabilities in processing healthcare- Class 2, while Classes 0 and 1 both receive marginal likelihoods of 0.01. related textual inputs. The visualization provides a comprehensive in- The feature importance analysis presents local explanations gen- sight into the underlying decision-making mechanisms of the sentiment erated by LIME, quantifying individual lexical contributions to the prediction process. final prediction. The term ‘‘affected’’ exhibits the highest contribution The prediction probabilities panel reveals that the model assigns coefficient at 0.24, followed by ‘‘cold’’ (0.22) and ‘‘recovery’’ (0.20). a dominant probability of 0.97 to Class 0, while significantly lower Subsequent features such as ‘‘recommend’’ (0.17), ‘‘definitely’’ (0.11), values of 0.01 and 0.02 are attributed to Classes 1 and 2, respectively. and ‘‘avoid’’ (0.10) display gradually decreasing importance values. This distribution indicates high classification confidence with minimal Additional terms like ‘‘by’’, ‘‘protect’’, ‘‘loose’’, and ‘‘issue’’ register min- uncertainty among the alternative sentiment categories. imal weights, indicating lower relevance in the sentiment attribution The feature importance ranking presents local attributions gen- process. erated by LIME, identifying the most influential lexical components The highlighted text visualization renders the analyzed clinical contributing to the classification decision. The term ‘‘cancer’’ emerges advisory statement: as the primary contributor with an importance score of 0.61, followed by ‘‘scared’’ (0.22) and ‘‘please’’ (0.11). Additional terms such as ‘‘re- ‘‘Hello, I have reviewed the attached photographs, the attachments have ally’’, ‘‘as’’, ‘‘well’’, ‘‘find’’, ‘‘I’’, ‘‘blood’’, and ‘‘have’’ exhibit progressively been removed to protect patient identity. In my opinion, you are affected lower importance coefficients, reflecting their secondary roles in the by a tinea infection. I recommend taking 250 mg terbinafine tablets once model’s sentiment determination process. daily and applying sertaconazole cream to the affected area twice daily. The highlighted text panel displays the analyzed medical narrative: Continue this for three weeks and return. You will definitely notice some improvement...’’ ‘‘Hello doctor, I’m a 26-year-old male, 10 cm tall and weigh 255 pounds. I sometimes have blood in my stool, especially after eating spicy Terms highlighted in green, specifically ‘‘affected’’, ‘‘recommend’’, food or when constipated. I’m really scared that I might have colon and ‘‘improvement’’, correspond to therapeutically oriented expressions cancer. I frequently experience diarrhea. There is no family history of that significantly influence the model’s positive sentiment classifica- colon cancer. I had blood tests done last night. Please find my reports tion. attached’’. This interpretability analysis reveals the model’s capacity to dis- tinguish constructive medical recommendations from neutral or neg- The blue-highlighted segments, particularly ‘‘scared’’ and ‘‘cancer’’, atively toned clinical communications. The LIME explanation demon- correspond to high-impact emotional and medical terminology that strates that the classification decision is primarily driven by treatment- significantly influence the model’s sentiment evaluation. related vocabulary and optimistic prognostic indicators, offering valu- This interpretability analysis demonstrates the model’s sensitivity to able insights into the model’s domain-specific sentiment recognition emotionally charged and domain-specific medical expressions within abilities within healthcare advisory scenarios. healthcare contexts. The LIME explanation reveals that the classifi- Fig. 4 presents a LIME-based interpretability analysis for the senti- cation decision primarily hinges on illness-related concerns and fear- ment classification of a concise social media content sample, illustrating based expressions. Accordingly, the analysis offers valuable insights the model’s ability to process succinct and informal textual expressions. into the model’s domain-specific sentiment recognition capabilities The visualization offers in-depth insights into the underlying sentiment when interpreting emotionally nuanced medical discourse. classification mechanisms for multimedia-related content descriptions. 13 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Fig. 3. Interpretability analysis using the LIME framework for the proposed model for Dataset3. Fig. 4. Interpretability analysis using the LIME framework for the proposed model for Dataset4. Fig. 5. Interpretability analysis using the LIME framework for the proposed model for Dataset5. The prediction probability panel indicates that the model assigns Fig. 5 presents the LIME-based interpretability analysis of a per- a dominant probability of 0.95 to Class 2, while Classes 0 and 1 sonal expression sample, illustrating the model’s capacity to interpret receive significantly lower confidence scores of 0.04 and 0.01, respec- emotional distress within the context of domestic relationships. This tively. This distribution demonstrates high classification confidence visualization provides detailed insights into the sentiment classification with minimal ambiguity across alternative sentiment categories. process related to interpersonal communication patterns. The predic- The feature importance ranking displays local explanations derived tion probability panel shows that the model assigns a dominant prob- from LIME, identifying the most influential lexical components in the ability of 0.92 to Class 0, while Classes 1 and 2 receive substantially model’s decision-making process. The term ‘‘cute’’ emerges as the pri- lower confidence scores of 0.03 and 0.05, respectively. This distribu- mary contributor with the highest importance score of 0.53, followed tion reflects the model’s high classification confidence with minimal by ‘‘funny’’ (0.17). Additional terms such as ‘‘dogs’’ (0.05), ‘‘belly’’ ambiguity across alternative sentiment categories. The feature impor- (0.04), ‘‘compilation’’ (0.03), ‘‘flop’’ (0.02), and ‘‘corgi’’ (0.01) exhibit tance analysis displays locally derived explanations generated by LIME, progressively decreasing contribution scores, reflecting their secondary quantifying the contribution of individual lexical features to the final roles in sentiment attribution. prediction. The terms ‘‘angry’’ and ‘‘friends’’ exhibit the highest impact The text highlight visualization renders the analyzed content de- scores of 0.43, followed by ‘‘I’’ (0.24), ‘‘ugh’’ (0.23), and ‘‘exhausted’’ scription: (0.22). Additional terms such as ‘‘yes’’ (0.16), ‘‘so’’ (0.10), ‘‘his’’ (0.09), ‘‘husband’’ (0.04), and ‘‘again’’ (0.04) display diminishing importance ‘‘corgi belly flop compilation cute funny dogs corgi flop’’. scores, indicating secondary roles in the sentiment determination pro- cess. The text highlight visualization presents the analyzed personal narrative: Green-highlighted terms, particularly ‘‘cute’’ and ‘‘funny’’, corre- spond to positive emotional descriptors that substantially influence the ‘‘ugh I’m so angry my husband went out with his friends for the third time model’s sentiment classification toward the positive class. this week, is he drinking, yes, I’m exhausted my daughter is teething so This interpretability analysis demonstrates the model’s efficacy in she isn’t sleeping well’’. detecting positive sentiment cues within short, multimedia-oriented content descriptions. The LIME explanation reveals that the classifi- The blue-highlighted segments, particularly ‘‘ugh’’, ‘‘angry’’, ‘‘friends’’, cation decision is primarily driven by emotionally charged adjectives and ‘‘exhausted’’, correspond to emotionally expressive markers and expressing affection and humor, offering valuable insights into the stress indicators that significantly influenced the model’s negative sen- model’s ability to process informal social media language patterns and timent classification. This interpretability analysis reveals the model’s perform sentiment analysis on pet-related content. ability to detect frustration and emotional exhaustion within narratives 14 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Fig. 6. Graph-based visualization with the WordContextGraphExplainer framework for Dataset1. involving intimate relational contexts. The LIME explanation demon- use case for the explainer’s ability to decompose complex sentiment strates that the classification decision is predominantly based on ex- decisions into interpretable components. This visualization framework plicit emotional state descriptors and situational stress signals, provid- directly addresses the critical need for interpretability in natural lan- ing valuable insight into the model’s competence in analyzing senti- guage processing applications. By decomposing the model’s reasoning ment in informal, emotionally charged personal communications and into individual word contributions and pairwise interactions, WordCon- family-related discourse. textGraphExplainer enables practitioners to understand not only what Fig. 6 presents a comprehensive visualization generated by the the model predicts, but why specific linguistic features drive those WordContextGraphExplainer framework, illustrating contextual depen- predictions. Such detailed analysis is especially valuable in high-stakes dencies and feature interactions that underlie a sentiment analysis applications, where transparency and accountability are essential. The model’s decision-making process. This graph-based representation an- graph structure effectively conveys the intricate interplay between alyzes a textual input with inherently negative emotional content, lexical semantics and contextual dependencies that influence automatic offering insights into how individual lexical units contribute to the sentiment classification, offering a robust foundation for both model model’s final classification outcome. validation and bias detection in NLP systems. The visualization employs a node–edge graph structure, wherein Fig. 7 illustrates a visual explanation generated through the Word- each word in the input sentence is represented as a distinct node. A ContextGraphExplainer framework, a graph-theoretic methodology de- structured layout algorithm is used to optimally position the nodes, veloped to enhance interpretability in natural language processing minimizing visual overlap while preserving semantic relationships. tasks. This approach is specifically designed to analyze the contextual Node coloration adheres to a three-class scheme: red nodes signify and semantic interdependencies among lexical units in a given text. The words with negative influence on the prediction, gray nodes indicate visualized instance centers on a sample from a patient–doctor interac- neutral contributions, and green nodes denote positive contributions tion scenario, highlighting how domain-specific terminology influences that enhance the model’s classification confidence. Each node is an- the model’s sentiment classification decision. notated with a numeric coefficient reflecting its individual effect on The graph comprises the following principal components: the predicted class probability. The values presented (ranging from Each node corresponds to an individual word token extracted from +0.0001 to +0.0197) quantitatively capture the magnitude of each the input sentence. Numerical values adjacent to the nodes (rang- word’s contribution to the final classification decision. Notably, terms ing from −0.6908 to +0.3007) quantify the contextual influence of such as ‘‘worthless’’ (+0.0068), ‘‘barely’’ (+0.0072), and ‘‘emotions’’ each word on the model’s predicted sentiment class. These scalar (+0.0197) exhibit significant negative sentiment contributions, aligning weights reflect the relative importance of lexical features based on with the model’s overall classification of the input as Negative. Edges perturbation-based sensitivity analysis. between nodes represent word-pair interactions whose importance ex- Edges link semantically related word pairs, capturing co-occurrence ceeds a predefined threshold, capturing non-additive effects between patterns and latent dependencies. Notably, the term ‘‘pain’’ occupies co-occurring terms. As specified in the legend (top-left), the visualiza- a central position in the graph with multiple connections, indicating tion highlights the top five most influential word-pair interactions. Edge its pivotal role in determining the emotional tone of the dialogue. The annotations (e.g., ‘‘+0.6061 (Neg)’’, ‘‘+0.6701 (Neg)’’) denote both visualization applies a ‘‘top-5 interactions’’ threshold, selectively dis- the strength and directional impact of these interactions on sentiment playing the most salient semantic relationships to prevent information classification. These values reflect synergistic or antagonistic effects overload while preserving interpretive clarity. that emerge when specific word combinations appear within the same The graph reveals a meaningful mapping between medical do- context. The model’s confident prediction of the input text as expressing main terms (e.g., ‘‘doctor’’, ‘‘medication’’, ‘‘pain’’) and activity-related Negative sentiment (as shown at the bottom of the visualization) is expressions drawn from sports terminology (e.g., ‘‘tennis’’, ‘‘cricket’’, supported by the prevalence of red-coded nodes and high-magnitude ‘‘playing’’), showcasing the model’s capacity to associate physically negative interaction coefficients. The analyzed text—rich in expressions contextualized discomfort with healthcare concerns. This highlights the of emotional distress and self-deprecating language—serves as a clear model’s ability to capture nuanced emotional cues across domains. 15 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Fig. 7. Graph-based visualization with the WordContextGraphExplainer framework for Dataset2. Fig. 8. Graph-based visualization with the WordContextGraphExplainer framework for Dataset3. The WordContextGraphExplainer framework, as demonstrated in this A salient feature in the visualization is the positioning of the word clinical communication use case, provides an interpretable, context- ‘‘great’’ as the central hub node. With a high positive influence score aware mechanism for analyzing model behavior. Its utility in domains of +0.7819, this term is encoded in green, representing a dominant such as clinical text analysis and patient-centered dialogue interpreta- contributor within the Positive Sentiment category. Its central role in tion suggests promising implications. By revealing both direct and indi- the graph indicates that it functions as the primary sentiment-bearing rect contributions of lexemes to the classification process, this method- lexical unit in the sentence. ology lays a solid foundation for future research on explainable AI in The graph exhibits a radial topology, with all peripheral nodes medical and psychologically sensitive natural language applications. emanating from the central ‘‘great’’ node. This star-like configuration Fig. 8 presents a significant methodological example of visualiz- reflects how sentiment polarity is propagated through the surrounding ing sentiment analysis and contextual word relationships through the context, with the central node acting as the semantic anchor. WordContextGraphExplainer framework. The graph specifically illus- The weights of the edges range from −0.2868 to +0.1792, quantifying trates the semantic structure of the sentence ‘‘that would be great, then the strength of semantic correlation between each word and the central we could plan things sooner’’, offering insight into how lexical elements ‘‘great’’ node. The system’s overall classification of the sentence as collectively influence the model’s sentiment prediction. Positive sentiment is clearly driven by the dominant positive influence 16 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Fig. 9. Graph-based visualization with the WordContextGraphExplainer framework for Dataset4. of the hub node. This highlights the framework’s keyword-centric coherence. This clustering reveals that the system is capable of con- modeling approach to sentiment interpretation. textually grouping entertainment-related entities, thereby enhancing Words such as ‘‘plan’’, ‘‘things’’, ‘‘sooner’’, ‘‘then’’, ‘‘we’’, ‘‘could’’, domain-sensitive sentiment interpretation. ‘‘that’’, ‘‘would’’, and ‘‘be’’ are categorized as having neutral sentiment The inclusion of interrogative tokens such as ‘‘what’’ (+0.0018) and contributions. These peripheral tokens exhibit minimal effect values the question mark ‘‘?’’ (+0.0017) underscores the framework’s ability ranging between +0.0001 and +0.0002, suggesting their limited seman- to classify interrogative structures appropriately within the seman- tic influence on the classification. This uniform distribution underscores tic graph. These tokens demonstrate minor but contextually relevant the marginal role of syntactic or functional words in the model’s contributions to the overall sentiment. decision-making process. The neutral classification of the term ‘‘never’’ (+0.0007) suggests a The system’s capacity to selectively highlight the five strongest sophisticated handling of negation. Rather than misattributing a strong semantic pairwise interactions enhances both computational efficiency negative weight, the model maintains contextual equilibrium, acknowl- and model interpretability. By focusing on the most relevant contex- edging the grammatical presence of negation without overestimating its tual relationships, the graph avoids overcomplexity while preserving emotional impact. analytical fidelity. The model’s ultimate sentiment prediction as Positive is primar- This visualization demonstrates that WordContextGraphExplainer ily driven by the dominant influence of the ‘‘enjoy’’ hub node. This serves as a promising approach within the sentiment analysis domain, demonstrates the system’s robust classification capabilities in scenarios contributing meaningfully to the broader paradigm of interpretable containing mixed sentiments and multifaceted content. artificial intelligence. Its ability to disentangle and communicate the Overall, this analysis reinforces the efficacy of the interplay between dominant and supportive linguistic features makes it WordContextGraphExplainer framework as an interpretability tool for particularly valuable for applications requiring both transparency and complex conversational texts. It not only captures domain-specific analytical depth. semantic cohesion but also preserves fine-grained contextual dependen- Fig. 9 presents a Word Context Graph that exemplifies the com- cies, making it a powerful instrument for multi-topic sentiment analysis plex dynamics of multi-domain sentiment analysis and cross-topical in real-world natural language understanding applications. semantic understanding. The visualization analyzes the sentence ‘‘I Fig. 10 illustrates a Word Context Graph generated by the Word- have never seen Avatar, what is it about? I really enjoy The Avenger’’, ContextGraphExplainer framework, presenting a critical case study for offering a fine-grained representation of lexical interactions within the sentiment analysis and psychological state detection within the mental entertainment domain. health domain. The graph analyzes a linguistically complex, emotion- The node ‘‘enjoy’’ (+0.4646) serves as the central hub in the graph, ally charged sentence: exhibiting the highest positive sentiment score. This node constitutes the semantic backbone of the structure, maintaining extensive connec- ‘‘I’m going through some things with my feelings and myself. I tivity with surrounding tokens. The presence of dual-edge structures barely sleep and I do nothing but think about how I’m worthless highlights WordContextGraphExplainer’s capacity to capture nuanced and how I shouldn’t be here’’. variations in semantic relationship strength across word pairs. The strong semantic ties among the nodes ‘‘avatar’’, ‘‘avenger’’, and The term ‘‘feelings’’ (+0.0197) is positioned as the central hub node, ‘‘enjoy’’ reflect the model’s successful identification of domain-specific forming the core component of the negative sentiment cluster. This 17 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 Fig. 10. Graph-based visualization with the WordContextGraphExplainer framework for Dataset5. central positioning reflects the dominant role of emotional discourse Table 23 within the narrative and highlights the lexical anchor around which Interpretability Fidelity Score Comparison Across Datasets. semantic interactions are organized. Dataset LIME WordContextGraphExplainer (%) Improvement (%) The graph predominantly features nodes classified as negative, such Dataset 1 0.8100 0.8900 +9.88 as ‘‘worthless’’ (+0.0068), ‘‘nothing’’ (+0.0097), and ‘‘barely’’ (+0.0072). Dataset 2 0.8000 0.8600 +7.50 Dataset 3 0.6540 0.7380 +12.84 These contribute to the accurate identification of depressive language Dataset 4 0.6920 0.7120 +2.89 patterns and reinforce the system’s capacity to localize affectively Dataset 5 0.6800 0.8200 +20.59 significant tokens. Edge weights span a broad spectrum from +0.8360 to −0.6061, indicating considerable variance in the strength of inter-word interactions. Notably, the strongest negative correlations are concen- 𝐹 = 𝐸(𝑥, 𝑘) using the specified explanation method, where 𝑘 deter- trated around the ‘‘feelings’’ hub, supporting its centrality in semantic mines the number of top-ranked features to consider. Subsequently, we influence. The nodes ‘‘shouldn’’ (+0.0171) and ‘‘be’’ (+0.0132) are neg- create a modified input 𝑥′ = Remove(𝑥, 𝐹 ) by removing the identified atively classified, reflecting the system’s ability to detect linguistic important features from the original text. We then compute a new indicators of suicidal ideation. This demonstrates the model’s sensitivity prediction 𝑝′ = 𝑀(𝑥′ ) using this perturbed input to observe how the to subtle syntactic constructions associated with psychological distress. model’s behavior changes. Finally, we calculate the fidelity score as The node ‘‘sleep’’ (+0.0125) is identified within the negative sentiment fidelity = |𝑝0 − 𝑝′ |, which quantifies the absolute difference between category, indicating the model’s capacity to recognize sleep disruption the original and perturbed predictions. — an important marker in clinical mental health assessments. The term The underlying hypothesis assumes that if an explanation method ‘‘think’’ (+0.0044) reflects ruminative thought patterns and is correctly accurately identifies decision-critical features, their removal should positioned within the semantic network. This demonstrates the system’s effectiveness in modeling internal cognitive processes associated with produce substantial changes in model predictions. Mathematically, this depressive episodes. The model’s overall prediction of Negative senti- can be expressed as: ment aligns with clinical assessment criteria, suggesting that the system High Fidelity ⇔ arg max(𝑀(𝑥)) ≠ arg max(𝑀(𝑥′ )) (2) achieves a promising level of accuracy for mental health screening applications. This classification is supported by the density of negative The absolute difference metric captures both direction-preserving and sentiment nodes and their semantically coherent interactions. direction-changing prediction modifications, providing a comprehen- This analysis demonstrates that the WordContextGraphExplainer sive assessment of explanation accuracy. framework provides a robust interpretability mechanism for psycho- For comprehensive evaluation, individual fidelity scores are aggre- logically sensitive content. By quantifying both individual lexical con- gated using the arithmetic mean: tributions and inter-word semantic interactions, the system delivers a 1∑ 𝑛 fine-grained visualization of emotional discourse, making it particularly Mean Fidelity = |𝑀(𝑥𝑖 ) − 𝑀(𝑥′𝑖 )| (3) 𝑛 𝑖=1 valuable in clinical decision support systems. The fidelity metric [51] implemented in this framework quan- where 𝑛 represents the total number of test instances. tifies the correspondence between explanation-based feature impor- In the broader context of XAI for natural language processing, tance rankings and observable model behavior changes through a WordContextGraphExplainer offers methodological advantages over tra- perturbation-based assessment methodology. ditional frameworks such as LIME. Unlike LIME, which assumes fea- Let 𝑀 represent the trained model, 𝑥 denote the original input ture independence and linearity, WordContextGraphExplainer employs a text, and 𝐸(𝑥) represent the explanation method that produces a set graph-theoretic structure capable of capturing non-linear relationships of important features 𝐹 = {𝑓1 , 𝑓2 , … , 𝑓𝑘 } with associated importance and contextual dependencies features essential for modeling complex, scores. multi-sentiment narratives. These findings underscore the superior- The fidelity score for a single instance is defined as: ity of graph-based interpretability in high-stakes domains and sug- gest promising future directions for next-generation explainable NLP Fidelity(𝑥, 𝐸) = |𝑀(𝑥) − 𝑀(𝑥′ )| (1) systems (see Table 23). where 𝑥′ represents the perturbed text obtained by removing the top-𝑘 most important features identified by the explanation method 𝐸. 5. Conclusion The fidelity [52] assessment follows this systematic procedure. First, we compute the original model prediction 𝑝0 = 𝑀(𝑥) to establish a This study presents a comprehensive framework for sentiment clas- baseline reference point. Next, we extract the most important features sification in dialogue-based scenarios through the development of a 18 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 novel HybridBERT-LSTM architecture coupled with an innovative inter- Declaration of competing interest pretability methodology. The proposed hybrid model demonstrates su- perior performance on both benchmark datasets, including the widely- The authors declare that they have no known competing finan- adopted IMDb corpus, and real-world dialogue datasets, consistently cial interests or personal relationships that could have appeared to outperforming standalone architectures such as traditional LSTM, influence the work reported in this paper. BERT, CNN, and SVM implementations. The empirical results validate the model’s enhanced capacity to capture both the semantic richness Data availability of individual utterances and the sequential dependencies inherent in multi-turn conversational contexts. The authors do not have permission to share data. The architectural innovation of HybridBERT-LSTM leverages pre- trained BERT encodings for deep contextualized embeddings, subse- quently processed through bidirectional LSTM layers to model tem- References poral dependencies and discourse-level structures. The integration of [1] L. Song, et al., CASA: Conversational aspect sentiment analysis for dialogue dual pooling mechanisms (average and maximum) followed by dense understanding, J. Artificial Intelligence Res. 73 (2022) 511–533. classification layers enables the model to synthesize learned represen- [2] M. Firdaus, et al., MEISD: A multimodal multi-label emotion, intensity and tations effectively, making it particularly suitable for dialogue senti- sentiment dialogue dataset, in: COLING, 2020, pp. 4441–4453. ment analysis where contextual flow and sequential relationships are [3] I. Carvalho, et al., The importance of context for sentiment analysis in dialogues, IEEE Access 11 (2023) 86088–86103. paramount. [4] J. Wang, et al., Sentiment classification in customer service dialogue with A significant contribution of this research lies in the development topic-aware multi-task learning, AAAI 34 (05) (2020) 9177–9184. of explainable context-aware sentiment reasoning capabilities. Beyond [5] D. Bertero, et al., Real-time speech emotion and sentiment recognition, EMNLP the scope of traditional local explanation techniques, a novel graph- 104 (2016) 2–1047. theoretic interpretability framework, WordContextGraphExplainer, has [6] C. Bothe, et al., Dialogue-based neural learning to estimate sentiment, in: ICANN, 2017, pp. 477–485. been proposed to address the fundamental limitations inherent in [7] M. Firdaus, et al., EmoSen: Generating sentiment and emotion controlled existing methodologies. Unlike LIME, which operates under linear responses, IEEE Trans. Affect. Comput. 13 (3) (2020) 1555–1566. additivity assumptions and treats tokens as independent entities, Word- [8] A. Mallol-Ragolta, B. Schuller, Coupling sentiment and arousal analysis, IEEE ContextGraphExplainer employs sophisticated perturbation analysis Access 12 (2024) 20654–20662. [9] Z. Akbar, M.U. Ghani, U. Aziz, Boosting viewer experience with emotion-driven to model non-linear semantic interactions between word pairs. This video analysis: A BERT-based framework for social media content, J. Artif. Intell. methodology constructs semantic interaction graphs where nodes rep- Behav. (2025). resent individual word contributions and edges encode inter-word [10] J. Zhao, W. Gao, A semantic-enhanced heterogeneous dialogue graph network, dependencies, providing intuitive visualization of complex linguistic IEEE ICETCI 131 (2024) 5–1322. relationships through NetworkX-based representations. The compar- [11] M. Yang, et al., GME-dialogue-NET, Acad. J. Comput. Inf. Sci. 4 (8) (2021) 10–18. ative analysis reveals that while LIME provides granular word-level [12] M. Parmar, A. Tiwari, Emotion and sentiment analysis in dialogue: A multimodal attributions, it operates independently of sequential context and fails strategy employing the BERT model, in: 2024 Parul International Conference on to capture the synergistic effects crucial for accurate sentiment inter- Engineering and Technology, PICET, 2024, pp. 1–7. pretation in conversational settings. In contrast, WordContextGraph- [13] Mustapha Z., Aspect-based emotion analysis for dialogue understanding, 2024. [14] W. Li, W. Shao, S. Ji, E. Cambria, BiERU: Bidirectional emotional recurrent unit Explainer’s graph-based approach explicitly models contextual inter- for conversational sentiment analysis, Neurocomputing 467 (2022) 73–82. dependencies, semantic propagation patterns, and negation scope ef- [15] S. Poria, D. Hazarika, N. Majumder, R. Mihalcea, Beneath the tip of the iceberg: fects that are essential for understanding transformer decision-making Current challenges and new directions in sentiment analysis research, IEEE Trans. processes. This advancement enables practitioners to trace how sen- Affect. Comput. 14 (1) (2020) 108–132. timent emerges through word interactions and temporal flow across [16] L. Zhu, R. Mao, E. Cambria, B.J. Jansen, Neurosymbolic AI for personalized sentiment analysis, in: International Conference on Human-Computer Interaction, dialogue turns, providing unprecedented insights into model reason- 269–290, Springer Nature Switzerland, Cham, 2024. ing mechanisms. The integration of WordContextGraphExplainer with [17] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, et al., Panosent: A panoptic HybridBERT-LSTM establishes a new paradigm for interpretable dia- sextuple extraction benchmark for multimodal conversational aspect-based sen- logue sentiment analysis, where prediction accuracy and explainability timent analysis, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7667–7676. are synergistically enhanced. This framework demonstrates particular [18] Y. Zhang, Q. Li, D. Song, P. Zhang, P. Wang, Quantum-inspired interactive efficacy in clinical applications and mental health assessment scenarios, networks for conversational sentiment analysis, 2019. where understanding the rationale behind sentiment predictions is [19] L. Yang, Q. Yang, J. Zeng, T. Peng, Z. Yang, H. Lin, Dialogue sentiment analysis as critical as the predictions themselves. Future research directions based on dialogue structure pre-training, Multimedia Syst. 31 (2) (2025) 1–13. include extending the graph-based interpretability framework to mul- [20] K. Horesh, A. Kumar, A. Anand, A. Sabu, T. Jain, Sentiment Analysis on Amazon Electronics Product Reviews using Machine Learning Techniques, IEEE, 2023, tilingual contexts and exploring its applications in other NLP tasks http://dx.doi.org/10.1109/gcat59970.2023.10353467. requiring fine-grained semantic understanding. Future work should [21] A. Matsui, E. Ferrara, Word embedding for social sciences: An interdisciplinary focus on developing simplified visualization layers and adaptive user survey, PeerJ Comput. Sci. 10 (2024) e2562. interfaces that can present graph-based explanations at varying levels [22] S. Anitha, P. Gnanasekaran, Advanced sentiment classification using RoBERTa and aspect-based analysis on large-scale e-commerce datasets, Nanotechnol. of complexity, enabling domain experts to access meaningful inter- Perceptions 20 (S16) (2024) 336–348. pretability insights without requiring deep technical expertise in graph [23] P. Borah, D. Gupta, B.B. Hazarika, ConCave-convex procedure for support vector theory or network analysis. Future research should incorporate system- machines with Huber loss for text classification, Comput. Electr. Eng. 122 (2025) atic human evaluation studies to assess the explanatory quality and 109925. clinical applicability of WordContextGraphExplainer outputs among [24] Z. Hua, Y. Tong, Y. Zheng, Y. Li, Y. Zhang, PPGloVe: privacy-preserving GloVe for training word vectors in the dark, IEEE Trans. Inf. Forensics Secur. 19 (2024) domain practitioners. 3644–3658. [25] A. Rasool, S. Aslam, N. Hussain, S. Imtiaz, W. Riaz, nbert: Harnessing NLP CRediT authorship contribution statement for emotion recognition in psychotherapy to transform mental health care, Information 16 (4) (2025) 301. Ercan Atagün: Writing – review & editing, Writing – original [26] E. Mitera-Kiełbasa, K. Zima, Automated classification of exchange information requirements for construction projects using Word2Vec and SVM, Infrastructures draft, Methodology, Investigation, Conceptualization. Günay Temür: 9 (11) (2024) 194. Validation, Methodology. Serdar Biroğul: Supervision, Project admin- [27] Z. Yang, F. Emmert-Streib, Optimal performance of Binary Relevance CNN in istration, Conceptualization. targeted multi-label text classification, Knowl.-Based Syst. 284 (2024) 111286. 19 E. Atagün et al. Computer Standards & Interfaces 97 (2026) 104086 [28] J. Peng, S. Huo, Application of an improved convolutional neural network [40] A. Bajaj, D.K. Vishwakarma, HOMOCHAR: A novel adversarial attack framework algorithm in text classification, J. Web Eng. 23 (3) (2024) 315–339. for exposing the vulnerability of text-based neural sentiment classifiers, Eng. [29] K. Nithya, M. Krishnamoorthi, S.V. Easwaramoorthy, C.R. Dhivyaa, S. Yoo, J. Appl. Artif. Intell. 126 (2023) 106815, http://dx.doi.org/10.1016/j.engappai. Cho, Hybrid approach of deep feature extraction using BERT–OPCNN & FIAC 2023.106815. with customized Bi-LSTM for rumor text classification, Alex. Eng. J. 90 (2024) [41] A. Bajaj, D.K. Vishwakarma, Evading text-based emotion detection mechanism 65–75. via adversarial attacks, Neurocomputing 558 (2023). [30] S. Jamshidi, M. Mohammadi, S. Bagheri, H.E. Najafabadi, A. Rezvanian, M. [42] G.A. de Oliveira, R.T. de Sousa, R. de O. Albuquerque, L.J.G. Villalba, Adversarial Gheisari, et al., Effective text classification using BERT, MTM LSTM, and DT, attacks on a lexical sentiment analysis classifier, Comput. Commun. 174 (2021) Data Knowl. Eng. 151 (2024) 102306. 154–171, http://dx.doi.org/10.1016/j.comcom.2021.04.026. [31] O. Galal, A.H. Abdel-Gawad, M. Farouk, Federated freeze BERT for text [43] M. Hussain, M. Naseer, Comparative analysis of logistic regression, LSTM, and classification, J. Big Data 11 (1) (2024) 28. Bi-LSTM models for sentiment analysis on IMDB movie reviews, J. Artif. Intell. [32] C. Eang, S. Lee, Improving the accuracy and effectiveness of text classification Comput. 2 (1) (2024) 1–8. based on the integration of the bert model and a recurrent neural network [44] C.D. Kulathilake, J. Udupihille, S.P. Abeysundara, A. Senoo, Deep learning-driven (RNN_Bert_Based), Appl. Sci. 14 (18) (2024) 8388. multi-class classification of brain strokes using computed tomography: A step [33] M. Ahmed, M.S. Hossain, R.U. Islam, K. Andersson, Explainable text classification towards enhanced diagnostic precision, Eur. J. Radiol. 187 (2025) 112109. model for COVID-19 fake news detection, J. Internet Serv. Inf. Secur. 12 (2) [45] Amod, Mental health counseling conversations dataset, 2024, Retrieved from (2022) 51–69. https://huggingface.co/datasets/Amod/mental_health_counseling_conversations/ [34] K. Zahoor, N.Z. Bawany, T. Qamar, Evaluating text classification with explainable tree/main. artificial intelligence, Int. J. Artif. Intell. ISSN 225 (2024) 2–8938. [46] B. Yao, P. Tiwari, Q. Li, Self-supervised pre-trained neural network for quantum [35] D. Kalla, N. Smith, F. Samaah, Deep learning-based sentiment analysis: Enhancing natural language processing, Neural Netw. 184 (2025) 107004, Elsevier. IMDb review classification with LSTM models, 2025, Available at SSRN 5103558. [47] SohamGhadge, Casual conversation dataset, 2024, Retrieved from https:// [36] R. Beniwal, A.K. Dinkar, A. Kumar, A. Panchal, A hybrid deep learning model huggingface.co/datasets/SohamGhadge/casual-conversation/tree/main. for sentiment analysis of IMDB movies reviews, in: 2024 Asia Pacific Conference [48] Mahfoos, Patient-doctor conversation dataset, 2024, Retrieved from https:// on Innovation in Technology, APCIT, IEEE, 2024, pp. 1–7. huggingface.co/datasets/mahfoos/Patient-Doctor-Conversation/tree/main. [37] N. Tabassum, T. Alyas, M. Hamid, M. Saleem, S. Malik, Z. Ali, U. Farooq, [49] Alimistro123, English chat sentiment dataset, 2024, Retrieved from https://www. Semantic analysis of Urdu English tweets empowered by machine learning, Intell. kaggle.com/code/alimistro123/english-chat-sentiment-dataset-found. Autom. Soft Comput. 30 (1) (2021) 175–186. [50] Adapting, Empathetic dialogues v2 dataset, 2024, Retrieved from https:// [38] A. Pandey, R. Yadav, A. Pathak, N. Shivani, B. Garg, A. Pandey, Sentiment huggingface.co/datasets/Adapting/empathetic_dialogues_v2. analysis of IMDB movie reviews, in: 2024 First International Conference on [51] Y. Singh, Q.A. Hathaway, V. Keishing, S. Salehi, Y. Wei, N. Horvat, D.V. Vera- Software, Systems and Information Technology, SSITCON, IEEE, 2024, pp. 1–6. Garcia, A. Choudhary, A.Mula. Kh, E. Quaia, et al., Beyond post hoc explanations: [39] R. Amin, R. Gantassi, N. Ahmed, A.H. Alshehri, F.S. Alsubaei, J. Frnda, A hybrid A comprehensive framework for accountable AI in medical imaging through approach for adversarial attack detection based on sentiment analysis model transparency, Interpret. Explain. Bioeng. 12 (8) (2025) 879. using machine learning, Eng. Sci. Technol. an Int. J. 58 (2024) 101829. [52] M. Bayesh, S. Jahan, Embedding security awareness in IoT systems: A framework for providing change impact insights, Appl. Sci. 15 (14) (2025) 7871. 20