Self-Consistency Improves the Trustworthiness of Self-Interpretable GNNs

Wenxin Tai¹, Ting Zhong¹, Goce Trajcevski², Fan Zhou¹

¹ University of Electronic Science and Technology of China ² Iowa State University

Abstract

Graph Neural Networks (GNNs) achieve strong predictive performance but offer limited transparency in their decision-making. Self-Interpretable GNNs (SI-GNNs) address this by generating built-in explanations, yet their training objectives are misaligned with evaluation criteria such as faithfulness. This raises two key questions: (i) can faithfulness be explicitly optimized during training, and (ii) does such optimization genuinely improve explanation quality? We show that faithfulness is intrinsically tied to explanation self-consistency and can therefore be optimized directly. Empirical analysis further reveals that self-inconsistency predominantly occurs on unimportant features, linking it to redundancy-driven explanation inconsistency observed in recent work and suggesting untapped potential for improving explanation quality. Building on these insights, we introduce a simple, model-agnostic self-consistency (SC) training strategy. Without changing architectures or pipelines, SC consistently improves explanation quality across multiple dimensions and benchmarks, offering an effective and scalable pathway to more trustworthy GNN explanations.

TL;DR: We study what happens when SI-GNNs are explicitly optimized for faithfulness, and show how this leads to more stable and reliable explanations.

What Does It Mean for an Explanation to Be Faithful?

Suppose a GNN tells us that only a small part of a graph is important for its prediction. If we keep only that part, should the prediction remain the same?

This is the idea of faithfulness—a widely used criterion for evaluating explanations.

Let a SI-GNN be a function \( f(G) \), and let \( G_s \subseteq G \) denote the explanation. A faithful explanation should preserve the model’s prediction when only the explanation is used:

\( f(G_s) = f(G) \)

However, while we evaluate explanations using this criterion, we never train models to satisfy it.

This mismatch naturally raises two questions:

(i) Can faithfulness be optimized during training?

(ii) Even if feasible, does it truly improve explanation quality?

Key Insight

(i) Can faithfulness be optimized? Yes—by enforcing self-consistency of explanations.

Let \( h_{G_s}(G) \) denote the explainer that extracts a subgraph \( G_s \). Instead of directly constraining predictions, we require the explainer to be consistent when reusing its own output:

\[ h_{G_s}(G) = h_{G_s}(G_s) \]

If the explanation truly captures what drives the prediction, such self-consistency naturally leads to faithful behavior.

\[ \underbrace{f(G_s) = f(G)}_{\text{faithfulness}} \quad \Longrightarrow \quad \underbrace{h_{G_s}(G) = h_{G_s}(G_s)}_{\text{self-consistency}} \]

(ii) Does it improve explanation quality? Yes—but not in an obvious way. Our empirical findings reveal that without self-consistency, explanations can vary significantly across repeated passes of the same model. Using benchmark datasets with ground-truth explanations, we further observe that self-inconsistency primarily arises from instability on features labeled as unimportant, while important features remain stable.

self-consistency analysis — Self-inconsistency measured as the L1 difference (DIFF) between two explanations. Results are shown for SMGNN (Azzolin et al., 2025) and GSAT (Miao et al., 2022) across four benchmark datasets: A = BA-2MOTIFS, B = MR, C = BENZENE, D = MUTAGENICITY. Within each panel, UNIMP and IMP denote unimportant and important features, respectively.

This behavior closely relates to explanation redundancy observed in recent work (Tai et al., 2025), where explainers allocate unnecessary importance to irrelevant features when budget allowed. That work further showed that addressing redundancy improves explanation quality. Since our study shows that self-inconsistency also concentrates on unimportant features, it suggests a potential connection: addressing self-inconsistency may address redundancy as well, thereby improving explanation quality in a similar manner.

Method: Self-Consistency Fine-Tuning

SC training framework — Overview of the self-consistency (SC) training framework. A SI-GNN is first trained with the standard objective. The encoder is then frozen, and the explainer and classifier are fine-tuned with an alignment loss that enforces the first-pass explanation \( G_s^{(1)} \) to match the second-pass explanation \( G_s^{(2)} \).

We adopt a simple two-stage training framework. First, a SI-GNN is trained with the standard objective. Then, we freeze the encoder and fine-tune the explainer and classifier.

During fine-tuning, given an input graph \( G \), the explainer produces an explanation \( G_s^{(1)} \). We then feed \( G_s^{(1)} \) back into the model to obtain a second explanation \( G_s^{(2)} \).

We introduce an additional self-consistency (SC) loss to align the two:

\[ \mathcal{L}_{\mathrm{SC}} = | G_s^{(1)} - G_s^{(2)} |. \]

This objective encourages the model to produce consistent explanations, and can be seamlessly applied to existing SI-GNNs without modifying their architectures.

Results

We illustrate the effect of self-consistency fine-tuning through a qualitative case study. The figure below shows explanations generated by five independently trained models.

Without SC, explanations vary significantly across runs and often highlight irrelevant structures. In contrast, SC fine-tuning produces explanations that are both more stable (consistent across runs) and more aligned with human-annotated explanation (plausible).

For more detailed analysis and quantitative results, please refer to the paper.