MICCAI

Explainable ADHD Diagnostic Framework Using Weakly-Supervised Action Recognition

The clinical diagnosis of Attention Deficit Hyperactivity Disorder (ADHD) primarily relies on scale questionnaires, clinical interviews, and executive function tests, which face challenges including limited medical...

Research Field Medical Image Analysis

Article Type Research analysis

Authors Fan et al.

Original Paper Published 2026

ISOM Posted 2026-03-12 19:39 UTC

Read Time 9M

Open PDF Open Source Page

Editorial Disclosure

ISOM follows an editorial workflow that structures the source paper into a readable analysis, then publishes the summary, source links, and metadata shown on this page so readers can verify the original work.

The goal of this page is to help readers understand the paper's core question, method, evidence, and implications before opening the original publication.

Background & Academic Lineage

The problem of diagnosing Attention Deficit Hyperactivity Disorder (ADHD) has historically relied on subjective clinical interviews and standardized rating scales. These methods are prone to clinician-dependent bias and lack objective, quantitative metrics for hyperactive behaviors. While early AI approaches attempted to automate this using structured records or neurophysiological data (like EEG/MRI), recent computer vision advancements have shifted toward analyzing behavioral phenotypes. However, the "pain point" that forced the development of EDWAR is the black-box nature of existing deep learning models. Previous systems might predict a diagnosis with high accuracy, but they fail to provide the "why"—they cannot point to specific moments in a video where a patient exhibited ADHD-related symptoms, making them untrustworthy for clinical integration.

Figure 1. The illustration of EDWAR framework

Intuitive Domain Terms

Weakly-Supervised Learning: Imagine teaching a student to identify a specific bird in a video by only telling them "this video contains the bird," rather than pointing to the exact second it appears. The model has to figure out the "where" on its own.
Skeletal Sequences: Think of this as a "stick-figure" animation extracted from a video. By focusing only on the joints (shoulders, elbows, knees), the model ignores distracting background details like room lighting or furniture, focusing purely on the patient's movement.
Gumbel-Softmax: In standard AI, a model might be "unsure" and output a fuzzy mix of possibilities. This tool acts like a fair coin flip that is "differentiable," allowing the model to make a firm, clear decision (e.g., "this is a movement") while still being able to learn from its mistakes during training.
Anomaly Activation: Think of this as a "heat map" for behavior. It is the model's way of highlighting specific timestamps in a video where a patient's movement deviates from the norm, effectively saying, "This is the exact moment the hyperactivity occurred."

Notation Table

Notation	Description
$X \in \mathbb{R}^{T \times D}$	The input skeletal sequence with $T$ timesteps and $D$ feature dimensions.
$f \in \mathbb{R}^{T \times d}$	The encoded pose features extracted by the encoder $g_\theta$.
$\alpha^{act} \in \mathbb{R}^{T \times 2}$	The activation map representing presence/absence of activity at each timestep.
$\mathbf{P}^{Act}_i$	The probability proposal for activity at timestep $i$ using Gumbel-Softmax.
$\alpha^{ano} \in \mathbb{R}^{T \times C}$	The anomaly activation matrix for $C$ different types of hyperactive behaviors.
$s \in \mathbb{R}^{C}$	The aggregated video-level anomaly score for each behavior category.
$r \in \mathbb{R}^{M}$	The standardized executive function test metrics (e.g., Stroop test results).
$p$	The final ADHD diagnosis probability output by the classifier.

The authors solve the interpretability problem by creating a two-stage collaborative framework. First, they use an Activity Segment Proposal (ASP) module to filter out static or irrelevant motion. They define the activation map $\alpha^{act}$ and use the Gumbel-Softmax trick to generate hard proposals $\mathbf{P}^{Act}_i$ that allow the model to focus only on active segments.

The core innovation is the Anomaly Activation Network (AAN), defined as:
$$\alpha^{ano} = \text{AAN}(\mathbf{P}^{Act} \odot f)$$
This equation masks the input features $f$ with the activity proposals $\mathbf{P}^{Act}$, ensuring the network only analyzes meaningful movements. The model then aggregates these into a score $s_c$ using a sigmoid function $\sigma(\cdot)$ and a learnable temperature parameter $\mathcal{T}_c$ to identify specific anomalies. Finally, the ADHD diagnosis is not just based on video, but on the concatenation of these anomaly scores and traditional test metrics $r$:
$$p = \text{MLP}(\text{concat}(s, r))$$
This joint optimization, governed by the loss function $\mathcal{L} = \mathcal{L}_{diag} + \lambda\mathcal{L}_{action}$, forces the model to learn features that are both accurate for diagnosis and clinically interpretable. It is a clever way to ensure the AI's "reasoning" aligns with human-observable clinical evidence.

Problem Definition & Constraints

The core challenge addressed by this paper is the "black-box" nature of existing AI-assisted ADHD diagnostic tools. Currently, clinicians rely on a combination of subjective rating scales, clinical interviews, and executive function tests. While previous AI models have attempted to automate this by analyzing behavioral data (like gaze or skeletal movement), they often function as opaque classifiers. This creates a significant gap: clinicians cannot trust or verify the "why" behind an AI's diagnostic decision, which is a critical requirement for clinical adoption.

The Dilemma and Constraints

The authors face a classic trade-off between predictive accuracy and interpretability.
- The Data Bottleneck: Obtaining fine-grained, frame-by-frame annotations of "abnormal" ADHD behaviors (e.g., wiggling, seat shifting) is prohibitively expensive and time-consuming. This forces the authors to rely on "weakly-supervised" learning, where they only have access to video-level labels (e.g., "this video contains ADHD symptoms") rather than precise temporal markers.
- The Noise Problem: During executive function tests, subjects perform many normal, task-related movements. A model must distinguish these from pathological, ADHD-related hyperactive behaviors.
- The Integration Wall: Simply concatenating clinical test metrics with behavioral features often leads to suboptimal performance because the two data sources exist in different "feature spaces." The authors had to design a collaborative framework that forces the model to learn features that are simultaneously discriminative for diagnosis and clinically meaningful for action recognition.

Mathematical Formulation

The authors bridge this gap by defining a two-stage collaborative reasoning framework.

Activity Segment Proposal (ASP): To handle the lack of fine-grained labels, they project encoded pose features $\mathbf{f} \in \mathbb{R}^{T \times d}$ into an activation map $\alpha^{act} \in \mathbb{R}^{T \times 2}$. To avoid the "fragmentation" of standard softmax, they use the Gumbel-Softmax trick:
$$[\mathbf{P}_i^{Act}, \mathbf{P}_i^{NoAct}] = \text{Gumbel-Softmax}([a_{i,0}, a_{i,1}]), \forall i \in \{1, \dots, T\}$$
This allows for differentiable gradients during training while enabling hard, deterministic selection of "active" segments during inference.
Anomaly Activation Network (AAN): Once the active segments are identified, the model predicts anomaly scores $\alpha^{ano}$ using:
$$\alpha^{ano} = \text{AAN}(\mathbf{P}_i^{Act} \odot \mathbf{f})$$
This effectively masks out irrelevant, static, or normal movements, focusing the model's attention only on the segments where ADHD-related behaviors are likely to occur.
Collaborative Optimization: The final diagnosis $p$ is obtained by concatenating the aggregated anomaly scores $\mathbf{s}$ with clinical test metrics $\mathbf{r}$ into an MLP classifier:
$$p = \text{MLP}(\text{concat}(\mathbf{s}, \mathbf{r}))$$
The entire system is trained using a multi-task loss function $\mathcal{L} = \mathcal{L}_{diag} + \lambda\mathcal{L}_{action}$. This forces the model to learn a shared representation where the diagnostic gradient acts as a supervisor for the action recognition module, ensuring the detected anomalies are actually relevant to the clinical diagnosis.

Why This Approach

The EDWAR framework addresses the critical challenge of clinical trust in AI-assisted ADHD diagnosis by replacing "black-box" models with a transparent, weakly-supervised action recognition system.

The Inevitability of the Choice

The authors identified that traditional SOTA methods—such as standard CNNs or basic Transformers—often fail in clinical settings because they treat the diagnostic process as a monolithic classification task. In ADHD assessment, the "what" (the diagnosis) is insufficient without the "why" (the behavioral evidence).

Comparative Superiority (The Benchmarking Logic):
* Structural Advantage: Unlike standard models that might process an entire video clip as a single feature vector, EDWAR utilizes an Activity Segment Proposal (ASP) module. This module acts as a filter, separating relevant hyperactive behaviors from static or irrelevant motions. By employing the Gumbel-Softmax trick, the model maintains differentiability during training while enabling hard, deterministic selection during inference.
* Multimodal Synergy: The framework is qualitatively superior because it does not rely on vision alone. It performs a "marriage" between behavioral video analysis and structured executive function test metrics. By concatenating the anomaly score vector $\mathbf{s}$ with test metrics $\mathbf{r}$ in the final classification layer, the model ensures that the diagnosis is grounded in both quantitative test performance and qualitative behavioral observations.

The core of the problem is to identify anomaly actions in a sequence $X \in \mathbb{R}^{T \times D}$ without frame-level labels. The authors solve this by:

Feature Encoding: Extracting features $\mathbf{f} = g_\theta(X)$ and projecting them into a $T \times 2$ activation map $\alpha^{act}$ to distinguish between activity and non-activity.
Stochastic Sampling: Using the Gumbel-Softmax distribution to generate proposals $\mathbf{P}^{Act}_i$ that allow for gradient flow.
Anomaly Localization: Predicting anomaly activations $\alpha^{ano}$ via an Anomaly Activation Network (AAN):
$$\alpha^{ano} = \text{AAN}(\mathbf{P}^{Act} \odot \mathbf{f})$$
This effectively masks out irrelevant motions, ensuring the model only analyzes segments where activity is detected.
Joint Optimization: The final diagnostic probability $p$ is derived from the concatenation of the aggregated anomaly scores $\mathbf{s}$ and test metrics $\mathbf{r}$, optimized via a multi-task loss function:
$$\mathcal{L} = \mathcal{L}_{\text{diag}} + \lambda\mathcal{L}_{\text{action}}$$

Mathematical & Logical Mechanism

The EDWAR framework addresses the clinical challenge of diagnosing ADHD by combining objective behavioral analysis with traditional test metrics. The core motivation is to move away from "black-box" AI models toward a system that provides both high diagnostic accuracy and transparent, temporally localized evidence that clinicians can verify.

The Master Equation

The framework relies on a scoring function that aggregates temporal anomaly activations into a single video-level probability. The core equation for the anomaly score $s_c$ of class $c$ is:

$$s_c = \sigma \left( \frac{\sum_{i=1}^{T} P_i^{\text{Act}} \cdot \alpha_{i,c}^{\text{ano}}}{T_c} \right)$$

Tearing the equation apart:

$s_c$: The predicted probability (between 0 and 1) that an anomaly of type $c$ occurred in the video.
$\sigma(\cdot)$: The sigmoid activation function.
$\sum_{i=1}^{T}$: A summation over all $T$ timesteps in the video.
$P_i^{\text{Act}}$: The "Activity Proposal" weight at timestep $i$. This acts as a gating mechanism or a filter; it is derived from the Gumbel-Softmax sampling, effectively "turning off" (setting to 0) timesteps that the model deems as static or normal.
$\alpha_{i,c}^{\text{ano}}$: The anomaly activation logit for class $c$ at timestep $i$.
$T_c$: A learnable temperature parameter.

Figure 2. Explainability Illustration example of constantly shifting in the seat action

Optimization Dynamics

The model learns through a multi-task objective function: $\mathcal{L} = \mathcal{L}_{\text{diag}} + \lambda\mathcal{L}_{\text{action}}$.

The optimization is a delicate balancing act. The $\mathcal{L}_{\text{diag}}$ loss forces the model to be accurate in its final clinical prediction, while $\mathcal{L}_{\text{action}}$ forces the model to correctly identify specific behaviors. Because these are trained jointly, the diagnostic gradients act as a "teacher" for the action recognition module, guiding it to focus on behaviors that are actually relevant to ADHD rather than just any random movement.

Results, Limitations & Conclusion

The EDWAR framework addresses a critical bottleneck in clinical psychiatry: the subjectivity and lack of quantitative transparency in diagnosing ADHD.

The Core Problem and Mathematical Solution

To solve the annotation bottleneck, the authors employ Weakly-Supervised Action Recognition. Instead of requiring frame-by-frame labels, the model only needs video-level diagnostic labels. The framework uses an Activity Segment Proposal (ASP) module to filter out irrelevant motions and focus on anomalous behaviors.

Mathematically, the model processes skeletal sequences $X \in \mathbb{R}^{T \times D}$ through an encoder $g_\theta$ to get features $\mathbf{f}$. To avoid the limitations of standard softmax, the authors use the Gumbel-Softmax trick:
$$[\mathbf{P}^{\text{Act}}_i, \mathbf{P}^{\text{NoAct}}_i] = \text{Gumbel-Softmax}([a_{i,0}, a_{i,1}]), \forall i \in \{1, \dots, T\}$$
This allows the model to maintain differentiable gradients during training while enabling hard, deterministic decisions during inference.

Experimental Validation

The authors tested their architecture against a wide range of baselines, including traditional machine learning models and advanced temporal pattern recognition models like bi-LSTM and BERT. The definitive evidence of their success is the 94.3% accuracy achieved by EDWAR, which significantly outperformed the BERT-based hybrid-modal baseline (91.6%). The ablation study in Table 2 serves as the "smoking gun," proving that the synergy between the WSAR module and the clinical test metrics is transformative.

Discussion and Future Perspectives

The EDWAR framework is a significant step forward, but it raises several fascinating questions for future research:
1. Cross-Disorder Generalization: Can this framework be adapted to distinguish between ADHD and other neurodevelopmental conditions?
2. Longitudinal Stability: How would the model perform if it had to analyze hours of classroom behavior?
3. Ethical and Privacy Considerations: As we move toward AI-assisted diagnosis, how do we ensure that the skeletal extraction process is handled with the highest level of privacy?

Overall, the framework provides a compelling, transparent, and highly accurate solution to a complex clinical problem, effectively bridging the gap between algorithmic decisions and human-readable evidence.

Table 2. Ablation study results of EDWAR framework components

Table 1. Comparison of ADHD diagnosis performance between different methods. T and A represent using executive function test and action information, respectively