Clinician-AI Performance Comparison on Contour Corrections

Evaluating how deep learning models compare to human experts in predicting clinical impact of segmentation variations

Radiotherapy Quality Assurance Deep Learning Clinical Validation

Abstract

In radiation therapy, segmentation accuracy directly impacts treatment outcomes. While automated segmentation methods promise efficiency and consistency, evaluating their quality remains challenging. Traditional geometric metrics (Dice, Hausdorff distance) fail to capture what matters clinically: how segmentation errors affect radiation dose delivery.

This research addresses this gap by developing and validating methods that assess segmentation quality through clinical dosimetry. We demonstrate that deep learning models can predict dosimetric impact more accurately and consistently than expert radiation oncologists, revealing concerning patterns in human expert assessment including poor inter-rater agreement (Cohen’s κ: 0.33–0.74) and systematic conservative biases.

Our work establishes the foundation for AI-based quality assurance systems that provide objective, dose-informed segmentation evaluation for safer, more efficient radiotherapy planning.

Dose-aware quality assessment framework comparing clinician and AI performance.

Introduction

The Problem with Geometric Metrics

Medical image segmentation has traditionally relied on geometric metrics like Dice Similarity Coefficient and Hausdorff Distance. These metrics offer mathematical precision and computational efficiency, making them attractive for algorithm development. However, they suffer from an inherent limitation: they assume all spatial errors carry equal clinical significance.

In radiotherapy, this assumption is fundamentally flawed. A small contour variation in a high-dose region may have profound clinical consequences, while a larger error in a low-dose area might be insignificant. The relationship between geometric accuracy and clinical impact is complex, context-dependent, and influenced by treatment technique, anatomical structure, and dose distribution.

The Clinical Challenge

Our research revealed a striking finding: radiation oncologists themselves struggle to visually predict which contour variations will cause clinically significant dosimetric changes. This underscores the urgent need for automated computational approaches that can accurately assess clinical implications of segmentation errors.

Key Research Contributions

1. Clinician vs AI Performance Study

Our MIDL 2024 study (Kamath et al., 2024) and extended Radiotherapy & Oncology 2025 journal (Willmann et al., 2025) provided the first systematic comparison of deep learning dose prediction versus clinical expert assessment.

Key Findings:

Deep learning models outperformed radiation oncologists in estimating dosimetric impact
Higher correlation with ground truth dose distributions
Substantially reduced variability compared to human experts
Completed assessments in seconds vs minutes/hours

Expert Variability Analysis:

Cohen’s Kappa values: 0.33–0.74 (weak to moderate agreement)
46% false positive rate: Equivalent variations incorrectly classified as “worse”
Conservative bias: No expert identified objectively superior variations
Implications for unnecessary clinical corrections and resource waste

2. AutoDoseRank Framework

AutoDoseRank (Mercado et al., 2024) introduced automated dosimetry-informed segmentation ranking for radiotherapy quality assurance.

Innovation:

Deep learning-based dose predictor eliminating need for full dose recalculation
Clinical priority integration considering organ-specific importance
Patient-level assessment across all organs-at-risk

Performance:

Outperformed 3 of 4 radiation oncology experts in ranking accuracy
Better inter-rater agreement (Kendall’s Tau) than human experts
Sub-second inference enabling real-time clinical integration

3. Automated Quality Assurance Integration

Our work establishes frameworks for integrating AI-based dose prediction into clinical workflows:

Objective assessment replacing subjective geometric metrics
Workflow efficiency through instant, automated evaluation
Improved consistency reducing harmful inter-evaluator variability
Enhanced safety through accurate identification of significant errors
Cost-effectiveness via reduced unnecessary corrections

Methods and Approach

Deep Learning Dose Prediction

Our approach employs deep learning models trained on clinical treatment planning data to predict dose distributions without requiring full physics-based simulation. This enables:

Rapid assessment: Dose prediction in seconds vs hours
Accurate sensitivity: Captures dosimetric impact of contour variations
Clinical validation: Evaluated against ground truth treatment plans

Clinical Validation Studies

Multi-Institutional Survey:

4 radiation oncologists and 3 medical physicists
54 glioblastoma target volume variations across 14 patients
Statistical analysis: Cohen’s Kappa, correlation analysis, pattern analysis

Performance Benchmarking:

Head-to-head comparison: AI vs expert clinicians
Kendall’s Tau ranking correlation
Normalized Distance-based Performance Measure (NDPM)

Quality Assurance Framework

Organ-specific priority integration
Patient-level comprehensive assessment
Uncertainty quantification for clinical confidence
Real-time feedback capabilities

Conclusion

Key Achievements

This research establishes that AI-based dose prediction provides more accurate and consistent dosimetric impact assessment than expert clinicians, addressing a critical gap in radiotherapy quality assurance. Key achievements include:

First systematic clinician-AI comparison demonstrating AI superiority
Identification of concerning patterns in expert assessment (poor agreement, systematic biases)
Practical framework for automated, dose-informed quality assurance
Clinical validation across multiple institutions and expert clinicians

Clinical Impact

The integration of AI-based dose prediction into radiotherapy workflows offers:

Objective evaluation replacing subjective geometric assessment
Improved efficiency through instant automated feedback
Enhanced safety via consistent, accurate dosimetric evaluation
Resource optimization by reducing unnecessary corrections
Clinical decision support with quantitative dose-based insights

Future Directions

Promising future research directions include:

Multi-modal integration: Incorporating functional imaging and temporal data
Real-time contouring feedback: Interactive dose-aware editing tools
Outcome-based validation: Linking segmentation quality to patient outcomes
Adaptive quality assurance: Personalized evaluation based on patient characteristics
Uncertainty quantification: Enhanced confidence estimation for clinical trust

By continuing to bridge computational methods with clinical needs, we advance toward safer, more efficient, and more effective radiation therapy through intelligent quality assurance systems.

Theme Correction