Dose Prediction for Contour Quality Evalution

The not-so-random initial state of this text is courtesy Ai2 Scholar QA, and it has been reasonably cross-checked and improved by a human. This is a WIP (Work-In-Progress): this message will be removed once sufficient progress has been made.

Introduction
Automated contour quality assurance
Contour quality assessment using dose predictions
Clinical applications and workflow integration
Challenges
Future directions

Introduction

Contouring quality in radiation oncology represents a critical aspect of treatment planning that directly influences treatment outcomes. Variations in how radiation oncologists delineate target volumes and organs at risk (OARs) can significantly impact dosimetric calculations, potentially affecting both tumor control probability (TCP) and normal tissue complication probability (NTCP) (Zhu et al., 2019). Research has demonstrated that accurate primary gross tumor contouring can positively influence tumor control and patient survival outcomes (Lin et al., 2019). This is particularly important because contouring variations directly affect radiation treatment planning quality, especially regarding dose distribution to OARs.

Quality assurance (QA) for clinical contours represents an integral component of radiotherapy, both in routine clinical practice and in clinical trials. However, the process of manually reviewing contours is resource-intensive, requiring substantial anatomical knowledge, significant human and financial investment, and may itself be subject to the same inter-observer variability present in the initial contouring process (Loo et al., 2012). The importance of contour QA has increased with the widespread adoption of highly conformal treatment techniques, which magnify the dosimetric impact of delineation errors (Nijhuis et al., 2021). Proper contour QA is particularly crucial for improving the validity and reliability of clinical trial outcomes, especially when analyzing relationships between radiation dose and treatment-related toxicity.

Automated contour quality assurance

The implementation of automatic segmentation in clinical practice faces a significant challenge: how to effectively validate and evaluate the accuracy and reliability of auto-generated contours (Cao et al., 2020). Traditional manual review of contours is resource-intensive and susceptible to the same inter-observer variability that affects initial contouring. Automated quality assurance (QA) systems offer a solution by efficiently identifying contouring errors while improving consistency (Brooks et al., 2024) and that guidelines alone do not suffice to eliminate interobserver variability, with more effort needed to accomplish further treatment standardisation, for example with artificial intelligence(Veen et al., 2020).

Several approaches to automated contour QA have been developed. One popular method uses knowledge-based outlier detection with one-class training, where features calculated from high-quality contours are used to classify contours of unknown quality (Altman et al., 2015). These features typically include contour volume, shape, orientation, position, and image characteristics. The models range from univariate statistical approaches to more sophisticated multivariate statistical and deep learning models (McIntosh et al., 2013).

Recent research has demonstrated impressive results with convolutional neural network (CNN)-based QA tools. Rhee et al. developed a CNN-based autocontouring tool that could detect errors in multiatlas-based autocontours with high accuracy for head and neck structures. For most organs at risk, their system correctly identified clinically unacceptable contours with accuracy rates of 80-99% (Rhee et al., 2019).

In a subsequent study focused on pelvic structures, researchers found that surface Dice Similarity Coefficient (DSC) with tolerances of 1-3 mm was the most accurate metric for distinguishing clinically acceptable from unacceptable contours, achieving accuracy rates above 90% for targets and critical structures in cervical cancer patients (Rhee et al., 2022).

Duan et al. further enhanced contour QA by using multiple geometric agreement metrics (27 in total) with machine learning classification models. This approach outperformed traditional methods based on only one or two metrics, with recall values ranging from 0.727 to 0.842 and precision values from 0.762 to 0.899 for various head and neck structures (Duan et al., 2023).

Clinical implementation of automated segmentation with QA components has shown promising results. In a multiuser study, 98% of AI-predicted organs at risk required no revision or only minor revisions before being deemed clinically acceptable, reducing manual contouring time by 90% (Jin et al., 2022). Similarly, Dai et al. conducted clinical evaluations of automatic delineation models by having groups of radiation oncologists review the auto-generated contours and categorize them as acceptable with no corrections, acceptable with minor corrections, or unacceptable (Dai et al., 2021).

The quality of automated QA systems depends on several factors, including the size and quality of the atlas or training dataset. Research has shown that performance generally increases with the size of the atlas library, but reaches a plateau beyond a certain point (Lee et al., 2019). Additionally, the choice of segmentation algorithm influences QA performance, with modern approaches like generative adversarial networks with shape constraints showing improvements over traditional methods (Tong et al., 2019).

As these automated QA systems continue to evolve, they offer significant potential to streamline the radiation therapy workflow, reduce the burden of manual review, and improve the consistency and quality of contours used in treatment planning (Sharp et al., 2014). However, while automated QA can identify potential errors, human verification remains essential in many clinical settings, particularly for target volumes where errors can have significant dosimetric consequences (Hoque et al., 2023).

Contour quality assessment using dose predictions

Traditional contour evaluation methods rely on either time-consuming visual inspection or mathematical metrics such as Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) to compare auto-generated contours with ground truth (Terparia et al., 2020). However, these geometric metrics don’t directly address the clinical significance of contour variations in terms of dosimetric impact.

AI dose prediction models offer a transformative approach to contour quality assessment by providing “dose awareness” – the ability to instantly assess how contour variations affect the dose distribution. As Poel et al. explain, “A deep-learning model that can give an accurate prediction of the dose received by an OAR instantly could provide the required information to assess the clinical impact of contour variations” (Poel et al., 2023).

This capability enables clinicians to focus quality assessment efforts on contour regions that would have the greatest impact on treatment outcomes rather than treating all geometric deviations as equally important. Several research groups have developed specialized approaches for contour quality assessment using AI-based methods. Wooten et al. demonstrated that training a random forest model on shape features of contours provides a viable method for contour quality assurance without requiring imaging or radiomic features, making it robust across different imaging platforms (Wooten et al., 2022).

For brain tumors specifically, we evaluated a Cascaded 3D UNet for dose prediction that showed good sensitivity to radiation dose changes resulting from contour variations, reporting promising mean dose score and mean Dose Volume Histogram (DVH) scores between predicted and reference dose volumes (Kamath et al., 2023). Furthermore, we propose ASTRA: where “atomic” contour changes can be simulated and corresponding dose changes could be visualized as a heatmap on the surface of the structures, to indicate dosimetric impact (Kamath et al., 2023).

The Bayesian deep learning approach offers another avenue for contour quality assessment by enabling risk analysis through uncertainty quantification. Chaves-de-Plaza et al. proposed generating an ensemble of possible contours and computing multiple dose-volume histograms (DVHs) to identify “risky uncertain areas” where metrics deviate significantly from planning values (Chaves-de-Plaza et al., 2022). This approach helps prioritize contour corrections based on their potential impact on treatment planning.

In clinical practice, contour quality assessment has two fundamental purposes, as Berenato et al. highlight: ensuring the patient’s dose distribution is optimal (PD) and verifying that reported doses (RD) are accurate for treatment safety assessment (Berenato et al., 2024). AI dose prediction supports both objectives by providing immediate feedback on dosimetric consequences of contour variations. This feedback is particularly valuable in adaptive radiotherapy workflows, where Yoon et al. found that auto-generated contours required qualitative rating and quantitative comparison to human-adjusted “ground truth” contours (Yoon et al., 2020).

Despite advances in automated segmentation and dose prediction, human oversight remains essential. Roper et al. strongly recommend performing basic observer training with joint delineation review sessions to assess contour quality, emphasizing the importance of understanding model strengths and weaknesses while maintaining human verification to prevent blind use of AI-segmented contours (Roper et al., 2022).

A promising approach suggested by Fransson et al. involves a pipeline that separates contour prediction from dose distribution prediction, providing flexibility in choosing between automated treatment segmentation (ATS) and automated treatment planning (ATP) workflows while offering a starting point for contours that can be manually refined (Fransson et al., 2024).

Clinical applications and workflow integration

AI dose prediction systems are increasingly being incorporated into clinical radiation therapy workflows, offering practical integration with existing treatment planning systems. Mashayekhi et al. describe a comprehensive integration approach where the AI dose predictor connects directly to the treatment planning system database. Their implementation requires annotated PTV and OAR contours along with prescribed dose levels as inputs, and then returns both predicted dose distributions and uncertainty maps directly to the treatment planning system database without manual import by leveraging the Eclipse API. This seamless integration provides clinicians with immediate access to spatial dose distributions, uncertainty maps, DVH curves, and relevant dosimetric metrics for clinical decision-making. The ability to rapidly generate predicted dose distributions allows for iterative planning approaches where clinicians can quickly evaluate multiple treatment strategies or contour modifications (Mashayekhi et al., 2023).

The clinical implementation of AI dose prediction often involves a multi-step workflow that combines automated contouring with dose prediction. Kerf et al. outline a process where deep learning segmentation (DLS) first generates AI contours that are then reviewed by radiation oncologists. These clinically approved contours serve as input for the deep learning planning (DLP) model, which creates an AI-generated treatment plan. This plan can be further refined by a human planner during a “fine-tune optimization step” to meet specific clinical requirements (Kerf et al., 2023).

This workflow demonstrates how AI dose prediction can be integrated as part of a larger AI-driven treatment planning pipeline while maintaining appropriate human oversight at critical decision points. Beyond plan generation these systems can flag potential issues in the planning process when implemented with uncertainty quantification capabilities, enabling radiation oncologists to focus their attention on areas of concern.

Challenges

Despite the promising advances, several key challenges remain. One significant hurdle is the generalizability of these models across different institutions and patient populations. Most current models are trained on data from single institutions with specific treatment planning protocols, limiting their applicability in centers with different planning approaches or equipment configurations. This challenge is compounded by the variability in contouring practices across institutions, creating inconsistencies in the training data that can affect model performance.

The handling of rare or complex cases represents another substantial challenge. Most AI prediction models perform well on common anatomical configurations but may struggle with unusual anatomies, rare disease presentations, or patients with prior treatments. These edge cases are precisely where expert human judgment is most critical, yet they are underrepresented in training datasets. Additionally, there remains a gap in addressing the uncertainty quantification in dose predictions—understanding not just what dose is predicted, but how confident the model is in that prediction for specific anatomical regions.

The integration of AI dose prediction tools with existing clinical workflows presents practical challenges related to computational resources, user interfaces, and regulatory approval. While some systems have successfully implemented APIs and database connections, many institutions still face technical barriers to smooth implementation. Furthermore, there are concerns about over-reliance on automated systems without adequate understanding of their limitations, potentially leading to a decline in physicist and physician expertise over time.

Future directions

The development of more explainable AI models is critical to improve trust and adoption in clinical practice. Rather than black-box approaches, models that can provide reasoning behind their predictions will be increasingly important for quality assurance. Multi-institutional collaborative efforts to create diverse, standardized datasets will help address the generalizability issue and enable more robust model training.

Future development is also focusing on real-time adaptive capabilities, where dose prediction models can update their estimations during treatment courses as patient anatomy changes. This would be particularly valuable for adaptive radiotherapy workflows where contours need frequent updates. Additionally, more sophisticated models that can account for uncertainties in treatment delivery, such as patient positioning variations and organ motion, will better represent the true dose distributions patients receive.

The integration of AI dose prediction with other AI-driven radiation therapy tools—including automated segmentation, beam angle optimization, and outcome prediction—offers the potential for more comprehensive end-to-end planning systems. These integrated systems could dramatically reduce planning time while maintaining or improving plan quality, though careful validation will be necessary at each step. As these technologies mature, they may enable more personalized radiation therapy approaches that tailor treatment parameters based on individual patient characteristics, treatment response predictions, and risk assessments.