Paper Summary: Calibrating Segmentation Networks with Margin Based Label Smoothing
This paper tackles the problem of models that are poorly calibrated, which result in overconfident predictions. The problem with cross entropy based loss functions is that it promotes the predicted softmax probabilities to match the onehot label assignments, which means that the correct label activation should be significantly larger than the remaining activations.
Major contributions of this work include:  A unifying constrainedoptimization perspective of current stateoftheart calibration losses, which are approximations of a linear penalty (or a Lagrangian term) imposing equality constraints on logit distances.  A simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances.  Comprehensive experiments on a variety of public medical image segmentation benchmarks demonstrate novel stateoftheart results for calibration, while also improving the discriminative performance.
Major Learning Points

There are many existing methods of improving calibration, including focal loss and label smoothing. The authors show in this paper that these could be viewed as different penalty functions for imposing the same logitdistance equality constraint “d(l) = 0” (where d is the logit distance to the winning class). The proposed marginbased generalization (d(l) ≤ m) of this logitdistance constraint is shown to have desirable properties like gradient dynamics for calibrating neural networks.

The experiments are rather comprehensive, however some of the data sets are really small. For example the MRBrainS18 data set has 7 subjects and 5 are used as training, 2 as test. The ACDC data set is split into 70 training, 10 validation and 20 test, which could have better power to make reasonable inferences. With this context, the results cannot hence be compared apples to apples (in my opinion at least) with each other, as BRATS (which has 4x the subjects as ACDC) should be equally highly weighted while evaluating the metrics.
Interesting bits

The calibration performance metrics this paper uses include ECE (Expected Calibration Error) and CECE (Classwise ECE). These attempt to address the problem that pseudoprobability of the predicted class almost always overestimates the actual probability of getting a correct answer. For example, if the largest pseudoprobability is 0.95 you don’t have a 95% chance of making a correct prediction — more like 75% or 85% chance of a correct prediction. (see here for a great explanation of how ECE is computed).

The networks chosen to test out this novel loss formulation to improve calibration include the now classic Unet, in addition to attentionUnet, Unet++, and TransUNet. What is interesting here is that the Unet somehow forms the basis of all subsequent networks and in spite of being nearly a decade old now, is still by far one of the best performing models.