ECE quantifies how much you can trust the class confidences your model gives. It is the difference between the predicted confidence and reality.
How to compute the Caliberation Error?
Computing calibration error is simple. First run your model over a set of samples and collect all the predictions. We need to compute two things. The accuracy of the predictions and the average confidence of all the predictions.
The absolute difference between the two is calibration error.
def compute_accuracy(predictions, targets): assert predictions.shape==targets.shape return np.mean(predictions==targets) def compute_caliberation_error(class_confidences, gt_idxs): predicted_class_idxs = np.argmax(class_confidences,axis=1) acc = compute_accuracy(predicted_class_idxs, gt_idxs) conf = np.max(class_confidences,axis=1).mean() return np.absolute(acc-conf),acc,conf
Expected Caliberation Error
A classifier is said to be well calibrated if it has a low ECE. I came across Expected Calibration Error from a recent, aptly titled paper “Your classifier is secretly an energy based model and you should treat it like one”. It defines ECE as
To compute ECE of a model, we simply bins the predictions first. Then, calculate average of all the calibration errors.
def expected_caliberation_error(class_confidences,gt_idxs, num_bins=20): delta = 1.0/num_bins predicted_confidences = np.max(class_confidences,axis=1) data =  for l in np.arange(0.0,1.0,delta): h = l+delta # bin the predictions idxs = np.argwhere((predicted_confidences<=h) & (predicted_confidences>l)).flatten() if len(idxs)==0:continue # compute caliberation error ce,acc,conf = compute_caliberation_error(class_confidences[idxs,:], gt_idxs[idxs]) data.append([l,h,ce,acc,conf]) # average the computed caliberation errors ece = np.mean([ce for _,_,ce,_,_ in data]) return ece, data
For testing, I fine-tuned a model with resnet18 stem and computed predictions over ImageWoof dataset.
Before training, the model has very high ECE
learner.load('stage-0'); probs, targets = learner.get_preds() class_confidences ,gt_idxs = probs.numpy(), targets.numpy() ece, data = expected_caliberation_error(class_confidences,gt_idxs)
After training, ECE has considerably reduced.
learner.load('stage-1'); probs, targets = learner.get_preds() class_confidences, gt_idxs = probs.numpy(), targets.numpy() ece, data = expected_caliberation_error(class_confidences,gt_idxs)