ECE quantifies how much you can trust the class confidences your model gives. It is the difference between the predicted confidence and reality.

How to compute the Caliberation Error?

Computing calibration error is simple. First run your model over a set of samples and collect all the predictions. We need to compute two things. The accuracy of the predictions and the average confidence of all the predictions.

The absolute difference between the two is calibration error.

def compute_accuracy(predictions, targets):
    assert predictions.shape==targets.shape
    return np.mean(predictions==targets)

def compute_caliberation_error(class_confidences, gt_idxs):
    predicted_class_idxs = np.argmax(class_confidences,axis=1)
    acc = compute_accuracy(predicted_class_idxs, gt_idxs)
    conf = np.max(class_confidences,axis=1).mean()
    return np.absolute(acc-conf),acc,conf

Expected Caliberation Error

A classifier is said to be well calibrated if it has a low ECE. I came across Expected Calibration Error from a recent, aptly titled paper “Your classifier is secretly an energy based model and you should treat it like one”. It defines ECE as

To compute ECE of a model, we simply bins the predictions first. Then, calculate average of all the calibration errors.

def expected_caliberation_error(class_confidences,gt_idxs, num_bins=20):
    delta = 1.0/num_bins
    predicted_confidences = np.max(class_confidences,axis=1)
    data = []
    for l in np.arange(0.0,1.0,delta):
        h = l+delta
        # bin the predictions
        idxs = np.argwhere((predicted_confidences<=h) & (predicted_confidences>l)).flatten()
        if len(idxs)==0:continue
        
        # compute caliberation error
        ce,acc,conf = compute_caliberation_error(class_confidences[idxs,:], gt_idxs[idxs])
        data.append([l,h,ce,acc,conf])
    
    # average the computed caliberation errors
    ece = np.mean([ce for _,_,ce,_,_ in data])
    return ece, data

For testing, I fine-tuned a model with resnet18 stem and computed predictions over ImageWoof dataset.

Before training, the model has very high ECE

learner.load('stage-0');
probs, targets = learner.get_preds()
class_confidences ,gt_idxs = probs.numpy(), targets.numpy()
ece, data = expected_caliberation_error(class_confidences,gt_idxs)
plot_figure(x,y,delta)

png

After training, ECE has considerably reduced.

learner.load('stage-1');
probs, targets = learner.get_preds()
class_confidences, gt_idxs = probs.numpy(), targets.numpy()
ece, data = expected_caliberation_error(class_confidences,gt_idxs)
plot_figure(x,y,delta)

png