Towards a comparable metric for AI model interpretability — Part 2

Read original article here

By: Howard Yang (GovTechSG), Jessica Foo (GovTechSG) With inputs from: Dr. Matthias Reso (Meta)

This post is part 2 of a two-part series on explainable AI (XAI). Part 1 (foundhere) introduces XAI and its common methods, while part 2 (this article) will focus on our experimentations with Captum in developing comparable metrics for explainability and fairness in the realm of computer vision. Part 1 provides an overview of the different XAI methods, which is useful to those who are new to XAI.

As GovTechSG’s Data Science and Artificial Intelligence Division — Video Analytics team, we use AI to analyse unstructured data in the form of images and videos. Developing models with high accuracy has always been our key focus, and the advent of XAI has inspired us to develop tools to better explain how our trained models work and to better quantify their fairness.

The field of XAI is considered nascent (at the point of writing this blog) and is an active area of research. Instead of researching this topic on our own, we decided to collaborate with industry partners with the right technical domain knowledge to explore new frontiers of XAI. To that end, we are glad to work with Meta to develop comparable metrics for explainability and fairness.

We are in collaboration with engineers from Meta to not only adopt Captum as a baseline toolkit, but also to co-develop feature extensions and share learnings with each other. Captum was chosen as it covers a wide range of attribution algorithms and is model agnostic. It also has native PyTorch integration which allows for quicker experimentation. We can therefore be more agile and can quickly iterate towards a working prototype.

We have identified two gaps in XAI for Computer Vision. They are, one, a lack of a single standardised metric for explainability, and two, a need for the training data to be labelled along sensitive attributes to compute fairness metrics.

With a variety of algorithms to use, how do we know which is a better explanation of the AI model’s prediction? Given that the field of XAI is fairly nascent, there is no consensus reached on a specific gold standard metric for explainability. We refer to a “gold standard metric” as the widely accepted and used metric for evaluation. For example, accuracy and f1-score are widely used in classification model evaluation, while mAP is accepted as the industry standard for object detection models. Unfortunately, such a metric has yet to be chosen in the field of model explainability.

Some requirements for a gold standard metric for XAI would include:

We present a proposed metric for explainability which measures if a model is explainable or not with attribution masks. The metric evaluates the goodness of a collection of attribution maps both at the instance and global level. This metric can then be used to judge whether the explanations generated by various attribution algorithms are consistent with each other.

To demonstrate our proposed metric, we applied it to a crocodile detection model.

Let us use the example of a crocodile detection problem to illustrate our solution. First, we train three Faster R-CNN models with different numbers of epochs [20, 50, 74 (early stopped when validation accuracy decreased)] on the same dataset for crocodile detection. This gives us a proxy for under-, medium-, and optimally-trained models. Visually all three models return similar bounding boxes when given the same set of test images.

As an example, shown below are the attribution maps generated by Deconvolution overlaid on the original image.

We can see that the model trained for 74 epochs and with the highest test mAP appears to have the clearest and most intuitive explanation, with attributions that are faithful to the detection (e.g., tail and limbs of the crocodile are also highlighted as part of the crocodile features). Expectedly, the mask from 20 epochs to 74 epochs increases in coverage of the detected crocodile. From this example, explanations for this particular test image seem intuitive and the model decision-making appears logically sound.

As mentioned in part 1 (found here), there exists a variety of explainability algorithms, and it would not be feasible to use all of them. We narrow down the options in the following sequence:

As we train the model for object detection, there are situations where multiple instances of crocodiles are detected in the same image. In such cases, we calculate the attribution map based on the most probable detection and discard the other detection(s).

After filtering out algorithms available in Captum that are non-deterministic and have long runtimes, we are left with five, namely, Integrated Gradients, Saliency, Input X Gradient, Deconvolution, and Guided Backpropagation.

We used Captum to generate the attribution maps on the same crocodile input image from each of the shortlisted algorithms, resulting in five output attribution masks. You can see from the following images that there are visible differences across the attribution masks. Now that we have these five masks, we try to aggregate the attribution masks into a single result to obtain a meaningful scalar quantifying model explainability.

First, we calculate the pixel-wise cosine similarity (formula described below) between each pair of attribution maps, obtaining the following matrix.

We then average the cosine similarities to get the instance-level goodness score.

Since pixel-wise attribution values are around 70% similar across different explainability methods for 20-epoch and 74-epoch (early stopped) models, we can conclude that the attribution maps are largely robust and consistent across different algorithms. A user can get a good sense of the model’s decision-making process by looking at just one attribution mask for one test image and be more confident of the explanation with an ensemble of attribution masks.

A whole gamut of fairness approaches and metrics exists to measure different biases for different purposes. For example, parity measures (e.g., FPR, FNR) are often used to quantify the parity of statistical metrics across groups with different sensitive attributes. Causality-based approaches are also often used to establish causal relationships between variables to ascertain counterfactual fairness.

Unfortunately, these metrics require ground truth labelling of the sensitive attributes, which may not be readily available for the training dataset. Moreover, the labelling of data post-collection is subjective to inherent biases and contains inaccuracies.

TCAV is a concept-based approach to determine the sensitivity of a model to sensitive attributes without the need for the training data to be labelled along the sensitive attribute.

Upon obtaining a trained model, a separate Concept Dataset is used to calculate the CAVs for each concept, to quantify fairness. Therefore, we will have two datasets — (1) the original dataset used to train the computer vision model and not annotated with sensitive attributes, and (2) the concept dataset with class labels representing sensitive attributes that can be sourced from the Internet or re-used from another problem. This allows us to calculate this fairness metric in a labour inexpensive manner.

For this proposed metric, we demonstrate how it can be applied with personal mobility devices (PMDs) detection.

Given a trained object detection model to detect personal mobility devices (PMDs), we can ascertain whether the detection of PMDs is biased toward certain food delivery bags, without the explicit annotation of these bags on test images.

First, we source for examples of concept images that represent attributes of interest (e.g., ‘GrabFood’ (GF), ‘FoodPanda’ (FP), ‘Deliveroo’ (DL)).

Using a Faster R-CNN model as an example, we then learn relative CAVs for the convolutional layers of the last residual block in the ResNet-51 backbone. Each relative CAV is obtained by training a linear classifier to distinguish between the model’s layer activations produced by examples from different pairs of concepts.

The relative CAV is then multiplied by the layer’s attributions (derived from any attribution algorithm such as Integrated Gradients) to measure the relative conceptual sensitivity of the layer’s activations to the class prediction, i.e., relative TCAV.

Lastly, we aggregate the relative TCAVs across the aforementioned convolutional layers to construct TCAV(A-B) and determine model fairness between pairs of concepts (A and B). Scores of 0.5 indicate perfect fairness between a pair of concepts. Given a score TCAV(A-B), a score greater than 0.5 indicates bias towards concept A while a score less than 0.5 indicates bias against concept A, relative to concept B.

From the results above, all pairwise scores across the delivery bag types are relatively close to 0.5. While there is a slight lean towards concept DL when compared to the concept FP, the deviation from 0.5 is not significant. As such, we can observe that the detection of PMDs by the Faster R-CNN model is largely not biased towards any delivery bag type.

Images Powered by Shutterstock

The Data Daily

Towards a comparable metric for AI model interpretability — Part 2