Vision transformers (ViTs) are now the go-to architecture for vision-based foundation models, but they may be challenging to interpret and may exhibit unexpected behaviors. Gandelsman et al. were able to interpret CLIP-ViT components using text, but how do we interpret arbitrary ViTs which may have different architectures (SWIN, MaxViT, DINO, DINOv2) and trained with different pretraining objectives (Imagenet classification, self supervised learning) ?
We introduce a three step procedure to solve this problem:
Model name | Worst group accuracy |
Average group accuracy |
||
---|---|---|---|---|
DeiT | 0.733 → 0.815 | 0.874 → 0.913 | ||
CLIP | 0.507 → 0.744 | 0.727 → 0.790 | ||
DINO | 0.800 → 0.911 | 0.900 → 0.938 | ||
DINOv2 | 0.967 → 0.978 | 0.983 → 0.986 | ||
SWIN | 0.834 → 0.871 | 0.927 → 0.944 | ||
MaxVit | 0.777 → 0.814 | 0.875 → 0.887 | ||
Worst group accuracy and average group accuracy for Waterbirds dataset before and after intervention |
Algorithm | DeiT | DINO | MaxViT | SWIN | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
pixAcc | mIoU | mAP | pixAcc | mIoU | mAP | pixAcc | mIoU | mAP | pixAcc | mIoU | mAP | |
Chefer et al | 0.7307 | 0.4785 | 0.7870 | 0.7309 | 0.4541 | 0.8080 | - | - | - | - | - | - |
GradCam | 0.6533 | 0.4625 | 0.7129 | 0.7045 | 0.4309 | 0.7481 | 0.4732 | 0.1705 | 0.4243 | 0.5973 | 0.2360 | 0.5365 |
Decompose | 0.7719 | 0.5291 | 0.8305 | 0.7577 | 0.4863 | 0.8111 | 0.7163 | 0.4237 | 0.7237 | 0.7136 | 0.4338 | 0.7620 |
Read the paper for more details!
@inproceedings{
balasubramanian2024decomposing,
title={Decomposing and Interpreting Image Representations via Text in ViTs Beyond {CLIP}},
author={Sriram Balasubramanian and Samyadeep Basu and Soheil Feizi},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=Vhh7ONtfvV}
}