188金宝慱亚洲体育馆网址188金宝慱亚洲体育馆网址
College of Computer Science and Software Engineering, SZU

C2AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation

IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2022)

Jinheng Xie1,3,4    Jianfeng Xiang1,3,4    Junliang Chen1,3,4    Xianxu Hou1,3,4    Xiaodong Zhao1,3,4    Linlin Shen1,2,3,4*

1Shenzhen University    2Wenzhou-Kean University

3Shenzhen Institute of Artificial Intelligence and Robotics for Society

4Guangdong Key Laboratory of Intelligent Information Processing

Figure 1: Feature manifold of foreground objects (blue) and backgrounds (green). As semantic information of foreground objects differs from that of backgrounds, the distribution of the representation of foreground objects (blue) is far away from backgrounds (green). Foreground objects with similar appearance or backgrounds with similar color/texture also have similar representations in the feature space. Based on these observations, positive and negative pairs can be formed for contrastive learning.

Abstract

While class activation map (CAM) generated by image classification network has been widely used for weakly supervised object localization (WSOL) and semantic segmentation (WSSS), such classifiers usually focus on discriminative object regions. In this paper, we propose Contrastive learning for Class-agnostic Activation Map (C$^2$AM) generation only using unlabeled image data, without the involvement of image-level supervision. The core idea comes from the observation that i) semantic information of foreground objects usually differs from their backgrounds; ii) foreground objects with similar appearance or background with similar color/texture have similar representations in the feature space. We form the positive and negative pairs based on the above relations and force the network to disentangle foreground and background with a class-agnostic activation map using a novel contrastive loss. As the network is guided to discriminate cross-image foreground-background, the class-agnostic activation maps learned by our approach generate more complete object regions. We successfully extracted from C$^2$AM class-agnostic object bounding boxes for object localization and background cues to refine CAM generated by classification network for semantic segmentation. Extensive experiments on CUB-200-2011, ImageNet-1K, and PASCAL VOC2012 datasets show that both WSOL and WSSS can benefit from the proposed C$^2$AM.

Figure 2: Difference between (a) the class activation map (CAM) and (b) class-agnostic activation map. CAM consists of $K$ (number of class) activation maps and C$^2$AM only predicts one class-agnostic activation map for an image, which directly indicates the foreground and background regions. Best viewed in color.

Figure 3: The overall network architecture of the proposed method. The encoder network $h(\cdot)$ maps image $\mX_i$ to the feature map $\mZ_i$. In disentangler, the activation head $\varphi(\cdot)$ produces the class-agnostic activation map $\mP_i$. Suppose $\mP_i$ activates the foreground regions and the background activation map can be derived as $(1-\mP_i)$. Based on the foreground and background activation maps, $\mZ_i$ can be disentangled into the foreground and background feature representations, i.e., $\vv^f_i$ and $\vv^b_i$. In evaluation, only the trained $h(\cdot)$ and $\varphi(\cdot)$ are used for generating the class-agnostic activation map $\mP_i$. \textbf{Flatt.}: matrix flattening; \textbf{Trans.}: matrix transpose; $\otimes$: matrix multiplication.

Figure 4: Illustration of cross-image foreground-background contrastive learning. Each image representation, i.e., $\mZ_i$, is disentangled into the foreground and background representations, i.e., $\vv_i^f$ and $\vv_i^b$. Two foreground or background representations are coupled into one positive pair, while a negative pair is formed with one foreground and one background representation. Contrastive learning is applied to pull close the representations from the positive pair and push apart the representations from the negative pair.

Figure 5: Refinement of initial CAM using background cues..

Figure 6: Visual comparison between CAM and the class-agnostic activation maps generated by C$^2$AM.

Figure 7: Illustration of CAM refinement using background cues extracted from C$^2$AM. First column: initial CAM. Second column: background cues extracted from C$^2$AM (255: background, 0: foreground). Third column: CAM refined using background cues. Last column: ground-truth masks.

Figure 8: Sensitivity analysis of hyper-parameter $\alpha$.

Bibtex

@inproceedings{ccam,

title={ C$^2$AM: Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation },

author={Xie, Jinheng and Xiang, Jianfeng and Chen, Junliang, and Hou, Xianxu, and Zhao, Xiaodong, and Shen, Linlin },

booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

year={2022}

}