Class-Weighted Convolutional Features for Visual Instance Search (BMVC2017)
Updated: Nov 21, 2018
This is the third part of the blog series in visual Image Retrieval. In this post I will explain the work carried out during my master thesis and the paper published in the BMVC Conference 2017 under the name Class-Weighted Convolutional Features for Visual Instance Search.
All the models and instructions to run the code can be found on the GitHub repository.
We recall from the introduction that the process of image retrieval begins with the exploration of a dataset and the encoding of its images into compact representations. These representations are later compared by means of a distance metric and ranked following the obtained score in order of similarity. We will focus on how to compute these representations.
Image retrieval in realistic scenarios targets large dynamic datasets of unlabeled images. In these cases, training or fine-tuning a model every time new images are added to the database is neither efficient nor scalable. Convolutional Neural Networks trained for image classification over large datasets (with lots of different classes) like ImageNet have been proven effective feature extractors when transferred to the task of image retrieval. The most successful approaches are based in encoding the activations of convolutional layers as they convey the image spatial information. In the next image the related works are shown:
All of them except CroW do not use the image contents to weight the features, they base their methods on heuristics or randomness.
The main aim of our work is encoding images into compact representations, taking into account the semantics of the scene by using only the knowledge contained in a CNN. To achieve that, we use convolutional features weighted by a soft attention model over the semantic classes detected in the image. Our main contribution is exploiting the transferability of the information encoded in a CNN, not only in its features, but also in its ability to focus the attention on the most representative regions of the image. For this goal, we explore the potential of Class Activation Maps (CAMs) to generate semantic-aware weights for convolutional features extracted from the deeper layers of a network.
Our premise is that every image or scene can be described by the objects that appear inside:
CAMs were proposed as a method to estimate the pixels of the image that were most attended by the CNN when predicting each semantic class. The computation of CAMs is straightforward in most state-of-the-art CNN architectures for image classification by replacing the last fully-connected layers for a Global Average Pooling (GAP) layer and a linear layer (classifier).
In our particular case, we use networks trained on ImageNet, so we have up to 1000 classes to choose from. In the next slide I show our image encoding pipeline:
It can be divided in three steps:
Features and CAMs Extraction
Each image is feed-forwarded through the CNN to compute, in a single pass, the convolutional features of a selected layer and the CAMs. The selected convolutional layer has K feature maps (χ) of width W and height H. Every CAM highlights the class-specific discriminative regions attended by the network to make its predictions. CAMs are normalized to fall in the range [0, 1] and, if their dimensions do not match the ones of the selected convolutional feature maps, they must be finally resized.
Feature Weighting and Pooling
Once the convolutional features and the CAMs have been extracted, the next step is weighting the features and pooling them to obtain a compact representation. For a given class c, we weight its features spatially, multiplying element-wise by the corresponding normalized CAM. Afterwards, each convolutional feature map is reduced to a single value by sum-pooling, in such a way that the dimensions of the feature vector corresponds to the amount of convolutional filters in the selected layer. We choose sum-pooling instead of max- pooling because we want to cover the extension of the objects rather than the most discriminative part. Furthermore, sum-pooling aggregation benefits more from the later application of PCA and whitening, as we observed experimentally and equally noted in [CroW, SpoC]. Finally, we include the channel weighting proposed in CroW to reduce channel redundancies and augment the contribution of rare features.
The final step is building a descriptor DI for each image I by aggregating NC class vectors. We perform l2 normalization, PCA-whitening and l2 normalization once more as in [CroW, R-MAC]. Then we combine the number of class vectors into a single one by summing them and l2 normalizing again in the end.
All the classes that we have available to aggregate are given by a pre-trained CNN. As we are transferring the learning into other datasets, we have to define a policy to select which classes are the most relevant. We define two basic approaches depending on the moment when we build the descriptors for the dataset:
Online Aggregation (OnA). The top NC predicted classes of the query image are obtained at search time (online) and the same set of classes is used to aggregate the features of each image in the dataset. This strategy, while generating descriptors that adapt to the query, presents two important drawbacks which do not make it scalable. First, it requires extracting and storing the CAMs for all classes for every image from the target dataset, with the corresponding requirements in terms of computation and storage. Secondly, the aggregation of weighted feature maps must also be computed at query time, which slows down the retrieval process.
Offline Aggregation (OfA). The considered top NC semantic classes can also be predicted individually for each image in the dataset at indexing time. This task is performed offline and no intermediate information needs to be stored, just the final descriptor, which makes the system more scalable than the online approach.
We present experiments in Oxford5k Buildings and Paris6k Buildings. Both datasets contain 55 query images to perform the search, each annotated with a region of interest. To test instance-level retrieval on a larger-scale scenario, we also consider the Oxford105k and the Paris106k datasets that extend Oxford5k and Paris6k with 100k distractor images. All the images in the datasets have a maximum size of 1024, so we keep it and resize the minimum dimension to 720, maintaining the aspect ratio. We follow the evaluation protocol using the features from the given query annotated region of interest. We compute the PCA parameters with Paris descriptors when we test in Oxford, and vice versa. We choose the cosine similarity metric to compute the scores as this operation is efficiently and fast computed with GPUs. The final ranked list is generated by ordering these scores. The evaluation metric for all the experiments is the mean Average Precision (mAP), as adopted in most related works using these datasets.
We explored the use of CAMs in different network architectures such as DenseNet161, ResNet50, the widely used VGG-16, and DecomposeMe which is a compact network based on 1D convolutions. We show some qualitative results showing the CAMs generated by these networks.
We can observe that VGG-16 tends to focus more on the discriminative part of the objects objects rather than in their global shape, while being less spread around the image, a desirable property for our retrieval system.
This discriminative focus on the most relevant areas is being beneficial as the VGG-16-CAM model is the one that performs better. For all the next experiments we'll use that model, as is the one used by all the related works as well (without the CAM modification).
In the next slides we show that introducing the CAM spatial weighting is being beneficial and providing a huge improvement over the baseline results.
The Online (OnA) and Offline (OfA) Aggregations are compared in terms of mAP as a function of the amount of top NC classes and Npca classes used to compute the PCA.
The computational burden grows as we increase the number of CAMs per image.
This is how we compare with the state-of-the-art:
To improve the performance of the offline aggregation without the practical limitations of aggregating online, we suggest restricting the total number of classes used to the most probable classes of the dataset’s theme. As we have two similar building datasets, Oxford and Paris, we compute the most representative classes of the 55 Paris queries and use that predefined list of classes ordered by probability of appearance to obtain the image representations in Oxford.
A common approach in image retrieval systems is to apply post-processing steps that refine a first fast search to improve the system performance. We have explored query expansion and re-ranking. We use the most probable classes predicted from the query to generate the regions of interest in the target images and with this information perform a spatial re-ranking after the first search. To obtain these regions, first we define heuristically a set of thresholds based on the normalized intensity of the CAM heatmap values. More precisely, we define a set of values 1%, 10%, 20%, 30% and 40% of the max value of the CAM and compute bounding boxes around the largest connected component of the CAM. Then, we build an image descriptor for every of the spatial regions and compare them to the query image using the cosine distance. We kept the one with the best score. The decision of using more than one threshold aims at covering the variability of the objects dimensions in different images and detect them with more precision.
In this work we proposed a technique to build compact image representations focusing on their semantic content. Our experiments demonstrated that selecting the relevant content of an image to build the image descriptor is beneficial, and contributes to increase the retrieval performance. The proposed approach establishes a new state-of-the-art compared to methods that build image representations combining off-the-shelf features using random or fixed grid regions.
Thanks for reading!