State-of-the-art Image Retrieval 2017
Updated: Nov 21, 2018
This is the second part of the blog series in visual Image Retrieval. In this post I will cover the state-of-the-art in academia until mid-late 2017. I will make special emphasis in techniques that do not train a particular model for retrieval, but they try to extract the maximum information from features obtained using pre-trained Convolutional Neural Networks.
If we recall from the introduction, the process of image retrieval begins with the exploration of a dataset and the encoding of its images into compact representations. These representations are later compared by means of a distance metric and ranked following the obtained score in order of similarity. We will focus on how to compute these representations.
The most successful retrieval approaches before the popularization of Deep Learning were based on locally invariant features , often encoded using a Bag of Words model  and improved using large visual codebooks . Explained in a nutshell, interest points are detected in the image and local invariant descriptors are extracted. Each descriptor is assigned to its closest visual word in a visual vocabulary: a codebook obtained offline by clustering a large set of descriptors with k-means. This results in a typically high dimensional sparse histogram representation. Then, an inverted list structure is employed for efficient indexing and a Term Frequency - Inverse Document Frequency (TF-IDF) scoring is used to discount the influence of visual-words which occur in many images.
Following the success of CNNs for the task of image classification, recent image retrieval works have replaced the classical hand-crafted features for representations extracted from CNNs. The first trend was transferring the learning done in a classification task and lately more works are focusing on learning features more specific for retrieval.
Unsupervised Retrieval Approaches
A first approach to using CNNs for image retrieval was encoding the images using features extracted from the fully connected layers (2014). A high level dense descriptor of the image visual content was obtained, which was referred as Neural Code . It was shown that by means of applying PCA, these codes could be shortened and still performing better than the previous hand-crafted features state-of-the-art. An extension to local analysis was presented in , where these features were extracted over a fixed set of regions at different scales defined over the image.
Posterior works observed that features from convolutional layers convey the spatial information of the images, making them more useful for the task of retrieval. In addition to that, it allowed to input variable size images to the network, which also brought an improvement of performance. Based on this observation, different authors have based their approaches on combining convolutional features with different estimation of the areas of interest within the image.
In the next slides we can see how to go from convolutional features to compact representations:
In  a global descriptor is built by sum-pooling convolutional features (SPoC descriptor) and introducing a gaussian centering prior, assuming that the relevant content is in the center of the image (introducing a dataset bias). Razavian’s technique in  considers a multiresolution search, extracting different size sub-patches at random locations. R-MAC  used a fixed-rigid grid of different size regions and encode a vector per region by performing max-pooling in every feature map. Then it aggregates each region vector to form a global image representation. In the last place, BoW  constructs a Bag of Words model on top of convolutional features using a fixed rigid grid of regions too. These works shows how focusing in local regions of the convolutional features can improve performance, but the computation of these regions is based on heuristics and randomness, not on the image content.
Based on the image content, a strategy called CroW , estimates a spatial weighting as a combination of the convolutional feature maps across all channels of the layer. Its authors claim they boost features at locations with salient visual content while down weights in non-salient locations. And another work  uses the saliency maps generated by another model to predict zones of interest to weight the convolutional features.
Our work: Class-Weighted Convolutional Features for Visual Instance Search  explicitly leverage the semantic information contained in the model. To this end, we adopt the Class Activation Maps (CAMs) proposed in  as a method to exploit the predicted classes and obtain semantic-aware spatial weights for convolutional features (post).
Supervised Retrieval Approaches
The works above use off-the-shelf features while, the next ones have focused on applying supervised learning to fine-tune CNNs. The first one uses a similarity oriented loss such as ranking [slides], and the second one explores using a pairwise similarity . Both achieve the objective of adapting the CNN to the particular dataset and boost the performance of the resulting representations. They obtain the best results, but this fine-tuning step has the main drawback of having to spend large efforts on collecting, annotating and cleaning a large dataset, which sometimes is not feasible. Furthermore, as these approaches are domain specific is not clear how they generalize to other domains and datasets.
Thanks for reading!