Visual research has become one of the most prominent areas of analysis techniques and machine intelligence development. The requirement for an effective and computationally efficient visual search engine has grown in importance as the multimedia sector has grown dramatically. The goal is to obtain specific photographs displaying occurrences of a user-specified object, scene, or place from a vast corpus of images. Mobile commerce, medical imaging, augmented reality, and many more applications are among the most important.

Due to issues such as changing item appearance, views, and scale, partial occlusions, different backdrops, and imaging circumstances, robust and reliable visual search is difficult. Furthermore, due to the massive amounts of multimedia data accessible nowadays, today’s systems must be scalable to billions of pictures.

A compact and discriminative picture representation is necessary to overcome these obstacles. Convolutional Neural Networks (CNNs) have proven to be useful in a variety of computer vision applications, such as picture categorization. However, they have yet to achieve the expected performance improvements in the picture retrieval issue, particularly at extremely large sizes. The key reason for this is because two basic issues remain unsolved:

  • how to optimally aggregate deep features collected by a CNN network into compact and discriminative imagelevel representations
  • how to train the resulting CNNaggregator architecture for picture retrieval tasks.

The overall design comprises a baseline CNN followed by the REMAP network in this case. The CNN component generates dense, deep convolutional features, which our REMAP approach aggregates. The CNN filter weights and REMAP parameters are trained at the same time, adjusting to changing deep descriptor distributions and improving the multi-region aggregation parameters over time.

Components of Remap

REMAP descriptor’s design solves two key difficulties in content-based image retrieval: a unique aggregation technique for multi-layer deep convolutional features collected by a CNN network, and enhanced assembly of multi-region and multi-layer representations with end-to-end training.

The objective is to combine a hierarchy of deep features from several CNN layers that have been specifically trained to represent numerous and complementary levels of visual feature abstraction, resulting in dramatically improved recognition.

End-to-end and particularly for recognition, multi-layer architecture is trained. This means that many CNN layers are trained together to be discriminative on their own, complimentary in recognition tasks, and helpful in extracting the features needed at later levels.

The CNN’s end-to-end training is crucial because it explicitly enforces intra-layer feature complementarity, which boosts performance dramatically. The characteristics from the extra layers, while coincidentally valuable, are not learned to be discriminative or complimentary without such cooperative multi-layer learning.

Another key invention is region entropy weighting, which aims to quantify how discriminating particular characteristics are in each local region and utilize that information to regulate the sum pooling process appropriately. The relative entropy of the distributions of distances for matching and non-matching image descriptor pairs, as assessed by the KL-divergence function, is known as region entropy.

Regions with a high degree of separability between matching and nonmatching distributions are more informative in recognition and so have higher weights.


Because each block in the REMAP network represents a differentiable operation, the complete architecture may be taught from start to finish.

REMAP is a new CNN-based architecture that learns a hierarchy of deep features that reflect various and complementary degrees of visual abstraction. The whole framework is trained using triplet loss from beginning to finish, and rigorous experiments show that REMAP exceeds the current state-of-the-art.