Many ML and computer vision techniques and applications rely on the extraction of useful features. One type of feature extraction in computer vision is the discovery and description of key picture areas. Traditionally, hand-engineered detectors and descriptors have been used to extract these traits. Keypoint-based or feature-based techniques are terms used to describe methodologies that follow this paradigm.

Neural networks have recently been reintroduced into several computer vision tasks, mostly replacing hand-engineered feature-based techniques. In most cases, neural network-based techniques learn feature extraction as part of an end-to-end pipeline. While these methods have demonstrated considerable success in tasks like scene identification, object detection, and classification, other tasks like structure-from-motion still rely on solely designed features to locate and define key points.

  • The process of finding critical object pieces is known as keypoint detection.

Eye corners, eyebrows, nose tips, eyebrows are important features of our faces. These components aid in the representation of the underlying object in a feature-rich way. Pose estimation and face identification are examples of keypoint detection applications.

Computer vision and Deep Learning

SIFT, HOG, SIFT, SURF, and ORB are just a few of the engineering feature extractors and descriptors that have been published in the computer vision field.

These extractors and descriptors were created with a variety of objectives in mind, such as improving matching accuracy or speeding up extraction and matching. In general, they have been shown to perform effectively in a variety of computer vision applications. In addition, techniques that learn keypoint detectors and descriptions have been documented in the literature.

Descriptors are used to establish geometrical links between two or more sets of keypoints in correspondence matching tasks, which are subsequently filtered by applying geometrical restrictions using model fitting techniques like RANSAC. Motion algorithms start with correspondence matching and then apply the calculated connections to a large number of pictures, resulting in a global model that governs everything.

Deep architectures

In recent years, research on Deep Convolutional Neural Networks has flooded the computer vision literature with state-of-the-art results on all fronts.

To learn feature descriptors, deep architectures were developed. The neural networks are trained to embed 64×64 patches in a feature space where matching patches are closer to each other than non-matching patches, which is based on the siamese architecture. The supervisory signal they utilize is based on structure-from-motion patches that were first used. They do not, however, learn keypoint detection or handle multiple scales natively.

Object detection and recognition literature have largely studied the identification of salient areas using deep architectures. Later layers’ properties were found to correspond to small details in the receptive areas they covered. Visual attention models are one method for generating salient area recommendations. A recurrent network is trained to progressively analyze and suggest parts of the visual space.

Unsupervised learning is how attention processes learn those significant features. The Spatial Transformer Network is one such technique, which outlines a region proposal system capable of automatically identifying areas with related transformations to a canonical pose. In spatial transformer networks, a predefined number of likely patch matching is detected. In essence, the network tries to identify and match patches at the same time, with only weak image-level supervision via match/no-match labels.

Region Proposal Networks are another method for generating region proposals. Their main goal is to identify areas of the picture space that contain objects. In most cases, region proposal networks are completely monitored throughout training.


We need a large set of patches with good quality keypoints that are additionally annotated with pairwise match information to train the keypoint detection network and the keypoint matching network.

For learning both keypoint detectors and descriptors from picture patches, there is presently no large-scale dataset available. Additionally, acquiring training examples for deep neural networks to train for this job is difficult due to the unreasonably high cost of gathering human-annotated instances.