Semantic segmentation is a critical process that involves classifying each pixel in an image into a certain category. U-net segmentation in GIS can be used to classify land cover or extract roads and buildings from satellite images.

U-net convolutional network was developed from the classic neural network.

The purpose of pixel-based classification is the same as in classic remote sensing picture classification, which is normally done using standard machine learning techniques. Semantic segmentation has two inputs, same as image classification.

  • A raster image including several bands,
  • Each pixel’s label is represented by a labeled picture.

Numerous segmentation algorithms exist, including U-net, Mask R-CNN, and Feature Pyramid Network (FPN), among others.

U-net architecture for semantic segmentation

U-net was created for neural network image segmentation and was the first to use it. Its design is roughly divided into two parts: an encoder network and a decoder network. Unlike classification, where the U-net network’s final output is the only thing that matters, semantic segmentation necessitates not just pixel-level discrimination but also a technique to construct the discriminative features learned at various stages of the encoder onto the image pixels.

  • In the architectural diagram, the encoder is the first half.

Before trained classification networks like VGG/ResNet, where convolution blocks are used to encode the input picture into feature representations at many levels, followed by max pool downsampling.

  • The architecture’s second half is the decoder.

The objective is to generate a dense classification by semantically projecting the encoder’s lower resolution onto the higher resolution. Upsampling and concatenation precede conventional convolution processes in the decoder.

Upsampling in CNN may be unfamiliar territory for those familiar with classification and object recognition architecture, but the concept is straightforward. We enlarge the feature dimensions because our idea is that we want to reconstruct the compressed image representation to the initial dimensions of the input picture.

  • Transposed convolution, upconvolution, and deconvolution are all terms used to describe upsampling.

We’d want to upsample it to make it the same size as the concatenation blocks. Both arrows indicate where we joined 2d image maps together. In this regard, the primary addition of U-Net is that, when upsampling in the network, we also concatenate that high res image features out from an encoder network with the exact features in order to learn representations with subsequent convolutions more effectively. We need a solid baseline from early phases to properly depict the localization because upsampling is a dense process.

To summarize, unlike classification, where the end result of a very deep network is the only thing that matters, semantic segmentation necessitates not only pixel-level discrimination but also a method to construct the discriminative features learned at different phases of the encoder onto to the pixel space.