CNN based Two-stage Detectors Summary and Comparison

R-CNN, SPPNet, Fast R-CNN, Faster R-CNN, FPN

R-CNN: Regions with CNN features (2014)

Object detection system overview.

Test time detection

  • Input: A test image.
  • Extract region proposals: Extract ~2000 region proposals using selective search.
  • Compute CNN features: Wrap each proposal and forward propagate it through a CNN (AlexNet).
  • Classify regions: For each extracted feature vector, score it through a bunch of SVMs for each class. Apply a greedy non-maximum suppression for each class independently.
  • Bounding-box regression: Post-process the prediction windows.

CNN supervised pre-training

Pre-trained the CNN (AlexNet) on the ImageNet dataset using image-level annotations only (bounding-box labels are not available for this data).

CNN domain-specific fine-tuning

To adapt the CNN to the new task (detection) and the new domain (warped proposal windows), the pre-trained CNN is further fine-tuned using the only warped region proposals.

Image input: For each proposal region, convert image data in that region into a fixed pixel size (227 × 227) as required by the CNN. Regardless of the size or aspect ratio of the candidate region, warp all pixels in a tight bounding box around it to the required size. The tight bounding box is created by dilating the original box so that at the warped size there are exactly p (p = 16) pixels of warped image context around the original box.

Output layer: The 1000-way classification layer is replaced with a randomly initialized (N + 1)-way classification layer (N object classes + background). The training labels are defined as:

  • Positive examples: Region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class.
  • Negative examples: Rest region proposals.

Mini-batch: (size=128) Uniformly sample 32 positive and 96 background examples.

Object category classifiers

Train class-specific linear SVMs. The positive and negative examples are defined differently from fine-tuning:

  • Positive examples: Ground-truth bounding boxes for each class.
  • Negative examples: Region proposals with ≤ 0.3 IoU overlap with a ground-truth box as negatives for that box’s class.

Once features are extracted (from the fine-tuned CNN) and training labels are assigned, optimize one linear SVM per class adopting the standard hard negative mining method.

Bounding-box regression

Train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal.


The redundant feature computations on a large number of overlapped proposals (over 2000 boxes from one image) lead to an extremely slow detection speed.

Spatial Pyramid Pooling Networks: SPPNet (2014)

Improvement against R-CNN

  • R-CNN is slow because it repeatedly applies the deep convolutional network to ~2000 windows per image.
  • SPPNet can run orders of magnitude faster than R-CNN because it allows running the convolutional layers only once on the entire image (possibly at multiple scales), and then extract features for each candidate windows on the feature maps.

Test time detection

Similar to R-CNN, the only difference:

  • Compute CNN features: Instead of computing features for each proposal region, compute the feature map on the entire image and then extract features for each proposal region on the feature map.

The Spatial Pyramid Pooling Layer

A network structure with a spatial pyramid pooling layer. Here 256 is the filter number of the conv5 layer, and conv5 is the last convolutional layer.

To extract features for proposal regions of various sizes from the feature map, a spatial pyramid pooling layer is added on top of the last convolution layer. Spatial pyramid pooling can maintain spatial information by pooling (max-pooling in the paper) in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. Thus SPP layer can accept inputs of arbitrary aspect ratios and scales and generates fixed-length outputs, which are then fed into the fully connected layers to generate the final features.

The outputs of the spatial pyramid pooling are kM-dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). In the paper, for each candidate window, a 4-level spatial pyramid (1×1, 2×2, 3×3, 6×6, totally 50 bins) is used to pool the features. This generates a 12,800-d (256×50) representation for each window. These representations are provided to the fully connected layers of the network.

CNN supervised pre-training

Pre-trained the CNN (ZF-5) on the ImageNet dataset using image-level annotations only, following R-CNN. However, there is a small modification to the CNN architecture: An l-level spatial pyramid pooling layer is added on top of the last convolutional layer, then the input of the next fully connected layer (fc6) is the concatenation of the l outputs.

CNN domain-specific fine-tuning

Similar to the implementation in R-CNN. However, since the features are pooled from the conv5 feature maps, for simplicity only the fully-connected layers are fine-tuned.

In this case, the data layer accepts the fixed-length pooled features after conv5, and the fc6,7 layers and a new (N + 1)-way fc8 layer follow. In each mini-batch, 25% of the samples are positive.

Object category classifiers

Similar to the implementation in R-CNN. Any negative sample is removed if it overlaps another negative sample by more than 70%.

Bounding-box regression

Similar to the implementation in R-CNN. However, the features used for regression are the pooled features from conv5, as a counterpart of the pool5 features used in R-CNN. The windows used for the regression training are those overlapping with a ground-truth window by at least 50%.


  • The training is still multi-stage.
  • SPPNet only fine-tunes its fully connected layers while simply ignores all previous layers.

Fast R-CNN (2015)

Improvement against R-CNN and SPPNet

  • Higher detection quality (mAP).
  • Training is single-stage, using a multi-task loss. Unlike R-CNN and SPPNet, training is a multi-stage pipeline.
  • Training can update all network layers. Unlike SPPNet, the fine-tuning algorithm cannot update the convolutional layers that precede the spatial pyramid pooling.
  • No disk storage is required for feature caching. For both R-CNN and SPPNet, SVMs and bounding-box regressor training require features are extracted and written to disk.
Fast R-CNN architecture

Test time detection

  • Input image: A test image.
  • Extract region proposals: Extract ~2000 region proposals and name them as regions of interest (RoIs).
  • Compute CNN features: Compute the feature maps of the entire image. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs).
  • Classification and Bounding-box regression: For each RoI r, the network produces two output vectors: softmax class probabilities and per-class bounding-box regression offsets relative to r.
  • Post-processing: For each object class, assign detection confidence to r using the estimated probability of that class. Then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN.

The RoI pooling layer

The RoI layer is simply the special case of the spatial pyramid pooling layer used in SPPNet in which there is only one pyramid level. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w). RoI max-pooling works by dividing the h×w RoI window into a H×W grid of sub-windows of approximate size h/H ×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling.

Initializing from pre-trained CNNs

The pre-trained CNN undergoes three transformations:

  • The last max-pooling layer is replaced by an RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16). Similar to SPPNet but only with one pyramid level (H×W).
  • The network’s last fully connected layer and softmax are replaced with the two sibling layers: (1) a fully connected layer and softmax over K+1 categories; (2) category-specific bounding-box regressors. With these two new layers, the class-specific SVMs and box regressor won’t be required anymore.
  • The network is modified to take two data inputs: a list of images and a list of RoIs in those images.

CNN domain-specific fine-tuning

Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v.

  • Foreground (positive) examples: Object proposals that have ≥ 0.5 IoU overlap with a ground-truth bounding box.
  • Background (negative) examples: Object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5).

Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors using a multi-task loss:

  • L_cls(p, u): The log loss for true class u.
  • [u ≥ 1]L_loc(tu, v): The robust L1 loss for bounding-box regression when u ≥ 1. L_loc is ignored for background RoIs.
  • Hyper-parameter λ controls the balance between the two task losses.

Unlike SPPNet, where only the fully-connected layers are fine-tuned, Fast R-CNN updates all network layers by backpropagation through RoI pooling layers.

Mini-batch: (size R=128) Choose N=2 images uniformly at random, sampling 64 RoIs from each image, where 25% of the samples are foreground examples.

Truncated SVD for faster detection

Large fully connected layers are accelerated by truncated SVD. A layer parameterized by the weight matrix W (u×v) is approximately factorized as

Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u, v).


Although Fast-RCNN successfully integrates the advantages of R-CNN and SPPNet, its detection speed is still limited by region proposals.

Faster R-CNN (2015)

Improvement against Fast R-CNN

Faster R-CNN is a single, unified network for object detection.
  • Get rid of region proposals by introducing Region Proposal Network (RPN), where region proposals are generated directly from convolutional feature maps used by region-based detectors.
  • The RPN is simply a kind of fully convolutional network (FCN) and can be trained end-to-end specifically for the task of generating detection proposals with a wide range of scales and aspect ratios.
  • Faster R-CNN is a combination of a deep fully convolution network that proposes regions and the Fast R-CNN detector that uses the proposed regions. The entire system is a single, unified network for object detection.

Test time detection

Similar to Fast R-CNN, but Faster R-CNN generates region proposals (from the RPN branch) and computes CNN features from the same network. Non-maximum suppression is adopted to reduce redundancy due to some highly overlap RPN proposals and finally yields ~2000 proposal regions per image.

Region Proposal Network (RPN)

Region Proposal Network (RPN)

A Region Proposal Network (RPN) is a fully convolutional network that takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score. It shares a common set of convolutional layers with a Fast R-CNN object detection network.

To generate region proposals, a small network slides over the convolutional feature map output by the last shared convolutional layer. This small network takes an n × n (n=3 in the paper) spatial window of the input convolutional feature map as input. Each sliding window is mapped to a lower-dimensional feature and fed into two sibling fully-connected layers — a box-regression layer (reg) and a box-classification layer (cls). Because the mini-network operates in a sliding window fashion, the fully connected layers are shared across all spatial locations.

At each sliding window location, k region proposals are predicted simultaneously. So the reg layer outputs 4k coordinates and the cls layer outputs 2k scores. The k proposals are parameterized relative to k reference boxes, which are called anchors. An anchor is centered at the sliding window and is associated with a scale and aspect ratio. In the paper, 3 different scales (128², 256², 512² ) and aspect ratios (1:1, 1:2, 2:1) are used, yielding k=9 anchors at each sliding window. Thus for a convolutional feature map of a size W×H, there are WHk anchors in total.

To train an RPN, a binary class label (an object or not) is assigned to each anchor.

  • Positive anchors: (i) the anchor/anchors with the highest IoU overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box.
  • Negative anchors: if its IoU ratio is lower than 0.3 for all ground-truth boxes.
  • Other anchors: anchors that are neither positive nor negative do not contribute to the training objective.

The loss function for an image is defined as shown LHS. The two terms are normalized by N_cls and N_reg and weighted by a balancing parameter λ. In the paper’s implementation, N_cls = 256 is the mini-batch size; N_reg ~ 2400 is the number of anchor locations. By default, λ = 10 thus the two terms are roughly equally weighted.

The bounding box regression adopts the parameterizations of the 4 coordinates as shown LHS. This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.

Since the features used for regression are of the same spatial size (3 × 3) on the feature maps, to account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights.

Mini-batch: (size=256) Choose a single image that contains many positive and negative example anchors. 256 anchors are sampled randomly, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, the mini-batch is padded with negative ones.

Sharing Features for RPN and Fast R-CNN

The authors adopt a pragmatic 4-step training algorithm to learn shared features via alternating optimization.

  • Step-1: Train the RPN. This network is initialized with an ImageNet pre-trained model and fine-tuned end-to-end for the region proposal task.
  • Step-2: Train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN. This detection network is also initialized by the ImageNet pre-trained model.

At this point, the two networks do not share convolutional layers.

  • Step-3: Use the detector network to initialize RPN training, but fix the shared convolutional layers and only fine-tune the layers unique to RPN.

Now the two networks share convolutional layers.

  • Step-4: Keeping the shared convolutional layers fixed, fine-tune the unique layers of Fast R-CNN.

As such, both networks share the same convolutional layers and form a unified network.

Feature Pyramid Networks (2017)

Improvement against Faster R-CNN

Feature Pyramid Networks
  • Construct feature pyramids that have rich semantics at all levels with marginal extra cost by naturally leveraging the pyramidal shape of a ConvNet’s feature hierarchy.
  • Using FPN in a basic Faster R-CNN system achieves SOTA performance.

Test time detection

Same as the detection system it is applied to.

Feature Pyramid Networks

The construction of the pyramid involves a (1) bottom-up pathway; a (2) top-down pathway, and (3) lateral connections.

(1) The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. (There are often many layers producing output maps of the same size and these layers are grouped in the same network stage.) One pyramid level is defined for each stage and the output of the last later of each stage is chosen to form the reference set of feature maps. These feature maps will be enriched to create the feature pyramid.

Specifically, for ResNets, feature activations output by each stage’s last residual block are chosen. The output of these last residual blocks is denoted as {C2, C3, C4, C5} for conv2, conv3, conv4, and conv5 outputs. They have strides of {4, 8, 16, 32} pixels with respect to the input image.

(2) The top-down pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.

With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (nearest neighbor upsampling). The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated.

To start the iteration, a 1×1 convolutional layer is attached on C5 to produce the coarsest resolution map. Finally, a 3×3 convolution is appended on each merged map to generate the final feature map, aiming to reduce the aliasing effect of upsampling. This final set of feature maps is called {P2, P3, P4, P5}, corresponding to {C2, C3, C4, C5} that are respectively of the same spatial sizes.

Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, the feature dimension (numbers of channels, denoted as d) in all the feature maps are fixed (d=256 in the paper) and thus all extra convolutional layers have d-channel outputs.

Feature Pyramid Networks for RPN

In the original RPN design, a small subnetwork is evaluated on dense 3×3 sliding windows, on top of a single-scale convolutional feature map. It can be simply adapted by replacing the single-scale feature map with the FPN: the same subnetwork is attached to each level on the feature pyramid.

Because the subnetwork slides densely over all locations in all pyramid levels, multi-scale anchors on a specific level are not necessary anymore. Instead, the anchors are designed to have areas of {32², 64², 128², 256², 512²} pixels on {P2, P3, P4, P5, P6} respectively and have multiple aspect ratios {1:2, 1:1, 2:1} at each level. In total, there are 15 anchors over the pyramid.

The training labels are assigned to the anchors following the same convention in Faster R-CNN. Note that scales of ground-truth boxes are not explicitly used to assign them to the levels of the pyramid; instead, ground-truth boxes are associated with anchors, which have been assigned to pyramid levels. The parameters of the subnetwork are shared across all feature pyramid levels since it’s believed that all levels of the pyramid share similar semantic levels.

Feature Pyramid Networks for Fast R-CNN

Fast R-CNN uses RoI pooling to extract features and mostly performs on a single-scale feature map. To use it with FPN, RoIs of different scales (produced by the RPN) need to be assigned to the pyramid levels. An RoI of width w and height h is assigned to the level P_k of the feature pyramid by:

The predictor heads (in Fast R-CNN the heads are class-specific classifiers and bounding box regressors) are attached to all RoIs of all levels. Again, the heads all share parameters, regardless of their levels.

Senior Data Scientist@ViSenze. If I don’t create, I don’t understand.