Paper Reading — Object Detection in 20 Years: A Survey (Part 8)(End)

Applications and Future Directions

Pedestrian Detection

Pedestrian detection, as an important object detection application, has received extensive attention in many areas such as autonomous driving, video surveillance, criminal investigation, etc. Some early time’s pedestrian detection methods laid a solid foundation for general object detection in terms of:

In recent years, some general object detection algorithms have been introduced to pedestrian detection and have greatly promoted the progress of this area: e.g., Faster RCNN [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Is Faster R-CNN Doing Well for Pedestrian Detection?].

Fig. 20. Some hard examples of pedestrian detection from Caltech dataset
  • (a) Small pedestrians: Some examples of the small pedestrians that are captured far from the camera. In Caltech Dataset, 15% of the pedestrians are less than 30 pixels in height.
  • (b) Hard negatives: Some backgrounds in street view images are very similar to pedestrians in their visual appearance.
  • (c) Dense and occluded pedestrian: Some examples of dense and occluded pedestrians. In the Caltech Dataset, pedestrians that haven’t been occluded only account for 29% of the total pedestrian instances.
  • Real-time detection: The real-time pedestrian detection from HD video is crucial for some applications like autonomous driving and video surveillance.

Due to the limitations of computing resources, the Haar wavelet feature has been broadly used in early-time pedestrian detection [A Trainable System for Object Detection, Example-based object detection in images by components, Detecting pedestrians using patterns of motion and appearance].

To improve the detection of occluded pedestrians, one popular idea of that time was “detection by components”, i.e., to think of the detection as an ensemble of multiple part detectors that trained individually on different human parts, e.g. head, legs, and arms: [Example-based object detection in images by components, Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors, An HOG-LBP human detector with partial occlusion handling].

With the increase of computing power, people started to design more complex detection models, and since 2005, gradient-based representation [Histograms of Oriented Gradients for Human Detection, Fast Human Detection Using a Cascade of Histograms of Oriented Gradients, An HOG-LBP human detector with partial occlusion handling, Detecting Pedestrians by Learning Shapelet Features] and DPM [Object Detection with Discriminatively Trained Part-Based Models, Object Detection with Grammar Models, 30Hz Object Detection with DPM V5] have become the mainstream of pedestrian detection.

In 2009, by using the integral image acceleration, an effective and lightweight feature representation: the Integral Channel Features (ICF), was proposed [Integral Channel Features]. ICF then became the new benchmark of pedestrian detection at that time [Pedestrian Detection: An Evaluation of the State of the Art].

In addition to the feature representation, some domain knowledge also has been considered, such as appearance constancy and shape symmetry [Pedestrian Detection Inspired by Appearance Constancy and Shape Symmetry] and stereo information [Pedestrian detection at 100 frames per second, Stixels estimation without depth map computation].

To improve small pedestrian detection. Although deep learning object detectors such as Fast/Faster R-CNN have shown state of the art performance for general object detection, they have limited success for detecting small pedestrians due to the low resolution of their convolutional features. Some recent solutions to this problem include:

To improve hard negative detection. Some recent improvements include the integration of:

To improve dense and occluded pedestrian detection. As mentioned in the paper, the features in deeper layers of CNN have richer semantics but are not effective for detecting dense objects.

Face Detection

Face detection is one of the oldest computer vision appliations. Early time’s face detection, such as the VJ detector [Rapid Object Detection using a Boosted Cascade of Simple Features], has greatly promoted the object detection where many of its remarkable ideas are still playing important roles even in today’s object detection.

Fig. 21. Challenges in face detection.
  • (a) Intra-class variation: (image from WildestFaces Dataset) Human faces may present a variety of expressions, skin colors, poses, and movements.
  • (b) Face occlusion: (image from UFDD Dataset) Faces may be partially occluded by other objects.
  • (c) Multi-scale face detection: (image from P. Hu et al. CVPR2017) Detecting faces in a large variety of scales, especially for some tiny faces.
  • Real-time detection: Face detection on mobile devices usually requires a CPU real-time detection speed.

The early time’s face detection algorithms can be divided into three groups:

There are two groups of face detectors in this period.

In deep learning era, most of the face detection algorithms follow the detection idea of the general object detectors such as Faster RCNN and SSD.

Text Detection

Text has long been the major information carrier of the human for thousands of years. The fundamental goal of text detection is to determine whether or not there is text in a given image, and if there is, to localize, and recognize it. Text detection has very broad applications.

Fig. 22. Challenges in text detection and recognition.
  • (a) Different fonts and languages: (image from maxpixel) Texts may have different fonts, colors, and languages.
  • (b) Text rotation and perspective distortion: (image from Y. Liu et al. CVPR2017) Texts may have different orientations and even may have perspective distortion.
  • (c) Densely arranged text localization: (image from Y. Wu et al. ICCV2017) Text lines with large aspect ratios and dense layout are difficult to localize accurately.
  • Broken and blurred characters: Broken and blurred characters are common in street view images.

Step-wise detection methods [Scene Text Localization and Recognition with Oriented Stroke Detection, Robust Text Detection in Natural Scene Images] consist of a series of processing steps including character segmentation, candidate region verification, character grouping, and word recognition.

  • Advantage: Most of the background can be filtered in the coarse segmentation step, which greatly reduces the computational cost of the following process.
  • Disadvantage: The parameters of all steps need to be set carefully, and the errors will occur and accumulate throughout each of these steps.

Integrated methods [End-to-end scene text recognition, End-to-end text recognition with convolutional neural networks, Text Flow: A Unified Text Detection System in Natural Scene Images, Deep Features for Text Spotting] frame the text detection as a joint probability inference problem, where the steps of character localization, grouping, and recognition are processed under a unified framework.

  • Advantage: It avoids the cumulative error and is easy to integrate language models.
  • Disadvantage: The inference will be computationally expensive when considering a large number of character classes and candidate windows.

Most of the traditional text detection methods generate text candidates in an unsupervised way, where the commonly used techniques include Maximally Stable Extremal Regions (MSER) segmentation [Robust Text Detection in Natural Scene Images] and morphological filtering [Multi-Orientation Scene Text Detection with Adaptive Clustering].

Some domain knowledge, such as the symmetry of texts and the structures of strokes, also have been considered in these methods [Scene Text Localization and Recognition with Oriented Stroke Detection, Robust Text Detection in Natural Scene Images, Symmetry-based text line detection in natural scenes].

In recent years, researchers have paid more attention to the problem of text localization rather than recognition. Two groups of methods are proposed recently.

The recent deep learning based text detection methods have proposed some solutions to the above problems.

Traffic Sign and Traffic Light Detection

With the development of self-driving technology, the automatic detection of traffic sign and traffic light has attracted great attention in recent years.

Fig. 23. Challenges in traffic sign detection and traffic light detection.
  • (a) Illumination changes: (image from pxhere) The detection will be particularly difficult when driving into the sun glare or at night.
  • (b) Motion blur: (image from GTSRB Dataset) The image captured by an on-board camera will become blurred due to the motion of the car.
  • (c) Detection under bad weather: (image from Flickr and Max Pixel) In bad weathers, e.g., rainy and snowy days, the image quality will be affected.
  • Real-time detection: This is particularly important for autonomous driving.

As traffic sign/light has particular shape and color, the traditional detection methods are usually based on:

As the above methods are merely designed based on low-level vision, they usually fail under complex environments, therefore, some researchers began to find other solutions beyond vision-based approaches, e.g., to combine GPS and digital maps in traffic light detection [Traffic light mapping and detection, Traffic light mapping, localization, and state detection for autonomous vehicles].

In deep learning era, some well-known detectors such as Faster RCNN and SSD were applied in traffic sign/light detection tasks [Traffic-Sign Detection and Classification in the Wild, A deep learning approach to traffic lights: Detection, tracking, and classification, Traffic signal detection and classification in street views using an attention model, Deep Convolutional Traffic Light Recognition for Automated Driving].

On basis on these detectors, some new techniques, such as the attention mechanism and adversarial training have been used to improve detection under complex traffic environments [Perceptual Generative Adversarial Networks for Small Object Detection, Traffic signal detection and classification in street views using an attention model].

Remote Sensing Target Detection

Remote sensing imaging technique has opened a door for people to better understand the earth. In recent years, as the resolution of remote sensing images has increased, remote sensing target detection (e.g., the detection of airplane, ship, oil-pot, etc), has become a research hot-spot. Remote sensing target detection has broad applications, such as military investigation, disaster rescue, and urban traffic management.

Fig. 24. Challenges in remote sensing target detection.
  • (a) Detection in “big data”: A comparison on data volume between remote sensing images and natural images (VOC, ImageNet, and MS-COCO). Due to the huge data volume of remote sensing images, how to quickly and accurately detect remote sensing targets remains a problem.
  • (b) Targets occluded by cloud: (images from S. Qiu et al. JSTARS2017 and Z. Zou et al. TGRS2016) Over 50% of the earth’s surface is covered by cloud every day.
  • Domain adaptation: Remote sensing images captured by different sensors (e.g., with different modulates and resolutions) present a high degree of differences.

Most of the traditional remote sensing target detection methods follow a two-stage detection paradigm:

Stage 1: Candidate extraction. Some frequently used methods include:

One similarity of the above methods is they are all unsupervised methods, thus usually fail in complex environments.

Stage 2: Target verification. In target verification stage, some frequently used features include:

To detect targets with particular structure and shape such as oil-pots and inshore ships, some domain knowledge is used. For example, the oil-pot detection can be considered as circle/arc detection problem [A Hierarchical Oil Tank Detector With Deep Surrounding Features for High-Resolution Optical Satellite Imagery, Framework design and implementation for oil tank detection in optical satellite imagery]. The inshore ship detection can be considered as the detection of the foredeck and the stern [A New Method on Inshore Ship Detection in High-Resolution Satellite Images Using Shape and Context Information, Automatic Detection of Inshore Ships in High-Resolution Remote Sensing Images Using Robust Invariant Generalized Hough Transform].

To improve the occluded target detection, one commonly used idea is “detection by parts” [Occluded Object Detection in High-Resolution Remote Sensing Images Using Partial Configuration Object Model, An On-Road Vehicle Detection Method for High-Resolution Aerial Images Based on Local and Global Structure Learning].

To detect targets with different orientations, the “mixture model” is used by training different detectors for targets of different orientations [Multi-class geospatial object detection and geographic image classification based on collection of part detectors].

After the great success of RCNN in 2014, deep CNN has been soon applied to remote sensing target detection [RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection, Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images, Efficient Saliency-Based Object Detection in Remote Sensing Images Using Deep Belief Networks, Airport Detection on Optical Satellite Images Using Deep Convolutional Neural Networks]. The general object detection framework like Faster RCNN and SSD have attracted increasing attention in remote sensing community [Random Access Memories: A New Paradigm for Target Detection in High Resolution Aerial Remote Sensing Images, Vehicle Detection in Aerial Images Based on Region Convolutional Neural Networks and Hard Negative Example Mining, Ship Detection in Spaceborne Optical Image With SVD Networks, Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?, An Efficient and Robust Integrated Geospatial Object Detection Framework for High Spatial Resolution Remote Sensing Imagery, Deformable ConvNet with Aspect Ratio Constrained NMS for Object Detection in Remote Sensing Imagery, Integrated Localization and Recognition for Inshore Ships in Large Scene Remote Sensing Images].

Due to the huge different between a remote sensing image and an everyday image, some investigations have been made on the effectiveness of deep CNN features for remote sensing images [Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?, Fast Vehicle Detection in Aerial Imagery, Comprehensive Analysis of Deep Learning-Based Vehicle Detection in Aerial Images]. People discovered that in spite of its great success, the deep CNN is no better than traditional methods for spectral data [Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?].

To detect targets with different orientations, some researchers have improved the ROI Pooling layer for better rotation invariance [Online Exemplar-Based Fully Convolutional Network for Aircraft Detection in Remote Sensing Images, Rotated region based CNN for ship detection].

To improve domain adaptation, some researchers formulated the detection from a Bayesian view that at the detection stage, the model is adaptively updated based on the distribution of test images [Random Access Memories: A New Paradigm for Target Detection in High Resolution Aerial Remote Sensing Images].

In addition, the attention mechanisms and feature fusion strategy also have been used to improve small target detection [Fully Convolutional Network With Task Partitioning for Inshore Ship Detection in Optical Remote Sensing Images, Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale Fully Convolutional Network].

Future Directions

The future research of object detection may focus but is not limited to the following aspects:

  • Lightweight object detection: To speed up the detection algorithm so that it can run smoothly on mobile devices. Some important applications include mobile augmented reality, smart cameras, face verification, etc.
  • Detection meets AutoML: Recent deep learning based detectors are becoming more and more sophisticated and heavily relies on experiences. A future direction is to reduce human intervention when designing the detection model (e.g., how to design the engine and how to set anchor boxes) by using neural architecture search.
  • Detection meets domain adaptation: The training process of any target detector can be essentially considered as a likelihood estimation process under the assumption of independent and identically distributed (i.i.d.) data. Object detection with non-i.i.d. data, especially for some real-world applications, still remains a challenge. GAN has shown promising results in domain adaptation and may be of great help to object detection in the future.
  • Weakly supervised detection: The training of a deep learning based detector usually relies on a large amount of well-annotated images. The annotation process is time-consuming, expensive, and inefficient. Developing weakly supervised detection techniques where the detectors are only trained with image-level annotations, or partially with bounding box annotations is of great importance for reducing labor costs and improving detection flexibility.
  • Small object detection: Detecting small objects in large scenes has long been a challenge. Some potential application of this research direction includes counting the population of wild animals with remote sensing images and detecting the state of some important military targets. Some further directions may include the integration of the visual attention mechanisms and the design of high resolution lightweight networks.
  • Detection in videos: Real-time object detection/tracking in HD videos is of great importance for video surveillance and autonomous driving. Traditional object detectors are usually designed under for image-wise detection, while simply ignores the correlations between videos frames. Improving detection by exploring the spatial and temporal correlation is an important research direction.
  • Detection with information fusion: Object detection with multiple sources/modalities of data, e.g., RGB-D image, 3d point cloud, LIDAR, etc, is of great importance for autonomous driving and drone applications. Some open questions include: how to immigrate well-trained detectors to different modalities of data, how to make information fusion to improve detection, etc.

Senior Data Scientist@ViSenze. If I don’t create, I don’t understand.