Paper Reading — Object Detection in 20 Years: A Survey (Part 8)(End)
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in computer vision, has received great…
Pedestrian detection, as an important object detection application, has received extensive attention in many areas such as autonomous driving, video surveillance, criminal investigation, etc. Some early time’s pedestrian detection methods laid a solid foundation for general object detection in terms of:
- Feature representation: HOG detector [Histograms of Oriented Gradients for Human Detection], ICF detector [Integral Channel Features]
- Design of classifier: [Classification using intersection kernel support vector machines is efficient]
- Detection acceleration: [Fast Human Detection Using a Cascade of Histograms of Oriented Gradients].
In recent years, some general object detection algorithms have been introduced to pedestrian detection and have greatly promoted the progress of this area: e.g., Faster RCNN [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Is Faster R-CNN Doing Well for Pedestrian Detection?].
Difficulties and Challenges
- (a) Small pedestrians: Some examples of the small pedestrians that are captured far from the camera. In Caltech Dataset, 15% of the pedestrians are less than 30 pixels in height.
- (b) Hard negatives: Some backgrounds in street view images are very similar to pedestrians in their visual appearance.
- (c) Dense and occluded pedestrian: Some examples of dense and occluded pedestrians. In the Caltech Dataset, pedestrians that haven’t been occluded only account for 29% of the total pedestrian instances.
- Real-time detection: The real-time pedestrian detection from HD video is crucial for some applications like autonomous driving and video surveillance.
Traditional pedestrian detection methods
Due to the limitations of computing resources, the Haar wavelet feature has been broadly used in early-time pedestrian detection [A Trainable System for Object Detection, Example-based object detection in images by components, Detecting pedestrians using patterns of motion and appearance].
To improve the detection of occluded pedestrians, one popular idea of that time was “detection by components”, i.e., to think of the detection as an ensemble of multiple part detectors that trained individually on different human parts, e.g. head, legs, and arms: [Example-based object detection in images by components, Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors, An HOG-LBP human detector with partial occlusion handling].
With the increase of computing power, people started to design more complex detection models, and since 2005, gradient-based representation [Histograms of Oriented Gradients for Human Detection, Fast Human Detection Using a Cascade of Histograms of Oriented Gradients, An HOG-LBP human detector with partial occlusion handling, Detecting Pedestrians by Learning Shapelet Features] and DPM [Object Detection with Discriminatively Trained Part-Based Models, Object Detection with Grammar Models, 30Hz Object Detection with DPM V5] have become the mainstream of pedestrian detection.
In 2009, by using the integral image acceleration, an effective and lightweight feature representation: the Integral Channel Features (ICF), was proposed [Integral Channel Features]. ICF then became the new benchmark of pedestrian detection at that time [Pedestrian Detection: An Evaluation of the State of the Art].
In addition to the feature representation, some domain knowledge also has been considered, such as appearance constancy and shape symmetry [Pedestrian Detection Inspired by Appearance Constancy and Shape Symmetry] and stereo information [Pedestrian detection at 100 frames per second, Stixels estimation without depth map computation].
Deep learning based pedestrian detection methods
To improve small pedestrian detection. Although deep learning object detectors such as Fast/Faster R-CNN have shown state of the art performance for general object detection, they have limited success for detecting small pedestrians due to the low resolution of their convolutional features. Some recent solutions to this problem include:
- Feature fusion: [Is Faster R-CNN Doing Well for Pedestrian Detection?]
- Introducing extra high-resolution handcrafted features: [Learning Multilayer Channel Features for Pedestrian Detection, What Can Help Pedestrian Detection?]
- Ensembling detection results on multiple resolutions: [Pushing the Limits of Deep CNNs for Pedestrian Detection]
To improve hard negative detection. Some recent improvements include the integration of:
- Boosted decision tree: [Is Faster R-CNN Doing Well for Pedestrian Detection?]
- Semantics segmentation (as the context of the pedestrians): [Pedestrian Detection aided by Deep Learning Semantic Tasks]
- In addition, the idea of “cross-modal learning” has also been introduced to enrich the feature of hard negatives by using both RGB and infrared images [Learning Cross-Modal Deep Representations for Robust Pedestrian Detection].
To improve dense and occluded pedestrian detection. As mentioned in the paper, the features in deeper layers of CNN have richer semantics but are not effective for detecting dense objects.
- To this end, some researchers have designed new loss function by considering the attraction of target and the repulsion of other surrounding objects [Repulsion Loss: Detecting Pedestrians in a Crowd].
- Target occlusion is another problem that usually comes up with dense pedestrians. The ensemble of part detectors [Deep Learning Strong Parts for Pedestrian Detection, Jointly Learning Deep Features, Deformable Parts, Occlusion and Classification for Pedestrian Detection] and the attention mechanism [Occluded Pedestrian Detection Through Guided Attention in CNNs] are the most common ways to improve occluded pedestrian detection.
Face detection is one of the oldest computer vision appliations. Early time’s face detection, such as the VJ detector [Rapid Object Detection using a Boosted Cascade of Simple Features], has greatly promoted the object detection where many of its remarkable ideas are still playing important roles even in today’s object detection.
Difficulties and Challenges
- (a) Intra-class variation: (image from WildestFaces Dataset) Human faces may present a variety of expressions, skin colors, poses, and movements.
- (b) Face occlusion: (image from UFDD Dataset) Faces may be partially occluded by other objects.
- (c) Multi-scale face detection: (image from P. Hu et al. CVPR2017) Detecting faces in a large variety of scales, especially for some tiny faces.
- Real-time detection: Face detection on mobile devices usually requires a CPU real-time detection speed.
Early time’s face detection (before 2001)
The early time’s face detection algorithms can be divided into three groups:
- Rule-based methods. This group of methods encode human knowledge of what constitutes a typical face and capture the relationships between facial elements [Human face detection in a complex background, Finding face features].
- Subspace analysis-based methods. This group of methods analyze the face distribution in underlying linear subspace [Eigenfaces for Recognition, View-based and modular eigenspaces for face recognition]. Eigenfaces is the representative of this group of methods.
- Learning based methods. To frame the face detection as a sliding window + binary classification (target vs background) process. Some commonly used models of this group include neural network [Original approach for the localisation of objects in images, Human face detection in visual scenes, Neural network-based face detection] and SVM [A general framework for object detection, Training support vector machines: an application to face detection].
Traditional face detection (2000–2015)
There are two groups of face detectors in this period.
- The first group of methods are built based on boosted decision trees [Rapid Object Detection using a Boosted Cascade of Simple Features, Robust Real-Time Face Detection, Boosting chain learning for object detection]. These methods are easy to compute, but usually suffer from low detection accuracy under complex scenes.
- The second group is based on early time’s convolutional neural networks, where the shared computation of features are used to speed up detection [A neural architecture for fast and robust face detection, Synergistic Face Detection and Pose Estimation with Energy-Based Models].
Deep learning based face detection (after 2015)
In deep learning era, most of the face detection algorithms follow the detection idea of the general object detectors such as Faster RCNN and SSD.
- To speed up face detection: Cascaded detection is the most common way to speed up a face detector in deep learning era [A convolutional neural network cascade for face detection, Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks]. Another speed up method is to predict the scale distribution of the faces in an image [Scale-Aware Face Detection] and then run detection on some selected scales.
- To improve multi-pose and occluded face detection: The idea of “face calibration” has been used to improve multi-pose face detection by estimating the calibration parameters [Supervised Transformer Network for Efficient Face Detection] or using progressive calibration through multiple detection stages [Real-Time Rotation-Invariant Face Detection with Progressive Calibration Networks].
- To improve occluded face detection, two methods have been proposed recently. The first one is to incorporate “attention mechanism” so that to highlight the features of underlying face targets [Face Attention Network: An Effective Face Detector for the Occluded Faces]. The second one is “detection based on parts” [Faceness-Net: Face Detection through Deep Facial Part Responses], which inherits ideas from DPM.
- To improve multi-scale face detection: Recent works on multi-scale face detection [Finding Tiny Faces, Face Detection through Scale-Friendly Deep Convolutional Networks, SSH: Single Stage Headless Face Detector, S³FD: Single Shot Scale-invariant Face Detector] use similar detection strategies as those in general object detection, including multi-scale feature fusion and multi-resolution detection.
Text has long been the major information carrier of the human for thousands of years. The fundamental goal of text detection is to determine whether or not there is text in a given image, and if there is, to localize, and recognize it. Text detection has very broad applications.
- It helps people who are visually impaired to “read” street signs and currency [A camera phone based currency reader for the visually impaired, Improved text-detection methods for a camera-based text reading system for blind persons].
- In geographic information systems, the detection and recognition of house numbers and street signs make it easier to build digital maps [Convolutional Neural Networks Applied to House Numbers Digit Classification, Attention-based Extraction of Structured Information from Street View Imagery].
Difficulties and Challenges
- (a) Different fonts and languages: (image from maxpixel) Texts may have different fonts, colors, and languages.
- (b) Text rotation and perspective distortion: (image from Y. Liu et al. CVPR2017) Texts may have different orientations and even may have perspective distortion.
- (c) Densely arranged text localization: (image from Y. Wu et al. ICCV2017) Text lines with large aspect ratios and dense layout are difficult to localize accurately.
- Broken and blurred characters: Broken and blurred characters are common in street view images.
Step-wise detection vs integrated detection
Step-wise detection methods [Scene Text Localization and Recognition with Oriented Stroke Detection, Robust Text Detection in Natural Scene Images] consist of a series of processing steps including character segmentation, candidate region verification, character grouping, and word recognition.
- Advantage: Most of the background can be filtered in the coarse segmentation step, which greatly reduces the computational cost of the following process.
- Disadvantage: The parameters of all steps need to be set carefully, and the errors will occur and accumulate throughout each of these steps.
Integrated methods [End-to-end scene text recognition, End-to-end text recognition with convolutional neural networks, Text Flow: A Unified Text Detection System in Natural Scene Images, Deep Features for Text Spotting] frame the text detection as a joint probability inference problem, where the steps of character localization, grouping, and recognition are processed under a unified framework.
- Advantage: It avoids the cumulative error and is easy to integrate language models.
- Disadvantage: The inference will be computationally expensive when considering a large number of character classes and candidate windows.
Traditional methods vs deep learning methods
Most of the traditional text detection methods generate text candidates in an unsupervised way, where the commonly used techniques include Maximally Stable Extremal Regions (MSER) segmentation [Robust Text Detection in Natural Scene Images] and morphological filtering [Multi-Orientation Scene Text Detection with Adaptive Clustering].
Some domain knowledge, such as the symmetry of texts and the structures of strokes, also have been considered in these methods [Scene Text Localization and Recognition with Oriented Stroke Detection, Robust Text Detection in Natural Scene Images, Symmetry-based text line detection in natural scenes].
In recent years, researchers have paid more attention to the problem of text localization rather than recognition. Two groups of methods are proposed recently.
- The first group of methods frame the text detection as a special case of general object detection [Single Shot Text Detector with Regional Attention, Reading Text in the Wild with Convolutional Neural Networks, etc]. These methods have a unified detection framework, but it is less effective for detecting texts with orientation or with large aspect ratio.
- The second group of methods frame the text detection as an image segmentation problem [Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection, Self-Organized Text Detection with Minimal Post-processing via Border Learning, Scene Text Detection via Holistic, Multi-Channel Prediction, Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping, Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation]. The advantage of these methods is there are no special restrictions for the shape and orientation of text, but the disadvantage is that it is not easy to distinguish densely arranged text lines from each other based on the segmentation result.
The recent deep learning based text detection methods have proposed some solutions to the above problems.
- For text rotation and perspective changes: The most common solution to this problem is to introduce additional parameters in anchor boxes and RoI pooling layer that are associated with rotation and perspective changes [Arbitrary-Oriented Scene Text Detection via Rotation Proposals, R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection, Deep Direct Regression for Multi-Oriented Scene Text Detection, Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection, EAST: An Efficient and Accurate Scene Text Detector].
- To improve densely arranged text detection: The segmentation-based approach shows more advantages in detecting densely arranged texts. To distinguish the adjacent text lines, two groups of solutions have been proposed recently. The first one is “segment and linking”, where “segment” refers to the character heatmap, and “linking” refers to the connection between two adjacent segments indicating that they belong to the same word or line of text [Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection, Scene Text Detection via Holistic, Multi-Channel Prediction]. The second group is to introduce an additional corner/border detection task to help separate densely arrange texts, where a group of corners or a closed boundary corresponds to an individual line of text [Self-Organized Text Detection with Minimal Post-processing via Border Learning, Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping, Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation].
- To improve broken and blurred text detection: A recent idea to deal with broken and blurred texts is to use word level [Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Detecting Text in Natural Image with Connectionist Text Proposal Network] recognition and sentence level recognition [Attention-based Extraction of Structured Information from Street View Imagery]. To deal with texts with different fonts, the most effective way is training with synthetic samples [Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Reading Text in the Wild with Convolutional Neural Networks].
Traffic Sign and Traffic Light Detection
With the development of self-driving technology, the automatic detection of traffic sign and traffic light has attracted great attention in recent years.
Difficulties and Challenges
- (a) Illumination changes: (image from pxhere) The detection will be particularly difficult when driving into the sun glare or at night.
- (b) Motion blur: (image from GTSRB Dataset) The image captured by an on-board camera will become blurred due to the motion of the car.
- (c) Detection under bad weather: (image from Flickr and Max Pixel) In bad weathers, e.g., rainy and snowy days, the image quality will be affected.
- Real-time detection: This is particularly important for autonomous driving.
Traditional detection methods
As traffic sign/light has particular shape and color, the traditional detection methods are usually based on:
- Color thresholding: [Automatic Detection and Classification of Traffic Signs, Traffic sign recognition and analysis for intelligent vehicles, Road traffic sign detection in color images, Road-Sign Detection and Recognition Based on Support Vector Machines, Traffic light detection with color and edge information]
- Visual saliency detection: [Unifying visual saliency with HOG feature learning for traffic sign detection]
- Morphological filtering: [Real time visual traffic lights recognition based on Spot Light Detection and adaptive traffic lights templates]
- Edge/contour analysis: [A single target voting scheme for traffic sign detection, Fast and Robust Traffic Sign Detection]
As the above methods are merely designed based on low-level vision, they usually fail under complex environments, therefore, some researchers began to find other solutions beyond vision-based approaches, e.g., to combine GPS and digital maps in traffic light detection [Traffic light mapping and detection, Traffic light mapping, localization, and state detection for autonomous vehicles].
Deep learning based detection methods
In deep learning era, some well-known detectors such as Faster RCNN and SSD were applied in traffic sign/light detection tasks [Traffic-Sign Detection and Classification in the Wild, A deep learning approach to traffic lights: Detection, tracking, and classification, Traffic signal detection and classification in street views using an attention model, Deep Convolutional Traffic Light Recognition for Automated Driving].
On basis on these detectors, some new techniques, such as the attention mechanism and adversarial training have been used to improve detection under complex traffic environments [Perceptual Generative Adversarial Networks for Small Object Detection, Traffic signal detection and classification in street views using an attention model].
Remote Sensing Target Detection
Remote sensing imaging technique has opened a door for people to better understand the earth. In recent years, as the resolution of remote sensing images has increased, remote sensing target detection (e.g., the detection of airplane, ship, oil-pot, etc), has become a research hot-spot. Remote sensing target detection has broad applications, such as military investigation, disaster rescue, and urban traffic management.
Difficulties and Challenges
- (a) Detection in “big data”: A comparison on data volume between remote sensing images and natural images (VOC, ImageNet, and MS-COCO). Due to the huge data volume of remote sensing images, how to quickly and accurately detect remote sensing targets remains a problem.
- (b) Targets occluded by cloud: (images from S. Qiu et al. JSTARS2017 and Z. Zou et al. TGRS2016) Over 50% of the earth’s surface is covered by cloud every day.
- Domain adaptation: Remote sensing images captured by different sensors (e.g., with different modulates and resolutions) present a high degree of differences.
Traditional detection methods
Most of the traditional remote sensing target detection methods follow a two-stage detection paradigm:
Stage 1: Candidate extraction. Some frequently used methods include:
- Gray value filtering based methods: [Characterization of a Bayesian Ship Detection Method in Optical Satellite Images, A Novel Hierarchical Method of Ship Detection from Spaceborne Optical Image Based on Shape and Texture Features]
- Visual saliency-based methods: [Unsupervised Ship Detection Based on Saliency and S-HOG Descriptor From Optical Satellite Images, A Visual Search Inspired Computational Model for Ship Detection in Optical Satellite Images, Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding, Object Detection in Optical Remote Sensing Images Based on Weakly Supervised Learning and High-Level Feature Learning]
- Wavelet transform based methods: [Compressed-Domain Ship Detection on Spaceborne Optical Image Using Deep Neural Network and Extreme Learning Machine]
- Anomaly detection based methods: [Ship Detection in High-Resolution Optical Imagery Based on Anomaly Detector and Local Shape Feature]
One similarity of the above methods is they are all unsupervised methods, thus usually fail in complex environments.
Stage 2: Target verification. In target verification stage, some frequently used features include:
- HOG (Histogram of oriented gradients): [Ship Detection in High-Resolution Optical Imagery Based on Anomaly Detector and Local Shape Feature, Vehicle Detection Using Partial Least Squares]
- LBP (Local Binary Pattern): [A Novel Hierarchical Method of Ship Detection from Spaceborne Optical Image Based on Shape and Texture Features]
- SIFT (Scale-invariant feature transform): [A Visual Search Inspired Computational Model for Ship Detection in Optical Satellite Images, Object Detection in Optical Remote Sensing Images Based on Weakly Supervised Learning and High-Level Feature Learning, Affine Invariant Description and Large-Margin Dimensionality Reduction for Target Detection in Optical Remote Sensing Images]
- Besides, there are also some other methods following the sliding window detection paradigm: [Vehicle Detection Using Partial Least Squares, Affine Invariant Description and Large-Margin Dimensionality Reduction for Target Detection in Optical Remote Sensing Images, Robust Vehicle Detection in Aerial Images Using Bag-of-Words and Orientation Aware Scanning, Detection of Cars in High-Resolution Aerial Images of Complex Urban Environments].
To detect targets with particular structure and shape such as oil-pots and inshore ships, some domain knowledge is used. For example, the oil-pot detection can be considered as circle/arc detection problem [A Hierarchical Oil Tank Detector With Deep Surrounding Features for High-Resolution Optical Satellite Imagery, Framework design and implementation for oil tank detection in optical satellite imagery]. The inshore ship detection can be considered as the detection of the foredeck and the stern [A New Method on Inshore Ship Detection in High-Resolution Satellite Images Using Shape and Context Information, Automatic Detection of Inshore Ships in High-Resolution Remote Sensing Images Using Robust Invariant Generalized Hough Transform].
To improve the occluded target detection, one commonly used idea is “detection by parts” [Occluded Object Detection in High-Resolution Remote Sensing Images Using Partial Configuration Object Model, An On-Road Vehicle Detection Method for High-Resolution Aerial Images Based on Local and Global Structure Learning].
To detect targets with different orientations, the “mixture model” is used by training different detectors for targets of different orientations [Multi-class geospatial object detection and geographic image classification based on collection of part detectors].
Deep learning based detection methods
After the great success of RCNN in 2014, deep CNN has been soon applied to remote sensing target detection [RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection, Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images, Efficient Saliency-Based Object Detection in Remote Sensing Images Using Deep Belief Networks, Airport Detection on Optical Satellite Images Using Deep Convolutional Neural Networks]. The general object detection framework like Faster RCNN and SSD have attracted increasing attention in remote sensing community [Random Access Memories: A New Paradigm for Target Detection in High Resolution Aerial Remote Sensing Images, Vehicle Detection in Aerial Images Based on Region Convolutional Neural Networks and Hard Negative Example Mining, Ship Detection in Spaceborne Optical Image With SVD Networks, Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?, An Efficient and Robust Integrated Geospatial Object Detection Framework for High Spatial Resolution Remote Sensing Imagery, Deformable ConvNet with Aspect Ratio Constrained NMS for Object Detection in Remote Sensing Imagery, Integrated Localization and Recognition for Inshore Ships in Large Scene Remote Sensing Images].
Due to the huge different between a remote sensing image and an everyday image, some investigations have been made on the effectiveness of deep CNN features for remote sensing images [Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?, Fast Vehicle Detection in Aerial Imagery, Comprehensive Analysis of Deep Learning-Based Vehicle Detection in Aerial Images]. People discovered that in spite of its great success, the deep CNN is no better than traditional methods for spectral data [Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?].
To detect targets with different orientations, some researchers have improved the ROI Pooling layer for better rotation invariance [Online Exemplar-Based Fully Convolutional Network for Aircraft Detection in Remote Sensing Images, Rotated region based CNN for ship detection].
To improve domain adaptation, some researchers formulated the detection from a Bayesian view that at the detection stage, the model is adaptively updated based on the distribution of test images [Random Access Memories: A New Paradigm for Target Detection in High Resolution Aerial Remote Sensing Images].
In addition, the attention mechanisms and feature fusion strategy also have been used to improve small target detection [Fully Convolutional Network With Task Partitioning for Inshore Ship Detection in Optical Remote Sensing Images, Maritime Semantic Labeling of Optical Remote Sensing Images with Multi-Scale Fully Convolutional Network].
The future research of object detection may focus but is not limited to the following aspects:
- Lightweight object detection: To speed up the detection algorithm so that it can run smoothly on mobile devices. Some important applications include mobile augmented reality, smart cameras, face verification, etc.
- Detection meets AutoML: Recent deep learning based detectors are becoming more and more sophisticated and heavily relies on experiences. A future direction is to reduce human intervention when designing the detection model (e.g., how to design the engine and how to set anchor boxes) by using neural architecture search.
- Detection meets domain adaptation: The training process of any target detector can be essentially considered as a likelihood estimation process under the assumption of independent and identically distributed (i.i.d.) data. Object detection with non-i.i.d. data, especially for some real-world applications, still remains a challenge. GAN has shown promising results in domain adaptation and may be of great help to object detection in the future.
- Weakly supervised detection: The training of a deep learning based detector usually relies on a large amount of well-annotated images. The annotation process is time-consuming, expensive, and inefficient. Developing weakly supervised detection techniques where the detectors are only trained with image-level annotations, or partially with bounding box annotations is of great importance for reducing labor costs and improving detection flexibility.
- Small object detection: Detecting small objects in large scenes has long been a challenge. Some potential application of this research direction includes counting the population of wild animals with remote sensing images and detecting the state of some important military targets. Some further directions may include the integration of the visual attention mechanisms and the design of high resolution lightweight networks.
- Detection in videos: Real-time object detection/tracking in HD videos is of great importance for video surveillance and autonomous driving. Traditional object detectors are usually designed under for image-wise detection, while simply ignores the correlations between videos frames. Improving detection by exploring the spatial and temporal correlation is an important research direction.
- Detection with information fusion: Object detection with multiple sources/modalities of data, e.g., RGB-D image, 3d point cloud, LIDAR, etc, is of great importance for autonomous driving and drone applications. Some open questions include: how to immigrate well-trained detectors to different modalities of data, how to make information fusion to improve detection, etc.