YOLO, SSD, RetinaNet

YOLO: You Only Look Once (2015)

Improvement against other detectors

The YOLO Detection System.

YOLO: System overview

R-CNN, SPPNet, Fast R-CNN, Faster R-CNN, FPN

R-CNN: Regions with CNN features (2014)

Object detection system overview.

Test time detection

CNN supervised pre-training

Pre-trained the CNN (AlexNet) on the ImageNet dataset using image-level annotations only (bounding-box labels are not available for this data).

CNN domain-specific fine-tuning

To adapt the CNN to the new task (detection) and the new domain (warped proposal windows), the pre-trained CNN is…

Applications and Future Directions

Pedestrian Detection

Pedestrian detection, as an important object detection application, has received extensive attention in many areas such as autonomous driving, video surveillance, criminal investigation, etc. Some early time’s pedestrian detection methods laid a solid foundation for general object detection in terms of:

In recent years, some general object detection algorithms have been introduced to pedestrian detection and have greatly promoted…

Recent Advances in Object Detection

Detection with Better Engines

Fig. 17. A comparison of detection accuracy of three detectors: Faster RCNN, R-FCN and SSD on MS-COCO dataset with different detection engines.

As the accuracy of a detector depends heavily on its feature extraction networks, we refer to the backbone networks, e.g. the ResNet and VGG, as the “engine” of a detector. Here we introduce some of the important detection engines in the deep learning era.

AlexNet (2012)

AlexNet, an eight-layer deep network, was the first CNN model that started the deep learning revolution in computer vision.

Speed-up of Detection

Fig. 12. An overview of the speed up techniques in object detection.

Feature Map Shared Computation

Among the different computational stages of an object detector, feature extraction usually dominates the amount of computation. For a sliding window based detector, the computational redundancy starts from both positions and scales, where the former one is caused by the overlap between adjacent windows, while the latter one is by the feature correlation between adjacent scales.

Spatial Computational Redundancy and SpeedUp

Feature map shared computation is to compute the feature map of the whole image only once before sliding window on it. The “image pyramid” of a traditional detector herein can be considered as a “feature pyramid”.

Technical Evolution in Object Detection

Early Time’s Dark Knowledge

The early time’s object detection (before 2000) did not follow a unified detection philosophy. Detectors at that time were usually designed based on low-level and mid-level vision.

Components, shapes, and edges

Some early researchers framed object detection as a measurement of similarity between the object components, shapes, and contours. Despite promising initial results, things did not work out well on more complicated detection problems.

Therefore, machine learning based detection methods were beginning to prosper. Machine learning based detection has gone through multiple periods, including the statistical models of appearance (before 1998), wavelet feature representations (1998–2005), and gradient-based representations (2005–2012).

Building statistical models of an…

Object Detection Datasets and Metrics

Object Detection Datasets and Metrics

Building larger datasets with less bias is critical for developing advanced computer vision algorithms. In object detection, a number of well-known datasets and bench- marks have been released in the past 10 years.

CNN based One-stage Detectors

Milestones: CNN based One-stage Detectors

In the deep learning era, object detection can be grouped into two genres: “two-stage detection” and “one-stage detection”, where the former frames the detection as a “coarse-to-fine” process while the latter frames it as to “complete in one step”.

YOLO: You Only Look Once (2015)

Original paper: You Only Look Once: Unified, Real-Time Object Detection


We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

CNN based Two-stage Detectors

Milestones: CNN based Two-stage Detectors

In 2012, the world saw the rebirth of convolutional neural networks. As a deep convolutional network is able to learn robust and high-level feature representations of an image, a natural question is whether we can bring it to object detection? R. Girshick et al. took the lead to break the deadlocks in 2014 by proposing the Regions with CNN features (RCNN) for object detection. Since then, object detection started to evolve at an unprecedented speed.

R-CNN: Regions with CNN features (2014)

Original paper: Rich feature hierarchies for accurate object detection and semantic segmentation



Not all debt is bad, but all debt needs to be serviced.

Developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive. This dichotomy can be understood through the lens of technical debt, a metaphor introduced by Ward Cunningham in 1992 to help reason about the long-term costs incurred by moving quickly in software engineering.

Technical debt may be paid down by refactoring code, improving unit tests, deleting dead code, reducing dependencies, tightening APIs, and improving documentation. The goal is not to add new functionality, but to enable future improvements, reduce…

Jiangchun Li

Senior Data Scientist@ViSenze. If I don’t create, I don’t understand.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store