March 21st 2021 • 8 min read

State of Deep Learning for Object Detection - You Should Consider CenterNets!

It's 2021 - there has been quite a bit of progress in practical models for object detection.

This post presents a short discussion of recent progress in practical deep learning models for object detection.

Configuring chart parameters ..

Source: Tensorflow Object Detection API , Ultralytics

I explored object detection models in detail about 3 years ago while builidng Handtrack.js and since that time, quite a bit has changed. For one, MobileNet SSD[^2] was the gold standard for low latency applications (e.g. browser deployment), now CenterNets[^1] appear to do even better.

This post does not pretend to be exhaustive, but focuses on methods that are practical (reproducible checkpoints exist) for today's usecases. Hence the excitement for CenterNets[^1]! The reader is encouraged to review papers listed in the references section below for more indepth discussions.

I also highly recommend the Tensorflow Object detection api[^3] from Google as a source of reference implementations; this post visualizes the performance metrics of pretrained models which they report and metrics from Ultralytics on Yolov5[^7].

TLDR;

Benchmarking Data Source

The data used in the chart above is harvested from the Tensorflow Object detection model zoo[^1]. Based on this Github issue[^4], the timings reported are based on experiments with an Nvidia GeForce GTX TITAN X card.

Data on Yolov5 is also added from the Ultralytics Benchmarks[^7] which is conducted on a Tesla v100 GPU and uses a PyTorch implementation. To compare FPS, I generously assume a Tesla v100 is 1.6x faster than a Titan X based on comparisons in this Lambdalabs GPU benchmarking article[^6].

Model

Box MAP Keypoint MAPFPS

CenterNet HourGlass104 512x512

41.9 41.9 14.29

CenterNet HourGlass104 Keypoints 512x512

40.0 61.4 13.16

CenterNet HourGlass104 1024x1024

44.5 44.5 5.08

CenterNet HourGlass104 Keypoints 1024x1024

42.8 64.5 4.74

CenterNet Resnet50 V1 FPN 512x512

31.2 31.2 37.04

CenterNet Resnet50 V1 FPN Keypoints 512x512

29.3 50.7 33.33

CenterNet Resnet101 V1 FPN 512x512

34.2 34.2 29.41

CenterNet Resnet50 V2 512x512

29.5 29.5 37.04

CenterNet Resnet50 V2 Keypoints 512x512

27.6 48.2 33.33

CenterNet MobileNetV2 FPN 512x512

23.4 23.4 166.67

CenterNet MobileNetV2 FPN Keypoints 512x512

41.7 41.7 166.67

SSD MobileNet v2 320x320

20.2 -- 52.63

SSD MobileNet V1 FPN 640x640

29.1 -- 20.83

SSD MobileNet V2 FPNLite 320x320

22.2 -- 45.45

SSD MobileNet V2 FPNLite 640x640

28.2 -- 25.64

SSD ResNet50 V1 FPN 640x640

34.3 -- 21.74

SSD ResNet50 V1 FPN 1024x1024

38.3 -- 11.49

SSD ResNet101 V1 FPN 640x640

35.6 -- 17.54

SSD ResNet101 V1 FPN 1024x1024

39.5 -- 9.62

SSD ResNet152 V1 FPN 640x640

35.4 -- 12.50

SSD ResNet152 V1 FPN 1024x1024

39.6 -- 9.01

EfficientDet D0 512x512

33.6 -- 25.64

EfficientDet D1 640x640

38.4 -- 18.52

EfficientDet D2 768x768

41.8 -- 14.93

EfficientDet D3 896x896

45.4 -- 10.53

EfficientDet D4 1024x1024

48.5 -- 7.52

EfficientDet D5 1280x1280

49.7 -- 4.50

EfficientDet D6 1280x1280

50.5 -- 3.73

EfficientDet D7 1536x1536

51.2 -- 3.08

Faster R-CNN ResNet50 V1 640x640

29.3 -- 18.87

Faster R-CNN ResNet50 V1 1024x1024

31.0 -- 15.38

Faster R-CNN ResNet50 V1 800x1333

31.6 -- 15.38

Faster R-CNN ResNet101 V1 640x640

31.8 -- 18.18

Faster R-CNN ResNet101 V1 1024x1024

37.1 -- 13.89

Faster R-CNN ResNet101 V1 800x1333

36.6 -- 12.99

Faster R-CNN ResNet152 V1 640x640

32.4 -- 15.63

Faster R-CNN ResNet152 V1 1024x1024

37.6 -- 11.76

Faster R-CNN ResNet152 V1 800x1333

37.4 -- 9.90

Faster R-CNN Inception ResNet V2 640x640

37.7 -- 4.85

Faster R-CNN Inception ResNet V2 1024x1024

38.7 -- 4.24

Yolov5s 640x640

36.8 -- 273.00

Yolov5m 640x64

44.5 -- 207.00

Yolov5l 640x64

36.8 -- 158.00

Yolov5x 640x64

50.1 -- 24.00

Object Detection Primer

The task of Object detection is focused on predicting the where and what each object instance. For instance, given an image of a breakfast table, we want to know where the objects of interest (plates, forks, knives, cups) are located (bounding box cordinates). Models that achieve this goal must solve two tasks - first, they must identify regions within an image that contain objects (region proposal), and then they must classify these regions into 1 of n categories. To perform the second task, you typically need to incoporate a model that can understand images well e.g. a pretrained image classification model (e.g. ResNet50, EfficientNet, Stacked HourGlassNets trained on the ImageNet dataset). This pretrained model, called a backbone, is then used in some section of the object detection model to extract feature maps that help in predicting things like if an object exists in a section of the image, the class of this object and its location.

Research in this area hag gone through several generations of iterations:

Two Stage Detection

Two stage object detection models typically have two distinct parts. The first stage focuses on generating proposals (i.e. what parts of my image are likely to contain objects?). This may be done using a region proposal algorithm like selective search [^13] (e.g. as seen in RCNN) or with a region proposal network - RPN which takes in feature map produced by a backbone network, and predicts regions where an object exists (e.g. as seen in Fast R-CNN, Mask R-CNN, Cascade R-CNN, CPNs [^11] etc).

Early RPNs used the concept of anchor boxes - a set of k predefined boxes at different aspect ratios and scales to predict if a section of a feature map (from backbone) contains an object. RPNs also predict refinements where needed that tranform the initial anchor box to match a proposal box.

The second stage takes proposed regions, and predicts object class and any additional refinements or bounding box offsets. Note that RPNs may be implemented using multiple approaches e.g recent two stage models like CPNs do not use anchor boxes in their RPN.

Single Stage Detection

This gets rid of the first stage and explores how the same network can be used to predict both the presence of objects, class of object and region/bbox transforms/offsets.

Anchor based detectors: Models in this category leverage the concept of anchor boxes described above. We slide each anchor box across the preceeding feature map and predict if an object exists + any refinements. Examples include RetinaNet[^12], SSD[^9], YOLO, and Faster R-CNN. Anchor based detectors rely on a set of anchor boxes of fixed size (a hyperparameter that needs to be tuned carefully), making it hard to detect objects that dont fit well within the selected anchor box shapes. In addition class predictions and refinements for each box incure compute cost.
Anchor free detectors: More recently, there have been efforts to simplify single stage detectors by removing the need for a predefined set of anchor boxes and the computational costs they incur (sliding them across feature maps). Examples of this approach include CenterNets[^1] (more below), ConerNets[^10], FCOS[^8].

Two-stage detectors are often more accurate but slower, compared to one-stage detectors.

Backbone Network

ResNet, GoogleNet, InceptionResNet, EfficientNet, HourGlassNet ..

Two stage detectors

Anchor based

R-CNN, Fast R-CNN ..

Anchor free

CPN

Single stage detectors

Anchor based

SSD, RetinaNet, Yolo, Faster R-CNN ..

Anchor free

CenterNet, CornerNet, FCOS ..

Why Do CenterNets (and Anchor Free Models) Work So Well?

So - why do CenterNets[^1] (and related anchor free models) appear to work so well in terms of achieving decent perfomance at lower latency? The short answer is that they reformulate the object detection task from generating and classifying proposals into a simpler problem - predicting objects’ centers (keypoints) and regress their corresponding attributes (width and height). This (anchor free) formulation provides a few benefits:

Avoids the use of anchor boxes (fixed number of prototypical boxes that we slide over the image at multiple scales) which can be costly to compute (quadratic to pixels).
More amenable (in theory) to oddly shaped objects which an anchor box method might miss.
Avoids the need for non maxima suppression (since we dont have these anchor boxes which generate duplicate detections).

CenterNets have also been applied to adjacent tasks such as (Multi) Object Tracking, Segmentation, Movement prediction etc [^5].

CenterNet Architecture. A convolutional backbone network applies cascade corner pooling and center pooling to output two corner heatmaps and a center keypoint heatmap, respectively. Similar to CornerNet, a pair of detected corners and the similar embeddings are used to detect a potential bounding box. Then the detected center keypoints are used to determine the final bounding boxes. Source: Duan, Kaiwen, et al 2019.

I certainly look forward to digging in and experimenting with CenterNets.

References

[^1]: Duan, Kaiwen, et al. "Centernet: Keypoint triplets for object detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. [^2]: Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016. [^3]: Tensorflow 2.0 Object Detection API model zoo. https://github.com/tensorflow/models/blob/master/research/object_detection/ [^4]: Tensorflow Object Detection API Performance Issue https://github.com/tensorflow/models/issues/3243 [^5]: CenterNet and its variants http://guanghan.info/blog/en/my-thoughts/centernet-and-its-variants/ [^6]: Titan RTX Deep Learning Benchmarks https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/ [^7]: Ultralytics Yolov5 Benchmarks https://github.com/ultralytics/yolov5 [^8]: Tian, Zhi, et al. "Fcos: A simple and strong anchor-free object detector." IEEE Transactions on Pattern Analysis and Machine Intelligence (2020). [^9]: Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016. [^10]: Law, Hei, and Jia Deng. "Cornernet: Detecting objects as paired keypoints." Proceedings of the European conference on computer vision (ECCV). 2018. [^11]: (CPN) Duan, Kaiwen, et al. "Corner proposal network for anchor-free, two-stage object detection." arXiv preprint arXiv:2007.13816 (2020). [^12]: Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings of the IEEE international conference on computer vision. 2017. [^13]: OpenCV Selective Search for Object Detection https://www.pyimagesearch.com/2020/06/29/opencv-selective-search-for-object-detection/

Interested in more articles like this? Subscribe to get a monthly roundup of new posts and other interesting ideas at the intersection of Applied AI and HCI.

Read and Subscribe

← Previous