8 min read

State of Deep Learning for Object Detection - You Should Consider CenterNets!

It's 2021 - there has been quite a bit of progress in practical models for object detection.

This post presents a short discussion of recent progress in practical deep learning models for object detection.

Configuring chart parameters ..

I explored object detection models in detail about 3 years ago while builidng Handtrack.js and since that time, quite a bit has changed. For one, MobileNet SSD[^2] was the gold standard for low latency applications (e.g. browser deployment), now CenterNets[^1] appear to do even better.

This post does not pretend to be exhaustive, but focuses on methods that are practical (reproducible checkpoints exist) for today's usecases. Hence the excitement for CenterNets[^1]! The reader is encouraged to review papers listed in the references section below for more indepth discussions.

I also highly recommend the Tensorflow Object detection api[^3] from Google as a source of reference implementations; this post visualizes the performance metrics of pretrained models which they report and metrics from Ultralytics on Yolov5[^7].

TLDR;

Benchmarking Data Source

The data used in the chart above is harvested from the Tensorflow Object detection model zoo[^1]. Based on this Github issue[^4], the timings reported are based on experiments with an Nvidia GeForce GTX TITAN X card.

Data on Yolov5 is also added from the Ultralytics Benchmarks[^7] which is conducted on a Tesla v100 GPU and uses a PyTorch implementation. To compare FPS, I generously assume a Tesla v100 is 1.6x faster than a Titan X based on comparisons in this Lambdalabs GPU benchmarking article[^6].


Model
Box MAP Keypoint MAPFPS
CenterNet HourGlass104 512x512
41.9 41.9 14.29
CenterNet HourGlass104 Keypoints 512x512
40.0 61.4 13.16
CenterNet HourGlass104 1024x1024
44.5 44.5 5.08
CenterNet HourGlass104 Keypoints 1024x1024
42.8 64.5 4.74
CenterNet Resnet50 V1 FPN 512x512
31.2 31.2 37.04
CenterNet Resnet50 V1 FPN Keypoints 512x512
29.3 50.7 33.33
CenterNet Resnet101 V1 FPN 512x512
34.2 34.2 29.41
CenterNet Resnet50 V2 512x512
29.5 29.5 37.04
CenterNet Resnet50 V2 Keypoints 512x512
27.6 48.2 33.33
CenterNet MobileNetV2 FPN 512x512
23.4 23.4 166.67
CenterNet MobileNetV2 FPN Keypoints 512x512
41.7 41.7 166.67
SSD MobileNet v2 320x320
20.2 -- 52.63
SSD MobileNet V1 FPN 640x640
29.1 -- 20.83
SSD MobileNet V2 FPNLite 320x320
22.2 -- 45.45
SSD MobileNet V2 FPNLite 640x640
28.2 -- 25.64
SSD ResNet50 V1 FPN 640x640
34.3 -- 21.74
SSD ResNet50 V1 FPN 1024x1024
38.3 -- 11.49
SSD ResNet101 V1 FPN 640x640
35.6 -- 17.54
SSD ResNet101 V1 FPN 1024x1024
39.5 -- 9.62
SSD ResNet152 V1 FPN 640x640
35.4 -- 12.50
SSD ResNet152 V1 FPN 1024x1024
39.6 -- 9.01
EfficientDet D0 512x512
33.6 -- 25.64
EfficientDet D1 640x640
38.4 -- 18.52
EfficientDet D2 768x768
41.8 -- 14.93
EfficientDet D3 896x896
45.4 -- 10.53
EfficientDet D4 1024x1024
48.5 -- 7.52
EfficientDet D5 1280x1280
49.7 -- 4.50
EfficientDet D6 1280x1280
50.5 -- 3.73
EfficientDet D7 1536x1536
51.2 -- 3.08
Faster R-CNN ResNet50 V1 640x640
29.3 -- 18.87
Faster R-CNN ResNet50 V1 1024x1024
31.0 -- 15.38
Faster R-CNN ResNet50 V1 800x1333
31.6 -- 15.38
Faster R-CNN ResNet101 V1 640x640
31.8 -- 18.18
Faster R-CNN ResNet101 V1 1024x1024
37.1 -- 13.89
Faster R-CNN ResNet101 V1 800x1333
36.6 -- 12.99
Faster R-CNN ResNet152 V1 640x640
32.4 -- 15.63
Faster R-CNN ResNet152 V1 1024x1024
37.6 -- 11.76
Faster R-CNN ResNet152 V1 800x1333
37.4 -- 9.90
Faster R-CNN Inception ResNet V2 640x640
37.7 -- 4.85
Faster R-CNN Inception ResNet V2 1024x1024
38.7 -- 4.24
Yolov5s 640x640
36.8 -- 273.00
Yolov5m 640x64
44.5 -- 207.00
Yolov5l 640x64
36.8 -- 158.00
Yolov5x 640x64
50.1 -- 24.00

Object Detection Primer

The task of Object detection is focused on predicting the where and what each object instance. For instance, given an image of a breakfast table, we want to know where the objects of interest (plates, forks, knives, cups) are located (bounding box cordinates). Models that achieve this goal must solve two tasks - first, they must identify regions within an image that contain objects (region proposal), and then they must classify these regions into 1 of n categories. To perform the second task, you typically need to incoporate a model that can understand images well e.g. a pretrained image classification model (e.g. ResNet50, EfficientNet, Stacked HourGlassNets trained on the ImageNet dataset). This pretrained model, called a backbone, is then used in some section of the object detection model to extract feature maps that help in predicting things like if an object exists in a section of the image, the class of this object and its location.

Research in this area hag gone through several generations of iterations:

Two Stage Detection

Two stage object detection models typically have two distinct parts. The first stage focuses on generating proposals (i.e. what parts of my image are likely to contain objects?). This may be done using a region proposal algorithm like selective search [^13] (e.g. as seen in RCNN) or with a region proposal network - RPN which takes in feature map produced by a backbone network, and predicts regions where an object exists (e.g. as seen in Fast R-CNN, Mask R-CNN, Cascade R-CNN, CPNs [^11] etc).

Early RPNs used the concept of anchor boxes - a set of k predefined boxes at different aspect ratios and scales to predict if a section of a feature map (from backbone) contains an object. RPNs also predict refinements where needed that tranform the initial anchor box to match a proposal box.

The second stage takes proposed regions, and predicts object class and any additional refinements or bounding box offsets. Note that RPNs may be implemented using multiple approaches e.g recent two stage models like CPNs do not use anchor boxes in their RPN.

Single Stage Detection

This gets rid of the first stage and explores how the same network can be used to predict both the presence of objects, class of object and region/bbox transforms/offsets.

  • Anchor based detectors: Models in this category leverage the concept of anchor boxes described above. We slide each anchor box across the preceeding feature map and predict if an object exists + any refinements. Examples include RetinaNet[^12], SSD[^9], YOLO, and Faster R-CNN. Anchor based detectors rely on a set of anchor boxes of fixed size (a hyperparameter that needs to be tuned carefully), making it hard to detect objects that dont fit well within the selected anchor box shapes. In addition class predictions and refinements for each box incure compute cost.

  • Anchor free detectors: More recently, there have been efforts to simplify single stage detectors by removing the need for a predefined set of anchor boxes and the computational costs they incur (sliding them across feature maps). Examples of this approach include CenterNets[^1] (more below), ConerNets[^10], FCOS[^8].

Two-stage detectors are often more accurate but slower, compared to one-stage detectors.

Backbone Network
ResNet, GoogleNet, InceptionResNet, EfficientNet, HourGlassNet ..
Two stage detectors
Anchor based
R-CNN, Fast R-CNN ..
Anchor free
CPN
Single stage detectors
Anchor based
SSD, RetinaNet, Yolo, Faster R-CNN ..
Anchor free
CenterNet, CornerNet, FCOS ..

Why Do CenterNets (and Anchor Free Models) Work So Well?

So - why do CenterNets[^1] (and related anchor free models) appear to work so well in terms of achieving decent perfomance at lower latency? The short answer is that they reformulate the object detection task from generating and classifying proposals into a simpler problem - predicting objects’ centers (keypoints) and regress their corresponding attributes (width and height). This (anchor free) formulation provides a few benefits:

  • Avoids the use of anchor boxes (fixed number of prototypical boxes that we slide over the image at multiple scales) which can be costly to compute (quadratic to pixels).
  • More amenable (in theory) to oddly shaped objects which an anchor box method might miss.
  • Avoids the need for non maxima suppression (since we dont have these anchor boxes which generate duplicate detections).

CenterNets have also been applied to adjacent tasks such as (Multi) Object Tracking, Segmentation, Movement prediction etc [^5].

CenterNet Architecture. A convolutional backbone network applies cascade corner pooling and center pooling to output two corner heatmaps and a center keypoint heatmap, respectively. Similar to CornerNet, a pair of detected corners and the similar embeddings are used to detect a potential bounding box. Then the detected center keypoints are used to determine the final bounding boxes. Source: Duan, Kaiwen, et al 2019.
CenterNet Architecture. A convolutional backbone network applies cascade corner pooling and center pooling to output two corner heatmaps and a center keypoint heatmap, respectively. Similar to CornerNet, a pair of detected corners and the similar embeddings are used to detect a potential bounding box. Then the detected center keypoints are used to determine the final bounding boxes. Source: Duan, Kaiwen, et al 2019.

I certainly look forward to digging in and experimenting with CenterNets.

References

[^1]: Duan, Kaiwen, et al. "Centernet: Keypoint triplets for object detection." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. [^2]: Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016. [^3]: Tensorflow 2.0 Object Detection API model zoo. https://github.com/tensorflow/models/blob/master/research/object_detection/ [^4]: Tensorflow Object Detection API Performance Issue https://github.com/tensorflow/models/issues/3243 [^5]: CenterNet and its variants http://guanghan.info/blog/en/my-thoughts/centernet-and-its-variants/ [^6]: Titan RTX Deep Learning Benchmarks https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/ [^7]: Ultralytics Yolov5 Benchmarks https://github.com/ultralytics/yolov5 [^8]: Tian, Zhi, et al. "Fcos: A simple and strong anchor-free object detector." IEEE Transactions on Pattern Analysis and Machine Intelligence (2020). [^9]: Liu, Wei, et al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016. [^10]: Law, Hei, and Jia Deng. "Cornernet: Detecting objects as paired keypoints." Proceedings of the European conference on computer vision (ECCV). 2018. [^11]: (CPN) Duan, Kaiwen, et al. "Corner proposal network for anchor-free, two-stage object detection." arXiv preprint arXiv:2007.13816 (2020). [^12]: Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings of the IEEE international conference on computer vision. 2017. [^13]: OpenCV Selective Search for Object Detection https://www.pyimagesearch.com/2020/06/29/opencv-selective-search-for-object-detection/

Interested in more articles like this? Subscribe to get a monthly roundup of new posts and other interesting ideas at the intersection of Applied AI and HCI.

RELATED POSTS | research, machine learning

Read the Newsletter.

I write a monthly newsletter on Applied AI and HCI. Subscribe to get notified on new posts.

Feel free to reach out! Twitter, GitHub, LinkedIn

.