March 3rd 2021 • 3 min read

Real-Time High-Resolution Background Matting | Paper Review

A review of the CVPR 2021 paper "Real-Time High-Resolution Background Matting".

This post is a summary of the paper by Lin et al 2020[^1] where they proposes a two stage deep neural network model for real time segmentation of subjects from background.

The technique employed is based on background matting, where an additional frame of the background is captured and used in recovering the alpha matte and the foreground layer. Source: Lin et al 2020.

Review TLDR;

Pros

Proposed model handles hair and subject boundary details much better than current approaches (think how Zoom might crop out hair portions or fail when your hand is close to your face or some other occlusion etc)
They improve speed/latency the state of the art for processing large images. Previous approaches that attempt fine grained segmentation achieve 8 fps on 512512 images (pretty much unsuable). Their approach achieves 60FPS on HD images (19201080) and 30FPS on 4K images (3840×2160
They achieve these speed gains by using a two stage network. First network downsamples the image and outputs matte predictions + error prediction map at a low resolution. The second network (a refinement network) uses the low resolution result and original image to generate high-resolution output (fine grained detail) for only select regions of the image.
They compare their approach with several existing approaches and create a zoom plugin that pipes model output to zoom.
They provide sample code to reproduce their results and allow experimentation via notebooks.

Cons

- System requires specifying background image to work well. This is not a huge issue but introduces a new step (selecting background image) that might interfere with usability.
- The results (30FPS on HD images and 60FPS on HD images) are run on a GPU - Nvidia RTX 2080 TI GPU. This suggests it might still be unusable on CPUs (the majority of user environments)

Download the paper

Its pretty remarkable what has been accomplished here. Future work might benefit from:

Efforts to definitely would benefit from further optimizing for latency while maintaining fine grained segmentation (e.g. using residual U-blocks as proposed in the U2Net Paper. This might lead to usable FPS values on commodity CPU machines.
Efforts to optimize for usability by elmininating the background image requirement. E.g. by reframing the ML problem, we can train the network to jointly predict the background (image completion) in addition to the alpha matte values, and leverage this knowledge in predicting better matte values. Ideally, this formulation will utilize background images (as labels) during training but not require them during inference.

Overall, well written paper and well produced video explaining their work.

I will be updating this post as I experiment with the model itself.

References

[^1]: Lin, S., Ryabtsev, A., Sengupta, S., Curless, B., Seitz, S., & Kemelmacher-Shlizerman, I. (2020). Real-Time High-Resolution Background Matting. arXiv preprint arXiv:2012.07810. CVPR 2021.

Interested in more articles like this? Subscribe to get a monthly roundup of new posts and other interesting ideas at the intersection of Applied AI and HCI.

Read and Subscribe

← Previous

Real-Time High-Resolution Background Matting | Paper Review

References

RELATED POSTS | research, paper review, machine learning

Read the Newsletter.