Developer Blog

In order to expand the scope of possibilities for augmented reality applications and player performance tracking in professional sports, researchers at Stats Perform, a sports analytics company, have developed an AI-based method that makes camera calibration faster and more flexible.

Sports teams use vision-based tracking systems to analyze player performance, and broadcasters deploy AR to enhance the viewing experience (e.g. the Virtual 3 in basketball, the First Down Line in the NFL). These solutions depend on high-quality camera calibration and as such are usually constrained to pre-calibrated, fixed cameras or real-time parameter updates from pan-tilt-zoom cameras. More flexible solutions exist but don’t work well in busy, non-uniform environments like a basketball game where the field is crowded and changes appearance from arena to arena.

To overcome these limitations, the researchers devised a novel neural network that combines semantic segmentation, camera pose initialization, and homography refinement into a single network architecture.

Trained with an NVIDIA TITAN RTX GPU, with the cuDNN-accelerated TensorFlow deep learning framework, their method enables them to determine camera homography of a single moving camera using only the camera frame and the sport as inputs. 

Their technique also reduced the inference time to 4 milliseconds (a reduction of two orders of magnitude compared to the previous state-of-the-art), making it suitable for live broadcast.

“The evaluation results show that our method outperforms the previous state-of-the-art in challenging scenarios like basketball and achieves competitive performance in relatively static environments like soccer,” the researchers stated in their paper, End-to-End Camera Calibration for Broadcast Videos. The paper will be presented at the Computer Vision and Pattern Recognition (CVPR) conference this week. 

Top row: examples of images in a basketball game with the field projection (blue lines) generated by the predicted homography. Bottom row: The output of the semantic segmentation step of the network. After the segmentation, the network performs camera pose initialization and homography refinement.

According to the researchers a typical sporting event is covered by multiple separate cameras, but today only a handful of them provide the calibration capabilities needed for these applications. Extending these capabilities to more cameras and more sports will improve insights and help sports leagues build a tighter relationship with their fans.

Read more>