Researchers from Google and Stanford have taught their computer vision model to detect the most important person in a multi-person video scene – for example, who the shooter is in a basketball game which typically contains dozens or hundreds of people in a scene.
Using 20 Tesla K40 GPUs and the cuDNN-accelerated Tensorflow deep learning framework to train their recurrent neural network on 257 NCAA basketball games from YouTube, an attention mask selects which of the several people are most relevant to the action being performed, then tracks relevance of each object as time proceeds. The team published a paper detailing more of their work.
Over time the system can identify not only the most important actor, but potential important actors and the events with which they are associated – such as, the ability to understand the player going up for a layup could be important, but that the most important player is the one who then blocks the shot.
Read more >>
Teaching an AI to Detect Key Actors in Multi-person Videos
Jul 06, 2016
Discuss (0)

Related resources
- GTC session: Build Next-Gen Agents With Large Vision Language Models
- GTC session: Designing VLM-Based AI Agents for Large-Scale Video Analysis
- GTC session: Leverage Computer Vision for High-Fidelity Scene Detection and Actionable Insights
- SDK: Llama3 70B Instruct NIM
- Webinar: Vision for All: Unlocking Video Analytics With AI Agents
- Webinar: Enhance Visual Understanding With Generative AI