5. Autonomous shopping for frictionless customer experience

Autonomous shopping is catching on quite rapidly. Autonomous shopping allows customers to enter, shop, and leave the store without ever having to interact with a cashier or wade their way through long lines of automated checkout counters that don’t always work reliably.

Among the many models required to power this application, we would need one that can predict human actions accurately.

5.1 Challenges with recognizing human actions

Recognizing human actions provides a deeper understanding of the scene than just detecting people. In a retail environment, it can help identify user activities like picking up an object from a shelf or inspecting the object or putting it in their cart, essential for a frictionless shopping experience.

However, action recognition is a very difficult problem to solve. First, you need labeled video data, which is more expensive than labeled image data. Second, action recognition is a more complex, compute-intensive operation. To recognize an action, you need the temporal aspect, so you are on a sequence of frames instead of a single one.

5.2 Customizing a pretrained action recognition model

The solution starts with a pretrained model for action recognition. For this experiment, we chose the I3D inception model architecture mentioned in this paper, which can be found at this repository.

Fig. 11 3-D Action Classification Model

In TAO, both RGB and optical flow I3D pretrained models are supported. We take the RGB model and finetune it with the classes that represent the typical actions performed by a human in a retail setting.

5.3 Results

For this experiment we used the open-source “Merl Shopping dataset”. This dataset has a total 106 videos, each two minutes in duration for a given action. To use this data for the action recognition tasks, we preprocessed and extracted each video clip into subclips and then labeled each subclip into five different action classes:

  • Reach to shelf
  • Retract from shelf
  • Hand in shelf
  • Inspect product
  • Inspect shelf

The data is then stored in the directory format shown below:

  |--> <Class name>
   |--> <Video 1>
     |--> rgb
      |--> 000001.png
      |--> 000002.png
      |--> 000003.png

From the 106 videos, 65 were chosen for training, 10 for validation, and 31 for testing. https://github.com/NVIDIA-AI-IOT/TAO-Toolkit-Whitepaper-use-cases/tree/main/workspace/ar_merl.

We used the I3D Kinetics pretrained weights as a starting point. The recipe to use this pretrained weights can be found at this GitHub repository

Please find the spec file for model finetuning, export, infer and evaluation.

After 100 epochs, for the given training dataset, the final training and validation loss were less than 0.1 and 0.4 respectively. The accuracy for these five classes is reported in the table below.

Class Name Accuracy
Hand in shelf 61.5
Inspect product 48.67
Inspect shelf 95.73
Reach to shelf 89.63
Retract from shelf 89.33

For model evaluation, we are using ‘conv’ method and more information about this evaluation method can be found in our developer blog.

Figure 12. Inference results on a sample test clip: images in the second row are from simulation videos

After training the model, we evaluated it using the clips from the test dataset. The images shown in figure11 show the classification results. The step-by-step training process is mentioned in this repo.