Conversational AI / NLP

Low-Code Building Blocks for Speech AI Robotics

Sep 22, 2022

By Dave Niewinski and Sven Chilton

Discuss (0)

AI-Generated Summary

Dislike

The NVIDIA Riva SDK provides low-code building blocks for adding speech AI skills, such as automatic speech recognition (ASR) and text-to-speech (TTS), to robotic applications.
By leveraging Riva, developers can create robots like Spot that can interact with humans vocally, perceive their environment, and navigate safely, with minimal coding required.
Riva's customizable and GPU-accelerated speech AI capabilities enable robots to understand and respond to voice commands, facilitating seamless human-machine interaction.

AI-generated content may summarize information incompletely. Verify important information. Learn more

When examining an intricate speech AI robotic system, it’s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, “Any sufficiently advanced technology is indistinguishable from magic.”

From accepting natural-language commands to safely interacting in real-time with its environment and the humans around it, today’s speech AI robotics systems can perform tasks to a level previously unachievable by machines.

Sign up for the latest Data Science news. Get the latest announcements, notebooks, hands-on tutorials, events, and more in your inbox once a month from NVIDIA.

Take Spot, a speech AI-enabled robot that can fetch drinks on its own, for example. To easily add speech AI skills, such as automatic speech recognition (ASR) or text-to-speech (TTS), many developers leverage simpler low-code building blocks when building complex robot systems.

For developers creating robotic applications with speech AI skills, this post breaks down the low-code building blocks provided by the NVIDIA Riva SDK.

By following along with the provided code examples, you learn how speech AI technology makes it possible for intelligent robots to take food orders, relay those orders to a restaurant employee, and finally navigate back home when prompted.

Design an AI robotic system using building blocks

Complex systems consist of several building blocks. Each building block is much simpler to understand on its own.

When you understand the function of each component, the end product becomes less daunting. If you’re using low-code building blocks, you can now focus on domain-specific customizations requiring more effort.

Our latest project uses “Spot,” a four-legged robot, and an NVIDIA Jetson Orin, which is connected to Spot through an Ethernet cable. This project is a prime example of using AI building blocks to form a complex speech AI robot system.

Architectural diagram of an AI robotics system with Riva low-code speech AI blocks shown as core components for platform, navigation, and interaction. — *Figure 2.* *A speech AI robot system with Riva low-code speech AI blocks to add ASR and TTS skills*

Our goal was to build a robot that could fetch us snacks on its own from a local restaurant, with as little intervention from us as possible. We also set out to write as little code as possible by using what we could from open-source libraries and tools. Almost all the software used in this project was freely available.

To achieve this goal, an AI system must be able to interact with humans vocally, perceive its environment (in our case, with an embedded camera), and navigate through the surroundings safely. Figure 2 shows how interaction, platform, and navigation represent our Spot robot’s three fundamental operation components, and how those components are further subdivided into low-code building blocks.

This post focuses solely on the human interaction blocks from the Riva SDK.

Add speech recognition and speech synthesis skills using Riva

We have so many interactions with people every day that it is easy to overlook how complex those interactions actually are. Speaking comes naturally to humans but is not nearly so simple for an intelligent machine to understand and talk.

Riva is a fully customizable, GPU-accelerated speech AI SDK that handles ASR and TTS skills, and is deployable on-premises, in all clouds, at the edge, and on embedded devices. It facilitates human-machine speech interactions.

Riva runs entirely locally on the Spot robot. Therefore, processing is secure and does not require internet access. It is also completely configurable with a simple parameter file, so no extra coding is needed.

Riva code examples for each speech AI task

Riva provides ready-to-use Python scripts and command-line tools for real-time transformation of audio data captured by a microphone into text (ASR, speech recognition, or speech-to-text) and for converting text into an audio output (TTS, or speech synthesis).

Adapting these scripts for compatibility with Open Robotics (ROS) requires only minor changes. This helps simplify the robotic system development process.

ASR customizations

The Riva OOTB Python client ASR script is named transcribe_mic.py. By default, it prints ASR output to the terminal. By modifying it, the ASR output is routed to a ROS topic and can be read by anything in the ROS network. The critical additions to the script’s main() function are shown in the following code example:

   inter_pub = rospy.Publisher('intermediate', String, queue_size=10)
   final_pub = rospy.Publisher('final', String, queue_size=10)
   rospy.init_node('riva_asr', anonymous=True)

The following code example includes more critical additions to main:

       for response in responses:
           if not response.results:
               continue
           partial_transcript = ""
           for result in response.results:
               if not result.alternatives:
                   continue
               transcript = result.alternatives[0].transcript
               if result.is_final:
                   for i, alternative in enumerate(result.alternatives):
                       final_pub.publish(alternative.transcript)
              else:
                  partial_transcript += transcript
           if partial_transcript:
               inter_pub.publish(partial_transcript)

TTS customizations

Riva also provides the talk.py script for TTS. By default, you enter text in a terminal or Python interpreter, from which Riva generates audio output. For Spot to speak, the input text talk.py script is modified so that the text comes from a ROS callback rather than a human’s keystrokes. The key changes to the OOTB script include this function for extracting the text:

def callback(msg):
   global TTS
   TTS = msg.data

They also include these additions to the main() function:

   rospy.init_node('riva_tts', anonymous=True)
   rospy.Subscriber("speak", String, callback)

These altered conditional statements in the main() function are also key:

       while not rospy.is_shutdown():
           if TTS != None:
               text = TTS

Voice interaction script

Simple scripts like voice_control.py consist primarily of the callback and talker functions. They tell Spot what words to listen for and how to respond.

def callback(msg):
   global pub, order
   rospy.loginfo(msg.data)
   if "hey spot" in msg.data.lower() and "fetch me" in msg.data.lower():
       order_start = msg.data.index("fetch me")
       order = msg.data[order_start + 9:]
       pub.publish("Fetching " + order)

def talker():
   global pub
   rospy.init_node("spot_voice_control", anonymous=True)
   pub = rospy.Publisher("speak", String, queue_size=10)
   rospy.Subscriber("final", String, callback)
   rospy.spin()

In other words, if the text contains “Hey Spot, … fetch me…” Spot saves the rest of the sentence as an order. After the ASR transcript indicates that the sentence is finished, Spot activates the TTS client and recites the word “Fetching” plus the contents of the order. Other scripts then engage a ROS action server instructing Spot to navigate to the restaurant, while taking care to avoid cars and other obstacles.

When Spot reaches the restaurant, it waits for a person to take its order by saying “Hello Spot.” If the ASR analysis script detects this sequence, Spot recites the order and ends it with “please.” The restaurant employee places the ordered food and any change in the appropriate container on Spot’s back. Spot returns home after Riva ASR recognizes that the restaurant staffer has said, “Go home, Spot.”

The technology behind a speech AI SDK like Riva for building and deploying fully customizable real-time speech AI applications deployable on-premises, in all clouds, at the edge, and embedded, brings AI robotics into the real world.

When a robot seamlessly interacts with people, it opens up a world of new areas where robots can help without needing a technical person on a computer to do the translation.

Deploy your own speech AI robot with a low-code solution

Teams such as NVIDIA, Open Robotics, and the robotics community, in general, have done a fantastic job in solving speech AI and robotics problems and making that technology available and accessible for everyday robotics users.

Anyone eager to get into the industry or to improve the technology they already have can look to these groups for inspiration and examples of cutting-edge technology. These technologies are usable through free SDKs (Riva, ROS, NVIDIA DeepStream, NVIDIA CUDA) and capable hardware (robots, NVIDIA Jetson Orin, sensors).

I am thrilled to see this level of community support from technology leaders and invite you to build your own speech AI robot. Robots are awesome!

For more information, see the following related resources:

Explore how to get started with integrating and deploying Riva ASR and TTS models in production with high-performance inference and minimal effort with the free ebook, Building Speech AI Applications.
Explore the NVIDIA Riva Quick Start Guide for Embedded Platforms for scripts and utilities for getting started with Riva Speech Skills on Embedded platforms.
Get started with downloading Riva, try guided Riva labs on ready-to-run infrastructure in LaunchPad, or get support for large-scale Riva deployments with NVIDIA AI Enterprise software.

Discuss (0)

About the Authors

About Dave Niewinski
Dave Niewinski is a Canadian robotics engineer who leverages years of industry experience to help companies adopt or develop their own robotic systems. He also shares his love of advanced robotics technologies through his YouTube channel by developing fun and engaging projects. He has a bachelor’s of engineering from the University of Guelph, and is a professional engineer in Ontario.

View all posts by Dave Niewinski

About Sven Chilton
Sven Chilton is a deep learning developer advocate at NVIDIA. He develops, plays with, educates about, and evangelizes for fundamental NVIDIA deep learning tools, in particular the Riva speech AI framework. Sven has long loved mathematics and languages and considers speech AI and NLP beautiful fusions of the two. He earned a bachelor's degree in physics from the Massachusetts Institute of Technology, and master's and doctoral degrees in nuclear engineering from the University of California, Berkeley.

View all posts by Sven Chilton