How Conversational AI Works

When you present an application with a question, the audio waveform is converted to text during the automatic speech recognition (ASR) stage. It converts the speech audio signal into text for processing by subsequent components. The question is then interpreted, and a large language model enhanced with retrieval-augmented-generation generates a response. Finally, the text is converted into speech signals to generate audio for the user during the text-to-speech (TTS) stage, also known as speech synthesis.