Conversational AI

Personalized Learning with Gipi, NVIDIA TensortRT-LLM, and AI Foundation Models

Stylized image of a smartphone chat with a young woman smiling off to one side.

Over 1.2B people are actively learning new languages, with over 500M learners on digital learning platforms such as Duolingo. At the same time, a significant portion of the global population, including 73% of Gen-Z, experiences feelings of disconnection and unhappiness, often exacerbated by social media.

This highlights a unique dichotomy: People are hungry for personalized learning experiences, but often lack the tools to navigate the impacts on personal well-being emanating from the platforms that pervade our day-to-day lives.

Rise of AI chatbots: Transforming education and interaction

AI chatbots are increasingly being used to address these challenges, offering unique personalities, personalized wellness check-ins, multilingual capabilities, and tutorship features with instant feedback. 

Some, including Gipi, are engineered to remember user conversations, learn from their interests, and engage in dialogue about topics that matter to them, including personalized help in language learning, speaking practice, mathematics, science, and other domains. 

Gipi also proactively reaches out to users to check in, continuing conversations from where they left off. For instance, when a user mentioned an upcoming job interview, Gipi followed up* with encouragement and later checked in for an update (Figure 1).

Screenshot where the user tells Gipi, 'I'm having an interview tomorrow at noon with the new job that I told you about.' Gipi responds, 'That's fantastic news, Emily! I'm so excited for you. How are you feeling about the interview? Is there anything specific you'd like to talk about or prepare for? I'm here to help in any way I can. Good luck, and remember, you've got this!' The conversation continues with Gipi checking in on the user, 'Hey Emily! How did your job interview go? Any news?' The user replies, 'I would love to share some good news!' indicating an ongoing supportive dialogue. 
Figure 1. Conversation text about an interview in the Gipi app

The mechanics of Gipi’s intelligence

The architecture of Gipi’s intelligence involves a range of technologies and processes. This section introduces the key components that enable Gipi to understand and interact with users:

  • Speech-to-text
  • Prompt creation and management
  • Make Gipi smart
  • Text-to-speech
Diagram shows three main steps: 1) Speech-to-text conversion of the user's audio input, 2) processing by the LLM, and 3) text-to-speech generation of Gipi's response, with arrows indicating the flow between the user, Gipi, and the processes involved.
Figure 2. Gipi data flow

Speech-to-text

Gipi’s speech-to-text technology relies on a custom Whisper-based model and the model size has been optimized to improve efficiency, reduce latency, and enhance GPU memory usage. 

Originally, the model used the standard Whisper dataset, which is composed of public videos that are prone to errors. To mitigate these anomalies, Gipi now trains its model on a unique, more reliable dataset, which enables more efficient voice-to-text conversion and captures the wide variety of linguistic nuances of our user base. 

Early investment in strong speech-to-text capabilities has been validated by the fact that over half of Gipi users actively engage with the voice chat feature.

Prompt creation and management

Gipi’s sophisticated personalities and tailored responses rely on user preferences and prompt history. Our history management system personalizes each interaction; Gipi remembers every user. 

We improve Gipi’s memory retention by summarizing past interactions and feeding them back into the system. More importantly, we constantly extract and integrate personal attributes about the user into the conversation prompts. This process enables Gipi to remember and reference every significant detail, ensuring a personalized and continuous dialogue.

We use LangChain to simplify prompt creation, which enables us to effectively organize and manage different types of prompts, such as system-related or conversational. This helps us keep prompts clear and appropriate for their specific uses. LangChain facilitates the adaptation of our prompts to different language models, making the system model-independent and more versatile. It also helps manage short-term memory, enabling Gipi to recall what was discussed earlier in an early conversation.

Make Gipi smart

Gipi’s LLM is at the heart of its intelligence. Although we originally relied on a proprietary model, we later turned to NVIDIA TensorRT for backend optimization to improve LLM inference speed. 

Originally, using the Llama 2 4-bit model with 4096 input tokens and 512 output tokens on an NVIDIA A6000 Ada GPU, we saw response times of 35–40 seconds per request. But after integrating NVIDIA TensorRT-LLM, we’ve dramatically reduced this to just 3–4 seconds, achieving a 10–12x speed increase. This framework excels in processing text-based language models quickly and efficiently.  

To complement these capabilities, we are working on integrating Mistral 7B, chosen for its versatility in tasks such as summarizing texts, translating languages, coding assistance, sentiment analysis, and more, further enhancing research and educational tools.

GIF shows a chat where the user asks what to buy for a barbecue with friends. Gipi suggests essentials such as burgers, hot dogs, condiments, refreshing drinks, sides like potato salad and coleslaw, and sweet treats for dessert. 
Figure 3. Conversation about barbecue supplies with Gipi

We developed a long-term memory system for Gipi, enabling it to recall past interactions for enhanced personalization in each session. This system, integrated with Gipi’s tailored response mechanisms, aims to provide a more engaging user experience.

Text-to-speech

In the realm of text-to-speech, we use the NVIDIA NeMo TTS Framework to ensure that Gipi not only understands you but also responds with a natural-sounding voice. 

Recently, we’ve expanded the product’s capabilities by developing the ability to create custom voices. Gipi can generate entirely new voices based on voice audio clips submitted by users, offering an even greater degree of personalization. 

The latest model uses a GPT2 backbone along with a perceiver model for speaker conditioning, which has improved Gipi’s ability to capture speaker characteristics and ensure consistent output. We have also integrated HifiGAN for audio signal computation, significantly reducing inference latency. 

Summary

As AI becomes integrated into daily routines, it enhances efficiency and expands our access to information. Gipi uses advanced AI to support language learning and skill development, providing tools that help users enhance their capabilities. 

We envision sophisticated AI tools becoming as accessible and ubiquitous as smartphones, empowering users with intelligent, adaptive support. Gipi is designed to facilitate growth and learning, offering a supportive boost in your pursuit of knowledge and self-improvement.

To discover how Gipi can enhance your interaction and learning experience, download it from the Google Play Store, Apple Store, or visit Gipi.

For more information about LLM enterprise applications, see Getting Started with Large Language Models for Enterprise Solutions. Join the conversation on LLMs in the NVIDIA TensorRT forum.

*For readability and to address privacy concerns, user consent has been obtained. All messages have been anonymized and edited accordingly.

Discuss (0)

Tags