AI agents are transforming business operations by automating processes, optimizing decision-making, and streamlining actions. Their effectiveness hinges on expert reasoning, enabling smarter planning and efficient execution.
Agentic AI applications could benefit from the capabilities of models such as DeepSeek-R1. Built for solving problems that require advanced AI reasoning, DeepSeek-R1 is an open 671-billion-parameter mixture of experts (MoE) model. Trained with reinforcement learning (RL) techniques that incentivize accurate and well-structured reasoning chains, it excels at logical inference, multistep problem-solving, and structured analysis.
DeepSeek-R1 breaks down complex problems into multiple steps with chain-of-thought (CoT) reasoning, enabling it to tackle intricate questions with greater accuracy and depth. To do this, DeepSeek-R1 uses test-time scaling, a new scaling law that enhances a model’s capabilities and deduction powers by allocating additional computational resources during inference.
However, this structured AI reasoning comes at the cost of longer inference times. As the model processes more complex problems, inference time scales nonlinearly, making real-time and large-scale deployment challenging. Optimizing its execution is critical to making DeepSeek-R1 practical for broader adoption.
This post explains the DeepSeek-R1 NIM microservice and how you can use it to build an AI agent that converts PDFs into engaging audio content in the form of monologues or dialogues.
DeepSeek-R1 masters complex reasoning and strategic adaptation
The exceptional performance of DeepSeek-R1 in benchmarks like AIME 2024, CodeForces, GPQA Diamond, MATH-500, MMLU, and SWE-Bench highlights its advanced reasoning and mathematical and coding capabilities. For enterprise agentic AI, this translates to enhanced problem-solving and decision-making across various domains. Its proficiency in complex tasks enables the automation of sophisticated workflows, leading to more efficient and scalable operations.
Reasoning models, however, are not well-suited for extractive tasks like fetching and summarizing information. Instead of retrieving answers efficiently, they tend to overthink, analyzing nuances unnecessarily before responding. This slows down performance and wastes computational resources, making them inefficient for high-throughput, fact-based tasks where simpler retrieval models would be more effective.
Enhance agentic reasoning with DeepSeek-R1 NIM microservice
As a developer, you can easily integrate state-of-the-art reasoning capabilities into AI agents through privately hosted endpoints using the DeepSeek-R1 NIM microservice, which is now available for download and deployment anywhere. This integration enhances the planning, decision-making, and actions of AI agents.
NVIDIA NIM microservices support industry standard APIs and are designed to be deployed seamlessly at scale on any Kubernetes-powered GPU system including cloud, data center, workstation, and PC. The flexibility to run a NIM microservice on your secure infrastructure also provides full control over your proprietary data.
NIM microservices advance a model’s performance, enabling enterprise AI agents to run faster on GPU-accelerated systems. Faster reasoning enhances the performance of agentic AI systems by accelerating decision-making across interdependent agents in dynamic environments.
Optimize performance at scale with NVIDIA NIM
NVIDIA NIM is optimized to deliver high throughput and latency across different NVIDIA GPUs. By taking advantage of Data Parallel Attention, NVIDIA NIM scales to support users on a single NVIDIA H200 Tensor Core GPU node, ensuring high performance even under peak demand.
It achieves this efficiency through the NVIDIA Hopper architecture FP8 Transformer Engine, utilized across all layers, and the 900 GB/s of NVLink bandwidth that accelerates MoE communication for seamless scalability. This high efficiency translates to a reduction in overall operational costs and low latency delivers fast response times that enhance user experience, making interactions more seamless and responsive.
The latency and throughput of the DeepSeek-R1 model will continue to improve as new optimizations will be incorporated in the NIM.
Build a real-world application with NVIDIA AI Blueprints and DeepSeek-R1
NVIDIA Blueprints are reference workflows for agentic and generative AI use cases. You can quickly integrate the capabilities of the DeepSeek-R1 NIM with these blueprints. To provide an example, this section walks through this integration for the NVIDIA AI Blueprint for PDF to podcast. This blueprint enables you to convert PDFs into engaging audio content in the form of monologues or dialogues.
The general workflow for this blueprint is as follows:
- The user passes the target PDF document to the blueprint. This document is the main source of information for the podcast. The user can optionally provide several context PDF documents to the blueprint, which will be used as additional sources of information.
- The blueprint processes the target PDF into markdown format and passes the results to the long reasoning agent.
- The agentic workflow for this blueprint relies on several LLM NIM endpoints to iteratively process the documents, including:
- A reasoning NIM for document summarization, raw outline generation and dialogue synthesis.
- A JSON NIM for converting the raw outline to structured segments, as well as converting dialogues to structured conversation format.
- An iteration NIM for converting segments into transcripts, as well as combining the dialogues together in a cohesive manner. This NIM is invoked multiple times and in parallel.
 
- These LLM NIM microservices are used iteratively and in several stages to form the final podcast content and structure.
- Once the final structure and content is ready, the podcast audio file is generated using the Text-to-Speech service provided by ElevenLabs.
Figure 1 shows an overview of this blueprint, which is available through NVIDIA-AI-Blueprints/pdf-to-podcast on GitHub.

The NIM used for each type of processing can be easily switched to any remotely or locally deployed NIM endpoint, as explained in subsequent sections.
Blueprint requirements
The NVIDIA AI Blueprint for PDF to podcast can be executed locally on Ubuntu-based machines (v20.04 and above). The minimum requirements necessary to follow along include:
- Docker engine, with the NVIDIA Container Toolkit and Docker Compose
- An API key for ElevenLabs Text-to-Speech API
- NIM endpoints
- You can use the NVIDIA-hosted endpoint for the DeepSeek-R1 NIM available from the NVIDIA API catalog by signing up to obtain an API key.
- For locally hosted NIM endpoints, see NVIDIA NIM for LLMs Getting Started for deployment instructions. Note that DeepSeek-R1 requires 16 NVIDIA H100 Tensor Core GPUs (or eight NVIDIA H200 Tensor Core GPUs) for deployment. Ensure the endpoints are up and reachable (for example, localhost:8000).
 
For more information, see the NVIDIA AI Blueprint for PDF to podcast documentation.
Blueprint setup
To set up the blueprint on your local machine, run the following commands:
# Clone the repository
$ git clone https://github.com/NVIDIA-AI-Blueprints/pdf-to-podcast
$ cd pdf-to-podcast
# Setup the API keys
$ echo "ELEVENLABS_API_KEY=your_key" >> .env
$ echo "NVIDIA_API_KEY=your_key" >> .env
# Avoid ElevenLabs API rate limiting issues
$ echo "MAX_CONCURRENT_REQUESTS=1" >> .env
# Install the dependencies
$ make uv
Model selection
Specifying the underlying models used throughout various pipeline stages is quite simple and can be done by modifying the models.json file in your local repository.
Considering the reasoning power of DeepSeek-R1, this model will be used as the reasoning NIM to ensure a deeper analysis and discussion for the resulting podcast. Other smaller models will be used for JSON and iteration NIM microservices that would make the nonreasoning processing stages much faster.
The following code demonstrates how to use some of the models from the NVIDIA API catalog for the podcast:
{
  "reasoning": {
    "name": "deepseek-ai/deepseek-r1",
    "api_base": "https://integrate.api.nvidia.com/v1"
  },
  "json": {
    "name": "meta/llama-3.1-8b-instruct",
    "api_base": "https://integrate.api.nvidia.com/v1"
  },
  "iteration": {
    "name": "meta/llama-3.1-70b-instruct",
    "api_base": "https://integrate.api.nvidia.com/v1"
  }
}
Note that, as part of its reasoning and test-time scaling process, DeepSeek-R1 typically generates many output tokens. If you’re using externally hosted models or APIs, such as those available through the NVIDIA API Catalog or ElevenLabs TTS service, be mindful of API usage credit limits or other associated costs and limitations.
After specifying the models, you can start the podcast generation agent services by running the following command:
# Start the services
$ make all-services
This command starts all the required services for podcast generation in the active terminal. To stop the services, simply send the CTRL + C signal in that terminal.
Generate the podcast
Once all the agent services are up and running, you can start generating the podcast. The repository provides a few sample documents to use under the samples directory. You can use your own documents by copying them to the samples directory. 
Run the following command to start generating the podcast:
# From inside the `pdf-to-podcast` directory:
# Activate the virtual environment
$ source .venv/bin/activate
# To generate a podcast with two speakers (dialog form):
$ python tests/test.py --target <main_doc.pdf> --context <context_doc.pdf>
# To generate a podcast with a single speakers (monologue form):
$ python tests/test.py --target <main_doc.pdf> --context <context_doc.pdf> --monologue
When the command above finishes executing, the resulting podcast will be written to an MP3 file under the tests directory.
Note that, when using the DeepSeek-R1 model as the reasoning model, we recommend experimenting with short documents (one or two pages, for example) for your podcasts to avoid running into timeout issues or API usage credits limits.
Customize and experiment
You can control the behavior of the underlying models used in this blueprint and customize them to your liking. The prompts used for invoking each model are available in podcast_prompts.py. Specifically, the definitions PODCAST_SUMMARY_PROMPT_STR, as well as PODCAST_MULTI_PDF_OUTLINE_PROMPT_STR and PODCAST_TRANSCRIPT_TO_DIALOGUE_PROMPT_STR are used for invoking the reasoning model during generation. Adjust these prompts and experiment with the blueprint.
Get started with the DeepSeek-R1 NIM microservice
You can build AI agents that deliver fast, accurate reasoning in real-world applications by combining the reasoning prowess of DeepSeek-R1 with the flexible, secure deployment offered by NVIDIA NIM microservices. You can also leverage the DeepSeek-R1 NIM in various NVIDIA Blueprints.
Get started with the DeepSeek-R1 NVIDIA NIM with an NVIDIA-hosted API or download and run on the NVIDIA platform.
 
         
           
           
           
     
     
     
     
     
     
     
    