Data scientists spend a lot of time cleaning and preparing large, unstructured datasets before analysis can begin, often requiring strong programming and statistical expertise. Managing feature engineering, model tuning, and consistency across workflows is complex and error-prone. These challenges are amplified by the slow, sequential nature of CPU-based ML workflows, which make experimentation and iteration painfully inefficient.
Accelerated data science ML agent
We prototyped a data science agent that can interpret user intent and orchestrate repetitive tasks in an ML workflow to simplify data science and ML experimentation. With GPU acceleration, the agent can process datasets with millions of samples using NVIDIA CUDA-X Data Science libraries. It showcases NVIDIA Nemotron Nano-9B-v2, a compact, powerful open-source language model designed to translate the intent of a data scientist into an optimized workflow.
With this setup, developers can explore large datasets, train models, and evaluate results just by chatting with the agent. It bridges the gap between natural language and high-performance computing, enabling users to go from raw data to business insights in minutes. We encourage you to use this as a starting point to build your own agent with different LLMs, tools, and storage solutions tailored to your specific needs. Explore the Python scripts for this agent on GitHub.
Data science agent orchestration
The agent’s architecture is designed for modularity, scalability, and GPU acceleration. It is organized into five core layers and one temporary data store that work together to translate natural language prompts into executable, data processing, and ML workflows. Figure 1 shows the high-level workflow of how each layer interacts.

Let’s take a closer look at how the layers work together.
Layer 1: User interface
The user interface was developed using a Streamlit-based conversational chatbot for users to interact with the agent in plain English.
Layer 2: Agent orchestrator
This is the central controller that coordinates with all layers. It interprets user prompts, delegates execution to the LLM for intent understanding, calls the right GPU-accelerated functions from the Tool Layer, and responds in natural language. Each orchestrator method is a lightweight wrapper around a GPU function; for instance, _describe_data in the user query calls basic_eda(), while _optimize_ridge in the user query calls optimize_ridge_regression().

Layer 3: LLM layer
The LLM layer serves as the reasoning engine of the agent, initializing the language model client to communicate with Nemotron Nano 9B-v2 using the NVIDIA NIM API. This layer enables the agent to interpret natural language inputs and translate them into structured, executable actions through four key mechanisms: LLM model, retry strategy for resilient communication, function calling for structured tool invocation, and a function calling window.
- LLM model
The LLM layer architecture is LLM-agnostic and can work with any language model that supports function calling. For this application, we used Nemotron Nano-9B-v2, which supports both function calling and advanced reasoning. Further, being smaller in size, the model offers an optimal balance between efficiency and capability, and can be deployed on a single GPU for inference. It delivers up to 6x higher token generation throughput than other leading models in its size class, while the thinking budget feature allows developers to control how many “thinking” tokens are used, reducing reasoning costs by up to 60%. This combination of exceptional performance and cost efficiency enables real-time conversational workflows that are economically viable for production deployment. - Retry strategy for resilient communication
The LLM client implements an exponential backoff retry mechanism to handle transient network failures and API rate limits, ensuring reliable communication even under adverse network conditions or high API load. - Function calling for structured tool invocation
Function calling bridges natural language and code execution by enabling the LLM to translate user intent into structured tool invocations in Agent Orchestrator. The agent defines available tools using OpenAI-compatible function schemas that specify each tool’s name, purpose, parameters, and constraints. - Function calling window
Function calling transforms the LLM from a text generator into a reasoning engine capable of API orchestration. The model, which is Nemotron Nano-9B-v2, is provided with a structured “API specification” of available tools, using which it tries to understand user intent, select appropriate functions, extract parameters with proper types, and coordinate multi-step data processing and ML operation. All this is executed through natural language, eliminating the need for users to understand API syntax or write code.
The complete function-calling flow shown in Figure 3 shows how natural language transforms into executable code. Refer tochat_agent.pyandllm.pyscripts in the GitHub code for the operations listed in Figure 3.

Layer 4: Memory layer
The memory layer (ExperimentStore) stores experiment metadata, including model configurations, performance metrics, and evaluation results, such as accuracy and F1 scores. This metadata is saved in standard JSONL format in a session-specific file, allowing for in-session tracking and retrieval via functions like get_recent_experiments() and show_history().
Layer 5: Temporary data storage
The temporary data storage layer stores session-specific output files (best_model.joblib and predictions.csv) in the system’s temporary directory as well as the user interface for immediate download and use. These files are automatically deleted when the agent shuts down.
Layer 6: Tool layer
The tool layer is the computational core of the agent, which is responsible for executing data science functions such as data loading, exploratory data analysis (EDA), model training & evaluation, and hyperparameter optimization (HPO). The function selected for execution is based on the user query. Various optimization strategies are used, including:
-
Consistency and Repeatability
The agent uses different abstraction methods from scikit-learn (a popular open-source library) to ensure consistent data preprocessing and model training across training, testing, and production environments. This design prevents common ML pitfalls such as data leakage and inconsistent preprocessing by automatically applying the exact same transformations (imputation values, scaling parameters, and encoding mappings) learned during training to all inference data. -
Memory Management
To handle large datasets, we use memory optimization strategies.Float32conversion reduces memory use, GPU memory management releases active cache GPU memory, and dense output configuration is faster on GPUs compared to sparse formats. -
Function Execution
The tool execution agent uses CUDA-X data science libraries such as cuDF and cuML to deliver GPU-accelerated performance while maintaining the same syntax of pandas and scikit-learn. This zero-code-change acceleration is achieved through Python’s module preloading mechanism, enabling developers to run existing CPU code on GPUs without refactoring. Thecudf.pandasaccelerator replaces pandas operations with GPU equivalents, whilecuml.accelautomatically substitutes scikit-learn models with cuML’s GPU implementations.
The following command launches a Streamlit interface with GPU acceleration enabled for both data processing and machine learning components:
python -m cudf.pandas -m cuml.accel -m streamlit run user_interface.py
Acceleration, modularity, and extension of the ML agent
The agent is built with a modular design for easy extension through new function calls, experiment stores, LLM integrations, and other enhancements. Its layered architecture supports the incorporation of additional capabilities over time. Out of the box, it includes support for popular machine learning algorithms, exploratory data analysis (EDA), and hyperparameter optimization (HPO).
Using CUDA-X data science libraries, the agent accelerates data processing and machine learning workflows end to end. This GPU-based acceleration delivers performance gains ranging from 3x to 43x, depending on the specific operation. Table 1 highlights the speedups achieved across several key tasks, including ML operations, data processing, and HPO.
| Agent Task | CPU (sec) | GPU (sec) | Speedup | Details |
| Classification ML task | 21,410 | 6,886 | ~3x | Using logistic regression, random forest classification, and linear support vector classification with 1 million samples |
| Regression ML task | 57,040 | 8,947 | ~6x | Using ridge regression, random forest regression, and linear support vector regression with 1 million samples |
| Hyperparameter optimization for ML algorithm | 18,447 | 906 | ~20x | cuBLAS-accelerated matrix operations (QR decomposition, SVD) dominate; the regularization path is computed in parallel and used |
Get started with Nemotron models and CUDA-X Data Science libraries
Get started with Nemotron models and CUDA-X Data Science libraries. The open-source data science agent is available on GitHub and ready to integrate with your datasets for end-to-end ML experimentation. Download the agent and let us know what datasets you tried, how much speedup you achieved, and the customizations you made.
Learn more:
- Nemotron family of models and agentic applications
- CUDA-X Data Science libraries: zero-code-change cuML and RAPIDS.ai community notebooks
- Google Colab and the latest data processing and ML libraries with zero code-change abilities.
- DLI learning path for data science with self-paced and instructor-led courses