Large language models (LLMs) in quantitative finance are increasingly being used for alpha generation, automated report analysis, and risk prediction. Yet adoption is constrained by cost, latency, and integration complexity. In financial markets, where alpha signals emerge from rapidly evolving data, the ability to continuously fine-tune, distill, and deploy models from proprietary and real-world sources is crucial.

This example shows how NVIDIA technology enables continuous model fine-tuning and distillation, enabling integration into financial workflows. Researchers can systematically optimize, compress, and deploy high-performing models with direct connectivity to backtesting and strategy evaluation processes.

The AI Model Distillation for Financial Data developer example is intended for quantitative researchers, AI developers, and enterprise data scientists. It shows how NVIDIA technology enables continuous model fine-tuning and distillation, enabling integration into financial workflows. Through the flywheel, we operate over a financial newsfeed dataset to generate features from unstructured data that can be used for alpha research and risk prediction. The result is a set of smaller, domain-specific, and task-optimized models that maintain high accuracy while reducing computational overhead and deployment costs.

What is AI Model Distillation for Financial Data?

Model distillation is the process of transferring knowledge from a large, high-performing teacher model to a smaller, efficient student model. This enables faster inference, lower resource consumption, and deployment in edge or hybrid environments, while maintaining accuracy on domain-specific tasks.

What is a developer example?

A developer example is a tested, reproducible reference architecture that combines best practices, software tools, and modular deployment patterns to accelerate enterprise AI adoption. These end-to-end customizable examples demonstrate how complex workflows such as domain adaptation, model compression, or agent orchestration can be developed and scaled using the NVIDIA AI Enterprise software stack. They bridge the gap between concept and production, pairing reference code with tested architectural guidance.

This developer example provides a practical framework for continuous domain adaptation and model distillation, creating smaller, high-performance models tailored to enterprise financial data. By combining NVIDIA NeMo, NVIDIA Nemotron, NVIDIA NIM, and Dockerized components, you can develop a data flywheel for feature engineering, signal evaluation, and retraining. The architecture supports both on-premises and hybrid cloud deployment, ensuring flexibility and compliance with financial data governance standards.

Figure 1. Circular data flywheel AI workflow illustrating seven stages for compressing large models into smaller, efficient versions for enterprise deployment

This example distills the capabilities of a 49B or 70B parameter teacher into a smaller customized student (1B, 3B, or 8B in this example). We demonstrate this through a multi-class classification problem where we use the teacher to generate labels for our dataset and then the labeled dataset to customize our student models.

The developer example enables teams to:

Distill large LLMs into efficient domain-specific versions suited for financial text, thus reducing latency and inference costs while maintaining accuracy targets. Accelerate backtesting and strategy evaluation by enabling rapid iteration and evaluation of trading signals, while maintaining model accuracy as market conditions and data sources evolve. Ensure scalability and observability by facilitating model evaluation with built-in experiment tracking. Deploy distilled models alongside existing NIM into financial AI workflows across on-prem, hybrid cloud, and edge environments.

These capabilities enable the deployment of lightweight, specialized models directly into research pipelines, trading systems, or edge inference environments.

How does it work?

We provide a reusable recipe to experiment and train these distilled models using the NVIDIA Data Flywheel Blueprint. At the heart of the blueprint is the flywheel orchestrator, a unified control plane that abstracts the complexity of interacting directly with NVIDIA NeMo microservices. Acting as the brain of the flywheel system, the orchestrator API coordinates the data flywheel job by leveraging a suite of modular NeMo microservices:

NVIDIA NeMo Customizer to handle lightweight LoRA-based fine-tuning

NVIDIA NeMo Evaluator to automate evaluations across runs

Datastore within NeMo to manage structured datasets and artifacts

Deployment manager within NeMo to spin up and serve candidate distilled models dynamically for inference

Each microservice is packaged as a Docker container for consistent deployment across different environments. This workflow is orchestrated through Kubernetes integration. It ensures dynamic orchestration of NIM microservices for experimentation and production workloads.

Figure 2. Architecture diagram for AI Model Distillation for Financial Data developer example showing the data flywheel orchestrator and its integration

Prerequisites for NeMo Microservices

To get the developer example up and running, you’ll first need to set up your environment and deploy the required services. Detailed instructions can be found on GitHub.

Generate a personal API key to pull Docker containers and assets to deploy NeMo microservices and access open Nemotron models hosted as a NIM.

Deploy the NeMo microservices platform.

Install and configure the Data Flywheel Orchestrator.

Once the environment is ready, you’ll configure your models and workflows using a config.yaml file.

Note: This file loads when the flywheel server starts. The settings remain static during a flywheel run. To update anything, you must stop the services, modify the YAML, and redeploy.



Unpacking the workflow

Next, we take a look at the developer example in action, showing each stage of the workflow with selected code snippets and experiment outputs. We demonstrate how different model configurations and dataset size influence performance, efficiency, and accuracy. By showcasing multiple experiment runs and distilled model comparisons, the walkthrough highlights how the developer example enables teams to iteratively refine models and achieve optimal trade-offs between cost, size, and precision.

Step 1: Dataset labeling

We use a sample dataset consisting of news headlines to demonstrate this workflow. Using the teacher model and prompt with few-shot examples (provided with our code), we generate labels for each headline in our dataset. The teacher is tasked to classify the headlines into one of the thirteen described classes. For sanity checking and evaluating baseline performance of the LLM, we include its performance against a subset of human-labeled samples from the dataset (~1k examples).

The following are three examples of financial news headlines, with their respective labels assigned by the teacher model:

[ { "Headline": "Ultratech Achieves ISO 9001 and 14001 Certification for Singapore Operations and Recertification for U.S. Facility", "Classified Category": "Regulatory" }, { "Headline": "Mid-Afternoon Market Update: Dow Up Over 200 Points; Lakeland Industries Shares Spike Higher", "Classified Category": "Stock price movement" }, { "Headline": "Analyst: Chipotle Is Successful Because It Sticks To What Works (Giant, Tasty Burritos)", "Classified Category": "Analyst Rating" } ]

We run the following steps using the Data Flywheel Blueprint.

Step 2: Dataset ingestion to flywheel server

Next, we ingest the dataset into an Elasticsearch index. The prompt and teacher model responses follow the OpenAI-compliant format, which the data flywheel server uses to run experiments.

"request": { "model": "meta/llama-3.3-70b-instruct", "messages": [ { "role": "system", "content": "You are a financial news classifier." }, { "role": "user", "content": "USER PROMPT" } ] }, "response": { "choices": [ { "message": { "role": "assistant", "content": "[[[analyst rating]]]" } } ] }, "workload_id": "news_classifier", "client_id": "<DATASET ID>", # dataset identifier in the flywheel server "timestamp": 1760845128 #timestamp when dataset was last updated }

Additionally, in this example, we show that the student model can be customized to match the teacher’s performance without requiring the entire dataset. We split our dataset into smaller stratified subsets of the original dataset (5k, 10k, and 25k examples). The split sizes and ratios for sampling from the multiple label classes, some of which occur less frequently than others, can be specified in the config.yaml file, as shown in our default example:

# Data split config: # train, val, eval split sizes and ratios data_split_config: eval_size: 100 val_ratio: 0.1 min_total_records: 50 random_seed: 42 limit: null # null = use all available records (ingress limit increased to 1GB) parse_function_arguments: true # parse function arguments to JSON objects for tool calling records stratify_enabled: true # Enable stratified splitting to maintain class balance min_samples_per_class: 2 # Minimum samples required per class for stratification rare_class_threshold: 1 # Group classes with <= this many samples as 'others'

Next, using the flywheel server, we repeat the following steps to customize and evaluate the models for different dataset sizes.

Step 3: Fine-tuning jobs

Using NeMo Customizer, supervised fine-tuning jobs are launched with LoRA adapters. Each job distills the knowledge from the dataset into the adapter to create smaller task-specific candidates. The student models for the distillation should be specified in the config.yaml file.

For example, to include the llama-3.2-1b-instruct model as one of the candidate students, we specify its model name and details following the naming conventions and details in the NeMo Microservices Model Catalog.

nim: - model_name: "meta/llama-3.2-1b-instruct" model_type: "llm" context_length: 8192 gpus: 1 pvc_size: 25Gi tag: "1.8.3" customization_enabled: true customizer_configs: target: "meta/llama-3.2-1b-instruct@2.0" gpus: 1 max_seq_length: 8192

Step 4: Evaluate runs

We then compare the performance of student models with and without customization. This is done by comparing the F1-score for each candidate student, referred to as:

base-eval: Zero-shot F1-score baseline of student model before customization

customized-eval: F1-score evaluation of customized model

Step 5: Scoring and aggregation

Model outputs are scored using NeMo Evaluator, and results are reported back through the Orchestrator API. We aggregate these results over different students and corresponding dataset sizes.

Step 6: Review and promotion

Developers can programmatically access metrics, download artifacts, launch follow-up experiments, or promote top-performing candidates to production to replace the teacher NIM.

This loop can be scheduled or triggered on demand, creating an automated, scalable system that continuously and progressively surfaces smaller, faster, and more cost-efficient models, while preserving the accuracy of the larger baseline model.

Results

The reported F1-scores in Table 1 and Figure 3 are evaluated on a held-out test set and are given relative to the F1-score of the large teacher model. In this setup, the teacher model is considered to have a perfect F1-score, against which each distilled student model is compared.

The table clearly shows that larger student models have a greater capacity to learn from the teacher’s supervision and achieve higher scores even with a small number of examples. As the number of training examples increases, the quality of the distilled model improves for each student model size. With enough examples, they converge to similar F1-scores.

These results show the trade-offs and possible gains of using larger student models and more training data during distillation. Practical factors such as data availability, hardware constraints, latency, and throughput at inference time influence the optimal choices for each application within the AI Model Distillation for Financial Data developer example.

Figure 3. F1-score improvement relative to teacher model performance for customized models over increasing dataset sizes

Training Data Model Name F1-Score 5000 meta/llama-3.2-1b-instruct 0.29 10000 meta/llama-3.2-1b-instruct 0.78 25000 meta/llama-3.2-1b-instruct 0.9 5000 meta/llama-3.2-3b-instruct 0.584 10000 meta/llama-3.2-3b-instruct 0.89 25000 meta/llama-3.2-3b-instruct 0.95 5000 meta/llama-3.1-8b-instruct 0.8 10000 meta/llama-3.1-8b-instruct 0.94 25000 meta/llama-3.1-8b-instruct 0.95 Table 1. Relative F1-scores of the distilled student models compared to dataset size. F1-score for customized models trained on different training dataset sizes

Conclusion

Model distillation in finance enables smaller, faster models to match the performance of complex ones, improving efficiency and explainability without sacrificing accuracy. By transferring knowledge from large teacher models to lightweight students, the AI Model Distillation for Financial Data developer example enables faster decision-making for feature engineering and signal generation, risk management, and surveillance.

Learn more

Model compression continues to advance rapidly, driving new possibilities for deploying LLMs efficiently across industries, learn more with the following resources:

For a detailed walk-through of the Data Flywheel blueprint, read this blog post and watch the accompanying video.

For advanced compression techniques, explore the The Art of Compressing LLMs course with the NVIDIA Deep Learning Institute.

Get started

Visit build.nvidia.com to deploy the notebook in a GPU-accelerated environment using NVIDIA Brev or your own cloud infrastructure using a standard GitHub repository.