Every swipe, transfer, and payment on a modern financial network encodes a pattern of human behavior. Transaction data is one of the richest signals an enterprise owns. Yet most production use cases for such tabular data still depend on hand-engineered features and rule sets that are brittle, expensive to maintain, and blind to the sequential structure inside a customer history.
Foundation models, pre-trained on large volumes of unlabeled transaction sequences, change this equation by producing general-purpose representations of financial behavior that transfer across a wide array of downstream tasks. A single backbone covers fraud detection, credit scoring, lifetime value prediction, segmentation, personalized recommendations, recurrent-transaction detection, and more.
The industry signal is strong and accelerating. Innovative financial firms are training transformer-based models on billions of transactions, reporting double-digit relative lifts on production-scale tasks while simultaneously streamlining operations. See Stripe’s payments foundation model, Nubank’s NuFormer, Visa’s TransactionGPT, Mastercard’s large tabular model, Revolut’s PRAGMA, Plaid’s transaction foundation model, and more.
The NVIDIA Build Your Own Transaction Model developer example walks through how to build a transaction foundation model end-to-end using accelerated computing.
You will progress through five steps in this workflow:
- GPU-accelerated data processing with NVIDIA CUDA-X library cuDF
- Custom tokenization with NVIDIA CUDA-X libraries cuDF and cuML
- Transformer decoder model pretraining from scratch with NVIDIA NeMo AutoModel open library, part of NVIDIA NeMo framework
- Extracting learned embeddings
- Augmenting a downstream fraud classifier with embeddings
By the end, you will reproduce a near-50% lift in Average Precision (“AP”)— the area under the precision-recall curve—capturing how well the model ranks fraud across all operating thresholds), over a strong XGBoost baseline on the IBM TabFormer fraud dataset. Figure 1, below, shows the end-to-end pipeline.

Why transformers fit transaction histories
Large language models learn from sequences of words. During pretraining, a model sees text and learns that words, phrases, and sentences carry meaning through order and context. A transaction foundation model applies the same principle to financial behavior. A sequence such as “paycheck deposit, grocery purchase, transit fare, recurring subscription, card-present restaurant payment” carries information that no single transaction row can express alone.
Transformers are well suited to this structure because self-attention can connect events that sit far apart in history. A fraudulent transaction may only look suspicious when paired with a recent travel pattern or a sudden burst of small authorizations. Traditional tabular features can approximate these patterns, but engineers must decide which windows, aggregates, and rules to build up front. A pretrained transformer learns those relationships directly from the sequence.
This approach complements other NVIDIA financial AI workflows, including the NVIDIA AI Blueprint for financial fraud detection using graph neural networks (GNNs). GNNs capture relationships across connected entities such as accounts, merchants, devices, and transactions. Transaction foundation models focus on behavioral histories within a customer or account sequence. In practice, both methods produce rich embeddings with complementary information that pair naturally.
Load the data and set a baseline
Notebook 01_dataset_baseline.ipynb loads the IBM TabFormer dataset, roughly 24.4M synthetic card transactions with a ~0.12% fraud rate, directly into GPU memory with cuDF.
The dataset splits are partitioned temporally by cumulative transaction count: the first 80% of transactions by date is used for training; the next 10% becomes validation; and the final 10% becomes test. These splits therefore occupy disjoint and ordered time windows, preventing data leakage and reflecting real-world production environments.
With the splits in place, the notebook trains an XGBoost classifier utilizing native GPU acceleration with tree_method="hist" and device="cuda" on a 1M-row balanced training sample. Evaluation runs on a 100k stratified holdout that preserves the realistic ~0.1% fraud prevalence.
The baseline numbers set the bar for the rest of the tutorial:
- Test ROC-AUC: 0.9885
- Test AP: 0.1238
Pay attention to AP rather than ROC-AUC. Under 0.1% class imbalance, ROC-AUC saturates quickly and hides meaningful differences in high scoring regions. AP measures across the full recall curve and responds to improvements where they matter operationally. Every subsequent model in this tutorial is judged by AP first.
Tokenize transactions on the GPU
General-purpose LLM tokenizers waste capacity on tabular financial data. For example, a byte pair encoding (BPE) tokenizer splits a single transaction into roughly 39 subword tokens, where most encode commas and dollar signs rather than behavior. Notebook 02_seq_preproc_tokenization.ipynb introduces a custom domain tokenizer that converts each transaction into roughly 12 semantic tokens with a much smaller vocabulary (6,251 symbols vs. 50,257 from BPE).
In addition to token information density, this efficiency also enables more than 3x the number of transactions for a set token budget. Practically speaking, a model with a context window of 4,092 can fit a history of ~315 transactions from the domain tokenizer and only ~102 transactions from a BPE tokenizer.
Figure 2, below, compares token counts per transaction between the two tokenization methods on the same records.
The domain tokenizer is implemented in src/tokenizer/financial_pipeline.py. This flexible pipeline handles amount binning, merchant hashing, hour-of-day and day-of-week, month, card identity, chip type, ZIP3 and state, and customer identity. Every step runs on the GPU through cuDF.
The tokenizer can be readily adapted to different transaction schema by adding or replacing individual steps in the modular pipeline. Each step implements a small BaseTokenizer interface, so extending coverage to new fields such as device ID or beneficiary country takes just a short subclass.

Pretrain with NeMo AutoModel
NeMo AutoModel is a Pytorch-native open-source training library under the NVIDIA NeMo Framework, designed to streamline and scale training and finetuning for LLMs and VLMs.
Notebook 03_foundation_model_training.ipynb pretrains a decoder-only foundation model on the tokenized corpus using causal language modeling. The objective is simple — to predict the next token given every previous token — but the supervision signal is dense. Every position in a sequence contributes a gradient, so a single packed transaction sequence yields thousands of next-event predictions.
The model is a compact Llama decoder defined in configs/pretrain_financial_decoder.yaml:
- ~29M parameters
- Hidden size 512, 8 transformer layers
- Grouped-Query Attention with 8 query heads and 2 KV heads
- 8,192-token RoPE context window
- SwiGLU activation, RMSNorm, domain vocabulary of 6,251 tokens
NeMo AutoModel handles the rest of the stack. Kick off a single-GPU sanity run.
python scripts/train_decoder_model.py \
--config configs/pretrain_financial_decoder.yaml \
--step_scheduler.max_steps 30
The 30-step demo drops training loss from ln(6251)≈8.74 (the random-guess baseline for this vocabulary) to around 6.0. To scale the same run to eight GPUs, simply prefix the command with torchrun --nproc-per-node=8 —no changes to the script or distributed boilerplate required. Multi-node scaling is straightforward as well. NeMo AutoModel wires up FSDP2 sharding, mixed precision, gradient accumulation, and checkpoint consolidation from the YAML.
Checkpoints land as standard safetensors files, which means the trained backbone loads with a one-liner anywhere HuggingFace Transformers is installed:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("models/decoder-foundation-model")
The repository ships a full checkpoint trained for 3,000 steps, which Notebooks 04 and 05 load; the 30-step test is for demonstrative and validation purposes.
To swap architectures, edit model._target_ and model.config._target_ in the YAML. Any HuggingFace-compatible decoder is designed to drop in without training-code changes.
Extract embeddings at scale
Notebook 04_inference_embedding_extraction.ipynb turns the pretrained backbone into a feature extractor. It loads the checkpoint with AutoModelForCausalLM, requests output_hidden_states=True, and pools the final hidden layer down to a 512-dim vector per user history.
For decoder-only models with causal attention, only the final position has observed the entire sequence while earlier positions are blind to later tokens. Last-token pooling therefore picks the most informative location in the sequence. The implementation in src/decoder_inference.py uses the attention mask to find the last non-pad token per row and gathers its hidden state.
The extraction loop is a single call:
embeddings = inference.extract_embeddings_batched(
padded_ids, batch_size=1024, show_progress=True
)
The notebook extracts and saves train, validation, and test embeddings as .npy files. Additionally, a metadata.json describing shapes and row alignment is saved, which is later used in Notebook 05 to join embeddings back to the associated raw tabular features.
Figure 3, below, shows a 3D UMAP projection of 50k validation embeddings, colored by merchant industry category and zip code. Visible clusters in each field confirm that the backbone has learned semantically coherent representations without ever seeing any target labels during pretraining.


Figure 3. 3D UMAP projection of 50,000 validation-set transaction embeddings. Points colored by merchant industry and user zip code each show clear behavioral clusters in the learned representation space
Measure lift on a downstream task
Notebook 05_xgboost_fraud_detection.ipynb answers the billion dollar question: Can transaction foundation model embeddings move downstream metrics?
It trains three GPU XGBoost classifiers and evaluates all of them on the same 100k stratified test set:
- Raw—13 hand-engineered tabular features (the baseline from Step 1)
- Embeddings—512-dim foundation-model vectors compressed to 64d with PCA (~78% variance retained)
- Combined—raw features concatenated with the 64d embeddings, 77d total
Table 1, below, summarizes the test results.
| Model | Feature dim | Test ROC-AUC | Test AP |
|---|---|---|---|
| Raw (baseline) | 13 | 0.9885 | 0.1238 |
| Embeddings only | 64 | 0.8775 | 0.0123 |
| Combined | 77 | 0.9925 | 0.1755 |
The combined model lifts ROC-AUC by 0.41% and AP by 41.76% over the baseline. That AP delta is the operational win: a review team with fixed daily capacity catches materially more fraud at the same workload.
Embeddings encode the user’s transaction history and provide predictive power, but underperform the baseline as lone features. The combined model leverages event-level information from the raw tabular row and sequence-level historical context from embeddings that were learned during pretraining. Figure 4, below, shows the comparison visually.

Customize the developer example
The repository is structured so that each component is swappable independently:
—Tokenizer: Adapt the pipeline in src/tokenizer/ to any transaction schema by adding or replacing steps. Each step is a small subclass of BaseTokenizer, so supporting new fields such as device fingerprint, beneficiary country, and merchant country is a short addition.
—Model architecture: Edit model._target_ and model.config._target_ in the training YAML to point at any HuggingFace-compatible decoder. The rest of the training pipeline using NeMo (data loader, FSDP2, checkpointing, evaluation) stays put.
—Downstream task: Replace XGBoost with any model that consumes fixed-length feature vectors. Churn prediction, customer segmentation, lifetime value regression, next-best-action ranking, and credit scoring all fit the same embedding-plus-head pattern.
The developer example is designed to extend to labels other than fraud as well, exhibiting foundational capabilities. Swap Is Fraud? in Step 5, above, for any event label that aligns with the user histories encoded by the backbone.
Get started
You now have a reference path from raw transaction logs to a pretrained foundation model that augments a downstream classifier, accelerated end-to-end with NVIDIA. The three components — a custom tokenizer, a transformer decoder backbone, and an embedding-driven XGBoost head — together deliver a near-50% AP lift over a strong industry standard baseline on the TabFormer fraud benchmark.
Visit build.nvidia.com to deploy the notebook in a GPU-accelerated environment via NVIDIA Launchable or your own environment via GitHub repository.