Telcos around the world are building sovereign AI factories based on the NVIDIA Cloud Partner (NCP) reference architecture, giving governments, enterprises, and startups access to in‑country AI infrastructure with the right controls, trust, and performance. But infrastructure alone doesn’t get you to high-margin, production-ready enterprise AI services.
Model sizes and reasoning workloads continue to grow, driving up tokens per request, while each new generation of accelerated computing drives down cost per token. Together, these trends make it more valuable to push AI economics higher up the stack—from selling GPU hours to delivering AI services measured and billed in tokens.
At the same time, enterprises don’t want to manage clusters, runtimes, or model weights. They want production‑ready applications and model APIs with predictable performance, metered by token consumption, and backed by service‑level agreements (SLAs) tied to AI‑native metrics such as tokens per second, time‑to‑first‑token (TTFT), and end‑to‑end query latency.
This post traces the path from GPU‑per‑hour infrastructure to token‑metered AI services and outlines the technical building blocks telcos need to evolve from infrastructure landlords into “token factories” with transparent, token‑based economics that enterprises can easily adopt without operating the underlying infrastructure themselves.
Building the telco AI cloud stack

AI can be understood as a 5-layer cake—energy, chips, infrastructure, models, and applications. Telco sovereign AI factories sit on top of the energy and chip layers and anchor the infrastructure layer, providing NVIDIA‑accelerated compute, networking, and storage that can securely host models and applications.
Telco AI factories start with NVIDIA‑certified infrastructure and a choice of software partners that define both the platform’s economic and regulatory posture. This foundational layer sets the cost of compute‑as‑a‑service, enforces where data can reside, and controls which tenants can run which workloads in a shared environment.
In practice, it turns raw GPU capacity into secure, multi‑tenant compute that can be exposed as services, and its cost structure and footprint set the baseline for cost per token as telcos move up the stack—from compute‑as‑a‑service to token‑as‑a‑service, where most of the long-term economic upside sits.
Compute‑as‑a‑Service: Infrastructure and platforms
Compute‑as‑a‑Service (CaaS) is how telcos monetize the energy, chips, and infrastructure layers of the 5‑layer cake, exposing NVIDIA‑certified systems, CPUs, GPUs, NVLink, high‑speed InfiniBand or Ethernet, and storage as GPU/Infrastructure‑as‑a‑Service (IaaS) that customers rent by the hour, similar to traditional cloud instances.
On top of that, a Kubernetes‑based platform layer turns this raw capacity into a managed environment with multi‑tenant clusters, namespaces, and GPU scheduling, so developers can deploy containers and inference runtimes while being billed primarily on GPU‑hours, node‑hours, and storage.
This tier is essential for flexibility, control, and sovereignty, but it keeps the business anchored in a GPU‑per‑hour model. The real economic shift happens when telcos add token‑metered models and applications on top of it and start selling AI output rather than just infrastructure time.
Token-as‑a‑Service: Creating and consuming token-metered services
Token‑as‑a‑Service (TaaS) moves telcos up into the model and application layers of the 5‑layer cake, where value is measured in tokens, API calls, and workflows rather than GPU‑hours. In this layer, GPU capacity from the AI factory is packaged into products that are measured, billed, and governed in those same units, and revenue is no longer limited by how many hours a GPU can be rented but by how many tokens the stack can serve at a given price and SLA.
Telcos typically begin with a focused portfolio of token‑metered services powered by open-source models like NVIDIA Nemotron, NVIDIA NIM, and blueprints, such as:
- Vertical AI applications (for example, customer‑care copilots or knowledge assistants tailored to local languages and regulations)
- Model and tools APIs for text, vision, speech, and agents
- Inference‑as‑a‑Service endpoints for fine‑tuned and domain‑specific models
Customers integrate these services through APIs and pay in units that match how their business consumes AI—tokens, requests, or workflows—rather than in opaque infrastructure metrics. SLAs shift accordingly: instead of uptime on specific servers, enterprises care about latency, reliability, and response quality at the model or application level.
To simplify service creation and consumption at this layer, many telcos work with NVIDIA-certified software partners to develop AI developer studios and AI marketplaces.
An AI developer studio is where these token‑metered services are designed, adapted, and operated. Data scientists and developers use NVIDIA NeMo to fine‑tune foundation models, deploy them as secure NIM‑based endpoints, and connect them to retrieval pipelines or agentic workflows. Within an AI studio, they can choose models from a curated catalog, fine-tune them with their own enterprise data to improve accuracy and relevancy, and publish them as reusable AI assets—models, agents, and blueprints—that developers can reuse without ever touching the underlying infrastructure.
An AI marketplace then becomes the storefront that turns those assets into products. Business and application owners browse a catalog of copilots, retrieval-augmented generation (RAG) applications, model SKUs, and independent software vendor (ISV) solutions, then subscribe and deploy them with a few clicks.
Behind the scenes, the platform provisions inference endpoints and meters usage in input and output tokens, API calls, or workflow executions, automatically enforcing quotas, rate limits, and SLAs.
Together, TaaS enabled by the AI developer studio and AI marketplace transform the telco AI factory from a pool of GPUs into a portfolio of sovereign, token‑metered AI products that enterprises can adopt out of the box.
Token-level metering and billing
To turn those capabilities into products, telcos require a metering and billing layer that treats tokens as a first-class signal and connects them to performance, governance, and infrastructure efficiency.
| KPI group | Examples |
|---|---|
| Token usage | Tokens per tenant, model, endpoint; input vs output; hourly/daily/monthly totals |
| Performance | QPS, request counts, p50–p99 latency, throughput in tokens per second |
| Reliability | Error rates tied to token volume |
| Governance | Per‑tenant quotas, rate limits, access/audits, policy signals |
| Economics | Tokens per GPU‑hour, per GPU type, tokens per dollar |
Together, these metrics let telcos offer plans priced per million tokens, enforce usage across tenants, and pick the right NVIDIA platform SKUs and service price-points based on real cost-per-token data.
Over time, this token‑level visibility turns the AI factory into a true token factory, where every improvement in the stack is measured in lower cost per token and higher, more predictable gross margin.
Monetizing AI infrastructure as a token factory

In a GPU‑per‑hour model, revenue is capped by how many hours a GPU can be rented and at what rate. You can tune utilization and pricing, but the unit of value remains “dollars per GPU‑hour,” so improvements in hardware and software mainly show up as pressure to lower hourly prices rather than as higher margins.
In a token‑as‑a‑service model, the same GPU is monetized by how many high‑quality tokens it can produce through an optimized stack, at a given price per million tokens and SLA.
Viewed this way, the AI factory becomes a token factory. Every improvement to the stack—better batching, smarter routing and scheduling, more efficient models, faster networking, and storage that removes I/O bottlenecks—either increases tokens per second or reduces cost‑per‑token.
Revenue scales with token throughput and price per token, while margin improves with each new NVIDIA platform generation and each software optimization, not just with higher hourly rental rates.
A practical example: GPU-per-hour vs. TaaS
The example in Figure 3, below, uses simplified assumptions to show how the economics change when you move from GPU‑per‑hour to TaaS. These numbers are illustrative, not prescriptive pricing.

GPU-per-hour model: Assume an H100‑class instance rents for about 3 USD per hour. At 70% average utilization over a year, that works out to roughly 18,400 USD in annual revenue per GPU. In this model, you mainly tune utilization and hourly price—you are still selling time on a GPU, not AI output.
TaaS model: Now assume you run a throughput‑optimized, mid‑size model that can sustain 30 million billable tokens per hour on a single H100. If you charge 1 USD per 1 million tokens, that GPU has 30 USD per hour of token revenue potential. At 60% “token‑active” utilization, that yields about 18 USD of realized token revenue per hour, or roughly 157,680 USD per year per GPU.
New GPU generations amplify this effect. NVIDIA GB200 NVL72 delivers order‑of‑magnitude improvements in tokens‑per‑second and cost‑per‑million‑tokens versus the previous generation, and leading inference providers report up to 10x lower cost‑per‑token on real workloads when they pair Blackwell with optimized stacks.
These savings are easiest to capture when you monetize at the token layer rather than per GPU‑hour, because higher tokens‑per‑second and lower cost‑per‑token translate directly into better unit economics for token‑metered services.

For example, if a B200‑class GPU doubles effective token throughput from 30 million to 60 million billable tokens per hour at the same price of 1 USD per 1 million tokens and 60% token‑active utilization, annual token‑as‑a‑service revenue per GPU increases from 157,680 USD to approximately 315,360 USD.
In a GPU‑per‑hour model, that extra throughput does not show up as additional revenue, but in a token‑as‑a‑service model it directly translates into higher revenue on the same GPU footprint and better margins as cost per token improves.
Where telcos go from here
For telcos that have already invested in NVIDIA‑powered sovereign AI factories, the next step is to move quickly up the stack—from AI infrastructure to AI services—and to align their business models with the AI token economy.
Practically, this means going beyond GPU clusters and standing up an AI cloud stack with a NVIDIA‑certified software provider that can orchestrate GPUs, enforce multi‑tenant policies, and connect token‑level usage to billing, SLAs, and governance. For example, partners such as Rafay are already helping telcos roll out token‑metered AI services on sovereign infrastructure, offering early evidence that this approach matches real enterprise demand and use cases.
From there, telcos can launch token‑metered AI services: AI studios where teams build and adapt models using NVIDIA NIM and NeMo, marketplaces where those models and applications are offered as SKUs, and APIs that enterprises can consume on a per‑token or per‑workflow basis.
By treating tokens as the core economic unit—backed by NVIDIA’s advances in tokens‑per‑second, tokens‑per‑watt, and cost‑per‑token—telcos can evolve from connectivity and infrastructure providers into sovereign AI service providers, with revenue and margins that scale as their token factories grow.
Learn how telecom operators are turning sovereign AI infrastructure into real revenue and impact for their nations.