Generative AI

Transforming Financial Analysis with NVIDIA NIM

In financial services, portfolio managers and research analysts diligently sift through vast amounts of data to gain a competitive edge in investments. Making informed decisions requires access to the most pertinent data and the ability to quickly synthesize and interpret that data.

Traditionally, sell-side analysts and fundamental portfolio managers have focused on a small subset of companies, meticulously examining financial statements, earning calls, and corporate filings. Systematically analyzing financial documents across a larger trading universe can uncover additional insights. Due to the technical and algorithmic difficulty of such tasks, systematic analysis of transcripts over a wide trading universe was, until recently, only accessible to sophisticated quant-trading firms.  

The performance achieved on these tasks using traditional natural language processing (NLP) methods such as bag-of-words, sentiment dictionaries, and word statistics, often falls short when compared to the capabilities of large language models (LLMs) in financial NLP tasks. Besides financial applications, LLMs have demonstrated superior performance in domains like medical document understanding, news article summarization, and legal document retrieval. 

By leveraging AI and NVIDIA technology, sell-side analysts, fundamental traders, and retail traders can significantly accelerate their research workflow, extract more nuanced insights from financial documents, and cover more companies and industries. By adopting these advanced AI tools, the financial services sector can enhance its data analysis capabilities, saving time and improving the accuracy of investment decisions. According to the NVIDIA 2024 State of AI in Financial Services survey report, 37% of respondents are exploring generative AI and LLMs for report generation, synthesis, and investment research to reduce repetitive manual work.

In this post, we’ll walk you through an end-to-end demo on how to build an AI assistant to extract insights from earnings call transcripts using NVIDIA NIM inference microservices to implement a retrieval-augmented generation (RAG) system. We’ll highlight how leveraging advanced AI technologies can accelerate workflows, uncover hidden insights, and ultimately enhance decision-making processes in the financial services industry. 

Analyzing earnings call transcripts with NIMs

Earnings calls in particular are a vital source for investors and analysts, providing a platform for companies to communicate important financial and business information. These calls offer insights into the industry, the company’s products, competitors, and most importantly, its business prospects. 

By analyzing earnings call transcripts, investors can glean valuable information about a company’s future earnings and valuation. Earnings call transcripts have successfully been used to generate alpha for over two decades. For more details, see Natural Language Processing – Part I: Primer and Natural Language Processing – Part II: Stock Selection.

Step 1: The data 

In this demo, we use transcripts from NASDAQ earnings calls from 2016 to 2020 for our analysis. This Earnings Call Transcripts dataset can be downloaded from Kaggle.

For our evaluation, we used a subset of 10 companies from which we then randomly selected 63 transcripts for manual annotation. For all transcripts, we answered the following set of questions: 

  1. What are the company’s primary revenue streams and how have they changed over the past year? 
  2. What are the company’s major cost components and how have they fluctuated in the reporting period? 
  3. What capital expenditures were made and how are these supporting the company’s growth?   
  4. What dividends or stock buybacks were executed? 
  5. What significant risks are mentioned in the transcript? 

This makes for a total of 315 question-answer pairs. All questions are answered using a structured JSON format. For example: 

Question: What are the company’s primary revenue streams and how have they changed over the past year? 


  "Google Search and Other advertising": { 
	"year_on_year_change": "-10%", 
	"absolute_revenue": "21.3 billion", 
	"currency": "USD" 
  "YouTube advertising": { 
	"year_on_year_change": "6%", 
	"absolute_revenue": "3.8 billion", 
	"currency": "USD" 
  "Network advertising": { 
	"year_on_year_change": "-10%", 
	"absolute_revenue": "4.7 billion", 
	"currency": "USD" 
  "Google Cloud": { 
	"year_on_year_change": "43%", 
	"absolute_revenue": "3 billion", 
	"currency": "USD" 
  "Other revenues": { 
	"year_on_year_change": "26%", 
	"absolute_revenue": "5.1 billion", 
	"currency": "USD" 

Using JSON enables evaluating model performance in a manner that does not rely on subjective language understanding methods, such as LLM-as-a-judge, which might introduce unwanted biases into the evaluation. 


This demo uses NVIDIA NIM, a set of microservices designed to speed up enterprise generative AI deployment. For more details, see NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale. Supporting a wide range of AI models, including NVIDIA-optimized community and commercial partner models, NIM ensures seamless, scalable AI inferencing, on-premises or in the cloud, leveraging industry-standard APIs. 

When ready for production, NIMs are deployed with a single command for easy integration into enterprise-grade AI applications using standard APIs and just a few lines of code. Built on robust foundations including inference engines like NVIDIA TensorRT, TensorRT-LLM, and PyTorch, NIM is engineered to facilitate seamless AI inferencing with best performance out-of-the-box based on the underlying hardware. Self-hosting models with NIM supports the protection of customer and enterprise data, which is a common requirement in RAG applications. 

Step 3: Setting up on NVIDIA API catalog 

NIMs can be accessed using the NVIDIA API catalog. All it takes to set up is registering an NVIDIA API key (From the API catalog, click Get API Key.) For the purposes of this post, we’ll store it in an environment variable: 


LangChain provides a package for convenient NGC integration. This tutorial will use endpoints to run embedding, reranking, and chat models with NIMs. To reproduce the code, you’ll need to install the following Python dependencies:


Step 4: Building a RAG pipeline with NIMs 

RAG is a method that enhances language models by combining retrieval of relevant documents from a large corpus with text generation.  

The first step of RAG is to vectorize your collection of documents. This involves taking a series of documents, splitting them into smaller chunks, using an embedder model to turn each of these chunks into a neural network embedding (a vector), and storing them in a vector database. We’ll do this for each of the earning calls transcripts: 

import os 
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.document_loaders import TextLoader 
from langchain.vectorstores import FAISS 
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings 
# Initialise the embedder that converts text to vectors 
transcript_embedder = NVIDIAEmbeddings(model='nvidia/nv-embed-v1', 
# The document we will be chunking and vectorizing 
transcript_fp = "Transcripts/GOOGL/2020-Feb-03-GOOGL.txt" 
raw_document = TextLoader(transcript_fp).load() 
# Split the document into chunks of 1500 characters each 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, 
documents = text_splitter.split_documents(raw_document) 
# Vectorise each chunk into a separate entry in the database 
vectorstore = FAISS.from_documents(documents, transcript_embedder) 
vector_store_path = "vector_db/google_transcript_2020_feb.pkl" 
except OSError: 

Once the vectorized database is built, the simplest RAG flow for the earning calls transcripts is as follows:  

  1. A user inputs a query. For example, “What are the company’s main revenue sources?” 
  2. The embedder model embeds the query into a vector and then searches through the vectorized database of the documents for the Top-K (Top-30, for example) most relevant chunks. 
  3. A reranker model, also known as a cross-encoder, then outputs a similarity score for each query-document pair. Additionally, metadata can also be used to help improve the accuracy of the reranking step. This score is used to reorder the Top-K documents retrieved by the embedder by relevance to the user query. Further filtering can then be applied, retaining only the Top-N (Top-10, for example) documents. 
  4. The Top-N most relevant documents are then passed onto an LLM alongside the user query. The retrieved documents are used as context to ground the model’s answer.
A user icon sending through a query that goes through three colored boxes. The first box, labeled Retrieval, represents the process of embedding the user-query into vector space and fetching the most similar documents from the vector data base. The second box, labeled Ranking, represents the process of using a reranker model to sort the documents by similarity to the user query. Finally, the right-most box, labeled LLM, represents the process of generating an answer with LLM grounded on the retrieved and reranked contexts.
Figure 2. A simplified RAG workflow that involves three main steps: embedding and retrieval, reranking, and context-grounded LLM answer generation 

Note that modifications can be made to improve a model’s answer accuracy, but for now we’ll continue with the simplest robust approach.  

Consider the following user query and desired JSON format: 

question = "What are the company’s primary revenue streams and how have they changed over the past year?" 
json_template = """ 
{"revenue_streams": [ 
        "name": "<Revenue Stream Name 1>", 
        "amount": <Current Year Revenue Amount 1>, 
        "currency": "<Currency 1>", 
        "percentage_change": <Change in Revenue Percentage 1> 
        "name": "<Revenue Stream Name 2>", 
        "amount": <Current Year Revenue Amount 2>, 
        "currency": "<Currency 2>", 
        "percentage_change": <Change in Revenue Percentage 2> 
    // Add more revenue streams as needed 
user_query = question + json_template 

The JSON template will be used so that, further down the pipeline, the LLM knows to output its answer in valid JSON, rather than in plain text. As mentioned in Step 1, using JSON enables the automated evaluation of model answers in an objective manner. Note that this could be removed if a more conversational style is preferred. 

To contextualize the user query, initialize the Embedder and the Reranker for the retrieval and ordering of the relevant documents: 

from langchain_nvidia_ai_endpoints import NVIDIARerank 
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever 
# How many retrieved documents to keep at each step 
top_k_documents_retriever = 30 
top_n_documents_reranker = 5 
# Initialie retriever for vector database 
retriever = vectorstore.as_retriever(search_type='similarity', 
                                     search_kwargs={'k': top_k_documents_retriever}) 
# Add a reranker to reorder documents by relevance to user query 
reranker = NVIDIARerank(model="ai-rerank-qa-mistral-4b", 
retriever = ContextualCompressionRetriever(base_compressor=reranker, 
# Retrieve documents, rerank them and pick top-N 
retrieved_docs = retriever.invoke(user_query) 
# Join all retrieved documents into a single string 
context = "" 
for doc in retrieved_docs: 
    context += doc.page_content + "\n\n" 

Then, when the relevant documents are retrieved, they can be passed onto the LLM alongside the user query. We are using the Llama 3 70B NIM: 

from langchain_nvidia_ai_endpoints import ChatNVIDIA 
Given the following context: 
Answer the following question: 
using the following JSON structure: 
For amounts don't forget to always state if it's in billions or millions and "N/A" if not present. 
Only use information and JSON keys that are explicitly mentioned in the transcript. 
If you don't have information for any of the keys use "N/A" as a value.  
Answer only with JSON. Every key and value in the JSON should be a string. 
llm = ChatNVIDIA(model="ai-llama3-70b", 
llm_input = PROMPT_FORMAT.format(**{"context": context, 
                                    "question": question, 
                                    "json_template": json_template 
answer = llm.invoke(llm_input) 

Running this code will produce a JSON-structured answer to the user query. The code can now be easily modified to read in multiple transcripts and answer varying user queries.

Step 5: Evaluation 

To evaluate the performance of the retrieval step, use the annotated question-answer pairs previously described to compare the ground-truth JSON with the predicted JSON, key-by-key. Consider the following ground-truth example: 

"Google Cloud": { 
    "year_on_year_change": "43%", 
    "absolute_revenue": "3 billion", 
    "currency": "N/A" 

The prediction looks like this:

"Google Cloud": { 
	"year_on_year_change": "43%", 
	"absolute_revenue": "N/A", 
	"currency": "USD" 

The three possible outcomes are: 

  1. True positive (TP): There is no value to extract, and the ground truth and the prediction match. For the previous example, the prediction for year_on_year_change is TP. 
  2. False Positive (FP): The ground truth value is “N/A”. In other words, there is no value to extract, but the prediction hallucinates a value. For the previous example, the prediction for currency is FP. 
  3. False Negative (FN): There is a ground truth value to extract, however, the prediction fails to capture that value. For the previous example, the prediction for absolute_revenue is FP. 

With these outcomes measured, next calculate the following three main metrics: 

  1. Recall = TP/ (TP + FN): Higher recall implies our model is returning more and more of the relevant results. 
  2. Precision = TP / (TP + FP): Higher precision implies our model returns a higher ratio of relevant results versus irrelevant ones. 
  3. F1-score = (2 * Precision * Recall) / (Precision + Recall): The F1-score is a harmonic mean of precision and recall. 

A user might want to be partially flexible with the matching of non-numeric values when doing string comparisons for some of the attributes. For example, consider a question about revenue sources, where one of the ground-truth answers is “Data Centers” and the model outputs “Data Center”. An exact match evaluation would treat this as a mismatch. To achieve more robust evaluation in such cases, use fuzzy matching with the Python default difflib

import difflib

def get_ratio_match(gt_string, pred_string):
   if len(gt_string) < len(pred_string):
       min_len = len(gt_string)
       min_len = len(pred_string)
   matcher = difflib.SequenceMatcher(None, gt_string, pred_string, autojunk=False)
   _, _, longest_match = matcher.find_longest_match(0, min_len, 0, min_len)
   # Return the ratio of match with ground truth
   return longest_match / min_len

For evaluation, consider any string attributes to be a match if their similarity ratio is above 90%. 

Table 1 presents results for two of the most-used open-source model families (Mistral AI Mixtral models and Meta Llama 3 models) on our manually annotated data. For both model families, there is noticeable performance deterioration when lowering the number of parameters. Visit the NVIDIA API catalog to experience these NIMs.

Method F1 Precision Recall 
Llama 3 70B 84.4% 91.3% 78.5% 
Llama 3 8B 75.8% 85.2% 68.2% 
Mixtral 8x22B 84.4% 91.9% 78.0% 
Mixtral 8x7B 62.2% 80.2% 50.7% 
Table 1. Performance of Llama and Mixtral models on JSON-structured information extraction and question-answering from earning call transcripts

 Mixtral-8x22B seems to have roughly equivalent performance to Llama 3 70B. However, for both model families, reducing the number of parameters does result in a significant decrease in performance. A decrease is most accentuated for Recall. This presents a frequent trade-off between choosing to have better accuracy at the cost of larger hardware requirements. 

In most cases, model accuracy can be improved without increasing the number of parameters, by fine-tuning either the Embedder, Reranker, or the LLM using domain-specific data (in this case, earning call transcripts). 

The Embedder is the smallest and therefore the quickest and most cost-effective to fine-tune. For detailed instructions, refer to the NVIDIA NeMo documentation. Additionally, NVIDIA NeMo simplifies and enhances the efficiency of fine-tuning an effective version of the LLM.

Key implications for users

This demo is designed to extract insights from earnings call transcripts. By leveraging advanced AI technologies like NIM, it’s now possible to quickly and accurately retrieve information from earnings call transcripts. The AI product assists multiple categories of financial researchers, analysts, advisors, and fundamental portfolio managers during the most intensive processes of documentation and data analysis, enabling financial professionals to spend more time on strategic decision-making or with clients. 

In the asset management sector, for example, portfolio managers can use the assistant to quickly synthesize insights from a vast number of earnings calls, improving investment strategies and outcomes. In the insurance industry, the AI assistant can analyze financial health and risk factors from company reports, enhancing underwriting and risk assessment processes. In fundamental and retail trading, the assistant can help with systematic information extraction to identify market trends and sentiment shifts, enabling the use of more detailed information for future trades. 

Even in banking, it can be used to assess the financial stability of potential loan recipients by analyzing their earnings calls. Ultimately, this technology enhances efficiency, accuracy, and the ability to make data-driven decisions, giving users a competitive edge in their respective markets. 

Visit the NVIDIA API catalog to see all the available NIMs and experiment with LangChain’s convenient integration to see what works best for your own data.

Discuss (0)