Generative AI

Streamline Trade Capture and Evaluation with Self-Correcting AI Workflows

An illustration of a female sitting at a computer looking at trade trends.

The success of LLMs in chat and digital assistant applications is sparking high expectations for their potential in business process automation. While achieving human-level reliability in such workflows has been challenging, it has highlighted key areas for improvement and fueled ongoing innovation.

Despite reliability challenges, there’s tremendous business potential in automating workflows that involve free-form, natural language content for which AI is the only alternative to manual processing. 

This post explores why AI-based free-form text workflows often fail and shows how combining AI with rules-based error correction can achieve near-perfect accuracy in trade entry for financial ‘what-if’ analysis.

Our experiments were powered by NVIDIA NIM, self-hosted inference containers that address data control concerns, reduce latency, and cut costs compared to cloud APIs. NIM was used to run models such as Qwen-3 and DeepSeek-v3 locally for benchmarking and performance evaluation.

Trade entry

‘What-if’ analysis involves assessing the impact of a potential new trade on a financial institution’s risk, trading limits, and capital requirements before the trade is executed. The first step in a ‘what-if’ analysis is trade entry, i.e., adding the prospective trade to the trading system to evaluate its risk and capital requirements. 

Trade entry input is free-form text and may come from an email chain, trader chat, or even voice. The output is data in the format that the trading system can accept. While the data schema for each trade type is fixed, the way trades are described varies, from simple lines like the following example to lengthy descriptions, with lots of details, and plenty that defy clean categorization.

We pay 5y fixed 3% vs. SOFR on 100m, effective Jan 10

This example describes an interest rate swap, one of the most widely traded financial instruments. It involves two counterparties making periodic coupon payments to each other based on a fixed or floating interest rate. 

The two payment streams going in opposite directions are called “legs” of the swap. In this case, one leg involves fixed payments of 3%. The other leg payments of a US dollar floating interest rate index (SOFR), published by the Federal Reserve Bank, on a notional amount of $100 million over a 5-year term (also called the swap tenor).

The lack of any predefined format for trade descriptions makes trade entry notoriously resistant to automation. This can be specified in many ways, foiling any rules or template-based parsing approaches. For example, the same swap can also be described as follows:

We are long 3% swap on $100m, maturity 10-Jan-2030

The latter description includes the currency (USD), which implies the index (SOFR), while the former includes the index, which implies the currency. The latter description also designates the long side, which by well-known convention means “pay fixed”, while the former description makes this designation explicit. 

The majority of data fields embedded in the two concise examples above aren’t assigned explicit names typical of more verbose descriptions found in trade confirmations (e.g., “Notional: $100m”). Converting these trade descriptions into data requires interpreting the meaning of each field from its position relative to other fields and understanding complex relationships between field values and financial industry conventions.  

This kind of structured output to JSON is possible with custom discriminative models. However, it requires careful training and data labelling. Modern LLMs have no difficulty understanding free-form trade descriptions without any specialized prompting, based on their training data. 

A sample output generated by providing the first example to a Llama 3.1 70B model, along with the simple prompt “Convert this data to a dictionary,” follows:

{
"notional": 100000000,
"tenor": "5Y",
"effective_date": "2024-01-10",
"leg_1": {
    "side": "pay",
    "fixed_rate": "3%",
},
"leg_2": {
     "side": "receive",
     "index": "SOFR",
}
}

And yet, despite its impressive performance in recognizing the trader jargon, the data produced by the LLM above contains an error. We’ll examine the nature of this error and the ways to overcome similar errors in the next section.

Controlling hallucinations in LLM-based trade capture workflows

Participants in CompatibL’s 2024 TradeEntry.ai hackathon showed that a single call to an LLM with a detailed and well-crafted prompt reaches a peak accuracy of around 90-95% on simple trade texts. However, for more complex inputs, the accuracy falls to around 80%, which is insufficient for production applications. 

Importantly, many errors observed during the hackathon weren’t due to AI doing too little but rather doing too much—i.e., performing additional transformations it learned from its training data that weren’t valid for the specific trade at hand. 

For example, in the LLM output example shown, the start date (2024-01-10) includes the year even though the input does not (“Jan 10”). Because the model had access to the current date of December 10, 2024, it learned from its training data that the effective date must include the year and used the current year. 

This logic is faulty because for a ‘what-if’ analysis, the start date is always in the future. An expert would know that a trader requesting ‘what-if’ analysis in December 2024, for a trade starting “Jan 10” means January 10, 2025, not a year ago.

AI-based coding assistants handle incorrect model assumptions by involving a human in the loop.  Over multiple iterations, the users review interim outputs and prompt corrections until the result is acceptable. For our use case, however, involving a human at every step would defeat the purpose of automation. Instead, we’ll use a self-correction approach to prevent the LLM from performing any transformations while converting the free-form text input to a data dictionary. Any logic, such as inserting the year, will then be performed in post-processing based on deterministic rules. 

We implement our approach by prompting the LLM to provide a string template (we use Box templating) along with a data dictionary, with the requirement that substituting the dictionary into the template will faithfully reproduce the original input. The string template for the first example may be structured as follows:

{fixed_side} {tenor} fixed {fixed_rate} vs. {floating_index} on {notional}, effective {effective_date}

Compared to the simpler approach requiring that all output data values are substrings of the input text, the template-based approach ensures the extracted data fully captures the original meaning and structure of the trade description. In the event some errors remain or new ones are introduced, the process continues with new corrections. Usually, all errors are eliminated after fewer than three iterations.

The data dictionary produced using this approach is different from the sample dictionary we discussed in the preceding section. It doesn’t include implied fields, such as the side (“receive”) of the floating leg, which was deduced by the model from the side (“pay”) of the fixed leg. 

Most importantly, the error with the added year is eliminated because the field “effective_date” will have its verbatim value “Jan 10” without the year. Once the data is in the dictionary, conventional rules-based processing can determine the year and perform all other required transformations. For dates and amounts expressed as text, libraries such as dateparser and text2num provide reliable rule-based conversions without using AI. Most trade analytics libraries already contain code that can resolve defaults using a reference database.

The self-correction process described in this section is presented below in Pythonic pseudo-code:

class SelfCorrectingTradeParser:

    def __init__(self, llm, max_iter: int = 3):
        # LLM object such as LangChain's ChatNVIDIA
        self.llm = llm
        # Maximum number of self-correction iterations to try
        self.max_iter = max_iter

    def parse_with_self_correction(self, trade_text: str) -> dict:
        # Create prompt describing the trade parsing process
        prompt = self._initial_prompt(trade_text)
        for _ in range(self.max_iter):
            # Get the LLM response
            reply = self.llm(prompt)
            # Extract trade dictionary and string template
            trade_dict, tmpl = self._extract(reply)
            # See if we can reconstruct the original trade text
            diffs = self._validate(trade_text, tmpl, trade_dict):
            # If there are no differences/errors, we are done
            if not diffs:
                return trade_dict
            # Otherwise we create a prompt with the corrections and retry
            prompt = self._correction_prompt(trade_text, trade_dict, tmpl, diffs)

        return trade_dict

Deploying open models

NVIDIA NIM offers self-hosted, GPU-accelerated inference Docker containers with standard APIs, optimized for low latency and high throughput using inference engines like NVIDIA TensorRT and NVIDIA TensorRT-LLM

NIM microservices support a range of family and model sizes, enabling users to balance accuracy and speed. We used NIM to evaluate the self-correcting workflow in this blog using self-hosted Qwen and DeepSeek models. For model performance, we used a test set collected for CompatibL’s 2024 TradeEntry.ai hackathon. We also employed a prompting technique called few-shot learning, where the model is shown example inputs and outputs to help it understand the task. 

Specifically, we tested two versions of the prompt: one that included a single example and another that included ten. Each example in the model’s prompt includes the text trading the basis swap (input) alongside the desired output dictionary and accompanying formatted string template (outputs).

To evaluate model performance, we measure the following outcomes for each basis swap:

  1. True positives (TP): There is a value to extract, e.g., “notional”, and the ground truth and the prediction match. 
  2. False positive (FP): The model prediction hallucinates something that doesn’t exist or that has an incorrect value. 
  3. False negative (FN): There is a ground truth value to extract, however, the model prediction fails to capture that value. 

With these outcomes measured, we calculate the following three metrics: 

  1. Recall = TP/ (TP + FN). A higher recall implies our model is returning more and more of the relevant results. 
  2. Precision = TP / (TP + FP). A higher precision implies our model returns a higher ratio of relevant results versus irrelevant ones. 
  3. F1-score = (2 * Precision * Recall) / (Precision + Recall). The F1-score is a harmonic mean of precision and recall.

When using self-correction, we use a maximum of 5 iterations. Because LLMs can produce different results each time they run, we set the temperature to 0.6 (to balance randomness and consistency) and run each model on the test set five times. We then averaged the results across those runs to report the following results on the test set:

Bar chart comparing performance of various LLMs across 1-shot and 10-shot settings. All models show improved F1-scores with self-correction, with DeepSeek-R1 (10-shot) achieving the highest at 0.988.
Figure 1. F1-score for varying few-shot examples and with/without self-correction
A bar chart showing total false positives and false negatives for various LLMs under 1-shot and 10-shot settings.
Figure 2. Total trade capture errors (false-positives + false-negatives) for varying few-shot examples and with/without self-correction

Both graphs highlight that using the self-correction method can result in between 20% to 25% error reduction and 3% to 5% higher F1-scores. Furthermore, models explicitly trained for reasoning, such as DeepSeek-R1, outperform general-purpose counterparts like DeepSeek-v3. They achieve near-perfect accuracy in the 10-shot scenario, underscoring superior capability for error correction and structured task decomposition required in our self-correction loop. Finally, increasing the number of few-shot examples consistently boosts performance, with DeepSeek-v3 improving its F1-score by about 4.8% (from 91.5% to 96.3%) and Qwen3-235B improving by nearly 6% (from 90.7% to 96.9%), showing the concrete benefit of richer prompting context.  

Conclusion

Much like humans, AI often makes implicit assumptions.  Many of the errors, including the examples we analyzed in this blog post, are caused by implicit reasoning, where the model attempts to do more than the task requires but misses a critical piece of data. Such errors can be eliminated using self-correcting workflows that combine AI with rules-based validation.

Humans learn best from practical examples and expert guidance, not when they are left alone to consult a long list of all possible errors that may occur. The self-correcting workflow uses the “learning by example” approach that works so well for humans, first by providing few-shot samples (practical training) and then correcting any residual errors (supervision). If you’re building LLM-based automation for financial workflows, we encourage you to adopt a self-correcting approach, starting by evaluating it on your own trade data by leveraging our free cloud model APIs on build.nvidia.com or using NIM to deploy them locally. 

Join us at NVIDIA GTC Paris from June 10 to 12 to hear industry leaders speak about AI in financial services. We’re hosting a connect with experts session on generative AI applications in financial services, where you can learn lessons from deploying these kinds of systems in production.

Discuss (0)

Tags