Data Science

Limit Order Book Dataset Generation for Accelerated Short-Term Price Prediction with RAPIDS

Stock board

In the high-frequency trading world, thousands of market participants interact daily. In fact, high-frequency trading accounts for more than half of the US equity trading volume, according to the paper High-Frequency Trading Synchronizes Prices in Financial Markets

Market makers are the big players on the sell side who provide liquidity in the market. Speculators are on the buy side, running experiments and researching processes they expect to profit from. End users are the retail investors who consult retail brokers for advice and transactions. Overall, financial firms are interested in evaluating financial machine learning (ML) algorithms to discover which algorithms are most profitable.

Researchers have published many versions of this type of algorithm recently. We sought to leverage high-frequency data and the explainability of random forest (RF) models, and chose the RF method presented in the paper Investigating Limit Order Book Characteristics for Short-Term Price Prediction: a Machine Learning Approach

Our study found that hardware acceleration using GPUs reduces the time required for financial ML researchers to obtain prediction results. As most of the runtime can be spent in classifier training, there is, of course, interest in more efficient training methods.

This post walks through our study, including the generated dataset, using limit order book (LOB) data for price prediction, and recommended steps for ML training. We explain how the studied GPU configurations lead to significantly faster ML training time, enabling more efficient and extensive model development. 


This study used a time series dataset showing real-time stock prices to better understand LOB structure and direction prediction. Provided by market data company Intrinio, the dataset for this study contains samples of actual market prices for NYSE and NASDAQ tickers from the DOW 30 stocks on a 1-second basis.

The 1-second quotes were used as input to ABIDES (Agent-Based Interactive Discrete Event Simulation) to generate LOB data that looks like the LOB of the market. The timestamp on every record is at the second mark; for example: 2019-01-02T14:09:18Z for January 2, the first trading day in 2019. 

The CSV file input to ABIDES consists of this column as the first, followed by 30 columns for the prices in US dollars (to two digits) for the DOW 30. This post focuses on the AAPL ticker as a test case.

Synthetic data generation using ABIDES

ABIDES is an approach to simulating the operation of financial markets. It is explained in the recent paper, ABIDES: Towards High-Fidelity Multi-Agent Market Simulation, by researchers at the Georgia Institute of Technology, the University of Georgia, and J.P. Morgan Chase Bank. 

ABIDES simulates many individual trading agents who buy and sell assets through exchange agents. Every trade and other event in the simulation is recorded and tied to the agent that performed it. This enables market researchers to analyze in detail how different agent strategies and events affect the simulated market. Importantly, the LOB of a given exchange can be reconstructed after the simulation. 

Some agents in an ABIDES simulation base their valuation of an asset on a time series representing the true value of the asset that the agents observe at some frequency with some added noise. This time series is called the fundamental value of the stock. To simulate a more realistic market at a macro scale, we use real historical data for our fundamental value.

To create plausible LOB data for training our RF model, we use 1-second quotes provided by Intrinio as the historical fundamental value for an ABIDES simulation. Figure 1 compares the mid-price of the output LOB data with the 1-second quotes used as the historical fundamental for AAPL.

Chart comparing the 1-second input quote data to the mid-price of the output ABIDES simulation for a single trading day. ‘Fundamental’ (blue line) to the mid-price of the output ABIDES simulation and ‘Mid-price’ (black line) for a single trading day. The two time series stay nearly the same throughout the course of the day.
Figure 1. Chart comparing the 1-second input AAPL quote data from Intrinio (blue line) to the mid-price of the output ABIDES simulation (black line) for a single trading day

The LOB as a predictor of short-term price movements

On the bid side of a trade transaction, the buyer would like to pay as little as possible to buy a given security. On the ask side, the seller would like to sell the security for as much as possible. Limit orders are a way of setting these limits on the bid and ask side.

The LOB for a given security is a list of order sizes with security prices on the x-axis and total volume at that price on the y-axis for the bid and ask sides. For example, a buyer is willing to buy 100 shares of the GOOG security at $580 each, so there must be enough shares from those selling to fulfill those 100 shares. See Figure 2 for an example LOB.

The LOB is separated into the bid portion (on the left of the red line in Figure 2), where prices are lower than the mid-market, and the ask portion (on the right of the red line in Figure 2) where prices are higher.

Limit order book snapshot for the Google security at two time points 292 microseconds apart. The chart shows order volume and price for bid and ask orders, as well as mid-price. Mid-quote price direction is upward almost one U.S. dollar from the first frame to the second frame. The mid-quote price is marked by a red line between the bid and ask orders.
Figure 2. Limit order book snapshot for GOOG security at two time points 292 microseconds apart. Order volume is shown on the y-axis, with price on the x-axis. The mid-quote price is marked by the red line between the bid and ask orders.

Put simply, buying parties expect to pay lower prices and selling parties expect to gain higher prices in the market. The time points have nine digits behind the decimal point, which reflects the nanosecond accuracy of modern securities exchanges. 

Figure 3 shows that accuracy in predicting the immediate direction of price movement increases when more LOB depth is provided to a classifier. This makes intuitive sense, as a classifier has more available information on both sides of the market (bid levels and ask levels) during training.

Chart showing LOB depth (x-axis at the bottom) and percent accuracy (y-axis on the left) with a relatively straight curve of points, showing that as depth increases, accuracy also increases.
Figure 3. Limit order book depth can vary. The ML mid-price direction prediction accuracy is more robust with more levels in the book
Image credit: Faisal I Qureshi

Accelerate random forest training with RAPIDS

We trained a random forest model to predict short-term price movements with LOB data as input. We trained a classifier to predict whether a given stock price will move upwards, downwards, or stay flat. 

Specifically, the goal is to predict whether the average of the next 20 mid-prices (mnext) will be smaller or larger than the average of the previous 20 mid-prices (mprev) by a certain margin. We defined this margin as 0.5 cents, the smallest non-zero difference in mid-prices between any two LOB frames in our dataset.

Our dataset consisted of 90 days of ABIDES-generated data, providing about 7.5 million labeled LOB frames. We used 66% of this dataset as our training set and held out the remaining 33% as a testing set, as in the paper. Our input to the RF model consisted of 40 features, which were the bid and ask prices and volumes at 10 LOB levels.

A label of 2 indicates an upward price movement (mnext – mprev > 0.5 cents), a label of 1 indicates a neutral price movement, and a label of 0 indicates a downward price movement (mnext – mprev < -0.5 cents). 

The following experiments were run on one NVIDIA A100 80 GB SXM for RAPIDS cuDF and RAPIDS cuML, and on two AMD EPYC 7742 64-core processors for scikit-learn and pandas. Mid-prices, averages, and labels are computed using both the RAPIDS cuDF library and pandas. 

Figure 4 presents a comparison of runtimes. The mean preprocessing time is calculated from 10 runs with 10 warmups for each configuration. This is a labeling step prior to the ML training run shown in Figure 5.

Bar graph comparing mean preprocessing time on CPU with pandas and on GPU with cuDF. CPU with pandas takes about 19 seconds, GPU with cuDF takes about 4.5 seconds.
Figure 4. Comparison of mean preprocessing time on CPU with pandas, and on GPU with cuDF

We trained a random forest classifier of 100 trees using both scikit-learn and RAPIDS cuML, and compared training time with each. RAPIDS cuML is a freely available, drop-in replacement for scikit-learn, which enables many popular ML algorithms to be accelerated on the GPU. 

Figure 5 shows a comparison of the runtime of the training workload on one NVIDIA A100 80 GB with RAPIDS cuML, and on two AMD EPYC 7742 64-core processors with scikit-learn. Training on CPU is multithreaded with 128 threads, using the scikit-learn n_jobs parameter. 

Time is averaged over 50 runs with five warmups, and scikit-learn time is averaged over 10 runs with five warmups. Using the GPU is about 10x faster for training. These results are in line with the 2022 GPU study detailed in Accelerating Machine Learning Training Time for Limit Order Book Prediction.

Bar graph showing training runtime in seconds for scikit-learn on CPU and cuML on GPU. scikit-learn on CPU takes around 388 seconds, while cuML on GPU takes around 35 seconds.
Figure 5. Training runtime in seconds for scikit-learn on CPU and cuML on GPU 

Training on GPU offers more than a 10x speedup for this workload. The iterative nature of ML classifier development makes it time intensive, especially considering the large amount of time series data used in financial markets. In short, the GPU is a game changer in ML algorithms research.

Rising compute needs for financial datasets

While the preceding example uses a single ticker, these high-frequency trading and limit order book use cases need multiple AI systems running the algorithms—equivalent to multiple NVIDIA DGX SuperPODs. Typically, organizations specializing in such use cases require multiple asset classes and trackers. 

Hence, the analysis and application of such algorithms can be easily parallelized with cases extending to multiple AI systems needing speedup time and significant compute. For example, quantitative finance, machine learning (such as RAPIDS cuML), and deep learning applications (such as neural nets on top of LOB datasets).

To speed up training when developing financial ML algorithms, you can leverage GPU acceleration by using the RAPIDS suite of libraries:

  • RAPIDS cuDF replaces the pandas Python library
  • RAPIDS cuML replaces the scikit-learn Python library 

Download and install RAPIDS to get started enabling GPUs for your data science workloads. Remember to install an NVIDIA driver and CUDA toolkit beforehand.

Discuss (0)