Edge Computing

Enable Real-Time AI for High-Speed Data Acquisition with DAQIRI

When AlphaFold2 revolutionized drug discovery in 2020, its success relied entirely on the roughly 170,000 protein structures collected by scientists since 1971 and preserved in the Protein Data Bank. Measured data is the backbone for all AI models and workflows that process data as it’s created, act on what matters in real time, and analyzes data for deep insights. With the current rise of modern sensors and detectors, nobody needs to wait 50 years to collect enough data for groundbreaking AI models.

From large scientific facilities such as Linac Coherent Light Source II (LCLS-II), which generates photon pulses at a 1 MHz repetition rate, to industrial CT scanners and high bandwidth software defined radios, output rates continue to increase and shift the bottleneck away from missing data to the current “collect, store, analyze” architecture, which has never been designed to deal with high data rates on short time scales. 

By moving to an adaptable data acquisition pipeline, pre-processing data at the source opens up opportunities that mitigate information loss from the data deluge, while accelerating the path from data collection to discovery.

NVIDIA DAQIRI (Data Acquisition for Integrated Real-time Instruments) shifts the data acquisition to an adaptable, software-centric architecture from an inflexible hardware-centric design. As a high-performance networking library, part of the NVIDIA Holoscan Platform, DAQIRI directly connects existing high-bandwidth streaming detectors and sensors to the NVIDIA software ecosystem. Examples include Holoscan for real-time multi-modal, multi-rate processing; NVIDIA TensorRT for real time inference; and NVIDIA nvCOMP for streaming compression. In addition to the NVIDIA ecosystem, DAQIRI can also stream directly into custom instrument-specific software platforms.

DAQIRI enables developers and instrument builders to create data acquisition workflows that process data in stream for tasks such as filtering, inference, compression, event selection, and adaptive control without having to modify their instruments. 

By streaming instrument outputs directly into edge supercomputing capabilities equipped with an NVIDIA ConnectX Network Interface Card (NIC), instruments can monitor the experiment continuously, respond to changing conditions, and trigger actions, in real time. These edge supercomputing systems can range in size from the NVIDIA DGX Spark to the NVIDIA IGX Platform to node and rack-based solutions like NVIDIA RTX Pro Server or VR200, depending on the required compute for the specific instrument. 

This approach gives researchers immediate insight into incoming data while also preparing selected outputs for downstream processing at supercomputing facilities using AI and other computationally intensive methods. 

A-GHOST: Making unsavable data searchable

The High-Luminosity Large Hadron Collider (HL-LHC) upgrade at the European Organization for Nuclear Research (CERN) will increase the luminosity by a factor of 10 compared to the original design. To process the much higher data rates, the ATLAS detector will upgrade its current selection system. The new design will still use a two-stage selection system, however now with a bandwidth of selected events of 1 MHz (up from 100 kHz) after the first stage, and up to 10 kHz (up from 1 kHz) after the second stage going to storage. Even at this increased rate, this still implies rejecting more than 99% of all collisions in the online system.

The A-GHOST project uses DAQIRI to apply more powerful AI-driven searches to the stream that is discarded by the nominal selection path by employing efficient networking to bring the GPUs closer to the raw detector data. 

The R&D effort focuses on exploring the utilization of a streaming link between the custom Field-Programmable Gate Array (FPGA)-based hardware boards planned to be used during HL-LHC, and a high performance GPU enabled processing farm. With this architecture, the R&D effort led by CERN Openlab, University of Chicago and UCL scientists will allow the real-time analysis of the full data stream by deploying powerful models like Convolutional Auto-Encoders (CAEs), temporal Convolutional Neural Networks (TCCN) and transformer-based models, which are planned to be tested with the prototype hardware. 

How DAQIRI works under the hood

DAQIRI is designed to handle high-bandwidth Ethernet data, including UDP and RoCE v2 traffic, at line rate of 100s of Gbps and higher. To achieve this, the architecture completely bypasses the Linux kernel. 

By leveraging the Data Plane Development Kit (DPDK), DAQIRI provides zero-copy access, routing data directly from the NIC to the GPU’s Direct Memory Access (DMA) buffers. This kernel-bypass mechanism reduces the latency and CPU overhead typically associated with traditional network stacks, ensuring that massive instrument data streams arrive at the GPU ready for immediate processing.

NVIDIA DAQIRI Key Features:

  • High Throughput, Low Latency
    • Achieve line rate on any interface with proper hardware and CPU/NUMA tuning
  • Customized Receive Processing
    • Automated packet reordering, data type conversion, and hardware-based flow steering
  • Zero Memory Copy to GPU
    • Direct NIC ring-buffer access (Batched and Header Data Split) to GPU tensor keeps latency at PCIe transit time
  • YAML-Driven Configuration
    • Optimized and customizable boilerplate network configurations for ease of deployment
  • Flexible Data Movement Backends
    • Linux Sockets, DPDK, and RoCEv2 support for varying application and hardware demands
  • Plug and Play C++ and Python APIs
    • Build a real time application and interface with other GPU libraries in minutes, not hours” 

While the underlying data movement relies on low-level networking optimizations, DAQIRI abstracts this complexity for instrument builders. Developers can orchestrate the data acquisition pipeline using accessible C++ and Python APIs, configured via readable YAML files. 

Instead of managing individual network packets or manual memory allocations, DAQIRI automatically batches incoming network packets directly into GPU tensors. This allows developers to focus entirely on writing their custom inference or filtering logic rather than managing network protocols.

The walkthrough below shows how an instrument designer configures a high-speed data stream in small, inspectable pieces, then uses a short C++ loop to receive GPU-ready tensors.

DAQIRI applications start with a configuration file. The config describes the data path before the application runs: which NIC to use, which GPU owns the packet buffers, how packets should be filtered, and how received packet payloads should be assembled for downstream processing.

This is the handoff point most instrument pipelines want: a batch-shaped tensor already resident on the GPU rather than individual network packets. The reorder stage can also convert payload data while assembling the tensor. For example, a sensor frontend may send compact int4 samples on the wire, while the GPU processing or AI inference stage expects fp16. 

DAQIRI can perform that conversion as part of the GPU reorder step, avoiding a separate unpacking pass in the application. These files are easily editable for any hardware configuration, and more examples can be found in the Cocktail Book (code examples).

DAQIRI walkthrough

Start with the top-level DAQIRI settings. This establishes the raw streaming path, assigns a CPU core to manage DAQIRI, and keeps logs concise enough for deployment.

%YAML 1.2
---
daqiri:
  cfg:
    version: 1
    stream_type: "raw"        # Use DAQIRI's high-speed DPDK/GPUDirect path.
    master_core: 3            	# CPU core used to start and manage DAQIRI.
	log_level: "info"

Next, define the GPU memory regions DAQIRI will use. rx_packets is where raw packet buffers land through GPUDirect, while rx_tensor is the completed, reordered tensor consumed by the GPU workload. The buffer sizes also make the int4-to-fp16 expansion explicit.

 memory_regions:
  - name: "rx_packets"
    kind: "device"            # Raw packets land directly in GPU memory.
    affinity: 0               # Use GPU 0.
    num_bufs: 16384
    buf_size: 8192            # Headers + sequence number + 8000B payload.
  - name: "rx_tensor"
    kind: "device"            # Reordered output also stays on the GPU.
    affinity: 0
    num_bufs: 128             # Number of completed tensors DAQIRI can queue.
    buf_size: 32768000        # 1024 packets * 8000 int4 bytes * 4 bytes after fp16 expansion.

Then bind the configuration to the physical receive interface. This names the NIC by PCIe address, enables flow isolation, and assigns a polling queue. The queue batches 1024 packets at a time but can flush after 2 ms so downstream processing is not waiting indefinitely.

  interfaces:
  - name: "rx0"
    address: "0000:00:00.0"   # RX NIC PCIe address from lspci.
    rx:
      flow_isolation: true    # Only packets matching the flow below are accepted.
      queues:
      - name: "q0"
        id: 0
        cpu_core: 9           # CPU core polling the NIC queue.
        batch_size: 1024      # Build tensors from 1024 packets.
        timeout_us: 2000      # Flush a partial batch after 2 ms.
        memory_regions: ["rx_packets"]

The flow rule narrows the accepted traffic to the expected UDP stream and sends matching packets into queue 0. That keeps the rest of the instrument or network traffic out of this receive path.

  flows:
    - name: "data_flow"
      id: 100                 # This flow is attached to the reorder rule.
      action: {type: queue, id: 0}
      match: {udp_src: 4096, udp_dst: 4096}

Finally, the reorder configuration turns packet payloads into the tensor shape the application actually wants. It skips packet headers, uses sequence numbers to restore ordering, batches 1024 packets, and converts compact int4 payloads into fp16 values as part of the GPU reorder step. 

 reorder_configs:
    - name: "packets_to_tensor"
      reorder_type: "gpu"     # Reorder and pack the batch on the GPU.
      memory_region: "rx_tensor"
      payload_byte_offset: 68 # Skip Ethernet/IP/UDP headers and sequence number.
      flow_ids: [100]
      data_types:
        input_type: "int4"    # Payload format on the wire.
        output_type: "fp16"   # Tensor format for the model/kernel.
        endianness: "host"
      method:
        seq_packets_per_batch:
          sequence_number:
      bit_offset: 512   # Sequence number starts at byte 64.
            bit_width: 32
      packets_per_batch: 1024

With those details in YAML, the application code becomes intentionally small. First, initialize DAQIRI from the configuration file and declare the burst handle that will receive completed packet batches.

// Initialize DAQIRI with a config file.
daqiri::daqiri_init("rx_reorder.yaml");

daqiri::BurstParams* burst = nullptr;

Then ask DAQIRI for the next completed burst. By the time the application receives the pointer, DAQIRI has already reordered the packets into contiguous GPU memory, so the code can hand the tensor directly to a model or CUDA kernel.

// Receive packets; DAQIRI reorders them into contiguous GPU memory.
daqiri::get_rx_burst(&burst);

// The reordered batch is now a GPU tensor ready for processing/inference.
auto* tensor = daqiri::get_packet_ptr(burst, 0);
run_model_or_kernel(tensor);

Once the GPU job is done, return the buffers to DAQIRI so they can be reused for the next burst.

Bring DAQIRI to your sensor or detector

By shifting to a software-defined, AI-enabled architecture from an inflexible, hardware-centric collection paradigm, DAQIRI removes the traditional bottlenecks of scientific data acquisition. Developers can now process data in stream, run real-time AI inference at the edge, and ensure that only high-quality, AI-ready data is sent to HPC facilities for deeper analysis.

Start integrating real-time processing into your real time streaming workflows with DAQIRI today!

Explore the DAQIRI GitHub repository

Read the DAQIRI Getting Started Docs for Tutorials and Documentations

Visit the DAQIRI Landing Page on GitHub for Benchmarks and Examples

Acknowledgments

We’d like to thank David Miller and Ioannis Xiotidis from CERN for their work and collaboration on using DAQIRI in the A-GHOST data acquisition pipeline, as well as for their collaboration and their technical review for this blog post. We also acknowledge Alexis Girault, whose ANO documentation helped inform this effort. Special thanks to NVIDIA contributors Cliff Burdick, Chloe Crozier, Jay Carlson, Mahdi Azizian, Julien Jomier, and the broader Holoscan team for their expertise, guidance, and support. 

Discuss (0)

Tags