NvSciStream Performance Test Application

NvSciStream provides a test application to measure KPIs when streaming buffers between a CPU producer and CPU consumers. This test focuses on NvSciStream performance, which does not use CUDA, NvMedia, or other hardware engines. To simplify measuring packet-delivery latency for each payload, the stream uses FIFO mode.

This test application is for performance testing purposes. It may simplify some setup steps and set unnecessary synchronization objects or fences to the CPU endpoints to include the fence transport latency in the measurement. To see how to create a stream with NvSciStream API, refer to NvSciStream Sample Application

This test uses the NvPlayFair library (see Benchmarking Library) to record timestamps, set the rate limit, save raw latency data, and calculate the latency statistics (such as the min, max, and mean value) on different platforms and operating systems.

The test app supports a variety of test cases:

  • Single-process, inter-process and inter-chip streaming
  • Unicast and multicast streaming

The test can set different stream configurations:

  • Number of packets allocated in pool.
  • Number of payloads transmitted between producer and consumers.
  • Buffer size for each element.
  • Number of synchronization objects used by each endpoint.
  • Frame rate, frequency of the payloads presented by the producer.
  • Memory type, vidmem, or sysmem.

The test measures several performance KPIs:

  • Latency for each process:
    • Total initialization time
    • Stream setup time
    • Streaming time
  • Latency for each payload:
    • Duration to wait for an available or ready packet
    • End-to-end packet-delivery latency
  • PCIe bandwidth in inter-chip stream

The README file in the test folder explains these KPIs with more details.

Prerequisites

NvSciIpc 

Where inter-process streaming is used, the performance test application streams packets between a producer process and a consumer process via inter-process communication (NvSciIpc) channels.

The NvSciIpc channels are configured via the device tree (DT) on QNX and via a plain text file, /etc/nvsciipc.cfg, on Linux. For more information on NvSciIpc configuration data, see the NvSciIpc Configuration Data chapter. The NvSciIpc channels used by the performance test application are as follows: 
INTER_PROCESS	nvscistream_0	nvscistream_1    16	24576  
INTER_PROCESS	nvscistream_2	nvscistream_3    16	24576  
INTER_PROCESS	nvscistream_4	nvscistream_5    16	24576  
INTER_PROCESS	nvscistream_6	nvscistream_7    16	24576  

Where inter-chip streaming is used, the sample application stream packets between different chips via NvSciIpc (INTER_CHIP, PCIe) Channels. For more information, see Chip to Chip Communication.

NvPlayFair

This performance test application uses the performance utility functions in the NvPlayFair library.

Building the NvSciStream Performance Test Application

The NvSciStream performance test includes source code, README, and a Makefile.

On the host system, navigate to the test directory:

cd <top>/drive-linux/samples/nvsci/nvscistream/perf_tests/

Build the performance test application:
make clean
make

Running the NvSciStream Performance Test Application

Option Meaning Default
-h Prints supported test options
-n <count>

Specifies the number of consumers.

Set in the producer process.

1
-k <count>

Specifies the number of packets in pool.

Set in the producer process for primary pool.

Set in the consumer process for c2c pool.

1
-f <count>

Specifies the number of payloads.

Set in all processes.

100
-b <size> Specifies the buffer size (MB) per packet. 1
-s <count>

Specifies the number of sync objects per client.

Set by each process.

1
-r <count> Specifies the producer frame-present rate (fps)
-t <0|1> Specifies the memory type. 0 for sysmem and 1 for vidmem. Pass dGPU UUID using -u, if using vidmem. 0
-u

Required for vidmem buffers.

Can be retrieved from 'nvidia-smi -L' command on x86

-l

Measure latency.

Skip if vidmen is used.

Set in all processes.

False
-v

Save the latency raw data in csv file.

Ignored if not measuring latency.

False
-a <target>

Specifies the average KPI target (us) for packet-delivery latency. Compare the test result with the input target with 5% tolerance.

Ignored if not measuring latency.

-m <target>

Specifies the 99.99 percentile KPI target (us) for packet-delivery latency. Compare the test result with the input target with 5% tolerance.

Ignored if not measuring latency.

For inter-process operation:
-p Inter-process producer.
-c <index> Inter-process indexed consumer.
For inter-chip operations:
-P <index> <Ipc endpoint> Inter-SoC producer, NvSciIpc endpoint name connected to indexed consumer.
-C <index> <Ipc endpoint> Inter-SoC consumer, NvSciIpc endpoint used by this indexed consumer.
Copy the sample application to the target filesystem:
cp <top>/drive-
linux/samples/nvsci/nvscistream/perf_tests/test_nvscistream_perf 
<top>/drive-linux/targetfs/home/nvidia/

Following are examples of running the performance test application with different configurations:

  • Measure latency for single-process unicast stream with default setup:

    ./test_nvscistream_perf -l

  • Measure latency for single-process unicast stream with three packets in pool:

    ./test_nvscistream_perf -l -k 3

  • Measure latency for single-process multicast stream with two consumers:

    ./test_nvscistream_perf -n 2 -l

  • Measure latency for inter-process unicast stream with default setup:

    ./test_nvscistream_perf -p -l &

    ./test_nvscistream_perf -c 0 -l

  • Measure latency for inter-process unicast stream with a fixed producer-present rate at 100 fps, which transmits 10,000 payloads:

    ./test_nvscistream_perf -p -f 10000 -l -r 100 &

    ./test_nvscistream_perf -c 0 -f 10000 -l

  • Measure latency and save raw latency data in nvscistream_*.csv file for inter-process unicast stream, which transmits 10 payloads:

    ./test_nvscistream_perf -p -f 10 -l -v &

    ./test_nvscistream_perf -c 0 -f 10 -l -v

  • Measure PCIe bandwidth for the inter-chip unicast stream with 12.5 MB buffer size per packet, which transmits 10,000 frames. The two commands are run on different SoCs with <pcie_s0_1> <pcie_s1_1> PCIe channel:

    On chip s0:

    ./test_nvscistream_perf -P 0 pcie_s0_1 -l -b 12.5 -f 10000

    On chip s1:

    ./test_nvscistream_perf -C 0 pcie_s1_1 -l -b 12.5 -f 10000

Note:

The test_nvscistream_perf application must run as root user (with sudo).

For the inter-process use case:

If it fails to open the IPC channel, cleaning up NvSciIpc resources may help.
sudo rm -rf /dev/mqueue/*
sudo rm -rf /dev/shm/*
Note: For inter-chip use cases:

The use case is only supported on Linux.

Ensure different SoCs are set with different SoC IDs. For Tegra-x86 use cases, set a non-zero SoC ID on the Tegra side, because x86 uses 0 as the SoC ID. For more information, refer to the "Bind Options" section in AV PCT Configuration.