Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to prove your innocence on a daily basis, as it is easy to blame the network. It is an unfair world.
Correlating application performance issues to the network is hard to do. You can start by checking basic connectivity using simple pings or traceroutes, check your SNMP-based monitoring tools, sniffers, or even reading device counters to look for drops. In the meantime, users suffer from application slowness, poor performance, or even unavailability.
Unfortunately, all these classic network troubleshooting methods are time-consuming and don’t guarantee success, as it is sometimes nearly impossible to pinpoint problems using them.
NetQ to the rescue
To facilitate network troubleshooting, NVIDIA developed NetQ—a scalable, modern network operations toolset that provides network visibility in real time.
The NetQ team recently introduced the unique flow analysis tool to provide further visibility enhancements. Flow analysis allows network administrators to instantly correlate service traffic flows to the paths taken in the fabric, dramatically reducing the mean time to innocence (MTTI) or even ensuring there is no network issue.
Flow analysis enables you to discover and visualize all paths that a specific application’s traffic flow takes between endpoints in the fabric. It monitors the fabric-wide latency and buffer utilization statistics. With EVPN and multi-tenancy becoming the standard solution in most modern data centers, the flow analysis tool was designed to sample TCP or UDP data on overlay and underlay networks within different VRFs.
Flow analysis becomes even more powerful when used with What Just Happened (WJH) ASIC telemetry. While flows are being analyzed, flow-related WJH events from all switches in traffic paths are presented to help you discover if there were drops that caused the service issue. These two features working together maximize the probability of pinpointing the actual problem affecting an application.
Flow analysis is supported on NVIDIA Spectrum 2 and later switches running Cumulus Linux 5.0 or later. It can also provide partial-path discovery for brownfield deployments with unsupported switches or switches running older versions of Cumulus Linux or SONiC.
Flow analysis samples traffic based on the packet’s four or five tuples, including VXLAN inner and outer headers. Its sampling lifetime is limited to 10, 15, 20, or 30 minutes. You can decide whether to run it on creation or schedule it for a later time.
The sample rate granularity is also configurable to low (1 per 10000), medium (1 per 1000), high (1 per 100), or all packets (1 per 1). The higher the sampling rate, the more accurate your analyzed data. A higher sampling rate results in higher CPU utilization, so I recommend setting lower sampling rates for heavy traffic flows.
NVIDIA Air is a tool for creating data center digital twins. With Air, you can build your own Cumulus Linux virtual data center, test it, validate it with NetQ, explore features, and learn some best practices. It is entirely free to use!
Try out flow analysis by spinning up the prebuilt NVIDIA Air Infrastructure Simulation Platform demo in the Air Marketplace. Follow the guided tour and see the significant benefits that flow analysis with NetQ can bring to your organization.
For more information, see the following resources:
- NVIDIA NetQ User Guide
- NVIDIA Air User Guide
- NVIDIA Cumulus Linux User Guide
- Analyzing Fabric-wide Network Latency With NetQ 4.1.0
- Automate Network Monitoring and Reduce Downtime with the Latest Release of NVIDIA NetQ
- Close Knowledge Gaps and Elevate Training with Digital Twin NVIDIA Air
- Troubleshooting Networks with NetQ