This document is intended to provide network administrators and capacity planning engineers with some useful information on understanding and using the bandwidth data collected by Netreo.
Background on Digital Transmission
To understand the full implications of the bandwidth data, it is useful to have some background information on how data communication over digital lines work. This is necessarily a simplified explanation. For the purposes of getting a basic understanding of the concepts involved (and to simplify the math required) we’ll assume that we have a perfect transmission line with no latency, overhead, setup, errors, or contention.
Digital circuits operate in a binary fashion, which is to say that at any given moment of time, they are either on or off. The speed of a digital circuit is controlled by a number of factors, but a simple way to think of it is to consider the line to have a clock speed similar to the way that a CPU would. For example, a 10-megabit Ethernet line could be said to have a clock of 10 million bits per second. Effectively, this means that at a small enough time scale (in this example, 1/10,000,000th of a second), you can see that the line is always either 100% utilized (on), or 0% utilized (off).
When a device attached to this line has data to send, it sends it all as fast as the line can take it, and then stops. The way most applications are written and used is that they will request a burst of data, then the user must interact with it—during which time activity may be minimal—then the cycle repeats. This is why network traffic is sometimes called “bursty.”
Since computers typically have many small amounts of data to send, and only a few large ones, a utilization period average can be calculated. Effectively, what you are calculating is the percentage of time this line was 100% utilized during the monitoring interval.
How is the data collected?
Netreo uses SNMP for data collection of bandwidth statistics. (In the case of some managed devices, Netreo may also use other protocols such as WBEM, WMI, and SOAP for this data collection, however the basic concepts are the same.) Devices typically store this data in the form of a counter, which is like an odometer in your car, showing how much data has been sent or received over the interface. Netreo collects this data at regular intervals, then divides the difference by the elapsed collection time to compute an average rate per second for that time period. This rate is then divided by the theoretical maximum rate to arrive at a utilization value for the collection period. This is the value Netreo reports in the bandwidth graphs.
Since the device typically isn’t keeping track of the peak rate achieved during this period, the calculations are limited to the precision defined by the collection interval. To get more precise data, you must collect it more often. However, the tradeoff when doing so is that the more often you read the odometer, the more traffic is generated on the network to request, authenticate and collect the data. The device being monitored must also respond to the request—using CPU and memory. This means that the data collection interval must be balanced against the impact on network performance.
By default, Netreo uses a data collection interval of 5 minutes for this type of data, which provides a good balance between visibility and performance. Beginning in version 7.3, there is also a bandwidth microscope mode that can be used to zoom in on a particular interface and collect data at 5 second intervals. This allows a network engineer to get a very detailed view of performance on a specific interface for troubleshooting purposes. Since this can generate excessive amounts of network traffic and load on the monitored devices, Netreo does not recommend using the tool continuously for long periods, or on many devices at once.
In order to make bandwidth data useful, once it is collected it is converted into several types of statistical data: peak, average and percentile.
Peak vs. Average
Peak utilization is a misleading term. As has been illustrated, the true peak value on any physical digital line is always 100% (when at a small enough time interval to be seen). Peak utilization, as Netreo records it, therefore has to do with data consolidation. Once we have calculated the average utilization for data, Netreo stores that data for a period of time. Later, it “rolls up” that data into an “average of averages” in order to retain more history. During this “roll up” process, Netreo also retains (separately) the highest average value from each interval, and records this as the peak value. Effectively, peak utilization represents the highest calculated average seen over the time of the graph, rather than the actual peak rate, which is always 100%.
To understand this better, here is an example of data collected for a specific interface over a specific period of time:
|Calculated Average Rate||N/A||33.333||40||53.333||153.333||60|
For this example, the average rate across all 6 samples is 56.666, and the peak rate for this interval is 153.333. When Netreo later rolls this data up for long-term storage, these are the values that will be retained. The individual sample values will then be dropped.
If we zoom into the peak in the last table by collecting the data more often, we’d probably see something like this:
|Calculated Average Rate||11.666||47.666||174||36.666||496.666|
As you can see, the higher granularity might allow us to see a shorter-term peak that we would not see otherwise. However, it also results in 5 times as much network monitoring traffic. So, data collection at these intervals is something that should be considered carefully. As indicated earlier, if we were to look at a small enough granularity, the values returned would always be either 0 or 100% of the line speed. In modern networks, transient spikes like this are not generally a problem, as network equipment is designed with many queuing and prioritization features to reduce the impact of short transients. The use of prioritization, Quality of Service (QoS), and load-balancing techniques such as Random Early Discard (RED) are designed to prevent any one user from consuming all the bandwidth and shutting out other users and applications.
Since the peak rate is typically not representative of the usual traffic flow across that interface, it is of limited utility when doing capacity planning. If you base capacity decisions off peak rate, you are likely to constantly increase the speed of the circuit unnecessarily. If equipment and circuits were free, that would not be a problem, but in the real world what we really need to know is how utilized the interface is most of the time. For this, we use percentile data.
Percentile is a statistical function. In basic terms, this value represents the peak value of a variable after the top N% of values are discarded. This gives you a better idea of the overall high water mark of the utilization, while excluding short spikes that would throw off your calculations. For an exhaustive explanation of percentile calculations, refer to the Wikipedia article on Percentiles.
While almost any percentile value can be used for capacity planning purposes, the most commonly used is the 95th percentile. The 95th percentile represents the peak value after the top 5% of values are discarded. For short time periods (i.e. a small sample size), the 95th percentile may be the same value as the peak. This is a normal effect of the statistical calculations and is not an error. For best results when capacity planning, the 95th percentile should always be calculated over a period of a month or longer. Other common values used for capacity planning are the 90th and 75th percentiles, each of which give a progressively more conservative estimate of bandwidth usage.
In this example, looking at inbound traffic values:
The average value for this time period is 309.778K. The peak value (Netreo usually calls this “Max”) is 11.114M, and the 95th percentile value is 3.097M.
By using the 95th percentile for our capacity calculations, we can determine the speed the interface needs to be to allow 95% of the traffic to go across it with minimal congestion and latency. Quality of Service (QoS) can also be configured on routers and gateway devices to insure that our high-priority mission critical traffic is always given the best service, while less critical traffic is delayed or even dropped during times of congestion.
When planning for capacity in the real world, 100% of the theoretical maximum utilization is not really achievable. This is due to several factors, such as, the protocols used to manage communication have some overhead; the transmission lines themselves introduce latency (and possibly errors); and we have to first initiate and set up this communication so the computers involved can know exactly what applications are supposed to handle the data. The actual real-world throughput of a given digital circuit is limited by all of these factors, and by the amount of communication that must be dedicated to setup, error correction, routing, and other functions. This is mainly a factor of how complex the underlying technology is and how many users must share it. For basic half-duplex Ethernet, the achievable theoretical maximum can be as low as 60% of the actual line speed. Wireless protocols can be much lower. Point-to-point serial lines, on the other hand, can achieve very high efficiencies, because there are only ever two parties involved. If we take these factors into account, we can set reasonable and appropriate thresholds for each kind of line we are sending data across, and factor that into our capacity planning calculations.