tuning the deadband
There are two problems that many IT folks will have in their career: We will obtain enough knowledge that they will occasionally forget some basics. We will receive a flurry of systems alerts that are flapping between problem and resolved faster than the wings of a hummingbird. In this post, I want to chat about the basics of building effective monitoring in terms of defining problems and recoveries. what the heck are we measuring? We're monitoring servers and databases and network switches and whatnot, duh! Well, if only it were that simple. Every node we monitor is going to have different items we can monitor, and before we build a monitor, we need to understand what type of data it is that we are wanting to monitor. Generally, these items we monitor will fall into one of two categories: Gauges represent a point-in-time value that can go up or down. Think inlet or CPU temperature, memory usage, or current connections. With gauges, we are concerned mostly with the actual change of the value and its direction of change. Counters on the other hand, only ever go up from their last reset. Think count of errors on an interface or total bytes transmitted. With counters, there's isn't too much sense in monitoring direction since they only go up, and we understand that they change, so there's generally not much sense in monitoring if they change. Instead, we are concerned with the rate of how quickly the counter is changing. One of the main reasons I want to point this out is that the type of item matters. Were you to apply some logic to alert when a counter is strictly less than a value (as one example) without converting it to a rate first, you'd have an alert trigger that would stay active forever until the counter reset. I acknowledge there are specific use cases for most monitoring configurations, but we are focusing on the basic principles and general concepts. some fast maths Speaking of "strictly less than," lets have a chat about inequalities. The Law of Trichotomy dictates that a given value can be one of the following at any given time: NameSymbolExpressionTypeGreater Than>Value > Threshold6 -gt 7
returns $true
7 -gt 6
returns $falseStrict InequalityLess Than<Value < Threshold6 -lt 7
returns $true
7 -lt 6
returns $falseStrict Inequality(Boundary Excluded)
When we use strict inequalities, we are making a decision about the boundary/threshold value. "If value is greater than 7..." "If value is less than 7..." So what happens when value is...7? 7 is not less than nor greater than itself. I know we're not breaking any new ground there, but it's important to understand this because if you have a monitoring case where something can critically fail at 7, and you are using strict inequalities, you might be opening yourself up to a failure state that wouldn't alert. If you did happen to need to know when something is exactly a value, your boundary is incredibly specific: NameSymbolExpressionTypeEqual To=Value = Threshold6 -eq 7
returns $false
7 -eq 6
returns $false
7 -eq 7
returns $trueEquality(Is the Boundary)
When you need more flexibility, you can use non-strict inequalities (also known as "inclusive" inequalities, as the boundary value is included in the set of true values): NameSymbolExpressionTypeGreater Than Or Equal To≥Value ≥ Threshold6 -ge 7
returns $false
7 -ge 6
returns $true
7 -ge 7
returns $trueNon-strict Inequality(Boundary Included)Less Than Or Equal To≤Value ≤ Threshold6 -le 7
returns $true
7 -le 6
returns $false
7 -le 7
returns $trueNon-strict Inequality(Boundary Included)
These operators are the foundation of the logic that drives all systems monitoring, 100% of the time. fighting the flap Remember one of the situations mentioned in the introduction to this post where you receive a flurry of systems alerts that are flapping between problem and resolved faster than the wings of a hummingbird? This can be a result of a badly defined problem expression. To avoid such situations, we use hysteresis (also known as "deadband"), which is a logic design where we set the recovery expression's threshold to a different value than the problem expression's threshold. We might say: It becomes a problem when the value is less than 5. It is no longer a problem when the value is greater than 8. This might seem counterintuitive at first, since we are essentially taking our threshold we defined and throwing it out the window, but by creating this gray area between 5 and 8, we are essentially getting the monitored system to prove it has stabilized before the alert closes. It is super important that the recovery threshold not be too far flung from the problem expression's threshold, otherwise the monitor becomes ineffectual and can remain in a warning state even when the monitored item has returned to a safe value. We want the recovery threshold generous enough to prevent noise from flapping, but constrained enough that every time you get the alert, it is accurate and honest. aggregation is more agreeable Speaking of time, it is also a major consideration when building a monitor. Imagine you have a Microsoft SQL server that is humming away happily at 90% RAM usage, and Bob from your data engineering team decides to open SQL Server Management Studio locally on the server and the RAM usage shoots up to 95% for around a minute while it loads. Your monitoring system's problem expression alerts when the CPU usage is greater than or equal to 95%, so you get paged to have a look just to find it's just good old Bob. Monitoring purely real-time values is not always the most valuable thing to do. By adding a dimension to our monitoring (time), we can smooth out alerting noise and better alert on what we want. We accomplish this through using the mean or median over a period of time. The mean (average) is helpful because it captures the "weight" of an issue over time, since it accounts for outliers in the dataset, but can become skewed and trigger a non-actionable alert if an outlier is too far from the rest of the datapoints. The median (50th or Nth percentile) is a great alternative as it represents the consistency of the state. If the median value over the course of 5 minutes is above the problem threshold, then you know the monitored item was unhealthy for half of that time, and is not affected by one-off spikes that might be non-impactful. pager? i hardly know 'er. Finally, know your environment and know your hardware! If you know your servers are in a climate-controlled T1 data center with exceedingly consistent climate control, then take that into account. The same goes for your network switch in a non-climate-controlled production shop. You'd likely monitor these for temperature differently, as you have different expectations. Look at the manufacturer specifications for your hardware. Are you monitoring for critical up/down health? Or performance optimization? Consider your objectives. Monitoring isn't just about turning on the alert light when things get hot. Its about translating physical or digital scenarios into logic that respects the IT folks on the other end of the pager. There are a million and one things that we cram into our heads, and we're all liable to forget some basics now and again. When we do and configure ineffective monitors, it is possible for the one true page to get lost amid the alert fatigue from a flock of flapping alerts. By mastering and implementing these concepts, you can assure that when that pager blares in the middle of the night that it's actually something that requires your attention and expertise. The more effort you put into getting monitoring right, the more your systems reliability will improve along with your quality of sleep.
Discussion in the ATmosphere