Monitoring is essential to ensuring a system is running well in production. Without monitoring, you are driving a car with a blindfold on. It can go well up to a certain point, but inevitably, you will have a fender bender or even a deadly crash.
The whole purpose of monitoring is to detect issues before your users do. With thousands or millions of active users, these are quick to complain when something is not working correctly, hurting the experience and brand significantly.
This story is focused on visual monitoring dashboards and avoids everything else, namely alerting, anomaly detection, and instrumentation, among others. The monitoring technology stack is also irrelevant, as everything stated in this story can be applied to every popular visualization platform. Well-designed dashboards provide data that can be understood easily and interpreted at a glance to decide if further action is necessary and, ideally, even provide hints on where to start looking.
The decision of what to monitor will affect the entire observability process. It is a two-edged sword: Monitor too much, and important information may be lost due to information overload. Monitor too little, and relevant data may be missed, causing issues to be detected too late.
A good starting point is to start from the outside and think about scenarios that will affect users directly. For example, an API endpoint responding with HTTP 500 errors or being extremely slow does not identify the root cause; however, it certainly brings visibility to an issue that must be investigated. In any event-based system, the dead letter topic is a good initial indicator that something is going wrong.
We don’t want (at least in the beginning) to obsess over technical metrics, e.g., the throughput of database operations. Sure, there could be an issue here, but the technical nature of this metric makes it difficult to tell good from evil, and who has any service level objectives (SLOs) on DB writes anyways. A database that stops answering will also be apparent in every upstream metric and must be investigated anyways.
Similar to slicing microservices, you can cut monitoring dashboards along functional or technical seams.
A mix of these provides the best observability as the aspect-oriented dashboards go hand-in-hand when analyzing an issue. The dashboard design patterns apply to functional and aspect-oriented dashboards alike.
I once was on a call with a UX designer and, for a short moment, showed her one of our dashboards. Immediately she pointed out how user-unfriendly it is: Too much data at once, dozens of data visualization types, lacking structure, no visual cues, and no easy-to-understand indication of what is going well and what is not.
These kinds of dashboards provide all the necessary information but make it challenging to extract value from them, especially for developers new to the team or who don’t look at the dashboards often.
Colors are the premier way to indicate what is going well and what is not. Besides the traditional traffic light representation, blue is an excellent color to state information that is not rated as good or bad.
This summary section of a calculation service dashboard allows viewers to understand which calculation types are working and which are not. The top blue row states throughput without judging its values. Rows two and three contain the service level objectives (SLOs) in the name and contain a color-coded representation of the current value compared to the SLO. Immediately we can see that the second column processing is struggling. The last row shows a gauge representing the percentage of successful calculations, and clearly, the second gauge is screaming at us. Something is going wrong!
As in UX design, a user is used to specific patterns which she knows how to interact with already. Applying this to dashboard design means that one type of representation should only be used for one kind of metric. Nothing is worse than having some time series graphs which are stacked and others that are not stacked. The same value looks very different and will confuse — at least at first glance.
A state timeline is one of the best ways to depict different issue types which occurred in a given time period. Use this visualization for all system components, and immediately, anyone will know when to look into excessive issues. Depicting something else, like throughput, would be harmful as high throughput would look like a high error rate, even though these two are unrelated.
Visualization variations are a good solution when multiple metric types are best visualized using the same type. It ensures each component only has one meaning and, at the same time, is not limited to the functionality of the dashboard.
Consistency applies not only to individual elements but also to sections. For example, say you own multiple microservices which publish events and subscribe to others. Although customized dashboards for different event types are not bad per se, a consistent layout with consistent data makes the most sense as it can be understood more quickly. Adding additional specific information is accepted as the high-level overview has already been understood by the viewer of the dashboard, and the context is known.
The dashboard section above describes all of the essential information for the consumption of events. The first row represents the active queue (green because we are up to date), and the second row shows the dead letter topic (red because we don’t ever like events that fail). The bottom half then goes into more detail describing how quickly processing is, what our throughput looked like, and finally, which errors in processing messages we met using the state timeline.
Personally, I am on the fence if absolute numbers have a place in monitoring dashboards designed for system observability, as they should not creep into business intelligence dashboards. However, these are the most effective when talking with stakeholders outside the technical domain. “We are not able to process 5000 orders in the past 24 hours” is received with significantly higher urgency than “we are not processing 0.X% of orders.”
I'm interested in exciting ideas, big or small, and business partners alike. So drop me a message, and we will talk about the vision you are pursuing.