On The Importance of End-to-End Monitoring for IoT

On The Importance of End-to-End Monitoring for IoT

IoT systems are made up of many building blocks these days. A common setup looks like this: You have an edge gateway communicating via the public Internet or via VPNs to an MQTT broker (the "southbound" interface). The gateways send messages with telemetry data from attached devices and machines. This data gets processed by normalization and transformation services before it is written to a time series database. REST or GraphQL APIs make the data available to a user interface or third-party systems (the "northbound" interface). The fundamental business function of this IoT system is that it can ingest data, process it, store it and make it available. All other business functions (KPI calculation, alert detection, etc) are usually built upon this basic one. It spans the southbound and northbound interfaces and all the steps in between.

You must, at all times, know the status of this functioniality. If it breaks, you must be the first to know, so you can start fixing.

End-to-end monitoring will give you this knowledge and the confidence in your IoT solution that follows from it.

How does an IoT End-to-End Check look like?

An end-to-end check could look like this: Each time it runs, it will send data to your MQTT broker. Often, sending just one datapoint is enough - more can be necessary if you want to also monitor the data processing logic. Your check will then start quering the REST of GraphQL API, until it can read the exact value that it sent previously. The datapoint's value is often a UUID or a timestamp so it can be uniquely identified. If you received the value before your check times out, your check is green, if you don't, or if it took too long, it's red.

This kind of check also gives you a useful metric: The end-to-end latency of data ingestion, processing and storage. You can use this to detect gradual decreases in performance (maybe the ingestion performance of your time-series database gets worse, the more data it stores) or of temporary spikes (if devices send more data during certain times of the day, causing congestions and delays in the data processing). The end-to-end latency is also a good SLA metric for reporting the perfomance to stakeholders.

Taking Action

If the end-to-end check breaks, your on-call DevOps staff must be informed right away, so that they can investigate. It will not tell them what is wrong, but it will tell them that something is wrong. They can now use more fined grained monitoring metrics to pinpoint the cause of the issue. To be clear: End-to-end monitoring isn't a replacement for individual component checks, it’s a complementary tool.

If your time for setting up monitoring checks is limited (as it often is, because features are deemed more important than reliability): Make sure you have at least this end-to-end monitoring check in place. It's a lot of value for a small amount of effort.