Version: Next

Monitoring Recommendation - Best Practices

The following guide outlines strategic monitoring best practices for the Cedalo Mosquitto broker. These recommendations focus on ensuring high availability, reliability, and license compliance.

Please note that these are general recommendations. Depending on your specific use case (e.g., high-throughput telemetry vs. low-frequency command & control), you may need to track additional metrics.

Prerequisite: OS-Level Monitoring

While monitoring the broker internally is critical, it is equally important to monitor the underlying operating system resources (CPU, RAM, Disk I/O, and Network Bandwidth). Broker metrics alone may not reveal if the host machine is starving the process.

We recommend running e.g. the Prometheus Node Exporter alongside your broker to capture these system-level metrics.

Tool: Prometheus Node Exporter
Key OS Metrics to Watch: System Load, RAM Available, Disk Space Used (especially for persistence), and File Descriptor usage.

1. Connection Health & License Compliance

Goal: Ensure clients can connect, stay connected, and that usage remains within licensed limits.

License Utilization Check

Metric: mosquitto_sessions vs. Your License Limit
Best Practice: Create a gauge that shows % License Used.
Why: Your license limit typically applies to the total number of sessions (both active/online and passive/offline with cleanSession=false). Monitoring mosquitto_sessions ensures you account for disconnected devices that are still holding a license slot.
- Alert Threshold: Warn at 80% usage; Critical at 90%.

The "Zombie Session" Check

Metrics: mosquitto_clients_offline
Best Practice: Monitor the gap between these two metrics.
Why: If the number of offline sessions continues to grow, you may have "zombie" clients that are not reconnecting. These dormant sessions hoard messages and could eventually exhaust system RAM or Disk and licenses.

The "Backlog" Warning

Metric: mosquitto_clients_backlog
Threshold: Alert if > 0 for more than 1 minute.
Why: This tracks connections waiting for authentication or the TCP stack. A positive value implies the broker cannot accept new connections fast enough, indicating a potential connection storm.

Client Churn Rate (Stability Index)

Metrics: rate(mosquitto_mqtt_connect_received) vs. mosquitto_clients_online
Best Practice: Alert if the Connect Rate is high (e.g., >10% of your total fleet size per minute) while the Online Client count remains flat.
Why: This pattern indicates that clients are failing to establish a stable session. They are stuck in a Connect -> Reject/Fail -> Retry loop, wasting broker CPU on handshake processing without ever becoming "Online".

2. Message Traffic & Reliability

Goal: Detect data loss, validate throughput, and identify bottlenecks.

The "Dropped Message" Alert (CRITICAL)

Metric: mosquitto_mqtt_publish_dropped
Threshold: Alert on any value > 0.
Why: If this counter increments, the broker has permanently discarded messages because queues for a subscribing client because queues were full.

Traffic Volume Analysis

Metrics:
- mosquitto_mqtt_publish_received (Ingress)
- mosquitto_mqtt_publish_sent (Egress)
Best Practice: Compare these two to verify your fan-out ratio. If sent drops to near zero while received continues, your consumers are offline.

Total Packet Load

Metrics:
- mosquitto_mqtt_packets_received
- mosquitto_mqtt_packets_sent
Best Practice: Use these to measure the overall "work" the broker is doing, including control packets (PING, ACKs) which might not show up in the publish metrics but still consume CPU/Network.

3. Persistence & Storage Health

Goal: Ensure the persistence engine is not filling up storage or memory.

Storage Saturation

Metrics:
- mosquitto_stored_messages (Count)
- mosquitto_stored_bytes (Size)
Best Practice: Monitor the rate of growth.
Why: A continuous upward trend indicates that offline clients are not returning to consume their queued messages. If stored_bytes approaches your RAM capacity, the broker may crash or stop accepting messages.

4. High Availability (HA) Cluster Status

> Note: This section is only applicable if you are running Mosquitto in an HA Cluster configuration.

Goal: Monitor the integrity of the Raft cluster and voting process.

Cluster Quorum Check

Metrics:
- mosquitto_ha_voting_nodes_online (Available on Leader)
- mosquitto_ha_voting_nodes (Total Configured)
Best Practice: Alert immediately if mosquitto_ha_voting_nodes_online < mosquitto_ha_voting_nodes.
Why:
- Degraded State: If a node is offline it should be investigated immediately to restore the cluster to full redundancy.
- Cluster availability risk: If the number of online nodes drops below the majority quorum (e.g. to only 1 online out of 3 configured, or 2 online out of 5 configured), the cluster will close connections for clients until the majority is restored. This can be prevented by monitoring the cluster state and acting as soon as a single node becomes unavailable.

5. Security & Authentication Monitoring

Goal: Detect intrusion attempts.

Brute Force Detection

Metrics: mosquitto_basic_auth_fail and mosquitto_extended_auth_fail
Best Practice: Alert on sudden spikes (e.g., > 50 failures / minute).
Why: Indicates an attack or a fleet-wide misconfiguration of credentials.

Recommended Grafana Dashboard Layout

Row 1: The "Pulse" (License & Health)

Gauge: mosquitto_sessions vs License Limit
Stat: mosquitto_ha_voting_nodes_online (Cluster Health - Green if matches configured nodes)
Stat: mosquitto_mqtt_publish_dropped (CRITICAL - Red if > 0)

Row 2: Throughput

Graph: rate(mosquitto_mqtt_publish_received) vs rate(mosquitto_mqtt_publish_sent)
Graph: rate(mosquitto_mqtt_packets_received) vs rate(mosquitto_mqtt_packets_sent)

Row 3: Persistence & Backlog

Graph: mosquitto_stored_bytes (Disk Usage Trend)
Graph: mosquitto_sessions vs mosquitto_clients_online (Zombie Session Gap)

Prerequisite: OS-Level Monitoring​

1. Connection Health & License Compliance​

License Utilization Check​

The "Zombie Session" Check​

The "Backlog" Warning​

Client Churn Rate (Stability Index)​

2. Message Traffic & Reliability​

The "Dropped Message" Alert (CRITICAL)​

Traffic Volume Analysis​

Total Packet Load​

3. Persistence & Storage Health​

Storage Saturation​

4. High Availability (HA) Cluster Status​

Cluster Quorum Check​

5. Security & Authentication Monitoring​

Brute Force Detection​

Recommended Grafana Dashboard Layout​