Skip to main content
Version: Next

Monitoring Recommendation - Best Practices

The following guide outlines strategic monitoring best practices for the Cedalo Mosquitto broker. These recommendations focus on ensuring high availability, reliability, and license compliance.

Please note that these are general recommendations. Depending on your specific use case (e.g., high-throughput telemetry vs. low-frequency command & control), you may need to track additional metrics.

Prerequisite: OS-Level Monitoring

While monitoring the broker internally is critical, it is equally important to monitor the underlying operating system resources (CPU, RAM, Disk I/O, and Network Bandwidth). Broker metrics alone may not reveal if the host machine is starving the process.

We recommend running e.g. the Prometheus Node Exporter alongside your broker to capture these system-level metrics.

  • Tool: Prometheus Node Exporter
  • Key OS Metrics to Watch: System Load, RAM Available, Disk Space Used (especially for persistence), and File Descriptor usage.

1. Connection Health & License Compliance

Goal: Ensure clients can connect, stay connected, and that usage remains within licensed limits.

License Utilization Check

  • Metric: mosquitto_sessions vs. Your License Limit
  • Best Practice: Create a gauge that shows % License Used.
  • Why: Your license limit typically applies to the total number of sessions (both active/online and passive/offline with cleanSession=false). Monitoring mosquitto_sessions ensures you account for disconnected devices that are still holding a license slot.
    • Alert Threshold: Warn at 80% usage; Critical at 90%.

The "Zombie Session" Check

  • Metrics: mosquitto_clients_offline
  • Best Practice: Monitor the gap between these two metrics.
  • Why: If the number of offline sessions continues to grow, you may have "zombie" clients that are not reconnecting. These dormant sessions hoard messages and could eventually exhaust system RAM or Disk and licenses.

The "Backlog" Warning

  • Metric: mosquitto_clients_backlog
  • Threshold: Alert if > 0 for more than 1 minute.
  • Why: This tracks connections waiting for authentication or the TCP stack. A positive value implies the broker cannot accept new connections fast enough, indicating a potential connection storm.

Client Churn Rate (Stability Index)

  • Metrics: rate(mosquitto_mqtt_connect_received) vs. mosquitto_clients_online
  • Best Practice: Alert if the Connect Rate is high (e.g., >10% of your total fleet size per minute) while the Online Client count remains flat.
  • Why: This pattern indicates that clients are failing to establish a stable session. They are stuck in a Connect -> Reject/Fail -> Retry loop, wasting broker CPU on handshake processing without ever becoming "Online".

2. Message Traffic & Reliability

Goal: Detect data loss, validate throughput, and identify bottlenecks.

The "Dropped Message" Alert (CRITICAL)

  • Metric: mosquitto_mqtt_publish_dropped
  • Threshold: Alert on any value > 0.
  • Why: If this counter increments, the broker has permanently discarded messages because queues for a subscribing client because queues were full.

Traffic Volume Analysis

  • Metrics:
    • mosquitto_mqtt_publish_received (Ingress)
    • mosquitto_mqtt_publish_sent (Egress)
  • Best Practice: Compare these two to verify your fan-out ratio. If sent drops to near zero while received continues, your consumers are offline.

Total Packet Load

  • Metrics:
    • mosquitto_mqtt_packets_received
    • mosquitto_mqtt_packets_sent
  • Best Practice: Use these to measure the overall "work" the broker is doing, including control packets (PING, ACKs) which might not show up in the publish metrics but still consume CPU/Network.

3. Persistence & Storage Health

Goal: Ensure the persistence engine is not filling up storage or memory.

Storage Saturation

  • Metrics:
    • mosquitto_stored_messages (Count)
    • mosquitto_stored_bytes (Size)
  • Best Practice: Monitor the rate of growth.
  • Why: A continuous upward trend indicates that offline clients are not returning to consume their queued messages. If stored_bytes approaches your RAM capacity, the broker may crash or stop accepting messages.

4. High Availability (HA) Cluster Status

> Note: This section is only applicable if you are running Mosquitto in an HA Cluster configuration.

Goal: Monitor the integrity of the Raft cluster and voting process.

Cluster Quorum Check

  • Metrics:
    • mosquitto_ha_voting_nodes_online (Available on Leader)
    • mosquitto_ha_voting_nodes (Total Configured)
  • Best Practice: Alert immediately if mosquitto_ha_voting_nodes_online < mosquitto_ha_voting_nodes.
  • Why:
    • Degraded State: If a node is offline it should be investigated immediately to restore the cluster to full redundancy.
    • Cluster availability risk: If the number of online nodes drops below the majority quorum (e.g. to only 1 online out of 3 configured, or 2 online out of 5 configured), the cluster will close connections for clients until the majority is restored. This can be prevented by monitoring the cluster state and acting as soon as a single node becomes unavailable.

5. Security & Authentication Monitoring

Goal: Detect intrusion attempts.

Brute Force Detection

  • Metrics: mosquitto_basic_auth_fail and mosquitto_extended_auth_fail
  • Best Practice: Alert on sudden spikes (e.g., > 50 failures / minute).
  • Why: Indicates an attack or a fleet-wide misconfiguration of credentials.

Row 1: The "Pulse" (License & Health)

  • Gauge: mosquitto_sessions vs License Limit
  • Stat: mosquitto_ha_voting_nodes_online (Cluster Health - Green if matches configured nodes)
  • Stat: mosquitto_mqtt_publish_dropped (CRITICAL - Red if > 0)

Row 2: Throughput

  • Graph: rate(mosquitto_mqtt_publish_received) vs rate(mosquitto_mqtt_publish_sent)
  • Graph: rate(mosquitto_mqtt_packets_received) vs rate(mosquitto_mqtt_packets_sent)

Row 3: Persistence & Backlog

  • Graph: mosquitto_stored_bytes (Disk Usage Trend)
  • Graph: mosquitto_sessions vs mosquitto_clients_online (Zombie Session Gap)