MQTT

Reading time: 15 minute Word count: 3011

Network MQTT TCP IoT

MQTT is hard to understand not because of its packet types, but because it builds message distribution on a broker in the middle, topic-based routing, and the reality of weak or unstable connections. Many devices already have network access and a TCP connection, yet the subscriber still does not receive the status update. Or a device comes back online after being disconnected for a few minutes, and the application expects the message to be recovered, only to get part of it back. The difficult part is not the message format itself. It is the fact that MQTT treats distribution as a broker-centered problem.

MQTT is not about sending messages directly between devices. It uses a broker to route messages centrally, a session to keep the minimum state, and layered QoS to control delivery cost on unstable networks and constrained devices

This article follows the classic MQTT model and focuses on three things: why publishers and subscribers are both centered on the broker, why state is kept in connections and sessions, and why QoS emphasizes recoverability and cost tiers rather than “perfect delivery for everything.” MQTT 5.0 properties, shared subscriptions, MQTT-SN, and vendor-specific conventions are mentioned, but not expanded into separate topics.

Follow One Message Flow

The most common path is not complicated:

A device connects to the broker as a client and completes CONNECT / CONNACK
A subscriber sends SUBSCRIBE to the broker and declares which topic filters it cares about
A publisher sends PUBLISH to the broker with a specific topic and payload
The broker forwards the message to the matching subscribers according to the topic
Both sides rely on keep alive, session state, and QoS handshakes to keep later delivery going

If you compress the logical relationship, it looks like this:

Publisher Client
  -> Broker: CONNECT
  -> Broker: PUBLISH sensors/floor1/temp = 23.4

Subscriber Client
  -> Broker: CONNECT
  -> Broker: SUBSCRIBE sensors/+/temp

Broker
  -> Subscriber Client: PUBLISH sensors/floor1/temp = 23.4

In this path, the publisher and subscriber usually do not know who the other side is, and they do not know whether the other side is online. What they actually depend on are these three things that the broker maintains:

Who has already established a connection
Who has subscribed to which topic filters
Whether some messages and acknowledgments still need to be tracked in the current session

So the core of MQTT is not “clients send messages to other clients.” It is “all participants hand distribution control to the broker.”

Why It Appeared

It solves persistent distribution in weak-connection environments, not just message transfer

If you go back to early telemetry and remote-device networks, the problem is straightforward: terminals are weak, links are expensive, the network is unstable, and many devices sit behind satellite links, mobile networks, or high-latency dedicated lines. Devices want to send status out, and controllers want to send commands back, but traditional request-based interaction does not fit this environment.

If you use a “device polls the server for new commands” model, several costs appear immediately:

The link is already expensive, so frequent polling amplifies empty-request overhead
Devices often go offline, so keeping point-to-point synchronization is hard
Sender and receiver are not online at the same time, so message distribution is naturally misaligned
As the number of terminals grows, direct connections or per-peer maintenance quickly become unmanageable

MQTT’s value is not that it invented messages. It moves message distribution from the two ends of the device network into a middle broker, so terminals know as little about others as possible and only maintain their own connection to the broker.

Its historical background makes it favor field engineering, not pure Internet theory

MQTT was first proposed by Andy Stanford-Clark and Arlen Nipper in the late 1990s, then standardized through OASIS and later adopted by ISO/IEC. It did not first run at Internet scale and then slowly become a standard. From the beginning, it was clearly aimed at telemetry, industrial sites, and remote-device communication.

That directly leads to several design preferences:

Reduce terminal implementation cost first, instead of chasing the richest interaction semantics
Accept that links can break at any time, so leave room in the protocol for recovery paths
Value “it still works in the end” more than making every step strongly synchronous
Assume there is a central node willing to take on connection management, routing, and part of the state burden

That is why many MQTT choices are pragmatic. It does not try to make every device smart enough. It makes the broker carry more complexity so the terminals stay lighter.

The Main Model

The first thing worth keeping in mind is not the bits in the fixed header, but this relationship:

Clients do not look for each other directly
Clients only maintain a long-lived connection to the broker
Messages are named and routed by topic
Delivery reliability is layered by QoS instead of being one-size-fits-all

The main objects are:

Client: can be a publisher, a subscriber, or both at the same time
Broker: terminates client connections, maintains subscriptions, handles message routing, and keeps part of the session state
Topic: the message namespace that decides where messages should be routed
Session: the minimum protocol state kept outside the connection, such as unfinished QoS delivery and subscription information
QoS: delivery level, where different levels mean different acknowledgment cost and duplication risk

This model is easy to misread in two places.

First, MQTT is not a decentralized protocol. It explicitly assumes the broker is the control point. Second, MQTT is not a “naturally reliable message queue.” It only provides different mechanisms for different reliability levels. Whether that finally satisfies the application semantics still depends on session settings, broker behavior, and when devices go online or offline.

Why It Was Designed This Way

Why the broker must sit in the middle instead of devices talking directly

If a publisher had to know all subscribers directly, the system would quickly run into three problems:

The sender would need to maintain receiver addresses, online state, and retry logic
Adding one consumer would affect all publishers in reverse
As the number of terminals grows, the connection graph becomes close to a mesh, and deployment and operations costs rise sharply

The broker-centered design rewrites “who sends to whom” into “who publishes to which topic, and who subscribes to which topic.”

The benefits are:

Publishers and subscribers are decoupled from each other
Terminals only maintain one connection to the broker
Routing, authorization, retained messages, and offline-message strategies can be implemented centrally

The tradeoffs are also obvious:

The broker becomes a single control-plane point and a critical capacity point
Every message passes through the middle layer, so latency and availability depend on the broker
Centralized design makes authorization and topic planning very important

MQTT is not blind to the risk of a central node. It explicitly judged that in weak-terminal and weak-network scenarios, concentrating the complexity is worth it.

Why a topic is a namespace, not a queue name

It is easy to think of a topic as a queue name in a message queue system. That comparison helps a little, but it also misleads easily.

A topic is closer to a hierarchical namespace. The publisher only says, “this message belongs to factory/line1/temp,” and the broker decides who should receive it based on subscriptions. The point is not point-to-point delivery. It is selective distribution based on a namespace.

That design is valuable because:

Business systems can organize message space by device, region, or type
Subscribers can catch a batch of messages with wildcards
Routing rules are expressed more through topic structure than through hard-coded client-address relationships

The cost is that once topic design gets messy, almost everything afterward becomes harder:

Authorization boundaries are harder to draw
Subscription scope is easy to make too broad
Retained-message and shared-subscription semantics become hard to predict

At its core, MQTT places the namespace at the center to replace the explicit topology of “who exactly should I send this message to?”

Why it needs long connections and keep alive

MQTT is not a protocol that waits for a request and then creates a connection on the spot. It is more like saying: since neither side knows when the next message will appear, let us keep the channel open first.

The benefits of a long connection are straightforward:

It avoids frequent connection setup costs
The broker can detect whether the client is still alive more quickly
Downlink messages from the server to the device can be sent immediately over the existing connection

Keep alive is not just about “staying alive.” More importantly, it gives both sides a time boundary for failure detection. When the link is broken, there is no need to wait forever. If no traffic appears after the agreed interval, the broker and client can both treat the session as dead and enter reconnection and state recovery logic.

So keep alive is not about showcasing activity. It is the time boundary for detecting connection failure.

Why reliability is split by QoS instead of making everything maximally reliable

MQTT’s QoS design is a good reflection of its engineering tradeoffs.

QoS 0: at most once, no acknowledgment, lowest cost, and the most likely to be lost
QoS 1: at least once, confirmed with PUBACK, duplicates allowed
QoS 2: exactly once, uses a longer handshake to avoid duplicates, highest cost

If the protocol tried to make every message maximally reliable from the start, several problems would appear immediately:

Terminal implementation complexity would rise
A weak link would make round-trip acknowledgment cost higher
The broker would need to maintain more intermediate state
A large amount of telemetry that could tolerate occasional loss would be forced to bear a high cost

So MQTT does not predefine that “everything must be the most reliable.” It gives the choice back to the application. Temperature telemetry, per-second status updates, critical control commands, and payment confirmations all have different cost models.

The key point is not just that there are three levels. It is that MQTT clearly admits that reliability is something you buy with state and round-trip cost.

Why session must exist independently of the instantaneous connection

Device networks often fail in a specific way: TCP is broken, but the device may come back later, and the business does not want it to start over every time.

If MQTT tied all state to the current connection, then once it disconnected:

Subscription relationships would need to be rebuilt completely
In-flight QoS 1 / QoS 2 messages could not be continued
It would be impossible to decide whether messages that should be recovered during offline time should be kept

The session exists so the broker and client can keep a small amount of collaborative state outside the connection. In classic versions, this is usually Clean Session; in MQTT 5.0, it is split more clearly into Clean Start and the session expiry interval.

The benefit is more natural recovery after disconnection. The cost is:

The broker needs to keep more state
Behavior after reconnect no longer depends only on the current connection
Application teams can easily misread this as “offline messages will definitely come back”

The session makes MQTT feel more like a continuous relationship than an instantaneous request, but it preserves only limited state, not unlimited compensation.

Awkward but Important Design Choices

Allowing duplicates in `QoS 1` is not a design mistake

QoS 1 often feels uncomfortable: if it is acknowledged, why can duplicates still happen?

The reason is simple. After the sender issues PUBLISH, the message might really not have arrived, or it might have arrived but the PUBACK was lost on the way back. If the protocol wants to keep moving in this uncertainty, it has to allow “I would rather send again than get stuck forever waiting for an acknowledgment.”

This means QoS 1 solves “try not to lose it,” not “never duplicate it.” If deduplication matters to the business, it usually still has to be handled by application-level idempotency keys, message sequence numbers, or state versions.

A retained message is not an offline queue

The broker can keep the last retained message for a topic. A new subscriber that comes later will immediately receive that retained value, which is very useful for device-state topics.

But retained means “give later arrivals the latest state,” not “queue up all historical messages and replay them later.” If retained is treated as an offline message pile, the expectation will quickly become wrong. It is more like a current snapshot of the topic than a durable history.

Last Will is not business compensation, it is a disconnect signal

The Last Will and Testament message looks magical: the client tells the broker in advance to publish a message if it disconnects abnormally.

What it really solves is: how do others know that I did not go offline normally, but disappeared suddenly? That is very useful for online status, failure broadcasts, and device availability judgment.

But it is only a connection-level failure signal, not a business compensation transaction. The broker can announce, “this client went offline unexpectedly,” but it cannot infer for the business whether the last command was actually executed.

MQTT says “Message Queuing,” but it is not a traditional queue system

The historical name includes Queuing, but classic MQTT is closer to a broker-driven pub/sub protocol than to a general-purpose message queue platform. It can have retained messages, offline caching, and shared subscriptions, but it does not naturally provide the full set of capabilities found in mature queue systems, such as consumption offsets, replay, long-term backlog management, and complex delivery strategies.

If you treat MQTT as something that can directly replace Kafka or RabbitMQ for every messaging scenario, you will quickly run into semantic boundary problems. MQTT is best at lightweight device access and topic-based distribution, not at endlessly growing into a universal messaging infrastructure.

How It Evolved

MQTT’s evolution did not overthrow the core broker + topic + QoS structure. Instead, it kept the lightweight model and added more capabilities for real-world needs.

Representative directions include:

Gradually tightening interoperability details from 3.1 to 3.1.1
Adding richer reason codes, properties, and error feedback in 5.0
Supporting clearer session lifecycle control and message expiry
Adding shared subscriptions, request-response patterns, and other features that fit better with cloud-side system integration

The central broker is still there
Topic routing is still there
QoS tiers are still there
New capabilities mostly enter the system by improving sessions, properties, and observability

When you look at MQTT today, the key question is not “how many fields did 5.0 add?” It is whether those additions changed the core judgment. In most cases, they did not.

How to Read MQTT in Real Engineering Work

If you are implementing a minimal usable client, what should you get right first?

Get these basics stable first:

Complete CONNECT / CONNACK and reconnection correctly
Maintain SUBSCRIBE, UNSUBSCRIBE, and topic-filter matching as expected
Make the basic delivery logic of QoS 0 / 1 correct before considering QoS 2
Define the session strategy clearly, and know which state should survive a disconnect
Handle keep alive timeouts, broker disconnects, and retransmission correctly

Many device-side implementations try to bring in every advanced feature from the start, and end up with an unstable reconnect and state-recovery path. For most IoT terminals, getting “it can come back after a drop, and its behavior after recovery is predictable” right is more valuable than supporting a pile of extensions.

If you are capturing packets or reading logs, what should you check first?

Do not start by staring at every control packet. First check these points:

Whether the connection is really stable, or whether the device keeps reconnecting around CONNECT
Whether the issue is between publisher and broker, or between broker and subscriber
What QoS the current message uses, and which acknowledgments should appear
Whether the client uses a persistent session, and whether the broker actually restores it
Whether the message was never entered into the broker, or the broker received it but failed to forward it

When analyzing MQTT issues, the most valuable thing is to quickly locate the fault layer:

Connection layer
Session layer
Topic / authorization layer
Broker routing layer
Application idempotency and state-handling layer

If something goes wrong in production, what is the most common failure path?

High-frequency failures are usually not “the protocol is completely broken.” They are cases where one default assumption was wrong:

A device reconnects and gets a new connection, but does not restore the old session
The publisher uses QoS 0, while the business treats the message as guaranteed delivery
The subscriber filter is too broad or too narrow, so it receives too much or nothing at all
The broker receives the message, but authorization, ACLs, or shared-subscription rules block delivery
The business mixes up retained messages, offline messages, and historical replay

In MQTT debugging, confirming which semantics are actually enabled matters more than memorizing packet order.

If you are designing topics, what is the most dangerous default assumption?

The most dangerous assumption is usually: a topic is just a string, so let us get it working first and think later.

At minimum, you should define these boundaries up front:

Whether the topic hierarchy is stable enough to support future ACLs and multi-tenant isolation
Whether state topics and event topics are separated
Which topics are allowed to be retained, and which absolutely are not
Whether shared subscriptions, broadcast subscriptions, and device-specific subscriptions will be mixed

Once topic design gets out of control, broker authorization, consumption semantics, and debugging cost all get worse together.

The most important thing to remember about MQTT is not how many control packets it has. It is that it uses the broker to unify connection relationships, uses sessions to resist disconnection, and uses QoS to assign different costs to different messages. When you implement, capture, or design a system, it is usually more effective to first identify which layer of semantics you depend on than to dive directly into field details.

MQTT

Follow One Message Flow

Why It Appeared

It solves persistent distribution in weak-connection environments, not just message transfer

Its historical background makes it favor field engineering, not pure Internet theory

The Main Model

Why It Was Designed This Way

Why the broker must sit in the middle instead of devices talking directly

Why a topic is a namespace, not a queue name

Why it needs long connections and keep alive

Why reliability is split by QoS instead of making everything maximally reliable

Why session must exist independently of the instantaneous connection

Awkward but Important Design Choices

Allowing duplicates in `QoS 1` is not a design mistake

A retained message is not an offline queue

Last Will is not business compensation, it is a disconnect signal

MQTT says “Message Queuing,” but it is not a traditional queue system

How It Evolved

How to Read MQTT in Real Engineering Work

If you are implementing a minimal usable client, what should you get right first?

If you are capturing packets or reading logs, what should you check first?

If something goes wrong in production, what is the most common failure path?

If you are designing topics, what is the most dangerous default assumption?

References and Further Reading

Main References

Supplemental Reading

Further Reading

Follow One Message Flow

Why It Appeared

It solves persistent distribution in weak-connection environments, not just message transfer

Its historical background makes it favor field engineering, not pure Internet theory

The Main Model

Why It Was Designed This Way

Why the broker must sit in the middle instead of devices talking directly

Why a topic is a namespace, not a queue name

Why it needs long connections and keep alive

Why reliability is split by QoS instead of making everything maximally reliable

Why session must exist independently of the instantaneous connection

Awkward but Important Design Choices

Allowing duplicates in QoS 1 is not a design mistake

A retained message is not an offline queue

Last Will is not business compensation, it is a disconnect signal

MQTT says “Message Queuing,” but it is not a traditional queue system

How It Evolved

How to Read MQTT in Real Engineering Work

If you are implementing a minimal usable client, what should you get right first?

If you are capturing packets or reading logs, what should you check first?

If something goes wrong in production, what is the most common failure path?

If you are designing topics, what is the most dangerous default assumption?

References and Further Reading

Main References

Supplemental Reading

Further Reading

Allowing duplicates in `QoS 1` is not a design mistake