ARP

ARP

Reading time: 8 minute Word count: 1505

The target IP and next hop are already known, but packets still cannot be sent. In many cases the problem is not at the IP, TCP, or HTTP layer. It is one step lower: the host knows which IP it wants to reach, but it still does not know which MAC address on the local link should receive the frame.

That is what ARP handles. It looks like a tiny helper protocol, but it actually performs one of the most important handoff tasks in IPv4 local delivery: mapping a Layer-3 address to a Layer-2 reachable object. Without that step, IP knows the direction but not the concrete local receiver.

The core of ARP is not “look up a MAC address.”
It is to resolve the next-hop IPv4 address into a local Layer-2 destination that can actually receive the frame.

Why It Exists

IP addresses and MAC addresses solve different problems:

IP addresses handle inter-network addressing and routing
MAC addresses handle frame delivery on the local link

When a host is about to send an IPv4 packet, even if it already knows the destination IP or next-hop IP, it cannot directly turn “send to 192.168.1.1” into an Ethernet frame. When the NIC actually transmits, it must know the destination MAC on the local link.

That is why ARP exists: in an IPv4 Ethernet environment, Layer-3 and Layer-2 need a local resolution step between them.

What It Was Built for, and Under What Background

ARP comes from the early TCP/IP stack and solves a very practical problem in a multi-layer model: different layers have different address semantics, but the host must finally turn the upper-layer choice into the real local delivery target.

Its design focus is very clear:

It only works inside the local broadcast domain
It does not try to solve cross-subnet addressing
It does not build complicated connection state
It uses caching to reduce repeated lookup cost

That is why ARP looks simple, but it naturally brings broadcast, caching, and trust-boundary problems with it.

The Main Model

To understand ARP, keep these two questions in mind:

Is the IP I am about to send to a local subnet peer?
If not, who is the first hop gateway on this local link?

What ARP really resolves is never “who is the final destination.” It is “who should receive the next frame on the local link.”

The logic can be reduced to this:

Same subnet:
Host A -> ARP(B's IP) -> get B's MAC -> send frame to B

Cross subnet:
Host A -> ARP(gateway IP) -> get gateway MAC -> send frame to gateway

If you do not get this boundary right, many packet-capture symptoms will be misread.

A Typical Path

First, same-subnet communication:

sequenceDiagram participant A as Host A participant B as Host B A->>B: ARP Request
Who has 192.168.1.20? B->>A: ARP Reply
192.168.1.20 is at 00:11:22:33:44:55 A->>B: Ethernet Frame(dst=00:11:22:33:44:55) + IP Packet

Then, cross-subnet communication:

sequenceDiagram participant A as Host A participant G as Gateway participant R as Remote Host A->>G: ARP Request
Who has 192.168.1.1? G->>A: ARP Reply
192.168.1.1 is at aa:bb:cc:dd:ee:ff A->>G: Ethernet Frame(dst=aa:bb:cc:dd:ee:ff) + IP Packet(dst=203.0.113.10) G->>R: Gateway continues routing and forwarding

In both paths, ARP only handles the local first hop. It does not care how the remote host ultimately receives the packet, and it does not participate in cross-subnet routing decisions.

Why ARP Starts with Broadcast

When a host is ready to send a frame, it usually knows only:

The destination IP or next-hop IP
Its own subnet and routing information

It does not know which local node owns the MAC address for that IP. For that “I know your Layer-3 identity, but not where you are on Layer 2” problem, the most direct solution is to ask the whole local broadcast domain.

The essence of an ARP Request is:

Broadcast to the whole local link
Ask “who owns this IP?”
Let the true owner reply unicast

Broadcast is not elegant, but it is the lowest-cost discovery method on the local link.

Why the Cache Must Be Soft State

If every packet required a new ARP broadcast, the overhead would be too high, so the host caches the result. But that cache cannot be treated as permanent truth, because real networks change:

A host may move to another port
A NIC may be replaced
A gateway may fail over to another MAC
Virtualization, containers, and migration can change Layer-2 ownership

That is why ARP cache must be soft state:

Remember it for a while
Refresh it after expiration
Let new broadcasts or replies update it when needed

This is also why “it worked just a moment ago, and now it is broken” is often a cache-state problem.

Why Cross-Subnet Traffic Is Different

This is one of the most common misunderstandings.

If the destination IP is outside the local subnet, the host does not ARP for the remote host’s IP. The reason is simple: ARP broadcasts only work on the local link, so the remote host cannot hear them.

In that case, the ARP target becomes the default gateway’s local IP. In other words:

The IP layer decides that the next hop is the gateway
ARP resolves the gateway IP into the gateway MAC

So when a capture shows:

The IP packet’s destination is a remote server
But the Ethernet frame’s destination MAC is the gateway

that is normal, not a mismatch.

Why Gratuitous ARP Looks Strange but Is Useful

Some ARP packets are not answers to other hosts. They proactively announce:

Which MAC currently owns a certain IP
Or whether that IP is already in use by someone else

These are often called Gratuitous ARP. They are very useful in practice:

Fast refresh of neighbor caches during active/passive failover
Making nearby devices learn a new owner quickly after an IP move
Detecting address conflicts at startup

It looks like “answering when nobody asked,” but in dynamic networks it is practical and useful.

Why ARP Is So Easy to Turn into a Trust Problem

ARP does not have strong built-in authentication. On the local link, many devices will trust the first host that says “this IP is mine” and put it into cache.

That leads to two kinds of problems:

Unintentional configuration conflicts: two devices think they should use the same IP
Malicious ARP spoofing: someone intentionally redirects traffic to the wrong device

So the risk is not that ARP is complex. The risk is that it trusts the local link too much. It works well in a closed, default-trusted local broadcast domain, but once the environment is not trusted, you need switch security features, static bindings, or higher-layer policy to reinforce it.

What to Look at in Captures

Do not start by staring at field names. A more useful order is:

First, see who is being asked about

Confirm:

What the requested target IP is
Whether that IP is a same-subnet host or the default gateway
Whether anyone replies after the broadcast

For many ARP failures, the first diagnostic question is not “is the format correct?” It is “did we ask the right host about the right object?”

Then, look for cache conflicts or repeated refreshes

Common high-frequency symptoms include:

The same IP keeps mapping to different MAC addresses
The broadcast gets no reply
A learned cache entry is immediately replaced by another reply
Gratuitous ARP keeps appearing repeatedly in the network

These can directly point to address conflicts, active/passive failover, VM migration, or ARP spoofing.

Finally, see whether the problem was misattributed to IP/TCP

Many problems that look like “IP is broken,” “TCP connection failed,” or “the server did not respond” are actually rooted earlier:

The correct MAC was never learned
The frame was sent to the wrong device
The gateway MAC cache is stale or wrong

If the ARP layer is not stable, everything above it will look dead too.

What to Think About ARP Today

Do not treat ARP as “just looking up a MAC.” It is responsible for IPv4 local first-hop reachability
Do not assume the ARP target is the final destination. Across subnets, ARP resolves the gateway
Do not treat the ARP cache as permanent state. Many failures come from its soft-state nature
Do not ignore Gratuitous ARP. It is often the key clue for switching, migration, and conflict detection
Do not push every ARP problem up into Layer 3 and above. If the local first hop is wrong, everything after it will be wrong too