Kubernetes Network Infrastructure — L2, BGP, and On-Prem Load Balancing

A companion to Kubernetes Services and DNS Guide. That doc covers Kubernetes service types, DNS, and traffic policies. This doc goes deeper into the network infrastructure underneath — how traffic actually enters the cluster on-prem, the protocols involved, and the load balancing options available when there’s no cloud provider.

1. The On-Prem Problem

In cloud environments (EKS, GKE, AKS), creating a LoadBalancer Service automatically provisions a cloud load balancer. On-prem, there’s no cloud API to call — the Service stays in Pending state unless something fulfills it.

The two main approaches to solve this:

L2 Announcements — a node claims a Virtual IP (VIP) via ARP on the local network segment
BGP Route Advertisement — nodes advertise VIP routes to upstream routers

Both make LoadBalancer Services work on bare-metal clusters by assigning external IPs and making them reachable from outside the cluster.

2. L2 Announcements

How It Works

L2 Announcements make service VIPs reachable on the local area network by responding to ARP queries. One node at a time “owns” each VIP.

A LoadBalancer Service is created and assigned a VIP (e.g., 192.168.1.100)
Via leader election, one node becomes the owner of that VIP
When any device on the L2 segment sends an ARP request (“Who has 192.168.1.100?”), the owning node responds with its MAC address
All traffic for that VIP flows to the owning node
The owning node uses kube-proxy/eBPF to load-balance across backend pods on any node

Client → ARP "Who has 192.168.1.100?" → Node2 responds with its MAC
Client → 192.168.1.100 → Node2 → kube-proxy → Pod (any node)

Failover via Gratuitous ARP

When the owning node fails:

Another node wins leader election for the VIP
It sends a gratuitous ARP — an unsolicited broadcast announcing “192.168.1.100 is now at my MAC”
All devices on the segment (switches, routers, hosts) update their ARP tables
Traffic seamlessly redirects to the new node

Advantages Over NodePort

Each service gets a unique IP — multiple services can use port 80/443
Clients don’t need to know about individual node IPs
Automatic failover — VIP migrates to a healthy node

Limitations

Single entry point — all traffic for a VIP enters through one node (potential bottleneck)
L2 scope only — only works within a single broadcast domain (VLAN/subnet). The router/gateway for external clients must be on the same L2 segment as the cluster nodes
No true pre-cluster load balancing — unlike BGP with ECMP

Who Actually Sends the ARP?

Clients outside the cluster don’t send ARP directly. The router/gateway on the same L2 segment does:

External client sends packet to VIP 192.168.1.100
Internet/WAN routing delivers it to the local router
Router sees VIP is in its local subnet, sends ARP to resolve the MAC
Kubernetes node responds, router forwards the packet

This is why L2 Announcements are described as being for “office or campus networks” — the cluster’s gateway must be on the same broadcast domain.

3. BGP Route Advertisement

BGP is the enterprise-grade alternative that works across routed networks (multiple subnets, data centers).

How It Works

Instead of ARP, Kubernetes nodes establish BGP peering sessions (TCP port 179) with upstream routers and advertise VIP routes:

Each node configured as a BGP speaker connects to the upstream router
When a node owns a VIP, it sends a BGP UPDATE: “Route to 192.168.100.10/32 via me”
The router installs this route and can propagate it further
Traffic for the VIP is routed to the advertising node

Node1 → Router: BGP UPDATE "192.168.100.10/32 via 192.168.1.10"
Router installs route → forwards traffic to Node1

ECMP — True Multi-Node Load Balancing

Unlike L2 (single owner), BGP supports Equal Cost Multi-Path — multiple nodes advertise the same route:

Node1 → Router: "192.168.100.10/32 via 192.168.1.10"
Node2 → Router: "192.168.100.10/32 via 192.168.1.11"
Node3 → Router: "192.168.100.10/32 via 192.168.1.12"

Router: ECMP group → load-balances across all three nodes

This eliminates the single-entry-point bottleneck of L2.

Failover

When a node fails, its BGP session drops. The router receives a WITHDRAW message (or detects session timeout) and removes that route. Traffic redistributes across remaining nodes.

Requirements

BGP-capable routers (Cisco, Juniper, Arista, or open-source FRR)
ASN allocation (private ASNs 64512–65535 for internal use)
BGP peering configuration on both the router and Kubernetes side
Network engineering expertise

BGP Message Types

Message	Purpose
OPEN	Establish peering session (ASN, capabilities)
UPDATE	Advertise new routes or withdraw old ones
KEEPALIVE	Maintain session (heartbeat)
NOTIFICATION	Signal errors or close session

4. L2 vs BGP — When to Use Which

	L2 Announcements	BGP
Network scope	Single broadcast domain	Multi-subnet, multi-datacenter
Protocol	ARP (Layer 2)	BGP over TCP (Layer 3)
Load balancing	Single node per VIP	ECMP across multiple nodes
Failover	Gratuitous ARP (seconds)	BGP convergence (faster, tunable)
Infrastructure needed	None (just a LAN)	BGP-capable routers
Expertise needed	Minimal	Network engineering
Best for	Office/campus, simple setups	Enterprise, multi-segment networks

Choose L2 when:

Simple office/campus network, single subnet
Limited networking expertise
Moderate traffic, simplicity over performance

Choose BGP when:

Enterprise network with multiple subnets/VLANs
Need true multi-node load balancing (ECMP)
Multi-datacenter deployments
Have BGP-capable infrastructure and expertise

5. On-Prem Load Balancer Solutions

Three categories of solutions fulfill LoadBalancer Services on bare-metal:

MetalLB

What it does: VIP advertisement only (L2 or BGP). Assigns external IPs to Services and announces them.
What it doesn’t do: No data plane — relies entirely on kube-proxy for packet forwarding
Works with: Any CNI
Maturity: Widely deployed, large community

Cilium LB-IPAM

What it does: Full load balancing integrated into the CNI — IP allocation, L2/BGP advertisement, and eBPF-based data plane
Advantages over MetalLB: Better performance (eBPF, no iptables), built-in observability via Hubble, integrated network policy enforcement
Trade-off: Must use Cilium as your CNI

LoxiLB

What it does: Complete eBPF-based load balancer with its own control and data plane
Differentiator: Designed for telecom — strong SCTP support, active health checks per Service, built-in metrics
Works with: Any CNI (standalone, like MetalLB)
Trade-off: Smaller community, CNCF sandbox project

Comparison

	MetalLB	Cilium LB-IPAM	LoxiLB
Data plane	None (kube-proxy)	eBPF (integrated)	eBPF (standalone)
Health checks	None (relies on K8s)	Integrated	Active per-service
Observability	Basic metrics	Hubble (deep)	Built-in dashboards
CNI requirement	Any	Cilium only	Any
Protocol support	TCP/UDP	TCP/UDP	TCP/UDP/SCTP
Community	Large	Large (CNCF graduated)	Small (CNCF sandbox)

Control Plane vs Data Plane

In the context of these load balancers:

Control plane — decides where traffic goes: service discovery, endpoint management, health checking, VIP assignment, BGP/ARP advertisement, leader election
Data plane — actually forwards packets: NAT, load balancing to backends, connection tracking, packet filtering

MetalLB only has a control plane. Cilium and LoxiLB have both.

6. Overlay vs Native CNI

This distinction matters for load balancing because it determines whether pod IPs are routable from outside the cluster.

Overlay CNI (Flannel, Weave)

Pods get IPs from a virtual address space. Traffic between nodes is encapsulated (typically VXLAN):

Pod on Node1 (10.244.0.5) → Pod on Node2 (10.244.1.8):

Original:  [src: 10.244.0.5 → dst: 10.244.1.8]
Encapsulated: [src: 192.168.1.10 → dst: 192.168.1.11] wrapping [10.244.0.5 → 10.244.1.8]

The physical network only sees host-to-host traffic. Pod IPs (10.244.x.x) are not routable from outside the cluster — external routers have no routes to them.

Consequence: External load balancers cannot target pod IPs directly. Traffic must enter via NodePort or a VIP on a node.

Native/Routed CNI (AWS VPC CNI, Calico BGP mode)

Pods get real IPs from the host network (e.g., VPC subnet). No encapsulation — packets route directly.

Pod on Node1 (10.0.1.101) → Pod on Node2 (10.0.1.201):

Packet: [src: 10.0.1.101 → dst: 10.0.1.201]  ← routed directly, no wrapping

The network infrastructure knows about pod IPs. External load balancers can target pods directly (IP target mode).

Consequence: Cloud LBs (NLB/ALB) can skip NodePort entirely and route to pod IPs. See IP Target vs Instance Target Mode in the services doc.

	Overlay (Flannel)	Native (VPC CNI)
Pod IPs	Virtual, not externally routable	Real, externally routable
Encapsulation	Yes (VXLAN/UDP)	No
Performance	Overhead from encap/decap	Native speed
External LB → pod	Not possible	Possible (IP target mode)
Portability	Works anywhere	Requires specific infra

7. External Load Balancers (NGINX, HAProxy)

Traditional load balancers can front a Kubernetes cluster on-prem. The main challenge is service discovery — how does the LB know about pod/node endpoints?

Option 1: Static NodePort Config

Point the LB at node IPs + NodePort. Simple but requires manual updates when nodes change.

# nginx.conf
upstream backend {
    server 192.168.1.10:30080;  # node1
    server 192.168.1.11:30080;  # node2
    server 192.168.1.12:30080;  # node3
}

Option 2: Dynamic Discovery via Consul

Consul’s consul-k8s controller watches the Kubernetes API for Service/Endpoints changes and syncs them into Consul’s service catalog. Then consul-template regenerates the LB config automatically:

K8s Endpoints change → consul-k8s syncs to Consul catalog →
consul-template regenerates nginx.conf → NGINX reloads

Template example:

upstream backend {
{{range service "web-service"}}
  server {{.Address}}:{{.Port}};
{{end}}
}

Each pod backing the service is registered individually in Consul — the LB gets fine-grained endpoint awareness.

Option 3: External Ingress Controller

Run an ingress controller (HAProxy, NGINX) outside the cluster, connected to the Kubernetes API. It watches Ingress/Service resources and configures itself dynamically.

When to Use External LBs

Need advanced L7 features (caching, rate limiting, WAF)
Already have NGINX/HAProxy expertise and infrastructure
Want to load-balance both K8s and non-K8s services
Need dedicated hardware for LB performance

Comparison with K8s-Native Solutions

	MetalLB/Cilium/LoxiLB	External NGINX/HAProxy
Service discovery	Automatic (K8s native)	Needs Consul/static config
Failover	Built-in leader election	Separate HA setup needed
L7 features	Limited	Rich (caching, WAF, etc.)
Operational model	K8s-native	Traditional infra

8. Protocol Deep Dives

ARP (Address Resolution Protocol)

ARP resolves IP addresses to MAC addresses on a local network segment.

ARP Request — broadcast to FF:FF:FF:FF:FF:FF (MAC broadcast, not IP 255.255.255.255):

Ethernet: dst=FF:FF:FF:FF:FF:FF  src=<requester MAC>
ARP:      op=REQUEST  sender-ip=192.168.1.50  target-ip=192.168.1.100

Every device on the L2 segment receives it. Only the device owning the target IP responds (unicast):

Ethernet: dst=<requester MAC>  src=<owner MAC>
ARP:      op=REPLY  sender-ip=192.168.1.100  sender-mac=<owner MAC>

Gratuitous ARP — unsolicited broadcast announcing an IP→MAC mapping. Used during VIP failover to force all devices to update their ARP tables immediately.

Security note: ARP has no authentication. Any device can claim any IP (ARP spoofing). In Kubernetes, leader election ensures only one node claims each VIP.

BGP (Border Gateway Protocol)

BGP exchanges routing information between autonomous systems over persistent TCP sessions (port 179). Unlike ARP broadcasts, BGP is point-to-point and stateful.

Session lifecycle:

Idle → Connect → OpenSent → OpenConfirm → Established
                                              ↑ routes exchanged here

Route advertisement (UPDATE message):

NLRI:       192.168.100.10/32          (the VIP)
Next-Hop:   192.168.1.10               (the advertising node)
AS-Path:    65001                       (originating AS)

Route withdrawal — when a node stops owning a VIP or its session drops, the router removes the route and stops forwarding traffic to it.

Path attributes control how routes propagate: AS-Path (loop prevention), Local-Pref (prefer certain paths), MED (influence inbound traffic), Communities (tagging for policy).

SCTP (Stream Control Transmission Protocol)

A transport protocol combining TCP reliability with UDP message boundaries:

Multi-streaming — multiple independent data streams in one connection
Multi-homing — single connection across multiple network paths for redundancy
Message-oriented — preserves message boundaries (unlike TCP’s byte stream)

Primarily used in telecom (SIP signaling, Diameter, SS7-over-IP). LoxiLB’s strong SCTP support is why it’s positioned for telecom environments.

9. Network Addressing Modes

Understanding how packets are addressed ties together many concepts in this doc.

Unicast (one-to-one)

A packet sent to a single specific destination. Most network traffic is unicast.

BGP peering sessions — TCP between one node and one router
Pod-to-pod traffic — one source IP to one destination IP
kube-proxy forwarding — after DNAT, traffic goes to one specific pod

Broadcast (one-to-all-on-segment)

A packet sent to all devices on a local network segment.

ARP requests — MAC broadcast FF:FF:FF:FF:FF:FF, received by all devices on the L2 segment
Limited broadcast (255.255.255.255) — IP-level broadcast, never forwarded by routers (RFC 919). Used by DHCP discovery.
Directed broadcast (e.g., 192.168.2.255) — targets a specific subnet’s broadcast address. Routers can forward but usually don’t (security risk).

L2 Announcements rely on ARP broadcast, which is why they’re confined to a single broadcast domain.

Anycast (one-to-nearest)

The same IP address is advertised from multiple locations. Routers send traffic to the nearest/best one.

This is exactly what BGP ECMP does in section 3 — multiple nodes advertise the same VIP route:

Node1 → Router: "192.168.100.10/32 via me"
Node2 → Router: "192.168.100.10/32 via me"
Node3 → Router: "192.168.100.10/32 via me"

Router: picks nearest/best path (or load-balances across all)

Anycast is also used at internet scale:

DNS root servers — the same IP serves DNS from hundreds of locations worldwide
CDNs (Cloudflare, AWS CloudFront) — client traffic routes to the nearest edge node
AWS Global Accelerator — anycast IPs route to the nearest AWS region

The key property: anycast provides geographic load balancing and fault tolerance without the client knowing anything about the topology. If one node goes down, traffic automatically routes to the next nearest.

In Kubernetes terms: L2 Announcements give you a VIP owned by one node (unicast with failover). BGP with ECMP gives you anycast — the same VIP served by multiple nodes simultaneously.

Multicast (one-to-group)

A packet sent to a group of interested receivers — not all devices (broadcast), not one (unicast), but a subscribed subset.

How it works:

IP range 224.0.0.0/4 (224.0.0.0 – 239.255.255.255) is reserved for multicast
A sender transmits once to a multicast group address (e.g., 239.1.1.1)
The network replicates the packet only to hosts that have joined that group
IGMP (Internet Group Management Protocol) handles group membership — hosts tell their local router “I want to receive group 239.1.1.1”
PIM (Protocol Independent Multicast) handles routing between routers — ensures multicast traffic reaches all network segments with interested receivers
At L2, multicast IP addresses map to special MAC addresses (01:00:5e:xx:xx:xx), so switches can forward selectively

How applications use it:

Video/audio streaming — one source, many viewers. The source sends once; the network replicates. Far more efficient than sending individual unicast streams to each viewer.
Financial market data — stock exchanges publish price feeds to a multicast group. Thousands of trading systems subscribe and receive simultaneously.
Service/cluster discovery — protocols like mDNS (224.0.0.251), VRRP, and older cluster membership systems (JGroups) use multicast to find peers without a central registry.
Gaming — multiplayer game state updates broadcast to all players in a session.

The key advantage: the sender transmits once regardless of how many receivers exist. The network handles replication. With unicast, the sender would need one copy per receiver.

Multicast Beyond the LAN

IP multicast was designed for local/campus networks. It breaks down in distributed scenarios:

No multicast routing across the internet — ISPs don’t enable PIM or carry multicast between autonomous systems. There’s no global multicast routing table.
Cloud providers don’t support it across regions — AWS VPC multicast only works within a single VPC via Transit Gateway. No cross-region, no cross-cloud.
NAT breaks it — IGMP group joins don’t traverse NAT gateways.

In practice, distributed multicast uses one of these approaches:

Application-level fan-out (most common): The application builds its own distribution tree using unicast connections. The “multicast” happens at the app layer, not the network:

Source → Relay A (US-East) → Subscribers 1, 2
                            → Relay B (EU-West) → Subscribers 3, 4

This is how Kafka, NATS, Redis pub/sub, and WebSocket fan-out work — producers write once, the middleware replicates and fans out over unicast.

Multicast-to-unicast gateways: For legacy apps that speak native multicast, a gateway at each site captures multicast traffic and tunnels it (GRE/VXLAN) to remote sites, which re-inject it as local multicast. Common in financial services for bridging stock exchange feeds across data centers.

Protocols designed for wide-area distribution: MQTT (IoT), AMQP — message protocols built for distributed pub/sub over unicast from the start.

The bottom line: once you cross network boundaries, everyone falls back to application-level fan-out over unicast. The sender still transmits once, but replication happens in middleware rather than network routers.

Multicast on Kubernetes

Not used in Kubernetes service networking, and most CNIs don’t support it:

Overlay CNIs (Flannel, Calico VXLAN) — encapsulate packets in unicast tunnels. Multicast semantics are lost inside the tunnel.
VPC CNI (AWS) — AWS VPC doesn’t support IP multicast between ENIs (except via Transit Gateway multicast domains, a separate feature).
Host networking (hostNetwork: true) — pods share the node’s network stack and can do multicast if the physical network supports it, but you lose pod isolation.

Approaches that work

Multicast-capable CNI:

Weave Net — supports multicast natively, replicates packets across its overlay via gossip
Macvlan/IPVLAN CNI — pods get interfaces directly on the physical network, so multicast works if the LAN supports it. But you lose network policy, service mesh, etc.
Calico in BGP mode (unencapsulated) — if the physical network supports multicast routing (PIM), it can work since pod IPs are routable

Application-level workaround:

Replace multicast with a message broker (Kafka, NATS, Redis pub/sub). Publish to a topic, subscribers receive via unicast. Most common approach — redesign the app to not need multicast.

Multus + secondary interface:

Use Multus CNI to attach a second network interface to pods
Primary interface: normal CNI (Calico/Cilium) for K8s services
Secondary interface: Macvlan/SR-IOV on a multicast-capable network
Common in telecom/5G workloads (which is also where SCTP from section 8 shows up)

Most teams running multicast on Kubernetes (telecom, financial services) use the Multus approach — a dedicated multicast network alongside the regular CNI, rather than trying to make the primary CNI support it.

10. Summary

Topic	Solution	Scope
Expose services on-prem (simple)	L2 Announcements (MetalLB/Cilium)	Single LAN segment
Expose services on-prem (enterprise)	BGP (MetalLB/Cilium/Calico)	Multi-subnet, multi-DC
Full LB with health checks	Cilium LB-IPAM or LoxiLB	Any
External LB with K8s discovery	NGINX/HAProxy + Consul	Any
Cloud LB direct-to-pod	IP target mode (VPC CNI)	Cloud only

The progression for on-prem:

NodePort — works but exposes cluster topology, port conflicts
L2 Announcements — unique VIPs, automatic failover, single broadcast domain
BGP — multi-node ECMP, multi-subnet, enterprise-grade
External LB — advanced L7 features, hybrid K8s + non-K8s

Cross-References

See 01-k8s-services-dns.md for:

Service types (ClusterIP, NodePort, LoadBalancer, ExternalName, Headless) — detailed specs and traffic flows
IP Target vs Instance Target Mode — how cloud LBs bypass NodePort with VPC CNI
MetalLB bare-metal section — how MetalLB fulfills LoadBalancer Services
Ingress & Gateway API — L7 routing that sits in front of Services
ExternalDNS — bridging cluster DNS to external DNS providers
Traffic Policies — externalTrafficPolicy and internalTrafficPolicy behavior

Kubernetes Network Infrastructure — L2, BGP, and On-Prem Load Balancing

1. The On-Prem Problem

2. L2 Announcements

How It Works

Failover via Gratuitous ARP

Advantages Over NodePort

Limitations

Who Actually Sends the ARP?

3. BGP Route Advertisement

How It Works

ECMP — True Multi-Node Load Balancing

Failover

Requirements

BGP Message Types

4. L2 vs BGP — When to Use Which

Choose L2 when:

Choose BGP when:

5. On-Prem Load Balancer Solutions

MetalLB

Cilium LB-IPAM

LoxiLB

Comparison

Control Plane vs Data Plane

6. Overlay vs Native CNI

Overlay CNI (Flannel, Weave)

Native/Routed CNI (AWS VPC CNI, Calico BGP mode)

7. External Load Balancers (NGINX, HAProxy)

Option 1: Static NodePort Config

Option 2: Dynamic Discovery via Consul

Option 3: External Ingress Controller

When to Use External LBs

Comparison with K8s-Native Solutions

8. Protocol Deep Dives

ARP (Address Resolution Protocol)

BGP (Border Gateway Protocol)

SCTP (Stream Control Transmission Protocol)

9. Network Addressing Modes

Unicast (one-to-one)

Broadcast (one-to-all-on-segment)

Anycast (one-to-nearest)

Multicast (one-to-group)

Multicast Beyond the LAN

Multicast on Kubernetes

Approaches that work

10. Summary

Cross-References

Related: Kubernetes Services & DNS Guide