This document describes docker networking in bridge and overlay mode delivered via libnetwork. Libnetwork uses iptables extensively to configure NATting and forwarding rules. https://wiki.archlinux.org/index.php/iptables provides a good introduction to iptables and its default chains. More details may be found in http://ipset.netfilter.org/iptables-extensions.man.html
The above diagram illustrates the network topology when a container instantiated with network mode set to bridge by docker engine. In this case, the libnetwork does the following
This completes network setup for container running in bridge mode. Outbound traffic from container flows through routing (Container NS) ? veth-pair ? docker0 bridge (Host NS) ---> docker0 interface (HOST NS) ? routing (HostNS) ? eth0 (HostNS) and out of host. And inbound traffic to container flows through the reverse direction.
Note that the container?s assigned IP (172.17.0.2 in above example) address is on docker0 subnet, and is not visible to externally to host. For this reason, a default masquerading rule is added to nat iptable?s POSTROUTING chain in host NS at docker engine initialization time. It states that for request traffic flow that has gone through the routing stage and the srcIP is within docker0 subnet (172.17.0.0/16), the traffic request must be originated from docker containers, therefore its srcIP is replaced with IP of outbound interface determined by routing. In the above diagram eth0?s IP 172.31.2.1 is used by replacement IP. In another word, masquerade is same as SNAT with replacement srcIP set to outbound interface?s IP.
If the container backends a service and has a listening targetPort in the container NS, it also must also have a corresponding publishedPort in host NS to receive the request and forward it to the container. Two rules are created in host NS for this purpose:
Libnetwork use completely different set of namespaces, bridges, and iptables to forward container traffic in swarm/overlay mode.
As depicted in the above diagram, when a host joinis a swarm cluster, the docker engine creates following network topology.
Initial Setup ( before any services are created)
In host NS, creates a docker_gwbridge bridge, assigning a subnet range to this bridge. In this above diagram 172.18.0.1/16. This subnet is local, and does not leak outside of the host.
In host NS, adds masquerading rule in nat iptable (5) PREROUTING chain for any request with srcIP within 172.18.0.0/16 subnet.
Creates a new network namespace ingress_sbox NS, creates two veth-pairs, one eth1 connects to docker_gwbridge bridge with fixed IP 172.18.0.2, and other (eth0) connected to ingress NS bridge br0. The eth0 is assigned an IP address, in this example, 10.255.0.2.
In ingress_sbox NS, adds to nat iptable(2)?s PREROUTING chain a rule that snat and redirect service request to ipvs for load-balancing. For instance,
?SNAT all -- anywhere 10.255.0.0/16 ipvs to:10.255.0.2?
In ingress_sbox NS, adds to nat iptable(2)?s POSTROUTING and OUTPUT chains rules that allows DNS lookup to be redirected to/from docker engine as is required by swarm service discovery.
Creates a new network namespace ingress NS, and creates a bridge br0 that has two links, one attaches to eth0 interface on ingress_sbox NS, and other to vxlan interface that in essence makes bridge br0 span across all hosts on the same swarm cluster. Each eth0 interface in ingress_sbox, each container instance of a service, and each service itself are given a unique IP 10.255.xx.xx, and is attached to br0, so that on the same swarm cluster, services, container instances of services, and ingress_sbox?s eth0 are all connected via bridge br0.
When a service is created, say with targetPort=80 and publishedPort=30000. The following are added to the existing network topology.
Service Setup
A container backends the service has its own namespace container NS. And two vether-pairs are created, eth1 is attached to docker_gwbridge in host NS, is given an IP assigned from docker_gwbridge subnet, in the example, 172.18.0.2, eth0 is attached to br0 in ingress NS, is given an IP of 10.255.0.5 in this example.
In container NS, adds rules to filter iptable (3)?s INPUT and OUTPUT chain that only allows targetPortt =80 traffic.
In container NS, adds rule to nat iptables (4)?s PREROUTING chain that changes publishedPort to targetPort. For instance,
?REDIRECT tcp -- anywhere 10.255.0.11 tcp dpt:30000 redir ports 80?
In container NS, adds rules to nat iptable(4)?s INPUT/OUTPUT chain that allow DNS lookup in this container to be redirected to docker engine.
In host NS, in filter iptable (5)?s FORWARD chain, inserts DOCKER-INGRESS chain, and adds a rule to allow service request to port 30000 and its reply.
I.e ?ACCEPT tcp -- anywhere anywhere tcp dpt:30000? and
?ACCEPT tcp -- anywhere anywhere state RELATED,ESTABLISHED tcp spt:30000?
In host NS, in nat iptable(6)?s PREROUTING chain, inserts (different) DOCKER-INGRESS chain, and adds a rule to dnat service request to ingress NS?s eth1?s IP (172.18.0.2). i.e
?DNAT tcp -- anywhere anywhere tcp dpt:30000 to:172.18.0.2:30000?
In ingress_sbox NS, in mangle iptable(1)?s PREROUTING chain, adds a rule to mark service request, i.e
?MARK tcp -- anywhere anywhere tcp dpt:30000 MARK set 0x100?
In ingress_sbox NS, in nat iptable(2)?s POSTROUTING chain, adds a rule to snat request?s srcIP to eth1?s IP, and forward to ipvs to load-balancing, i.e
?SNAT all -- 0.0.0.0/0 10.255.0.0/16 ipvs to:10.255.0.2?
In ingress_sbox, configures ipvs LB policy for marked traffic, i.e
FWM 256 rr
-> 10.255.0.5:0 Masq 1 0 0
-> 10.255.0.7:0 Masq 1 0 0
-> 10.255.0.8:0 Masq 1 0 0
Here each of 10.255.0.x represents IP address of of container instance backending the service.
This section describes traffic flow of request and reply to/from a service with publishedPort = 30000, targetPort = 80
Northbound traffic originated from a container instance, for example, ping www.cnn.com:
The traffic flow is exactly the same as in bridge mode, except it is via docker_gwbridge in host NS, and traffic is masqueraded with nat rule in initial setup (2).
DNS traffic
DNS lookup traffic is routed to docker engine from container instance for service discovery, filling the blank.
Other iptable chains and rules created and/or managed by docker engine/libnetwork.
DOCKER-USER: inserted as the first rule to FORWARD chain of filter iptable in host NS. So that user can independently managed traffic that may or may not be related docker containers.
DOCKER-ISOLATION-STAGE-1 / 2: Filling in the blank