The problem with kube-proxy: enabling IPVS on EKS
Update (3/11/20) kube-proxy no longer automatically cleans up network rules created by running kube-proxy in other modes. If you are switching the kube-proxy mode (EG: iptables to IPVS), you will need to run kube-proxy --cleanup
, or restart the worker node (recommended) before restarting kube-proxy. If you are not switching kube-proxy between different modes, this change should not require any action.
Introduction
All worker nodes in a Kubernetes cluster run a program called kube-proxy that is responsible for routing traffic to backend pods. Each time a service is created, a corresponding object is stored in etcd. This triggers an endpoint controller which records a set of endpoints in etcd. These endpoints are then propagated to all of the nodes which kube-proxy uses to update the local iptable rules. This works fine at small/medium scale, but at very large scale this can cause several performance issues.
The problem
kube-proxy configures iptables to perform load balancing. iptables is a user space application that allows you to create kernel firewall rules. It was never really intended to be used as a load balancer. As the number of nodes and services begins to grow, 3 issues start to emerge:
- Inserting/removing entries becomes less and less efficient
- Every incoming packet has to match an iptables rule to determine where to route the packet. iptables stores these rules in list/table that has to linearly traversed. When the table becomes very large, latency can occur which ultimately translates into lower throughput.
- The table is locked when it’s updated. This can cause contention which adds to the latency issue. For example it can take as long as 10 minutes to insert a rule in iptables on a cluster with 5,000 services.
IPVS to the rescue
IPVS is a kernel mode, transport layer load balancer directs traffic to real services. It avoids a lot of the issues that iptables suffers from at scale in part because it is hash-based instead of list or table-based. Using a hash allows it process updates in approximately 2ms irrespective of clusters size. It also supports both UDP/TCP and offers several different load balancing algorithms such as round robin, least connected, destination hashing, source hashing, shortest expected delay, and never queue.
Implementing IPVS on EKS
The first step is to update worker node’s user data. The excerpt below is from the cloud-config
that eksctl
uses to bootstrap instances. I’ve bolded the updates I’ve made to the file.
#cloud-config
packages:
- ipvsadm
runcmd:
- sudo modprobe ip_vs
- sudo modprobe ip_vs_rr
- sudo modprobe ip_vs_wrr
- sudo modprobe ip_vs_sh
- sudo modprobe nf_conntrack_ipv4
- /var/lib/cloud/scripts/per-instance/bootstrap.al2.sh
These updates add the IPVS and various modules module and the Linux kernel [on Amazon Linux 2]. By making these changes, every node that gets deployed will have IPVS installed.
The next step involves updating the kube-proxy daemonset.
containers:
- command:
- /bin/sh
- -c
- kube-proxy --v=2 --kubeconfig=/var/lib/kube-proxy/kubeconfig --proxy-mode=ipvs --ipvs-scheduler=sed
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.13.8
To enable IPVS, you need to add the --proxy-mode
and --ipvs-scheduler
flags to the kube-proxy command. The --ipvs-scheduler
flag informs kube-proxy which routing algorithm to use. In this case, we’re using shortest expected delay (sed).
Conclusion
If you plan to operate very large clusters with thousands of services, you might want to consider using IPVS instead of iptables. As always, research IPVS thoroughly before implementing it and be aware of potential issues that might arise. For additional information on IPVS and Kubernetes, please refer to IPVS-based In-cluster Load Balancing Deep Dive on the Kubernetes blog.