【容器实战高手课-学习-5】如何修改容器的网络参数

Demon.Lee 2021年09月15日 2,508次浏览

课程原文: 李程远. 容器网络:我修改了/proc/sys/net下的参数,为什么在容器中不起效?

本文实践环境:
Operating System: Ubuntu 20.04.2 LTS
Kernel: Linux 5.11.0-27-generic
Architecture: x86-64
Docker Client/Server Version: 20.10.7

最近一段时间一直在学习网络相关的基础知识,深感网络体系的博大精深。假如没有网络,我们的生活将会变成什么样,不敢想象。

好久没有继续更新这门课的学习笔记了,这一节继续跟随李程远老师,将容器中网络参数的修改进行一个简单梳理。

Network Namespace 理解

之前相关学习笔记中,对容器中的使用到的 Namespace 和 Cgroups 进行了初步梳理。最近看到刘超老师在《趣谈网络协议》中对二者的描述也很有意思,摘录如下:

封闭的环境主要使用到了两种技术,一种是看起来隔离的技术,称为 namespace,也即每个 namespace 中的应用看到的是不同的 IP 地址、用户空间、进程号等。另一种是用起来隔离的技术,称为 cgroup,也即明明整台机器有很多的 CPU、内存,而一个应用只能用其中的一部分。

而 Network Namespace 就是针对网络进行隔离的一种看起来隔离的技术。在 Linux 系统下,通过 man 7 network_namespaces 可以看到相关的描述:

NETWORK_NAMESPACES(7)                                                                                Linux Programmer's Manual        

NAME
       network_namespaces - overview of Linux network namespaces

DESCRIPTION
       Network  namespaces  provide  isolation of the system resources associated with networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, the /proc/net directory (which is a symbolic
       link to /proc/PID/net), the /sys/class/net directory, various files under /proc/sys/net, port numbers (sockets), and so on.  In addition, network namespaces  isolate  the  UNIX  domain  abstract  socket  namespace  (see
       unix(7)).

       A  physical network device can live in exactly one network namespace.  When a network namespace is freed (i.e., when the last process in the namespace terminates), its physical network devices are moved back to the ini‐
       tial network namespace (not to the parent of the process).

       A virtual network (veth(4)) device pair provides a pipe-like abstraction that can be used to create tunnels between network namespaces, and can be used to create a bridge to a physical network device  in  another  name‐
       space.  When a namespace is freed, the veth(4) devices that it contains are destroyed.

       Use of network namespaces requires a kernel that is configured with the CONFIG_NET_NS option.

SEE ALSO
       nsenter(1), unshare(1), clone(2), veth(4), proc(5), sysfs(5), namespaces(7), user_namespaces(7), brctl(8), ip(8), ip-address(8), ip-link(8), ip-netns(8), iptables(8), ovs-vsctl(8)

COLOPHON
       This   page   is   part   of   release  5.05  of  the  Linux  man-pages  project.   A  description  of  the  project,  information  about  reporting  bugs,  and  the  latest  version  of  this  page,  can  be  found  at
       https://www.kernel.org/doc/man-pages/.

对上面的内容进行梳理,可以知道有以下几类资源是通过 Network Namespace 进行隔离的:

  • 网络设备:通过 ip link 命令可以查看到它们。
  # 宿主机上的网络接口
  root@demonlee-ubuntu:~# ip link
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
      link/ether 08:00:27:06:96:00 brd ff:ff:ff:ff:ff:ff
  3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
      link/ether 02:42:74:45:fe:50 brd ff:ff:ff:ff:ff:ff
  5: vethdbd733a@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default 
      link/ether b2:4c:5a:71:8d:7a brd ff:ff:ff:ff:ff:ff link-netnsid 0
  root@demonlee-ubuntu:~#
  # 容器内的网络接口
  root@demonlee-ubuntu:~# docker exec -it net_para ip link
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
      link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
  root@demonlee-ubuntu:~#
  • IPv4 和 IPv6 协议栈:涉及 IP 层及其上层的传输层协议,它们相关参数大多在 /proc/sys/net/ 目录下。
  root@demonlee-ubuntu:~# docker exec -it net_para ls -l /proc/sys/net
  total 0
  dr-xr-xr-x 1 root root 0 Sep  6 11:19 bridge
  dr-xr-xr-x 1 root root 0 Sep  6 11:19 core
  dr-xr-xr-x 1 root root 0 Sep  6 11:18 ipv4
  dr-xr-xr-x 1 root root 0 Sep  6 11:19 ipv6
  dr-xr-xr-x 1 root root 0 Sep  6 11:19 mptcp
  dr-xr-xr-x 1 root root 0 Sep  6 11:19 netfilter
  dr-xr-xr-x 1 root root 0 Sep  6 11:19 unix
  root@demonlee-ubuntu:~#
  • IP 路由表:不同的 Network Namespace 运行 ip route 命令,能看到不同的路由表。
  # 宿主机路由表
  root@demonlee-ubuntu:~# ip route
  default via 192.168.105.254 dev enp0s8 proto dhcp metric 101 
  169.254.0.0/16 dev enp0s3 scope link metric 1000 
  172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
  192.168.56.0/24 dev enp0s3 proto kernel scope link src 192.168.56.104 metric 100 
  192.168.105.0/24 dev enp0s8 proto kernel scope link src 192.168.105.20 metric 101 
  root@demonlee-ubuntu:~# 
  # 容器内的路由表
  root@demonlee-ubuntu:~# docker exec -it net_para ip route
  default via 172.17.0.1 dev eth0 
  172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2 
  root@demonlee-ubuntu:~#
  • 防火墙规则:每个 Network Namespace 可以单独设置 iptables 规则。

  • 网络状态:从 /proc/net(内核网络信息等) 和 /sys/class/net(网卡信息) 中获取上面四种资源的状态信息。

  root@demonlee-ubuntu:~# docker exec -it net_para ls /proc/net
  anycast6      igmp6              mcfilter6  rt6_stats     udp
  arp           ip6_flowlabel      netfilter  rt_acct       udp6
  dev           ip6_mr_cache       netlink    rt_cache      udplite
  dev_mcast     ip6_mr_vif         netstat    snmp          udplite6
  dev_snmp6     ip_mr_cache        packet     snmp6         unix
  fib_trie      ip_mr_vif          protocols  sockstat      wireless
  fib_triestat  ip_tables_matches  psched     sockstat6     xfrm_stat
  icmp          ip_tables_names    ptype      softnet_stat
  icmp6         ip_tables_targets  raw        stat
  if_inet6      ipv6_route         raw6       tcp
  igmp          mcfilter           route      tcp6
  root@demonlee-ubuntu:~# 
  root@demonlee-ubuntu:~# docker exec -it net_para ls /sys/class/net
  eth0  lo
  root@demonlee-ubuntu:~#

手动实现一个 Network Namespace

下面通过两个系统调用 clone()unshare() 来直观体验一下在 Linux OS 中网络隔离的效果。

clone() 系统调用会创建一个新的进程,若我们在入参中传入 CLONE_NEWNET 标志位,那么这个新创建进程的网络也是全新的,此时使用 ip link 来查看网络设备,就能直观看见其与宿主机之间的差别了。

具体代码如下所示:

#ifndef _GNU_SOURCE 
#define _GNU_SOURCE
#endif

#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>

#define errExit(msg) do{perror(msg);exit(EXIT_FAILURE);}while(0)
#define STACK_SIZE (1024*1024)
static char stack[STACK_SIZE];

int new_netns(void *para){
  printf("[%d] New Namespace Devices: \n", getpid());
  system("ip link");
  printf("\n\n");
  sleep(100);
  return 0;
}

int main(void){
  pid_t pid;
  printf("[%d] Host Namespace Devices:\n", getpid());
  system("ip link");
  printf("\n\n");

  pid = clone(new_netns, stack + STACK_SIZE, CLONE_NEWNET | SIGCHLD, NULL);
  if(pid==-1){
    errExit("clone");
  }

  if(waitpid(pid, NULL, 0)==-1){
    errExit("waitpid");
  }

  return 0;
}

编译之后,运行的结果如下所示:

root@demonlee-ubuntu:network_namespace# ./clone-netns
[2313] Host Namespace Devices:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:c7:66:ff brd ff:ff:ff:ff:ff:ff
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:3b:7f:d0 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 
    link/ether 02:42:eb:01:ca:ab brd ff:ff:ff:ff:ff:ff


[2316] New Namespace Devices: 
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00


root@demonlee-ubuntu:network_namespace#

可以看到,新进程中的网络设备只有一个 lo(回环设备),而宿主机上有 4 个。

另外一种方式,是通过 unshare() 系统调用来改变当前进程的 Network Namespace,通过 man 2 unshare 简单了解一下这个系统调用的作用:

NAME
       unshare - disassociate parts of the process execution context

SYNOPSIS
       #define _GNU_SOURCE
       #include <sched.h>

       int unshare(int flags);

DESCRIPTION
       unshare()  allows  a process (or thread) to disassociate parts of its execution context that are currently being shared with other
       processes (or threads).  Part of the execution context, such as the mount namespace, is shared implicitly when a  new  process  is
       created  using  fork(2)  or vfork(2), while other parts, such as virtual memory, may be shared by explicit request when creating a
       process or thread using clone(2).

       The main use of unshare() is to allow a process to control its shared execution context without creating a new process.

       The flags argument is a bit mask that specifies which parts of the execution context should be unshared.  This argument is  speci‐
       fied by ORing together zero or more of the following constants:

       CLONE_FILES
              Reverse  the  effect  of  the clone(2) CLONE_FILES flag.  Unshare the file descriptor table, so that the calling process no
              longer shares its file descriptors with any other process.

       CLONE_FS
              Reverse the effect of the clone(2) CLONE_FS flag.  Unshare filesystem attributes, so that the  calling  process  no  longer
              shares its root directory (chroot(2)), current directory (chdir(2)), or umask (umask(2)) attributes with any other process.

       CLONE_NEWCGROUP (since Linux 4.6)
              This  flag has the same effect as the clone(2) CLONE_NEWCGROUP flag.  Unshare the cgroup namespace.  Use of CLONE_NEWCGROUP
              requires the CAP_SYS_ADMIN capability.

       CLONE_NEWIPC (since Linux 2.6.19)
              This flag has the same effect as the clone(2) CLONE_NEWIPC flag.  Unshare the IPC namespace, so that  the  calling  process
              has a private copy of the IPC namespace which is not shared with any other process.  Specifying this flag automatically im‐
              plies CLONE_SYSVSEM as well.  Use of CLONE_NEWIPC requires the CAP_SYS_ADMIN capability.

       CLONE_NEWNET (since Linux 2.6.24)
              This flag has the same effect as the clone(2) CLONE_NEWNET flag.  Unshare  the  network  namespace,  so  that  the  calling
              process  is  moved  into  a  new  network  namespace  which  is  not  shared  with any previously existing process.  Use of
              CLONE_NEWNET requires the CAP_SYS_ADMIN capability.

       CLONE_NEWNS
              This flag has the same effect as the clone(2) CLONE_NEWNS flag.  Unshare the mount namespace, so that the  calling  process
              has a private copy of its namespace which is not shared with any other process.  Specifying this flag automatically implies
              CLONE_FS as well.  Use of CLONE_NEWNS requires the CAP_SYS_ADMIN capability.   For  further  information,  see  mount_name‐
              spaces(7).
       ...

演示代码如下:

#ifndef _GNU_SOURCE 
#define _GNU_SOURCE
#endif

#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>

#define errExit(msg) do{perror(msg);exit(EXIT_FAILURE);}while(0)
#define STACK_SIZE (1024*1024)
static char stack[STACK_SIZE];

int main(void){
  pid_t pid;
  printf("[%d] Host Namespace Devices:\n", getpid());
  system("ip link");
  printf("\n\n");

  if(unshare(CLONE_NEWNET)==-1){
    errExit("unshare");
  }

  printf("[%d] New Namespace Devices:\n", getpid());
  system("ip link");
  printf("\n\n");

  return 0;
}

编译之后,运行的结果如下所示:

root@demonlee-ubuntu:network_namespace# ./unshare-netns 
[2360] Host Namespace Devices:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:c7:66:ff brd ff:ff:ff:ff:ff:ff
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:3b:7f:d0 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 
    link/ether 02:42:eb:01:ca:ab brd ff:ff:ff:ff:ff:ff


[2360] New Namespace Devices:
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00


root@demonlee-ubuntu:network_namespace#

其结果与前面的 clone() 系统调用一样,新的网络空间拥有的网络设备与宿主机上是不同的。

那通过什么方式,可以访问一个已经存在的独立网络空间呢?答案是 lsnsnsenter 两个系统命令。

  • lsns -t net:查看已存在的 Network Namespace。
  • nsenter -t <pid> -n:这里的 -n 表示进入 <pid> 对应的 Network Namespace。

关于两个命令的具体用法,可以通过 lsns -hnsenter -h 进行详细了解。

下面就以前面的 clone-netns 进程为例,运行程序后,再通过 lsns 和 nsenter 进行查看:

root@demonlee-ubuntu:~# lsns -t net
        NS TYPE NPROCS   PID USER     NETNSID NSFS COMMAND
4026531992 net     176     1 root  unassigned      /sbin/init splash
4026532245 net       1  1208 rtkit unassigned      /usr/libexec/rtkit-daemon
4026532323 net       1  3034 root  unassigned      ./clone-netns
root@demonlee-ubuntu:~# 
# 进入对应的 Network Namespace,并执行 ip addr
root@demonlee-ubuntu:~# nsenter -t 3034 -n ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@demonlee-ubuntu:~#
# 进入对应的 Network Namespace
root@demonlee-ubuntu:~# nsenter -t 3034 -n
# 然后执行相关命令,比如 ip link
root@demonlee-ubuntu:~# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@demonlee-ubuntu:~#
# 回到 1 号进程的 Network Namespace,即宿主机的网络环境
root@demonlee-ubuntu:~# nsenter -t 1 -n
root@demonlee-ubuntu:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:c7:66:ff brd ff:ff:ff:ff:ff:ff
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:3b:7f:d0 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 
    link/ether 02:42:60:ee:fd:ff brd ff:ff:ff:ff:ff:ff
root@demonlee-ubuntu:~# 

容器网络参数修改

问题重现

前面提到 Linux OS 下很大一部分网络参数在 /proc/sys/net 目录下,容器中运行的应用程序,可能需要对相关网络参数进行修改,那要怎么做呢?

在这之前,我们先来看这样一个场景:将宿主机上部分网络参数的默认值修改之后,观察新启动的容器中,这些网络参数是否会跟着变化。

root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
root@demonlee-ubuntu:~# 
root@demonlee-ubuntu:~# echo bbr > /proc/sys/net/ipv4/tcp_congestion_control
root@demonlee-ubuntu:~# echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
root@demonlee-ubuntu:~# echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
root@demonlee-ubuntu:~# echo 6 > /proc/sys/net/ipv4/tcp_keepalive_probes
root@demonlee-ubuntu:~# 
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_congestion_control
bbr
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_time
600
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
10
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_probes
6
root@demonlee-ubuntu:~#

可以看到,宿主机上的 4 个网络参数已经调整完毕,并且确认修改成功。下面启动一个容器,看看里面的网络参数值:

root@demonlee-ubuntu:~# docker run -d --name net_para centos:8.1.1911 sleep 3600
f7b71a9615943e04249596829f05bf8ebb8273cffaf039aa5205db95c54d1a86
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# docker exec -it net_para /bin/bash
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_congestion_control 
bbr
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_keepalive_time     
7200
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl 
75
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_keepalive_probes 
9
[root@f7b71a961594 /]#

从结果来看,有点出乎意料,除了 tcp_congestion_control 这个参数继承了宿主机的值,其他几项都还是默认值。即新启动容器中的网络参数,一部分继承自宿主机环境,另外一部分又重新初始化了。这又是为啥呢?

梳理原因

要想知道这个结果是如何产生的,还得回到内核源码中找答案。tcp_ipv4.c 文件中的 tcp_sk_init 函数代码如下:

static int __net_init tcp_sk_init(struct net *net) {
	int res, cpu, cnt;

	net->ipv4.tcp_sk = alloc_percpu(struct sock *);
	if (!net->ipv4.tcp_sk)
		return -ENOMEM;

	for_each_possible_cpu(cpu) {
		struct sock *sk;

		res = inet_ctl_sock_create(&sk, PF_INET, SOCK_RAW,
					   IPPROTO_TCP, net);
		if (res)
			goto fail;
		sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);

		/* Please enforce IP_DF and IPID==0 for RST and
		 * ACK sent in SYN-RECV and TIME-WAIT state.
		 */
		inet_sk(sk)->pmtudisc = IP_PMTUDISC_DO;

		*per_cpu_ptr(net->ipv4.tcp_sk, cpu) = sk;
	}

	net->ipv4.sysctl_tcp_ecn = 2;
	net->ipv4.sysctl_tcp_ecn_fallback = 1;

	net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
	net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
	net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
	net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
	net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;

	net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
	net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
	net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL;

	net->ipv4.sysctl_tcp_syn_retries = TCP_SYN_RETRIES;
	net->ipv4.sysctl_tcp_synack_retries = TCP_SYNACK_RETRIES;
	net->ipv4.sysctl_tcp_syncookies = 1;
	net->ipv4.sysctl_tcp_reordering = TCP_FASTRETRANS_THRESH;
	net->ipv4.sysctl_tcp_retries1 = TCP_RETR1;
	net->ipv4.sysctl_tcp_retries2 = TCP_RETR2;
	net->ipv4.sysctl_tcp_orphan_retries = 0;
	net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
	net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX;
	net->ipv4.sysctl_tcp_tw_reuse = 2;
	net->ipv4.sysctl_tcp_no_ssthresh_metrics_save = 1;

	cnt = tcp_hashinfo.ehash_mask + 1;
	net->ipv4.tcp_death_row.sysctl_max_tw_buckets = cnt / 2;
	net->ipv4.tcp_death_row.hashinfo = &tcp_hashinfo;

	net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128);
	net->ipv4.sysctl_tcp_sack = 1;
	net->ipv4.sysctl_tcp_window_scaling = 1;
	net->ipv4.sysctl_tcp_timestamps = 1;
	net->ipv4.sysctl_tcp_early_retrans = 3;
	net->ipv4.sysctl_tcp_recovery = TCP_RACK_LOSS_DETECTION;
	net->ipv4.sysctl_tcp_slow_start_after_idle = 1; /* By default, RFC2861 behavior.  */
	net->ipv4.sysctl_tcp_retrans_collapse = 1;
	net->ipv4.sysctl_tcp_max_reordering = 300;
	net->ipv4.sysctl_tcp_dsack = 1;
	net->ipv4.sysctl_tcp_app_win = 31;
	net->ipv4.sysctl_tcp_adv_win_scale = 1;
	net->ipv4.sysctl_tcp_frto = 2;
	net->ipv4.sysctl_tcp_moderate_rcvbuf = 1;
	/* This limits the percentage of the congestion window which we
	 * will allow a single TSO frame to consume.  Building TSO frames
	 * which are too large can cause TCP streams to be bursty.
	 */
	net->ipv4.sysctl_tcp_tso_win_divisor = 3;
	/* Default TSQ limit of 16 TSO segments */
	net->ipv4.sysctl_tcp_limit_output_bytes = 16 * 65536;
	/* rfc5961 challenge ack rate limiting */
	net->ipv4.sysctl_tcp_challenge_ack_limit = 1000;
	net->ipv4.sysctl_tcp_min_tso_segs = 2;
	net->ipv4.sysctl_tcp_min_rtt_wlen = 300;
	net->ipv4.sysctl_tcp_autocorking = 1;
	net->ipv4.sysctl_tcp_invalid_ratelimit = HZ/2;
	net->ipv4.sysctl_tcp_pacing_ss_ratio = 200;
	net->ipv4.sysctl_tcp_pacing_ca_ratio = 120;
	if (net != &init_net) {
		memcpy(net->ipv4.sysctl_tcp_rmem,
		       init_net.ipv4.sysctl_tcp_rmem,
		       sizeof(init_net.ipv4.sysctl_tcp_rmem));
		memcpy(net->ipv4.sysctl_tcp_wmem,
		       init_net.ipv4.sysctl_tcp_wmem,
		       sizeof(init_net.ipv4.sysctl_tcp_wmem));
	}
	net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC;
	net->ipv4.sysctl_tcp_comp_sack_slack_ns = 100 * NSEC_PER_USEC;
	net->ipv4.sysctl_tcp_comp_sack_nr = 44;
	net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE;
	spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
	net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
	atomic_set(&net->ipv4.tfo_active_disable_times, 0);

	/* Reno is always built in */
	if (!net_eq(net, &init_net) &&
	    bpf_try_module_get(init_net.ipv4.tcp_congestion_control,
			       init_net.ipv4.tcp_congestion_control->owner))
		net->ipv4.tcp_congestion_control = init_net.ipv4.tcp_congestion_control;
	else
		net->ipv4.tcp_congestion_control = &tcp_reno;

	return 0;
fail:
	tcp_sk_exit(net);

	return res;
}

从代码中可以看到,tcp_congestion_control 的赋值与 sysctl_tcp_keepalive_time 等不同,从这里可以推断出 tcp_congestion_control 根据判断条件会从宿主机中获取对应的值。

那我们是不是可以直接在容器中修改这些参数呢,以下为测试结果:

root@demonlee-ubuntu:~# docker exec -it net_para /bin/bash
[root@f7b71a961594 /]# echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
bash: /proc/sys/net/ipv4/tcp_keepalive_time: Read-only file system
[root@f7b71a961594 /]# 
[root@f7b71a961594 /]# echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
bash: /proc/sys/net/ipv4/tcp_keepalive_intvl: Read-only file system
[root@f7b71a961594 /]# echo bbr > /proc/sys/net/ipv4/tcp_congestion_control
bash: /proc/sys/net/ipv4/tcp_congestion_control: Read-only file system
[root@f7b71a961594 /]# 
[root@f7b71a961594 /]# cat /proc/mounts | grep "/proc/sys"
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
[root@f7b71a961594 /]#

在普通容器中(非 privileged 容器,后面会单独梳理这一块的学习笔记)修改这些参数会提示错误:“Read-only file system”。
通过 cat /proc/mounts | grep "/proc/sys" 我们也能发现 /proc/sys 是只读的(read-only)。

而这么做的原因也很容易想到:安全问题。为了防范不可控问题,将容器中 /proc/sys 等相关目录进行了只读 mount 处理,更详细的说明请点击参考文献中的链接进行了解。

[root@f7b71a961594 /]# cat /proc/mounts
overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/RQFUDW2R4OWJ5N7WAKVSKXRFZE:/var/lib/docker/overlay2/l/SLRTXUAAXRPQTLBQIAX5Q6EMEQ,upperdir=/var/lib/docker/overlay2/e36766f7c706db2f31c4d53ec0ef2058af85693e4f3923d80aae4f80e5d3575e/diff,workdir=/var/lib/docker/overlay2/e36766f7c706db2f31c4d53ec0ef2058af85693e4f3923d80aae4f80e5d3575e/work 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,relatime,mode=755,inode64 0 0
cgroup /sys/fs/cgroup/systemd cgroup ro,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
cgroup /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup ro,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/pids cgroup ro,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup ro,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/rdma cgroup ro,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/perf_event cgroup ro,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k,inode64 0 0
/dev/sda5 /etc/resolv.conf ext4 rw,relatime,errors=remount-ro 0 0
/dev/sda5 /etc/hostname ext4 rw,relatime,errors=remount-ro 0 0
/dev/sda5 /etc/hosts ext4 rw,relatime,errors=remount-ro 0 0
proc /proc/bus proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/fs proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/irq proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/asound tmpfs ro,relatime,inode64 0 0
tmpfs /proc/acpi tmpfs ro,relatime,inode64 0 0
tmpfs /proc/kcore tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/keys tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/timer_list tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/sched_debug tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/scsi tmpfs ro,relatime,inode64 0 0
tmpfs /sys/firmware tmpfs ro,relatime,inode64 0 0
[root@f7b71a961594 /]#

根据前面的分析,我们可以得出两点结论:

  • 改宿主机上的网络参数,新建容器对应的参数不能全部继承。
  • 容器启动之后,没有权限再修改网络参数。

既然如此,那如何才能修改容器中的网络参数呢?

解决方案

前面提到使用 nsenter 可以进入相关 Network Namespace,然后便可以更改对应容器的网络参数了。比如下面演示的样例:

root@demonlee-ubuntu:~# lsns -t net
        NS TYPE NPROCS   PID USER     NETNSID NSFS                           COMMAND
4026531992 net     175     1 root  unassigned                                /sbin/init splash
4026532253 net       1  1733 root           0 /run/docker/netns/46ce60a32587 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
4026532326 net       1  1174 rtkit unassigned                                /usr/libexec/rtkit-daemon
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# nsenter -t 1733 -n sysctl -a | grep net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# nsenter -t 1733 -n sysctl -w net.ipv4.tcp_keepalive_time=6000
net.ipv4.tcp_keepalive_time = 6000
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# nsenter -t 1733 -n sysctl -a | grep net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 6000
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# docker exec -it net_para cat /proc/sys/net/ipv4/tcp_keepalive_time
6000
root@demonlee-ubuntu:~# 

通过这行命令 nsenter -t 1733 -n sysctl -w net.ipv4.tcp_keepalive_time=6000(也可以使用这个方式:nsenter -t 1733 -n bash -c 'echo 660 > /proc/sys/net/ipv4/tcp_keepalive_time'),我们就将容器 net_para 中对应的网络参数给修改了。

但是,这又产生了新的问题:

  • 一般生产环境是不允许通过 nsenter 这种命令行的方式进行修改的。
  • 即使通过 nsenter 进行了修改,服务也需要重启,对应的配置才能生效,这同样也不被允许。

如此一来,我们就只有一条方案可以选了,那就是在容器启动时调整相关配置

Docker 和 Kubernetes 都在容器启动时提供了对应的参数:Docker 对应的是 --sysctl,Kubernetes 对应的是 allowed-unsafe-sysctls 特性。下面以 Docker 的 --sysctl 参数为例进行验证:

root@demonlee-ubuntu:~# docker run -d --name net_para2 --sysctl net.ipv4.tcp_keepalive_time=6600 centos:8.1.1911 sleep 3600 
79192ce108974be1cf493c4a592dce992fee57cdec80acc2a1d08072c0aebfef
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# docker exec -it net_para2 cat /proc/sys/net/ipv4/tcp_keepalive_time 
6600
root@demonlee-ubuntu:~#

总结

  • 容器中应用了 Network Namespace,从而实现了网络设备、路由规则和网络内核参数等的隔离。
  • 在 Linux OS 下可以使用 clone()unshare() 系统调用进行 Network Namespace 的验证和测试。
  • 容器中修改内核参数,需要在启动容器时进行配置,Docker 中可以使用 --sysctl 参数进行配置。

参考文献