课程原文: 李程远. 容器网络:我修改了/proc/sys/net下的参数,为什么在容器中不起效?
本文实践环境:
Operating System: Ubuntu 20.04.2 LTS
Kernel: Linux 5.11.0-27-generic
Architecture: x86-64
Docker Client/Server Version: 20.10.7
最近一段时间一直在学习网络相关的基础知识,深感网络体系的博大精深。假如没有网络,我们的生活将会变成什么样,不敢想象。
好久没有继续更新这门课的学习笔记了,这一节继续跟随李程远老师,将容器中网络参数的修改进行一个简单梳理。
Network Namespace 理解
之前相关学习笔记中,对容器中的使用到的 Namespace 和 Cgroups 进行了初步梳理。最近看到刘超老师在《趣谈网络协议》中对二者的描述也很有意思,摘录如下:
封闭的环境主要使用到了两种技术,一种是看起来隔离的技术,称为 namespace,也即每个 namespace 中的应用看到的是不同的 IP 地址、用户空间、进程号等。另一种是用起来隔离的技术,称为 cgroup,也即明明整台机器有很多的 CPU、内存,而一个应用只能用其中的一部分。
而 Network Namespace 就是针对网络进行隔离的一种看起来隔离的技术。在 Linux 系统下,通过 man 7 network_namespaces
可以看到相关的描述:
NETWORK_NAMESPACES(7) Linux Programmer's Manual
NAME
network_namespaces - overview of Linux network namespaces
DESCRIPTION
Network namespaces provide isolation of the system resources associated with networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, the /proc/net directory (which is a symbolic
link to /proc/PID/net), the /sys/class/net directory, various files under /proc/sys/net, port numbers (sockets), and so on. In addition, network namespaces isolate the UNIX domain abstract socket namespace (see
unix(7)).
A physical network device can live in exactly one network namespace. When a network namespace is freed (i.e., when the last process in the namespace terminates), its physical network devices are moved back to the ini‐
tial network namespace (not to the parent of the process).
A virtual network (veth(4)) device pair provides a pipe-like abstraction that can be used to create tunnels between network namespaces, and can be used to create a bridge to a physical network device in another name‐
space. When a namespace is freed, the veth(4) devices that it contains are destroyed.
Use of network namespaces requires a kernel that is configured with the CONFIG_NET_NS option.
SEE ALSO
nsenter(1), unshare(1), clone(2), veth(4), proc(5), sysfs(5), namespaces(7), user_namespaces(7), brctl(8), ip(8), ip-address(8), ip-link(8), ip-netns(8), iptables(8), ovs-vsctl(8)
COLOPHON
This page is part of release 5.05 of the Linux man-pages project. A description of the project, information about reporting bugs, and the latest version of this page, can be found at
https://www.kernel.org/doc/man-pages/.
对上面的内容进行梳理,可以知道有以下几类资源是通过 Network Namespace 进行隔离的:
- 网络设备:通过
ip link
命令可以查看到它们。
# 宿主机上的网络接口
root@demonlee-ubuntu:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:06:96:00 brd ff:ff:ff:ff:ff:ff
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:74:45:fe:50 brd ff:ff:ff:ff:ff:ff
5: vethdbd733a@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
link/ether b2:4c:5a:71:8d:7a brd ff:ff:ff:ff:ff:ff link-netnsid 0
root@demonlee-ubuntu:~#
# 容器内的网络接口
root@demonlee-ubuntu:~# docker exec -it net_para ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
root@demonlee-ubuntu:~#
- IPv4 和 IPv6 协议栈:涉及 IP 层及其上层的传输层协议,它们相关参数大多在 /proc/sys/net/ 目录下。
root@demonlee-ubuntu:~# docker exec -it net_para ls -l /proc/sys/net
total 0
dr-xr-xr-x 1 root root 0 Sep 6 11:19 bridge
dr-xr-xr-x 1 root root 0 Sep 6 11:19 core
dr-xr-xr-x 1 root root 0 Sep 6 11:18 ipv4
dr-xr-xr-x 1 root root 0 Sep 6 11:19 ipv6
dr-xr-xr-x 1 root root 0 Sep 6 11:19 mptcp
dr-xr-xr-x 1 root root 0 Sep 6 11:19 netfilter
dr-xr-xr-x 1 root root 0 Sep 6 11:19 unix
root@demonlee-ubuntu:~#
- IP 路由表:不同的 Network Namespace 运行
ip route
命令,能看到不同的路由表。
# 宿主机路由表
root@demonlee-ubuntu:~# ip route
default via 192.168.105.254 dev enp0s8 proto dhcp metric 101
169.254.0.0/16 dev enp0s3 scope link metric 1000
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.56.0/24 dev enp0s3 proto kernel scope link src 192.168.56.104 metric 100
192.168.105.0/24 dev enp0s8 proto kernel scope link src 192.168.105.20 metric 101
root@demonlee-ubuntu:~#
# 容器内的路由表
root@demonlee-ubuntu:~# docker exec -it net_para ip route
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2
root@demonlee-ubuntu:~#
-
防火墙规则:每个 Network Namespace 可以单独设置 iptables 规则。
-
网络状态:从 /proc/net(内核网络信息等) 和 /sys/class/net(网卡信息) 中获取上面四种资源的状态信息。
root@demonlee-ubuntu:~# docker exec -it net_para ls /proc/net
anycast6 igmp6 mcfilter6 rt6_stats udp
arp ip6_flowlabel netfilter rt_acct udp6
dev ip6_mr_cache netlink rt_cache udplite
dev_mcast ip6_mr_vif netstat snmp udplite6
dev_snmp6 ip_mr_cache packet snmp6 unix
fib_trie ip_mr_vif protocols sockstat wireless
fib_triestat ip_tables_matches psched sockstat6 xfrm_stat
icmp ip_tables_names ptype softnet_stat
icmp6 ip_tables_targets raw stat
if_inet6 ipv6_route raw6 tcp
igmp mcfilter route tcp6
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# docker exec -it net_para ls /sys/class/net
eth0 lo
root@demonlee-ubuntu:~#
手动实现一个 Network Namespace
下面通过两个系统调用 clone()
和 unshare()
来直观体验一下在 Linux OS 中网络隔离的效果。
clone()
系统调用会创建一个新的进程,若我们在入参中传入 CLONE_NEWNET
标志位,那么这个新创建进程的网络也是全新的,此时使用 ip link
来查看网络设备,就能直观看见其与宿主机之间的差别了。
具体代码如下所示:
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
#define errExit(msg) do{perror(msg);exit(EXIT_FAILURE);}while(0)
#define STACK_SIZE (1024*1024)
static char stack[STACK_SIZE];
int new_netns(void *para){
printf("[%d] New Namespace Devices: \n", getpid());
system("ip link");
printf("\n\n");
sleep(100);
return 0;
}
int main(void){
pid_t pid;
printf("[%d] Host Namespace Devices:\n", getpid());
system("ip link");
printf("\n\n");
pid = clone(new_netns, stack + STACK_SIZE, CLONE_NEWNET | SIGCHLD, NULL);
if(pid==-1){
errExit("clone");
}
if(waitpid(pid, NULL, 0)==-1){
errExit("waitpid");
}
return 0;
}
编译之后,运行的结果如下所示:
root@demonlee-ubuntu:network_namespace# ./clone-netns
[2313] Host Namespace Devices:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:c7:66:ff brd ff:ff:ff:ff:ff:ff
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:3b:7f:d0 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:eb:01:ca:ab brd ff:ff:ff:ff:ff:ff
[2316] New Namespace Devices:
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@demonlee-ubuntu:network_namespace#
可以看到,新进程中的网络设备只有一个 lo(回环设备),而宿主机上有 4 个。
另外一种方式,是通过 unshare()
系统调用来改变当前进程的 Network Namespace,通过 man 2 unshare
简单了解一下这个系统调用的作用:
NAME
unshare - disassociate parts of the process execution context
SYNOPSIS
#define _GNU_SOURCE
#include <sched.h>
int unshare(int flags);
DESCRIPTION
unshare() allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other
processes (or threads). Part of the execution context, such as the mount namespace, is shared implicitly when a new process is
created using fork(2) or vfork(2), while other parts, such as virtual memory, may be shared by explicit request when creating a
process or thread using clone(2).
The main use of unshare() is to allow a process to control its shared execution context without creating a new process.
The flags argument is a bit mask that specifies which parts of the execution context should be unshared. This argument is speci‐
fied by ORing together zero or more of the following constants:
CLONE_FILES
Reverse the effect of the clone(2) CLONE_FILES flag. Unshare the file descriptor table, so that the calling process no
longer shares its file descriptors with any other process.
CLONE_FS
Reverse the effect of the clone(2) CLONE_FS flag. Unshare filesystem attributes, so that the calling process no longer
shares its root directory (chroot(2)), current directory (chdir(2)), or umask (umask(2)) attributes with any other process.
CLONE_NEWCGROUP (since Linux 4.6)
This flag has the same effect as the clone(2) CLONE_NEWCGROUP flag. Unshare the cgroup namespace. Use of CLONE_NEWCGROUP
requires the CAP_SYS_ADMIN capability.
CLONE_NEWIPC (since Linux 2.6.19)
This flag has the same effect as the clone(2) CLONE_NEWIPC flag. Unshare the IPC namespace, so that the calling process
has a private copy of the IPC namespace which is not shared with any other process. Specifying this flag automatically im‐
plies CLONE_SYSVSEM as well. Use of CLONE_NEWIPC requires the CAP_SYS_ADMIN capability.
CLONE_NEWNET (since Linux 2.6.24)
This flag has the same effect as the clone(2) CLONE_NEWNET flag. Unshare the network namespace, so that the calling
process is moved into a new network namespace which is not shared with any previously existing process. Use of
CLONE_NEWNET requires the CAP_SYS_ADMIN capability.
CLONE_NEWNS
This flag has the same effect as the clone(2) CLONE_NEWNS flag. Unshare the mount namespace, so that the calling process
has a private copy of its namespace which is not shared with any other process. Specifying this flag automatically implies
CLONE_FS as well. Use of CLONE_NEWNS requires the CAP_SYS_ADMIN capability. For further information, see mount_name‐
spaces(7).
...
演示代码如下:
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
#define errExit(msg) do{perror(msg);exit(EXIT_FAILURE);}while(0)
#define STACK_SIZE (1024*1024)
static char stack[STACK_SIZE];
int main(void){
pid_t pid;
printf("[%d] Host Namespace Devices:\n", getpid());
system("ip link");
printf("\n\n");
if(unshare(CLONE_NEWNET)==-1){
errExit("unshare");
}
printf("[%d] New Namespace Devices:\n", getpid());
system("ip link");
printf("\n\n");
return 0;
}
编译之后,运行的结果如下所示:
root@demonlee-ubuntu:network_namespace# ./unshare-netns
[2360] Host Namespace Devices:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:c7:66:ff brd ff:ff:ff:ff:ff:ff
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:3b:7f:d0 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:eb:01:ca:ab brd ff:ff:ff:ff:ff:ff
[2360] New Namespace Devices:
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@demonlee-ubuntu:network_namespace#
其结果与前面的 clone()
系统调用一样,新的网络空间拥有的网络设备与宿主机上是不同的。
那通过什么方式,可以访问一个已经存在的独立网络空间呢?答案是 lsns 和 nsenter 两个系统命令。
lsns -t net
:查看已存在的 Network Namespace。nsenter -t <pid> -n
:这里的-n
表示进入<pid>
对应的 Network Namespace。
关于两个命令的具体用法,可以通过 lsns -h
和 nsenter -h
进行详细了解。
下面就以前面的 clone-netns
进程为例,运行程序后,再通过 lsns 和 nsenter 进行查看:
root@demonlee-ubuntu:~# lsns -t net
NS TYPE NPROCS PID USER NETNSID NSFS COMMAND
4026531992 net 176 1 root unassigned /sbin/init splash
4026532245 net 1 1208 rtkit unassigned /usr/libexec/rtkit-daemon
4026532323 net 1 3034 root unassigned ./clone-netns
root@demonlee-ubuntu:~#
# 进入对应的 Network Namespace,并执行 ip addr
root@demonlee-ubuntu:~# nsenter -t 3034 -n ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@demonlee-ubuntu:~#
# 进入对应的 Network Namespace
root@demonlee-ubuntu:~# nsenter -t 3034 -n
# 然后执行相关命令,比如 ip link
root@demonlee-ubuntu:~# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@demonlee-ubuntu:~#
# 回到 1 号进程的 Network Namespace,即宿主机的网络环境
root@demonlee-ubuntu:~# nsenter -t 1 -n
root@demonlee-ubuntu:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:c7:66:ff brd ff:ff:ff:ff:ff:ff
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:3b:7f:d0 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:60:ee:fd:ff brd ff:ff:ff:ff:ff:ff
root@demonlee-ubuntu:~#
容器网络参数修改
问题重现
前面提到 Linux OS 下很大一部分网络参数在 /proc/sys/net 目录下,容器中运行的应用程序,可能需要对相关网络参数进行修改,那要怎么做呢?
在这之前,我们先来看这样一个场景:将宿主机上部分网络参数的默认值修改之后,观察新启动的容器中,这些网络参数是否会跟着变化。
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# echo bbr > /proc/sys/net/ipv4/tcp_congestion_control
root@demonlee-ubuntu:~# echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
root@demonlee-ubuntu:~# echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
root@demonlee-ubuntu:~# echo 6 > /proc/sys/net/ipv4/tcp_keepalive_probes
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_congestion_control
bbr
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_time
600
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
10
root@demonlee-ubuntu:~# cat /proc/sys/net/ipv4/tcp_keepalive_probes
6
root@demonlee-ubuntu:~#
可以看到,宿主机上的 4 个网络参数已经调整完毕,并且确认修改成功。下面启动一个容器,看看里面的网络参数值:
root@demonlee-ubuntu:~# docker run -d --name net_para centos:8.1.1911 sleep 3600
f7b71a9615943e04249596829f05bf8ebb8273cffaf039aa5205db95c54d1a86
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# docker exec -it net_para /bin/bash
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_congestion_control
bbr
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
[root@f7b71a961594 /]# cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
[root@f7b71a961594 /]#
从结果来看,有点出乎意料,除了 tcp_congestion_control
这个参数继承了宿主机的值,其他几项都还是默认值。即新启动容器中的网络参数,一部分继承自宿主机环境,另外一部分又重新初始化了。这又是为啥呢?
梳理原因
要想知道这个结果是如何产生的,还得回到内核源码中找答案。tcp_ipv4.c 文件中的 tcp_sk_init
函数代码如下:
static int __net_init tcp_sk_init(struct net *net) {
int res, cpu, cnt;
net->ipv4.tcp_sk = alloc_percpu(struct sock *);
if (!net->ipv4.tcp_sk)
return -ENOMEM;
for_each_possible_cpu(cpu) {
struct sock *sk;
res = inet_ctl_sock_create(&sk, PF_INET, SOCK_RAW,
IPPROTO_TCP, net);
if (res)
goto fail;
sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
/* Please enforce IP_DF and IPID==0 for RST and
* ACK sent in SYN-RECV and TIME-WAIT state.
*/
inet_sk(sk)->pmtudisc = IP_PMTUDISC_DO;
*per_cpu_ptr(net->ipv4.tcp_sk, cpu) = sk;
}
net->ipv4.sysctl_tcp_ecn = 2;
net->ipv4.sysctl_tcp_ecn_fallback = 1;
net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL;
net->ipv4.sysctl_tcp_syn_retries = TCP_SYN_RETRIES;
net->ipv4.sysctl_tcp_synack_retries = TCP_SYNACK_RETRIES;
net->ipv4.sysctl_tcp_syncookies = 1;
net->ipv4.sysctl_tcp_reordering = TCP_FASTRETRANS_THRESH;
net->ipv4.sysctl_tcp_retries1 = TCP_RETR1;
net->ipv4.sysctl_tcp_retries2 = TCP_RETR2;
net->ipv4.sysctl_tcp_orphan_retries = 0;
net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX;
net->ipv4.sysctl_tcp_tw_reuse = 2;
net->ipv4.sysctl_tcp_no_ssthresh_metrics_save = 1;
cnt = tcp_hashinfo.ehash_mask + 1;
net->ipv4.tcp_death_row.sysctl_max_tw_buckets = cnt / 2;
net->ipv4.tcp_death_row.hashinfo = &tcp_hashinfo;
net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128);
net->ipv4.sysctl_tcp_sack = 1;
net->ipv4.sysctl_tcp_window_scaling = 1;
net->ipv4.sysctl_tcp_timestamps = 1;
net->ipv4.sysctl_tcp_early_retrans = 3;
net->ipv4.sysctl_tcp_recovery = TCP_RACK_LOSS_DETECTION;
net->ipv4.sysctl_tcp_slow_start_after_idle = 1; /* By default, RFC2861 behavior. */
net->ipv4.sysctl_tcp_retrans_collapse = 1;
net->ipv4.sysctl_tcp_max_reordering = 300;
net->ipv4.sysctl_tcp_dsack = 1;
net->ipv4.sysctl_tcp_app_win = 31;
net->ipv4.sysctl_tcp_adv_win_scale = 1;
net->ipv4.sysctl_tcp_frto = 2;
net->ipv4.sysctl_tcp_moderate_rcvbuf = 1;
/* This limits the percentage of the congestion window which we
* will allow a single TSO frame to consume. Building TSO frames
* which are too large can cause TCP streams to be bursty.
*/
net->ipv4.sysctl_tcp_tso_win_divisor = 3;
/* Default TSQ limit of 16 TSO segments */
net->ipv4.sysctl_tcp_limit_output_bytes = 16 * 65536;
/* rfc5961 challenge ack rate limiting */
net->ipv4.sysctl_tcp_challenge_ack_limit = 1000;
net->ipv4.sysctl_tcp_min_tso_segs = 2;
net->ipv4.sysctl_tcp_min_rtt_wlen = 300;
net->ipv4.sysctl_tcp_autocorking = 1;
net->ipv4.sysctl_tcp_invalid_ratelimit = HZ/2;
net->ipv4.sysctl_tcp_pacing_ss_ratio = 200;
net->ipv4.sysctl_tcp_pacing_ca_ratio = 120;
if (net != &init_net) {
memcpy(net->ipv4.sysctl_tcp_rmem,
init_net.ipv4.sysctl_tcp_rmem,
sizeof(init_net.ipv4.sysctl_tcp_rmem));
memcpy(net->ipv4.sysctl_tcp_wmem,
init_net.ipv4.sysctl_tcp_wmem,
sizeof(init_net.ipv4.sysctl_tcp_wmem));
}
net->ipv4.sysctl_tcp_comp_sack_delay_ns = NSEC_PER_MSEC;
net->ipv4.sysctl_tcp_comp_sack_slack_ns = 100 * NSEC_PER_USEC;
net->ipv4.sysctl_tcp_comp_sack_nr = 44;
net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE;
spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
atomic_set(&net->ipv4.tfo_active_disable_times, 0);
/* Reno is always built in */
if (!net_eq(net, &init_net) &&
bpf_try_module_get(init_net.ipv4.tcp_congestion_control,
init_net.ipv4.tcp_congestion_control->owner))
net->ipv4.tcp_congestion_control = init_net.ipv4.tcp_congestion_control;
else
net->ipv4.tcp_congestion_control = &tcp_reno;
return 0;
fail:
tcp_sk_exit(net);
return res;
}
从代码中可以看到,tcp_congestion_control
的赋值与 sysctl_tcp_keepalive_time
等不同,从这里可以推断出 tcp_congestion_control
根据判断条件会从宿主机中获取对应的值。
那我们是不是可以直接在容器中修改这些参数呢,以下为测试结果:
root@demonlee-ubuntu:~# docker exec -it net_para /bin/bash
[root@f7b71a961594 /]# echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
bash: /proc/sys/net/ipv4/tcp_keepalive_time: Read-only file system
[root@f7b71a961594 /]#
[root@f7b71a961594 /]# echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
bash: /proc/sys/net/ipv4/tcp_keepalive_intvl: Read-only file system
[root@f7b71a961594 /]# echo bbr > /proc/sys/net/ipv4/tcp_congestion_control
bash: /proc/sys/net/ipv4/tcp_congestion_control: Read-only file system
[root@f7b71a961594 /]#
[root@f7b71a961594 /]# cat /proc/mounts | grep "/proc/sys"
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
[root@f7b71a961594 /]#
在普通容器中(非 privileged 容器,后面会单独梳理这一块的学习笔记)修改这些参数会提示错误:“Read-only file system”。
通过 cat /proc/mounts | grep "/proc/sys"
我们也能发现 /proc/sys
是只读的(read-only)。
而这么做的原因也很容易想到:安全问题。为了防范不可控问题,将容器中 /proc
和 /sys
等相关目录进行了只读 mount 处理,更详细的说明请点击参考文献中的链接进行了解。
[root@f7b71a961594 /]# cat /proc/mounts
overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/RQFUDW2R4OWJ5N7WAKVSKXRFZE:/var/lib/docker/overlay2/l/SLRTXUAAXRPQTLBQIAX5Q6EMEQ,upperdir=/var/lib/docker/overlay2/e36766f7c706db2f31c4d53ec0ef2058af85693e4f3923d80aae4f80e5d3575e/diff,workdir=/var/lib/docker/overlay2/e36766f7c706db2f31c4d53ec0ef2058af85693e4f3923d80aae4f80e5d3575e/work 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666 0 0
sysfs /sys sysfs ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,relatime,mode=755,inode64 0 0
cgroup /sys/fs/cgroup/systemd cgroup ro,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
cgroup /sys/fs/cgroup/freezer cgroup ro,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup ro,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/cpuset cgroup ro,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup ro,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/pids cgroup ro,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup ro,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/memory cgroup ro,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup ro,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/rdma cgroup ro,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/perf_event cgroup ro,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/blkio cgroup ro,nosuid,nodev,noexec,relatime,blkio 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k,inode64 0 0
/dev/sda5 /etc/resolv.conf ext4 rw,relatime,errors=remount-ro 0 0
/dev/sda5 /etc/hostname ext4 rw,relatime,errors=remount-ro 0 0
/dev/sda5 /etc/hosts ext4 rw,relatime,errors=remount-ro 0 0
proc /proc/bus proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/fs proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/irq proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,nosuid,nodev,noexec,relatime 0 0
proc /proc/sysrq-trigger proc ro,nosuid,nodev,noexec,relatime 0 0
tmpfs /proc/asound tmpfs ro,relatime,inode64 0 0
tmpfs /proc/acpi tmpfs ro,relatime,inode64 0 0
tmpfs /proc/kcore tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/keys tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/timer_list tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/sched_debug tmpfs rw,nosuid,size=65536k,mode=755,inode64 0 0
tmpfs /proc/scsi tmpfs ro,relatime,inode64 0 0
tmpfs /sys/firmware tmpfs ro,relatime,inode64 0 0
[root@f7b71a961594 /]#
根据前面的分析,我们可以得出两点结论:
- 改宿主机上的网络参数,新建容器对应的参数不能全部继承。
- 容器启动之后,没有权限再修改网络参数。
既然如此,那如何才能修改容器中的网络参数呢?
解决方案
前面提到使用 nsenter 可以进入相关 Network Namespace,然后便可以更改对应容器的网络参数了。比如下面演示的样例:
root@demonlee-ubuntu:~# lsns -t net
NS TYPE NPROCS PID USER NETNSID NSFS COMMAND
4026531992 net 175 1 root unassigned /sbin/init splash
4026532253 net 1 1733 root 0 /run/docker/netns/46ce60a32587 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
4026532326 net 1 1174 rtkit unassigned /usr/libexec/rtkit-daemon
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# nsenter -t 1733 -n sysctl -a | grep net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 7200
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# nsenter -t 1733 -n sysctl -w net.ipv4.tcp_keepalive_time=6000
net.ipv4.tcp_keepalive_time = 6000
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# nsenter -t 1733 -n sysctl -a | grep net.ipv4.tcp_keepalive_time
net.ipv4.tcp_keepalive_time = 6000
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# docker exec -it net_para cat /proc/sys/net/ipv4/tcp_keepalive_time
6000
root@demonlee-ubuntu:~#
通过这行命令 nsenter -t 1733 -n sysctl -w net.ipv4.tcp_keepalive_time=6000
(也可以使用这个方式:nsenter -t 1733 -n bash -c 'echo 660 > /proc/sys/net/ipv4/tcp_keepalive_time'
),我们就将容器 net_para 中对应的网络参数给修改了。
但是,这又产生了新的问题:
- 一般生产环境是不允许通过
nsenter
这种命令行的方式进行修改的。 - 即使通过
nsenter
进行了修改,服务也需要重启,对应的配置才能生效,这同样也不被允许。
如此一来,我们就只有一条方案可以选了,那就是在容器启动时调整相关配置。
Docker 和 Kubernetes 都在容器启动时提供了对应的参数:Docker 对应的是 --sysctl
,Kubernetes 对应的是 allowed-unsafe-sysctls
特性。下面以 Docker 的 --sysctl
参数为例进行验证:
root@demonlee-ubuntu:~# docker run -d --name net_para2 --sysctl net.ipv4.tcp_keepalive_time=6600 centos:8.1.1911 sleep 3600
79192ce108974be1cf493c4a592dce992fee57cdec80acc2a1d08072c0aebfef
root@demonlee-ubuntu:~#
root@demonlee-ubuntu:~# docker exec -it net_para2 cat /proc/sys/net/ipv4/tcp_keepalive_time
6600
root@demonlee-ubuntu:~#
总结
- 容器中应用了 Network Namespace,从而实现了网络设备、路由规则和网络内核参数等的隔离。
- 在 Linux OS 下可以使用
clone()
或unshare()
系统调用进行 Network Namespace 的验证和测试。 - 容器中修改内核参数,需要在启动容器时进行配置,Docker 中可以使用
--sysctl
参数进行配置。