Linux 僵尸进程实战
本文实践环境:
Operating System: Ubuntu 20.04.2 LTS
Kernel: Linux 5.8.0-50-generic
Architecture: x86-64
在之前的这篇文章中,笔者对Linux下的进程状态做了一个总结。现在,我们将聚焦于僵尸进程,专门来聊一聊僵尸进程的一些细节。
僵尸进程
僵尸进程可以这样理解:程序已经运行结束,但还有一个躯壳(PID资源等)没有被父进程回收。
父进程回收僵尸子进程的方式一般有两种:
1)当一个进程B被创建时,创建B进程的父进程A应该通过系统调用 wait()
或 waitpid()
等待子进程结束,回收子进程的资源;
2)子进程B在结束时,也会向父进程A发送 SIGCHLD
信号,此时父进程A可以通过注册对应的信号处理函数,异步回收资源。SIGCHLD信号的缺省处理方式是忽略:
demonlee@demonlee-ubuntu:~$ man 7 signal
SIGCHLD P1990 Ign Child stopped or terminated
如果父进程没有按照上面的方式进行处理,或者子进程退出太快,父进程还没来得及处理,那么僵尸进程的生命周期可能会与父进程保持一致了。如果父进程一直不退出,僵尸进程会一直存在,当父进程退出后,僵尸将会由init进程回收。
wait() vs waitpid()
由于僵尸进程需要通过 wait()
、waitpid()
这两个系统调用进行资源回收,所以对二者做一个梳理,以免踩雷。在linux上执行 man 2 wait
,将会看到如下内容(有省略):
...
pid_t wait(int *wstatus);
pid_t waitpid(pid_t pid, int *wstatus, int options);
wait() and waitpid()
The wait() system call suspends execution of the calling thread until one of its children terminates. The call wait(&wstatus) is equivalent
to: waitpid(-1, &wstatus, 0);
The waitpid() system call suspends execution of the calling thread until a child specified by pid argument has changed state. By default,waitpid() waits only for terminated children, but this behavior is modifiable via the options argument, as described below.
The value of pid can be:
< -1 meaning wait for any child process whose process group ID is equal to the absolute value of pid.
-1 meaning wait for any child process.
0 meaning wait for any child process whose process group ID is equal to that of the calling process at the time of the call to waitpid().
> 0 meaning wait for the child whose process ID is equal to the value of pid.
The value of options is an OR of zero or more of the following constants:
WNOHANG return immediately if no child has exited.
WUNTRACED also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even
if this option is not specified.
WCONTINUED (since Linux 2.6.10)
also return if a stopped child has been resumed by delivery of SIGCONT.
...
-
wait()
函数调用后,调用进程会立刻阻塞自己,直到有一个子进程退出。 -
pid_t wait(&wstatus)
其实就等于waitpid(-1, &wstatus, 0)
,即wait()
是waitpid()
的一个特例而已。 -
waitpid()
另外两个参数的取值不同,会有不同的表现:pid
参数允许我们指定任意想等待终止的进程ID,值为-1表示等待第一个终止的子进程,值>0表示等待特定的子进程。options
参数给了我们更多的控制选项,比如将options
选项若设置为WNOHANG
,就是告诉内核在没有已终止子进程时不要阻塞。 -
wstatus
参数表示获取子进程退出时的状态,若不关注退出状态,可以使用wait(NULL)
。wstatus
参数的结果可以使用相关宏定义的函数进行判断:If wstatus is not NULL, wait() and waitpid() store status information in the int to which it points. This integer can be inspected with the following macros (which take the integer itself as an argument, not a pointer to it, as is done in wait() and waitpid()!): WIFEXITED(wstatus):returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main(). WEXITSTATUS(wstatus):returns the exit status of the child. This consists of the least significant 8 bits of the status argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should be employed only if WIFEXITED returned true. WIFSIGNALED(wstatus):returns true if the child process was terminated by a signal. WTERMSIG(wstatus):returns the number of the signal that caused the child process to terminate. This macro should be employed only if WIFSIGNALED returned true. WCOREDUMP(wstatus):returns true if the child produced a core dump (see core(5)). This macro should be employed only if WIFSIGNALED returned true. ...
-
关于两个函数的返回值:处理成功会返回子进程id,出现错误时返回-1,如下:
wait(): on success, returns the process ID of the terminated child; on error, -1 is returned. waitpid(): on success, returns the process ID of the child whose state has changed; if WNOHANG was specified and one or more child(ren) speci‐fied by pid exist, but have not yet changed state, then 0 is returned. On error, -1 is returned.
其他更多细节,可以进一步阅读文档。另外 stackoverflow 上有一个关于两者区别的提问,答案中提到:
The waitpid() function is provided for three reasons:
- To support job control
- To permit a non-blocking version of the wait() function
- To permit a library routine, such as system() or pclose(), to wait for its children without interfering with other terminated children for which the process has not waited
实践
下面将通过coding对前面的理论内容进行消化吸收,zombie-test.c
示例代码如下:
/**
* Created by DemonLee on 2021/06/05.
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <time.h>
#include <string.h>
#include <signal.h>
void create_process(int total);
char *get_current_time();
void reap_child_process(int total);
void sig_handler(int sig_no) {
printf("[%s]:receive sig_no: %d\n", get_current_time(), sig_no);
if (sig_no == SIGCHLD) {
int wait_status;
int pid;
// ④
// 1) 这里不能使用if判断,因为Linux信号不排队,如果多个子进程在同一时刻发送SIGCHLD信号,父进程可能来不及响应,
// 就会出现并发问题,父进程执行一次信号处理函数,只会回收了一个子进程,其他进程依然会沦为僵尸进程
// 2) 这里while循环中不用wait,而用waitpid(同时指定options为WNOHANG),是因为wait会阻塞主进程,直到所有子进程都被回收
if ((pid = wait(&wait_status)) > 0) {
// while ((pid = wait(&wait_status)) > 0) {
// while ((pid = waitpid(-1, &wait_status, WNOHANG)) > 0) {
printf("[%s]: child process [%d] is terminated, wait_status=[%d], WIFEXITED=[%d], WEXITSTATUS=[%d]...\n",
get_current_time(), pid, wait_status, WIFEXITED(wait_status), WEXITSTATUS(wait_status));
}
}
}
int main(int argc, char *args[]) {
int total = 1;
if (argc > 1) {
total = atoi(args[1]);
}
// ①:通过注册信号函数回收子进程
// signal(SIGCHLD, sig_handler);
create_process(total);
// ②:通过循环中多个wait()函数回收子进程
// reap_child_process(total);
int k = 0;
// 主进程收到信号后会被唤醒,所以这里要用死循环保证主进程不会退出
while (1) {
printf("[%s]:Parent process is going to sleep now[%d]...\n", get_current_time(), k++);
sleep(60);
}
printf("[%s]:Parent process exit now...\n", get_current_time());
return EXIT_SUCCESS;
}
void reap_child_process(int total) {
for (int i = 0; i < total; ++i) {
int waitStatus;
printf("[%s]: idx[%d] pre to wait child process, waitStatus: [%d]\n", get_current_time(), i, waitStatus);
pid_t exit_pid = wait(&waitStatus);
//pid_t exit_pid = waitpid(-1, &waitStatus, 0);
printf("[%s]:idx[%d] child process[%d] is terminated, waitStatus: [%d]\n",
get_current_time(), i, exit_pid, waitStatus);
}
}
void create_process(int total) {
for (int i = 0; i < total; ++i) {
pid_t pid = fork();
if (pid == 0) {
printf("[%s]: child[%d] process is running, ppid: [%d], pid: [%d]\n",
get_current_time(), i, getppid(), getpid());
//sleep(10 + i * 5);
sleep(10);
// ③ 模拟 wait和waitpid的区别
// if (i == 1) {
// while (1) { sleep(30); }
// }
exit(EXIT_FAILURE);
//exit(EXIT_SUCCESS);
} else if (pid > 0) {
printf("[%s]: parent: [%d] create a child[%d] process success...\n", get_current_time(), getpid(), i);
} else {
printf("[%s]: Cannot create child process, errno: [%d]\n", get_current_time(), errno);
break;
}
}
}
char *get_current_time() {
time_t t;
time(&t);
char *time_str = ctime(&t);
time_str[strlen(time_str) - 1] = '\0';
return time_str;
}
1)模拟僵尸进程
将上面的代码编译运行,日志如下:
demonlee@demonlee-ubuntu:zombie-proc$ ./zombie-test 5
[Sun Jun 6 15:05:06 2021]: parent: [24449] create a child[0] process success...
[Sun Jun 6 15:05:06 2021]: parent: [24449] create a child[1] process success...
[Sun Jun 6 15:05:06 2021]: child[0] process is running, ppid: [24449], pid: [24450]
[Sun Jun 6 15:05:06 2021]: child[1] process is running, ppid: [24449], pid: [24451]
[Sun Jun 6 15:05:06 2021]: parent: [24449] create a child[2] process success...
[Sun Jun 6 15:05:06 2021]: child[2] process is running, ppid: [24449], pid: [24452]
[Sun Jun 6 15:05:06 2021]: parent: [24449] create a child[3] process success...
[Sun Jun 6 15:05:06 2021]: parent: [24449] create a child[4] process success...
[Sun Jun 6 15:05:06 2021]:Parent process is going to sleep now[0]...
[Sun Jun 6 15:05:06 2021]: child[4] process is running, ppid: [24449], pid: [24454]
[Sun Jun 6 15:05:06 2021]: child[3] process is running, ppid: [24449], pid: [24453]
[Sun Jun 6 15:06:06 2021]:Parent process is going to sleep now[1]...
[Sun Jun 6 15:07:06 2021]:Parent process is going to sleep now[2]...
...
...
此时在另一个终端中查看进程状态,可以看到子进程状态一开始是 S+
,后来变成了 Z+
:
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24449 0.0 0.0 2496 716 pts/2 S+ 15:05 0:00 ./zombie-test 5
demonlee 24450 0.0 0.0 2496 76 pts/2 S+ 15:05 0:00 ./zombie-test 5
demonlee 24451 0.0 0.0 2496 88 pts/2 S+ 15:05 0:00 ./zombie-test 5
demonlee 24452 0.0 0.0 2496 88 pts/2 S+ 15:05 0:00 ./zombie-test 5
demonlee 24453 0.0 0.0 2496 88 pts/2 S+ 15:05 0:00 ./zombie-test 5
demonlee 24454 0.0 0.0 2496 88 pts/2 S+ 15:05 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24449 0.0 0.0 2496 716 pts/2 S+ 15:05 0:00 ./zombie-test 5
demonlee 24450 0.0 0.0 0 0 pts/2 Z+ 15:05 0:00 [zombie-test] <defunct>
demonlee 24451 0.0 0.0 0 0 pts/2 Z+ 15:05 0:00 [zombie-test] <defunct>
demonlee 24452 0.0 0.0 0 0 pts/2 Z+ 15:05 0:00 [zombie-test] <defunct>
demonlee 24453 0.0 0.0 0 0 pts/2 Z+ 15:05 0:00 [zombie-test] <defunct>
demonlee 24454 0.0 0.0 0 0 pts/2 Z+ 15:05 0:00 [zombie-test] <defunct>
demonlee@demonlee-ubuntu:zombie_proc$
2)主进程在循环中调用 wait()
函数回收子进程
将 main()
函数中的第②处代码放开,此时的 main()
函数如下所示:
int main(int argc, char *args[]) {
int total = 1;
if (argc > 1) {
total = atoi(args[1]);
}
// ①:通过注册信号函数回收子进程
// signal(SIGCHLD, sig_handler);
create_process(total);
// ⑤
// sleep(15);
// ②:通过循环中多个wait()函数回收子进程
reap_child_process(total);
int k = 0;
// 主进程收到信号后会被唤醒,所以这里要用死循环保证主进程不会退出
while (1) {
printf("[%s]:Parent process is going to sleep now[%d]...\n", get_current_time(), k++);
sleep(60);
}
printf("[%s]:Parent process exit now...\n", get_current_time());
return EXIT_SUCCESS;
}
重新编译运行:
demonlee@demonlee-ubuntu:zombie-proc$ ./zombie-test 5
[Sun Jun 6 17:00:29 2021]: parent: [24595] create a child[0] process success...
[Sun Jun 6 17:00:29 2021]: child[0] process is running, ppid: [24595], pid: [24596]
[Sun Jun 6 17:00:29 2021]: parent: [24595] create a child[1] process success...
[Sun Jun 6 17:00:29 2021]: parent: [24595] create a child[2] process success...
[Sun Jun 6 17:00:29 2021]: child[1] process is running, ppid: [24595], pid: [24597]
[Sun Jun 6 17:00:29 2021]: parent: [24595] create a child[3] process success...
[Sun Jun 6 17:00:29 2021]: parent: [24595] create a child[4] process success...
[Sun Jun 6 17:00:29 2021]: idx[0] pre to wait child process, waitStatus: [5]
[Sun Jun 6 17:00:29 2021]: child[3] process is running, ppid: [24595], pid: [24599]
[Sun Jun 6 17:00:29 2021]: child[2] process is running, ppid: [24595], pid: [24598]
[Sun Jun 6 17:00:29 2021]: child[4] process is running, ppid: [24595], pid: [24600]
[Sun Jun 6 17:00:39 2021]:idx[0] child process[24596] is terminated, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]: idx[1] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]:idx[1] child process[24598] is terminated, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]: idx[2] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]:idx[2] child process[24597] is terminated, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]: idx[3] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]:idx[3] child process[24599] is terminated, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]: idx[4] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]:idx[4] child process[24600] is terminated, waitStatus: [256]
[Sun Jun 6 17:00:39 2021]:Parent process is going to sleep now[0]...
...
从日志中可以看到,主进程中循环调用 wait()
函数,是同步阻塞的,其中一个子进程结束后,然后再阻塞等待其他子进程结束。从另外一个终端中查看进程状态,发现进程数从6降到1,没有僵尸进程:
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24595 0.0 0.0 2496 716 pts/2 S+ 17:00 0:00 ./zombie-test 5
demonlee 24596 0.0 0.0 2496 80 pts/2 S+ 17:00 0:00 ./zombie-test 5
demonlee 24597 0.0 0.0 2496 92 pts/2 S+ 17:00 0:00 ./zombie-test 5
demonlee 24598 0.0 0.0 2496 92 pts/2 S+ 17:00 0:00 ./zombie-test 5
demonlee 24599 0.0 0.0 2496 92 pts/2 S+ 17:00 0:00 ./zombie-test 5
demonlee 24600 0.0 0.0 2496 92 pts/2 S+ 17:00 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24595 0.0 0.0 2496 716 pts/2 S+ 17:00 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$
像这种方式处理回收子进程,需要等待主进程循环调用 wait()
函数,如果子进程在主进程调用 wait()
之前就退出了,会出现什么情况?子进程会先变成僵尸进程,然后才被回收,如果主进程出现异常,没有调 wait()
函数,子进程将会一直是僵尸状态。
我们将 main()
函数中的第②、⑤处放开,重新编译:
int main(int argc, char *args[]) {
int total = 1;
if (argc > 1) {
total = atoi(args[1]);
}
// ①:通过注册信号函数回收子进程
// signal(SIGCHLD, sig_handler);
create_process(total);
// ⑤
sleep(15);
// ②:通过循环中多个wait()函数回收子进程
reap_child_process(total);
int k = 0;
// 主进程收到信号后会被唤醒,所以这里要用死循环保证主进程不会退出
while (1) {
printf("[%s]:Parent process is going to sleep now[%d]...\n", get_current_time(), k++);
sleep(60);
}
printf("[%s]:Parent process exit now...\n", get_current_time());
return EXIT_SUCCESS;
}
测试结果如下:
demonlee@demonlee-ubuntu:zombie-proc$ ./zombie-test 5
[Sun Jun 6 17:17:23 2021]: parent: [24734] create a child[0] process success...
[Sun Jun 6 17:17:23 2021]: child[0] process is running, ppid: [24734], pid: [24735]
[Sun Jun 6 17:17:23 2021]: parent: [24734] create a child[1] process success...
[Sun Jun 6 17:17:23 2021]: child[1] process is running, ppid: [24734], pid: [24736]
[Sun Jun 6 17:17:23 2021]: parent: [24734] create a child[2] process success...
[Sun Jun 6 17:17:23 2021]: child[2] process is running, ppid: [24734], pid: [24737]
[Sun Jun 6 17:17:23 2021]: parent: [24734] create a child[3] process success...
[Sun Jun 6 17:17:23 2021]: parent: [24734] create a child[4] process success...
[Sun Jun 6 17:17:23 2021]: child[3] process is running, ppid: [24734], pid: [24738]
[Sun Jun 6 17:17:23 2021]: child[4] process is running, ppid: [24734], pid: [24739]
[Sun Jun 6 17:17:38 2021]: idx[0] pre to wait child process, waitStatus: [0]
[Sun Jun 6 17:17:38 2021]:idx[0] child process[24735] is terminated, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]: idx[1] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]:idx[1] child process[24736] is terminated, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]: idx[2] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]:idx[2] child process[24737] is terminated, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]: idx[3] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]:idx[3] child process[24738] is terminated, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]: idx[4] pre to wait child process, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]:idx[4] child process[24739] is terminated, waitStatus: [256]
[Sun Jun 6 17:17:38 2021]:Parent process is going to sleep now[0]...
...
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24734 0.0 0.0 2496 720 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee 24735 0.0 0.0 2496 80 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee 24736 0.0 0.0 2496 92 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee 24737 0.0 0.0 2496 92 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee 24738 0.0 0.0 2496 92 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee 24739 0.0 0.0 2496 92 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24734 0.0 0.0 2496 720 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee 24735 0.0 0.0 0 0 pts/2 Z+ 17:17 0:00 [zombie-test] <defunct>
demonlee 24736 0.0 0.0 0 0 pts/2 Z+ 17:17 0:00 [zombie-test] <defunct>
demonlee 24737 0.0 0.0 0 0 pts/2 Z+ 17:17 0:00 [zombie-test] <defunct>
demonlee 24738 0.0 0.0 0 0 pts/2 Z+ 17:17 0:00 [zombie-test] <defunct>
demonlee 24739 0.0 0.0 0 0 pts/2 Z+ 17:17 0:00 [zombie-test] <defunct>
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24734 0.0 0.0 2496 720 pts/2 S+ 17:17 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$
正如前面所分析的那样,由于主进程延迟调用 wait()
函数,所以子进程的状态从 S+
变成了 Z+
,最后才退出。
3)主进程注册 SIGCHLD
信号处理函数回收子进程
这是标准的做法,为什么呢?因为信号处理是异步的,不会影响主进程执行,又是在创建子进程之前注册的,所以就不会出现上面子进程已结束,主进程还未开始回收的窘境。
将 main()
函数的第①处放开,重新编译:
int main(int argc, char *args[]) {
int total = 1;
if (argc > 1) {
total = atoi(args[1]);
}
// ①:通过注册信号函数回收子进程
signal(SIGCHLD, sig_handler);
create_process(total);
// ⑤
// sleep(15);
// ②:通过循环中多个wait()函数回收子进程
// reap_child_process(total);
int k = 0;
// 主进程收到信号后会被唤醒,所以这里要用死循环保证主进程不会退出
while (1) {
printf("[%s]:Parent process is going to sleep now[%d]...\n", get_current_time(), k++);
sleep(60);
}
printf("[%s]:Parent process exit now...\n", get_current_time());
return EXIT_SUCCESS;
}
测试结果如下:
demonlee@demonlee-ubuntu:zombie-proc$ ./zombie-test 5
[Sun Jun 6 22:02:33 2021]: parent: [24901] create a child[0] process success...
[Sun Jun 6 22:02:33 2021]: child[0] process is running, ppid: [24901], pid: [24902]
[Sun Jun 6 22:02:33 2021]: parent: [24901] create a child[1] process success...
[Sun Jun 6 22:02:33 2021]: child[1] process is running, ppid: [24901], pid: [24903]
[Sun Jun 6 22:02:33 2021]: parent: [24901] create a child[2] process success...
[Sun Jun 6 22:02:33 2021]: child[2] process is running, ppid: [24901], pid: [24904]
[Sun Jun 6 22:02:33 2021]: parent: [24901] create a child[3] process success...
[Sun Jun 6 22:02:33 2021]: child[3] process is running, ppid: [24901], pid: [24905]
[Sun Jun 6 22:02:33 2021]: parent: [24901] create a child[4] process success...
[Sun Jun 6 22:02:33 2021]:Parent process is going to sleep now[0]...
[Sun Jun 6 22:02:33 2021]: child[4] process is running, ppid: [24901], pid: [24906]
[Sun Jun 6 22:02:43 2021]:receive sig_no: 17
[Sun Jun 6 22:02:43 2021]: child process [24902] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:02:43 2021]:receive sig_no: 17
[Sun Jun 6 22:02:43 2021]: child process [24903] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:02:43 2021]:Parent process is going to sleep now[1]...
[Sun Jun 6 22:02:43 2021]:receive sig_no: 17
[Sun Jun 6 22:02:43 2021]: child process [24904] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:02:43 2021]:Parent process is going to sleep now[2]...
[Sun Jun 6 22:03:43 2021]:Parent process is going to sleep now[3]...
...
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24901 0.0 0.0 2496 716 pts/2 S+ 22:02 0:00 ./zombie-test 5
demonlee 24902 0.0 0.0 2496 76 pts/2 S+ 22:02 0:00 ./zombie-test 5
demonlee 24903 0.0 0.0 2496 88 pts/2 S+ 22:02 0:00 ./zombie-test 5
demonlee 24904 0.0 0.0 2496 88 pts/2 S+ 22:02 0:00 ./zombie-test 5
demonlee 24905 0.0 0.0 2496 88 pts/2 S+ 22:02 0:00 ./zombie-test 5
demonlee 24906 0.0 0.0 2496 88 pts/2 S+ 22:02 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 24901 0.0 0.0 2496 716 pts/2 S+ 22:02 0:00 ./zombie-test 5
demonlee 24905 0.0 0.0 0 0 pts/2 Z+ 22:02 0:00 [zombie-test] <defunct>
demonlee 24906 0.0 0.0 0 0 pts/2 Z+ 22:02 0:00 [zombie-test] <defunct>
demonlee@demonlee-ubuntu:zombie_proc$
咦,出现了僵尸进程。从日志中也可以观察到,5个子进程中只有3个子进程被信号处理函数被捕,那其他两个去哪了呢?这还得从信号处理函数着手:
void sig_handler(int sig_no) {
printf("[%s]:receive sig_no: %d\n", get_current_time(), sig_no);
if (sig_no == SIGCHLD) {
int wait_status;
int pid;
// ④
// 1) 这里不能使用if判断,因为Linux信号不排队,如果多个子进程在同一时刻发送SIGCHLD信号,父进程可能来不及响应,
// 就会出现并发问题,父进程执行一次信号处理函数,只会回收了一个子进程,其他进程依然会沦为僵尸进程
// 2) 这里while循环中不用wait,而用waitpid(同时指定options为WNOHANG),是因为wait会阻塞主进程,直到所有子进程都被回收
if ((pid = wait(&wait_status)) > 0) {
// while ((pid = wait(&wait_status)) > 0) {
// while ((pid = waitpid(-1, &wait_status, WNOHANG)) > 0) {
printf("[%s]: child process [%d] is terminated, wait_status=[%d], WIFEXITED=[%d], WEXITSTATUS=[%d]...\n",
get_current_time(), pid, wait_status, WIFEXITED(wait_status), WEXITSTATUS(wait_status));
}
}
}
代码注释中已经给出了原因,使用 if ((pid = wait(&wait_status)) > 0)
判断子进程结束,存在并发问题,所以需要将 if
调整为 while
,如下所示:
void sig_handler(int sig_no) {
printf("[%s]:receive sig_no: %d\n", get_current_time(), sig_no);
if (sig_no == SIGCHLD) {
int wait_status;
int pid;
// ④
// 1) 这里不能使用if判断,因为Linux信号不排队,如果多个子进程在同一时刻发送SIGCHLD信号,父进程可能来不及响应,
// 就会出现并发问题,父进程执行一次信号处理函数,只会回收了一个子进程,其他进程依然会沦为僵尸进程
// 2) 这里while循环中不用wait,而用waitpid(同时指定options为WNOHANG),是因为wait会阻塞主进程,直到所有子进程都被回收
// if ((pid = wait(&wait_status)) > 0) {
while ((pid = wait(&wait_status)) > 0) {
// while ((pid = waitpid(-1, &wait_status, WNOHANG)) > 0) {
printf("[%s]: child process [%d] is terminated, wait_status=[%d], WIFEXITED=[%d], WEXITSTATUS=[%d]...\n",
get_current_time(), pid, wait_status, WIFEXITED(wait_status), WEXITSTATUS(wait_status));
}
}
}
重新编译测试:
demonlee@demonlee-ubuntu:zombie-proc$ ./zombie-test 5
[Sun Jun 6 22:19:14 2021]: parent: [25051] create a child[0] process success...
[Sun Jun 6 22:19:14 2021]: child[0] process is running, ppid: [25051], pid: [25052]
[Sun Jun 6 22:19:14 2021]: parent: [25051] create a child[1] process success...
[Sun Jun 6 22:19:14 2021]: child[1] process is running, ppid: [25051], pid: [25053]
[Sun Jun 6 22:19:14 2021]: parent: [25051] create a child[2] process success...
[Sun Jun 6 22:19:14 2021]: parent: [25051] create a child[3] process success...
[Sun Jun 6 22:19:14 2021]: child[3] process is running, ppid: [25051], pid: [25055]
[Sun Jun 6 22:19:14 2021]: child[2] process is running, ppid: [25051], pid: [25054]
[Sun Jun 6 22:19:14 2021]: parent: [25051] create a child[4] process success...
[Sun Jun 6 22:19:14 2021]: child[4] process is running, ppid: [25051], pid: [25056]
[Sun Jun 6 22:19:14 2021]:Parent process is going to sleep now[0]...
[Sun Jun 6 22:19:24 2021]:receive sig_no: 17
[Sun Jun 6 22:19:24 2021]: child process [25052] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:19:24 2021]: child process [25053] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:19:24 2021]: child process [25054] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:19:24 2021]: child process [25055] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:19:24 2021]: child process [25056] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:19:24 2021]:receive sig_no: 17
[Sun Jun 6 22:19:24 2021]:Parent process is going to sleep now[1]...
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 25051 0.0 0.0 2496 704 pts/2 S+ 22:19 0:00 ./zombie-test 5
demonlee 25052 0.0 0.0 2496 76 pts/2 S+ 22:19 0:00 ./zombie-test 5
demonlee 25053 0.0 0.0 2496 88 pts/2 S+ 22:19 0:00 ./zombie-test 5
demonlee 25054 0.0 0.0 2496 88 pts/2 S+ 22:19 0:00 ./zombie-test 5
demonlee 25055 0.0 0.0 2496 88 pts/2 S+ 22:19 0:00 ./zombie-test 5
demonlee 25056 0.0 0.0 2496 88 pts/2 S+ 22:19 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 25051 0.0 0.0 2496 704 pts/2 S+ 22:19 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$
可以看到,使用 while
循环后,子进程都被回收了,没有出现并发问题。但仔细分析会发现,5个子进程总共发送了5个 SIGCHLD
信号,但日志中只打印了2次,并且第1次一次性就将5个子进程都回收了,但如果5个子进程是分批次结束的呢?前面提到 wait()
会阻塞,此时主进程将会阻塞在 while
循环这行代码上,直到有子进程结束。
我们继续将 create_process()
函数中第③处的代码放开,如下所示:
void create_process(int total) {
for (int i = 0; i < total; ++i) {
pid_t pid = fork();
if (pid == 0) {
printf("[%s]: child[%d] process is running, ppid: [%d], pid: [%d]\n",
get_current_time(), i, getppid(), getpid());
//sleep(10 + i * 5);
sleep(10);
// ③ 模拟 wait和waitpid的区别
if (i == 1) {
while (1) { sleep(30); }
}
exit(EXIT_FAILURE);
//exit(EXIT_SUCCESS);
} else if (pid > 0) {
printf("[%s]: parent: [%d] create a child[%d] process success...\n", get_current_time(), getpid(), i);
} else {
printf("[%s]: Cannot create child process, errno: [%d]\n", get_current_time(), errno);
break;
}
}
}
重新编译测试,结果如下:
emonlee@demonlee-ubuntu:zombie-proc$ ./zombie-test 5
[Sun Jun 6 22:41:39 2021]: parent: [25174] create a child[0] process success...
[Sun Jun 6 22:41:39 2021]: child[0] process is running, ppid: [25174], pid: [25175]
[Sun Jun 6 22:41:39 2021]: parent: [25174] create a child[1] process success...
[Sun Jun 6 22:41:39 2021]: child[1] process is running, ppid: [25174], pid: [25176]
[Sun Jun 6 22:41:39 2021]: parent: [25174] create a child[2] process success...
[Sun Jun 6 22:41:39 2021]: child[2] process is running, ppid: [25174], pid: [25177]
[Sun Jun 6 22:41:39 2021]: parent: [25174] create a child[3] process success...
[Sun Jun 6 22:41:39 2021]: child[3] process is running, ppid: [25174], pid: [25178]
[Sun Jun 6 22:41:39 2021]: parent: [25174] create a child[4] process success...
[Sun Jun 6 22:41:39 2021]:Parent process is going to sleep now[0]...
[Sun Jun 6 22:41:39 2021]: child[4] process is running, ppid: [25174], pid: [25179]
[Sun Jun 6 22:41:49 2021]:receive sig_no: 17
[Sun Jun 6 22:41:49 2021]: child process [25175] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:41:49 2021]: child process [25177] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:41:49 2021]: child process [25178] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:41:49 2021]: child process [25179] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 25174 0.0 0.0 2496 720 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee 25175 0.0 0.0 2496 80 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee 25176 0.0 0.0 2496 92 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee 25177 0.0 0.0 2496 92 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee 25178 0.0 0.0 2496 92 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee 25179 0.0 0.0 2496 92 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 25174 0.0 0.0 2496 720 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee 25176 0.0 0.0 2496 92 pts/2 S+ 22:41 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$
主进程会每隔1分钟打印一次休眠日志,但日志中只打印了第0次:[Sun Jun 6 22:41:39 2021]:Parent process is going to sleep now[0]...
,后面就再也不打印了,可以推导出主进程被阻塞了。
那如何避免主进程被阻塞呢?答案就是 waitpid()
函数,我们将 sig_handler()
函数中的代码调整为如下这样:
void sig_handler(int sig_no) {
printf("[%s]:receive sig_no: %d\n", get_current_time(), sig_no);
if (sig_no == SIGCHLD) {
int wait_status;
int pid;
// ④
// 1) 这里不能使用if判断,因为Linux信号不排队,如果多个子进程在同一时刻发送SIGCHLD信号,父进程可能来不及响应,
// 就会出现并发问题,父进程执行一次信号处理函数,只会回收了一个子进程,其他进程依然会沦为僵尸进程
// 2) 这里while循环中不用wait,而用waitpid(同时指定options为WNOHANG),是因为wait会阻塞主进程,直到所有子进程都被回收
// if ((pid = wait(&wait_status)) > 0) {
// while ((pid = wait(&wait_status)) > 0) {
while ((pid = waitpid(-1, &wait_status, WNOHANG)) > 0) {
printf("[%s]: child process [%d] is terminated, wait_status=[%d], WIFEXITED=[%d], WEXITSTATUS=[%d]...\n",
get_current_time(), pid, wait_status, WIFEXITED(wait_status), WEXITSTATUS(wait_status));
}
}
}
编译测试结果为:
demonlee@demonlee-ubuntu:zombie-proc$ ./zombie-test 5
[Sun Jun 6 22:52:08 2021]: parent: [25212] create a child[0] process success...
[Sun Jun 6 22:52:08 2021]: child[0] process is running, ppid: [25212], pid: [25213]
[Sun Jun 6 22:52:08 2021]: parent: [25212] create a child[1] process success...
[Sun Jun 6 22:52:08 2021]: child[1] process is running, ppid: [25212], pid: [25214]
[Sun Jun 6 22:52:08 2021]: parent: [25212] create a child[2] process success...
[Sun Jun 6 22:52:08 2021]: child[2] process is running, ppid: [25212], pid: [25215]
[Sun Jun 6 22:52:08 2021]: parent: [25212] create a child[3] process success...
[Sun Jun 6 22:52:08 2021]: child[3] process is running, ppid: [25212], pid: [25216]
[Sun Jun 6 22:52:08 2021]: parent: [25212] create a child[4] process success...
[Sun Jun 6 22:52:08 2021]:Parent process is going to sleep now[0]...
[Sun Jun 6 22:52:08 2021]: child[4] process is running, ppid: [25212], pid: [25217]
[Sun Jun 6 22:52:18 2021]:receive sig_no: 17
[Sun Jun 6 22:52:18 2021]: child process [25213] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:52:18 2021]: child process [25215] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:52:18 2021]: child process [25216] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:52:18 2021]: child process [25217] is terminated, wait_status=[256], WIFEXITED=[1], WEXITSTATUS=[1]...
[Sun Jun 6 22:52:18 2021]:receive sig_no: 17
[Sun Jun 6 22:52:18 2021]:Parent process is going to sleep now[1]...
[Sun Jun 6 22:53:18 2021]:Parent process is going to sleep now[2]...
[Sun Jun 6 22:54:18 2021]:Parent process is going to sleep now[3]...
...
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 25212 0.0 0.0 2496 716 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee 25213 0.0 0.0 2496 80 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee 25214 0.0 0.0 2496 92 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee 25215 0.0 0.0 2496 92 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee 25216 0.0 0.0 2496 92 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee 25217 0.0 0.0 2496 92 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$ ps aux|grep -v grep|grep zombie-test
demonlee 25212 0.0 0.0 2496 716 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee 25214 0.0 0.0 2496 92 pts/2 S+ 22:52 0:00 ./zombie-test 5
demonlee@demonlee-ubuntu:zombie_proc$
很明显,与之前相比,主进程并未阻塞,所以这种方式是我们所推崇的。
总结
最后,对这篇文章进行一个简单的梳理:
- 僵尸进程状态是
EXIT_ZOMBIE
,通过ps
和top
命令查看时,展示的缩写为Z
。 - 僵尸进程出现的原因是父进程没有及时对结束运行的子进程进行回收,如果回收了,该状态很快就会结束,子进程退出。
- 父进程可以使用
wait()
或waitpid()
两个系统调用对结束的子进程进行回收。 - 回收子进程的最佳实践是通过信号处理函数异步完成,并且使用
waitpid()
函数的非阻塞版本,即:while ((pid = waitpid(-1, &wait_status, WNOHANG)) > 0)
。
最后补充一点:父进程只能回收儿子进程,不能回收孙子进程。举个例子:儿子进程创建了子进程(即孙子进程),当孙子进程变成僵尸进程后,若儿子进程没有回收,那么孙子进程将一直是僵尸进程。
参考文献
- linux下的僵尸进程处理SIGCHLD信号,by Jessica程序猿
- 使用阻塞I/O和进程模型:最传统的方式,by 盛延敏
- 案例篇:系统中出现大量不可中断进程和僵尸进程怎么办,by 倪朋飞