nohup&tmux

nohup的问题

​ 在我使用 nohup 命令启动后台的训练任务的时候,发现了 nohup 命令的一些坑,特此记录和警醒自己

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156332 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156333 closing signal SIGHUP
Traceback (most recent call last):
File “/home/user2/miniconda/envs/matting/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/home/user2/miniconda/envs/matting/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/run.py”, line 713, in run
)(*cmd_args)
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/launcher/api.py”, line 252, in launch_agent
result = agent.run()
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py”, line 125, in wrapper
result = f(*args, **kwargs)
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py”, line 709, in run
result = self._invoke_run(role)
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py”, line 843, in _invoke_run
time.sleep(monitor_interval)
File “/home/user2/miniconda/envs/matting/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py”, line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1100295 got signal: 1

和这个 github issue 描述的问题一样:nohup跑一段时间显示Message: ‘Received 1 death signal, shutting down workers’ · Issue #237 · OpenMOSS/MOSS

​ 导致这个问题发生的原因有几步:

  • nohup 本身是忽略 SIGHUP 信号的,但 PyTorch 的多进程(multiprocessing)模块可能会修改信号处理方式?(可能)

(这个问题以后好好学了操作系统再来分析)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import signal
import time
import sys

def handle_sighup(signum, frame):
print("Received SIGHUP signal", file=sys.stderr)

signal.signal(signal.SIGHUP, handle_sighup)

print("Running test_sighup.py", file=sys.stderr)
time.sleep(1000)

>>> # OUTPUT
nohup: ignoring input
Running test_sighup.py
Received SIGHUP signal

tmux使用方法: