Classic problem: Python is running as PID 1 inside the container, and when SSH timeouts occur (especially with ProxyCommand/bastion), child processes hang onto PID 1, which doesn’t reap them — hence, zombie processes. I fix this in two layers: I install tini as init inside the image (protects any docker run), and explicitly enable init: true in compose as a clear signal.
RUN apt-get update \\
&& apt-get install -y --no-install-recommends openssh-client \\
&& apt-get install -y --no-install-recommends openssh-client tini \\
&& rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["python3", "/app/server.py", "--host", "0.0.0.0", "--config", "/config/config.yaml"]
# tini as PID 1: reaps orphaned children (main source of zombies — ssh with ProxyCommand/bastion, killed by ConnectTimeout/SSH_TIMEOUT_SEC, leaves child ssh to bastion hanging on init);
# and correctly propagates signals to the Python server.
ENTRYPOINT ["/usr/bin/tini", "-g", "--", "python3", "/app/server.py", "--host", "0.0.0.0", "--config", "/config/config.yaml"]
restart: always
# In case the ENTRYPOINT in the image was overridden: docker-init (tini)
# will still catch zombie processes from ssh/ProxyCommand on timeouts.
init: true
ports:
What changed and why:
-
Dockerfile: I install thetinipackage and wrap the command astini -g -- python3 …. Now PID 1 istini, which:- reaps orphaned children (main source of zombies —
sshwithProxyCommand/bastion, killed byConnectTimeout/SSH_TIMEOUT_SEC, leaves childsshto bastion hanging on init); - correctly propagates
SIGTERM/SIGINTto the entire process group (flag-g), sodocker stopdoesn’t hang for 10 seconds and doesn’t leave SSH leftovers.
- reaps orphaned children (main source of zombies —
-
docker-compose.yml:init: true— an extra layer of protection, in case someone locally overridesentrypoint:(thentinifrom the image disappears, butdocker-initwill still become PID 1).
How to apply:
docker compose build --no-cache ansible-status
docker compose up -d
Check that zombies no longer multiply (inside the container, there should be no lines with Z in STAT):
docker compose exec ansible-status sh -c 'ps -e -o pid,ppid,stat,comm | awk "NR==1 || /Z/"'
If you want to go even further — you can add options -o ControlMaster=no -o ControlPath=none and start_new_session=True to ssh_check_for_target in subprocess.run, so ssh doesn’t spawn long-lived multiplexers, and when TimeoutExpired, we can kill the entire process group, not just the main ssh. But this is an improvement in behavior — the root cause of zombies (lack of init) is already gone.