Zombie process issue in Docker container

Classic problem: Python is running as PID 1 inside the container, and when SSH timeouts occur (especially with ProxyCommand/bastion), child processes hang onto PID 1, which doesn’t reap them — hence, zombie processes. I fix this in two layers: I install tini as init inside the image (protects any docker run), and explicitly enable init: true in compose as a clear signal.

RUN apt-get update \\
    && apt-get install -y --no-install-recommends openssh-client \\
    && apt-get install -y --no-install-recommends openssh-client tini \\
    && rm -rf /var/lib/apt/lists/*

ENTRYPOINT ["python3", "/app/server.py", "--host", "0.0.0.0", "--config", "/config/config.yaml"]
# tini as PID 1: reaps orphaned children (main source of zombies — ssh with ProxyCommand/bastion, killed by ConnectTimeout/SSH_TIMEOUT_SEC, leaves child ssh to bastion hanging on init);
# and correctly propagates signals to the Python server.
ENTRYPOINT ["/usr/bin/tini", "-g", "--", "python3", "/app/server.py", "--host", "0.0.0.0", "--config", "/config/config.yaml"]

    restart: always
    # In case the ENTRYPOINT in the image was overridden: docker-init (tini)
    # will still catch zombie processes from ssh/ProxyCommand on timeouts.
    init: true
    ports:

What changed and why:

  • Dockerfile: I install the tini package and wrap the command as tini -g -- python3 …. Now PID 1 is tini, which:

    • reaps orphaned children (main source of zombies — ssh with ProxyCommand/bastion, killed by ConnectTimeout/SSH_TIMEOUT_SEC, leaves child ssh to bastion hanging on init);
    • correctly propagates SIGTERM/SIGINT to the entire process group (flag -g), so docker stop doesn’t hang for 10 seconds and doesn’t leave SSH leftovers.
  • docker-compose.yml: init: true — an extra layer of protection, in case someone locally overrides entrypoint: (then tini from the image disappears, but docker-init will still become PID 1).

How to apply:

docker compose build --no-cache ansible-status
docker compose up -d

Check that zombies no longer multiply (inside the container, there should be no lines with Z in STAT):

docker compose exec ansible-status sh -c 'ps -e -o pid,ppid,stat,comm | awk "NR==1 || /Z/"'

If you want to go even further — you can add options -o ControlMaster=no -o ControlPath=none and start_new_session=True to ssh_check_for_target in subprocess.run, so ssh doesn’t spawn long-lived multiplexers, and when TimeoutExpired, we can kill the entire process group, not just the main ssh. But this is an improvement in behavior — the root cause of zombies (lack of init) is already gone.