Ansible stuck on gathering facts

93,719

Solution 1

I was having a similar issue with Ansible ping on Vagrant, it just suddenly stuck for no reason and has previously worked absolutely fine. Unlike any other issue like ssh or connective issue, it just forever die with no timeout.

One thing I did to resolve this issue is to clean ~/.ansible directory and it just works again. I can't find out why, but it did get resolved.

If you got change to have it again try clean the ~/.ansible folder before you refresh your Vagrant.

Solution 2

Ansible can hang like this for a number of reasons, usually because of a connection problem or because the setup module hangs. Here's how to narrow the problem down so you can solve it.

Ansible cannot connect to the destination host

Host Key (known_hosts) Problems

1) On older versions of Ansible (2.1 or older), Ansible would not always tell you if the host key for the destination does not exist on the source, or if there is a mismatch.

Solution: try opening an SSH connection with the same parameters to that destination. You may find SSH errors you need to resolve, and then the command will work.

2) Sometimes Ansible displays an SSH connection message to you in the midst of other statuses, causing Ansible to "freeze" on that task:

Warning: the ECDSA host key for 'myhost' differs from the key for the IP address '10.10.1.10'
Offending key for IP in /etc/ssh/ssh_known_hosts:246
Matching host key in /etc/ssh/ssh_known_hosts:477
Are you sure you want to continue connecting (yes/no)?

In this case, simply typing "yes" for as many SSH questions as you were asked will permit the play to continue. Afterwards you can fix the root known_hosts problems.

Private Key Authentication Problems

If using key-based authentication vs password, other problems include:

  • Private key may not be set up properly on the destination
  • Private key might have incorrect permissions locally (should be readable only by the user running the Ansible job)

Solution: try running ansible -m ping <destination> -k against the problem host - if that doesn't work, try the Host Key Problems solutions above.

Ansible cannot quickly gather facts

The setup module (when run automatically at the beginning of an ansible-playbook run, or when run manually as ansible -m setup <host>) can often hang when gathering hardware facts (e.g. if getting disk information from hosts with high i/o, bad mount entries, etc.).

Solution: try running ansible -m setup -a gather_subset=!all <destination>. If this works, you should consider setting this line in your ansible.cfg:

gather_subset=!hardware

Solution 3

For me the setup module module was stuck on a dead NFS mount.

If you do a "df" on your machine and nothing happens, you may be on the same case.

PS: if you can't umount the NFS share/mountpoint, consider using the bad "umount -l"

Solution 4

There are many reasons why ansible may hang at fact gathering, but before going any further, here is the first test you should be making in any such situation :

ansible -m ping <hostname>

This test just connects to the host, and executes enough code to return :

<hostname> | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}

If this works, you can pretty much rule out any setup or connectivity issue, as it proves that you could resolve target hostname, open a connection, authenticate, and execute an ansible module with the remote python interpreter.

Now, here is a (non-exhaustive) list of things that can go wrong at the beginning of a playbook :

The command executed by ansible is waiting for an interactive input

I can remember this happening on older ansible versions, where a command would wait for an interactive input that would never come, such as a sudo password (when you forgot a -K switch), or acceptation of a new ssh host fingerprint (for a new target host).

Modern versions of ansible handle both these cases gracefully and raise an error immediately for normal usecases, so unless you're doing things such as calling ssh or sudo yourself, you shouldn't have this kind of issue. And even if you did, it would be after fact gathering.

Dead ssh master connection

There are some very interesting options passed to the ssh client, in the debug log given here :

  • ControlMaster=auto
  • ControlPersist=60s
  • ControlPath=/home/vagrant/.ansible/cp/ansible-ssh-%h-%p-%r

These options are documented in man ssh_config.

By default, ansible will try and be smart regarding its ssh connection use. For a given host, instead of creating a new connection for each and every task in the play, it will open it once, and keep it open for the whole playbook (and even across playbooks).

That's good, as establishing a new connection is far slower and computation-intensive than using an already existing one.

In practice, every ssh connection will check for the existence of a socket at ~/.ansible/cp/some-host-specific-path. The first connection cannot find it, so it connects normally, and then creates it. Every subsequent connection will then just use this socket to go through the already established connection.

Even if the established connection finally times out and closes after not being used for long enough, the socket is closed too, and we're back to square one.

So far so good.

Sometimes however, the connection actually dies, but the ssh client still considers it established. This typically happens when you execute the playbook from you laptop, and you lose your WiFi connection (or switch from WiFi to Ethernet, etc…)

This last example is a terrible situation : you can ssh to the target machine with a default ssh config, but as long as your previous connection is still considered active, ansible won't even try establishing a new one.

At this point, we just want to get rid of this old socket, and the simplest way to do that is to remove it:

# Delete all the current sockets (may disrupt currently running playbooks)
rm -r ~/.ansible/cp
# Delete only the affected socket (requires to know which one it is)
rm ~/.ansible/cp/<replace-by-your-socket>

This is perfect for a one-shot fix, but if it happens too often, you may need to look for a longer-term fix. Here are some pointers that might help towards this goal :

  • Start playbooks from a server (with a network connection way more stable than your laptop's)
  • Use ansible configuration, or directly ssh client configuration to disable connection sharing
  • Use the same resources, but to fine-tune timeouts, so that a master connection crash actually times out faster

Please note that at the time of writing, a few options have changed (for example, my latest run gave me ControlPath=/home/toadjaune/.ansible/cp/871b533295), but the general idea is still valid.

Fact gathering actually taking too much time

At the beginning of every play, ansible collects a lot of information on the target system, and puts it into Facts. These are variables that you can then use in your playbook, and are usually really handy, but sometimes, getting this info can be very long (bad mount points, disks with high i/o, high load…)

This being said, you don't strictly need facts to run a playbook, and almost certainly not all of them, so let's try and disable what we don't need. Several options for that :

For debugging purposes, it is really convenient to invoke the setup module directly from the command-line :

ansible -m setup <hostname>

This last command should hang as well as your playbook, and eventually timeout (or succeed). Now, let's execute the module again, disabling everything we can :

ansible -m setup -a gather_subset='!all' <hostname>

If this still hangs, you can always try and disable totally the module in your play, but it's really likely that your problem is somewhere else.

If, however, it works fine (and quickly), then have a look at the module documentation. You have two options :

  • Limit the fact gathering to a subset, excluding what you don't need (see possible values for gather_subset)
  • gather_timeout can also help you fix your issue, by allowing more time (although that would be to fix a timeout error, not a hang)

Other issues

Obviously, other things can go wrong. A few pointers to help debugging :

  • Use ansible maximum verbosity level (-vvvv), as it will show you every command executed
  • Use ping and setup modules directly from the command-line as explained above
  • Try to ssh manually if ansible -m ping doesn't work

Solution 5

I had a similar issue with Ansible hanging at Gathering Facts. I pared my script down to a prompt with no tasks or roles and it still hung.

I found 12 hung ansible processes in my process list that had accumulated over the day.

/usr/bin/python /tmp/ansible_Jfv4PA/ansible_module_setup.py
/usr/bin/python /tmp/ansible_M2T10L/ansible_module_setup.py

Once I killed those, it started working again.

Share:
93,719
Bj Blazkowicz
Author by

Bj Blazkowicz

Get psyched!

Updated on September 18, 2022

Comments

  • Bj Blazkowicz
    Bj Blazkowicz 3 months

    I'm having some odd issues with my ansible box(vagrant).

    Everything worked yesterday and my playbook worked fine.

    Today, ansible hangs on "gathering facts"?

    Here is the verbose output:

    <5.xxx.xxx.xxx> ESTABLISH CONNECTION FOR USER: deploy
    <5.xxx.xxx.xxx> REMOTE_MODULE setup
    <5.xxx.xxx.xxx> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-
    o', 'ControlPersist=60s', '-o', 'ControlPath=/home/vagrant/.ansible/cp/ansible-s
    sh-%h-%p-%r', '-o', 'Port=2221', '-o', 'KbdInteractiveAuthentication=no', '-o',
    'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o
    ', 'PasswordAuthentication=no', '-o', 'User=deploy', '-o', 'ConnectTimeout=10',
    '5.xxx.xxx.xxx', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1411372677
    .18-251130781588968 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1411372677.18-2
    51130781588968 && echo $HOME/.ansible/tmp/ansible-tmp-1411372677.18-251130781588
    968'"]
    
    • Antonis Christofides
      Antonis Christofides about 8 years
      It hangs for how much time? Did you try vagrant ssh and investigate during the hang to see if there is anything useful in ps and netstat? Also, one of the first suspects in hangs is DNS - check if DNS is resolving from inside the virtual machine.
    • Bj Blazkowicz
      Bj Blazkowicz about 8 years
      Thanks for you comment. The solution was simple, vagrant destroy and vagrant up... I still think it's weird that it just stopped working?
    • rektide
      rektide over 7 years
      I had an issue with Ansible stalling out if there's an inaccessible (cifs-) mounts.
    • GnP
      GnP over 7 years
      Just had it happen, it was caused by an outdated host key in the known_hosts file. Weird that the connection didn't fail as is usual in this case.
    • Pablo Martinez
      Pablo Martinez about 7 years
      Can you check sshd logs in the vagrant box? You may need to set "LogLevel DEBUG" in /etc/ssh/sshd_config but that may provide more info of what's going on.
    • Danny Staple
      Danny Staple almost 7 years
      I looked at the below - and didn't find anything there. ansible tmp/setup was running as a python process on the target box (not vagrant but a vm), but was taking a very long time and doing something very IO heavy. I had to kill -9 and wait for it to stop after about 5 minutes.
    • Shawn
      Shawn over 1 year
      sudo apt install -y ansible sshpass will fix this
  • Quanlong
    Quanlong about 7 years
    rm -rf ~/.ansible did not work for me on El Captitan
  • Deer Hunter
    Deer Hunter almost 7 years
    Puppet? What puppet? This is an ansible question.
  • JamesP
    JamesP over 5 years
    Passing to 'gather_subset=!hardware' to setup worked for a particular VM that was not responding.
  • melihovv
    melihovv over 5 years
    rm -rf ~/.ansible/cp is enough
  • David Boshton
    David Boshton about 5 years
    Fixed for me. Dodgy mount points, I think. I had a VM that I used for ansible provisioning and it worked until I added a new NFS share. Now it doesn't, until I added the above.
  • haridsv
    haridsv about 4 years
    Turned out to be a host key problem in my case. The host was reimaged, so my first run failed and I ran the suggested ssh-keygen -R command to remove the offending key. I ran ssh once to get the key added, but the second run was hanging. When I ran ssh again, I got the key confirmation prompt which was unexpected. I realized that there is an offending key that needed to be removed, so after removing that and rerunning ssh, I got the Warning: Permanently added the ECDSA host key ... message and then only the fact gathering continued.
  • tschale
    tschale about 4 years
    I can confirm the observation from @DavidBoshton. Had this issue on a VM that had NFS directories mounted, that weren't available (NFS server problem). After fixing the NFS server it worked
  • Saurabh Nanda
    Saurabh Nanda almost 4 years
    yup, that was it!
  • pkaramol
    pkaramol over 3 years
    I got around the issue initially by setting gather_facts to False but this tip really saved the day because that was my problem too.
  • Karthik
    Karthik about 3 years
    In my case I reused a IP address. Hence two host keys were present in the known_hosts file
  • Luke Stewart
    Luke Stewart about 3 years
    +1 for explanation of why wiping ~/.ansible works (in answer from @yikaus)
  • Luke Stewart
    Luke Stewart about 3 years
    See the answer below from @toadjaune for why this works.
  • Thomasleveil
    Thomasleveil over 2 years
    it can also be that the private ssh key is protected by a password and that key was not added to ssh agent (check with ssh-add -l)
  • Komal-SkyNET
    Komal-SkyNET over 2 years
    Thanks! How did you find out? Strace?
  • Martin
    Martin about 2 years
    Thanks for this excellent and detailled explanation, especially about the ssh master connection !
  • mik3fly-4steri5k
    mik3fly-4steri5k almost 2 years
    well, sometimes, i start ansible, then i kill it in the beginning, but the ssh connection stay active/alive; this answer helped me a lot.
  • MoRe
    MoRe over 1 year
    Careful! This deleted my installed plugin(s)!