Ansible stuck on gathering facts
Solution 1
I was having a similar issue with Ansible ping on Vagrant, it just suddenly stuck for no reason and has previously worked absolutely fine. Unlike any other issue like ssh or connective issue, it just forever die with no timeout.
One thing I did to resolve this issue is to clean ~/.ansible
directory and it just works again. I can't find out why, but it did get resolved.
If you got change to have it again try clean the ~/.ansible
folder before you refresh your Vagrant.
Solution 2
Ansible can hang like this for a number of reasons, usually because of a connection problem or because the setup module hangs. Here's how to narrow the problem down so you can solve it.
Ansible cannot connect to the destination host
Host Key (known_hosts) Problems
1) On older versions of Ansible (2.1 or older), Ansible would not always tell you if the host key for the destination does not exist on the source, or if there is a mismatch.
Solution: try opening an SSH connection with the same parameters to that destination. You may find SSH errors you need to resolve, and then the command will work.
2) Sometimes Ansible displays an SSH connection message to you in the midst of other statuses, causing Ansible to "freeze" on that task:
Warning: the ECDSA host key for 'myhost' differs from the key for the IP address '10.10.1.10'
Offending key for IP in /etc/ssh/ssh_known_hosts:246
Matching host key in /etc/ssh/ssh_known_hosts:477
Are you sure you want to continue connecting (yes/no)?
In this case, simply typing "yes" for as many SSH questions as you were asked will permit the play to continue. Afterwards you can fix the root known_hosts problems.
Private Key Authentication Problems
If using key-based authentication vs password, other problems include:
- Private key may not be set up properly on the destination
- Private key might have incorrect permissions locally (should be readable only by the user running the Ansible job)
Solution: try running ansible -m ping <destination> -k
against the problem host - if that doesn't work, try the Host Key Problems solutions above.
Ansible cannot quickly gather facts
The setup
module (when run automatically at the beginning of an ansible-playbook
run, or when run manually as ansible -m setup <host>
) can often hang when gathering hardware facts (e.g. if getting disk information from hosts with high i/o, bad mount entries, etc.).
Solution: try running ansible -m setup -a gather_subset=!all <destination>
. If this works, you should consider setting this line in your ansible.cfg:
gather_subset=!hardware
Solution 3
For me the setup module module was stuck on a dead NFS mount.
If you do a "df" on your machine and nothing happens, you may be on the same case.
PS: if you can't umount the NFS share/mountpoint, consider using the bad "umount -l"
Solution 4
There are many reasons why ansible may hang at fact gathering, but before going any further, here is the first test you should be making in any such situation :
ansible -m ping <hostname>
This test just connects to the host, and executes enough code to return :
<hostname> | SUCCESS => {
"changed": false,
"ping": "pong"
}
If this works, you can pretty much rule out any setup or connectivity issue, as it proves that you could resolve target hostname, open a connection, authenticate, and execute an ansible module with the remote python interpreter.
Now, here is a (non-exhaustive) list of things that can go wrong at the beginning of a playbook :
The command executed by ansible is waiting for an interactive input
I can remember this happening on older ansible versions, where a command would wait for an interactive input that would never come, such as a sudo password (when you forgot a -K
switch), or acceptation of a new ssh host fingerprint (for a new target host).
Modern versions of ansible handle both these cases gracefully and raise an error immediately for normal usecases, so unless you're doing things such as calling ssh or sudo yourself, you shouldn't have this kind of issue. And even if you did, it would be after fact gathering.
Dead ssh master connection
There are some very interesting options passed to the ssh client, in the debug log given here :
ControlMaster=auto
ControlPersist=60s
ControlPath=/home/vagrant/.ansible/cp/ansible-ssh-%h-%p-%r
These options are documented in man ssh_config.
By default, ansible will try and be smart regarding its ssh connection use. For a given host, instead of creating a new connection for each and every task in the play, it will open it once, and keep it open for the whole playbook (and even across playbooks).
That's good, as establishing a new connection is far slower and computation-intensive than using an already existing one.
In practice, every ssh connection will check for the existence of a socket at ~/.ansible/cp/some-host-specific-path
.
The first connection cannot find it, so it connects normally, and then creates it.
Every subsequent connection will then just use this socket to go through the already established connection.
Even if the established connection finally times out and closes after not being used for long enough, the socket is closed too, and we're back to square one.
So far so good.
Sometimes however, the connection actually dies, but the ssh client still considers it established. This typically happens when you execute the playbook from you laptop, and you lose your WiFi connection (or switch from WiFi to Ethernet, etc…)
This last example is a terrible situation : you can ssh to the target machine with a default ssh config, but as long as your previous connection is still considered active, ansible won't even try establishing a new one.
At this point, we just want to get rid of this old socket, and the simplest way to do that is to remove it:
# Delete all the current sockets (may disrupt currently running playbooks)
rm -r ~/.ansible/cp
# Delete only the affected socket (requires to know which one it is)
rm ~/.ansible/cp/<replace-by-your-socket>
This is perfect for a one-shot fix, but if it happens too often, you may need to look for a longer-term fix. Here are some pointers that might help towards this goal :
- Start playbooks from a server (with a network connection way more stable than your laptop's)
- Use ansible configuration, or directly ssh client configuration to disable connection sharing
- Use the same resources, but to fine-tune timeouts, so that a master connection crash actually times out faster
Please note that at the time of writing, a few options have changed (for example, my latest run gave me ControlPath=/home/toadjaune/.ansible/cp/871b533295
), but the general idea is still valid.
Fact gathering actually taking too much time
At the beginning of every play, ansible collects a lot of information on the target system, and puts it into Facts. These are variables that you can then use in your playbook, and are usually really handy, but sometimes, getting this info can be very long (bad mount points, disks with high i/o, high load…)
This being said, you don't strictly need facts to run a playbook, and almost certainly not all of them, so let's try and disable what we don't need. Several options for that :
- Completely disable the setup module
- Change the configuration of the setup module to include only certain parts of it.
- Via command-line arguments
- Via ansible configuration files
For debugging purposes, it is really convenient to invoke the setup module directly from the command-line :
ansible -m setup <hostname>
This last command should hang as well as your playbook, and eventually timeout (or succeed). Now, let's execute the module again, disabling everything we can :
ansible -m setup -a gather_subset='!all' <hostname>
If this still hangs, you can always try and disable totally the module in your play, but it's really likely that your problem is somewhere else.
If, however, it works fine (and quickly), then have a look at the module documentation. You have two options :
- Limit the fact gathering to a subset, excluding what you don't need (see possible values for
gather_subset
) -
gather_timeout
can also help you fix your issue, by allowing more time (although that would be to fix a timeout error, not a hang)
Other issues
Obviously, other things can go wrong. A few pointers to help debugging :
- Use ansible maximum verbosity level (
-vvvv
), as it will show you every command executed - Use
ping
andsetup
modules directly from the command-line as explained above - Try to ssh manually if
ansible -m ping
doesn't work
Solution 5
I had a similar issue with Ansible hanging at Gathering Facts. I pared my script down to a prompt with no tasks or roles and it still hung.
I found 12 hung ansible processes in my process list that had accumulated over the day.
/usr/bin/python /tmp/ansible_Jfv4PA/ansible_module_setup.py
/usr/bin/python /tmp/ansible_M2T10L/ansible_module_setup.py
Once I killed those, it started working again.

Comments
-
Bj Blazkowicz 3 months
I'm having some odd issues with my ansible box(vagrant).
Everything worked yesterday and my playbook worked fine.
Today, ansible hangs on "gathering facts"?
Here is the verbose output:
<5.xxx.xxx.xxx> ESTABLISH CONNECTION FOR USER: deploy <5.xxx.xxx.xxx> REMOTE_MODULE setup <5.xxx.xxx.xxx> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '- o', 'ControlPersist=60s', '-o', 'ControlPath=/home/vagrant/.ansible/cp/ansible-s sh-%h-%p-%r', '-o', 'Port=2221', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o ', 'PasswordAuthentication=no', '-o', 'User=deploy', '-o', 'ConnectTimeout=10', '5.xxx.xxx.xxx', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1411372677 .18-251130781588968 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1411372677.18-2 51130781588968 && echo $HOME/.ansible/tmp/ansible-tmp-1411372677.18-251130781588 968'"]
-
Antonis Christofides about 8 yearsIt hangs for how much time? Did you try
vagrant ssh
and investigate during the hang to see if there is anything useful inps
andnetstat
? Also, one of the first suspects in hangs is DNS - check if DNS is resolving from inside the virtual machine. -
Bj Blazkowicz about 8 yearsThanks for you comment. The solution was simple, vagrant destroy and vagrant up... I still think it's weird that it just stopped working?
-
rektide over 7 yearsI had an issue with Ansible stalling out if there's an inaccessible (cifs-) mounts.
-
GnP over 7 yearsJust had it happen, it was caused by an outdated host key in the known_hosts file. Weird that the connection didn't fail as is usual in this case.
-
Pablo Martinez about 7 yearsCan you check sshd logs in the vagrant box? You may need to set "LogLevel DEBUG" in /etc/ssh/sshd_config but that may provide more info of what's going on.
-
Danny Staple almost 7 yearsI looked at the below - and didn't find anything there. ansible tmp/setup was running as a python process on the target box (not vagrant but a vm), but was taking a very long time and doing something very IO heavy. I had to kill -9 and wait for it to stop after about 5 minutes.
-
Shawn over 1 yearsudo apt install -y ansible sshpass will fix this
-
-
Quanlong about 7 years
rm -rf ~/.ansible
did not work for me on El Captitan -
Deer Hunter almost 7 yearsPuppet? What puppet? This is an ansible question.
-
JamesP over 5 yearsPassing to 'gather_subset=!hardware' to setup worked for a particular VM that was not responding.
-
melihovv over 5 yearsrm -rf ~/.ansible/cp is enough
-
David Boshton about 5 yearsFixed for me. Dodgy mount points, I think. I had a VM that I used for ansible provisioning and it worked until I added a new NFS share. Now it doesn't, until I added the above.
-
haridsv about 4 yearsTurned out to be a host key problem in my case. The host was reimaged, so my first run failed and I ran the suggested
ssh-keygen -R
command to remove the offending key. I ran ssh once to get the key added, but the second run was hanging. When I ran ssh again, I got the key confirmation prompt which was unexpected. I realized that there is an offending key that needed to be removed, so after removing that and rerunning ssh, I got theWarning: Permanently added the ECDSA host key ...
message and then only the fact gathering continued. -
tschale about 4 yearsI can confirm the observation from @DavidBoshton. Had this issue on a VM that had NFS directories mounted, that weren't available (NFS server problem). After fixing the NFS server it worked
-
Saurabh Nanda almost 4 yearsyup, that was it!
-
pkaramol over 3 yearsI got around the issue initially by setting
gather_facts
toFalse
but this tip really saved the day because that was my problem too. -
Karthik about 3 yearsIn my case I reused a IP address. Hence two host keys were present in the known_hosts file
-
Luke Stewart about 3 years+1 for explanation of why wiping ~/.ansible works (in answer from @yikaus)
-
Luke Stewart about 3 yearsSee the answer below from @toadjaune for why this works.
-
Thomasleveil over 2 yearsit can also be that the private ssh key is protected by a password and that key was not added to ssh agent (check with
ssh-add -l
) -
Komal-SkyNET over 2 yearsThanks! How did you find out? Strace?
-
Martin about 2 yearsThanks for this excellent and detailled explanation, especially about the ssh master connection !
-
mik3fly-4steri5k almost 2 yearswell, sometimes, i start ansible, then i kill it in the beginning, but the ssh connection stay active/alive; this answer helped me a lot.
-
MoRe over 1 yearCareful! This deleted my installed plugin(s)!