Reasons for a server to be unreachable, How to investigate?
Solution 1
The server is actually part of Azure Cloud
The error could be occurring anywhere along the network path between the ssh client / mongo client and the server. This may represent a large number of components you will not have access to.
Your next port of call (after checking for reboots) should be Microsoft's support (good luck with that).
Meanwhile:
Check your system logs for any messages relating to your network devices.
If this doesn't turn up anything then you'll need to set up some remote monitoring to track the outages. In addition to providing useful information for the support staff to investigate the problem, it also provides you a means to get out of your contract and switch to a different provider.
Solution 2
From your question i guess that there is no performance or availability problem , and this seems to be network connectivity problem and may be related to firewalls on your client or target server.
There can be multiple ways to investigate.
Check the ping response
Traceroute to the server from client and from client to server traceroute and tracepath
commands
Try to connect both by FQDN and IP address , and check name-server entries in /etc/resolv.conf
, make sure they are ipv4 addresses.
Check sshd configuration on the server
Check tcp connection timeout settings
Disable firewall and se-linux for some time and re-try , if it is related to that.
Check for some clues in /var/log/messages
and /var/log/secure
or /var/log/auth
, /var/log/audit/audit.log
etc
Use tcpdump to inspect the packets , possibly , it may be due to tcp keepalive problem.
Related videos on Youtube
Comments
-
Julien Leray over 1 year
One of my server which host a mongoDB is, sometimes and "randomly" unreachable.
After a while, it come back, like nothing happend.
During this period, impossible to open an ssh tunnel (timeout, don't even ask for a password), every apps connections to the hosted MongoDB break,...
I'm not even sure the server is still up, and this issue can really occur 2 times a days as 1 time a week.
Unfortunately, I'm unable to find any traces of disgraceful shutdown/reboot or any others clues about what is going on at this times.
What I've done so far to investigate:
foo@bar:/var/log$ who -b system boot Jun 22 09:25
Nothing suspicious here, the server wasn't boot in 1 month.
This could be confirmed by the boot.log:
foo@bar:/var/log# tail boot.log 2016/06/22 09:25:34 Processing completed for Microsoft.OSTCExtensions.LinuxDiagnostic-2.3.9001 2016/06/22 09:25:34 Finished processing ExtensionsConfig.xml monit: /opt/foo/common/lib/libcrypto.so.1.0.0: no version information available (required by monit) monit: /opt/foo/common/lib/libssl.so.1.0.0: no version information available (required by monit) * Starting daemon monitor monit ...done. * Stopping System V runlevel compatibility
Once again, I checked last logged user, nothing seems to be wrong:
foo@bar:/var/log# last -x localadm pts/0 16.618.3.75 Tue Jul 19 14:37 still logged in localadm pts/0 16.618.3.75 Tue Jul 19 13:59 - 14:36 (00:37) localadm pts/0 16.618.3.75 Tue Jul 19 13:18 - 13:53 (00:35) localadm pts/0 16.618.3.75 Tue Jul 19 07:45 - 09:15 (01:29) localadm pts/3 16.618.3.75 Mon Jul 18 15:14 - 15:51 (00:37) localadm pts/0 16.618.3.75 Mon Jul 18 14:57 - 15:22 (00:24) localadm pts/0 16.618.3.75 Mon Jul 4 10:01 - 10:06 (00:05) localadm pts/0 16.618.3.75 Mon Jul 4 09:03 - 09:19 (00:16) localadm pts/0 16.618.3.75 Mon Jul 4 08:16 - 08:19 (00:03) localadm pts/0 16.618.3.75 Mon Jul 4 08:07 - 08:14 (00:06) localadm pts/0 16.618.3.75 Mon Jul 4 08:00 - 08:04 (00:04)
I also checked cron jobs, none of them seems to affect any run level:
foo@bar:/var/log$ cat syslog Jul 20 07:02:01 bar CRON[28967]: (localadmin) CMD (cd /opt/foo/stats && ./agent.bin --run -D) Jul 20 07:17:01 bar CRON[29489]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Jul 20 08:02:01 bar CRON[30754]: (localadmin) CMD (cd /opt/foo/stats && ./agent.bin --run -D)
(I've also checked manually every CRON table at global level and user level:
less /etc/crontab
)The server is actually part of Azure Cloud (I don't know if this can be related with the problem).
Do you know what else could cause this issue?
Any idea how I can investigate further?
-
chrishollinworth almost 8 yearsAgree that this is a network problem, but turning off the firewall is unlikely to reveal anything but is rather dangerous (I expect some people would say the same about SELinux - but that's a much longer discussion).
-
chrishollinworth almost 8 years"can really occur 2 times a days as 1 time a week." - means it would need to be down for a week to establish if the problem persists - and longer to establish if the problem is resolved.
-
Julien Leray almost 8 yearsThe Azure part is the pint which afraid me too. Even for remote monitoring; how can this work if the issue is network related? I cannot switch provider, it's a corporate decision :/
-
Julien Leray almost 8 years@IjazKhan You're true but I cannot do this kind of test when tthe server is working. I had only one time when I could have seen the issue live. And I actually don't really know how long for the server is "down".
-
Ijaz Ahmad almost 8 yearsFor Selinux you dont need to disable it , because if selinux is preventing something you can see it clearly in the logs
-
Ijaz Ahmad almost 8 years@Julien Leray answer updated
-
chrishollinworth almost 8 yearsThe remote monitoring will give specific times when the server is unavailable which should correlate with events being logged elsewhere on the network.