Reasons for a server to be unreachable, How to investigate?

ssh shutdown reboot mongodb

5,247

Solution 1

The server is actually part of Azure Cloud

The error could be occurring anywhere along the network path between the ssh client / mongo client and the server. This may represent a large number of components you will not have access to.

Your next port of call (after checking for reboots) should be Microsoft's support (good luck with that).

Meanwhile:

Check your system logs for any messages relating to your network devices.

If this doesn't turn up anything then you'll need to set up some remote monitoring to track the outages. In addition to providing useful information for the support staff to investigate the problem, it also provides you a means to get out of your contract and switch to a different provider.

Solution 2

From your question i guess that there is no performance or availability problem , and this seems to be network connectivity problem and may be related to firewalls on your client or target server.

There can be multiple ways to investigate.

Check the ping response

Traceroute to the server from client and from client to server traceroute and tracepath commands

Try to connect both by FQDN and IP address , and check name-server entries in /etc/resolv.conf , make sure they are ipv4 addresses.

Check sshd configuration on the server

Check tcp connection timeout settings

Disable firewall and se-linux for some time and re-try , if it is related to that.

Check for some clues in /var/log/messages and /var/log/secure or /var/log/auth , /var/log/audit/audit.log etc

Use tcpdump to inspect the packets , possibly , it may be due to tcp keepalive problem.

Read this article as well

5,247

Julien Leray

Student in IT!

Updated on September 18, 2022

Comments

Julien Leray over 1 year

One of my server which host a mongoDB is, sometimes and "randomly" unreachable.

After a while, it come back, like nothing happend.

During this period, impossible to open an ssh tunnel (timeout, don't even ask for a password), every apps connections to the hosted MongoDB break,...

I'm not even sure the server is still up, and this issue can really occur 2 times a days as 1 time a week.

Unfortunately, I'm unable to find any traces of disgraceful shutdown/reboot or any others clues about what is going on at this times.

What I've done so far to investigate:

foo@bar:/var/log$ who -b
         system boot  Jun 22 09:25

Nothing suspicious here, the server wasn't boot in 1 month.

This could be confirmed by the boot.log:

foo@bar:/var/log# tail boot.log
2016/06/22 09:25:34 Processing completed for Microsoft.OSTCExtensions.LinuxDiagnostic-2.3.9001
2016/06/22 09:25:34 Finished processing ExtensionsConfig.xml
monit: /opt/foo/common/lib/libcrypto.so.1.0.0: no version information available (required by monit)
monit: /opt/foo/common/lib/libssl.so.1.0.0: no version information available (required by monit)
 * Starting daemon monitor monit
   ...done.
 * Stopping System V runlevel compatibility

Once again, I checked last logged user, nothing seems to be wrong:

foo@bar:/var/log# last -x
localadm pts/0        16.618.3.75      Tue Jul 19 14:37   still logged in
localadm pts/0        16.618.3.75      Tue Jul 19 13:59 - 14:36  (00:37)
localadm pts/0        16.618.3.75      Tue Jul 19 13:18 - 13:53  (00:35)
localadm pts/0        16.618.3.75      Tue Jul 19 07:45 - 09:15  (01:29)
localadm pts/3        16.618.3.75      Mon Jul 18 15:14 - 15:51  (00:37)
localadm pts/0        16.618.3.75      Mon Jul 18 14:57 - 15:22  (00:24)
localadm pts/0        16.618.3.75      Mon Jul  4 10:01 - 10:06  (00:05)
localadm pts/0        16.618.3.75      Mon Jul  4 09:03 - 09:19  (00:16)
localadm pts/0        16.618.3.75      Mon Jul  4 08:16 - 08:19  (00:03)
localadm pts/0        16.618.3.75      Mon Jul  4 08:07 - 08:14  (00:06)
localadm pts/0        16.618.3.75      Mon Jul  4 08:00 - 08:04  (00:04)

I also checked cron jobs, none of them seems to affect any run level:

foo@bar:/var/log$ cat syslog
Jul 20 07:02:01 bar CRON[28967]: (localadmin) CMD (cd /opt/foo/stats && ./agent.bin --run -D)
Jul 20 07:17:01 bar CRON[29489]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 20 08:02:01 bar CRON[30754]: (localadmin) CMD (cd /opt/foo/stats && ./agent.bin --run -D)

(I've also checked manually every CRON table at global level and user level: less /etc/crontab)

The server is actually part of Azure Cloud (I don't know if this can be related with the problem).

Do you know what else could cause this issue?

Any idea how I can investigate further?

chrishollinworth almost 8 years

Agree that this is a network problem, but turning off the firewall is unlikely to reveal anything but is rather dangerous (I expect some people would say the same about SELinux - but that's a much longer discussion).
chrishollinworth almost 8 years

"can really occur 2 times a days as 1 time a week." - means it would need to be down for a week to establish if the problem persists - and longer to establish if the problem is resolved.
Julien Leray almost 8 years

The Azure part is the pint which afraid me too. Even for remote monitoring; how can this work if the issue is network related? I cannot switch provider, it's a corporate decision :/
Julien Leray almost 8 years

@IjazKhan You're true but I cannot do this kind of test when tthe server is working. I had only one time when I could have seen the issue live. And I actually don't really know how long for the server is "down".
Ijaz Ahmad almost 8 years

For Selinux you dont need to disable it , because if selinux is preventing something you can see it clearly in the logs
Ijaz Ahmad almost 8 years

@Julien Leray answer updated
chrishollinworth almost 8 years

The remote monitoring will give specific times when the server is unavailable which should correlate with events being logged elsewhere on the network.