AWS EC2 dns resolution diagnostic

6,464

There is a known AWS bug which causes DNS resolution to sporadically fail:

https://forums.aws.amazon.com/thread.jspa?messageID=330465#330465

You might want to test with persistent connections as that would reduce the frequency at which DNS resolution is performed.

A local DNS cache (e.g. pdns-recursor or dnscache) will reduce the frequency but the RDS hostname records have very short (60 second) TTLs so it will mean that the problem occurs far less frequently but still happens a few times a day.

Share:
6,464
Kevin Lee
Author by

Kevin Lee

Updated on September 18, 2022

Comments

  • Kevin Lee
    Kevin Lee almost 2 years

    I am using EC2 instances with amazon linux installed (with amazon dns server settings, which comes from DHCP), as well as an RDS database. The EC2 instances are behind ELB and get high traffic. The application that I use is coded with PHP.

    The problem is when PHP tries to connect to the RDS database, sometimes it returns the following error:

    PHP Warning:  mysqli_connect(): (HY000/2005): Unknown MySQL server host ...
    

    It doesn't happen a lot but sometimes it is geting worse; I'm getting thousands of error events with that message.

    Is there any suggestion for diagnosing the problem? I was thinking about dumping all DNS traffic to a file and checking it but servers get really high traffic so it will be hard to track from that file.

    Ip:
    197171459 total packets received
    1 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    197171458 incoming packets delivered
    175015443 requests sent out
    Icmp:
    12528 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 188
        echo requests: 12340
    12559 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 219
        echo replies: 12340
    IcmpMsg:
        InType3: 188
        InType8: 12340
        OutType0: 12340
        OutType3: 219
    Tcp:
    5231380 active connections openings
    3978862 passive connection openings
    881 failed connection attempts
    6420 connection resets received
    17 connections established
    191630575 segments received
    200105352 segments send out
    2797151 segments retransmited
    0 bad segments received.
    6910 resets sent
    Udp:
    5577451 packets received
    219 packets to unknown port received.
    0 packet receive errors
    5577700 packets sent
    UdpLite:
    TcpExt:
    172 invalid SYN cookies received
    808 resets received for embryonic SYN_RECV sockets
    7176788 TCP sockets finished time wait in fast timer
    507 packets rejects in established connections because of timestamp
    448055 delayed acks sent
    2927 delayed acks further delayed because of locked socket
    Quick ack mode was activated 2433 times
    94865861 packets directly queued to recvmsg prequeue.
    16611185 packets directly received from backlog
    54150864749 packets directly received from prequeue
    2158966 packets header predicted
    79141174 packets header predicted and directly queued to user
    40780030 acknowledgments not containing data received
    56946553 predicted acknowledgments
    84 times recovered from packet loss due to SACK data
    Detected reordering 4 times using FACK
    Detected reordering 11 times using SACK
    Detected reordering 69 times using time stamp
    70 congestion windows fully recovered
    1241 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 13
    2491 congestion windows recovered after partial ack
    0 TCP data loss events
    220 timeouts after SACK recovery
    104 fast retransmits
    99 forward retransmits
    7 retransmits in slow start
    2792531 other TCP timeouts
    22 times receiver scheduled too late for direct processing
    2423 DSACKs sent for old packets
    2785871 DSACKs received
    5162 connections reset due to unexpected data
    921 connections reset due to early user close
    135 connections aborted due to timeout
    TCPDSACKIgnoredOld: 533
    TCPDSACKIgnoredNoUndo: 393
    TCPSackShifted: 477
    TCPSackMerged: 536
    TCPSackShiftFallback: 2709
    TCPBacklogDrop: 46
    TCPDeferAcceptDrop: 3906058
    IpExt:
    InOctets: 69400712361
    OutOctets: 94841399143
    
    • Konrad K.
      Konrad K. almost 13 years
      We're seeing this too on an EC2 hosted site with only developer traffic + also with RDS as the backend db. Now at least 1 out of 10 queries are ended in error with an "Unknown MySQL server host" message. It looks as if queries taking several seconds to run would be more prone to this than those that execute in less than a second.
  • Kevin Lee
    Kevin Lee almost 13 years
    Actually because of AWS dont have that kind of choice because they are giving a hostname and they dont promise that ip will stay same, so it changes, may make a local cache to update them but then it must fail as well. since it need to check it every second or maybe less than second. adding netstat -s output from an appSrv, it seems ok to me