AWS Elastic Load Balancer and target group health check fail for no apparent reason

6,601

I had the same problem, it took me a long time to find where is the error that may cause Classic Load balancer "Out Of Service" error.

Go to "Health Check" > edit "Ping target" from "Http" to "TCP"

now it's "IN SERVICE".

I hope this helps any of you.

Share:
6,601

Related videos on Youtube

wlarcheveque
Author by

wlarcheveque

Updated on September 18, 2022

Comments

  • wlarcheveque
    wlarcheveque over 1 year

    I have completed my AWS ELB architecture for our website and successfully created the Launch Configuration and Target Groups for which instances are created behind the Load Balancer.

    My configuration is as follows :

    Target Group
    WebInstancesHttps
    HTTPS through port 443

    Health check
    HTTPS
    Path : /healthy.html
    Port 443
    Healthy threshold : 10
    Unhealthy threshold : 2
    Timeout : 5
    Interval : 30

    Autoscaling group
    Desired : 2
    Min : 2
    Max : 3
    No scaling policy for now.

    Load Balancer
    Application load balancer that listens on HTTP:80 and HTTPS:443 that both forward to the mentioned target group.

    The problem
    I have one remaining issue where my instances keep being terminated because of failed health checks. Although, querying the path mentioned in health check configuration works all the time and the instance seems to have no issue at all.

    Every now and then, an instance becomes unhealthy and then a new one is instantiated back.

    I have read all the documentation regarding health checks and I understand the theory behind it but I don't understand why hosts keep failing health checks at times while there are near to no traffic or load on the application. The application works great apart from the mentioned fact that instances keep failing health check... But I cannot figure out why.

    I have a hard time investigating the issue as all configuration seem adequate.

    UPDATE 2019-03-13
    It seems the aws s3 sync command executed every minute in my crontab stalls...

    20343 bitnami 20 0 192392 48884 9696 R 34.3 2.4 1:06.32 /usr/bin/python3 /home/bitnami/.local/bin/aws s3 sync --delete /opt/bitnami/apps/wordpress/htdocs s3://nutriti-code
    20351 bitnami 20 0 192108 48608 9680 R 32.7 2.4 0:37.24 /usr/bin/python3 /home/bitnami/.local/bin/aws s3 sync --delete /opt/bitnami/apps/wordpress/htdocs s3://nutriti-code
    20375 bitnami 20 0 339360 48748 9728 R 32.7 2.4 0:10.88 /usr/bin/python3 /home/bitnami/.local/bin/aws s3 sync --delete /opt/bitnami/apps/wordpress/htdocs s3://nutriti-code

    (This is from my writer node but I suspect the same happens on my reader nodes that get terminated.)

    The aws s3 sync command runs for minutes until the server fails health check.

    Attached are all timestamps for health check failures : https://photos.app.goo.gl/sdU1yzL4r5q8q5hz5

    Some insights would be greatly appreciated!

    Thanks!

    • muskaan sharma
      muskaan sharma about 5 years
      Your web-logs on your back-ends should show the health-check requests, and the corresponding responses. Also configure your web-server to log request completion times. Your health-check is configured with a 5 second timeout, which means that it will be considered failed if the load-balancer does not get a response from the target within 5 seconds. Any health-check requests that take more than 5 seconds in your logs would be failed health-checks as far as the ALB is concerned.
    • wlarcheveque
      wlarcheveque about 5 years
      @ColtonCat Thanks for the insight. I will adjust the timeout for my health-checks. Although, their is no weblog to check because the Application ELB does the check. Am I right ? The only information I get is that is has failed health-check but not if it was a 404 or timeout.
    • muskaan sharma
      muskaan sharma about 5 years
      I meant the web-log on your web server... The ALB health-check attempts should be present in that...
    • wlarcheveque
      wlarcheveque about 5 years
      Thanks, good point. I checked a single instance access_logs and I see the "healthy.html" file used for my checks is always queryable by the ALB (200).