Kubernetes autoscaler - NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)

15,391

Solution 1

I had the wrong parameters defined on the autoscaler.

I had to modify the node-group-auto-discovery and nodes parameters.

        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=default
        - --scan-interval=25s
        - --scale-down-unneeded-time=30s
        - --nodes=1:20:terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/example-job-runner
        - --logtostderr=true
        - --stderrthreshold=info
        - --v=4

When installing the cluster autoscaler it is not enough to simply apply the example config, e.g.:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

As documented in the user guide, that config has placeholder for your eks cluster name in the value for node-group-auto-discovery, and you must either replace it before applying, or update it after deploying.

Solution 2

I ran into this as well. I didn't see this super well documented where you think it would be. Here is the detailed explanation on the main README.md

AWS - Using auto-discovery of tagged instance groups

Auto-discovery finds ASGs tags as below and automatically manages them based on the min and max size specified in the ASG. cloudProvider=aws only.

  • Tag the ASGs with keys to match .Values.autoDiscovery.tags, by default: k8s.io/cluster-autoscaler/enabled and k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
  • Verify the IAM Permissions
  • Set autoDiscovery.clusterName=<YOUR CLUSTER NAME>
  • Set awsRegion=<YOUR AWS REGION>
  • Set awsAccessKeyID=<YOUR AWS KEY ID> and awsSecretAccessKey=<YOUR AWS SECRET KEY> if you want to use AWS credentials directly instead of an instance role

$ helm install my-release autoscaler/cluster-autoscaler-chart --set autoDiscovery.clusterName=<CLUSTER NAME>

My issue was that I did not specify both tags instead only specifying the k8s.io/cluster-autoscaler/enabled tag. This makes sense now that I think about it as if you have multiple k8s clusters in the same account the cluster-autoscaler will need to know which ASG to actually scale.

Solution 3

I mistakenly added these as node labels k8s.io/cluster-autoscaler/enabled and k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>

But they should actually be tags on the nodes in the worker groups.

Specifically, if you're using the AWS EKS module in Terraform -

  workers_group_defaults = {
    tags = [{
        key                 = "k8s.io/cluster-autoscaler/enabled"
        value               = "TRUE"
        propagate_at_launch = true
      },{
        key                 = "k8s.io/cluster-autoscaler/${var.cluster_name}"
        value               = "owned"
        propagate_at_launch = true
      }]
  }
Share:
15,391
Chris Stryczynski
Author by

Chris Stryczynski

Software dev(op). Independent consultant available for hire! Checkout my GitChapter project on github!

Updated on July 19, 2022

Comments

  • Chris Stryczynski
    Chris Stryczynski almost 2 years

    I'd like to run a 'job' per node, one pod on a node at a time.

    • I've scheduled a bunch of jobs
    • I have a whole bunch of pending pods now
    • I'd like these pending pods to now trigger a node scaling up event (which does NOT happen)

    Very much like this issue (made by myself): Kubernetes reports "pod didn't trigger scale-up (it wouldn't fit if a new node is added)" even though it would?

    However in this case it should indeed fit on a new node.

    How can I diagnose why Kubernetes has determined that a node scaling event is not possible?

    My job yaml:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: example-job-${job_id}
      labels:
        job-in-progress: job-in-progress-yes
    spec:
      template:
        metadata:
          name: example-job-${job_id}
        spec:
          # this bit ensures a job/container does not get scheduled along side another - 'anti' affinity
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - topologyKey: kubernetes.io/hostname 
                labelSelector:
                  matchExpressions:
                  - key: job-in-progress
                    operator: NotIn
                    values:
                    - job-in-progress-yes
          containers:
          - name: buster-slim
            image: debian:buster-slim
            command: ["bash"]
            args: ["-c", "sleep 60; echo ${echo_param}"]
          restartPolicy: Never
    

    Autoscaler logs:

    I0920 19:27:58.190751       1 static_autoscaler.go:128] Starting main loop
    I0920 19:27:58.261972       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: []
    I0920 19:27:58.262003       1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-09-20 19:28:08.261998185 +0000 UTC m=+302.102284346
    I0920 19:27:58.262092       1 static_autoscaler.go:261] Filtering out schedulables
    I0920 19:27:58.264212       1 static_autoscaler.go:271] No schedulable pods
    I0920 19:27:58.264246       1 scale_up.go:262] Pod default/example-job-21-npv6p is unschedulable
    I0920 19:27:58.264252       1 scale_up.go:262] Pod default/example-job-28-zg4f8 is unschedulable
    I0920 19:27:58.264258       1 scale_up.go:262] Pod default/example-job-24-fx9rd is unschedulable
    I0920 19:27:58.264263       1 scale_up.go:262] Pod default/example-job-6-7mvqs is unschedulable
    I0920 19:27:58.264268       1 scale_up.go:262] Pod default/example-job-20-splpq is unschedulable
    I0920 19:27:58.264273       1 scale_up.go:262] Pod default/example-job-25-g5mdg is unschedulable
    I0920 19:27:58.264279       1 scale_up.go:262] Pod default/example-job-16-wtnw4 is unschedulable
    I0920 19:27:58.264284       1 scale_up.go:262] Pod default/example-job-7-g89cp is unschedulable
    I0920 19:27:58.264289       1 scale_up.go:262] Pod default/example-job-8-mglhh is unschedulable
    I0920 19:27:58.264323       1 scale_up.go:304] Upcoming 0 nodes
    I0920 19:27:58.264370       1 scale_up.go:420] No expansion options
    I0920 19:27:58.264511       1 static_autoscaler.go:333] Calculating unneeded nodes
    I0920 19:27:58.264533       1 utils.go:474] Skipping ip-10-0-1-118.us-west-2.compute.internal - no node group config
    I0920 19:27:58.264542       1 utils.go:474] Skipping ip-10-0-0-65.us-west-2.compute.internal - no node group config
    I0920 19:27:58.265063       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-25-g5mdg", UID:"d2e0e48c-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7256", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265090       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-8-mglhh", UID:"c7d3ce78-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7267", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265101       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-6-7mvqs", UID:"c6a5b0e4-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7273", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265110       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-20-splpq", UID:"cfeb9521-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7259", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265363       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-21-npv6p", UID:"d084c067-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7275", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265384       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-16-wtnw4", UID:"ccbe48e0-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7265", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265490       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-28-zg4f8", UID:"d4afc868-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7269", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265515       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-24-fx9rd", UID:"d24975e5-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7271", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    I0920 19:27:58.265685       1 static_autoscaler.go:360] Scale down status: unneededOnly=true lastScaleUpTime=2019-09-20 19:23:23.822104264 +0000 UTC m=+17.662390361 lastScaleDownDeleteTime=2019-09-20 19:23:23.822105556 +0000 UTC m=+17.662391653 lastScaleDownFailTime=2019-09-20 19:23:23.822106849 +0000 UTC m=+17.662392943 scaleDownForbidden=false isDeleteInProgress=false
    I0920 19:27:58.265910       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"example-job-7-g89cp", UID:"c73cfaea-dbd9-11e9-a9e2-024e7db9d360", APIVersion:"v1", ResourceVersion:"7263", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 
    
  • Nhan Tran
    Nhan Tran almost 4 years
    it's not recommend to set both nodes and node-group-auto-discovery. usually node-group-auto-discovery will be enough github.com/kubernetes/autoscaler/blob/master/cluster-autosca‌​ler/…