Systems under Load

CoreDNS Time Bomb: How a Schema-Valid EKS Add-on Update Took Down Cluster DNS Two Days Later

Kannan Anandakrishnan — Mon, 01 Jun 2026 17:44:08 GMT

Last year, we made a CoreDNS config change where we added a property under the kubernetes plugin. The change was syntactically valid and Terraform applied it successfully with no errors. The cluster looked healthy for two days and everything was running fine.

Then our scheduled node upgrade ran, applications started erroring out with UnknownHostException, and soon the entire stack was unreachable.

During RCA, we found that the config change was invalid. Due to the way EKS add-on and CoreDNS behaved, the incorrect config had turned into a ticking time bomb.

How EKS managed add-ons apply configuration

We manage CoreDNS in EKS using the managed CoreDNS add-on. When EKS creates the CoreDNS add-on, it owns the underlying Kubernetes resources, including the CoreDNS Deployment, Service, ConfigMap, etc.,

We update the add-on through aws_eks_addon resource in Terraform and pass configuration values. Every config change applied becomes an add-on update.

EKS validates and accepts the update, mutates the underlying resources, and reports the update as InProgress or Successful.

But "successful" means the API accepted the configuration and applied it to the underlying resources. It does not mean add-on is functioning correctly with the change.

In this example, I’m updating the CoreDNS add-on’s minReplicas to 6 (earlier it was 5).

# truncated command
$ aws eks update-addon --region us-east-1\
  --cluster-name example-cluster \
  --addon-name coredns \
  --resolve-conflicts OVERWRITE \
  --configuration-values '{
    "autoScaling": {
      "enabled": true,
      "maxReplicas": 10,
      "minReplicas": 6
    },
    "corefile": ".:53 {\n    errors\n    health {\n        lameduck 5s\n    }\n    ready\n    kubernetes cluster.local in-addr.arpa ip6.arpa {\n        pods insecure\n        fallthrough in-addr.arpa ip6.arpa\n        ttl 30\n    }\n    prometheus :9153\n    forward . /etc/resolv.conf {\n        max_concurrent 300\n    }\n    cache 30\n    loop\n    reload\n    loadbalance\n    \n}",
    "resources": {
      "limits": {
        "cpu": "100m",
        "memory": "150Mi"
      },
      "requests": {
        "cpu": "100m",
        "memory": "150Mi"
      }
    },

In the console, it will show the update under Update History

The add-on updates the desired replicas to 6 in the deployment.

What EKS validates and what it does not

EKS exposes a JSON schema for every add-on configuration. The schema defines what fields, values are supported and allows us to customize them.

To check the supported configuration of CoreDNS add-on schema, run

aws eks describe-addon-configuration \
  --addon-name coredns \
  --addon-version versionNumber \
  | jq -r .configurationSchema | jq .

If we send an unsupported field, EKS rejects it immediately.

For example, when we pass minReplica instead of minReplicas

 ~ aws eks update-addon --region us-east-1\
  --cluster-name glean-cluster \
  --addon-name coredns \
  --resolve-conflicts OVERWRITE \
  --configuration-values '{
    "autoScaling": {
      "enabled": true,
      "maxReplicas": 10,
      "minReplica": 6,
    },

An error occurred (InvalidParameterException) when calling the UpdateAddon operation: 
ConfigurationValue provided in request is not supported: 
Yaml schema validation failed with error: 
[$.autoScaling.minReplica: is not defined in the schema 
and the schema does not allow additional properties]

The add-on schema treats corefile as a string value. So EKS can only validate the JSON object structurally but it doesn’t understand the semantics of the CoreDNS configs inside that string. So the change can be schema-valid for EKS API and still be runtime-invalid for CoreDNS.

In the below example, I’m intentionally adding the bug “CATCH ME IF YOU CAN!!!” in the corefile section.

 ~ aws eks update-addon --region us-east-1\
  --cluster-name example-cluster \
  --addon-name coredns \
  --resolve-conflicts OVERWRITE \
  --configuration-values '{
    "autoScaling": {
      "enabled": true,
      "maxReplicas": 10,
      "minReplicas": 6,
    }, 
    "corefile": ".:53 {\n CATCH ME IF YOU CAN!!!!  errors\n    health {\n        lameduck 5s\n    }\n    ready\n    kubernetes cluster.local in-addr.arpa ip6.arpa {\n        pods insecure\n        fallthrough in-addr.arpa ip6.arpa\n        ttl 30\n    }\n    prometheus :9153\n    forward . /etc/resolv.conf {\n        max_concurrent 300\n    }\n    cache 30\n    loop\n    reload\n    loadbalance\n    \n}"}

And EKS updated the add-on with our bug successfully.

Why the change did not fail immediately

When the add-on update ran, EKS updated the CoreDNS ConfigMap with the new Corefile (buggy).

During our CoreDNS add-on update, EKS temporarily reconciled the Deployment back to the default replica count of 2 before scaling it back up to the configured target. That effectively restarted n-2 CoreDNS pods.

Similar behavior has been reported publicly in aws/containers-roadmap#2540.

In the below screenshot, we can see CoreDNS restarted n-2 pods, 2 are running fine and the rest are in crashloopbackoff state.

At this stage, the bug is successfully propagated, but the two CoreDNS replicas are still running and serving the DNS requests.

In our case, we mistakenly added ndots as a directive inside the Corefile. CoreDNS has no concept of ndots and so every new pod hit “unknown property 'ndots'“ and crashed. However this was masked due to the 2 running pods.

Why the Corefile ConfigMap update didn’t affect the 2 replicas?

CoreDNS uses a reload plugin, which is responsible for loading the Corefile whenever it changes.

However if the reload fails, then CoreDNS rejects the new config and continues running with the previous working config.

From CoreDNS reload plugin documentation,

reload allows automatic reload of a changed Corefile.
This plugin periodically checks if the Corefile has changed by reading it and calculating its SHA512 checksum. If the file has changed, it reloads CoreDNS with the new Corefile.
The reloads are graceful - you should not see any loss of service when the reload happens. Even if the new Corefile has an error, CoreDNS will continue to run the old config and an error message will be printed to the log.

Logs from the running pod. Reload failed, so it runs with old corefile.

This graceful reload is created as a safety feature for CoreDNS, but in this scenario, it unintentionally hid our bug. Because those pods continued to run and serve the traffic, the only way to identify the issue was with better observability (which is covered later in the section).

Why the node upgrade triggered the outage?

During the node groups upgrade, both the pods got force evicted (we used force_update_version as true in the aws_eks_node_group configuration).

When the replacement pods came in, they started using the Corefile in the ConfigMap and crashed. At that point the cluster lost the last working CoreDNS instances and the DNS outage became visible.

The key thing is outage didn’t begin during the node upgrade. The node upgrade only exposed an invalid state that had already been sitting in the cluster for two days.

Why the recovery was delayed?

The workaround was simply to revert the Corefile change and run Terraform apply again. But since DNS itself was down, our Terraform deploy was stuck because our deploy pipeline had some dependencies on DNS. We ended up running a manual aws eks update-addon command to push the corrected Corefile and bypass the deployment pipeline entirely.

How this could’ve been prevented?

Validate the Corefile outside EKS before applying it
As we learned, the EKS add-on schema can’t validate the contents inside the Corefile. After this incident, we added an integration test for CoreDNS configuration changes. Any Corefile change is now validated by starting a CoreDNS container with the generated Corefile and verifying that CoreDNS starts successfully and responds to a DNS query.
Monitoring the add-on status
EKS add-ons will move to DEGRADED status if the underlying deployment pod’s starts failing. Alerting on this status would have given us an earlier signal that the add-on is failing after our update.

Improved CoreDNS observability
Before this incident, we had alerts on the CoreDNS error rate, which was not enough. In this failure mode, the better signal was that CoreDNS pods were failing to load the Corefile.
So we started capturing more critical signals such as

coredns_panics_total
coredns_reload_failed_total
coredns pod restarts
coredns deployment replicas not ready

Disabling forced node group upgrade
We update our node groups every week to ensure that the AMIs are up to date and to avoid any blockers, we used force_update_version. We disabled this config after the outage. We now investigate why an upgrade is blocked in the node group and take corrective actions, instead of forcing it.

Takeaway

The key takeaway is critical add-on changes are not complete when the API accepts it. They are complete only when a newly started pod can run with the applied configuration and service works fine.

In this incident, the configuration was schema-valid for EKS but runtime-invalid for CoreDNS. The behaviour of CoreDNS reload protected the running pods and left the cluster in a latent failure state. The cluster was healthy enough to serve traffic, but not healthy enough to restart the components.

For all critical managed add-ons and controller-driven components, we need to validate both control-plane acceptance and runtime correctness. That means targeted schema validations, integration testing, synthetic checks, and broader observability signals to catch such failed config application.

Why Kubernetes HPA Didn’t Scale When CPU Was Above 100%

Kannan Anandakrishnan — Sun, 22 Mar 2026 11:32:34 GMT

Recently we had an outage in one of our production applications running in k8s, which was non responsive due to high cpu usage.

Before the outage, there a new release was rolled out for the application. The new version required a database schema migration that had not been applied yet.

As a result, the new replicaSet pods kept erroring, failing the startup probes and never became ready. The old ReplicaSet pods continued serving all traffic and eventually hit CPU throttling and probes were failing.

We have configured Horizontal Pod Autoscaling (HPA) with a threshold of 70%, so HPA should’ve scaled up the replicas of the deployment.

But the HPA never scaled up the deployment, even though the running pods CPU usage was above 100. We ended up fixing the missing schema and after that new rollout came up fine.

In this post, I will deep dive on how HPA scales up, how it behaves when there are unReady pods, and how to handle these scenarios.

How HPA Calculates Current Usage

Source: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go

Step 1: Group pods by state

HPA first classifies all pods targeted by the Deployment into four categories.

readyPodCount, unreadyPods, missingPods, ignoredPods := groupPods(

    podList, metrics, resource,

    c.cpuInitializationPeriod, c.delayOfInitialReadinessStatus,

)

Missing Pods are those that have no metrics available and pods that are getting deleted/terminated are considered as ignored pods.

Step 2: First pass - compute usage ratio from ready pods only

HPA removes metrics for ignored and unready pods, then computes the usage ratio only for the ready pods:

removeMetricsForPods(metrics, ignoredPods)

removeMetricsForPods(metrics, unreadyPods)

usageRatio, utilization, rawUtilization, err := metricsclient.GetResourceUtilizationRatio(

    metrics, requests, targetUtilization,

)

GetResourceUtilizationRatio sums up the CPU usage and requests across all pods in the metrics map and returns usageRatio (currentUtilization / targetUtilization)

currentUtilization = (totalCpuUsage * 100) / totalCpuRequests

A ratio > 1.0 means “using more than target - scale up” and ratio < 1.0 means “scale down”.

Step 3: Second pass - add unready pods back at 0% usage (dampening)

If there are unready pods and the first pass says scale up, HPA enters the dampening path and now includes them into the resource utilization ratio.

scaleUpWithUnready := len(unreadyPods) > 0 && usageRatio > 1.0

if scaleUpWithUnready {
// on a scale-up, treat unready pods as using 0% of the resource request
  for podName := range unreadyPods {
    metrics[podName] = metricsclient.PodMetric{Value: 0}
  }

newUsageRatio, _, _, err := metricsclient.GetResourceUtilizationRatio(metrics, requests, targetUtilization)
}

Step 4: Tolerance consideration

Kubernetes has a tolerance threshold for metric variations, configured for all metrics based autoscaling services. This is to prevent autoscaler acting on slighter variances.

For example, consider a HorizontalPodAutoscaler configured with a target memory consumption of 100MiB and a scale-up tolerance of 5%:

behavior:
  scaleUp:
    tolerance: 0.05 # 5% tolerance for scale up

With this configuration, the HPA algorithm will only consider scaling up if the memory consumption is higher than 105MiB (that is: 5% above the target).

By default, there is a cluster-wide tolerance of 10% applied, used by HPA.

After including the unReady pods into resource utilization ratio, HPA checks for tolerance. If it is within the tolerance, it doesn’t scale up.

if tolerances.isWithin(newUsageRatio) || (usageRatio < 1.0 && newUsageRatio > 1.0) || (usageRatio > 1.0 && newUsageRatio < 1.0) {
  // return the current replicas if the change would be too small,
  // or if the new usage ratio would cause a change in scale direction
  return currentReplicas, usage, nil
}

Applying this to our outage scenario

Application resources and HPA configuration

CPU Request/Limit: 2000m/3000m
Limit/Request ratio = 1.5x (max utilization = 150%)
HPA target: 70% average CPU utilization
Deployment minReplicas: 2
Rolling update: maxSurge: 100%, maxUnavailable: 0%

During the incident:

2 old pods (ready) at ~150% CPU utilization
2 new pods (unready) - failing startup probes, 0% used CPU

Step 1 and 2: Calculate usage ratio for ready pods

Usage Total  = 3000m + 3000m = 6000m

Requests Total = 2000m + 2000m = 4000m

Usage Percent = (6000 × 100) / 4000 = 150%

Usage Ratio = 150 / 70 (threshold) = 2.14  → Indicates scale-up

# Since there are unready pods and the initial direction is scale-up, 
# HPA recalculates utilization by including unready pods at 0% usage (dampening logic).

Step 3: Recalculate usage ratio including the unready pods

Usage Total  = 3000m + 3000m + 0m + 0m = 6000m
Requests Total = 2000m + 2000m + 2000m + 2000m = 8000m

Usage Percent  = (6000 × 100) / 8000 = 75%

Usage Ratio = 75 / 70 = 1.07  → Indicates scale-up. Check for tolerances.

Step 4: Check for tolerances

By default, HPA has a tolerance of 10%, meaning no scaling action is taken if the usage ratio falls between 0.9 and 1.1.

Since 1.07 falls within this range (0.9 < 1.07 < 1.1), HPA does not scale up.

For HPA to scale up in this scenario, the recalculated usage ratio would need to exceed 1.1.

That would require the old pods to exceed ~150% CPU utilization. However, since CPU limits were set at 3000m (1.5× the request), the pods could not exceed that level.

Subscribe now

Why does HPA do this?

HPA is intentionally designed to act conservatively during the rollouts.

During a normal rollout, new pods often take time to initialize . If HPA only looked at the ready pods (old replicaset), it might aggressively scale the deployment while the new pods are still starting. Once those new pods become ready, the deployment would suddenly be over-provisioned and HPA would scale down again.

By including them at 0%, HPA conservatively assumes “these pods will soon be up” and avoid oscillating between scale up/down. And HPA can’t assume that these pods will be ready in X minutes.

From Kubernetes documentation

Furthermore, if any not-yet-ready pods were present, and the workload would have scaled up without factoring in missing metrics or not-yet-ready pods, the controller conservatively assumes that the not-yet-ready pods are consuming 0% of the desired metric, further dampening the magnitude of a scale up.
After factoring in the not-yet-ready pods and missing metrics, the controller recalculates the usage ratio. If the new ratio reverses the scale direction, or is within the tolerance, the controller doesn’t take any scaling action. In other cases, the new ratio is used to decide any change to the number of Pods.

Possible solutions

There are some possible solutions one can implement to avoid such scenarios.

Alert on UnReady replicas or dangling replicaset

Often when the new release fails (new replicaSet), the service will continue to function with the old replicaset and we tend to ignore it.

The straightforward action to take in this scenario would be to identify and fix the unReady pods.

Kube-state-metrics has following metrics available to identify these kinds of replicas.

kube_deployment_status_replicas_ready
kube_deployment_spec_replicas
kube_pod_container_status_last_terminated_reason
kube_pod_container_status_waiting_reason

So we can make use of them to identify such dangling replica set pods.

Reduce the maxSurge to 50%

MaxSurge is one of the key configuration in the rollout strategy. This decides how many replicas can be spun up in the new replicaSet and decides the rollout time.

We have configured it as 100% for faster rollouts.

Had the maxSurge been configured as 50%, then during the rollout only 1 new replica would’ve been spun up. This will help in the usage ratio calculation.

Assume it’s not ready, then the usage ratio for unReady pods will become (150 + 150 + 0) / 3 = 100% and HPA would’ve triggered the scale up.

Remove CPU limits

Given the cpu limit is configured at 1.5x, the old replicas couldn’t use more than that and they were getting throttled.

Had the limit been not set, they could’ve potentially used available CPU in the underlying node (not guaranteed though) and their real usage would’ve been high. This would give better usage ratio to HPA and lead to scale up.

Keeping/removing CPU limits is a never ending debate in K8s community. There are supporting thoughts on both sides, so one can experiment with it and adopt accordingly.

Ref: https://home.robusta.dev/blog/stop-using-cpu-limits

Key Takeaways

HPA doesn’t only look at ready pods. Unready pods are included in the scale up calculations at 0% usage.
MaxSurge of 100% can fasten up the rollouts. But if the new rollout is failing, it will dilute the utilization and prevent HPA from scaling.
When investigating HPA behavior, it is important to consider rollout configuration, resource limits, and autoscaling tolerance together.

Understanding How Deployment Pods Are Named in Kubernetes

Kannan Anandakrishnan — Sat, 24 Jan 2026 16:19:03 GMT

One of my favorite interview questions is about pod names in Kubernetes.

We’re all familiar with the names of the pod created by the deployment. Deployment name + replicaset suffix + random characters.

For example, if you create an nginx deployment, you will likely see

$ k create deployment --image nginx:latest nginx
deployment.apps/nginx created

$ k get pods
NAME                     READY   STATUS              RESTARTS   AGE
nginx-54c98b4f84-l6wqq   0/1     ContainerCreating   0          3s

I usually ask about the middle part - what does 548b4f84 represent? How is this suffix generated and changes for each rollouts?

This question maps neatly to one of the interesting design choices of the deployment in Kubernetes.

I like asking this, because it tells

About candidate’s reasoning and thought process
Whether they know about internal functions of the Kubernetes beyond surface level
How they approach a question even if they don’t know the exact answer

Because this is something, a good engineer can often deduce and figure out on the fly, even if never heard/read before.

Pod Name hierarchy

In a Kubernetes Deployment, the Pod name follows a specific hierarchy: [Deployment-Name]-[Pod-Template-Hash]-[Pod-Unique-ID].

Each suffix that follows the deployment name, serves a different purpose in the orchestration process.

Pod Template Hash

The first suffix is a Pod Template Hash.

When you create or update a Deployment, the Deployment controller takes the podTemplate (the configuration of the containers, labels, etc.) and runs it through a hashing algorithm (FNV-32a). The resultant hash is added to the replicaset as pod-template-hash label. The replicaset in turn adds this label to the underlying pods it manages.

This is to ensure that all the pods used by the replicaset are identical.

Kubernetes uses the pod-template-hash both as a label and a selector on ReplicaSets and Pods so that the correct set of Pods is managed by the right ReplicaSet.

$ k describe rs nginx-54c98b4f84

Name:           nginx-54c98b4f84
Namespace:      default
Selector:       app=nginx,pod-template-hash=54c98b4f84
Labels:         app=nginx
                pod-template-hash=54c98b4f84
Annotations:    deployment.kubernetes.io/desired-replicas: 1
                deployment.kubernetes.io/max-replicas: 2
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/nginx
...
Pod Template:
  Labels:  app=nginx
           pod-template-hash=54c98b4f84
  Containers:
   nginx:
    Image:         nginx:latest
    Port:          
...

The pod-template-hash label is also applied to the pods managed by this replicaset.

$ k describe pod nginx-54c98b4f84-l6wqq

Name:             nginx-54c98b4f84-l6wqq
Namespace:        default
...
Labels:           app=nginx
                  pod-template-hash=54c98b4f84
Annotations:      
Status:           Running
...
Controlled By:  ReplicaSet/nginx-54c98b4f84

Pod suffix

The second suffix is an unique, random string generated by the replicaset controller.

The replicaset controller uses random string generator for the pod suffix. Given a replicaset could manage N number of pods, this random string is to ensure that every pod gets unique name. If the pod dies and new one comes up, the new pod will have completely new suffix generated randomly.

When we scale the deployment replicas, the replicaset remains the same and the new pods will follow the same.

In the deployment spec, pod template includes the metadata, labels, and container spec.

$ k create deployment --image nginx:latest nginx --dry-run=client -oyaml

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx:latest
        name: nginx
        resources: {}
status: {}

Once the deployment is created, we can clearly see the pod template using describe command.

$ k describe deployment nginx

Name:                   nginx
Namespace:              default
CreationTimestamp:      Sat, 17 Jan 2026 17:57:37 +0530
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=nginx
...
Pod Template:
  Labels:  app=nginx
  Containers:
   nginx:
    Image:         nginx:latest
    Port:          
    Host Port:     
    Environment:   
    Mounts:        
  Volumes:         
  Node-Selectors:  
  Tolerations:

So now any change we do on the pod template fields, will cause the template’s hash value to change and lead to creation of new replicaset.

For example, if we update the image version, it will lead to new replicaset creation, which is typically what we observe during the releases.

But when we increase the replicas, the field is outside of the template, so it doesn’t cause any hash value change and the replicaset remains the same.

How rollout restart creates the new replicaset

So far we have seen how changing the pod template leads to change in the hash value and so new replicaset is created. Then by that logic, executing rollout restart shouldn’t change the hash value, given we are not explicitly changing anything the pod template, right?

That’s right. We are not updating anything, rather K8s does it with an annotation. When rollout restart is executed against a deployment, K8s adds an annotation that says kubectl.kubernetes.io/restartedAt: “2026-01-17T20:26:06+05:30” under the pod template.

$ k rollout restart deployment nginx
deployment.apps/nginx restarted

$ k describe deployment nginx

Name:                   nginx
Namespace:              default
CreationTimestamp:      Sat, 17 Jan 2026 17:57:37 +0530
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 2
Selector:               app=nginx
Replicas:               1 desired | 1 updated | 2 total | 1 available | 1 unavailable
...
Pod Template:
  Labels:       app=nginx
  Annotations:  kubectl.kubernetes.io/restartedAt: 2026-01-17T20:26:06+05:30
  Containers:
   nginx:
    Image:         nginx:latest
...

This causes the template hash value change and leads to creation of new replicaset.

This is also the reason why even adding the labels to the pods lead to new replicaset creation.

Summary

Each pod backed by the deployment controller, follows the deployment-name + pod template hash + pod unique id.

Any changes to the pod template leads to change in hash value and so warrants a new replicaset. This also explains why very few changes are allowed on the fly and many aren’t.

When One AWS Tag Broke Our Production NLB

Kannan Anandakrishnan — Sun, 11 Jan 2026 07:49:04 GMT

Last year, we started noticing a strange issue in several of our AWS Network Load Balancers.

After our regular weekend maintenance, which involved:

Elasticsearch (ES) image upgrades (StatefulSet running in EKS)
EKS node group upgrades

our Network Load Balancer (NLB) for ES suddenly stopped registering new targets.

To mitigate this, the engineers had to identify where ES pods were running, copy the instance IDs and manually add them to the target group.

At Glean, we have hundreds of AWS customer accounts, each with their own EKS cluster in a single tenant setup.

This issue appeared randomly and only during weekends, making it harder to reason about. We didn’t have enough observability for AWS Load Balancer Controller, so no historical logs to debug when the issue happened. Since the failures always followed ES upgrades, we assumed it to be an issue with it.

When I finally got a chance to triage this deeply, the root cause turned out to be something entirely different and unexpected.

Architecture Overview

ES runs as a StatefulSet with 3 pods in a dedicated node group.

It listens on two ports - 9200 (HTTP) and 9300 (TCP) and exposed via a Kubernetes Service of type LoadBalancer, which the AWS Load Balancer Controller (AWS LBC) creates as an external NLB (target type as instance).

Traffic flow

Client -> NLB -> EC2 instances (Target group) -> NodePort -> kube-proxy -> ES pods

The target type matters here because traffic is routed via EC2 instances rather than directly to pods.

In instance target mode, the NLB forwards traffic to a NodePort on worker nodes. kube-proxy updates iptables such that every node in the cluster listens on that NodePort, regardless of whether a pod is running locally.

This also means, spot nodes can receive the traffic and may die, results in connection reset.

To mitigate this, we configured:

externalTrafficPolicy: Local

This ensures that only the nodes with ES pods will receive traffic.

Role of AWS loadbalancer Controller

AWS LBC is a controller that satisfies Kubernetes Service resources by provisioning Network Load Balancers.

It is responsible for

Creating Load Balancers
Creating Target Groups
Registering targets
Configuring health checks
Continuous reconciliation as per the K8s spec

When we checked the controller logs, we saw continuous reconciliation failures.

Error logs (simplified):

{"msg":"Requesting network requeue due to error from ReconcileForNodePortEndpoints","tgb":{"name":"k8s-elastics-xxx","namespace":"elasticsearch-1-namespace"},

{"msg":"Reconciler error", "controllerGroup":"NLBv2.k8s.aws", "controllerKind":"TargetGroupBinding","TargetGroupBinding":{"name":"k8s-elastics-xxx-1b854ad070","namespace":"elasticsearch-1-namespace"}

"error":"expected exactly one securityGroup tagged with kubernetes.io/cluster/cluster-name for eni eni-0bc35d6b81992f2d3, got: [sg-0f5c2897d0ec362e8 sg-0fa198e24300075e0] (clusterName: cluster-name)"}

During the reconciliation process, it will query for security groups attached to the ENIs to allow traffic from LB to the nodes.

Given all the nodes retain the cluster security group, it uses kubernetes.io/cluster/clusterName: owned label to discover the SGs.

AWS-LBC-security-groups-selection

It expects only one SG to be available (one that gets created by default during cluster creation) with the label. In this case, it found two security groups and so it stopped the reconciliation. This is the actual issue that led to the failure of registering new targets.

aws ec2 describe-security-groups \
  --filters "Name=tag-key,Values=kubernetes.io/cluster/cluster-name" \
  --query 'SecurityGroups[].{id:GroupId,name:GroupName,tags:Tags}' \
  --region us-west-1

Output (simplified):

[
  {
    "id": "sg-0f5c2897d0ec362e8",
    "name": "eks-cluster-sg-cluster-name-1347457926",
    "tags": [
      {"Key": "aws:eks:cluster-name", "Value": "cluster-name"},
      {"Key": "kubernetes.io/cluster/cluster-name", "Value": "owned"}
    ]
  },
  {
    "id": "sg-0fa198e24300075e0",
    "name": "terraform-20250312123836166600000001",
    "tags": [
      {"Key": "kubernetes.io/cluster/cluster-name", "Value": "owned"},
      {"Key": "Name", "Value": "eks-xxx-yyy"}
    ]
  }
]

How the label got added

We had recently adopted Karpenter (couple of months ago) as the cluster autoscaler solution in AWS. It was rolled out in phases to the AWS accounts and during the rollout, we accidentally added the kubernetes.io/cluster/cluster-name: owned tag to an additional security group.

This caused the AWS LBC reconciliation failures, which is a known issue documented by Karpenter.

When launching nodes, Karpenter uses all the security groups that match the selector.
If you choose to use the kubernetes.io/cluster/$CLUSTER_NAME tag for discovery, note that this may result in failures using the AWS Load Balancer controller.
The Load Balancer controller only supports a single security group having that tag key. See this issue for more details.

Why this issue didn’t occur on the weekdays

Even though the LBC reconciliation was failing regularly, it didn’t affect the Elasticsearch during the weekdays.

This is because ES is running in a dedicated node group.

Weekdays:

No nodegroup rotation
Existing nodes and their NLB registrations remained untouched
No changes in the targets

Weekends:

Nodegroup upgrade created new nodes, removed old nodes
Old nodes in the target groups no longer exists
Reconciliation hit the security groups error and skipped registering targets
NLB ended up with no healthy targets

The Fix

We simply removed the tag from the extra security group.

Once we removed, immediately LBC started reconciling successfully. After the weekend nodegroup upgrades, new nodes were registered correctly.

Logs after fix (simplified):

"msg": "registering targets",
"arn": "arn:aws:elasticloadbalancing:us-west-1:...:targetgroup/k8s-elastics-xxx-1b854ad070/...",
"targets": [
  {"Id":"i-00778c57210a62f39","Port":30301},
  {"Id":"i-069e316e76e8e2074","Port":30301},
  {"Id":"i-06ae7fb26e916f702","Port":30301},
  {"Id":"i-06f7a589b03ae78c1","Port":30301}
]
...
"msg": "Successful reconcile"

Longer term improvements

We also made longer‑term improvements so that this class of issue is less likely to occur again.

Switch Elasticsearch NLB target type to IP

For ES, We’ve changed the target type as IP from instance

service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip

LBC now registers pod IPs as NLB targets for ES. All other services were using this already.
The path becomes
```
Client -> NLB -> PodIP:Container
```
It no longer needs to treat every node as a backend via NodePort, so the particular “pick a cluster SG from ENI SGs” path is not exercised for the NLB.
This also removes one redundant network hop from the path.

Terraform validations on security group tags

We added Terraform tests/validations to ensure:
- Only the EKS cluster security group can carry the kubernetes.io/cluster/ tag.
- Any attempt to add that tag to additional SGs fails validation.

Improving Observability of AWS LBC

After this incident, we also improved the observability of AWS LBC to detect similar issues quickly.

Critical alerts

Reconciliation failures (error %)
- Metric: controller_runtime_reconcile_errors_total (rate, filtered by controller).
- Signal: sustained increase in error rate for targetGroupBinding / Service reconciliation.
- Impact: load balancers, target groups, and targets may not be created/updated/deleted as desired.
Workqueue depth
- Metric: workqueue_depth (per controller).
- Signal: queue length growing and not draining.
- Impact: delayed provisioning and updates of NLB/ALB resources.

Supporting indicators

P95 reconcile latency
- Metric: controller_runtime_reconcile_time_seconds_bucket.
- Signal: shows how long it takes to process 95% of reconciliation requests. High latency can indicate AWS API throttling, permission issues, or controller resource constraints.
- Impact: slower provisioning and updates of AWS resources.
AWS API errors
- Metric: aws_api_requests_total{job="aws-load-balancer-controller",status!~"2.."}.
- Signal: Indicates failures when communicating with AWS services.
- Non‑2xx responses from AWS APIs (ELBv2, EC2, etc.) directly affect LBC’s ability to reconcile state.

Lessons learned and key takeaways

Tag hygiene matters - We should enforce validations on critical tags used by the controllers.
Observability for Control plane - Control plane metrics are equally (if not more) important than application metrics. We should identify critical controller workflows and define SLOs and alerts around them.
Always validate assumptions - Correlation vs causation is real. In our case, the ES upgrade was a red herring that delayed root cause identification.
Create reproducers - reproducer setups help uncover blind spots in observability and significantly speed up debugging.

Thanks for reading In Progress by Kannan! This post is public so feel free to share it.

Why I'm Writing Again

Kannan Anandakrishnan — Sun, 04 Jan 2026 12:40:13 GMT

Long time, no see!

It’s been more than 8 years since I stopped writing on my personal site hadoopandcloud.com. I failed to renew the hosting and all the data was gone. It went viral during the Big data phase as it was the only site with detailed curriculum for CCA131 cloudera certification that time. The old data is still available in the wordpress account https://hadoopandcloud.wordpress.com/category/cca131/

During it’s popularity, multiple people reached out/interacted with me. I even made some money through affiliate marketing and also got a udemy course deal. It opened so many opportunities. I could’ve been … ok, i’m ruminating about the past. I’ll stop.

Outside of my site, I was writing in medium as well under https://medium.com/@kannan_ak

I know everybody loves talking about like back in the day, I was so fit, back in the day I used to study hard. Back in the day this, back in the day that. Yada, yada. Ok fine, What are you doing now?

For the last one year, I’ve been wanting to write, but for no clear reason. I kept procrastinating. I still can’t figure out, whether I love writing or I love the idea of writing. I felt so much friction in starting this, let alone actively writing. But one thing I knew for sure is I regretted not writing. I regretted not keep up with learning, sharing things. It’s never too late. So here we go again.

I joined Glean as a SRE six months ago and I got my hands on so many things there. Given it’s a single tenant setup, with 800+ customers having dedicated GCP/AWS projects, I’m facing so many interesting problems at work. More than writing code, I love debugging and fixing things. So yeah, I decided to write my learnings, observations, all things related to infra and reliability.