<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Systems under Load]]></title><description><![CDATA[Field notes on infrastructure, reliability, and large-scale systems.]]></description><link>https://www.kannanak.com</link><image><url>https://www.kannanak.com/img/substack.png</url><title>Systems under Load</title><link>https://www.kannanak.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 12 Apr 2026 12:43:33 GMT</lastBuildDate><atom:link href="https://www.kannanak.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Kannan Anandakrishnan]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[kannanak@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[kannanak@substack.com]]></itunes:email><itunes:name><![CDATA[Kannan Anandakrishnan]]></itunes:name></itunes:owner><itunes:author><![CDATA[Kannan Anandakrishnan]]></itunes:author><googleplay:owner><![CDATA[kannanak@substack.com]]></googleplay:owner><googleplay:email><![CDATA[kannanak@substack.com]]></googleplay:email><googleplay:author><![CDATA[Kannan Anandakrishnan]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Why Kubernetes HPA Didn’t Scale When CPU Was Above 100%]]></title><description><![CDATA[and the hidden impact of unready pods]]></description><link>https://www.kannanak.com/p/why-kubernetes-hpa-didnt-scale-when</link><guid isPermaLink="false">https://www.kannanak.com/p/why-kubernetes-hpa-didnt-scale-when</guid><dc:creator><![CDATA[Kannan Anandakrishnan]]></dc:creator><pubDate>Sun, 22 Mar 2026 11:32:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RvZf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RvZf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RvZf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 424w, https://substackcdn.com/image/fetch/$s_!RvZf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 848w, https://substackcdn.com/image/fetch/$s_!RvZf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 1272w, https://substackcdn.com/image/fetch/$s_!RvZf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RvZf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png" width="1935" height="1018" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1018,&quot;width&quot;:1935,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:302790,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kannanak.com/i/191737566?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01b6323e-94aa-4656-a25f-c0d256816ee0_1938x1049.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RvZf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 424w, https://substackcdn.com/image/fetch/$s_!RvZf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 848w, https://substackcdn.com/image/fetch/$s_!RvZf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 1272w, https://substackcdn.com/image/fetch/$s_!RvZf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cd6b630-b101-418b-9f15-526b7e072deb_1935x1018.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Recently we had an outage in one of our production applications running in k8s, which was non responsive due to high cpu usage.</p><p>Before the outage, there a new release was rolled out for the application. The new version required a database schema migration that had not been applied yet.</p><p>As a result, the new replicaSet pods kept erroring, failing the startup probes and never became ready. The old ReplicaSet pods continued serving all traffic and eventually hit CPU throttling and probes were failing.</p><p>We have configured Horizontal Pod Autoscaling (HPA) with a threshold of 70%, so  HPA should&#8217;ve scaled up the replicas of the deployment.</p><p>But the HPA never scaled up the deployment, even though the running pods CPU usage was above 100.  We ended up fixing the missing schema and after that new rollout came up fine.</p><p>In this post, I will deep dive on how HPA scales up, how it behaves when there are unReady pods, and how to handle these scenarios.</p><div><hr></div><h2>How HPA Calculates Current Usage</h2><p><strong>Source</strong>: <a href="https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go">https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go</a></p><h3>Step 1: Group pods by state</h3><p>HPA first classifies all pods targeted by the Deployment into four categories.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;go&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-go">readyPodCount, unreadyPods, missingPods, ignoredPods := groupPods(

    podList, metrics, resource,

    c.cpuInitializationPeriod, c.delayOfInitialReadinessStatus,

)</code></pre></div><p>Missing Pods are those that have no metrics available and pods that are getting deleted/terminated are considered as ignored pods.</p><h3>Step 2: First pass - compute usage ratio from ready pods only</h3><p>HPA removes metrics for ignored and unready pods, then computes the usage ratio  only for the ready pods:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;go&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-go">removeMetricsForPods(metrics, ignoredPods)

removeMetricsForPods(metrics, unreadyPods)

usageRatio, utilization, rawUtilization, err := metricsclient.GetResourceUtilizationRatio(

    metrics, requests, targetUtilization,

)</code></pre></div><p><em>GetResourceUtilizationRatio</em> sums up the CPU usage and requests across all pods in the metrics map and returns usageRatio (currentUtilization / targetUtilization)</p><p>currentUtilization = (totalCpuUsage * 100) / totalCpuRequests</p><p>A ratio &gt; 1.0 means &#8220;using more than target - scale up&#8221; and ratio &lt; 1.0 means &#8220;scale down&#8221;.</p><h3>Step 3: Second pass - add unready pods back at 0% usage (dampening)</h3><p>If there are unready pods and the first pass says scale up, HPA enters the dampening path and now includes them into the resource utilization ratio.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;go&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-go">scaleUpWithUnready := len(unreadyPods) &gt; 0 &amp;&amp; usageRatio &gt; 1.0

if scaleUpWithUnready {
// on a scale-up, treat unready pods as using 0% of the resource request
  for podName := range unreadyPods {
    metrics[podName] = metricsclient.PodMetric{Value: 0}
  }

newUsageRatio, _, _, err := metricsclient.GetResourceUtilizationRatio(metrics, requests, targetUtilization)
}</code></pre></div><h3>Step 4: Tolerance consideration</h3><p>Kubernetes has a tolerance threshold for metric variations, configured for all metrics based autoscaling services. This is to prevent autoscaler acting on slighter variances.</p><p>For example, consider a HorizontalPodAutoscaler configured with a target memory consumption of 100MiB and a scale-up tolerance of 5%:</p><pre><code><code>behavior:
  scaleUp:
    tolerance: 0.05 # 5% tolerance for scale up</code></code></pre><p>With this configuration, the HPA algorithm will only consider scaling up if the memory consumption is higher than 105MiB (that is: 5% above the target).</p><p>By default, there is a cluster-wide tolerance of 10% applied, used by HPA.</p><p>After including the unReady pods into resource utilization ratio, HPA checks for tolerance. If it is within the tolerance, it doesn&#8217;t scale up. </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;go&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-go">if tolerances.isWithin(newUsageRatio) || (usageRatio &lt; 1.0 &amp;&amp; newUsageRatio &gt; 1.0) || (usageRatio &gt; 1.0 &amp;&amp; newUsageRatio &lt; 1.0) {
  // return the current replicas if the change would be too small,
  // or if the new usage ratio would cause a change in scale direction
  return currentReplicas, usage, nil
}</code></pre></div><h2>Applying this to our outage scenario</h2><p>Application resources and HPA configuration</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">CPU Request/Limit: 2000m/3000m
Limit/Request ratio = 1.5x (max utilization = 150%)
HPA target: 70% average CPU utilization
Deployment minReplicas: 2
Rolling update: maxSurge: 100%, maxUnavailable: 0%</code></pre></div><p>During the incident:</p><ul><li><p>2 old pods (ready)  at ~150% CPU utilization</p></li><li><p>2 new pods (unready) - failing startup probes, 0% used CPU</p></li></ul><p><strong>Step 1 and 2:  Calculate usage ratio for ready pods</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;shell&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-shell">Usage Total  = 3000m + 3000m = 6000m

Requests Total = 2000m + 2000m = 4000m

Usage Percent = (6000 &#215; 100) / 4000 = 150%

Usage Ratio = 150 / 70 (threshold) = 2.14  &#8594; Indicates scale-up

# Since there are unready pods and the initial direction is scale-up, 
# HPA recalculates utilization by including unready pods at 0% usage (dampening logic).</code></pre></div><p><strong>Step 3: Recalculate usage ratio including the unready pods</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">Usage Total  = 3000m + 3000m + 0m + 0m = 6000m
Requests Total = 2000m + 2000m + 2000m + 2000m = 8000m

Usage Percent  = (6000 &#215; 100) / 8000 = 75%

Usage Ratio = 75 / 70 = 1.07  &#8594; Indicates scale-up. Check for tolerances.</code></pre></div><p><strong>Step 4: Check for tolerances</strong></p><p>By default, HPA has a tolerance of 10%, meaning no scaling action is taken if the usage ratio falls between 0.9 and 1.1.<br><br>Since 1.07 falls within this range (0.9 &lt; 1.07 &lt; 1.1), HPA does not scale up.<br><br>For HPA to scale up in this scenario, the recalculated usage ratio would need to exceed 1.1.<br><br>That would require the old pods to exceed ~150% CPU utilization. However, since CPU limits were set at 3000m (1.5&#215; the request), the pods could not exceed that level.</p><h2></h2><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kannanak.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kannanak.com/subscribe?"><span>Subscribe now</span></a></p><h2><br>Why does HPA do this? </h2><p>HPA is intentionally designed to act conservatively during the rollouts.</p><p>During a normal rollout, new pods often take time to initialize . If HPA only looked at the ready pods (old replicaset), it might aggressively scale the deployment while the new pods are still starting. Once those new pods become ready, the deployment would suddenly be over-provisioned and HPA would scale down again. </p><p>By including them at 0%, HPA conservatively assumes &#8220;these pods will soon be up&#8221; and avoid oscillating between scale up/down. And HPA can&#8217;t assume that these pods will be ready in X minutes.</p><p>From <a href="https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/#algorithm-details(quoted">Kubernetes documentation</a></p><blockquote><p>Furthermore, if any not-yet-ready pods were present, and the workload would have scaled up without factoring in missing metrics or not-yet-ready pods, the controller conservatively assumes that the not-yet-ready pods are consuming 0% of the desired metric, further dampening the magnitude of a scale up.</p><p>After factoring in the not-yet-ready pods and missing metrics, the controller recalculates the usage ratio. If the new ratio reverses the scale direction, or is within the tolerance, the controller doesn&#8217;t take any scaling action. In other cases, the new ratio is used to decide any change to the number of Pods.</p></blockquote><h2>Possible solutions</h2><p>There are some possible solutions one can implement to avoid such scenarios.</p><ol><li><p><strong>Alert on UnReady replicas or dangling replicaset</strong></p></li></ol><p>Often when the new release fails (new replicaSet), the service will continue to function with the old replicaset and we tend to ignore it.</p><p>The straightforward action to take in this scenario would be to identify and fix the unReady pods. </p><p><a href="https://github.com/kubernetes/kube-state-metrics">Kube-state-metrics</a> has following metrics available to identify these kinds of replicas.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">kube_deployment_status_replicas_ready
kube_deployment_spec_replicas
kube_pod_container_status_last_terminated_reason
kube_pod_container_status_waiting_reason</code></pre></div><p>So we can make use of them to identify such dangling replica set pods.<br></p><ol start="2"><li><p><strong>Reduce the maxSurge to 50%</strong></p></li></ol><p>MaxSurge is one of the key configuration in the rollout strategy. This decides how many replicas can be spun up in the new replicaSet and decides the rollout time.</p><p>We have configured it as 100% for faster rollouts.</p><p>Had the maxSurge been configured as 50%, then during the rollout only 1 new replica would&#8217;ve been spun up. This will help in the usage ratio calculation.</p><p>Assume it&#8217;s not ready, then the usage ratio for unReady pods will become (150 + 150 + 0) / 3 = 100% and HPA would&#8217;ve triggered the scale up.<br></p><ol start="3"><li><p><strong>Remove CPU limits</strong></p></li></ol><p>Given the cpu limit is configured at 1.5x, the old replicas couldn&#8217;t use more than that and they were getting throttled. </p><p>Had the limit been not set, they could&#8217;ve potentially used available CPU in the underlying node (not guaranteed though) and their real usage would&#8217;ve been high. This would give better usage ratio to HPA and lead to scale up.</p><p>Keeping/removing CPU limits is a never ending debate in K8s community. There are supporting thoughts on both sides, so one can experiment with it and adopt accordingly.</p><p>Ref: <a href="https://home.robusta.dev/blog/stop-using-cpu-limits">https://home.robusta.dev/blog/stop-using-cpu-limits</a></p><h2>Key Takeaways</h2><ol><li><p>HPA doesn&#8217;t only look at ready pods. Unready pods are included in the scale up calculations at 0% usage.</p></li><li><p>MaxSurge of 100% can fasten up the rollouts. But if the new rollout is failing, it will dilute the utilization and prevent HPA from scaling.</p></li><li><p>When investigating HPA behavior, it is important to consider rollout configuration, resource limits, and autoscaling tolerance together.</p><p></p></li></ol>]]></content:encoded></item><item><title><![CDATA[Understanding How Deployment Pods Are Named in Kubernetes]]></title><description><![CDATA[One of my favorite interview questions is about pod names in Kubernetes.]]></description><link>https://www.kannanak.com/p/understanding-how-deployment-pods</link><guid isPermaLink="false">https://www.kannanak.com/p/understanding-how-deployment-pods</guid><dc:creator><![CDATA[Kannan Anandakrishnan]]></dc:creator><pubDate>Sat, 24 Jan 2026 16:19:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Gn3y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of my favorite interview questions is about pod names in Kubernetes. </p><p>We&#8217;re all familiar with the names of the pod created by the deployment. Deployment name + replicaset suffix + random characters.</p><p>For example, if you create an nginx deployment, you will likely see</p><pre><code>$ k create deployment --image nginx:latest nginx
deployment.apps/nginx created

$ k get pods
NAME                     READY   STATUS              RESTARTS   AGE
nginx-54c98b4f84-l6wqq   0/1     ContainerCreating   0          3s</code></pre><p>I usually ask about the middle part - what does <code>548b4f84 </code>represent? How is this suffix generated and changes for each rollouts?</p><p>This question maps neatly to one of the interesting design choices of the deployment in Kubernetes.</p><p>I like asking this, because it tells</p><ul><li><p>About candidate&#8217;s reasoning and thought process</p></li><li><p>Whether they know about internal functions of the Kubernetes beyond surface level</p></li><li><p>How they approach a question even if they don&#8217;t know the exact answer</p></li></ul><p>Because this is something, a good engineer can often deduce and figure out on the fly, even if never heard/read before.</p><h2>Pod Name hierarchy</h2><p>In a Kubernetes Deployment, the Pod name follows a specific hierarchy: <code>[Deployment-Name]-[Pod-Template-Hash]-[Pod-Unique-ID]</code>.</p><p>Each suffix that follows the deployment name, serves a different purpose in the orchestration process. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gn3y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gn3y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Gn3y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Gn3y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Gn3y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gn3y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2534112,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kannanak.com/i/184860914?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gn3y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Gn3y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Gn3y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Gn3y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa0f3d51-3f40-4ff9-83bd-af45f30b9371_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kannanak.com/p/understanding-how-deployment-pods?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kannanak.com/p/understanding-how-deployment-pods?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><h2>Pod Template Hash</h2><p>The first suffix is a Pod Template Hash. </p><p>When you create or update a Deployment, the Deployment controller takes the <code>podTemplate</code> (the configuration of the containers, labels, etc.) and runs it through a hashing algorithm (<a href="http://computeHash">FNV-32a</a>). The resultant hash is added to the replicaset as pod-template-hash label. The replicaset in turn adds this label to the underlying pods it manages.</p><p>This is to ensure that all the pods used by the replicaset are identical. </p><p>Kubernetes uses the pod-template-hash both as a <strong>label</strong> and a <strong>selector</strong> on ReplicaSets and Pods so that the correct set of Pods is managed by the right ReplicaSet.</p><pre><code>$ k describe rs nginx-54c98b4f84

Name:           nginx-54c98b4f84
Namespace:      default
Selector:       app=nginx,pod-template-hash=54c98b4f84
Labels:         app=nginx
                <strong>pod-template-hash=54c98b4f84</strong>
Annotations:    deployment.kubernetes.io/desired-replicas: 1
                deployment.kubernetes.io/max-replicas: 2
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/nginx
...
Pod Template:
  Labels:  app=nginx
           <strong>pod-template-hash=54c98b4f84</strong>
  Containers:
   nginx:
    Image:         nginx:latest
    Port:          &lt;none&gt;
...</code></pre><p>The pod-template-hash label is also applied to the pods managed by this replicaset.</p><pre><code>$ k describe pod nginx-54c98b4f84-l6wqq

Name:             nginx-54c98b4f84-l6wqq
Namespace:        default
...
Labels:           app=nginx
                  <strong>pod-template-hash=54c98b4f84</strong>
Annotations:      &lt;none&gt;
Status:           Running
...
Controlled By:  ReplicaSet/nginx-54c98b4f84</code></pre><h2>Pod suffix</h2><p>The second suffix is an unique, random string generated by the replicaset controller.</p><p>The replicaset controller uses random string generator for the pod suffix. Given a replicaset could manage N number of pods, this random string is to ensure that every pod gets unique name. If the pod dies and new one comes up, the new pod will have completely new suffix generated randomly. </p><p>When we scale the deployment replicas, the replicaset remains the same and the new pods will follow the same.</p><p>In the deployment spec, pod template includes the metadata, labels, and container spec.</p><pre><code>$ k create deployment --image nginx:latest nginx --dry-run=client -oyaml

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  strategy: {}
  <strong>template:
    metadata:
      creationTimestamp: null
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx:latest
        name: nginx
        resources: {}</strong>
status: {}</code></pre><p>Once the deployment is created, we can clearly see the pod template using describe command.</p><pre><code>$ k describe deployment nginx

Name:                   nginx
Namespace:              default
CreationTimestamp:      Sat, 17 Jan 2026 17:57:37 +0530
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=nginx
...
<strong>Pod Template:
  Labels:  app=nginx
  Containers:
   nginx:
    Image:         nginx:latest
    Port:          &lt;none&gt;
    Host Port:     &lt;none&gt;
    Environment:   &lt;none&gt;
    Mounts:        &lt;none&gt;
  Volumes:         &lt;none&gt;
  Node-Selectors:  &lt;none&gt;
  Tolerations:     &lt;none&gt;</strong></code></pre><p>So now any change we do on the pod template fields, will cause the template&#8217;s hash value to change and lead to creation of new replicaset.</p><p>For example, if we update the image version, it will lead to new replicaset creation, which is typically what we observe during the releases.</p><p>But when we increase the replicas, the field is outside of the template, so it doesn&#8217;t cause any hash value change and the replicaset remains the same.<br></p><h2>How rollout restart creates the new replicaset</h2><p>So far we have seen how changing the pod template leads to change in the hash value and so new replicaset is created. Then by that logic, executing rollout restart shouldn&#8217;t change the hash value, given we are not explicitly changing anything the pod template, right?</p><p>That&#8217;s right. We are not updating anything, rather K8s does it with an annotation. When rollout restart is executed against a deployment, K8s adds an annotation that says <code>kubectl.kubernetes.io/restartedAt: &#8220;2026-01-17T20:26:06+05:30&#8221; </code>under the pod template.</p><pre><code>$ k rollout restart deployment nginx
deployment.apps/nginx restarted

$ k describe deployment nginx

Name:                   nginx
Namespace:              default
CreationTimestamp:      Sat, 17 Jan 2026 17:57:37 +0530
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 2
Selector:               app=nginx
Replicas:               1 desired | 1 updated | 2 total | 1 available | 1 unavailable
...
Pod Template:
  Labels:       app=nginx
  <strong>Annotations:  kubectl.kubernetes.io/restartedAt: 2026-01-17T20:26:06+05:30</strong>
  Containers:
   nginx:
    Image:         nginx:latest
...</code></pre><p>This causes the template hash value change and leads to creation of new replicaset.</p><p>This is also the reason why even adding the labels to the pods lead to new replicaset creation. </p><h2>Summary</h2><p>Each pod backed by the deployment controller, follows the deployment-name + pod template hash + pod unique id.</p><p>Any changes to the pod template leads to change in hash value and so warrants a new replicaset. This also explains why very few changes are allowed on the fly and many aren&#8217;t.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kannanak.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Systems under Load! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[When One AWS Tag Broke Our Production NLB]]></title><description><![CDATA[Last year, we started noticing a strange issue in several of our AWS EKS clusters.]]></description><link>https://www.kannanak.com/p/when-one-aws-tag-broke-our-production</link><guid isPermaLink="false">https://www.kannanak.com/p/when-one-aws-tag-broke-our-production</guid><dc:creator><![CDATA[Kannan Anandakrishnan]]></dc:creator><pubDate>Sun, 11 Jan 2026 07:49:04 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9ffc172f-cc06-4296-bb3c-6d46d2d7500f_898x436.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last year, we started noticing a strange issue in several of our AWS Network Load Balancers.</p><p>After our regular weekend maintenance, which involved:</p><ul><li><p>Elasticsearch (ES) image upgrades (StatefulSet running in EKS)</p></li><li><p>EKS node group upgrades</p></li></ul><p>our Network Load Balancer (NLB) for ES suddenly stopped registering new targets.</p><p>To mitigate this, the engineers had to identify where ES pods were running, copy the instance IDs and manually add them to the target group.</p><p>At Glean, we have hundreds of AWS customer accounts, each with their own EKS cluster in a single tenant setup. </p><p>This issue appeared randomly and only during weekends, making it harder to reason about. We didn&#8217;t have enough observability for AWS Load Balancer Controller, so no historical logs to debug when the issue happened. Since the failures always followed ES upgrades, we assumed it to be an issue with it. </p><p>When I finally got a chance to triage this deeply, the root cause turned out to be something entirely different and unexpected.</p><h3><br>Architecture Overview</h3><p>ES runs as a StatefulSet with 3 pods in a dedicated node group.</p><p>It listens on two ports - 9200 (HTTP) and 9300 (TCP) and exposed via a Kubernetes Service of type LoadBalancer, which the AWS Load Balancer Controller (AWS LBC) creates as an external NLB (target type as instance).  </p><p><strong>Traffic flow</strong></p><pre><code>Client -&gt; NLB -&gt; EC2 instances (Target group) -&gt; NodePort -&gt; kube-proxy -&gt; ES pods</code></pre><p>The target type matters here because traffic is routed via EC2 instances rather than directly to pods.</p><p>In instance target mode, the NLB forwards traffic to a NodePort on worker nodes. kube-proxy updates iptables such that <strong>every node in the cluster listens on that NodePort</strong>, regardless of whether a pod is running locally.</p><p>This also means, spot nodes can receive the traffic and may die, results in connection reset.</p><p>To mitigate this, we configured:</p><pre><code><code>externalTrafficPolicy: Local</code></code></pre><p>This ensures that only the <strong>nodes</strong> <strong>with ES pods</strong> will receive traffic.</p><h3><br>Role of AWS loadbalancer Controller</h3><p><a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/">AWS LBC</a> is a controller that satisfies Kubernetes Service resources by provisioning Network Load Balancers.</p><p>It is responsible for </p><ul><li><p>Creating Load Balancers</p></li><li><p>Creating Target Groups</p></li><li><p>Registering targets</p></li><li><p>Configuring health checks</p></li><li><p>Continuous reconciliation as per the K8s spec</p></li></ul><p>When we checked the controller logs, we saw continuous reconciliation failures. </p><p>Error logs (simplified):</p><pre><code>{"msg":"Requesting network requeue due to error from ReconcileForNodePortEndpoints","tgb":{"name":"k8s-elastics-xxx","namespace":"elasticsearch-1-namespace"},

{"msg":"Reconciler error", "controllerGroup":"NLBv2.k8s.aws", "controllerKind":"TargetGroupBinding","TargetGroupBinding":{"name":"k8s-elastics-xxx-1b854ad070","namespace":"elasticsearch-1-namespace"}

"error":"expected exactly one securityGroup tagged with kubernetes.io/cluster/cluster-name for eni eni-0bc35d6b81992f2d3, got: [sg-0f5c2897d0ec362e8 sg-0fa198e24300075e0] (clusterName: cluster-name)"}</code></pre><p>During the reconciliation process, it will query for security groups attached to the ENIs to allow traffic from LB to the nodes. </p><p>Given all the nodes retain the cluster security group, it uses <code>kubernetes.io/cluster/clusterName: owned</code> label to discover the SGs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EAOQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EAOQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 424w, https://substackcdn.com/image/fetch/$s_!EAOQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 848w, https://substackcdn.com/image/fetch/$s_!EAOQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 1272w, https://substackcdn.com/image/fetch/$s_!EAOQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EAOQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png" width="1456" height="529" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:529,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85046,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kannanak.com/i/184116950?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EAOQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 424w, https://substackcdn.com/image/fetch/$s_!EAOQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 848w, https://substackcdn.com/image/fetch/$s_!EAOQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 1272w, https://substackcdn.com/image/fetch/$s_!EAOQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe8d4756-cbd8-42ad-9825-5c390fe888c0_1530x556.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.5/guide/service/nlb/#worker-node-security-groups-selection">AWS-LBC-security-groups-selection</a></p><p>It expects only one SG to be available (one that gets created by default during cluster creation) with the label. In this case, it found two security groups and so it stopped the reconciliation. This is the actual issue that led to the failure of registering new targets.</p><pre><code>aws ec2 describe-security-groups \
  --filters "Name=tag-key,Values=kubernetes.io/cluster/cluster-name" \
  --query 'SecurityGroups[].{id:GroupId,name:GroupName,tags:Tags}' \
  --region us-west-1</code></pre><p>Output (simplified):</p><pre><code>[
  {
    "id": "sg-0f5c2897d0ec362e8",
    "name": "eks-cluster-sg-cluster-name-1347457926",
    "tags": [
      {"Key": "aws:eks:cluster-name", "Value": "cluster-name"},
      {"Key": "kubernetes.io/cluster/cluster-name", "Value": "owned"}
    ]
  },
  {
    "id": "sg-0fa198e24300075e0",
    "name": "terraform-20250312123836166600000001",
    "tags": [
      {"Key": "kubernetes.io/cluster/cluster-name", "Value": "owned"},
      {"Key": "Name", "Value": "eks-xxx-yyy"}
    ]
  }
]</code></pre><h3><br>How the label got added</h3><p>We had recently adopted Karpenter (couple of months ago) as the cluster autoscaler solution in AWS. It was rolled out in phases to the AWS accounts and during the rollout, we accidentally added the <code>kubernetes.io/cluster/cluster-name: owned</code> tag to an additional security group.</p><p>This caused the AWS LBC reconciliation failures, which is a known issue <a href="https://karpenter.sh/docs/concepts/nodeclasses/#:~:text=)%20is%20supported.-,Note,-When%20launching%20nodes">documented</a> by Karpenter.</p><blockquote><p>When launching nodes, Karpenter uses all the security groups that match the selector.<br>If you choose to use the <a href="http://kubernetes.io/cluster/$CLUSTER_NAME">kubernetes.io/cluster/$CLUSTER_NAME</a> tag for discovery, note that this may result in failures using the AWS Load Balancer controller.<br>The Load Balancer controller only supports a single security group having that tag key. See <a href="https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2367">this issue</a> for more details.</p></blockquote><h3><br>Why this issue didn&#8217;t occur on the weekdays</h3><p>Even though the LBC reconciliation was failing regularly, it didn&#8217;t affect the Elasticsearch during the weekdays. </p><p>This is because ES is running in a dedicated node group.</p><p><strong>Weekdays</strong>:</p><ul><li><p>No nodegroup rotation </p></li><li><p>Existing nodes and their NLB registrations remained untouched</p></li><li><p>No changes in the targets</p></li></ul><p><strong>Weekends</strong>:</p><ul><li><p>Nodegroup upgrade created new nodes, removed old nodes</p></li><li><p>Old nodes in the target groups no longer exists</p></li><li><p>Reconciliation hit the security groups error and skipped registering targets</p></li><li><p>NLB ended up with no healthy targets</p></li></ul><h3><br>The Fix</h3><p>We simply removed the tag from the extra security group. </p><p>Once we removed, immediately LBC started reconciling successfully. After the weekend nodegroup upgrades, new nodes were registered correctly.</p><p>Logs after fix (simplified):</p><pre><code><code>"msg": "registering targets",
"arn": "arn:aws:elasticloadbalancing:us-west-1:...:targetgroup/k8s-elastics-xxx-1b854ad070/...",
"targets": [
  {"Id":"i-00778c57210a62f39","Port":30301},
  {"Id":"i-069e316e76e8e2074","Port":30301},
  {"Id":"i-06ae7fb26e916f702","Port":30301},
  {"Id":"i-06f7a589b03ae78c1","Port":30301}
]
...
"msg": "Successful reconcile"</code></code></pre><h3><br>Longer term improvements</h3><p>We also made longer&#8209;term improvements so that this class of issue is less likely to occur again.</p><ol><li><p><strong>Switch Elasticsearch NLB target type to IP</strong></p></li></ol><p>For ES, We&#8217;ve changed the target type as IP from instance</p><pre><code>service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip</code></pre><ul><li><p>LBC now registers <strong>pod IPs</strong> as NLB targets for ES. All other services were using this already.</p></li><li><p>The path becomes</p><pre><code>Client -&gt; NLB -&gt; PodIP:Container</code></pre></li><li><p>It no longer needs to treat every node as a backend via NodePort, so the particular <strong>&#8220;</strong>pick a cluster SG from ENI SGs<strong>&#8221;</strong> path is not exercised for the NLB.</p></li><li><p>This also removes one redundant network hop from the path.</p><p></p></li></ul><ol start="2"><li><p><strong>Terraform validations on security group tags</strong></p><p><br>We added <strong>Terraform tests/validations</strong> to ensure:</p><ul><li><p>Only the EKS cluster security group can carry the <code>kubernetes.io/cluster/&lt;cluster-name&gt;</code> tag.</p></li><li><p>Any attempt to add that tag to additional SGs fails validation.</p></li></ul><p></p></li></ol><h3>Improving Observability of AWS LBC</h3><p>After this incident, we also improved the observability of AWS LBC to detect similar issues quickly.</p><p><strong>Critical alerts</strong></p><ul><li><p><strong>Reconciliation failures (error %)</strong></p><ul><li><p>Metric: <code>controller_runtime_reconcile_errors_total</code> (rate, filtered by controller).</p></li><li><p>Signal: sustained increase in error rate for targetGroupBinding / Service reconciliation.</p></li><li><p>Impact: load balancers, target groups, and targets may <strong>not be created/updated/deleted</strong> as desired.<br></p></li></ul></li><li><p><strong>Workqueue depth</strong></p><ul><li><p>Metric: <code>workqueue_depth</code> (per controller).</p></li><li><p>Signal: queue length growing and not draining.</p></li><li><p>Impact: <strong>delayed</strong> provisioning and updates of NLB/ALB resources.<br></p></li></ul></li></ul><p><strong>Supporting indicators</strong></p><ul><li><p><strong>P95 reconcile latency</strong></p><ul><li><p>Metric: <code>controller_runtime_reconcile_time_seconds_bucket</code>.</p></li><li><p>Signal: shows how long it takes to process 95% of reconciliation requests. High latency can indicate <strong>AWS API throttling</strong>, permission issues, or controller resource constraints.</p></li><li><p>Impact: slower provisioning and updates of AWS resources.<br></p></li></ul></li><li><p><strong>AWS API errors</strong></p><ul><li><p>Metric: <code>aws_api_requests_total{job="aws-load-balancer-controller",status!~"2.."}</code>.</p></li><li><p>Signal: Indicates failures when communicating with AWS services.</p></li><li><p>Non&#8209;2xx responses from AWS APIs (ELBv2, EC2, etc.) directly affect LBC&#8217;s ability to reconcile state.</p></li></ul></li></ul><h3><br>Lessons learned and key takeaways</h3><ol><li><p>Tag hygiene matters - We should enforce validations on critical tags used by the controllers.</p></li><li><p>Observability for Control plane - Control plane metrics are equally (if not more) important than application metrics.  We should identify critical controller workflows and define SLOs and alerts around them.</p></li><li><p>Always validate assumptions  - Correlation vs causation is real. In our case, the ES upgrade was a red herring that delayed root cause identification.</p></li><li><p>Create reproducers - reproducer setups help uncover blind spots in observability and significantly speed up debugging.</p></li></ol><p></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.kannanak.com/p/when-one-aws-tag-broke-our-production?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading In Progress by Kannan! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kannanak.com/p/when-one-aws-tag-broke-our-production?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kannanak.com/p/when-one-aws-tag-broke-our-production?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><div><hr></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kannanak.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Why I'm Writing Again]]></title><description><![CDATA[and What this space will be]]></description><link>https://www.kannanak.com/p/why-im-writing-again</link><guid isPermaLink="false">https://www.kannanak.com/p/why-im-writing-again</guid><dc:creator><![CDATA[Kannan Anandakrishnan]]></dc:creator><pubDate>Sun, 04 Jan 2026 12:40:13 GMT</pubDate><content:encoded><![CDATA[<p>Long time, no see!</p><p>It&#8217;s been more than 8 years since I stopped writing on my personal site hadoopandcloud.com. I failed to renew the hosting and all the data was gone. It went viral during the Big data phase as it was the only site with detailed curriculum for CCA131 cloudera certification that time. The old data is still available in the wordpress account <a href="https://hadoopandcloud.wordpress.com/category/cca131/">https://hadoopandcloud.wordpress.com/category/cca131/</a> </p><p>During it&#8217;s popularity, multiple people reached out/interacted with me. I even made some money through affiliate marketing and also got a udemy course deal. It opened so many opportunities. I could&#8217;ve been &#8230; ok, i&#8217;m ruminating about the past. I&#8217;ll stop.</p><p>Outside of my site, I was writing in medium as well under <a href="https://medium.com/@kannan_ak">https://medium.com/@kannan_ak </a></p><p>I know everybody loves talking about like back in the day, I was so fit, back in the day I used to study hard. Back in the day this, back in the day that. Yada, yada. Ok fine, What are you doing now? </p><p>For the last one year, I&#8217;ve been wanting to write, but for no clear reason. I kept procrastinating. I still can&#8217;t figure out, whether I love writing or I love the idea of writing. I felt so much friction in starting this, let alone actively writing. But one thing I knew for sure is I regretted not writing. I regretted not keep up with learning, sharing things. It&#8217;s never too late. So here we go again. </p><p>I joined Glean as a SRE six months ago and I got my hands on so many things there. Given it&#8217;s a single tenant setup, with 800+ customers having dedicated GCP/AWS projects, I&#8217;m facing so many interesting problems at work. More than writing code, I love debugging and fixing things. So yeah, I decided to write my learnings, observations, all things related to infra and reliability.</p><p></p><p></p>]]></content:encoded></item></channel></rss>