Debugging an EBS ZoneMismatch Error and Migrating to the AWS EBS CSI Driver in EKS
EKS Administation
While doing some routine maintenance work on my EKS cluster recently, I ended up modernising part of my storage setup.
The cluster was already using the AWS EBS CSI driver, but it had originally been installed manually using eksctl rather than being managed through Terraform.
Because of that:
The add-on wasn’t managed as infrastructure-as-code
The driver version was behind the recommended version for my Kubernetes release
That led me to:
Bring the AWS EBS CSI driver under Terraform management
Migrate Grafana’s persistent storage from the legacy in-tree EBS provisioner to the modern CSI driver
But this work didn’t start as a storage migration task.
It started with a scheduling issue.
The Problem That Triggered This Work
While improving Grafana's scheduling in my EKS cluster, I ran into an issue regarding how stateful workloads interact with AWS EBS volumes.
I covered the scheduling side of this problem in more detail in a previous post about node affinity and zonal storage constraints in Kubernetes.
While investigating the issue further, I noticed that the Grafana volume had originally been provisioned using the legacy in-tree AWS EBS driver, rather than the modern AWS EBS CSI driver.
Since I was already reviewing the storage configuration, this became a good opportunity to migrate Grafana’s persistent storage to the CSI driver and bring the EBS CSI addon under Terraform management.
In-Tree vs CSI Storage Drivers
While investigating the storage configuration, I noticed that the Grafana volume had originally been created using the legacy in-tree AWS EBS provisioner.
Older Kubernetes clusters relied on built-in storage plugins provided by the cloud provider.
For AWS EBS this meant the provisioner looked like this:
kubernetes.io/aws-ebsThese drivers have since been deprecated in favour of CSI drivers.
You can easily identify which driver a cluster is using by inspecting the StorageClass.
In my case, the existing storage class looked like this:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: prometheus-stack-storageclass
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumerThe important line here is the provisioner field, which shows the legacy in-tree driver:
provisioner: kubernetes.io/aws-ebsModern Kubernetes clusters instead use the AWS EBS CSI driver, which uses the following provisioner:
ebs.csi.aws.comSo the new storage class looks something like this:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-csi-gp3
provisioner: ebs.csi.aws.com
parameters:
type: gp3
fsType: ext4
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumerSince the Grafana volume had been provisioned using the in-tree driver, this became a good opportunity to migrate it to a CSI-backed storage class.
Managing the CSI Driver with Terraform
During this work I also noticed that the AWS EBS CSI driver had originally been installed manually using eksctl.
While that works perfectly fine, the rest of my infrastructure is managed using Terraform, so leaving the add-on outside of Terraform would eventually lead to configuration drift.
To bring the driver under infrastructure-as-code management, I first defined the add-on in Terraform:
resource “aws_eks_addon” “aws_ebs_csi_driver” {
cluster_name = var.cluster_name
addon_name = “aws-ebs-csi-driver”
addon_version = “v1.xx.x-eksbuild.x”
service_account_role_arn = var.ebs_csi_driver_role_arn
resolve_conflicts_on_create = “OVERWRITE”
resolve_conflicts_on_update = “OVERWRITE”
}Since the add-on already existed in the cluster, the next step was to import it into Terraform state rather than recreate it.
This can be done using:
terraform import aws_eks_addon.aws_ebs_csi_driver <cluster-name>:aws-ebs-csi-driverAfter importing the resource, Terraform was able to recognise the existing driver and manage it moving forward.
This means future upgrades to the CSI driver can now be handled alongside the rest of the cluster infrastructure through Terraform.
Backing Up Grafana Before Migration
Grafana stores dashboards, plugin data and configuration on disk, so before migrating the volume, I created a snapshot of the existing EBS volume.
The snapshot was created at the AWS level first, which meant Kubernetes did not yet know about it.
To make the snapshot usable within Kubernetes, I needed to import it using Kubernetes snapshot resources.
Importing the AWS Snapshot into Kubernetes
Since the snapshot already existed in AWS, I created a VolumeSnapshotContent resource referencing the AWS snapshot ID.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
name: grafana-snap-content
spec:
deletionPolicy: Retain
driver: ebs.csi.aws.com
volumeSnapshotClassName: ebs-csi-snapclass
source:
snapshotHandle: snap-xxxxxxxx
volumeSnapshotRef:
name: grafana-snapshot
namespace: prometheusThis effectively imports the AWS snapshot into Kubernetes.
Creating the VolumeSnapshot
Next I created the Kubernetes snapshot resource:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: grafana-snapshot
namespace: prometheus
spec:
volumeSnapshotClassName: ebs-csi-snapclass
source:
volumeSnapshotContentName: grafana-snap-contentOnce created, Kubernetes recognised the snapshot as a restore source.
I confirmed it was ready:
kubectl get volumesnapshot -n prometheusOutput:
READYTOUSE trueRestoring the Snapshot Into a CSI Volume
Once the snapshot was ready, I created a new PVC restoring the snapshot using the CSI storage class.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-pvc-csi
namespace: prometheus
spec:
storageClassName: ebs-csi-gp3
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 5Gi
dataSource:
name: grafana-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
kmsKeyId: arn:aws:kms:eu-west-2:XXXXX:key/<XXXXXX>Kubernetes then provisioned a new EBS volume using the CSI driver and restored the snapshot contents.
Updating Grafana to Use the New Volume
Grafana was deployed using the kube-prometheus-stack Helm chart.
I updated the Helm values so Grafana would use the restored claim:
grafana:
persistence:
enabled: true
existingClaim: grafana-pvc-csiAfter redeploying the chart, Grafana started successfully using the new CSI-backed volume.
All existing configuration was preserved, including dashboards and data sources.
Final Result
After completing the work:
The AWS EBS CSI driver is now managed through Terraform
Grafana storage has been migrated from the in-tree AWS EBS provisioner to the CSI driver
The monitoring stack now runs on a modern storage architecture
snapshots provide a safe rollback path for future changes
Closing Thoughts
This migration stemmed from a real operational issue.
A scheduling problem led me to review storage behaviour, which uncovered an outdated driver and an opportunity to modernise the cluster.
That kind of chain reaction is very common in platform engineering.
You start by fixing one issue, and if you take the opportunity, you leave the platform in a much better state than you found it.



