Most Terraform code I see for EKS works for a single cluster and falls apart the second a second environment shows up. The patterns below are what I have converged on across multiple production clusters at Turna and earlier projects — not best practices in the abstract, but choices that pay rent on real teams.
Module boundaries follow ownership, not AWS services
The temptation is to make a module per AWS service: modules/vpc, modules/eks, modules/rds. This looks tidy and ages poorly. Real ownership boundaries are coarser. A platform team owns "networking" (VPC + subnets + endpoints + NAT + flow logs) as one indivisible unit. They almost never change one piece without considering the others.
I structure modules around that ownership:
modules/
├── platform-network/ # VPC, subnets, endpoints, NAT, flow logs
├── platform-cluster/ # EKS, addons, IRSA, default node groups
├── platform-observability/ # Prometheus stack, Loki, alerting
└── app-database/ # RDS instances, parameter groups, IAM
Each module has 5–15 inputs and a flat output surface. If a module needs more than 20 inputs, it is doing too much.
One Terraform workspace per environment, not per stack
Terraform workspaces are tempting because they look like a free per-environment knob. They are a trap when used to separate prod from dev, because they share the same backend prefix and a single fat-fingered workspace select mistake destroys the wrong environment.
What I use instead: separate directories per environment, each with its own backend config and its own state file.
envs/
├── dev/
│ ├── main.tf # Calls platform-* modules with dev-sized inputs
│ ├── backend.tf # S3 key: env/dev/terraform.tfstate
│ └── terraform.tfvars
├── stage/
└── prod/
Workspaces are still useful for short-lived stack variants (a feature branch's preview cluster, a load test stack). For prod vs dev, separate directories with separate backends.
Variables: required, optional, derived
A pattern that has saved me dozens of hours: separate variables into three buckets by intent.
- Required inputs have no default. The plan fails loudly if the caller forgot one. Example:
cluster_name,vpc_cidr. - Optional inputs have a sensible default that works for 80% of callers. Example:
node_group_instance_types = ["t3.large"]. - Derived values are
locals, not variables. Example:locals { subnet_count = length(var.availability_zones) }.
I never use optional variables for safety-critical knobs. If turning on cluster logging is important, make it required.
Remote state, not state passing
You will eventually need values from one stack (the network ID, a security group) in another (the cluster, an app). Two ways:
- The data source way:
data "terraform_remote_state" "network". Reaches into another stack's state. - The output-and-pass way: write the value to SSM Parameter Store or AWS Secrets Manager, read it from the consumer.
I default to the second one. Reaching into another stack's state couples your modules tightly and breaks if the producer reorganizes its outputs. SSM is a stable contract you control, and it works for tools other than Terraform.
resource "aws_ssm_parameter" "vpc_id" {
name = "/platform/network/${var.env}/vpc_id"
type = "String"
value = aws_vpc.main.id
}
On the consumer side, a small data block fetches it. No state coupling.
The provider block belongs to the root, not the module
A module that declares its own provider block (with region, profile, assume-role) seems convenient — until you call that module twice with different providers and Terraform refuses. The fix is to keep required_providers in the module (declaring the version range it tolerates) but put the actual provider block only in the root configuration.
# inside module
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# in root
provider "aws" {
region = var.region
}
Lock the module versions
Calling a module by Git reference without a tag means the next terraform init may pull a different commit. Always pin to a tag:
module "cluster" {
source = "git::https://github.com/org/tf-modules.git//platform-cluster?ref=v1.4.2"
# ...
}
Renovate or Dependabot can open PRs to bump these tags. The PR is the audit trail.
Things I have stopped using
- Per-environment
tfvarsfiles inside a single directory. Easier than separate directories on day one, much worse on day 100. Switch early. - Recursive module nesting. A module that calls another module that calls another module produces traces no human can follow. Two levels max.
- The
terraform-aws-modulesEKS module's defaults without overrides. Excellent starting point; reading the source first time is mandatory. - Using
countfor environment toggling.count = var.env == "prod" ? 1 : 0creates resources whose addresses depend on the variable and re-create when you change it. Use modules or branches inmain.tf.
What stays the same across all of this
The actual EKS cluster config is mostly boring. Two node groups (system + workload), IRSA enabled, CloudWatch logging on, KMS-encrypted secrets, private API endpoint. The interesting code is everything around the cluster — the IAM policies, the addon orchestration, the per-app IRSA roles. That is where Terraform earns its keep.