Infrastructure as Code Best Practices: Terraform at Scale

Infrastructure as Code (IaC) has transformed how we manage cloud resources. After managing Terraform configurations for large-scale infrastructure, I’ve learned patterns that prevent common pitfalls and enable teams to scale effectively.

State Management

The foundation of Terraform success is proper state management:

terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Key principles:

Remote state for collaboration
State locking to prevent conflicts
Encryption at rest
Separate states per environment

Module Organization

Structure code for reusability:

# modules/k8s-cluster/main.tf
variable "cluster_name" {
  type = string
}

variable "node_count" {
  type    = number
  default = 3
}

resource "aws_eks_cluster" "main" {
  name     = var.cluster_name
  role_arn = aws_iam_role.cluster.arn

  vpc_config {
    subnet_ids = var.subnet_ids
  }
}

# Root module usage
module "production_cluster" {
  source       = "./modules/k8s-cluster"
  cluster_name = "production"
  node_count   = 10
}

Environment Management

Use workspaces or separate directories:

# Directory structure approach
infrastructure/
├── modules/
│   ├── network/
│   ├── k8s/
│   └── database/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── production/

Testing Infrastructure

Validate changes before applying:

// Terratest example
func TestTerraformModule(t *testing.T) {
    opts := &terraform.Options{
        TerraformDir: "../modules/k8s-cluster",
        Vars: map[string]interface{}{
            "cluster_name": "test-cluster",
        },
    }

    defer terraform.Destroy(t, opts)
    terraform.InitAndApply(t, opts)

    clusterName := terraform.Output(t, opts, "cluster_name")
    assert.Equal(t, "test-cluster", clusterName)
}

CI/CD Integration

Automate infrastructure changes:

# GitLab CI
stages:
  - validate
  - plan
  - apply

validate:
  stage: validate
  script:
    - terraform fmt -check
    - terraform validate

plan:
  stage: plan
  script:
    - terraform plan -out=plan.tfplan
  artifacts:
    paths:
      - plan.tfplan

apply:
  stage: apply
  script:
    - terraform apply plan.tfplan
  when: manual
  only:
    - main

Drift Detection

Detect manual changes:

#!/bin/bash
# Run daily via cron
terraform plan -detailed-exitcode

if [ $? -eq 2 ]; then
  echo "Drift detected!" | mail -s "Infrastructure Drift" ops@example.com
fi

Cost Estimation

Implement cost controls:

# Use Infracost in CI
resource "aws_instance" "expensive" {
  instance_type = "m5.24xlarge"  # Infracost warns about cost
}

Security and Compliance

Implement security scanning for infrastructure code:

# Scan for security issues
resource "aws_s3_bucket" "secure" {
  bucket = "my-secure-bucket"

  # Ensure encryption
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }

  # Block public access
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true

  # Enable versioning
  versioning {
    enabled = true
  }

  # Lifecycle policy
  lifecycle_rule {
    enabled = true

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}

Use tools like tfsec, checkov, or Terraform Sentinel for policy enforcement:

# Run security scan
tfsec .

# Output example
Check 1/25: S3 Bucket has an ACL defined which allows public READ access.
   modules/storage/main.tf:15-23

# Fix issues before applying

Implement policy as code with Sentinel:

import "tfplan"

# Ensure all S3 buckets have encryption
main = rule {
  all tfplan.resources.aws_s3_bucket as _, instances {
    all instances as _, r {
      r.applied.server_side_encryption_configuration != null
    }
  }
}

Advanced Patterns

Dynamic Blocks

Generate repeated configuration blocks:

variable "ingress_rules" {
  type = list(object({
    port        = number
    protocol    = string
    cidr_blocks = list(string)
  }))
}

resource "aws_security_group" "main" {
  name = "main-sg"

  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port   = ingress.value.port
      to_port     = ingress.value.port
      protocol    = ingress.value.protocol
      cidr_blocks = ingress.value.cidr_blocks
    }
  }
}

Conditional Resource Creation

Create resources based on conditions:

variable "enable_monitoring" {
  type    = bool
  default = false
}

resource "aws_cloudwatch_dashboard" "main" {
  count = var.enable_monitoring ? 1 : 0

  dashboard_name = "monitoring-dashboard"
  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/EC2", "CPUUtilization"]
          ]
        }
      }
    ]
  })
}

Data Sources for Discovery

Fetch existing resources:

# Find latest AMI
data "aws_ami" "latest" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

# Use in resource
resource "aws_instance" "app" {
  ami           = data.aws_ami.latest.id
  instance_type = "t3.micro"
}

Workspace Management

Use Terraform workspaces for environment separation:

# Create workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new production

# List workspaces
terraform workspace list

# Select workspace
terraform workspace select production

# Use workspace in code
resource "aws_instance" "app" {
  instance_type = terraform.workspace == "production" ? "t3.large" : "t3.micro"

  tags = {
    Environment = terraform.workspace
  }
}

However, for complex environments, separate directories often provide better isolation and clarity.

Remote Execution

Use Terraform Cloud or Enterprise for team collaboration:

terraform {
  cloud {
    organization = "my-org"

    workspaces {
      name = "production-infrastructure"
    }
  }
}

Benefits of remote execution:

Consistent execution environment
Automatic state management
Role-based access control
Policy enforcement
Cost estimation
Audit logging

Performance Optimization

For large infrastructures, optimize performance:

# Use parallel operations (default: 10)
# Increase for faster applies
# terraform apply -parallelism=20

# Target specific resources
terraform apply -target=module.networking

# Refresh state less frequently
terraform plan -refresh=false

Resource Targeting

When working with large state files:

# Plan specific module
terraform plan -target=module.database

# Apply specific resource
terraform apply -target=aws_instance.web[0]

# Destroy specific resource
terraform destroy -target=aws_s3_bucket.logs

Use targeting sparingly—it can lead to inconsistent state.

Import Existing Resources

Bring existing infrastructure under Terraform management:

# Import EC2 instance
terraform import aws_instance.web i-1234567890abcdef0

# Import S3 bucket
terraform import aws_s3_bucket.data my-bucket-name

Write the corresponding configuration first:

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"
  # ... other required attributes
}

Then import and verify:

terraform import aws_instance.web i-1234567890abcdef0
terraform plan  # Should show no changes

Troubleshooting Common Issues

State Lock Issues

If state is locked after a failed operation:

# Force unlock (use carefully!)
terraform force-unlock <lock-id>

# Check DynamoDB for lock details
aws dynamodb get-item \
  --table-name terraform-locks \
  --key '{"LockID": {"S": "my-state-file-md5"}}'

State Drift

When manual changes cause drift:

# Detect drift
terraform plan -detailed-exitcode

# Import changes
terraform import <resource_type>.<name> <id>

# Or update code to match reality
terraform apply

Module Version Conflicts

Pin module versions:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"  # Allow minor updates
}

Use version constraints:

= : Exact version
>= : Greater than or equal
~> : Compatible version (e.g., ~> 1.2 allows 1.2.x)

Conclusion

Successful Infrastructure as Code at scale requires:

Robust state management with remote backends and locking
Modular code organization for reusability and maintainability
Environment isolation through workspaces or separate directories
Automated testing with tools like Terratest
CI/CD integration for consistent, automated deployments
Drift detection to catch manual changes
Cost awareness with tools like Infracost
Security scanning with tfsec, checkov, or Sentinel
Documentation for team knowledge sharing
Incremental adoption starting simple and adding complexity

The key to success is treating infrastructure code with the same rigor as application code. Use version control, peer reviews, automated testing, and continuous integration. Start with core principles, then layer on advanced patterns as your infrastructure scales and team grows.

Infrastructure as Code is a journey, not a destination. Continuously refine your practices, learn from incidents, and share knowledge across your organization. The investment in proper IaC practices pays dividends in reduced operational toil, faster deployments, and more reliable infrastructure.