Skip to main content
Infrastructure as Code

The Intentional Imperative: Designing Your IaC for Deliberate State Mutation

If you have ever watched a Terraform plan produce a diff that makes no sense—a planned destruction of a database that hasn't changed in the code, or a resource replacement that seems to come from nowhere—you have felt the gap between declarative intent and actual state mutation. The tool promises "infrastructure as code," but the code alone does not guarantee that each change is deliberate, well-ordered, or even safe. This guide is for teams who already use IaC daily and want to move beyond the default apply-and-hope cycle. We will explore a design principle we call deliberate state mutation : the practice of modeling each infrastructure change as an explicit, observable transition, not as a side effect of a generic reconciliation loop. Why This Topic Matters Now Infrastructure as Code has become the standard for provisioning and managing cloud resources.

If you have ever watched a Terraform plan produce a diff that makes no sense—a planned destruction of a database that hasn't changed in the code, or a resource replacement that seems to come from nowhere—you have felt the gap between declarative intent and actual state mutation. The tool promises "infrastructure as code," but the code alone does not guarantee that each change is deliberate, well-ordered, or even safe. This guide is for teams who already use IaC daily and want to move beyond the default apply-and-hope cycle. We will explore a design principle we call deliberate state mutation: the practice of modeling each infrastructure change as an explicit, observable transition, not as a side effect of a generic reconciliation loop.

Why This Topic Matters Now

Infrastructure as Code has become the standard for provisioning and managing cloud resources. Terraform, Pulumi, CloudFormation, and others let us define desired state in files and let the tool figure out how to get there. This abstraction is powerful, but it hides a critical tension: the tool's plan is a guess about what changes are needed, and that guess can be wrong. When a team has dozens of environments, each with its own drift history, a single plan can cascade into unintended deletions, orphaned resources, or state that no longer reflects reality.

We have seen projects where a routine update to a security group rule triggered a full replacement of an EC2 instance, because the tool's internal logic interpreted a change in a dependency as a need to recreate. The team had no warning because they had not modeled the transition explicitly. The result was downtime, a rollback, and a loss of trust in the IaC pipeline. This is not a tool bug; it is a design gap. The tool cannot know which mutations are safe and which require careful sequencing. That knowledge must live in the code itself.

Many industry surveys suggest that a majority of IaC practitioners encounter state drift at least monthly. The common response is to add more tooling—state locking, drift detection, approval gates. But these are bandaids. The root cause is that the IaC code describes what the infrastructure should look like, not how to get there safely. Deliberate state mutation flips this: we design our IaC to make transitions explicit, so that every change is a known, tested, and reversible step.

The Cost of Blind Reconciliation

When a Terraform apply runs, it compares the current state to the desired state and computes a diff. That diff is a set of actions: create, update, delete. But the tool does not know the cost of each action. Deleting a resource that holds data is the same as deleting a stateless one. Replacing a load balancer is the same as replacing a log group. The team must add safeguards manually—prevent_destroy, lifecycle blocks, manual approvals—but these are reactive, not designed.

Deliberate mutation means that instead of letting the tool decide the sequence, we encode the sequence in our code. We break a deployment into phases: first, create the new resource; second, migrate traffic; third, delete the old resource. Each phase is a separate apply, with its own plan and its own rollback path. This is not a new idea—it is common in database migrations and application deployments—but it is surprisingly rare in infrastructure code.

Core Idea in Plain Language

Deliberate state mutation is the practice of making each infrastructure change a first-class operation with a defined input, output, and rollback. Instead of saying "I want the infrastructure to look like this," you say "I want to transition from state A to state B, and here is the exact sequence of operations to do that safely." The distinction is subtle but powerful.

Think of a database migration. You do not write a single script that drops a column and renames a table in one go. You write a migration that adds the new column, then a second migration that copies data, then a third that drops the old column. Each migration is reversible, and you can run them one at a time. Infrastructure should work the same way. A change that replaces a security group should be a series of steps: create the new group, attach it to resources, verify traffic, then remove the old group.

This approach requires a shift in how we write IaC. Instead of one monolithic configuration that represents the final state, we write a series of state transitions, each with its own module or configuration set. The tool still computes diffs, but we constrain the scope of each apply to a single transition. This is not the same as using multiple workspaces or environments; it is about breaking a single resource change into multiple steps.

Why Not Just Use Lifecycle Blocks?

Terraform's create_before_destroy and prevent_destroy are partial solutions. They tell the tool to create a new resource before destroying the old one, but they do not tell it how to handle dependencies or data migration. If the new resource has a different name or ARN, other resources that reference the old one will break. The tool cannot update those references mid-apply because the plan is computed before execution. Deliberate mutation addresses this by separating the creation of the new resource from the deletion of the old one, and by updating references as a separate step.

Another common workaround is to use moved blocks or state manipulation to rename resources. But state surgery is risky and hard to audit. A deliberate mutation approach avoids state manipulation by designing the infrastructure so that changes are additive before they are subtractive.

How It Works Under the Hood

At the implementation level, deliberate state mutation relies on a few key patterns: phased deployments, state versioning, and explicit dependency mapping. Let us look at each.

Phased Deployments

A phased deployment breaks a single change into multiple applies. Each apply moves the infrastructure from one safe state to another. For example, to replace a security group, you might have:

  • Phase 1: Create the new security group with the desired rules. No resources use it yet.
  • Phase 2: Update the resources that reference the security group to use the new one. This is a separate apply that only touches the referencing resources.
  • Phase 3: Remove the old security group after confirming that no resources reference it.

Each phase is a separate Terraform configuration or a separate workspace. The key is that each phase's desired state is a valid intermediate state. After phase 1, the infrastructure has both security groups. After phase 2, it has only the new one in use. After phase 3, it has only the new one.

State Versioning

To make phases reversible, you need to version your state. This can be done with state backends that support versioning (like S3 with versioning enabled) or by taking snapshots before each apply. In practice, we recommend storing each phase's state file separately, or using a tool like Terragrunt to manage multiple state files for the same logical environment. The goal is that you can roll back a phase by restoring the previous state and re-applying the previous configuration.

Explicit Dependency Mapping

Most IaC tools infer dependencies from resource references. But inference is not enough for deliberate mutation. You need to explicitly document which resources depend on each other for the transition, not just for the final state. For example, if resource B depends on resource A, and you are replacing A, you need to know that B must be updated to point to the new A before A can be destroyed. This is often done with output variables and data sources that are versioned.

In practice, we use a small state machine encoded in the CI/CD pipeline. Each phase is a job that runs only if the previous phase succeeded. The pipeline stores the current phase number in a parameter store, so that a rollback can jump to any previous phase. This is not a new idea—it is the same pattern used by deployment strategies like blue-green or canary—but applied to resource-level changes.

Worked Example: Replacing a Security Group

Let us walk through a concrete example. Suppose you have an EC2 instance that uses security group sg-old. You need to update the rules to allow a new port. The naive approach is to modify the security group resource in your Terraform code and run apply. Terraform will update the group in place, which is safe. But if the change requires a new group (e.g., because the name or VPC changes), you need a replacement.

With deliberate mutation, you define three phases:

Phase 1: Create the New Security Group

Create a new configuration that defines only the new security group. Apply it. The new group exists alongside the old one. The EC2 instance still uses sg-old.

resource "aws_security_group" "new" {
  name        = "sg-new"
  description = "New security group"
  vpc_id      = "vpc-12345"
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

After apply, you have two security groups. The state file for phase 1 is saved separately.

Phase 2: Update the EC2 Instance

Now define a configuration that updates the EC2 instance to use sg-new. This phase only touches the instance resource. Apply it. The instance is updated, and traffic starts flowing through the new group. The old group still exists but is no longer attached to any resource.

resource "aws_instance" "web" {
  ami                    = "ami-12345"
  instance_type          = "t3.micro"
  vpc_security_group_ids = [aws_security_group.new.id]
}

After apply, verify that the instance is reachable on port 443. If not, roll back by restoring the phase 1 state and re-applying the original configuration.

Phase 3: Remove the Old Security Group

Finally, define a configuration that deletes sg-old. Apply it. The old group is gone. The infrastructure now has only the new group.

resource "aws_security_group" "old" {
  # This resource is being deleted; we reference it by name or ID
  name = "sg-old"
  # No other attributes needed; Terraform will delete it
}

This three-step process is more work upfront, but it gives you a clean rollback at each step. If phase 2 fails, you can revert to phase 1 without recreating anything. If phase 3 fails, the old group remains but is unused, causing no harm.

Edge Cases and Exceptions

Deliberate mutation is not a silver bullet. There are situations where it adds overhead without benefit, and cases where it is impossible without additional tooling.

When the Change Is Idempotent and Safe

If a change is purely additive (e.g., adding a new rule to a security group) and the tool supports in-place updates, phased deployment is overkill. The tool's native plan is sufficient. Use deliberate mutation only when the change is destructive (delete or replace) or when the order of operations matters for correctness.

When Resources Have Hard Dependencies

Some resources cannot exist simultaneously. For example, a Route53 alias record that points to an ALB cannot point to a non-existent ALB. If you need to replace the ALB, you cannot create the new one without deleting the old one first, because the alias record will fail validation. In such cases, you might need to use a temporary resource (like a dummy ALB) or accept a brief outage. The deliberate mutation approach still helps by making the outage explicit and planned.

When the State File Is Shared

If multiple team members run applies concurrently, phased deployments can conflict. Use state locking and separate state files per phase to avoid this. In practice, we recommend using a CI/CD pipeline that serializes phases, so only one phase runs at a time.

When the Tool Does Not Support Partial State

Some IaC tools (like CloudFormation) do not allow you to manage a subset of resources independently. You would need to use nested stacks or separate stacks per phase. This is possible but adds complexity. Terraform and Pulumi are more flexible because you can split configurations arbitrarily.

Limits of the Approach

Deliberate state mutation is a design discipline, not a tool feature. Its limits are largely about complexity and cost.

Increased Configuration Surface

Each phase requires its own configuration, which means more files, more state files, and more pipeline jobs. For large infrastructures with hundreds of resources, this can become unwieldy. We recommend applying deliberate mutation only to high-risk resources: databases, load balancers, IAM roles, and security groups. Stateless resources like auto-scaling groups or launch templates can be handled with simpler patterns.

Requires Pipeline Infrastructure

To manage phases, you need a CI/CD pipeline that can store phase state, run conditional jobs, and handle rollbacks. Not all teams have this. If your deployment pipeline is a single script that runs terraform apply, you will need to upgrade it. This is a one-time investment that pays off for critical resources.

Not a Substitute for Testing

Deliberate mutation does not eliminate the need for testing. Each phase should be tested independently, and the overall sequence should be tested in a staging environment. The approach makes testing easier because you can test each transition in isolation, but it does not remove the testing burden.

Can Encourage Over-Engineering

It is tempting to break every change into ten phases. Resist that. The goal is safety, not perfection. A good rule of thumb is: if a change can be rolled back with a single undo (like reverting a commit and re-applying), you do not need phases. If a rollback would require manual intervention or data recovery, you need phases.

Reader FAQ

Does this work with Pulumi?

Yes. Pulumi's programming model makes it easier to express phases as separate functions or stacks. You can use the same pattern: create a new resource in one stack, update references in another, and delete the old resource in a third. The key is to use stack references to pass outputs between phases.

How do I handle data migration?

Data migration is separate from infrastructure mutation. If you are replacing a database, you need to migrate data before switching traffic. This guide focuses on infrastructure resources, not data. But the same phased approach applies: create the new database, migrate data, switch traffic, then delete the old database. The infrastructure phases should align with the data migration steps.

Can I use this with Kubernetes?

Kubernetes already has a form of deliberate mutation through deployments and rolling updates. But if you are managing the cluster itself (e.g., upgrading the control plane), the same principles apply. Use phased updates for node groups, API server versions, and add-ons.

What if a phase fails halfway?

If a phase fails, you have two options: roll back to the previous phase's state, or fix the issue and re-run the same phase. Rollback is safer if you do not know why it failed. Re-running is acceptable if the failure was transient (e.g., network timeout). We recommend automating rollback for critical resources.

Practical Takeaways

Deliberate state mutation is not a tool or a library; it is a design principle. Here are three actions you can take today:

  1. Identify high-risk resources in your infrastructure—those whose replacement could cause data loss or downtime. For each, sketch a phased transition plan. Start with one resource and implement the phases in a staging environment.
  2. Audit your current IaC pipeline for the ability to run partial applies and store multiple state versions. If you use Terraform, consider Terragrunt or a custom script that manages phase state in a parameter store.
  3. Write a rollback playbook for each high-risk resource. Document the exact steps to revert a failed phase. Test the playbook in staging. This is more valuable than any tooling change.

The goal is not to eliminate risk but to make risk visible and manageable. When every mutation is deliberate, you can sleep better after an apply.

Share this article:

Comments (0)

No comments yet. Be the first to comment!