AWS to GCP, after the workload inventory

The cloud bill does not care which provider you like. It cares about idle machines, chatty networks, bad storage classes, and workloads nobody has looked at since the last migration.

The AWS to GCP move that actually worked for us was not a religious provider swap. It was an inventory exercise with a migration attached. We were moving AI training and processing away from self-hosted boxes and into managed GPU workflows, but the savings came from the questions we asked before touching Terraform.

Measure the workload, not the invoice#

An invoice tells you where the money went. It rarely tells you why.

I want the inventory by workload:

what runs all day;
what runs only during training or batch processing;
which services need GPUs and which only inherited them;
where data crosses regions or providers;
which disks exist because deleting them felt risky;
which dashboards, logs, and traces nobody reads.

That inventory changed the tone of the migration. The conversation stopped being "GCP is cheaper" and became "this specific workload should not be warm all month."

The provider mattered. Managed GPU availability mattered. But the real work was separating always-on product infrastructure from bursty AI training and processing. Once we did that, the architecture got less dramatic and the bill got more honest.

flowchart LR A[workload inventory] --> B[classify runtime] B --> C{always on?} C -->|yes| D[right-size platform] C -->|no| E[batch / managed GPU] D --> F[move with health checks] E --> F F --> G[compare cost + latency]

Move the boring things first#

The safest migrations are deliberately dull.

We started with workloads that had clear inputs, clear outputs, and health checks the team trusted. Batch jobs are good candidates. Internal services are good candidates. Anything with unclear ownership or hidden customer coupling is not a first wave; it is where migrations go to become folklore.

For each service, I wanted three things written down before the move:

What proves the workload is healthy?
What metric says the move is unsafe?
Who owns the forward fix when the migration window is boring until it is not?

That sounds slow. It is faster than debugging a cross-cloud incident where nobody remembers which DNS entry was supposed to be authoritative.

Terraform is not the plan#

Terraform can make the new shape repeatable. It cannot tell you whether the new shape should exist.

The plan lived in runbooks, owners, dashboards, and health checks. Terraform followed that plan. When infrastructure code becomes the only source of truth, teams stop writing down the operational truth: the weird dependency, the one manual approval, the fact that training can pause overnight but inference cannot.

The useful split was:

Terraform for the durable shape;
scripts for repeatable one-time moves;
runbooks for human sequencing;
dashboards for "did it work?";
weekly cleanup for the things the migration revealed we should delete.

The savings came from deletion#

The provider move helped. The managed GPU flow helped. Committed-use planning helped. But the part I trust most is the deletion.

We deleted old instances, idle disks, stale logs, duplicated environments, and "temporary" services that had survived long enough to become normal. The migration gave us permission to ask whether each thing still deserved to exist.

That is why the cost cut held. A cheaper copy of a messy platform is still a messy platform. A measured migration is a chance to remove the mess while you still have everyone's attention.