RDS Right-Sizing. Why It's Harder Than It Looks

01 Mar 2026

RDS is often the largest single line item in an AWS bill and the one teams are most afraid to touch.

On paper, right-sizing looks simple. In practice, it rarely is.

The reasoning usually looks like this:

Instance size looks oversized.
CPU utilization seems low.
Storage could appear underused.

This is where most cost optimization attempts go wrong.

The false signals

CPU utilization lies.

A common strategy:

“Monitor maximum CPU last N weeks/months. Resize if the CPU is below a certain threshold.”

While partially right, it is not sufficient.

CPU can look healthy while the database is already constrained, or look scary while the database is perfectly fine.

CPU tells you how busy the CPU cores are, not whether the database is the bottleneck.

How CPU lies

Lie 1: CPU is low, but the DB is struggling

This happens when the bottleneck is elsewhere:

I/O wait (EBS throughput or IOPS).
Lock contention.
Slow queries blocked on indexes.
Network latency.
Memory pressure causing page churn.

In these cases:

CPU may sit at 20-30%.
Latency is terrible.
Developers (and alarms) scream.
And the instinctive (wrong) move is: “scale up the instance”.

You pay more.

The problem stays.

CPU is a signal, not a verdict.

Lie #2: CPU spikes don’t mean “we need a bigger instance”

CPU spikes are normal.

Common examples:

Backups.
Maintenance windows.
Short-lived traffic bursts.
Cache warmups.
Batch jobs.

If you size for peak CPU, you end up paying 24/7 for minutes of pressure.

This is one of the most common RDS cost traps. How many downsize initiatives are frozen because of a single isolated weekly peak?

Lie #3: CPU hides bad architecture

High CPU is often caused by:

Missing indexes
N+1 query patterns
Chatty ORMs
Poor connection pooling

Scaling RDS “fixes” it (temporarily) and locks in higher cost.

This is cost optimization debt.

Memory is not optional

A database does not gracefully degrade when memory is tight. It fails loudly.

Memory in RDS is used for:

Buffer/cache (pages, indexes).
Sorts and joins.
Temporary tables.
Connection overhead.

When memory is insufficient:

Disk I/O explodes.
Latency spikes.
CPU may stay moderate.
Users feel pain.

And here’s the trap:

Teams see “FreeableMemory still > 0” and assume they’re safe.

They are not.

Why FreeableMemory lies

FreeableMemory includes:

Cache that could be freed.
Memory that will be reclaimed under pressure.

But databases want to keep memory hot.

A low-but-stable FreeableMemory is often good.

A sawtooth pattern (drops -> sudden reclaim -> drops again) is a warning sign:

The DB is constantly evicting useful data.
Reads fall through to disk.
Performance becomes noisy and unpredictable.

Cost optimization mistake #1

“CPU is low, so let’s downsize.”

This downsizes memory, so:

Shrinks buffer cache.
Increases disk reads.
Pushes cost into I/O.
Makes performance worse and cost higher.

This is why memory is not optional.

It is the foundation, not an optimization lever.

If memory pressure exists, instance downscaling is not a cost optimization. It’s risk creation. Before downsizing an instance, study the memory pattern.

Storage & IOPS: the hidden coupling

The illusion

“We’re not I/O bound. CPU is fine.”

Reality:

Memory pressure -> cache misses.
Cache misses -> disk reads.
Disk reads -> IOPS + latency dependency.

You don’t choose to care about IOPS. Memory problems force you to.

Once memory becomes insufficient, storage performance stops being a secondary concern and becomes the primary constraint.

The dangerous coupling (especially on GP2)

With GP2:

IOPS scale with storage size.
Need more IOPS? -> Buy more disk.
Even if you don’t need the space.

Result:

Storage grows.
Cost grows.
Root cause (memory / queries) untouched.

This is classic cost displacement, not optimization.

GP3 helps — but doesn’t save you

GP3 decouples:

Storage size.
IOPS.
Throughput.

Good. But:

It makes IOPS visible.
It turns hidden inefficiency into a bill line.
Teams suddenly ask: “Why did IOPS jump?”

Answer: because memory was never sufficient.

Cost optimization mistake #2

“Let’s reduce storage, it’s expensive.”

Reducing storage without understanding I/O:

Lowers baseline IOPS (on GP2).
Increases latency.
Triggers retries.
Raises CPU.
Creates instability.

You save dollars and spend credibility.

Memory pressure silently converts compute cost into I/O cost.

What right-sizing actually requires:

Right-sizing is not instance-only.
Memory is the first constraint.
Storage is the silent multiplier.
CPU is the last thing to trust.
Never downsize RDS without reviewing memory behavior.
Never touch storage without understanding I/O patterns.
Treat CPU as a symptom, not a cause.

Why teams delay (human factors)

Even when metrics suggest an instance is oversized, teams hesitate.

Databases are stateful, critical, and often poorly understood outside the platform team.

Fear of downtime, lack of representative staging environments, and unclear rollback plans make inaction feel safer than change.

Cost optimization without operational confidence rarely happens.

What “responsible” right-sizing actually looks like

Responsible right-sizing is not a one-off resize action.

It requires:

Observing sustained behavior, not isolated peaks.
Understanding memory and I/O before touching instance size.
Making reversible changes with rollback plans.
Reviewing impact after the change.

Without feedback loops, right-sizing becomes guesswork.

The real cost of getting it wrong

RDS right-sizing fails when it is treated as a resizing exercise instead of a systems decision.

CPU, memory, storage, and I/O are not independent knobs. They trade cost and risk across layers.

When teams optimize one signal in isolation, they often move cost rather than remove it. Or worse, create instability that freezes future optimization attempts.

Right-sizing done responsibly is slower, more deliberate, and less dramatic.

But it is also repeatable.

And repeatability is what turns cost optimization from a one-time win into a discipline.

¿Qué te pareció el post?

No hay comentarios.