Zero-Downtime DB Migration, Easier Said Than Done

Why is changing one table column so terrifying

We needed to change the phone column in the users table from varchar(20) to varchar(15) and add format validation. The table had 184,720 rows of user data.

"Just run one ALTER TABLE, right?" Correct. But while that ALTER TABLE runs, the table gets locked. Our service runs 24/7, and scheduling downtime makes the CS team lose their minds.

The initial plan

Step 1: Add a new column phone_new. Step 2: Write to both columns in the code (dual write). Step 3: Batch-migrate existing data. Step 4: Switch code to read only from the new column. Step 5: Drop the old column.

Textbook approach. The plan was flawless.

(Things started going sideways at step 2.)

What happened during dual write

Deployed the code update to write to both phone and phone_new. Five minutes after deployment, errors started firing. The ORM had a NOT NULL constraint on phone_new, but existing rows had it empty.

Removed the NOT NULL constraint, redeployed. Worked this time. But during that window, 3 orders were saved in an incomplete state. Fixed them manually. A minor 27-minute incident.

The batch migration trap

Running UPDATE on all 184,720 rows at once would lock the table. So I split it into batches of 1,000 with a 100ms delay between each batch.

Total time: 43 minutes 12 seconds. New data coming in during that time was fine thanks to dual write. But DB CPU spiked to 68% during the batch run. Normal was 15%. Slow query alerts fired 3 times.

Bumped the delay to 500ms -- CPU stabilized, but total time stretched to 2 hours. These tradeoffs are the most frustrating part.

Why is changing one table column so terrifying

The initial plan

What happened during dual write

The batch migration trap

Another mistake during the code switch

Dropping the old column

Next time

Related Posts

Lessons from Migrating My Blog to Next.js 16

GraphQL vs REST: My Verdict After 3 Projects

WebSocket vs SSE: My Realtime Misadventure