Development··3 min read

Zero-Downtime DB Migration, Easier Said Than Done

Had to change a DB schema without taking the service down. The plan was perfect. Reality was not.

Why is changing one table column so terrifying

We needed to change the phone column in the users table from varchar(20) to varchar(15) and add format validation. The table had 184,720 rows of user data.

"Just run one ALTER TABLE, right?" Correct. But while that ALTER TABLE runs, the table gets locked. Our service runs 24/7, and scheduling downtime makes the CS team lose their minds.

The initial plan

Step 1: Add a new column phone_new. Step 2: Write to both columns in the code (dual write). Step 3: Batch-migrate existing data. Step 4: Switch code to read only from the new column. Step 5: Drop the old column.

Textbook approach. The plan was flawless.

(Things started going sideways at step 2.)

What happened during dual write

Deployed the code update to write to both phone and phone_new. Five minutes after deployment, errors started firing. The ORM had a NOT NULL constraint on phone_new, but existing rows had it empty.

Removed the NOT NULL constraint, redeployed. Worked this time. But during that window, 3 orders were saved in an incomplete state. Fixed them manually. A minor 27-minute incident.

The batch migration trap

Running UPDATE on all 184,720 rows at once would lock the table. So I split it into batches of 1,000 with a 100ms delay between each batch.

Total time: 43 minutes 12 seconds. New data coming in during that time was fine thanks to dual write. But DB CPU spiked to 68% during the batch run. Normal was 15%. Slow query alerts fired 3 times.

Bumped the delay to 500ms -- CPU stabilized, but total time stretched to 2 hours. These tradeoffs are the most frustrating part.

Another mistake during the code switch

Switched the code to read only from the new column, but the phone field format in the API response had changed. The frontend expected a hyphenated format, but the new column only had digits. Took 18 minutes to deploy the frontend fix.

This is honestly something I should've checked before the migration. My fault for not looking at the frontend code.

Dropping the old column

DROP COLUMN also needs to be done carefully. Something somewhere might be referencing it with a hardcoded column name. Grep'd the entire codebase before dropping it. Monitored for a week, confirmed no issues, and only then ran the DROP.

Next time

Test with a dataset similar in size to production first. This time the test DB only had 500 rows, so I underestimated the batch processing time. And I need a checklist that includes frontend impact analysis.

Zero-downtime migration isn't technically hard -- it's hard because there are so many things that are easy to miss.

Related Posts